Multilevel spatiotemporal validation of snow/ice mass balance and runoff modeling in glacierized catchments

In this study, the fully distributed, physically based hydroclimatological model AMUNDSEN is set up for catchments in the highly glacierized Ötztal Alps (Austria, 558 km2 in total). The model is applied for the period 1997–2013, using a spatial resolution of 50 m and a temporal resolution of 1 h. A novel parameterization for lateral snow redistribution based on topographic openness is presented to account for the highly heterogeneous snow accumulation patterns in the complex topography of the study region. Multilevel spatiotemporal validation is introduced as a systematic, independent, complete, and redundant validation procedure based on the observation scale of temporal and spatial support, spacing, and extent. This new approach is demonstrated using a comprehensive set of eight independent validation sources: (i) mean areal precipitation over the period 1997–2006 derived by conserving mass in the closure of the water balance, (ii) time series of snow depth recordings at the plot scale, (iii–iv) multitemporal snow extent maps derived from Landsat and MODIS satellite data products, (v) the snow accumulation distribution for the winter season 2010/2011 derived from airborne laser scanning data, (vi) specific surface mass balances for three glaciers in the study area, (vii) spatially distributed glacier surface elevation changes for the entire area over the period 1997–2006, and (viii) runoff recordings for several subcatchments. The results indicate a high overall model skill and especially demonstrate the benefit of the new validation approach. The method can serve as guideline for systematically validating the coupled components in integrated snow-hydrological and glacio-hydrological models.

1 Introduction 15 Assessing the amount of water resources stored in mountain catchments as snow and ice as well as the timing of meltwater production and the resulting runoff is of high interest for glaciological and hydrological investigations and hydropower production. Climate change induced shifts in snow and ice melt will alter the hydrological regimes in glacierized catchments in terms of both timing and magnitude of streamflow discharge. Longer periods of negative glacier mass balances first result in increased runoff due to the enlarged contribution of glacier melt later on, followed by a decline of runoff amounts as a con- 20 sequence of the reduced glacier area (e. g., Jansson et al., 2003;Beniston, 2003;Collins, 2008;Bliss et al., 2014). For some regions, this moment of "peak water" has already passed, while for others it is expected over the course of the current century (Salzmann et al., 2014;Bliss et al., 2014). Consequently, these catchments undergo a regime shift from icemelt-dominated physically based model, all parameters have a physical interpretation and can, in principle, be derived from field measurements.
In practice however, often few direct measurements are available and a number of required input parameters has to be interand extrapolated in space and time.
When validating hydrological models, commonly only runoff records are applied as direct field measurements. It is however well established that validating hydrological models by only comparing observed and simulated runoff at the catchment outlet 15 is not sufficient, as multiple parameter sets may yield the same results (the equifinality problem (Beven, 1993)) -for example, in glacierized catchments underestimations in simulated precipitation volumes may be compensated by increased ice melt contributions. This might be acceptable for short-term applications such as operational flood forecasting, where the main aim is to acquire accurate discharge simulations, however when applying such model parameterizations for long-term simulations (e. g., to determine future runoff changes due to climate change), the errors due to the misrepresentation of specific processes 20 can accumulate and the model provides misleading results. Hence, whenever possible multiple independent data sets should be used for model calibration and validation (e. g., Refsgaard, 1997;Grayson et al., 2002;Schaefli et al., 2005;Finger et al., 2015).
For this purpose, for snow-hydrological applications frequently satellite-derived snow extent observations are additionally used in the validation process (e. g., Parajka and Blöschl, 2008;Finger et al., 2011). As they are available operationally in high 25 temporal resolution (up to daily, however always with the constraint of frequent cloud obstructions), they allow to spatially assess snow accumulation and melt processes in the model. Additionally, incorporating measured glacier mass balances has shown to lead to more realistic process representations in models (however not necessarily in terms of improved runoff results) (e. g., Konz and Seibert, 2010;Finger et al., 2011;Schaefli et al., 2005;Schaefli and Huss, 2011). Glacier mass balance is an integral measure of the accumulation and ablation processes over the glacier in a defined period, but observations are either 30 available only for very few glaciers with direct measurements or are acquired using DEM differencing covering multi-year periods.
Since several studies have shown that especially accurately capturing the amount and distribution of snow accumulation during the winter is a prerequisite for reliable long-term runoff simulations (e. g., Huss et al., 2014;Magnusson et al., 2011), much work has been put in parameterizing models to better capture winter snow accumulation in terms of spatial distribu- 35 2 The Cryosphere Discuss., doi:10.5194/tc-2016-58, 2016 Manuscript under review for journal The Cryosphere Published: 5 April 2016 c Author(s) 2016. CC-BY 3.0 License. tion and volume. For this purpose, commonly point measurements of snow depth, snow water equivalent or precipitation are inter-and extrapolated, but information on the total water volume stored in entire mountain catchments is rare (e. g., Jonas et al., 2009). Measurements of solid precipitation are impaired with errors (e. g., Sevruk, 1986;Goodison et al., 1998), and the representativeness of point measurements for entire mountain catchments is uncertain (e. g., Grünewald and Lehning, 2011).
A relatively recent technology is to use lidar-derived surface elevation differences to obtain snow depth maps in high spatial 5 resolution (e. g., Deems et al., 2006;Grünewald et al., 2010;Grünewald and Lehning, 2011;Grünewald et al., 2013;Schöber et al., 2014;Helfricht et al., 2012Helfricht et al., , 2014a. Schöber et al. (2014) demonstrated that using lidar-derived snow water equivalent (SWE) maps in the calibration of a spatially distributed snow hydrological model significantly improved the results for simulated snow accumulation, compared to the assimilation of optical remote sensing data of the snow-covered area only.
In our study, we set up the physically based hydroclimatological model AMUNDSEN (Strasser, 2008) for a study region in 10 the highly glacierized Ötztal Alps (Austria), where a maximum amount of validation data is available: (i) mean areal precipitation over the period 1997-2006, (ii) time series of point-based snow depth recordings at several locations, (iii-iv) multitemporal snow extent maps acquired from Landsat and MODIS imagery, (v) the snow accumulation distribution for the winter season 2010/11 acquired using airborne laser scanning surveys, (vi) glacier-averaged annual surface mass balances for three glaciers in the study area, (vii) spatially distributed glacier surface elevation changes for the entire area over the period 1997-2006, and 15 (viii) hourly runoff records for several subcatchments. By utilizing these data sets, we present a concept for systematic model validation using the "observation scale" (Blöschl and Sivapalan, 1995) of temporal and spatial support, spacing, and extent.

Study site and data
The study site is located in the mountain region of the Ötztal Alps (Tyrol, Austria, fig. 1) and comprises the headwater catchments of the valleys Ötztal, Pitztal, and Kaunertal, which contribute to the streamflow of the river Inn. The combined area of 20 the investigated catchments is 558 km 2 , of which 480 km 2 are gauged. Elevations range between 1760 m a.s.l. at the lowest elevated runoff gauge and 3770 m a.s.l. at the top of Wildspitze, the highest summit of Tyrol. The Ötztal Alps are the most glacierized mountain region of the Eastern Alps, accounting for almost one third of the glacier area in Austria. According to the glacier outlines of the second Austrian glacier inventory (Fischer et al., 2015), in 1997 137 km 2 (24 %) of the investigated area were covered by glaciers. Only 11 km 2 (2 %) of the investigated area are covered by forests, and another 3 km 2 (0.5 %) 25 by shrubs (source: Land Tirol -data.tirol.gv.at). Main surface types besides the ice-covered areas are alpine grass vegetation, debris cover, and rock walls. Due to the touristic development and the production of hydropower in the region, a large number of weather stations and discharge measurements ensure an extensive basis of meteorological and hydrological data. Approx.
half of the study area (277 km 2 ) are catchments which supply water to the Gepatsch hydropower reservoir (see fig. 1). Table 1 lists the area, minimum/maximum/mean elevation, and glacierization (as of the year 1997) of the investigated catchments.

30
An airborne laser scanning (ALS)-derived digital elevation model (DEM) from the year 2006 (source: Land Tirol -data.tirol.gv.at) resampled to 50 m resolution was used as input for the model and the calculation of derived terrain variables (e. g., slope, aspect, sky-view factor, and catchment boundaries). Meteorological records (air temperature, precipitation, relative humidity, global radiation, and wind speed) in hourly resolution from 14 automatic weather stations in and surrounding the study region were used to drive the model. Initial ice thickness distribution of all glaciers in the Ötztal Alps was calculated using the approach by Huss and Farinotti (2012). The method is based on glacier mass turnover and ice flow mechanics and requires glacier outlines and a DEM. Mass balance gradients and constants recommended by M. Huss (pers. comm., 2011) were used to calculate volumetric balance fluxes of the individual glaciers. Required glacier surface elevations and glacier outlines of the 5 year 1997 exist from the second Austrian Glacier Inventory (Fischer et al., 2015).

The hydroclimatological model AMUNDSEN
The modular, physically based, distributed modeling system AMUNDSEN (Strasser, 2008) was applied for the simulation of the snow and ice surface mass balance. AMUNDSEN has been designed to specifically address the requirements of snow 10 modeling in mountain regions under climate change conditions and has already been extensively validated in various Alpine sites (Strasser, 2004;Pellicciotti et al., 2005;Strasser et al., 2008;Strasser, 2008;Hanzer et al., 2014;Marke et al., 2015).
As input data for the model, a digital elevation model of the model domain with a spatial resolution typically in the order of tens to hundreds of meters (the comparatively high resolution is necessary for adequately capturing the small-scale processes shaping the snow cover in complex terrain), as well as hourly to three-hourly recordings of the meteorological variables 15 air temperature, relative humidity, precipitation, global radiation, and wind speed are required. Several derived topographic parameters (slope, aspect, sky-view factor, openness) can either be preprocessed or calculated during runtime. In order to enable specific submodules (canopy module, evapotranspiration, runoff), various other spatial input fields (land cover, soil, catchment boundaries) have to be prescribed. The calculations presented in this study have been performed on a 50 m grid and with hourly meteorological recordings. 20 Interpolated fields from the scattered point measurements are -in the case of temperature, precipitation, humidity, and wind speed -obtained using a combined lapse rate-inverse distance weighting scheme, either using automatically calculated lape rates for each time step or (as in this study) using prescribed monthly lapse rates. In the case of radiation, first clear-sky global radiation is calculated following Corripio (2002) taking into account hillshading, transmission losses due to scattering (Rayleigh and aerosol scattering) and absorption (by water vapor, ozone, and other trace gases), transmission gains due to 25 multiple reflections between the atmosphere and the ground, and reflections from surrounding terrain. Subsequently, actual global radiation is obtained by correcting the clear-sky radiation with interpolated cloud factor fields (obtained using radiation recordings at the meteorological stations). Incoming longwave radiation is also derived following Corripio (2002) using parameterizations for the radiation fractions coming from the clear sky, from clouds, and from surrounding slopes.
Precipitation phase is then determined using a wet-bulb temperature threshold of T w = 2 • C. Wet-bulb temperature is calcu-30 lated by iteratively solving the psychrometric equation. Four types of snow/ice layers are distinguished in the model, namely new snow, old snow, firn, and ice, with each layer having distinct properties in terms of water equivalent, density, and albedo.
Fresh snowfall is always added to the new snow layer. New snow is converted to old snow when reaching a transition density 4 The Cryosphere Discuss., doi:10.5194/tc-2016Discuss., doi:10.5194/tc- -58, 2016 Manuscript under review for journal The Cryosphere Published: 5 April 2016 c Author(s) 2016. CC-BY 3.0 License. of 200 kg m −3 , old snow to firn always on September 30, and firn to ice when reaching a transition density of 900 kg m −3 .
Fresh snow density ρ ns is calculated as a function of air temperature T a [ • C] following Anderson (1976) and Jordan (1991): Snow compaction for the new snow and old snow layer is calculated following Anderson (1976) and Jordan (1991), taking into account the effects of compaction and metamorphism: with ρ s [kg m −3 ] being the layer (new snow or old snow) density, W * [kg m −2 ] the load of snow water equivalent (snow in the layer above and 50 % of the snow in the current layer), For the firn layer, a linear transition from firn to ice in 10 years is assumed, while ice density is kept constant at 900 kg m −3 .
Snow surface albedo α is parameterized following Rohrer (1992) taking into account snow age and temperature: where α min is the (prescribed) minimum albedo, α t−1 the albedo in the previous time step, and k a temperature-dependent 15 recession factor (implemented by prescribing two factors k pos and k neg for positive and negative air temperatures, respectively).
For the present study, fresh snow albedo was set to 0.85, while α min , k pos , and k neg for new snow and old snow were set to 0.55, 0.12, and 0.05, respectively. Firn and ice albedo were held constant with α firn = 0.4 and α ice = 0.2.
In forested areas, a canopy submodule optionally modifies the meteorological variables for inside-canopy conditions (Strasser, 2008) and accounts for the forest snow processes of interception, sublimation, and melt unload following Liston and Elder 20 (2006). Evapotranspiration over vegetated areas is calculated using the FAO Penman-Monteith approach (Allen et al., 1998).
The surface energy balance is calculated as with Q being the shortwave and longwave radiation balance, H the sensible heat flux, E the latent heat flux, A the advective energy supplied by solid or liquid precipitation, B the soil heat flux, and M the energy potentially available for melt. For a 25 detailed description of the calculation of the individual energy fluxes see Strasser (2008).
For the application in this study, the original model setup of AMUNDSEN was adapted to the mountain region of the Ötztal Alps by adding modules which enable a more realistic simulation of the catchment precipitation, of the timing of snowmelt (by considering cold content and liquid water content of the snowpack), and of runoff concentration. These are described in the following sections.

Precipitation correction
With respect to the undercatch of solid precipitation by common rain gauges (e. g., Sevruk, 1986;Goodison et al., 1998), a number of previous studies showed that the measured values of solid precipitation have to be corrected for systematic errors due 5 to wetting loss, evaporation loss, and wind-induced undercatch (e. g., Rohrer and Braun, 1994;Farinotti et al., 2011;Schöber et al., 2014).
A common and straightforward method to apply a correction for snow undercatch is to introduce a fixed snow correction factor (SCF) which is applied to the fraction of precipitation identified as snow. However, it has been shown that errors due to wind-induced undercatch are especially large at lower temperatures, where snowfall mainly consists of smaller particles which 10 are blown away more easily (e. g., Sevruk, 1983;Goodison et al., 1998). Using fixed SCF values thus tends to result either in underestimations of winter snowfall amounts or overestimations of snowfall events during spring and fall. A more robust method of snow correction is hence to introduce a variable correction factor derived from the meteorological variables (most importantly wind speed and temperature) measured at the gauge site.
For our study, we used the empirical correction for the Hellmann type precipitation gauge presented in Goodison et al. 15 (1998). Catch ratio CR (i. e., the fraction of the actual precipitation amount that is captured by the gauge) in percent is thereby Adjusted precipitation values are then obtained by dividing the original values by CR.
This correction is applied to the station measurements at each model time step prior to the spatial interpolation of precipita-20 tion. However, since model results indicated that the simulated snowfall amounts were still underestimated, a post-interpolation adjustment of solid precipitation using a fixed snow correction factor (for details see section 5.1) was additionally implemented.

Snow redistribution
With the standard method for the spatial interpolation of point precipitation measurements implemented in AMUNDSEN, obtained interpolation values are influenced by the respective grid cell elevation and its distance to surrounding weather stations. 25 However, it is well known that snow accumulation patterns in complex terrain are fundamentally influenced by topographic controls beyond elevation alone, most importantly being attributed to redistribution of snow by wind and gravitational forces (McKay and Gray, 1981;Blöschl and Kirnbauer, 1992;Grünewald et al., 2014;Bernhardt et al., 2010;Warscher et al., 2013).
Considering these processes is a prerequisite for reliable long-term mass balance simulations, hence the distribution of solid precipitation in AMUNDSEN has been updated using an empirical relation between SWE and topographic parameters.

30
Numerous studies have applied statistical models to explain snow cover variability using multiple regressions of topographic parameters such as elevation, slope, aspect, curvature, viewshed, and terrain roughness (e. g., Elder et al., 1991; Chang and  Li, 2000;Pomeroy et al., 2002;Winstral et al., 2002;Grünewald et al., 2013), However, Grünewald et al. (2013) showed that statistical relations between snow depth and topography are site-specific and performance decreases considerably when applying calibrated regression formulas to snow depth distributions in other catchments. Additionally, the topographic derivatives depend distinctly on the spatial scale used for calculation. Helfricht et al. (2014b) showed that the spatial variability of snow depth in a glacierized catchment is caused by a short-range variability based on small-scale terrain 5 roughness, but also by a long-range variability with respect to the glacierized and wind-sheltered cirques and valleys in contrast to wind-exposed mountain ridges.
In this study we used topographic openness (Yokoyama et al., 2002) for the parameterization of the spatial snow distribution.
Openness is a parameter originally developed to visualize topographic character and features in images, as it expresses the degree of dominance or enclosure of a location on an irregular surface. Topographic openness has two viewer perspectives in larger length scale was determined by manual optimization for the best fit with model results. The value of 5000 m is in the order of the mean ridge-to-ridge distance in Alpine terrain as applied by Marke (2008).
A linear relation was applied between the minimum and the maximum threshold of negative openness to derive the snow redistribution factor SRF, shown in fig. 2 (bottom): where 1.6 ⊥0.1Ψ denotesΨ clipped to values between 0.1 and 1.6. Hence, at least 10 % of the initial atmospheric solid precipitation 30 can be stored even in almost vertical slopes and in very exposed areas for the time of the precipitation event, while windsheltered areas can hold a maximum of 1.6 times the initial amount. At each time step, solid precipitation of each raster cell is multiplied with the corresponding redistribution factor. The new total amount of solid precipitation over the entire area is related to the initial precipitation amount in order to keep the total precipitation volume constant (mass conservation). Consequently, simulated accumulation is reduced in areas of low negative openness (i. e., exposed ridges, sheer rock faces) and increased in sheltered areas (i. e., cirques and low elevated valley floors).
To summarize, in total three steps of snow adjustments are applied: (i) the wind speed and temperature-dependent correction 5 of measured precipitation at the meteorological stations (eq. (7)), (ii) the additional post-interpolation snowfall adjustment using a fixed snow correction factor, and (iii) the adjustment using the snow redistribution factors acquired using eq. (10).
Whereas the first two steps are required to correct precipitation input towards a realistic precipitation volume, the latter does not change the total volume but rather redistributes the solid precipitation with respect to the terrain.

10
To account for temperature changes inside the snow cover, a parameterization for cold content and liquid water content based on the work of Braun (1984) has been added to the model. Meltwater is thereby not immediately removed from the snowpack, but a certain amount of liquid water (originating from surface melt or rain) can be retained in the snowpack. In the case of negative surface energy balances, this liquid water can refreeze. Further heat loss is used to build up a cold content, which needs to be depleted again before actual melt can occur. 15 Following Blöschl and Kirnbauer (1991), for the new snow and old snow layers the liquid water retention capacity of the snowpack was set to 10 % of the total snowpack weight, and the maximum possible cold content (expressed for simplicity also in mm w.e. -to convert it to an energy amount, the value is multiplied by the latent heat of fusion) to 3 % of the total snowpack weight for this study. The refreezing factor (the amount of heat loss that is used to build up the cold content) was set to 0.65. This value, slightly higher than the literature value of 0.5, was determined by comparison of observed and simulated 20 snow depth recordings.

Runoff concentration
For runoff concentration, a linear reservoir cascade approach (Nash, 1960) following Asztalos et al. (2007) was implemented in the model. Liquid precipitation and melt is thereby cumulated in each time step and catchment and routed through four parallel linear reservoir cascades for unglacierized areas, bare ice areas, firn-covered areas on glaciers, and snow-covered areas 25 on glaciers (the latter consisting of the AMUNDSEN "new snow" and "old snow" layers, which are not treated separately in terms of runoff concentration). A constant fraction f glacierized of the inflow into the snow, firn, and ice reservoirs, as well as a fraction f unglacierized of the inflow into the unglacierized reservoir is diverted into an additional soil reservoir. The parameters of the linear reservoir model (for each cascade the number of parallel reservoirs n and the storage constant k, as well as f glacierized and f unglacierized ) are determined by calibration separately for each catchment. Calibration is performed using an automatic 30 optimization routine with the aim to maximize the objective function i. e., maximizing the Nash-Sutcliffe efficiency NSE and minimizing the relative volume error V E (following Lindström (1997)). Table 2 shows the allowed ranges for the parameters during the automatic calibration.

Validation approach
For the validation of the individual model components, eight observation data sets are used: (i) mean areal precipitation values for the period 1997-2006 for the gauged catchments, (ii) daily to hourly snow depth recordings at five locations in the study 5 area, (iii-iv) snow-covered area maps acquired by Landsat and MODIS imagery, respectively, (v) the snow accumulation distribution for the winter season 2010/11 acquired using airborne laser scanning surveys, (vi) glacier-averaged annual surface mass balances for three glaciers in the study area, (vii) spatially distributed glacier surface elevation changes for the entire area over the period 1997-2006, and (viii) hourly runoff records for 8 of the investigated catchments. Besides the satellite-derived snow extent maps which only give binary information (snow yes/no), all other data sets also include volumetric information 10 about the water resources in the study area (in varying spatial and temporal scales).
According to Blöschl and Sivapalan (1995), any finite set of observations is accompanied by an "observation scale" in space and time, that can be defined by a "scale triplet" of support, spacing, and extent. Support is the integration volume or time of a single sample, spacing is the distance or time between individual samples, and extent is the total coverage in space or time of the entire data set. Table 3 lists the spatial and temporal support, spacing, and extent of the used validation data sets. 15 However, as the tabular representation of the scales makes it rather difficult to interpret and compare them, a visual method to display the observation scale of a data set is proposed: The six dimensions (support, spacing, and extent in both space and time) are arranged as axes on a radar chart, where the left and right halves of the chart represent the spatial and temporal dimensions, respectively. The ranges of the individual axes and their ordering (from low to high or from high to low values) are designed such that a "perfect" validation data set (i. e., having the lowest possible support and spacing, and the highest 20 possible extent) is represented by a regular hexagon with the maximum possible diameter (i. e., extending to the maximum axis extent in each dimension). This implies that the ordering of the axis values is not consistent (extent ranges from low to high values, while support and spacing range from high to low values), however it allows for an easier visual interpretation (bigger is always better). Fig. 3 shows the resulting axes including the possible values for our particular case study, while fig. 4 shows the resulting charts for the eight validation data sets used in this study. As the charts are supposed mainly for a qualitative 25 interpretation and comparison between the different data sets, they are shown without axis and tick labels in the latter.
The various data sets and the respective validation strategies are outlined in the following sections.

Areal precipitation
As a method to estimate long-term mean annual catchment precipitation for all gauged catchments to use as a validation data set for the respective AMUNDSEN simulation results, the OEZ approach (Kuhn and Batlogg, 1998;Kuhn, 2000) was used.

30
This method calculates catchment-scale precipitation P as the remainder from the water balance equation 9 The Cryosphere Discuss. in monthly time steps and has proven to be very robust for the simulation of decadal values (Kuhn, 2000). Evaporation E, typically comparatively small in glacierized catchments, is thereby approximated using constant values for snow and snow-free vegetation, respectively. For runoff Q, the respective measurements at the catchment outlets were used, while glacier volume changes ∆S were derived from the surface elevation changes between 1997 and 2006 according to the glacier inventories performed in these years (Abermann et al., 2009).

Snow depth
Comparisons with snow depth measurements at the locations of meteorological stations allow to evaluate model performance in terms of realistic representation of accumulation (adequate correction for gauge undercatch, snow/rain separation) and ablation (surface albedo evolution, cold content and liquid water content, surface energy balance) at the point scale, as well the conversion of snow water equivalent to snow depth. The latter has however already been evaluated in previous studies (e. g., Satellite-derived snow cover images allow to spatially validate simulation results in comparatively high spatial and temporal resolutions. For this study, we used a comprehensive set of Landsat (5/7) and MODIS scenes to derive snow extent maps for the study area. Landsat products are available in 30 m spatial resolution and a 16-day revisit time, while MODIS snow products are available daily from two satellites (Aqua and Terra), however with a coarser spatial resolution of 500 m. However, both Landsat and MODIS images are subject to frequent cloud obstructions, hence only a limited subset of the available scenes was 20 usable for this study.
With regard to Landsat data, a set of 26 suitable scenes covering the period 1998-2012 was manually selected. Besides the requirement of no or minimum cloud coverage over the study area, images acquired in the early morning hours especially during winter were not taken into account, as illumination effects from low sun elevation angles often pose problems in terms of sensor saturation on sun-facing slopes and topographic shadowing, making it difficult to retrieve snow cover. 25 To derive snow maps from the raw Landsat bands, first the digital numbers (DNs) of the individual bands were converted to top-of-atmosphere reflectances using the i.landsat.toar module from GRASS GIS (GRASS Development Team, 2012).
Subsequently, the reflectances were topographically corrected using i.topo.corr. Then, the normalized difference snow index (NDSI) (Hall et al., 1995) was calculated using the band ratio where green and SWIR (shortwave infrared) correspond to TM bands 2 and 5, respectively, for both Landsat 5 and 7. Several NDSI thresholds for the binary snow/no snow classification were subsequently evaluated, however the commonly chosen value of 0.4 (Hall et al., 1995) provided adequate results and was thus used for all 26 scenes. Following Hall et al. (1998), additional to the NDSI threshold a threshold of the near-infrared (NIR) channel (TM band 4) was used to avoid misclassifying water bodies as snow -pixels with NIR < 11 % are thereby never classified as snow.

5
The applied NDSI threshold generally also classifies ice surfaces as snow. To discriminate snow from glacier ice, for all glacier pixels (according to the used glacier mask) broadband albedo α was calculated from Landsat TM bands 2 and 4 (green and NIR) following the relation by Knap et al. (2010): where α 2 = TM 2 and α 4 = TM 4. Then, for each scene an α threshold for discriminating snow from ice was manually derived In contrast to Landsat, precomputed MODIS snow cover products are readily available for download (Hall et al., 2002). For our study we used the binary snow cover products MYD10A1 and MOD10A1 (for the Aqua and Terra satellites, respectively), which are available in 500 m resolution on a daily basis. To reduce the influence of clouds and misclassifications, Aqua and 15 Terra scenes for each day were merged into a composite image: if a pixel in one scene was classified as cloud-covered in one scene and cloud-free in the other, the respective value of the cloud-free scene was taken, while if one pixel was considered snow-covered in one scene and snow-free in the other, the pixel was classified as snow. MODIS snow products are calculated using a NDSI thresholding approach and hence also do not discriminate between snow and ice surfaces. Some studies have used the 250 m visible and near-infrared MODIS bands to classify snow and ice surfaces (e. g., Shea et al. (2013)), however 20 we found this method not applicable for our study, since the coarse resolution of the MODIS scenes makes it challenging to differentiate between snow and ice facies for the majority of (small) glaciers in the study area. As a pragmatic solution we excluded all MODIS scenes taken in the months July-September from the analyses, since it can be assumed that outside of this period the glaciers are snow-covered and MODIS "snow" pixels are actually snow rather than ice.
To compare simulation results with the satellite products, the daily AMUNDSEN SWE maps were converted into binary 25 snow cover images using an SWE threshold of 1 mm (this amount of snowfall is required to turn a summer landscape into "white winter" landscape). For the Landsat validation, the Landsat 30 m pixels were resampled to the 50 m model resolution, while for the MODIS validation the AMUNDSEN snow maps were resampled to the MODIS resolution using mode resampling, i. e., a 500 m pixel was classified as snow if at least 50 % of the 50 m pixels it comprises were snow-covered.
For the comparison and evaluation of observed and simulated snow cover patterns, we use the contingency table-based 30 efficiency criteria ACC, BIAS, and CSI (Zappa, 2008): For the definitions of n 00 to n xx see table 4. Accuracy ACC is the number of correct predictions divided by the total number 5 of samples (values between 0 and 1, with a perfect score being 1). This score, however, tends to be comparatively optimistic, as it usually yields high values both during winter (where most of the pixels are snow-covered) and summer (where most of the pixels are snow-free). A more sensitive score is the critical success index CSI, which is the number of correctly predicted snow events divided by the number of times where snow is predicted in the model and/or observed. Finally, BIAS corresponds to the frequency of correct snow predictions, i. e., the number of times where snow is present in the simulations divided by 10 the number of times where it is observed. Again, a value of 1 is a perfect match, while here values below 1 indicate that snow cover is underrepresented in the model, and values above 1 indicate that the model overestimates the snow cover.

ALS-derived snow distribution
While satellite-derived snow cover maps can be used to analyze binary snow coverage (snow yes/no) in high temporal resolution, lidar-derived surface elevation fields allow to obtain snow depth maps in very high spatial resolution. For this purpose, 15 two lidar surveys are required: one for mapping the snow-free terrain, and one for recording the snow-covered terrain. The difference between both surface elevations can be interpreted as snow depth (in case of glacier surfaces, the vertical component of the ice flow has to be evaluated and considered for error analysis (e. g., Sold et al., 2013;Helfricht et al., 2014a)).
Airborne laser scanning ( This relation has already been applied for the investigation area in previous studies and has proven to be robust, with average errors in the order of 10 % (Schöber, 2014).

Glacier mass balance
For three glaciers in the study region (Hintereisferner, Kesselwandferner, and Vernagtferner), long-term annual mass balance series are available. For these glaciers, the simulated specific annual mass balance is compared with the observations, as well as the cumulative values over the entire simulation period 1997-2013. In addition, using two DEMs of the study region acquired in 1997 and 2006 spatially distributed glacier surface elevation changes during this period were calculated, which allows 5 assessing the performance of the multiannual simulated mass balance for all glaciers in the study region over this period. The respective snow/ice volume changes as simulated by AMUNDSEN are thereby converted to surface elevation changes using the respective layer densities as calculated by the model (see section 3.1).

Runoff
Hydrological data from eight runoff gauges with adequate hourly records are used. Runoff is modeled in an hourly resolution Nash-Sutcliffe efficiency NSE, as well as the benchmark efficiency BE (Schaefli and Gupta, 2007): For BE, the benchmark model runoff Q bench is calculated as the multiannual mean observed runoff per calendar day and hour.
5 Results and discussion 20

Precipitation
Mean areal precipitation for the gauged catchments in the region for the period 1997-2006 as derived by closing the water balance using the OEZ method (see section 4.1) was compared to the respective AMUNDSEN simulation results.
As the applied correction of the precipitation recordings using wind speed and temperature (eq. (7)) still resulted in an underestimation of precipitation amounts (as indicated by comparisons with (i) the catchment precipitation simulated by closing step) was applied. By comparing precipitation amounts derived by closing the water balance with the respective AMUNDSEN results over all gauged catchments, this factor was set to a value of 1.15 (i. e., 15 % increase of snowfall amounts). Using this additional adjustment, AMUNDSEN mean areal precipitation for all gauged catchments deviates less than 1 % from the respective values as acquired by closing the water balance, as compared to a difference of −31 % when using uncorrected values ( can be considered satisfactory. The largest differences occur in the medium-sized catchments, with the maximum deviation being approx. 25 %, while the smaller catchments generally show the smallest deviations. In several of the following sections, results obtained using uncorrected and corrected precipitation are compared and evaluated. "Uncorrected" in this terminology corresponds to the elevation-dependent remapping of unaltered precipitation recordings alone, while "corrected" refers to the combination of the adjustment of snowfall amounts (eq. (7) and the additional increase 10 by 15 %) and the topographic snow redistribution (eq. (10)). Table 6 lists the performance measures R 2 , Nash-Sutcliffe efficiency NSE, and percent bias PBIAS for the comparison of daily snow depth observations at the point scale to the respective simulation results for five stations located in elevations between 871 and 2864 m a.s.l. for the model runs using uncorrected and corrected precipitation, respectively.

15
At the stations Obergurgl, Weisssee, and Pitztaler Gletscher, using corrected precipitation leads to significantly improved results in terms of all three performance measures. At the low-elevated stations Prutz and Nauders, the applied precipitation correction leads to a severe overestimation of snow depth of 223 and 120 %, respectively, on average. However, while these stations are within the simulation area (i. e., the rectangle surrounding the study catchments shown in fig. 1), they are located outside of the investigated catchments and are significantly lower elevated (minimum elevation of the study site is 1760 m). in forested areas the NDSI threshold is lowered in order to increase the classification accuracy (Klein et al., 1998), no such 35 15 The Cryosphere Discuss. a correct representation of snow/rain distinction is especially crucial, hence the lower performance scores for Landsat are not unexpected. Fig. 9 shows the observed and simulated snow distributions (i. e., observed surface elevation differences between the ALS acquisition dates converted to SWE using eq. (18), and simulated SWE differences for the same period) for the winter 2010/11.

10
A 746 km 2 large area was covered by the ALS acquisitions, resulting in a sample size of 298468 pixels (with 50 m resolution) available for model validation. As seen in fig. 9, despite the applied precipitation corrections the total snow volume is still underrepresented in the model by approx. 15 % in this particular period, however the applied snow redistribution parameterization leads to a significantly improved representation of the accumulation patterns (snow-free ridges, increased accumulation on glacierized areas) as compared to model results using elevation-dependent precipitation fields only. Fig. 10 shows the 15 high correlation of simulated with observed snow water equivalent values with an R 2 value of 0.57. This is a considerable improvement in model performance considering the results using uncorrected precipitation (elevation-dependent remapping only) (R 2 = 0.07) and using undercatch-corrected precipitation without topographic redistribution (R 2 = 0.23).
The obtained correlations are in the range of the results by Schöber et al. (2014), who obtained R 2 = 0.52 in a 166 km 2 large subcatchment of our study region, also using a cell size of 50 m. Grünewald et al. (2013) obtained R 2 values between 20 0.30 and 0.91 for different investigation areas between 1.5 and 28 km 2 , however using a coarser cell size of 400 m, and Schirmer et al. (2011) obtained R 2 = 0.42 for a site in Switzerland using 10 m resolution (however, both of these studies used statistical models based on topographic parameters alone to model snow depth distribution and did not employ snow cover models). Hence, considering the size of the investigated area and the high spatial resolution, our results can be considered very satisfactory. However, again it has to be emphasized that -as also shown in other studies (e. g., Grünewald et al. (2013)) - 25 the relations between snow depth and topography are site-specific and hence have to be calibrated for different study regions individually. In the case of the openness-based correction applied in this study, this concerns the choice of the scale parameters L for negative openness, the value range of negative openness for the different scale parameters (eqs. (8) and (9)), and the final relation to precipitation correction (eq. (10)). 2013 is captured very well for VF and HEF, with differences of only 39 mm (0.3 %) for VF and -377 mm (2.1 %) for HEF, while KWF shows the lowest performance with a deviation of -2633 mm (32 %). For all three glaciers, results improve with increasing simulation time, possibly due to the improved representation of the interpolated meteorological fields due to increased station density (cf. Schöber, 2014, p. 40f.). A meteorological station is located in direct vicinity to VF, which is a possible explanation for the distinctively good performance of the model with regard to this glacier. However, since no 5 specific calibration of the model for reproducing the glacier mass balances has been performed, the results can be considered satisfactory also for the other two glaciers and indicate that the model setup is suitable for glacier mass balance simulations at the regional scale. This relatively low value is in the same order of magnitude as the uncertainties arising from glacier volume change measure-15 ments. Hence, the model can be considered capable of realistically simulating the runoff contribution by glacier melt. Possible explanations for the deviation -beside the measurement uncertainty -are processes such as basal melt or collapsing glacier tongues, which are not represented in the model. Additionally, the albedo of the dirty glacier tongues covered with mineral dust and cryoconite might be overestimated by the model. Since no ice flow is considered in AMUNDSEN up to now, the spatial distribution of glacier thickness changes is biased at the low elevated glacier tongues and in high elevated accumulation areas. 20 Although glacier ice flow velocities slowed down in the past decades and the contribution of vertical ice flow is low at most of the glaciers in the study region (e. g., Helfricht et al., 2014a), surface elevation changes caused by ice flow accumulate to considerable values in multi-annual time periods. Hence, in order to prevent systematic model biases in long-term applications it is crucial to integrate a dynamic tracking of the glacier surface, which is proposed to be implemented in a future version of AMUNDSEN. 25

Runoff
For the runoff simulations, calibration and validation periods were generally 1998-2006 and 2007-2013, respectively, however for some catchments only shorter time series of runoff measurements were available. Table 7 shows the results (Nash-Sutcliffe efficiency NSE, benchmark efficiency BE, and percent bias PBIAS) of the simulations for the catchments in which respective hourly runoff observations were available, while fig. 14 shows the fractions of the individual runoff components as simulated 30 by AMUNDSEN using uncorrected and corrected precipitation, respectively, for the calibration period (1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006). Fig. 15 exemplarily shows the observed and simulated runoff as well as the individual runoff components for the gauge Gepatschalm By means of the applied precipitation correction, NSE values for almost all catchments slightly improve as compared to the model run using uncorrected precipitation, with values of around or above 0.8, with the exception of two small catchments with very low glacierization (as the runoff module is specifically designed for glacierized catchments it is expected to perform worse in catchments with no or low glacierization). However, for most catchments even the use of uncorrected precipitation leads to comparatively high NSE values. This is typical for catchments with a strong runoff seasonality: the fact that NSE 5 uses the mean observed runoff as reference makes it a rather inefficient skill score in these cases -high NSE values can be obtained as soon as the seasonality is represented in the model, regardless if smaller scale fluctuations are misrepresented (Schaefli and Gupta, 2007). Looking at benchmark efficiency BE, variations between catchments as well as the effects of the applied precipitation correction are much more pronounced. Using corrected precipitation significantly increases BE in almost all catchments in both the calibration and validation period. Despite using corrected precipitation, in the two smallest 10 and least glacierized gauged catchments BE values are below zero, indicating that the multiannual mean per calendar day and hour as predictor would be better than the model. The total runoff volume bias PBIAS is (absolutely) smaller for the model runs with uncorrected precipitation (with runoff on average underestimated by approx. 5 % in both the calibration and the validation period) than when corrected precipitation is used, where runoff is on average overestimated by 6.4 % and 15.6 %, respectively, with even higher values for the two largest catchments. This can also be seen in fig. 14, which also illustrates vations for 8 catchments. A correction function for solid precipitation has been added to the model in order to compensate for the severe undercatch of precipitation measurements in high elevated and wind-exposed areas. This pre-interpolation correction of the recorded precipitation depths using temperature and wind speed has proven to significantly improve model results, however an additional snow correction factor of 15 % had to be applied to the resulting snowfall fields. A parameterization for cold content and liquid water content of the snowpack has been implemented to account for the delayed start of melting of the 30 snowpack following cold periods and nights with clear-sky conditions. Finally, a parameterization for snow redistribution using topographic openness has been added to the model, helping to significantly better represent the complex snow accumulation patterns in the study site as compared to assuming an elevation dependency alone. 18 The Cryosphere Discuss., doi:10.5194/tc-2016-58, 2016 Manuscript under review for journal The Cryosphere balances from field surveys as well as long-term mass balances derived from DEM differencing allow the evaluation of model performance in reproducing the long-term surface mass balance over the glaciers, which is essential prior to applying the model for scenario studies. Currently, an ice flow parameterization is being implemented in the model, which will allow to apply the model setup for climate change scenarios in order to estimate future runoff conditions and glacier retreat in the study site.
While the value of using multiple data sets for model validation has already been investigated in previous studies, our concept 15 of multi-level spatiotemporal validation represents a new approach for assessing glaciohydrological simulation accuracy on a fully comprehensive level. Usually, exemplary comparisons of single simulated model variables are utilized for validation, either (i) with time series of local recordings of the respective variable, (ii) with discrete (in time) shots of the spatial distribution of the variable, or (iii) runoff is compared to gauge recordings. The latter is a measure that summarizes all relevant processes (of accumulation, redistribution and ablation of snow and ice in our case), but does not allow for separating these processes. 20 Multi-level spatiotemporal validation is a systematic, independent, complete and -here -even fully redundant validation procedure for the simulated key variables in the model. The spatial distribution of the simulation result is validated in all six dimensions -temporal as well as spatial support, spacing, and extent -by means of two independent satellite data products, and thereby fully covering the entire shape of the respective radar chart hexagons (see fig. 4). The small remaining gap in temporal spacing could be filled by, e. g., using automatic cameras. Likewise, the water volume (i. e., the mass balance) of the 25 simulation results is validated by at least two independent validation data sources for each dimension, e. g., spatial spacing (in high model resolution) by long-term glacier mass balances and ALS. Redundancy of the validation data sources will only be seldomly available in most applications and in practice. It may serve here as additional proof of robustness and reliance for both the simulation results and the validation concept. This concept may serve as a next piece for a new generation of integrated environmental system models with coupled components, separately validated, and defined interface design in between them              (2008)). O and S denote observations and simulations, respectively, while the subscripts 0 and 1 correspond to snow-free and snow-covered situations, respectively.