Intercomparison of snow depth retrievals over Arctic sea ice from radar data acquired by Operation IceBridge

Since 2009, the ultra-wideband snow radar on Operation IceBridge (OIB; a NASA airborne mission to survey the polar ice covers) has acquired data in annual campaigns conducted during the Arctic and Antarctic springs. Progressive improvements in radar hardware and data processing methodologies have led to improved data quality for subsequent retrieval of snow depth. Existing retrieval algorithms differ in the way the air–snow (a–s) and snow–ice (s–i) interfaces are detected and localized in the radar returns and in how the system limitations are addressed (e.g., noise, resolution). In 2014, the Snow Thickness On Sea Ice Working Group (STOSIWG) was formed and tasked with investigating how radar data quality affects snow depth retrievals and how retrievals from the various algorithms differ. The goal is to understand the limitations of the estimates and to produce a well-documented, long-term record that can be used for understanding broader changes in the Arctic climate system. Here, we assess five retrieval algorithms by comparisons with field measurements from two groundbased campaigns, including the BRomine, Ozone, and Mercury EXperiment (BROMEX) at Barrow, Alaska; a field program by Environment and Climate Change Canada at Eureka, Nunavut; and available climatology and snowfall from ERA-Interim reanalysis. The aim is to examine available algorithms and to use the assessment results to inform the development of future approaches. We present results from these assessments and highlight key considerations for the production of a long-term, calibrated geophysical record of springtime snow thickness over Arctic sea ice.

tion, delaying the onset of surface ice melt. During the melt season, water from snowmelt pools into depressions to form melt ponds, which lower the surface albedo and further enhance the sea ice melt rate. This creates a strong positive feedback between the absorbed downwelling shortwave radiation and melt pond coverage on the ice surface (Kwok and Untersteiner, 2011). Further, available meltwater spreads over larger areas on smoother, seasonal ice compared to rougher, deformed ice (e.g., Fetterer and Untersteiner, 1998;Polashenski et al., 2012;Webster et al., 2015). When this meltwater drains through the ice cover and into the surface ocean (Polashenski et al., 2012), it decreases the salinity and density structure of the ocean, thereby affecting stratification and mixing.
From a remote sensing perspective, estimates of snow loading are required for accurate retrievals of ice thickness from sea ice freeboard . Current and planned satellite altimeters provide only sea ice freeboard (e.g., radar altimeters on CryoSat-2, AltiKa, Sentinel-3) or the combined snow and ice freeboard (lidars on ICESat, ICESat-2), with snow loading (depth and bulk density) left to be measured or modeled by other methods. However, routine measurements of snow depth and density over the Arctic Ocean are not available. Hence, there is extensive interest in the climatology, seasonal and interannual variability, and spatial distribution of snow depth for ice thickness retrievals, climate analyses and modeling, and forecast of sea ice behavior.
Previous understanding of snow depth over Arctic sea ice has been derived from various field surveys (e.g., Sturm et al., 2002Sturm et al., , 2006 and the climatology based on snow data from drifting stations that operated in 1937and 1954-1991(Warren et al., 1999. The W99 snow climatology is still widely used in ice thickness retrievals. Because of the wide-ranging importance of snow depth, remote determination of snow depth at almost any spatial scale has long been desired. The sensor suite of Operation IceBridge (OIB) (Koenig et al., 2010) includes an ultra-wideband snow radar that allows estimates of snow depth by resolving the range location of the air-snow (a-s) and snow-ice (s-i) interfaces. Early examination shows that snow depth can be estimated to an uncertainty of about several centimeters and that the mean snow depth is broadly consistent with the W99 climatology except over seasonal sea ice (e.g., Kurtz and Farrell, 2011;Farrell et al., 2012). To date, OIB has acquired 8 years (2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017) of radar data, including repeat surveys of the early spring snow and ice conditions in different parts of the western Arctic. These snow depth datasets from the OIB snow radar provide an unparalleled opportunity to examine snow depth across the Arctic Ocean and both its recent spatial and interannual variability.
Multiple algorithms now exist for the retrieval of snow depth, but they differ in how the air-snow and snow-ice interfaces are detected and localized in the radar returns and in how the system limitations are addressed (e.g., noise, resolution). In 2014, the Snow Thickness On Sea Ice Working Group (STOSIWG) was formed and tasked with investigating how radar data quality affects snow depth retrievals and how retrievals from the various algorithms differ. In this paper, we report on the assessment of retrievals from five algorithms by comparisons of retrieved snow depths with each other, with measurements from two field surveys, with modified climatology, and with snowfall from ERA-Interim products. The comparisons with field measurements allow a detailed assessment of the retrievals locally, while the comparisons with climatology and analyzed snowfall provide a large-scale multiyear perspective of their year-to-year retrieval consistency and robustness to changes in radar parameters and their relative agreements with basin-scale fields. The aim of this paper is to examine these algorithms and to use the assessment results to inform the development of the next-generation algorithm. The paper is organized as follows. The next section summarizes the snow radar, the five retrieval algorithms, and the datasets against which we compare the snow-radar data. The datasets include snow measurements from the two field campaigns, modified snow climatology, and snow depth estimates based on snowfall from ERA-Interim products. The retrieved snow depths are compared with the field surveys in Sect. 3 and with the modified climatology and ERA-Interim snowfall in Sect. 4. The last section concludes the paper with our recommendations for future approaches.

Data description
In this section, we describe (1) the snow radar and the steps followed in producing the echograms from which the airsnow and snow-ice interfaces are identified, (2) the data from two field programs, (3) the construction of snow depth estimates from climatology and from snowfall, and (4) the five retrieval algorithms.

Snow radar
In the spring of 2009, an early version of the snow radar was installed and flown on the NASA P-3B to survey Arctic sea and land ice from Alaska and Greenland. This version of the radar employed a fast-tuning, wideband voltage-controlled oscillator in a phase-locked loop (PLL) configuration that used a frequency-modulated continuous wave chirp signal from a direct digital synthesizer (DDS) as a reference to produce a fast-sweeping, linear, and ultra-wideband chirp capable of collecting high-resolution sounding measurements from fixed-wing aircraft . The measurements collected during these initial surveys demonstrated a new capability to routinely measure snow depth over sea ice and annual snow layering over land ice. Since then, the system has been routinely deployed for both Arctic and Antarctic airborne campaigns as part of NASA OIB and to support the University of Kansas' Center for Remote Sensing of Figure 1. (a) Snow-radar hardware block diagram. Strength and location of sidelobes in the system impulse response (b) before and (c) after deconvolution of the radar data acquired during 2012 OIB Arctic campaign. Family of curves shows dependence on peak-signal-to-noise ratio (PSNR in dB) and is constructed using all the radar echoes ( ∼ 4 × 10 6 ) from the campaign. Normalized returns are for every 1 dB increment of PSNR between 10 and 50 dB.
Ice Sheets (KU CReSIS) and other collaborators' field programs (e.g., Patel et al., 2017;Yan et al., 2017). Below, we describe the system hardware and software processing algorithms used to generate the resulting data products and the changes that have been incorporated throughout the evolution of the system to what is used today.

Hardware
Since 2009, the snow-radar hardware has been modified to improve system performance in terms of wider bandwidth, improved phase linearity at faster chirp rates, increased data acquisition rates, and real-time hardware processing. Figure 1 shows the system block diagram, although specific values have changed throughout the various OIB deployments. The basic components include a digital system, microwave chirp generator, microwave transmitter and receiver, and intermediate frequency (IF) receiver. The digital system generates a 600-900 MHz reference chirp from the second Nyquist band of a 1 gigasample s −1 DDS, which is then multiplied by a factor of 20 by the PLL to generate a 12 to 18 GHz Ku-band signal. The 12-18 GHz signal, which is typically used by the CReSIS Ku-band altimeter, is then down-converted using a mixer driven by a 10 GHz phase-locked oscillator to produce the system's 2-8 GHz transmit waveform. A directional coupler is used to replicate the chirp for the receiver. The receiver filters, amplifies, and mixes the received waveform with the replicated chirp to produce an IF signal. This IF signal consists of a collection of near-constant "beat" frequency components in which target delay is expressed by a simple relationship as the ratio of frequency by chirp rate. The spectral purity of each component is related to the frequency linearity of the 2-8 GHz chirp. Deviations from a linear frequency sweep result in reduced resolution and sidelobes. Finally, the sampled IF signal is coherently integrated or stacked four to eight times.

Processing
The raw data recorded by the snow radar is a stream of coherently integrated intermediate frequency signals time tagged with UTC time. In some datasets, an optional digital down converter (DDC) was used. The recorded data are post-processed using the following steps: 1. Locating IF spectrum: the IF spectrum is found by taking the discrete Fourier transform of the raw data and mapping this result to time delay based on the Nyquist zone and DDC settings. The raw data are windowed to reduce range sidelobes. Additionally, any DDC-induced modifications to coherent noise components must be compensated.
2. Coherent noise removal: there are several unwanted coherent components within the system resulting in noise signals that vary slowly with respect to time. These signals are estimated by analyzing the Doppler components of the data and then removed.
3. Platform altitude and position corrections: precision GPS and inertial measurement unit information is used to position the antennas along the flight track. Relative altitude variations of the antennas within the coherent averaging window are corrected so that the subsequent coherent averaging focuses the along-track beam pattern towards nadir.
4. Windowing/deconvolution: phase and amplitude nonlinearities of the system are estimated by analyzing data collected over specular returns produced by sea ice leads. A catalog of these responses is saved during a campaign and used to deconvolve nonlinearities to produce low sidelobe waveforms. Figure 1 illustrates the effectiveness of the deconvolution process in suppressing sidelobes (noted in Kwok and Haas, 2015) for data collected during the 2012 OIB Arctic campaign.
5. Coherent/incoherent integration: finally, additional coherent and incoherent integrations are used to improve signal-to-noise ratio (SNR), along-track resolution, and to reduce speckle.

Snow depth from field surveys
Coordinated surveys of field measurements and OIB overflights occurred in 2 years (2012 and 2014) under the auspices of two different programs. Both were located on landfast ice to minimize to the variability introduced by the spatial mismatch of the airborne and ground-based measurements due to ice drift. A brief description of these field programs is provided below.

BROMEX 2012
In coordination with OIB, in situ snow data were collected as part of the 2012 BROMEX (BRomine, Ozone, and Mer-cury EXperiment) near Barrow, Alaska (Nghiem et al., 2013;Webster et al., 2014). Field measurements were conducted on smooth, first-year sea ice on Elson Lagoon, a location where landfast ice undergoes little deformation. Following the OIB pass on 15 March 2012, snow depths were measured every 1-5 m in a two-dimensional layout along two transects using a GPS-enabled Magna probe unit (Sturm and Holmgren, 1999); the probe has an accuracy of 0.3 cm over level sea ice and snow (Sturm et al., 2006). The first transect (used here) consisted of three lines ∼ 1000 m in length, each 5 m apart, for a width of 10 m. Snow density was measured every ∼ 100 m with a Federal Sampler (Marr, 1940); the average density was 306 ± 91 kg m −3 . More information about the in situ data and field conditions is available in Webster et al. (2014).

Eureka 2014
During March and April of 2014, a coordinated flight and field-based campaigns (sponsored by Environment and Climate Change Canada) were carried out near Eureka, Nunavut, Canada, to evaluate OIB estimates of snow depth on landfast first-year ice (FYI). A predetermined set of 11 parallel OIB flight lines was executed within Eureka Sound, a large inlet separating Axel Heiberg and Ellesmere islands, on 25 March 2014. A spacing of 12 m between OIB flight lines ensured strong coincidence between the narrow swath of the snow radar and the point-based nature of the planned field measurements. Snow depths within the radar footprints were sampled after the overflights. An extended description of the 2014 Eureka study site and field campaign can be found in King et al. (2015). As part of the field campaign, a linear transect was established to characterize local-scale variations in snow depth and density in proximity to the OIB snow-radar footprint. Between the dates of 26 and 29 March 2015 a total of 37 320 snow depth measurements were made along the transect covering a distance of 46 km. Measurements of snow depth were made, also with Magna probe units (as above), spaced by approximately 2 m between samples. In addition to measurements in the along-track flight direction, orthogonal transects of up to 100 m were completed at random intervals to characterize variation in the radar across-track direction. Snow density was measured along the sampling transect at 174 locations with a coring device commonly referred to as the ESC-30 (Eastern Snow Conference 30 cm cross-section corer). Extracted cores were weighed in situ with a hanging scale to estimate bulk snow density and its water equivalent.
The observed snow layer along the 2014 Eureka transect was shallow, with a mean snow depth of 17.8 ± 9.9 cm. This general condition corresponded with spatial predominance of large undeformed FYI pans (62 % of ice regions under flight of the Eureka mission). Here, smooth ice topography and sustained Arctic winds allowed rapid redistribution of snow accumulation and limited mean depths to 16.0 ± 8.3 cm. Increased snow depth was associated with local regions of deformed ice where drifted accumulation was found in proximity to convergence features (i.e., rafting, rubble, and pressure ridging). Those areas were described as deformed FYI in King et al. (2015) and were shown to have a higher mean depth of 20.7 ± 11.4 cm. Transect variations in density were conservative, with a mean of 306 kg m −3 and standard deviation of 50 kg m −3 , which is comparable to the assumed climatological mean of ∼ 320 kg m −3 near the end of the winter.

Snow depth from snowfall and climatology
Averaged retrievals (25 km × 25 km) from the five algorithms are compared with both snow depths from snowfall in ERA-Interim products and a modification of the W99 climatology. We elected to use the ERA-Interim products here because it is one of the more broadly used reanalysis in climate work and compares reasonably with the modified climatology (see Sect. 4). Below, the procedures for constructing these daily fields of basin-wide snow depths are described.

Snow depth from ERA-Interim snowfall (ERAI-sf)
Fields of snow depth from ERA-interim snowfall are constructed following Kwok and Cunningham (2008). Daily snowfall on ice parcels (100 km × 100 km) across the Arctic Ocean is recorded on a daily basis. Ice drift is from optimally interpolated motion fields (described in Kwok et al., 2013). A daily cycle of accumulation and ice advection is carried out for each Lagrangian parcel, which mimics the process of snow accumulation over sea ice in motion. Starting from 15 August, the snowfall on each drifting parcel is recorded and accumulated through the end of spring. Surface conditions (air temperature and ice concentration) determine when and where snow is allowed to accumulate. Accumulation is permitted only when the ERA-Interim 2 m air temperature is below freezing and the AMSR-E ice concentration exceeds 50 %. When ice concentration drops below 50 %, the accumulated snow is removed from that parcel. As ice concentrations rarely drop below 50 % within the perennial ice pack, this condition is only relevant to the accumulation process over seasonal ice during the advance of the ice cover in the fall. Typically, the snow is thinner where the ice cover is formed later in the season (W99; Sturm et al., 2002;Webster et al., 2014). The snow density climatology of Kwok and Cunningham (2008) (a modified version of that used in W99) is used to convert snow water equivalent into snow depth. No initial snow cover, representing snow that survived the melt season, is added to the multiyear ice at the beginning of the accumulation season. Henceforth, we refer to snow depth from this procedure as ERAI-sf estimates.

Snow depth from modified climatology (modW99)
In this paper, we compute snow depth (h fs ) from the climatology (W99) following Kwok and Cunningham (2015) as We use the annual space-varying snow depth and density, h W s (X, t), from the snow climatology in W99. We note that this W99 climatology is from in situ data collected between 1954 and 1991, and it is considered to be representative of snow depth over multiyear ice, so it does not address the snow depth over the seasonal ice cover of the Arctic Ocean. To account for reduced snow depth over FYI, we follow Laxon et al. (2013), who used a fraction (α) of the climatological snow depth to represent the reduced snow accumulation over FYI identified by Kurtz and Farrell (2011). Here h s is dependent on the fractional coverage of FYI (f FY ), derived from ASCAT scatterometer data (following Kwok, 2004). The f FY retrievals from ASCAT impart time-varying spatial patterns upon these otherwise static climatological fields. While Laxon et al. (2013) used a fixed value of α = 0.5, here we use α = 0.7, based on a subsequent analysis of CryoSat-2 ice thickness (Kwok and Cunningham, 2015). Since the above construction represents a modification of the W99 climatology, we henceforth refer to these snow depths as modW99 estimates.

Retrieval algorithms
Five sets of retrievals from five separate algorithms are considered here. The first algorithm below produced the standard product (2009)(2010)(2011)(2012)(2013) that is archived at and distributed by the National Snow and Ice Data Center (NSIDC). The remaining algorithms were developed by members of the Snow Thickness On Sea Ice Working Group. Only brief summaries of each algorithm are provided here, and the reader is referred to the published literature for details on each of the algorithms. These retrieval algorithms differ in the way the air-snow and snow-ice interfaces are detected and localized in the radar returns. Figure 2 shows the different range locations of the interfaces in a collection of radar returns from Eureka; these differences are discussed in Sect. 3.5.
We also note that the following analysis includes only retrieval results contributed by those who elected to participate in STOSIWG at the time of this assessment; therefore, it does not include all relevant algorithms (e.g., Holt et al., 2015). Table 2 shows the datasets that were available and used herein.

Existing NSIDC product (NSIDC)
This algorithm was used for the Sea Ice Freeboard, Snow Depth, and Thickness data product that is archived at NSIDC with the following designation: IDSI4 . Details of the full algorithm methodology are described in Kurtz et al. (2013). Briefly, the algorithm is an empirical method that selects the a-s interface using a combined peak and threshold method. The a-s interface is taken to be either the first significant peak above a defined threshold or the first point when the rise in the radar return reaches a specified threshold if no peak is found. The method uses a linear scaling relation for the first version of the snow-radar data collected in 2009 to select the thresholds used in the algorithm (as described in Kurtz and Farrell, 2011). The location of the s-i interface is defined as the maxima in the radar signal below the a-s interface.

GSFC-NK
The GSFC-NK algorithm was used for the processing of quick-look IceBridge data in 2014-2016 (https: //nsidc.org/data/docs/daac/icebridge/evaluation_products/ sea-ice-freeboard-snowdepth-thickness-quicklook-index. html). Full details of the algorithm methodology are described in the product documentation at NSIDC. The algorithm is a waveform fitting method that follows the algorithm used for CryoSat-2 surface height retrievals described by Kurtz et al. (2014). A modification to that algorithm accounts for the different instrument characteristics of the snow radar and accounts for coherent scattering of the return using a heterogeneous flat patch surface model (Brown, 1982). The algorithm fits a model waveform to the snow-radar data using a bounded trust region method (Coleman and Li, 1996). Both the a-s and s-i interfaces are selected from the model fit results. This algorithm is highly sensitive to parameters used in the fitting process, and these are selected to provide a balance between the processing time needed and the quality of the fit obtained. The most important of these parameters are the initial guess and model fit bounds for the a-s and s-i interfaces along with the maximum number of iterations used in the fitting process. The initial guesses for the interface locations are taken from the empirical algorithm of the existing NSIDC product, and the bounds for the possible interface locations are restricted to be within ±1.5 ns of the initial guesses. The maximum number of fit iterations was set to 300. Due to the large processing time needed for the algorithm, it was run on every other point for this comparison. The model fit quality was determined by calculating the sum of the waveform power divided by the squared norm of the fit residual, and fits with a value less than 0.2 were discarded.

SRLD
The Snow Radar Layer Detection (SRLD) algorithm was first developed for layer detection on land ice, in which the detection of the a-s interface and s-i interface is determined prior to the subsequent detection of deeper layers within the firn (Koenig et al., 2016). Over sea ice, only the a-s and s-i interfaces are returned by the algorithm. The initial application to existing OIB snow-radar data from various campaigns (2009)(2010)(2011)(2012) and the need for it to be applicable to future campaigns required a process that would adapt to the data and not be dependent on fixed thresholds in the radar return signal. The SRLD method uses the gradient between the open-air return values and the maximum return value to locate the two interfaces. The a-s interface is taken to be where the presence of snow has caused the radar return level to be elevated above the open-air values, near the beginning of the gradient. The s-i interface is taken to occur where the transition from snow to ice has produced a maximum peak in radar return level -at the end of the gradient. The open-air value is first determined by taking a sample median of the values above the surface for the data frame being analyzed. The median of the peak values is used, along with the open-air value, to define the gradient to which a threshold is then applied for the determination of the a-s interface. These gradient endpoints, comprised of the median air value and median peak value, adjust with the data from frame to frame. In the current implementation, the midpoint radar power level is used as the threshold. The point at which this radar power level maps onto each radar return profile determines the a-s interface, while the peak level in the return determines the s-i interface.

Wavelet
The Wavelet algorithm, described by Newman et al. (2014), operates on each trace (column) of an echogram independently and has three components: interface detection through the use of the Haar wavelet-continuous wavelet transform (Haar-CWT), topographic filtering using the h topo parame-ter (to mitigate against heavily deformed ice topography on the radar point of closest approach), and the assignment of precision to each derived snow depth using radar system parameters.
The Haar-CWT is optimized for detecting abrupt transitions within a signal, such as those arising from interface returns, with the largest Haar-CWT coefficients localized at the largest magnitude signal transition. To detect the a-s and s-i interfaces, the echograms were preconditioned in different ways. To detect the a-s interface the logarithm of the echogram results in the largest magnitude signal transition occurring at the a-s interface. To detect the s-i interface the echogram is left in its original form, wherein the largest magnitude signal transition occurs at the s-i interface. The Haar-CWT is then applied to each echogram trace and the resulting coefficients are summed over a range of different scales -from the broad localization of the interface at large wavelet scales to precise localization at small wavelet scales. The interface location is assigned to the location of the maximum of the summed coefficients. The benefits of this algorithm are that (1) it is robust and does not depend on a set of fixed thresholds and (2) the interface detection process is not affected by changes in transmitted power and receiver noise, which vary both during and between different OIB flight campaigns.
Interface picks associated with h topo values greater than 50 cm are ignored as they are associated with heavily deformed ice topography, such as sea ice pressure ridges, where derived snow depths have been shown to be unreliable due to uncertainty in the radar scattering surface. Interface picks associated with h topo values of less than 50 cm are deemed valid and converted to snow depth by considering two-way travel time in the snow pack and the permittivity of snow.
Derived snow depths are assigned a precision based upon concurrent snow-radar system parameters. Snow depth precision is calculated for each derived snow depth independently, with the final precision value the sum of terms relating to the SNR at the a-s interface, the signal-to-clutter ratio at the s-i interface, the fast-time range bin spacing, and the bandwidthdependent range resolution of the snow radar.

JPL
This is a simplified version of the algorithm by Kwok and Maksym (2014) that deals with residual system sidelobes in the returns. Since a reprocessed version of the radar dataset with suppressed system sidelobes is now available for all years except 2013, this aspect of the algorithm has been disabled.
In this algorithm, both the s-i and a-s interfaces are detected and localized by determining the significance of each local maximum above the noise floor in individual echo returns. Significance is determined by the strength and width of the local maxima (power) and its associated leading/trailing edges relative to the expected noise power of the system. The system bandwidth controls the width (or sharpness) of a local maximum and the rate of rise of its leading edge. The algorithm uses these system-dependent parameters to adapt to the changes in the radar system as the bandwidth and noise level of the snow radar have progressively improved over the course of the OIB mission. The highest significant peak in the echo profile is designated as the return from the s-i interface. Returns from a-s interfaces are assumed to be weaker and are the first significant range returns, determined using the above criteria, above the s-i interface. Once the interfaces are detected, the radar range to the interface is localized in an oversampled (by 16 times) version of the echo return; this reduces the range error in the identification of the local maxima in the echo return. From a scattering perspective, this restricts the detected returns to a-s interfaces that are more specular and appear as a detectable peak, rather than just the strength of the leading edge.

Comparisons with field surveys
In this section, we first describe the comparison approach and the expected variability of the differences given the statistics of the retrievals and field measurements. Next, results from comparisons with measurements from BROMEX and Eureka are discussed. Lastly, we contrast the absolute range locations of the a-s and s-i interfaces from the different algorithms, which provides insights into the preferred location on the echo profile that each algorithm designates as an in-terface. To account for the reduced propagation speed of the radar wave in the snow layer, all radar measurements below were converted to snow depth, assuming an end-of-winter bulk density of 320 kg m −3 (W99); snow depth estimates are relatively insensitive to uncertainties in bulk density (see error estimates in .

Comparison approach
The spatial correlation length scales (at ρ = 0.5) of the point samples from both BROMEX and Eureka are short (∼ 5-7 m; Fig. 3). Hence, it is essential to select an averaging length scale that is more compatible with the coarser resolution of the snow-radar retrievals (5-10 m). We select an averaging radius of 20 m to allow for and to reduce the sensitivity of the comparisons to uncertainties in the snowradar footprint as well as to accommodate for uncertainties in the spatial overlap between the snow-radar footprint and the point samples from the field measurements. The number of in situ snow depth measurements in each averaged field sample for BROMEX and Eureka is 21 ± 5 and 15 ± 4, respectively. Moreover, a 20 m radius represents an averaging of ∼ 9 radar spots along track, assuming a nominal spacing of ∼ 5 m. The pre-and post-averaging spatial statistics of the field data are shown in Fig. 3. For both datasets, the correlation length scale becomes broader (> 20 m) and variability is reduced. The resulting mean standard deviation (i.e., mean of the distribution of σ f , see   within a 20 m circle reduced from 6.1 ± 1.5 to 2.2 ± 1.2 cm and for the Eureka data from 7.1 ± 3.7 to 1.9 ± 1.6 cm. To compare the averaged measurements (at a 40 m spacing), we take the difference between the estimates of the snow radar (d sr ) and field (d f ) as d = d sr − d f . Assuming that these two estimates are random variables that are uncorrelated and normally distributed, the variance of d is expected to be σ 2 d = σ 2 sr + σ 2 f . The contribution of σ f is expected to be bounded by the values in the distribution shown in Fig. 2, and the contribution of σ sr is discussed below. With the bounds on the two expected variances, σ d can be estimated for assessment of their differences. Figure 4 shows the comparison between four algorithms (NSIDC, GSFC-NK, SRLD, and JPL) with BROMEX field measurements. These results for the first transect are summarized in Table 4. The Wavelet retrievals were not available for this assessment. The OIB overflight of BROMEX covered a short distance of ∼ 1000 m. Along this track, the field-measured snow depth (averaged) range between 15 and  30 cm. Except for the mean differences, it is apparent that the along-track variability in snow depth is reproduced reasonably well, as measured by the correlations between the radar and field estimates (0.45-0.67). Of note is the lower scatter and reduced sample size in the distributions of σ sr for both the NSIDC and GSFC-NK retrievals prior to averaging (Fig. 4). This pattern can be attributed to the fact that both these algorithms used averaged waveforms in their retrievals, rather than raw waveforms provided in the radar data, and thus the data have already been smoothed and subsampled before the retrieval process. The averaging of the SRLD and JPL retrievals, in contrast, reduced σ sr by a factor of ∼ 5. In any case, the standard deviation of the differences of ∼ 3-4 cm (Fig. 4) is approximately what one would expect when calculating (σ 2 d = σ 2 sr + σ 2 f ) using the values of σ f in Fig. 2a and values of σ sr in Fig. 4 (center panels). In these comparisons, the mean differences vary between a maximum of −5 cm (GSFC-NK) and a minimum of +0.3 cm (NSIDC). These relative differences will be discussed in Sect. 3.5. Table 4. Comparison of radar snow depth with field measurements at BROMEX and Eureka using retrievals at locations that are common to both algorithms. In the matrix presentation below, the snow depth retrieved by the row algorithm (at locations common to the column algorithm) are used in the comparisons with field data. Note that the matrix is not symmetric. Quantities in italics are from

Comparisons with snow depth from Eureka
The comparisons with Eureka field measurements are shown in Fig. 5 and the results are summarized in Table 3. This comparison includes many more samples from the multiple OIB tracks and more extensive sampling of the surface compared to the BROMEX survey. Averaged snow depth ranges from ∼ 5 cm to more than 45 cm. This richer dataset tests the skill of five algorithms over a broad range of snow depths. Variability is between 0.29 and 0.72, as measured by the correlations between the radar and field estimates. The NSIDC and GSFC-NK algorithms have lower correlations, as their retrievals tend to underestimate snow depth; these two algorithms seem to be insensitive to snow depths that are more than ∼ 20 cm in this case. The SRLD, Wavelet, and JPL re-trievals have comparable correlations (0.66-0.72) with field data. Broadly, these algorithms tend to overestimate snow depths below ∼ 10 cm, with higher variability in retrievals over thicker snow (Fig. 5). Except for the NSIDC retrievals, the standard deviation of the differences of ∼ 4-5 cm is approximately as expected when calculated using the values of σ f in Fig. 2b and values of σ sr (σ 2 d = σ 2 sr + σ 2 f ; not shown here). In these comparisons, the mean differences vary between a maximum of −6 cm (GSFC-NK) and a minimum of −1 cm (NSIDC).
We also examined the impact of surface roughness on the quality of the retrievals from the different algorithms (Table 3). We define roughness as σ f , as defined above and as the standard deviation of the snow depth of the field samples used in creating the spatial average. If samples with rough-ness > 10 cm were not included in the comparisons, there would be a consistent decrease in the standard deviation of the differences for all retrievals. This suggests that all the retrievals seem to be affected by surface roughness, consistent with the results reported by King et al. (2015). However, the correlation values did not increase for all algorithms (NSIDC: 0.29 to 0.34; GSFC-NK: 0.45 to 0.61; SRLD: 0.66 to 0.71; Wavelet: 0.72 to 0.67; JPL: 0.63 to 0.60).

Algorithm-dependent filtering strategies: BROMEX and Eureka
For both the BROMEX and Eureka comparisons, the different algorithms do not provide the same number of retrievals (Table 3) because of algorithm-dependent filtering strategies for removing low-quality estimates. More conservative strategies reduce the retrieval rate, which may bias these comparisons. We address this concern by examining the retrieval results for only those samples that are common to a pair of algorithms. Table 4 shows the results from all pairings of the five algorithms. Broadly, the changes in the comparison measures (i.e., ρ and "Diff") did not change significantly for both BROMEX and Eureka. Compared to the diagonal element of the matrices in Table 4, the mean differences and standard deviations (relative to the field data) remain similar even when only the subset of common samples was used. Hence, we conclude that the statistics in Table 3 are fairly robust measures of the retrievals and relatively insensitive to the filtering strategies devised.

Different approaches to localize the air-snow and snow-ice interfaces
As observed earlier, the comparisons show that the biases (or mean differences) for individual algorithms are consistent (similar in magnitude) at both BROMEX and Eureka. For example, the mean differences of the SRLD and JPL retrievals are approximately +2 and −2 cm, respectively, at both field sites (see Table 3). Thus, on average, the SRLD snow depth retrievals are 4 cm higher than the JPL retrievals. These patterns suggest that systematic biases exist within individual algorithms due to their range determinations of the a-s and s-i interfaces in the echograms. In each retrieval algorithm, a particular characteristic of a leading edge (see Fig. 3b) or return peak is typically used to determine the range point to be the location of an interface. In this case, the same characteristic has to be used consistently for both the a-s and s-i interfaces, otherwise the range distance between the two interfaces will be biased.
Here, we compare the range locations of the two interfaces selected by four of the five algorithms to assess the contribution of algorithm-induced biases seen in the observed mean differences at the Eureka site (Fig. 6). The left panel (Fig. 6) shows whether there are mean differences in the retrieved range distances (calculated from the range locations of the two identified interfaces) from two algorithms. We first discuss the comparison of the SRLD and JPL retrievals (Fig. 6f) since this is the simplest case, where the range distance calculated by the SRLD algorithm is always greater than that obtained by JPL. Here both algorithms use the peak in a given echogram as the location of the s-i interface, so they have a narrow distribution of differences in the range location of that interface (∼ 1 cm). Nearly all of their difference in retrieved snow depth can be attributed to differences in the location of the a-s interfaces (Fig. 6f). The SRLD algorithm locates the a-s interface on the leading edge, while the JPL algorithm located it at the local peak following the leading edge. Relative to the JPL a-s interfaces, the SRLD a-s interfaces are always displaced toward the radar, i.e., in the near and therefore shorter range. Thus, the range distance to the JPL a-s interface is always higher. The same arguments apply to explaining some of the systematic differences in the range location observed in the other comparisons shown in Fig. 6. This pattern suggests that systematic choices made in the localization of the interfaces in each algorithm largely explain the consistent mean inter-algorithm differences (Table 3). The related question of where on the echo profile one should pick as the location of an interface will not be addressed here.

Basin-scale assessments
We next assess the retrieved OIB snow depths from all Arctic campaigns at a longer averaging length scale (> 10 km). First, we summarize the overall retrievals from four algorithms available for this analysis (NSIDC, GSFC-NK, SRLD, and JPL). Second, the consistency of retrievals at crossovers and repeat tracks (12.5 km) is examined. Last, we compare the averaged retrievals (at 25 km) with the fields of snow depth constructed from ERA-Interim snowfall and modified climatology (modW99). These large-scale comparisons provide a broad assessment of the spatial and interannual variability of the snow-radar retrievals, along with their relative agreement with the two reconstructed fields of snow depth. As seen in the spatial maps and distributions (Figs. 7 and 8), the snow depth within the MYI cover north of Greenland and next to the Canadian Arctic Archipelago is highest in all the retrievals. In general, the average snow depth over the seasonal ice is thinner than those within the MYI cover. Of note is that the snow depth in the southern Beaufort Sea is always the thinnest, especially for those tracks between 2012 and 2015. The differences in retrievals from the four algorithms are examined in more detail in Sect. 4.3.

Comparisons at crossovers and repeat tracks
Here we examine the consistency of the snow-radar estimates (12.5 km averages) at available crossovers of the OIB flight tracks over the 7 years of available data. Figure 9a shows the retrieval crossover differences (up to 164) and the time separation between them. When the time separations are short (< 10 days), the differences are at the centimeter level and quite consistent (high correlation) for all algorithms. Even with the expected spatial variability, these differences remain stable over the 7 years and thus indicate the general consistency of individual retrieval algorithms. Crossover consistency also suggests a relatively long, isotropic spatial correlation length scale for snow depth. That is, at longer length scales snow depth varies slowly spatially and is likely due to synoptic-scale patterns. As expected, when the time separation increases, so do the differences. At separations of more than 20 days, the differences are generally higher and positive, which seems consistent with the initial expectation that changes could be attributed to snowfall. However, differences are also likely due to advection of ice parcels with thicker snow covers into the crossover point. As most of the crossovers are located in regions with fairly large spatial gradients in snow depth, attribution of the observed differences is difficult.  Figure 9b compares the snow depth at four available repeat tracks. These near-exact repeat tracks are of outbound and return segments flown during the same flight, so the time separations are typically less than a few hours. Except for ice drift, the snow radar should be acquiring data over similar snow and ice conditions. The difference distributions show that the mean differences are at most ∼ 2-3 cm. The sign and magnitude of the differences are consistent across all the retrievals and suggest that these are valid mean differences in the estimated snow depth. These comparisons also suggest that even though the algorithms produce self-consistent results, the mean snow depth can be quite different between the algorithms, as is evident in the difference in retrieved snow depths from the four algorithms (Fig. 9b).

Comparisons with snow depth from ERA-Interim snowfall (ERAI-sf) and modified climatology (modW99)
The snow depth estimates from ERAI-sf, modW99, and their differences along OIB flight tracks are shown in Fig. 10a. The large-scale patterns are similar to the thinner snow over seasonal ice and thicker snow in regions with higher MYf, especially north of Greenland and the Canadian Arctic Archipelago. Relative to the modW99 estimates, the ERAI-sf snow depths are broadly lower over the entire Arctic Ocean ice cover, except in 2012. Figure 11 shows the relative interannual variability (IAV) of the mean snow depth of the two fields over 7 years for the same three sea ice categories. As expected, the IAV values of modW99 fields are lower than those from ERAI-sf, because the modW99 estimates are a static monthly clima- vected into this region, which was used in the construction of the modW99 fields but is absent in the ERAI-sf fields (because the MYf is not used in the estimates). Away from the coastal zones and away from the transition zone between the MYI and seasonal ice cover, these differences are less pronounced. Figure 11 also shows the IAV in the mean snow depth from the four algorithms. Three of the algorithms (NSIDC, GSFC-NK, and SRLD) have relatively large IAV compared to the two constructed fields. In the mean, the NSIDC snow depths are lower than the constructed fields. The large increase/decrease in the GSFC-NK retrievals in 2012/2013 is much larger than the expected IAV of ∼ 6 cm between March and April (W99). Over the 7 years, the SRLD retrievals show significant snow depths trend in all three categories of MYI coverage, which seems unrealistic. The IAV values of the JPL retrievals are more similar to those of the constructed fields. Figures 12-15 show the spatial comparisons with the constructed fields. Variability in the differences between the retrievals and ERAI-sf and modW99 is expected. The mean snow depth from the snow radar represents the average of all age types within a track, whereas the ERAI-sf and modW99 estimates represent the accumulated snow depth since the beginning of the season with no consideration of wind-driven redistribution, loss into leads, or the introduction of seasonal ice of variable age (due to deformation of the ice cover). If any of those factors were significant for a given year, it would change the mean snow depth observed by the snow radar.
Broadly, the IAV of the retrievals for the different ice categories in Figure 11 is consistent with the variability in the difference maps (Figs. 12-15).
The contrast in IAV in retrievals from the four algorithms underscores the importance of multiyear assessments. The large IAV values of retrievals from a given algorithm suggest issues in algorithmic robustness in adapting to changes in radar data quality. Since the snow radar has changed and improved over time (Table 1), the retrieval algorithms must adapt to changes in data quality. It is important to note that even though the retrievals may be internally consistent for a given algorithm, as suggested in Sect. 4.2, whether algorithms are sensitive to changes in radar data quality or radar hardware parameters depends on the specific characteristics of each algorithm.

Conclusions
In this study, we compared the retrievals from five different algorithms with each other and with snow depth measurements from two field surveys, modified climatology, and snow depths derived from ERA-Interim products. The intercomparisons amongst different retrievals and comparisons with field measurements allow a detailed assessment of the retrievals, while the comparisons with climatology and analyzed snowfall provide a broader multiyear perspective of their interannual retrieval consistency and robustness to changes in radar parameters and their relative agreements R. Kwok et al.: Intercomparison of snow depth retrievals over Arctic sea ice with basin-scale fields. We reiterate that the aim of the work was not to select the best algorithm, but rather to provide results that would serve to inform the development of the next-generation retrieval algorithm. The next step is to harmonize the differences in the present approaches and to provide an improved snow depth product. Below we highlight the salient points from the above intercomparisons and important considerations in the development and assessment of retrieval algorithms.
-When comparing snow-radar retrievals with point snow depth samples, it is essential to select an appropriate length scale such that the differences in the comparisons are not dominated by the geophysical variability. Hence, the fairly large number of point samples within a radar footprint is crucial in generating the averaged field dataset.
-Comparisons with BROMEX and Eureka field data show that, even though there are residual biases, the profile of along-track snow depth can be reproduced and the sensitivity of the snow radar to the a-s and s-i interfaces in the varied conditions at the two field sites can be evaluated.
-Retrieval algorithms differ in the way the a-s and s-i interfaces are detected and localized in the radar returns, and algorithm choices made in the localization of the interfaces by each algorithm mostly explain the consistent mean differences seen in the intercomparisons and in the comparisons to field data. A related and important question regarding where on the echo profile one should pick as the location of an interface is not addressed here.
-Most of the retrievals were able to reproduce the expected spatial pattern of higher snow depth in MYI regions north of Greenland and near to the Canadian Arc-tic Archipelago, along with the thinner sea ice cover toward the Beaufort Sea to the southwest.
-The examination of retrievals at crossovers and repeat tracks shows internal consistency of individual retrieval algorithms. These comparisons also suggest that, even though the algorithms may produce self-consistent results, the mean snow depth may be very different between algorithms, so this metric cannot be used as a measure of the overall performance of a retrieval approach -Over the 7 years, several algorithms have relatively large interannual variability that is higher than expected by climatology (W99). Large interannual variability of retrievals from a given algorithm suggests issues in algorithmic robustness in adapting to changes in radar data quality. The contrast in interannual variability in retrievals from the four algorithms underscores the importance of multiyear assessments, especially when the instrument performance has changed and improved, as is the case here. Of note is that the 2009 snow-radar data has lower SNR, and the processed 2013 data had large, persistent sidelobes that affected the retrieval algorithms. To provide a long-term, consistent record of snow depth for understanding changes, algorithms with set thresholds may need to be tuned appropriately for the duration of the record. The new 2-18 GHz system was flown this past spring (2017), and with the associated large changes in bandwidth and SNR it is likely that the algorithm adaptations to these hardware improvements will be significantly more important in future campaigns.
Data availability. The datasets from NASA's Operation Ice-Bridge are provided by the National Snow and Ice Data Center Distributed Active Archive Center, Boulder, CO, USA (https://doi.org/10.5067/G519SHCKWQV6).