Improved Arctic sea-ice thickness projections using bias corrected CMIP5 simulations

. Projections of Arctic sea ice thickness (SIT) have the potential to inform stakeholders about accessibility to the region, but are currently rather uncertain. The latest suite of CMIP5 global climate models (GCMs) produce a wide range of simulated SIT in the historical period (1979–2014) and exhibit various biases when compared with the Pan-Arctic Ice–Ocean Modelling and Assimilation System (PIOMAS) sea ice reanalysis. We present a new method to constrain such GCM simulations of SIT via a statistical bias correction technique. The bias correction successfully constrains the spatial SIT distribution and temporal variability in the CMIP5 projections whilst retaining the climatic ﬂuctuations from individual ensemble members. The bias correction acts to reduce the spread in projections of SIT and reveals the signiﬁcant contributions of climate internal variability in the ﬁrst half of the century and of scenario uncertainty from the mid-century onwards. The projected date of ice-free conditions in the Arctic under the RCP8.5 high emission scenario occurs in the 2050s, which is a decade earlier than without the bias correction, with potentially signiﬁcant implications for stakeholders in the Arctic such as the shipping industry. The bias correction methodology developed could be similarly applied to other variables to reduce spread in climate projections more


Introduction
Global climate models (GCMs) are the primary tool for making climate predictions on seasonal to decadal timescales, and climate projections over the next century (Flato et al., 2013). In a warming climate, changes to sea ice thickness (SIT) are expected to lead to significant implications for polar regions and beyond. A reduction in SIT will likely open up the Arctic Ocean to economic diversification including new marine shipping routes  and extraction of natural resources, as well as changes to the Arctic ecosystem and potential links to mid-latitude weather (Francis and Vavrus, 2012). Many of these economic opportunities may rely on SIT evolution, but current projections have considerable uncertainty. SIT is also much more informative than sea ice concentration (SIC), especially in the central Arctic, where future thinning can occur without major changes in the local SIC.
The GCMs from the Coupled Model Intercomparison Project, phase 5 (CMIP5) (Taylor et al., 2012) exhibit a large range in sea ice volume (SIV), spatial SIT distribution, and temporal SIT variability under present-day forcing conditions (e.g. Blanchard-Wrigglesworth and Bitz, 2014). For September sea ice extent, Swart et al. (2015) showed that the uncertainty in CMIP5 projections over the next few decades is dominated by these differences between models, termed "model uncertainty" by Sutton (2009, 2011). Uncertainty in climate projections arises from three distinct sources: (1) model uncertainty, (2) internal variability, and (3) scenario uncertainty, as discussed by Sutton (2009, 2011) for temperature and precipitation respectively. In contrast to projections of temperature where the anomalies are often used, the absolute value of SIT is importantfor example, ships have critical SIT thresholds above which their use is not possible .
Bias correction (BC) of GCM simulations has the potential to reduce the differences between models and hence potentially increase confidence in near-term climate projections. The importance of BC in impact-based climate change studies was described in a special report of the IPCC (Seneviratne et al., 2012), but BC has not previously been applied to projections of SIT; this manuscript is novel in that it recalibrates SIT, and does it locally. There are many different types of proposed BC techniques (e.g. Boe et al. (2009) ;Christensen et al. (2008); Ho et al. (2011); Mahlstein and Knutti (2012); Vrac and Friederichs (2014); Watanabe et al. (2012), and references therein) which have mainly been applied to temperature and precipitation. However, these existing methods need refining for sea ice as SIT is a particularly challenging variable. This is due to its positive semi-definite nature, and the spatial and temporal occurrence of zeros, in observations and projections of SIT.
This study addresses the development of a new BC technique that constrains both the mean and variance of SIT in GCMs to an estimate of the observed statistics. It is important to correct the mean as this corrects the spatial SIT distribution. Variability in SIT also has a significant impact on the simulated range of regional ice-free dates, something of great interest to stakeholders, and the CMIP5 GCMs exhibit a wide range in their SIT variability. The study also uses multiple ensemble members from the same model when performing the BC, something that is often not utilised in other studies. This is important as it enables an assessment of the role of internal variability in future projections to be made. The techniques described in this paper are not limited to SIT, and would work for many climate variables. The exact implementation used in this study should also be calibrated to the user's needs based on factors such as the length of reliable observations and number of ensemble members.
In this paper we use the Pan-Arctic Ice-Ocean Modelling and Assimilation System (PIOMAS) (Zhang and Rothrock, 2003) as a reanalysis-based estimate of recent SIT, along with climate projections from a subset of six GCMs from the CMIP5 archive (Sect. 2). We first test the performance of increasingly complex BC approaches in a "toy" model environment (Sect. 3) and then apply our favoured method to the subset of CMIP5 GCMs in Sect. 4. We test the BC method by splitting the historical PIOMAS data, and then explore how the range in SIT projections is reduced using these techniques (Sect. 4) and summarise and discuss the results in Sect. 5.

PIOMAS
To represent observed SIT, we use estimates from the PI-OMAS reanalysis. PIOMAS is a coupled ice-ocean model that is forced with the National Centers for Environmental Prediction (NCEP) atmospheric reanalysis, and assimilates satellite observed sea ice concentration (Lindsay and Zhang, 2006) and sea surface temperature (Schweiger et al., 2011). It does not however assimilate sea ice thickness (SIT), although this has been attempted using the NASA Operation IceBridge and SIZONet campaigns of 2012 (Lindsay et al., 2012).
As a reanalysis, PIOMAS is constrained by the quality of the assimilated observations. Lindsay et al. (2014) force PI-OMAS with four different atmospheric reanalysis products producing differing results. Schweiger et al. (2011) found biases in PIOMAS of 0.26 m in autumn and 0.1 m in spring when compared with ICESat (Zwally et al., 2002) although the spring bias is within the range of uncertainties found by Zygmuntowska et al. (2014). Larger differences are found in the areas of thickest ice, north of Greenland and the Canadian Archipelago, with ICESat retrievals around 0.7 m larger than PIOMAS. However in this region PIOMAS agrees better with in situ data (Schweiger et al., 2011). Zygmuntowska et al. (2014) suggest that this discrepancy is due to the choice of sea ice density in ICESat, and they support this explanation by finding lower discrepancies between PIOMAS and CryoSat-2 (Laxon et al., 2013) which utilises an alternative sea ice density value. Stroeve et al. (2014), in a comprehensive study of SIT across CMIP5 and observations, find that the spatial correlations in thickness between CMIP5 models and PIOMAS are generally higher than those between CMIP5 models and ICESat. It should be noted that these results will be sensitive to the data set chosen to represent observed SIT.
We choose PIOMAS to represent estimates of SIT as satellite observations are limited in their spatial and temporal range. For example, data from ICESat are only available between October and March 2003-2008(Kwok et al., 2009). More recently CryoSat-2 has started producing real-time SIT data sets but only for the non-summer months (Tilling et al., 2015). This is also not ideal as it is the summer and autumn months when the ice is thinnest that are most relevant for potential economic activity. The spatial consistency, temporal length, and completeness of the data are important considerations when computing climatological means and variances, as the longest time series possible is needed to validate the statistics. It is primarily for this reason that PIOMAS has been chosen to represent observations in this study. Several studies (e.g. Laxon et al. (2013), Schweiger et al. (2011), Lindsay andZhang (2006), and Stroeve et al., 2014) have compared PIOMAS to satellite and in situ observations and models and find it a suitable estimate of observed SIT. PI-OMAS is also deemed realistic enough to initialise numerical models for seasonal forecasts e.g. the Sea Ice Outlook (Blanchard-Wrigglesworth and Bitz, 2014) where the accuracy of the initial conditions is vital. Figure 1 shows the mean September SIT and temporal standard deviation (SD) after linear detrending for PI-OMAS over the satellite era . In the heart of the Canadian Archipelago, PIOMAS ice thickness is up to 1.5 m, which is reasonable when compared to Haas and Howell (2015) who measured ice along the Northwest Passage in May 2011 and April 2015 using airborne electromagnetic induction soundings, and to Tilling et al. (2015) who used CryoSat-2 for October and November 2010-2014. North of Greenland SIT exceeds 3.5 m, which is again comparable to CryoSat-2 for October and November 2010-2014 and is between 0 and 1 m along the north Russian coast. The SIT is most variable around the edge of the ice pack and especially near land. An effective BC should ensure that the simulations replicate these patterns of mean SIT and SD over this recent period.

Global climate models
This paper utilises a subset of six GCMs from CMIP5. Since a large part of this work assesses SIT variability, it is necessary for each GCM to have multiple ensemble simulations in the historical period and for each of the representative concentration pathways (RCPs) 2.6, 4.5, and 8.5 for future scenarios (Van Vuuren et al., 2011). In addition, the GCM mean spring thickness must fall within the 10th and 90th percentile of PIOMAS (Stroeve et al., 2014), have a reasonable spatial resolution, and a somewhat resolved Canadian Archipelago. A consistent spatial distribution of land is needed for realistic and spatially complete multi-model means. The six GCMs that comprise this CMIP5 subset are listed in Table 1.
For the CMIP5 subset the historical simulations are used for the period 1979-2005. In most of the analysis for the period post-2005, the RCP8.5 scenario is used, which ramps up the amount of greenhouse gases to have a cumulative effect of increasing the direct radiative forcing by 8.5 W m −2 (approximately 1370 ppm CO 2 equivalent) by 2100 (Van Vuuren et al., 2011). The impact of other scenarios is compared later in the analysis. Figure 2 shows the 1979-2014 ensemble-mean September SIT for the CMIP5 subset, highlighting the considerable differences between the model sim- ulations, and indicating that model bias is likely to be the dominant uncertainty in near-term projections.
The aim of the SIT BC outlined in this paper is to correct the mean and variance in the CMIP5 subset shown in Fig. 2 to the PIOMAS statistics. Although this should improve short-term predictions, a caveat to this approach is that PIOMAS only yields one realisation of the past (see Lindsay et al. (2014) for discussion of PIOMAS forced with alternative atmospheric forcings). We have to assume that the relatively short period over which we have observations (36 years) captures a representative sample of the behaviour we expect from the climate system. In the short term, this is probably a reasonable assumption, as the GCMs will not have evolved far from their corrected state of the recent past; this assumption is explored further in Sect. 4.

Bias correction methodology
Bias correction methods effectively aim to reduce model uncertainty by constraining GCMs to observations. There are two components to model uncertainty: the overall mean difference (or bias), and differences in the amplitude of response to specified forcings. We have deliberately chosen not to try and correct the simulated ice loss trend to that which PIOMAS depicts. Our reasoning is to keep this as prescribed by the different GCMs because the response of the SIT to future warming is unknown, likely non-linear, and the GCMs are designed to give an estimate of this. It is also doubtful how well the forced current trend can be determined from 36 years of data given the high noise to signal ratio for trends, especially on grid point scales. It is also uncertain how much of the recent ice loss seen in the observations can be attributed to changes in external forcing as opposed to internal variability, although previous studies have attempted   . "Ice-free" is defined as the first occurrence of any ensemble member below 0.15 m. The ice-free ensemble range is shown; i.e. the year of the first ensemble member to be ice-free to the last ensemble member to be ice-free. The black line represents "observations"; the blue and red lines represent high and low ice models respectively. The thin coloured lines represent ensemble members, and the thick lines represent the ensemble mean.  (2015), Swart et al. (2015) and Zhang (2015). We are also cautious of overfit-ting; applying a trend correction would potentially result in an over-confident projection.
To test the performance of different possible BC methods, a toy model was used as proxy ensemble time series (representing SIT at a single grid point for the same month each year for the period 1979-2100). The time series are shown in Fig. 3a for a high-mean-high-variance model (blue) and a low-mean-low-variance model (red), where the black line shows the "truth" (observations with one realisation over the historical period only). The time series were all produced using a first-order auto-regressive (with an AR(1) parameter of 0.3 chosen to be representative of CMIP5 SIT autocorrelation) model imposed on a declining linear trend with negative numbers reset to zero. Each model has five separate model ensemble members (thin coloured lines); the thick lines represent the ensemble means. The statistics in all the legends are calculated over the observation window . "Ice-free" in Fig. 3 is here defined as the first occurrence of an ensemble member below 0.15 m. Shown is the ice-free ensemble range, i.e. the year of the first ensemble member to be ice-free to the last ensemble member to be icefree. A successful BC method should transform the individual ensemble members (thin red and blue lines) to match the mean and variance of the observations (black line), producing matched statistics. We test various approaches for such a bias correction. The mathematical notation for the following equations is in Table 2.

Additive correction
A basic additive correction, which has previously been used for temperature projections, is shown in Fig. 3b. This approach simply corrects the time mean by subtracting the difference between the historical model ensemble-mean time mean, M h , and observation time mean, O h , from each of the model ensemble members, M.

Additive corrected thickness
However, as the low ice model is adjusted up by the addition of a constant, it equilibrates at a positive value in the future rather than zero. Likewise the high ice model equilibrates at negative values. Neither of these properties are sensible.
This study makes use of multiple ensemble members from the same model, raising the question of how to treat ensemble member statistics when calculating a particular GCM's bias. For calculating the mean SIT, each GCM's ensemble mean is used because it is the GCM's mean bias that we wish to correct. This is important because a particular ensemble member's deviation from the ensemble mean is retained; it allows an individual ensemble member's time mean to be different to the observations over the historical period, but not the ensemble mean. The treatment of ensemble members for the SD calculation is described in Sect. 3.4. Temporally detrended x over the historical period σ Standard deviation

Multiplicative correction
If a multiplicative correction is used (Fig. 3c), where the ratio of the observed time mean and model ensemble-mean time mean, O h / M h , is multiplied as a factor to the model ensemble members, M, then the corrected thickness is as follows.

Multiplicative corrected thickness
Multiplicative methods effectively preserve the future zero ice year, which is potentially an important value for a wide range of stakeholders. However, when applied as above this approach has the undesired effect of distorting the variances by the same factor as the mean correction, as visible in Fig. 3c.

Mean multiplicative correction
To avoid altering the variances, the mean multiplicative correction can be introduced (Fig. 3d), where the multiplicative mean correction, O h / M h , is applied only to the 11-yearcentred running-mean ensemble mean, M . This corrects the model mean evolution without corrupting the sub-decadal variance as M is smoothed. The model anomalies for each ensemble member, M − M , are then added back to the corrected mean evolution.

Mean multiplicative corrected thickness
This works to correct the mean SIT and does not suffer from any peculiarities of the previous two methods. The model variance now remains unchanged but the approach opens up the possibility of correcting the variance towards that observed in the historical period. Note that by using the ensemble mean, M h , for all these corrections we ensure that each ensemble member is corrected in the same way, thus preserving certain ensemble properties into the future.

Mean and variance correction
The GCMs from CMIP5 show a large range in SIT variance, and the magnitude of these variations is a significant factor determining when regions of the Arctic may first become accessible (when one ensemble member may first become ice-free). Therefore a variance correction is incorporated into Eq. (3) by taking the ratio of the temporal standard deviation of the detrended observations, σ O h , to the square root of the ensemble mean of the variance of the detrended model ensembles, σ M h (detrended mean ensemble SD), over the historical period. The detrending in the models is calculated using each model's ensemble mean linear trend. This has some similarities to the approach of Ho et al. (2011) in application to temperature projections for Europe. See also Appendix A for some further discussion of the choices made.
To incorporate the variance correction, the mean multiplicative correction (Eq. 3) is first de-trended, the variance correction applied, and the trend re-applied. This creates the Mean And VaRIance Correction (MAVRIC), shown in Eq. (4). Figure 3e shows the MAVRIC does a near-perfect job of correcting both the mean and variance to the observed statistics while still retaining the individual ensemble members' own climate fluctuations, but being fractionally scaled by the variance ratio.
Comparing the ensemble range in projected ice-free date between the correction methods, it is apparent that although the shapes of time series have qualitatively changed, this does not always result in a different range in projected ice-free date. For example the difference evident on comparing the high-mean-high-variance GCM (blue) between (a) to (c) and (b) to (d) is partly coincidence and partly due to how the four correction methods shown manipulate the time series. The MAVRIC method (e) results in a unique set of ice-free dates. This is an important attribute that the MAVRIC method displays, as the ice-free date is of vital importance to stakeholders in the Arctic and more basic methods of bias correction fail to appropriately adjust this parameter. Figure 3e illustrates that the MAVRIC successfully corrects the mean and variance in a toy model environment. Before proceeding to investigate the impact of the MAVRIC on SIT projections, it is prudent to test whether the MAVRIC can improve GCM performance by validating with PIOMAS. We use CSIRO-Mk3.6.0 (CSIRO) as the GCM to test. The ice in CSIRO generally has too much areal coverage and too little variability and is a CMIP5 outlier model with regards to SIT (Stroeve et al., 2014). However, CSIRO benefits from having 10 ensemble members, increasing the robustness of the statistics. For these two reasons, it is considered a thorough test of the MAVRIC's performance within a real GCM.

Bias corrected sea ice thickness projections
The test uses a data denial method where we train the MAVRIC on a subset of PIOMAS observations, 1979-1999, termed the "calibration window". From this we examine how the MAVRIC predicts the observations for 2000-2014, termed the "validation window". A limitation of this method is the length of observations: the period over which the MAVRIC calibration takes place must be long enough to capture a robust measure of the observed statistics. The validation period must also be long enough to be able to draw robust conclusions. It is not clear whether either the 21year calibration or the 15-year validation windows are long enough for robust method calibration and results verification, but we are limited by the data available. An additional limitation to this method is that the calibration and validation periods are very close to each other. Figure 4 shows the performance of the MAVRIC at three grid points for September. The raw CSIRO ensembles (grey) are bias-corrected via the MAVRIC using the PIOMAS observations (black) over the calibration window, producing the MAVRIC corrected ensembles (green) for the validation window. If the MAVRIC can produce plausible predictions, the characteristics of PIOMAS should be indistinguishable from individual corrected ensemble members in the validation window. It is clear from the validation bean plots (right), that the distribution from the corrected ensembles resembles PIOMAS much more closely than the raw distribution, e.g. non-zero probability of zero ice. We do not expect the distribution from PIOMAS to match the corrected distribution perfectly as PIOMAS only has one realisation (15 data points) while CSIRO has 10 realisations. We can tentatively accept that this test demonstrates the validity of the MAVRIC approach.
In the following sections the MAVRIC is applied to the CMIP5 subset of six GCMs used in this study (Table 1). PI-OMAS estimates of Arctic SIT are available from 1979 to 2014. This 36-year window is the period over which statistics are calculated in the observations, and in the CMIP5 subset (using historical runs for 1979-2005and RCP8.5 for 2006. Each model, month, and grid point has its own specific correction which is applied to all years (1979-2100). However, separate ensemble members from the same GCM are treated with the same correction, as we wish to correct the model bias and retain the ensemble spread. Results are shown for September, initially only for CSIRO and later for all six models combined to form the CMIP5 subset used for this study.  lar SIT, requiring only a small mean adjustment; however CSIRO requires a big increase in variance. The MAVRIC moves the first possible ice-free date about 30 years earlier and increases the ensemble range from 32 to 63 years. It is worth noting that the dominant cause of this shift to an earlier ice-free date at this location is due to the variance correction term in the MAVRIC rather than the mean correction term. This highlights the importance of correcting the variance in addition to the mean. Figure 5 demonstrates that the MAVRIC can lead to simulations that look significantly more like reality in the historical period and have an impact on regional ice-free projections.

Historical spatial perspective
In addition to examining the MAVRIC in a temporal sense, it is important to evaluate the results spatially to see where the MAVRIC is having the most effect and if it works at all locations. Figures 2 and 6 show that the mean September SIT distribution is very different in HadGEM2-ES and CSIRO.
After the MAVRIC has been applied, the mean SIT fields are almost identical for the historical period (Fig. 6). It is important to note there are still differences when considering individual years and ensemble members i.e. the year-to-year variability and ensemble spread is preserved (although adjusted by the MAVRIC). Figure 6 also shows the SD before and after the MAVRIC. The SD shown is the detrended mean ensemble SD as before. CSIRO has a variability that is too low in the majority of locations, although it correctly places the maximum SD near the edges of the ice pack similarly to PIOMAS. HadGEM2-ES exhibits about the same magnitude of variability as the observations but the variability is too high in the centre of the ice pack and too low at the edges. After the correction, the SD fields in both GCMs now look more similar to each other, with the highest variability located at the edge of the ice pack and at coastal locations. They are now also both similar to the estimate from PIOMAS (Fig. 1).

CMIP5 subset multi-model sea ice thickness projections
The bias-corrected SIT from each GCM can be brought together to form the multi-model mean CMIP5 subset, computed using three ensemble members (the maximum available across all models) from each of the six GCMs for the historical and future decadal periods (Fig. 7). It is remarkable how the raw multi-model mean product for the historical period is not too different from PIOMAS in Fig. 1, showing that the location and magnitude of model biases cancel out to a considerable degree, at least with this subset of models. Given this result it is not so surprising that the raw and corrected fields are fairly similar for the future projections also. Nevertheless, even in this multi-model multi-ensemble framework the MAVRIC is still making some discernible differences. These differences are most apparent in the Canadian Archipelago and the Russian Arctic seas, where the correction leads to a reduction in SIT of approximately 1 m in both regions. Both the raw and bias-corrected fields predict a SIT loss of about 0.25 m per decade.
The fact that the MAVRIC is still making a significant difference on the regional scale is critical, e.g. for ship route availability. Currently, studies that assess the future opening of Arctic shipping routes, which critically depend on the ab- solute value of SIT, do not yet account for such factors and will need to be reassessed.

Sources of uncertainty in projections of sea ice thickness
The uncertainty in climate projections can be partitioned into three distinct sources: (1) model uncertainty: for the same radiative forcing, different models simulate different mean distributions and temporal changes; (2) internal variability: the natural fluctuations of the climate present with or without any anthropogenic induced changes to radiative forcing; (3) scenario uncertainty: uncertainty in future radiative forcing resulting from unknown future emissions. Sutton (2009, 2011) assessed these sources of uncertainty in global and regional temperature and precipitation projections, and here we quantify the sources of uncertainty in SIT, utilising the CMIP5 subset multi-model ensemble. Crucially we use the absolute values of SIT rather than considering anomalies as is often done for other climate variables. The methodology for partitioning these sources of uncertainty is detailed in Appendix B. An additional source of uncertainty that we neglect here is the PIOMAS calibration uncertainty emerging from the choice of atmospheric reanalysis and model tuning. This could be assessed by sampling the different versions of the PIOMAS reanalysis described in Lindsay et al. (2014). They find the different versions are broadly similar and can be accounted for by appropriate tuning of the ice model component. This bias in PIOMAS itself will introduce systematic biases to the MAVRIC projections. This bias is not a flaw in MAVRIC however but a limitation intrinsic to the observational data set one is correcting to. The MAVRIC method outlined in this study acts to eliminate the model bias in the MAVRIC calibration period . After this period the model uncertainty grows due to the GCM's differing responses to changes in external forcing. The sources of uncertainty for SIT for the decade 2015-2024, immediately following the MAVRIC calibration period, are shown in Fig. 8. The total uncertainty in the corrected CMIP5 subset is strikingly lower than in the raw CMIP5 subset. Closer analysis reveals that this is due to the substantial reduction in model uncertainty owing to the MAVRIC. The other sources of uncertainty do not change as much.
The temporal evolution of these sources of uncertainty is shown in Fig. 9a by taking the median variance from each of the panels in Fig. 8 for this and other periods. There are three competing factors for how the uncertainty will change with time. First, the SIT is decreasing, and this will reduce the uncertainty as the range of values of which the SIT can occupy shrinks. Second, the separate GCMs' simulated SIT responses due to external forcing will differ from each other, causing GCMs to drift apart over time. Thirdly, sea ice at the grid point scale becomes more mobile and vulnerable to external factors as it thins. This will increase variability, initially at least (Sou and Flato, 2009). All of these factors are involved in the evolution of the uncertainties.
The raw CMIP5 subset exhibits a decrease in total uncertainty with time (dashed black in Fig. 9a). This is primarily due to the reduction in model uncertainty (dashed blue), likely because the mean SIT is reducing. The corrected total uncertainty is lower than the raw uncertainty until at least the end of the century. This means that the MAVRIC can reduce the model spread (or bias) and so may potentially increase confidence in climate projections of SIT throughout this period. The corrected model uncertainty increases for the first three decades, as the models start from a similar state and subsequently diverge because of differing responses to the changes in external forcing. Later the corrected model uncertainty reduces as the mean SIT decreases towards zero.
The total uncertainty is the sum of model uncertainty, internal variability, and scenario uncertainty (see Appendix B for more details). The other panels in Fig. 9 illustrate the relative importance of these sources of uncertainty in terms of the percentage total variance explained, for the raw data, and after the MAVRIC. Figure 9b illustrates that in the raw projections, model uncertainty remains the dominant (> 50 %) source of uncertainty until at least 2100, whereas it only becomes dominant for a few decades mid-century after the MAVRIC (Fig. 9c). The absolute magnitude of internal variability, and its contribution to the total uncertainty, decreases with time because SIT also decreases with time. In the corrected projections, the internal variability is the major contributor to the total uncertainty for the first 25 years, compared to a maximum contribution of only 26 % in the raw projections. This highlights the importance of correcting the variance to realistic magnitudes and also the key role of natural variations in predicting the near-future evolution of sea ice. The scenario uncertainty accounts for less than 10 % of the total uncertainty for the first 50+ years. Additional analysis metrics on the improvement that the MAVRIC method affords can be found in Appendix C.
Although we have demonstrated here that the MAVRIC method reduces the model uncertainty as seen by the reduction in spread of projected SIT with our selection of GCMs, we acknowledge that this may not necessarily correspond to a reduction in uncertainty in the real world.

Reduced spread in timing of ice-free conditions
By reducing the model spread the range of possible outcomes has been reduced, this potentially leads to greater confidence in SIT projections. Figure 10 shows the raw and corrected CMIP5 subset SIV* projections until 2100 using the 18 multi-model ensemble members in each scenario as before (* calculated here does not consider spatial SIC as it is not bias-corrected). To find a representative SIC for the SIV* calculation we use the September SIC in CCSM4 RCP8.5 and find a mean (of the non-zero grid cells) SIC of approximately 50 % for 2006-2100.
The thick coloured lines show the multi-model scenario mean and the coloured regions represent the 16-84 % (equivalent to 1σ around the mean of a Gaussian distribution) of the ensemble members. To account for the large range in SIT at any particular time in the CMIP5 subset, we use a method similar to that of Massonnet et al. (2012) to calculate first icefree conditions. We postulate that SIV for ice-free conditions is 1 × 10 3 km 3 , which is in agreement with previous studies calculating first ice-free dates (e.g. Massonnet et al. (2012) and Overland and Wang, 2013), and is equivalent to 1 m thick ice for an ice extent of 10 6 km 2 .  Panels (a, b) show the projected SIV* from all six models (18 ensemble members total) in both the raw and corrected GCMs (11-year running mean), and shaded regions are the 16th-84th percentiles. Panel (c) shows the number of ensemble members having passed the ice-free threshold. Panel (d) shows the statistics of (c), with the whiskers representing the range (1st and 18th ensemble member ice-free), the box capturing the 16th-84th percentiles, and the bold line showing the median (9th ensemble member). Ice-free is defined as the first year the pan-Arctic SIV* dips below 1 × 10 3 km 3 for a particular ensemble member. *Volume (SIV*) is calculated using a constant 50 % SIC throughout.
The MAVRIC reduces the total SIV, but the relative magnitude of this reduction decreases as SIV declines. The 16-84 % range has also been vastly reduced, particularly for the near future. For example, in 2025 the MAVRIC has reduced the 16-84 % range from 6 × 10 3 km 3 to 2.5 × 10 3 km 3 . It is this reduction in the plausible range of SIV that leads to potential increased confidence in projections of SIT and SIV. To assess when the Arctic will first display ice-free conditions, we focus on RCP8.5, the most realistic scenario from the last 10 years (Fuss et al., 2014). The cumulative number of ensemble members having satisfied the ice-free criterion as a function of time is shown in Fig. 10c. If the range in this parameter has reduced, this will be shown by the gradient of the line increasing post-MAVRIC, and this is clearly seen. Figure 10d further illustrates the spread reduction with box plots, where the internal line represents the median (9th) ensemble member to go ice-free. This occurs in 2052 with the MAVRIC, 9 years earlier than before. The box represents 16-84 % of the ensemble members. This range has been reduced by about 20 years; dates after 2085 can now be eliminated.
Corrected results from the other emission scenarios show similar features but with later ice-free dates, as expected for lower emissions, and some ensemble members fail to become ice-free by 2100. For RCP4.5 the MAVRIC makes a profound difference with the median ice-free date occurring 35 years earlier in 2060. For RCP2.6 there is spread reduction mid-century but the CMIP5 subsets before and after the The Cryosphere, 9, 2237-2251, 2015 www.the-cryosphere.net/9/2237/2015/ MAVRIC are in good agreement by the end of the century, with projected ice-free dates around 2090.

Summary
This study has developed a bias correction methodology for simulations of sea ice thickness (SIT). By constraining CMIP5 simulations with the PIOMAS reanalysis we have demonstrated the following.
-GCMs simulate a wide range of SIT in the historical period and exhibit various spatial and temporal biases when compared with the PIOMAS reanalysis. This model uncertainty (or bias) is the dominant source of uncertainty in CMIP5 future climate projections of SIT.
-The Mean And VaRIance Correction (MAVRIC) technique outlined in this paper significantly reduces the total uncertainty in future projections of SIT as far as 2100 by reducing model uncertainty. Correcting both mean and variance of models is found to be critical for improving the robustness of the projections.
-The MAVRIC results in internal variability being the dominant source of uncertainty until 2022, and model uncertainty is dominant thereafter. From the midcentury onwards, scenario uncertainty becomes increasingly important and as influential as model uncertainty by 2100.
-The MAVRIC results in projected September ice-free conditions in the Arctic under RCP8.5 occurring up to 10 years earlier (2050s) than without the correction, and with a considerably narrower range, e.g. excluding post-2085 dates.

Discussion
Without the MAVRIC, the true magnitude of the internal variability and scenario uncertainty in projections of SIT is concealed by the dominant model uncertainty. This demonstrates that time invested in running many ensemble members to sample internal variability in SIT may be more beneficial than running many future emission scenarios for near-term projections. These findings implicate that there is room for improvement in GCMs at least for 50-year projections where the scenario differences are negligible. However, for projections at the end of the century, the scenarios become more important.
The MAVRIC bias correction technique developed in this study results in a significant improvement in model simulations of SIT with respect to observations. In future projections, the MAVRIC results in a substantial reduction in the range of SIT, potentially leading to increased confidence in climate projections. As absolute values of SIT are utilised, this reduction in spread potentially has important implications for stakeholder sectors operating in Arctic waters such as the shipping industry. The application of the bias correction results in a 60 % reduction in the likely range (16-84 %) of sea ice volume in September 2025.
There are a number of caveats to these findings. No attempt is made to constrain the trend in the GCMs. This would be difficult because of the short timescale over which observations are available, raising serious questions about the robustness of calculated historical trends. However future studies could consider this further and assess the feasibility of a trend correction to GCMs. In addition, it is important to recognise that PIOMAS, used here as observations, will also have errors. It would be possible to reduce the multiplicative weightings in Eq. (4) to reflect some uncertainty in the historical data. Other temporally and spatially complete sea ice reanalyses could also be used in future to address this issue.
The simulations tend to show an increase in variance as the sea ice thins, before subsequently declining as the thickness approaches zero (Goosse et al., 2009). Blanchard-Wrigglesworth andBitz (2014) assessed the relationship of this mean state-dependent variance in 19 GCMs, including five of the six used in this study, in addition to PIOMAS. They find a relationship between mean thickness variability and mean thickness in models; i.e. models with thicker SIT depict more variable SIT. In the 19 GCMs assessed, PI-OMAS sits on the trend line for the correlation between mean thickness variability and mean thickness. However, in the developed MAVRIC, the change in variance is decoupled from the applied change to the mean state. This aspect could be further developed, but only by making additional assumptions about future changes in SIT variability.
Studies should make use of the MAVRIC in assessing the impact on potential stakeholders sensitive to SIT; and a paper utilising the MAVRIC to investigate the opening of the Arctic sea routes is in preparation. We also make the bias-corrected SIT fields (Melia, 2015)  For the PIOMAS observations we choose to linearly detrend the monthly data. A smoothed detrending was considered, however this might remove longer timescale variability which is undesirable. Using similar reasoning it is possible that the linear detrending removes some variability on the multi-decadal timescale. This is assumed to be significantly less than variability on smaller timescales, and much of the trend is attributed to be externally forced over the 36 years, hence should not be included as internal variability. The performance of a smoothed detrend was tested in a theoretical framework and resulted in a 10 % loss of accuracy in the standard deviation correction due to describing variance as trend.
The calculation of variance in the models is more complicated due to the fact that there is more than one realisation. It is obvious that the required variance should be calculated from the individual ensemble members rather than the ensemble mean. The variance should be calculated in each ensemble member and then the mean taken. There is another choice to make, i.e. whether each ensemble member should be detrended with its own trend, or whether the ensemble mean trend should be used. We propose that the ensemble mean trend should be used as this is the models' response to the changes in forcings. The model detrended ensemble mean standard deviation, σ M h , was calculated by calculating the detrended ensemble variances, then taking the square root of their mean.
The running mean for the future model correction term M is calculated over an 11-year period of the ensemble mean; this window hence starts at 1975 for the historical calculations. The chosen period must be long enough to adequately smooth the time series, whilst still being able to capture variations in the sea ice decline trend. This was also tested and found to outperform a 21-year period.
-The total uncertainty is the variance calculated across all 540 fields.
-The internal variability is calculated similarly to the total variability except instead of the absolute values, the anomalies from the models' decadal-mean ensemble mean for each scenario are used.
-To calculate the model uncertainty, each of the six models' decadal-mean ensemble mean is calculated, resulting in six fields. The variance is then calculated across these six fields, and repeated for all three scenarios separately (to eliminate differential model dependent responses to the different emission scenarios). The model uncertainty is the square root of the mean of these three fields.
-The scenario uncertainty is calculated in a similar way. For each model, each of the three scenarios decadalmean ensemble means are calculated resulting in three (scenario-dependent) decadal-mean ensemble means for each of the six models. The variance is then calculated through these three scenario mean fields for each of the six models, resulting in six fields of the variance in each model. The square root of the mean of the six models' scenario uncertainty is the scenario uncertainty.
To create Fig. 8b and c it is assumed that the total variance (total uncertainty, T 2 ) is the sum of the variance due to model uncertainty (M 2 ), internal variability (I 2 ), and scenario uncertainty (S 2 ), formally: We note that the variances calculated above do not always sum exactly in this way due to small interaction terms (approximately 10 %) which we ignore. To highlight whether the estimated uncertainties are reliable, we examine the errors in the projections when considering one member as "truth". As all ensemble members are constrained by PIOMAS, one individual ensemble member out of sample should fall within the distribution of the remaining ensemble members. This principle should hold true for all ensemble members out of sample in turn. The root-mean-square error (RMSE) is calculated using Eq. (C1):

The
where E n is the ensemble member between 1 and 18; E 15 is the mean of the 15 ensemble members from the models of which E n is not a member. Figure C1 shows the advantage of the MAVRIC method in this out of sample RMSE test. A decreasing RMSE means that the models are initially biased though are converging to a common value (as we expect in this case as the models trend towards being ice-free). An increasing RMSE means that the models are diverging as they have different ice loss trends.
The MAVRIC ensemble trained on every individual ensemble member within MAVRIC results in an RMSE of 0.1 m initially and up to a maximum RMSE of 0.5 m. The fact that the raw RMSE decreases (as opposed to increases) highlights that the models have biases. The 0.1 m in the MAVRIC RMSE indicates that initially the MAVRIC ensemble members differ only in internal variability. The RMSE then grows due to differing ice loss trends, which is expected as there is no attempt to correct the trends in this study.
To find the dispersion of the MAVRIC multi-model ensemble we repeat this style of experiment with the standard error (SE) metric, using Eq. (C2): where E n is the ensemble member between 1 and 18; E 15 is the mean of the 15 ensemble members from the models of which E n is not a member. σ 15 is the standard deviation of the 15 ensemble members of which E n is not a member. This is repeated for all 18 ensemble members, giving 18 SEs of how different each ensemble member is to the rest of the multimodel ensemble set. The SD across these 18 SEs is the dispersion of the multi-model ensemble. A perfectly dispersed ensemble set will have a dispersion of 1. Numbers less than 1 mean the ensemble set is under-dispersed and hence predictions/projections from that set will be under-confident as the SD is too large. Values greater than 1 indicate that the system is over-dispersive and hence over-confident.
The results of the dispersion calculation are shown in Fig. C2. The MAVRIC ensemble is approximately 15-30 % over-dispersed for lead times of up to 60 years. This  means that the ensemble is slightly over-confident and thus has slightly too little overall variance. The rapid increase in dispersion from 60 years is solely due to the CSIRO GCM, specifically its comparatively slow ice loss trend. This was tested by repeating the dispersion experiment omitting CSIRO (not shown). At this lead time many models are starting to be ice-free in September while CSIRO retains ice. It is to the merit of MAVRIC that it is less over-dispersed than the raw output; hence more reliance can be placed on MAVRIC than the raw output as its ensemble distribution is more representative.