Scientific Investigations Report 2009–5015
U.S. GEOLOGICAL SURVEY
Scientific Investigations Report 2009–5015
Statistical regression models are developed using measured data. Estimates of prediction error can be developed for data that fall inside the range of values used to develop the regression model. However, outside the range of the values used to develop the regression model, the bounds of statistical confidence are unknown. Regression analyses often yield equations that provide valid results in the range of input values, but may produce invalid or unrealistic estimates when extrapolating beyond the range of input values. Therefore, common statistical principles dictate that regression equations should not be used in extrapolation, but rather should be used only for interpolation (that is, within the range of measured values of the explanatory variables used in the regression analysis).
The purpose of the project was to distinguish between perennial and intermittent streams in Idaho; however, most streamflow measurements used to develop regression equations come from streamflow-gaging stations that are rarely placed on small, intermittent streams. As a result, the sample data typically do not represent the size of streams of interest in this application, and the regression equations must be applied in extrapolation. This was referred to as extreme extrapolation because the equations for 7Q2 were applied to small watersheds (as small as a single 10 m by 10 m grid cell). Standard statistical measures are not always useful for assessing the validity of the model in extrapolation. Regression equations must be evaluated as to whether they are well behaved in extreme extrapolation for small streams. The spatial behavior of an equation can be evaluated based on the pattern of perennial streams simulated by the model. Regression models that simulated unrealistic stream network patterns were determined to be badly behaved or ineffective for this application.
If a regression equation includes only one explanatory variable (such as drainage area), regression estimates for a large number of values in the range of interest can be generated to determine how well the regression equation functions. A Monte-Carlo analysis often is used to generate simulated parameter values when multiple variables are involved. This analysis involves estimating the statistical distribution of the observed variables and randomly generating parameter values using this distribution for the range of interest. The combination of large numbers of randomly generated multi-parameter values then can be used to examine the behavior of the regression equation in extrapolation; however, randomly generated parameter values were not needed for this analysis. The complete set of parameter values for every grid cell in the domain of interest can be computed; therefore, the regression equation results can be evaluated for every combination of parameter values that exists in the modeled area, which produces a map of the spatial behavior of the equation.
As a test of regression equation behavior in extrapolation, continuous parameter grids were evaluated for the 7Q2 equation for region 8 published by Hortness (2006), and given here as:
7Q2 = 3.86 A0.930 BS-0.648,
|
(1) |
where:
7Q2 | is the 7-day, 2-year low flow, in ft3/s |
A | is drainage area, in mi2, and |
BS | is average basin slope, in percent. |
This equation was determined to be not well behaved in extreme extrapolation. The minimum drainage area in region 8 used to develop this regression equation was 6.6 mi2, and the minimum basin slope was 6.15 percent. In the continuous parameter grid, the minimum drainage area is one 10 by 10 m grid cell, or 3.86 × 10-5 mi2. Model-generated values for basin slope for very small basins can be zero, which causes a numerically invalid result; therefore, slope is a poor predictor variable for estimating flow characteristics of very small basins, unless it is censored to remove zero values.
Alternative regression models were evaluated due to the poor spatial behavior of some of the previously developed equations. Regression models that produced stream network patterns that clearly were not reasonable in comparison with topographic maps were rejected in favor of models that produced more reasonable patterns. Methods described in Hortness (2006) were followed when developing revised regression equations.
Certain types of categorical variables such as forested area tended to produce unrealistic patterns. In small drainage areas, categorical variables tended to have extreme values of either zero or 100 percent. These extremes are rarely if ever present in the large basins upstream of streamflow-gaging stations and usually would be outside the range of data values used in the regression analysis. For this reason, categorical variables such as forested area were avoided where possible.
The final regression equations developed for modeling perennial streams in Idaho are shown in table 1. The ranges of basin characteristics used to develop the final regression equations are shown in table 2. Revised regression models were selected for five of the eight regions; the original equations by Hortness (2006) were retained for regions 2, 4, and 6. These regression equations were applied to create revised continuous grids of 7Q2 estimates for the eight regions in the study area.
The scatterplot in figure 3 shows 7Q2 values plotted against drainage area for the (A) original and (B) revised regression models in small drainages in region 8. Note that basin slope and mean annual precipitation, although parameters in the original and revised equations, respectively, are not shown on the plot. The minimum drainage area used to develop the regression equations, 6.6 mi2, is shown in figure 3 as a vertical line. Although some spatial patterns can be attributed to variability in additional explanatory variables, it is evident that the original model does not produce stable linear results in extreme extrapolation for small drainage areas. The revised model simulates linear results in extreme extrapolation for small drainage areas better than the original model.
Screen captures of the stream network derived using the (A) original (Hortness, 2006) and (B) revised perennial streams models for a small area in region 8 are shown in figure 4. The grid cells with 7Q2 estimates greater than or equal to 0.1 ft3/s are shown in pink transparency over a USGS 1:24,000-scale topographic base map. On the original map (fig. 4A), unreasonably dense and discontinuous drainage patterns appear that do not correspond to locations that appear to be stream channels based on topographic maps and knowledge of the area. This illustrates that regression equations that are not well behaved (in extreme extrapolation) may be detected by looking for unrealistic spatial patterns in the drainage network derived from application of the regression equation in a continuous manner. The spatial pattern for the revised model (fig. 4B) is much more reasonable than the original model (Hortness, 2006). However, the original model would still be preferred for predicting streamflow statistics in streams where extrapolation is not required because the standard error of prediction is better for the original model than the revised model.
The results of the revised regression models for region 8 in a broader spatial context are shown in figure 5. Figure 5A shows the results of the original model by Hortness (2006) for one 8-digit HUC (16010201). Figure 5B shows the results of the revised model for the same area.
The standard error of prediction for each regional regression equation is shown in table 1. The standard error of prediction is a measure of the overall model error as well as the sample error and is a good indicator of the overall predictive ability of the model within the range of input variables (Pope and others, 2001); however, it is not a suitable measure of the model error in the range of extrapolation because it is calculated based on measured values. The values presented in log10 format represent the errors in the log-transformed equations, and the percentage values represent the range of errors for the untransformed equations, as presented in table 1. The values were determined using error transformation equations presented in Riggs (1968). For example, the percent range for standard error of prediction for region 1 is +86.7 to -46.4 percent. In this region, the error range for a 7Q2 value of 1.0 ft3/s would be 0.54 to 1.87 ft3/s. The highest standard errors of prediction are observed in regions 3 and 7. These regions encompass the Columbia Plateau, Snake River Plain, and Owyhee Uplift physiographic regions (not labeled in fig. 1), and are characterized by lower elevations, smaller terrain relief, and less rainfall compared to the remaining regions. Large standard errors of prediction also were observed for the original Hortness (2006) regression equations for regions 3 and 7. Hortness (2006) speculated that the large errors in region 3 likely were due to few streamflow-gaging stations available for developing the regression model. Region 7 covers a large part of the study area, and Hortness and Berenbrock (2001) noted a high degree of natural variability in streamflow in region 7 that could contribute to a high error of prediction. Revising the regression model did not improve the standard error of prediction but substantially improved spatial patterns observed in extrapolation.