Scientific Investigations Report 2006–5316

U.S. GEOLOGICAL SURVEY
Scientific Investigations Report 2006–5316
DOE/ID-22201

Back to Table of Contents

Geostatistical Methods

Geostatistics is a specialized branch of statistical analysis concerned with the spatial relationships among data situated in two- or three-dimensional coordinate space. The process of geostatistical analysis and modeling includes: (1) assembling the data for analysis (data preparation); (2) calculating basic statistics, evaluating statistical stationarity, and performing data transformations to reduce the effect of outliers (exploratory analysis); (3) characterizing the spatial nature of the data (autocorrelation analysis); and (4) applying these statistical insights to create a model of spatial variability using a kriging estimator.

Data Selection

The database of basalt and interbed stratigraphy compiled by Anderson and others (1996) was used as the basis for this work. However, several limitations in their data set had to be addressed prior to analysis. Foremost was the fact that their stratigraphic interpretations represented best available knowledge at the time of compilation. Subsequent drilling, coring, and paleomagnetic and radiometric measurements have added to our understanding of subsurface stratigraphy beneath the INL. In many locations, new data have corroborated their interpretations; in other locations, new data have forced a re‑evaluation of local stratigraphic relationships.

Such limitations do not diminish the value of the 1996 compilation for two reasons. First, the 1996 database defines the spatial occurrences of various lithologies, regardless of their interpreted stratigraphic context; therefore, the Anderson database represents a valid snapshot of subsurface sedimentary abundances at the time of data compilation. Second, as new subsurface data accrue, an appropriate temporal reference point will have to be selected on which future data syntheses can build. Although recent stratigraphic compilations were and currently are being made by various groups at the INL, these are not as extensively documented or vetted as the interpretations published by Anderson and coworkers. However, such compilations will provide a useful means of evaluating the accuracy and quality of earlier compilations after selecting an appropriate reference against which new information can be compared.

For these reasons, the database of Anderson and others (1996) serves as a reasonable baseline for analyzing the subsurface distribution of sediment in the study area. Geostatistical analysis and modeling performed during the current work did not rely on the interpreted stratigraphy of Anderson and others (1996) except as it was used by Anderson and Liszewski (1997) to assign a composite unit stratigraphy.

Data Preparation

The stratigraphic data set compiled by Anderson and others (1996) was in the form of a series of ASCII text files containing well locations, boring identifiers, lithologies, and interpreted stratigraphic unit designations. These were converted to spreadsheet form and then to .dbf tables, and ultimately incorporated in a relational database.

Several factors influenced the decision regarding which data fields to incorporate in the database tables. Relational queries were used to extract information for specific analysis tasks, and the information was reformatted to be compatible with different graphical display and visualization software, as well as with Geographic Information System (GIS) and geostatistical software used in subsequent analyses. Accordingly, database fields were assigned, containing modified identifiers with which data can be located by lithologic and stratigraphic codes and, in the future, to trace the pedigree of interpreted stratigraphic assignments. Appropriate coordinates were added, reflecting the USGS subregional ground-water flow-model domain, and various modifications were carried out, including removing duplicate coordinates, assigning missing-value flags, and modifying anomalous stratigraphic designations which, in their original form, were incompatible with the software used for graphical display and visualization. These modifications are described below.

Modified Coordinates

Anderson and others (1996) reported their well locations in geographic coordinates of latitude and longitude, as decimal degrees. To be compatible with the USGS subregional flow model, Albers (NAD 27) coordinates for all wells were provided by L. Davis (U.S. Geological Survey, written commun., 2004) and used to update the Anderson database. Altitude coordinates, originally referenced to the National Geodetic Vertical Datum of 1929 (NGVD 29), were not updated from those provided by Anderson and others (1996).

Duplicate Coordinates

A total of 12 pairs, triplets, and quadruplets of well locations in the original Anderson data set shared identical geographic coordinates, which usually reflects wells that were drilled close together. These locations were updated with more accurate Global Positioning System (GPS) coordinates provided by L. Davis (U.S. Geological Survey, written commun., 2004).

Ambiguous Stratigraphic Designations

The stratigraphic assignments made in the original database contain almost two dozen instances of a sedimentary unit interpreted to be a lens within a basalt unit; for example, the “B(1)s” sedimentary unit appears between two occurrences of the basalt unit “B(1)b”. Such a designation is physically impossible unless the basalt unit is actually an aggregate of two or more separate basalt flows. These designations had to be corrected to avoid incompatibility with the visualization software used. Omission of such lenses was considered inappropriate, as was assigning different stratigraphic designations to the basalt above and below the lens (which would affect the entire database). The solution was to aggregate the thickness of the sedimentary lens with its stratigraphically nearest interbed. This approach introduced the least disruption into the modified database while preserving the lens’s contribution to total sediment abundance.

Database Organization

To facilitate organization and data-query capabilities, a number of minor modifications were made to the original data set published by Anderson and others (1996). Some borehole names were modified, usually by deleting spaces or other ambiguous alphanumeric designators, to ensure standardized well names for linking tables. A field was added to designate the composite unit into which a particular stratigraphic unit falls (table 1). Finally, a field was incorporated to allow for future changes to the interpreted stratigraphic assignments to track the “pedigree” of future stratigraphic interpretations and database modifications.

Analysis Approach

Almost all of the lithologic information contained in the database compiled by Anderson and others (1996) reflects a dichotomous designation of sediment or basalt. Thus, the proportion of sediment encountered by a borehole can be expressed either as a cumulative thickness or as a percentage over a designated depth interval. Although both measures were evaluated during exploratory data analysis, the use of sediment percentages introduces statistical bias where sediment constitutes all of a composite unit’s thickness (most frequently, in locations where a composite unit is relatively thin or incompletely penetrated by boreholes) and skews the statistics of sediment abundance relative to basalt. Where practical, cumulative thicknesses as well as percentage of sediment were statistically analyzed; note that for model layers A, B, and C, each of which is 100-ft thick, measures of sediment thickness and percentage are numerically equivalent.

Exploratory analysis involved the computation of basic statistics for each composite unit’s sediment content, an evaluation of the effect of geographically clustered wells on the computed statistics, and an examination of the hypothesis of statistical stationarity (both in a geographic spatial sense and in a time-stratigraphic spatial sense). As one test of statistical stationarity, nonparametric hypothesis tests were used to evaluate the degree of similarity between sample medians. A Mann-Whitney test, equivalent to a Student’s t-test for normally distributed populations, was used to compare pairs of sample means; the Kruskal-Wallis test (equivalent to a one-way analysis of variance for normally distributed populations) was used to compare differences among three or more sample medians. All hypothesis tests were evaluated at a 95 percent confidence level (where a p-value greater than 0.05 indicates no statistically significant difference among the samples).

Autocorrelation analysis focused on the computation of various spatial statistics to characterize the scale over which the thicknesses of composite units and interbeds persist. These statistical measures provided additional support for the hypothesis of stationarity and served as the basis for developing models of sediment thickness via kriging.

Kriging Approach

Kriging is a nearest-neighbor predictor that makes statistically “best” estimates at unsampled locations on the basis of available information in local neighborhoods. It differs from other linear prediction methods in its reliance on a model of spatial autocorrelation (inferred from the data) to assign weights to nearest data neighbors. Kriging estimates are optimal in the sense that their estimation error variances are minimized. In ordinary kriging, the error variance provides an estimate of prediction uncertainty based on the assumption that estimation errors are normally distributed. In this sense, ordinary kriging has been described as a “parametric” estimator (Isaaks and Srivastava, 1989). It is often unrealistic, however, to assume that estimation uncertainty is normally distributed or that the ordinary kriging estimate, the mean, is an appropriate estimator of central tendency at all kriged locations. To avoid this limitation, multiple indicator kriging (mIK) uses nonlinear (indicator) transforms of data around multiple thresholds to estimate the cumulative frequency distribution (CFD) within a local neighborhood. The local CFD is unconstrained by assumptions of normality, and provides a true measure of estimation uncertainty due to local data variability as well as a more appropriate estimate of central tendency, the median. The drawback is that autocorrelation statistics must be calculated and an autocorrelation model inferred at each threshold used to discretize the CFD, making mIK more time consuming to implement. However, the power, flexibility, and generality of mIK make it the superior kriging technique for modeling environmental data (Goovaerts, 1997). All kriging results were discretized and converted to industry-standard shapefile format on a quarter-mile grid corresponding to the cell centers of the USGS subregional flow model.

Software Selection

Data organization, query building, and data extraction were performed with Microsoft Access™ database software; Microsoft Excel™ was used for data manipulation, sorting, and indicator transformation. Basic statistical analyses were performed with Minitab™ and StatMost™ statistical software, and ArcMap™ Geographic Information System software was used for spatial data visualization and graphics generation. Variogram analysis was performed with the VARIOWIN™ software package (Pannatier, 1996), and declustering analysis and multiple indicator kriging were performed with the Stanford Geostatistical Software Library™ (GSLIB) (Deutsch and Journel, 1998). Results of geostatistical modeling were imported as GSLIB grids into ArcMap and converted to shapefiles and raster grids for visualization and raster-based computations.

Back to Table of Contents


AccessibilityFOIAPrivacyPolicies and Notices

Take Pride in America home page.FirstGov buttonU.S. Department of the Interior | U.S. Geological Survey
Persistent URL: https://pubs.water.usgs.gov/sir20065316
Page Contact Information: Publications Team
Page Last Modified: Thursday, 01-Dec-2016 19:33:13 EST