Note: Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

Introduction

R is a software environment for statistical computing and graphics. In R, the primary mechanism for sharing with others is the package. Packages are collections of computer code, data, and documentation in a well-defined format. Datasets pertaining to the United States Geological Survey-Idaho National Laboratory (USGS-INL) water-quality and water-level aquifer monitoring networks are stored in an R package named inldata. And computer code (functions) used to analyze this data are bundled together in an R package named ObsNetQW. This document is a vignette in the ObsNetQW package and provides an overview of the software used to optimize the USGS-INL water-quality aquifer monitoring network. A package vignette is a fully reproducible document that combines narrative text and R code to produce static (PDF) and dynamic (HTML) output formats. The R code is run when the vignette is built, and data analysis output (figures and tables) are created extemporaneously and inserted into the final document. Small chunks of stylized code are shown in this vignette and are intended to be used interactively in an R terminal. These code chunks may be useful for testing, development, and validation purposes. This document provides (1) software installation instructions, (2) dataset and function documentation, (3) processing instructions for performing the optimization analysis, and (4) guidance on recreating the R environment used when compiling this report.

Software

Software used in this report are all available as free and open-source software (FOSS), can be run on multiple platforms (Windows, OS X, Linux), and do not require administrative privileges to install or run.

R

The R software environment can be obtained from the Comprehensive R Archive Network (CRAN). And the capabilities of R may be extended by installing an assorted group of user-contributed R packages available on CRAN and the U.S. Geological Survey R Archive Network (GRAN). Typing the following commands into an R terminal (console) will install the latest versions of the R packages used in this report.

update.packages()
remotes::install_gitlab("inl/obsnetqw",
                        host = "code.usgs.gov",
                        dependencies = TRUE)

LaTeX

To create PDF vignette documents with R requires LaTeX, a document preparation system for typesetting. The LaTeX distribution used in this report is TinyTeX. To install TinyTeX, first uninstall any other LaTeX distribution that may be installed, and then run the following command in the R terminal:

tinytex::install_tinytex()

The capabilities of LaTeX may be extended by installing an assorted group of user-contributed LaTeX packages available on the Comprehensive TeX Archive Network (CTAN). To install additional packages that were used in this report and not included in the default installation of TinyTeX, run the following command:

inlmisc:::InstallLatexPackages()

Pandoc

To create HTML vignette documents with R requires pandoc, a universal document converter. Instructions for installing pandoc are available for multiple platforms.

Datasets

Datasets in the R-package inldata may be accessed from the R terminal by running the following command:

library("inldata")

The help documentation for each dataset is given below.

Background Concentrations

Description

Water-quality background concentrations for selected radionuclides, organic compounds, and chemical constituents analyzed for in water from the Eastern Snake River Plain aquifer at and near the Idaho National Laboratory (INL). Background concentrations are either naturally occurring or anthropogenic, and are not influenced by waste and wastewater disposal at the INL (Bartholomay and Hall, 2016, p. 3).

Usage

background

Format

A data frame with 73 records and 4 variables:

parm_cd: U.S. Geological Survey 5-digit parameter code, see parameters dataset for details.
bkgrd_min, bkgrd_max: minimum and maximum limits of background concentration
reference: source of background concentration limits. Reference citations are: "Bartholomay and Hall (2016)", "Knobel and others (1992)", "Michel (1989)", and "Orr and others (1991)".

Row names for the data frame indicate the Substance Registry Services (SRS) name and parameter units.

References

Bartholomay, R.C., and Hall, L.F., 2016, Evaluation of background concentrations of selected chemical and radiochemical constituents in groundwater in the eastern Snake River Plain aquifer at and near the Idaho National Laboratory, Idaho: U.S. Geological Survey Scientific Investigations Report 2016–5056, (DOE/ID–22237), 19 p., https://doi.org/10.3133/sir20165056.

Knobel, L.L., Orr, B.R., and Cecil, L.D., 1992, Summary of background concentrations of selected radiochemical and chemical constituents in groundwater from the Snake River Plain aquifer, Idaho: estimated from an analysis of previously published data: Journal of the Idaho Academy of Science, v. 28, no. 1, p. 48–61.

Michel, R.L., 1989, Tritium deposition in the continental United States, 1953–83: U.S. Geological Survey Water Resources Investigations Report 89–4072, 46 p., https://doi.org/10.3133/wri894072.

Orr, B.R., Cecil, L.D., and Knobel, L.L., 1991, Background concentrations of selected radionuclides, organic compounds, and chemical constituents in ground water in the vicinity of the Idaho National Engineering Laboratory: U.S. Geological Survey Water-Resources Investigations Report 91–4015 (DOE/ID–22094), 52 p., https://doi.org/10.3133/wri914015.

Examples

str(background)
rownames(background)

Benchmark Concentrations

Description

Water-quality benchmark concentrations of selected radionuclides, organic compounds, and chemical constituents. Benchmarks include: the U.S. Environmental Protection Agency (USEPA) Maximum Contaminant Levels (MCLs) and Human Health Benchmarks for Pesticides (HHBPs), and the U.S. Geological Survey (USGS) Health-Based Screening Levels (HBSLs).

Usage

benchmarks

Format

A data frame with 74 records and 9 variables:

parm_cd: USGS 5-digit parameter code, see parameters dataset for details.
mcl: USEPA MCLs.
hhbp_noncancer: USEPA Chronic Noncancer HHBPs
hhbp_cancer_min: USEPA Carcinogenic HHBPs for a one-in-one million cancer risk.
hhbp_cancer_max: USEPA Carcinogenic HHBPs for a one-in-ten thousand cancer risk.
hbsl_noncancer: USGS Noncancer HBSLs
hbsl_cancer_min: USGS Cancer HBSLs for a one-in-one million cancer risk.
hbsl_cancer_max: USGS Cancer HBSLs for a one-in-ten thousand cancer risk.
remark: comments

Row names for the data frame indicate the Substance Registry Services (SRS) name and parameter units.

Source

Many of the water-quality benchmarks were accessed from the U.S. Geological Survey Health-Based Screening Levels database, accessed on August 8, 2017, from https://water.usgs.gov/water-resources/hbsl/. Benchmarks for total Trihalomethanes, Tritium, and Strontium-90 were provided by the U.S. Environmental Protection Agency (2015). Note that MCL benchmark values reported in millirem per year were substituted with a 50 picocuries per liter screening level.

References

U.S. Environmental Protection Agency, 2015, Protection of environment—Code of Federal Regulations 40, Part 141, Subpart G, National Primary Drinking Water Regulations, Maximum Contaminant Levels and Maximum Residual Disinfectant Levels: Washington, D.C., Office of the Federal Register, National Archives and Records Administration.

Examples

str(benchmarks)
rownames(benchmarks)

Cities and Towns

Description

Cities and towns (populated places) in the vicinity of Idaho National Laboratory, eastern Idaho.

Usage

cities

Format

A SpatialPointsDataFrame of the sp package with 24 records and 16 variables. See projection dataset for coordinate reference system information.

Source

U.S. Department of Commerce, U.S. Census Bureau, Geography Division/Cartographic Products Branch. Spatial extract from the Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB), 2019 data collection, released April 2, 2020.

Examples

inlmisc::PlotMap(cities, dms.tick = TRUE)
sp::plot(cities, pch = 19, add = TRUE)
raster::text(cities, cities@data$NAME,
             pos = 1, cex = 0.6)
str(cities@data)

County Boundaries

Description

Boundaries of counties in the vicinity of Idaho National Laboratory, eastern Idaho, as of January 1, 2015.

Usage

counties

Format

A SpatialLines of the sp package with 45 features. See projection dataset for coordinate reference system information.

Source

Examples

inlmisc::PlotMap(counties, dms.tick = TRUE)
sp::plot(counties, lty = 5, add = TRUE)

Laboratory Detection Limits

Description

Analytical method detection limits of selected radionuclides, which are based on laboratory procedures.

Usage

dl

Format

A data frame with 8 records and 6 variables:

srsname: Substance Registry Services (SRS) name.
parm_cd: U.S. Geological Survey 5-digit parameter code.
parameter_units: parameter units
lab_det_lim_va: laboratory detection limit concentration
sdate: date that the detection limit went into effect.
reference: source of detection limit.

Source

Detection limits reported by: Bartholomay and others (2003, table 9); Bartholomay and others (2014, table D1); and Bodnar and Percival (1982).

References

Bartholomay, R.C., Knobel, L.L., and Rousseau, J.P., 2003, Field methods and quality-assurance plan for quality-of-water activities, U.S. Geological Survey, Idaho National Engineering and Environmental Laboratory, Idaho: U.S. Geological Survey Open-File Report 03–42 (DOE/ID–22182), 45 p. https://doi.org/10.3133/ofr0342.

Bartholomay, R.C., Maimer, N.V., and Wehnke, A.J., 2014, Field methods and quality-assurance plan for water-quality activities and water-level measurements, U.S. Geological Survey, Idaho National Laboratory, Idaho: U.S. Geological Survey Open-File Report 2014–1146 (DOE/ID–22230), 64 p. https://pubs.usgs.gov/of/2014/1146/.

Bodnar, L.Z., and Percival, D.R., eds., 1982, Analytical Chemistry Branch procedures manual—Radiological and Environmental Sciences Laboratory: U.S. Department of Energy Report IDO–12096 [variously paged].

Examples

str(dl)

Eastern Snake River Plain Boundary

Description

Boundary of the eastern Snake River Plain, Idaho.

Usage

esrp

Format

A SpatialPolygonsDataFrame of the sp package object with 1 feature and 1 variable. See projection dataset for coordinate reference system information.

Source

U.S. Geological Survey Idaho National Laboratory Project Office

Examples

inlmisc::PlotMap(esrp, dms.tick = TRUE)
sp::plot(esrp, col = "red", add = TRUE)
str(esrp@data)

Idaho National Laboratory Facilities

Description

Federal research facilities at the Idaho National Laboratory (INL).

Usage

facilities

Format

A SpatialPolygonsDataFrame object with 7 features and 1 variable. See projection dataset for coordinate reference system information.

Source

U.S. Geological Survey Idaho National Laboratory Project Office

Examples

inlmisc::PlotMap(facilities, dms.tick = TRUE)
sp::plot(facilities, col = "red", add = TRUE)
raster::text(facilities, facilities@data$NAME,
             cex = 0.6, pos = 1)
str(facilities@data)

Groundwater Level Measurements

Description

Groundwater level measurements in wells in the U.S. Geological Survey (USGS) water-level aquifer monitoring network, Idaho National Laboratory and vicinity, Idaho. Data was obtained from the National Water Information System (NWIS) (U.S. Geological Survey, 2020).

Usage

gwl

Format

A data frame with 54,417 records and 5 variables:

site_no: USGS site number
lev_dt: date the water level was measured.
lev_va: water level value, in feet below land surface. A water-level elevation may be calculated by subtracting the depth-to-water (lev_va) from the land-surface elevation (alt_va in the sites dataset).
lev_status_cd: a code indicating the status of the site at the time the water level was measured. The codes and their meanings are: "A" water level was affected by atmospheric pressure, "D" the site was dry (no water level is recorded), "N" the measurement was discontinued, "O" an obstruction was encountered in the well (no water level was recorded), "P" the site was being pumped, "R" the site had been pumped recently, "S" a nearby site that taps the same aquifer was being pumped, and "X" the water level was affected by stage in nearby surface-water site.
date_time: date and time the water level was measured. Missing values of time were substituted with “00:00” (12:00 midnight and start of day).

Source

Data obtained from the NWIS database (U.S. Geological Survey, 2020).

References

U.S. Geological Survey, 2020, National Water Information System—web services, accessed August 4, 2020, from https://doi.org/10.5066/F7P55KJN.

Examples

site_no <- "432700112470801"  # well USGS 1
xlim <- as.Date(c("1989-01-01", "2019-01-01"))
d <- gwl[gwl$site_no == site_no, c("lev_dt", "lev_va")]
main <- sites@data[sites@data$site_no == site_no, "site_nm"]
ylab <- sprintf("Water level, in %s below land surface",
                c("feet", "meters"))
inlmisc::PlotGraph(d, ylab = ylab, main = main, xlim = xlim,
                   type = "p", pch = 19, seq.date.by = "year",
                   conversion.factor = 0.3048,
                   center.date.labels = TRUE)
str(gwl)

alt_va <- sites@data[sites@data$site_no == site_no, "alt_va"]
y <- alt_va - d$lev_va
ylab <- sprintf("Water level, in %s above sea level",
                c("feet", "meters"))
inlmisc::PlotGraph(d$lev_dt, y, ylab = ylab, main = main,
                   xlim = xlim, type = "p", pch = 19,
                   seq.date.by = "year",
                   conversion.factor = 0.3048,
                   center.date.labels = TRUE)

State of Idaho Boundary

Description

Simplified representation of the boundary of Idaho, a state in the northwestern region of the United States.

Usage

idaho

Format

A SpatialPolygons of the sp package with 1 feature. See projection dataset for coordinate reference system information.

Source

Examples

inlmisc::PlotMap(idaho, dms.tick = TRUE)
sp::plot(idaho, col = "red", add = TRUE)

Idaho National Laboratory Boundary

Description

Boundary of the Idaho National Laboratory (INL).

Usage

inl

Format

A SpatialPolygonsDataFrame of the sp package object with 1 feature and 4 variables. See projection dataset for coordinate reference system information.

Source

U.S. Geological Survey Idaho National Laboratory Project Office

Examples

inlmisc::PlotMap(inl, dms.tick = TRUE)
sp::plot(inl, col = "red", add = TRUE)
str(inl@data)

Map Labels

Description

Map labels in the vicinity of Idaho National Laboratory, eastern Idaho.

Usage

labels

Format

A SpatialPointsDataFrame of the sp package with 50 features and 7 variables:

labels: text to be written
cex: character expansion factor
col, font: color and font to be used, respectively.
srt: string rotation in degrees.
halo: if true, a "halo" is printed around text.
map: map index

See projection dataset for coordinate reference system information.

Source

U.S. Geological Survey Idaho National Laboratory Project Office

Examples

inlmisc::PlotMap(labels, dms.tick = TRUE)
lab <- labels[labels$map == 1L, names(labels) != "map"]
for (i in seq_along(lab))
  do.call(raster::text, c(lab[i, ], as.list(lab@data[i, ])))
str(labels@data)

Lakes and Ponds

Description

Perennial lakes and ponds in the vicinity of Idaho National Laboratory, eastern Idaho.

Usage

lakes

Format

A SpatialPolygonsDataFrame of the sp package with 155 features of 12 variables. See projection dataset for coordinate reference system information.

Source

U.S. Geological Survey (USGS), National Geospatial Technical Operations Center, USGS National Hydrography Dataset (NHD) Medium Resolution for Idaho, released August 4, 2014.

Examples

inlmisc::PlotMap(lakes, dms.tick = TRUE,
                 lakes = list(lakes, "lwd" = 1))
str(lakes@data)

Mountain Ranges and Buttes

Description

Simplified representation of the mountain ranges and buttes in the vicinity of Idaho National Laboratory, eastern Idaho.

Usage

mountains

Format

A SpatialPolygonsDataFrame of the sp package with 17 features and 1 variable. See projection dataset for coordinate reference system information.

Source

U.S. Geological Survey Idaho National Laboratory Project Office

Examples

inlmisc::PlotMap(mountains, dms.tick = TRUE)
sp::plot(mountains, col = "gray", add = TRUE)
str(mountains@data)

Parameter Information for Analytes

Description

Parameter code information for selected chemical constituents, organic compounds, and radionuclides measured for in water samples collected from wells in the U.S. Geological Survey (USGS) water-quality aquifer monitoring network, Idaho National Laboratory and vicinity, Idaho.

Usage

parameters

Format

A data frame with 74 records and 7 variables:

parm_cd: USGS 5-digit parameter code
parameter_group_nm: parameter group name, such as "Radiochemical"
parameter_nm: long parameter name, such as "Strontium-90, water, unfiltered, picocuries per liter"
casrn: Chemical Abstracts Service (CAS) registry number, such as "10098-97-2" for Strontium-90
srsname: Substance Registry Services (SRS) name, such as "Strontium-90"
parameter_units: parameter units. Unit abbreviations and descriptions are: "mg/L" milligrams per liter, "mg/L as N" milligrams per liter as Nitrogen, "ug/L" microgram per liter, and "pCi/L" picocuries per liter.
siunitx: parameter units formatted for LaTeX using the siunitx package notation.

Source

USGS water data acquired from the National Water Information System (U.S. Geological Survey, 2019). The SRS name (srsname) for "Trihalomethanes (four), total, from SDWA NPDWR" was shorten to its preferred acronym "TTHM4".

References

U.S. Geological Survey, 2019, National Water Information System—web services, accessed June 11, 2019, from https://doi.org/10.5066/F7P55KJN.

Examples

str(parameters)

Coordinate Reference System

Description

PROJ.4 string defining a custom coordinate reference system (CRS) used by the U.S. Geological Survey Idaho National Laboratory Project Office. The CRS is based on the following attributes: Albers equal-area conic projection; latitude of first and second standard parallel is 42.83 and 44.16 decimal degrees, respectively; latitude and longitude of false origin is 41.5 and -113 decimal degrees, respectively; easting and northing of false origin is 200,000 and 0 meters, respectively; Clarke (1966) reference ellipsoid; North American Datum of 1983; and units of meters.

Usage

projection

Format

A character string describing the CRS in PROJ.4 format.

Examples

crs <- sp::CRS(projection)  # convert to class "CRS"
print(crs)

Primary and Secondary Roads

Description

Primary and secondary roads in the vicinity of Idaho National Laboratory, eastern Idaho.

Usage

roads

Format

A SpatialLinesDataFrame of the sp package with 1,616 features and 5 variables. See projection dataset for coordinate reference system information.

Source

Examples

inlmisc::PlotMap(roads, dms.tick = TRUE,
                 roads = list(roads, "lwd" = 1))
sp::plot(roads[roads@data$PRISEC, ], col = "red", add = TRUE)
str(roads@data)

Water-Quality Data Records

Description

Water-quality data from laboratory analyses of groundwater samples collected from wells in the U.S. Geological Survey (USGS) water-quality monitoring network, Idaho National Laboratory and vicinity, Idaho. Data was obtained from the National Water Information System (U.S. Geological Survey, 2019).

Usage

samples

Format

A data frame with 159,554 records and 19 variables:

site_no: USGS site identification number
sample_dt: date sample was collected
medium_cd: 3-digit medium code that identifies the material type and quality assurance type of the sample. The codes and their meanings are: "WG" water below land surface contained in the saturated zone (groundwater); and "WGQ" groundwater quality-control (QC) sample.
db_no: 2-digit NWIS database number. The codes and their meanings are: "01" is the environmental database, and "10" is the quality-assurance (QA) database.
anl_ent_cd: analyzing entity code of the organizational unit that performed the sample analysis used to obtain the result.
parm_cd: USGS 5-digit parameter code. For example, the parameter code for Tritium is "07000".
remark_cd: 1-digit remark code (result level) used to qualify the parameter value. The codes and their meanings are: "" quantified value. "<" actual value is known to be less than the value reported, that is, the measured concentration is below the reporting limit (RL) and represented as a censored (or nondetection) value. For censored values, the value reported is the RL. "E" value is estimated, that is, the actual value is greater than the minimum detection limit (MDL) and less than the laboratory reporting level (LRL). "R" nondetect, result less than sample-specific critical level. "U" material specifically analyzed for but not detected. "V" value affected by contamination.
result_va: parameter value; see parameter_units variable in the parameters dataset for the units of measurement.
lab_std_dev_va: laboratory standard deviation (SD). For radiochemical data, SD is determined from the counting error. Prior to January 1, 2018, counting error was reported as two SD, therefore, these values were divied by 2.
dqi_cd: 1-digit data quality indicator code that indicates the review status of a result. The codes and their meanings are: "A" historical data, "S" presumed satisfactory, and "R" reviewed and accepted.
rpt_lev_va: laboratory reporting limit in effect for the parameter and method at the time the measurement was made.
rpt_lev_cd: reporting level code that identifies the analytical reporting level appropriate for the analytical method. The codes and their meanings are: "DLBLK" detection limit by blank data; "DLDQC" detection limit by DQCALC, lowest concentration that with 90 percent confidence will be exceeded no more than 1 percent of the time when a blank sample is measured; "IRL" interim reporting level, a temporary reporting level; "LRL" laboratory reporting level, equal to twice the yearly-determined LT-MDL; "LT-MDL" long-term method detection limit, a detection level derived by determining the standard deviation of a minimum of 24 MDL spike sample measurements over an extended period of time; "MDL" method detection limit, minimum concentration of a substance that can be measured and reported with a 99 percent confidence that the analyte concentration in greater than zero; "PQL" practical quantitation limits; "MRL" minimum reporting level, smallest measured concentration that can be reliably measured using a given analytical method; "RLDQC" reporting limit by DQCALC, is greater than or equal to two times the DLDQC; "SSLC" sample-specific critical level, the calculated and reported value is below which the radiochemistry result is considered a non-detect; and "SSMDC" sample-specific minimum detectable concentration, a reporting level that varies for each sample and is primarily used in radiochemical analyses.
samp_type_cd: 1-digit sample type code that identifies the QA type of a sample. The codes and their meanings are: "H" samples collected over a period of time (composite); "2" samples prepared from a reference material where none of the analytes of interest are present in detectable quantities (blank); "6" reference material that provide the baseline analytical results for comparison with reference samples; "7" replicate samples; and "9" sample taken from the environment (regular).
meth_cd: method code, the codes are documented in the "NWIS Method Code Dictionary"
result_cm_tx: comment about the water quality result
date_time: date and time the sample was collected. Missing values of time were substituted with “00:00” (12:00 midnight and start of day).
sample_cd: unique identifier for the water sample. The sample code is a concatenation of the date_time and site_no variables.
comments: comments pertaining to changes applied after the records were obtained from NWIS.
rep_pair_no: number used to identify replicate samples. Replicates are identified by matching a replicate sample (samp_type_cd eqaul to 7) with its corresponding environmental sample (samp_type_cd equal to 9).

Source

Data obtained from the NWIS-QWDATA database on June 11, 2019 using the QWDATA system (U.S. Geological Survey, 2019).

References

U.S. Geological Survey, 2019, National Water Information System—Water-Quality System (QWDATA) data retrieval program.

Examples

site_no <- "433002113021701"  # well RWMC PROD
parm_cd <- "32102"            # carbon tetrachloride
xlim <- as.Date(c("1989-01-01", "2019-01-01"))
d <- samples[samples$site_no == site_no & samples$parm_cd == parm_cd,
             c("sample_dt", "result_va")]
ylab <- parameters[parameters$parm_cd == parm_cd, "parameter_nm"]
main <- sites@data[sites@data$site_no == site_no, "site_nm"]
inlmisc::PlotGraph(d, ylab = ylab, main = main, xlim = xlim,
                   type = "p", pch = 19, seq.date.by = "year",
                   center.date.labels = TRUE)
str(samples)

Descriptive Site Information

Description

Information for sampling sites in the U.S. Geological Survey (USGS) aquifer monitoring networks, Idaho National Laboratory and vicinity, Idaho.

Usage

sites

Format

A SpatialPointsDataFrame of the sp package with 182 records and 21 variables:

site_no: USGS site number for monitoring well
station_nm: USGS site name
coord_meth_cd: latitude/longitude coordinate method code. The codes and their meanings are: "D" differentially corrected Global Positioning System (GPS), "G" mapping grade GPS unit (handheld accuracy range 3.7 to 12.2 meters), "L" long range navigation system, "M" interpolated from topographic map, and "S" transit, theodolite, or other surveying method.
coord_acy_cd: accuracy code for latitude/longitude values. The codes and their meanings are: "H" accurate to +/- 0.1 second, "5" accurate to +/- 0.5 second, "S" accurate to +/- 1 second, and "F" accurate to +/- 5 seconds.
alt_va: elevation of the land surface reference point, in feet above the North American Vertical Datum of 1988.
alt_meth_cd: method code for measuring elevation. The codes and their meanings are: "D" differentially corrected global positioning system, "L" level or other surveyed method, and "M" interpolated from topographic map.
alt_acy_va: accuracy of the elevation value (alt_va), does not account for vertical datum shift.
construction_dt: date the well was completed.
huc_cd: hydrologic unit code (HUC). Hydrologic units are geographic areas representing part or all of a surface drainage basin or distinct hydrologic feature and are delineated on the Hydrologic Unit Map.
reliability_cd: reliability code for data available for the site. The codes and their meanings are: "C" data have been checked by the reporting agency, and "U" unchecked data.
nat_aqfr_cd: national aquifer codes
aqfr_cd: aquifer codes defined by the catalog of aquifer names and geologic unit codes used by the Water Mission Area.
aqfr_type_cd: aquifer type code. The codes and their meanings are: "C" confined single aquifer, "M" confined multiple aquifers, "U" unconfined single aquifer, and "X" mixed (confined and unconfined) multiple aquifers.
well_depth_va: depth of the finished well, in feet below the land surface datum
hole_depth_va: total depth of the borehole, in feet below the land surface datum
depth_src_cd: source code for depth measurements. The codes and their meanings are: "A" reported by another government agency, "D" from driller's log or report, "G" private geologist-consultant or university associate, "L" interpreted from geophysical logs by personnel of source agency, "O" reported by owner of well, "R" reported by person other than the owner, driller, or another government agency, and "S" measured by personnel of reporting agency.
site_nm: local site name
completion_cd: borehole completion code. The codes and their meanings are: "O" open hole completion, "M" multilevel completion, and "P" open hole completion prior to multilevel completion.
network: aquifer monitoring network, either "aquifer" or "perched".
pos: a position specifier for site-labels on a map. Values of 1, 2, 3 and 4, respectively indicate positions below, to the left of, above and to the right of the site coordinates.

See projection dataset for coordinate reference system information.

Source

USGS site data acquired from the NWIS (U.S. Geological Survey, 2020).

References

U.S. Geological Survey, 2020, National Water Information System—web services, accessed August 4, 2020, from https://doi.org/10.5066/F7P55KJN.

Examples

inlmisc::PlotMap(sites, dms.tick = TRUE)
sp::plot(sites, add = TRUE)
str(sites@data)

Rivers and Streams

Description

Stream segments in the vicinity of Idaho National Laboratory, eastern Idaho.

Usage

streams

Format

A SpatialLinesDataFrame of the sp package with 197 features of 14 variables. See projection dataset for coordinate reference system information.

Source

U.S. Geological Survey (USGS), National Geospatial Technical Operations Center, USGS National Hydrography Dataset (NHD) Medium Resolution for Idaho, released August 4, 2014.

Examples

inlmisc::PlotMap(streams, dms.tick = TRUE,
                 rivers = list(streams, "lwd" = 1))
str(streams@data)

Land-Surface Topography

Description

Land-surface topography and shaded relief in the vicinity of Idaho National Laboratory, eastern Idaho.

Usage

topo

Format

A RasterStack of the raster package with 2 layers. Grid cells in the "elevation" layer represent land-surface elevations in meters above the North American Vertical Datum of 1988 (NAVD 88). Grid cells in the "hillshade" layer represent shaded relief calculated from the slope and aspect of land-surface elevations. The spatial grid is composed of 900 rows and 900 columns, and has cell sizes that are constant at 100 meters by 100 meters. See projection dataset for coordinate reference system information.

Source

The National Map (TNM) 1/3-arc-second DEM (Gesch, 2007; Gesch and others, 2002), accessed on August 4, 2020. This dataset can be downloaded in a Esri ArcGRID â„¢ format using TNM Download. Elevation datasets are distributed in geographic coordinates in units of decimal degrees, and in conformance with the NAD 83. Elevation values are in meters above the NAVD 88. The west, east, south, and north bounding coordinates for this dataset are -114, -112, 43, and 45 decimal degrees, respectively.

References

Gesch, D.B., 2007, The National Elevation Dataset, in Maune, D., ed., Digital Elevation Model Technologies and Applications—The DEM Users Manual, 2nd ed.: Bethesda, Maryland, American Society for Photogrammetry and Remote Sensing, p. 99–118.

Gesch, D., Oimoen, M., Greenlee, S., Nelson, C., Steuck, M., and Tyler, D., 2002, The National Elevation Dataset: Photogrammetric Engineering and Remote Sensing, v. 68, no. 1, p. 5–11.

Examples

inlmisc::PlotMap(topo[["elevation"]], bg.image = topo[["hillshade"]],
                 pal = inlmisc::GetColors(scheme = "dem3", alpha = 0.5),
                 dms.tick = TRUE, useRaster = TRUE)

Functions

Functions in the R-package ObsNetQW may be accessed from the R terminal by running the following command:

library("ObsNetQW")

The help documentation for each function is given below.

Describe Sampling Frequency

Description

Suggests an adjective that best describes a sampling frequency.

Usage

DescribeFrequency(x, frac = 0.2)

Arguments

`x`	numeric or difftime vector. Sampling frequencies; if numeric, specified in days.
`frac`	numeric. Fraction to move cutoffs to ensure that sampling frequencies fall within a sensible frequency.

Details

Invoking the DescribeFrequency() command without an x argument returns a data frame with the frequency values adjusted for frac.

Value

A character vector of frequency descriptors.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

Examples

x <- c(0.1, 0.8, 1, 6, 6.6, 7.6, 15, 27, 28, 31, 370)
data.frame(x, "description" = DescribeFrequency(x))
DescribeFrequency()

data.frame(x, "description" = DescribeFrequency(x, frac = 0.50))
DescribeFrequency(frac = 0.50)

DescribeFrequency(as.difftime(c(1, NA, 10, 100, 1000),
  units = "weeks"
))

Fit Local Regression Curve

Description

Uses locally weighted scatterplot smoothing (LOESS) to fit a polynomial to time-series data. LOESS is a non-parametric regression method originally proposed by Cleveland (1979) and further developed by Cleveland and Devlin (1988). The smooth curve fitted through the data is known as the Loess curve, where each smoothed value is given by a weighted linear least squares regression of the span. Values returned from this function include the points on the Loess curve, their standard error and confidence interval.

Usage

FitLoessCurve(
  samples,
  site_no = NULL,
  parm_cd = NULL,
  Fun = stats::median,
  mult = 0.5,
  sdate = NA,
  edate = NA,
  by = "month",
  conf = 0.9,
  span = NULL,
  degree = 1,
  family = c("symmetric", "gaussian"),
  ...,
  surface = c("direct", "interpolate")
)

Arguments

`samples`	data.frame. Water-quality data from analysis of environmental and quality-control samples. The required variables include `site_no`, `sample_dt`, `parm_cd`, `remark_cd`, and `result_va`; see `samples` dataset for variable descriptions.
`site_no`	character vector. Site numbers to include in the analysis; all sites are included by default.
`parm_cd`	character vector. Parameter codes to include in the analysis; all parameters are included by default.
`Fun`	function. Function used to aggregate daily values. By default, the `stats::median` function is applied to values reported on the same sample date.
`mult`	numeric. Multiplier for censored values (`remark_cd` equal to `"<"`).
`sdate, edate`	Date or character. Start and end date of prediction interval, respectively. Required date format is YYYY-MM-DD.
`by`	number, difftime, or character string. Increment of the sequence of values with which to predict data points on the Loess curve, monthly by default.
`conf`	numeric number. Confidence level expressed as a fraction, used to construct confidence intervals that make up the confidence band for the Loess curve.
`span`	numeric. Smoothing parameter, defined as the fraction of the total number of data points that is used in each local fit. Larger values give more smoothness. Specify as `NULL` to automate the selection of the smoothing parameter using the bias-corrected Akaike information criterion (Hurvich and others, 1998). Estimated values are restricted to lie with the range of values from 0.2 to 0.9.
`degree`	numeric. Degree of the polynomials to use, first degree by default.
`family`	character. Indicates the method used for fitting. By default, a re-descending M estimator is used with Tukey's biweight function.
`...`	Arguments passed to the `stats::loess` function.
`surface`	character. Indicates how the fitted surface is calculated, exactly by default.

Details

Smoothed values are given by a weighted linear least squares regression over the span. The polynomial surface is fit using the stats::loess function.

Value

A data.frame with the following components:

site_no: site number
parm_cd: parameter code
obs: observations in the unevenly-spaced time series stored as a matrix with columns x y, and cens. Where x is the sample date, y is the measured value, and cens indicates whether the measurement is censored.
pred: predicted points on the Loess curve. Where x and y are coordinate values, se is the estimated standard error for predicted value, lcl and ucl are the lower and upper confidence interval for predicted value, respectively.
span: is the smoothing parameter.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

References

Cleveland, W.S., 1979, Robust locally weighted regression and smoothing scatterplots: J. American Statistical Association, v. 74, p. 829–836.

Cleveland, W.S., Devlin, S.J., 1988, Locally-weighted regression—an approach to regression analysis by local fitting: Journal of the American Statistical Association, v. 83, no. 403, p. 596–610.

Hurvich, C.M., Simonoff, J.S., and Tsai, C.L., 1998, Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion: Journal of the Royal Statistical Society, Series B, vol. 60, no. 2, p. 271–293.

Examples

site_no <- "432307112583101" # well USGS 124
parm_cd <- c(chloride = "00940", tritium = "07000")
d <- FitLoessCurve(inldata::samples, site_no, parm_cd)

is <- d$site_no == site_no & d$parm_cd == parm_cd[2]
obs <- d[is, "obs"][[1]][, 1:2]
inlmisc::PlotGraph(obs,
  ylab = "Tritium, in picocuries per liter",
  type = "p", pch = 3, col = "purple", xpd = TRUE
)
pred <- d[is, "pred"][[1]]
graphics::matpoints(pred$x, pred[, c("y", "lcl", "ucl")],
  pch = 19,
  col = c("black", "red", "red")
)

Fit Survival Regression Model

Description

Fits a parametric survival regression model to time series data. It is a wrapper around the survival::survreg function. Before running this function, ensure that the data are appropriate for survival analysis; the FlagTemporalData function is provided for this purpose.

Usage

FitSurvivalModel(
  samples,
  parm_cd = NULL,
  dl = NULL,
  conf = 0.95,
  dist = "lognormal",
  alpha = 0.05,
  ...
)

Arguments

`samples`	data.frame. Water quality data from analysis of environmental and quality-control samples. The required variables include `site_no`, `sample_dt`, `parm_cd`, `remark_cd`, `result_va`, and `lab_std_dev_va`; see `samples` dataset for variable descriptions.
`parm_cd`	character vector. Parameter codes; all parameters are included by default.
`dl`	data.frame. Laboratory detection limits; see `dl` dataset for table format.
`conf`	numeric number. Confidence level expressed as a fraction, used to calculate the confidence interval for a measured value (`result_va`) using its laboratory standard deviation value (`lab_std_dev_va`).
`dist`	character. Assumed distribution for the response variable of the regression model (`result_va`). Distribution types include `"lognormal"` (default), `"logistic"`, `"loglogistic"`, `"weibull"`, `"exponential"`, and `"gaussian"`.
`alpha`	numeric. Level of significance, a threshold for p-value's.
`...`	Arguments passed to the `survival::survreg.control` function.

Details

The regression model requires data values that are greater than zero. To satisfy this requirement, an interval-censored value with a lower bound less than or equal to zero (or a detection limit) is represented as left-censored.

Value

A list object with survreg components. Components are named using a concatenation of variables specified in the subset_by argument. The returned object includes a diagnostic_messages attribute that contains a list object with diagnostic messages that result from regression problems, and a data attribute that contains a data.frame object with the data fit by the model.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

Examples

parm_cd <- "07000" # tritium
is <- inldata::samples$sample_dt > as.Date("1989-01-01")
models <- FitSurvivalModel(inldata::samples[is, ],
  parm_cd = parm_cd, dl = inldata::dl
)

site_no <- "433204112562001" # well CFA 1
model <- models[[paste(site_no, parm_cd)]]
summary(model)

d <- attr(models, "data")
d <- d[d$site_no == site_no & d$parm_cd == parm_cd, ]
m <- as.matrix(d$surv)
is <- m[, "status"] %in% 1:2
m[is, "time2"] <- m[is, "time1"]
m[m[, "status"] == 2, "time1"] <- -Inf
inlmisc::PlotGraph(d$sample_dt, m[, 1:2], type = "i", col = "blue")
x <- seq(par("usr")[1], par("usr")[2], length.out = 100)
newdata <- list(x)
names(newdata) <- colnames(model$x)[2]
y <- stats::predict(model, newdata,
  type = "quantile",
  p = c(0.025, 0.5, 0.975)
)
graphics::matlines(x, y, col = "red", lty = c(2, 1, 2))
coefs <- stats::coef(model)
fmt <- "y = exp(%sx + %s),\nwhere x is number of days since 1970-01-01"
txt <- sprintf(fmt, signif(coefs[2], 2), signif(coefs[1], 2))
graphics::mtext(txt, line = -2)

Fit Variability Model

Description

Fits a model of variability (standard deviation as a function of concentration) to replicate data. Replicates are defined as two or more water samples that are collected, prepared, and analyzed such that they are considered to be essentially identical in composition and analysis. The variability of replicate data is defined as the random error in independent measurements as the result of repeated application of the measurement process under identical conditions (Mueller and others, 2015). A model of variability is created for each analyte. Variability is a function of concentration and estimates of variability are developed for discrete ranges of concentration.

Usage

FitVariabilityModel(samples, parm_cd = NULL, h = 1)

Arguments

`samples`	data.frame. Water quality data from analysis of environmental and quality-control samples. The required variables include `site_no`, `sample_dt`, `parm_cd`, `result_va`, `samp_type_cd`, and `rep_pair_no`; see `samples` dataset for variable descriptions.
`parm_cd`	character vector. Parameter codes to include in the analysis; all parameters are included by default.
`h`	integer count. The minimal segment size in the low or high range either given as the minimal number of valid replicate pairs in each segment.

Details

A two-range model, as described by Mueller and Titus (2005) and Mueller and others (2015, p. 32–34), is used to estimate standard deviation as a function of concentration. The two-range model splits the replicate data into two pieces: a low concentration range for which standard deviation is about constant, and a high concentration range for which the relative standard deviation (RSD) is about constant. For concentrations within the low range, variability is estimated as the average standard deviation of replicates; within the high range, variability is estimated as the average RSD. Data subsetting was approached by identifying abrupt structural changes (breakpoints) in the variance of replicates sorted by their average concentration. Optimal breakpoints were calculated using the algorithm described by Bai and Perron (2003); see strucchange::breakpoints function for R implementation. These breakpoints define the end-points of the replicate-data subsets.

Value

A list object with function components. Each component defines a parameter's model of variability, that is, the standard deviation as a function of concentration, or the measurement error if a alpha argument is specified. The U.S. Geological Survey (USGS) parameter code (parm_cd) is assigned to the component name. The returned object includes 3 attributes: (1) replicates, a list with replicate plotting coordinates, see returned value from grDevices::xy.coords function for component format. (2) subsets, a data.frame with the following variables:

parm_cd: USGS 5-digit parameter code;
subset_min, subset_max: minimum and maximum concentrations of replicate subset, in parameter units;
nreplicates: number of replicates;
ave_sd: mean standard deviation, in parameter units; and
ave_rsd: mean relative standard deviation, in percent.

(3) data, a list with data.frame components. Each component stores the data used to create a parameter's variability model.

Note

Replicate sets with nondetection or negative values in one or more samples are removed.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

References

Bai, J., and Perron, P., 2003, Computation and Analysis of Multiple Structural Change Models: Journal of Applied Econometrics, v. 18, p. 1–22.

Mueller, D.K., Schertz, T.L., Martin, J.D., and Sandstrom, M.W., 2015, Design, analysis, and interpretation of field quality-control data for water-sampling projects: U.S. Geological Survey Techniques and Methods, book 4, chap. C4, 54 p., https://doi.org/10.3133/tm4C4.

Mueller, D.K., and Titus, C.J., 2005, Quality of nutrient data from streams and ground water sampled during water years 1992–2001: U.S. Geological Survey Scientific Investigations Report 2005–5106, 27 p., https://pubs.usgs.gov/sir/2005/5106/.

Examples

parm_cd <- "00930" # sodium
models <- FitVariabilityModel(inldata::samples, parm_cd)

FUN <- models[[parm_cd]] # model of variability
conc <- c(20, 50) # measured concentration
stdev <- FUN(conc) # standard deviation
error <- FUN(conc, alpha = 0.05) # measurement error

xy <- attr(models, "replicates")[[parm_cd]] # replicate data
inlmisc::PlotGraph(xy$x, xy$y,
  xlab = "Mean concentration",
  ylab = "Standard deviation", type = "p",
  pch = 21
)
x <- seq(par("usr")[1], par("usr")[2], length.out = 1000)
graphics::lines(x, y = FUN(x), col = "red", lwd = 1.2)

str(attr(models, "subsets"))

Flag Temporal Data for Regression Analysis

Description

Identifies data records that are suitable for temporal regression analysis. Data preprocessing can ensure that at least one continuous record block exists in each unevenly-spaced time series. A continuous record block adheres to the following limits: (1) a minimum duration for the length of time that continuous reliable observations are available; (2) a minimum number of unique observation dates; and (3) a minimum temporal spacing between consecutive observations.

Usage

FlagTemporalData(
  samples,
  subset_by = c("site_no", "parm_cd"),
  remark_cd = c("", "<", "E"),
  samp_type_cd = c("9", "7", "A"),
  sdate = NA,
  edate = NA,
  min_por = NULL,
  max_gap = NULL,
  min_ndates = NULL,
  site_no = NULL,
  parm_cd = NULL,
  robust = FALSE
)

Arguments

`samples`	data.frame. Water quality data from analysis of environmental and quality-control samples. The required variables include `sample_dt`, `parm_cd`, `remark_cd`, `result_va`, and `samp_type_cd`.
`subset_by`	character or integer vector. Variables (column names or numbers) in `samples` that should be used to subset a time series within the temporal data set.
`remark_cd`	character vector. Remark codes. Only records with the specified remark codes will be included in the data set.
`samp_type_cd`	character vector. Sample type codes. Only records with the specified sample type codes will be included in the data set.
`sdate, edate`	Date or character. Start and end date for the period of interest, respectively. Required date format is YYYY-MM-DD. The default value, `NA`, indicates that the earliest (or latest) date should be used.
`min_por`	numeric. Minimum duration for a continuous record block, in days.
`max_gap`	numeric. Maximum days between consecutive observations in a continuous record block.
`min_ndates`	integer. Minimum number of unique observation dates in a continuous record block.
`site_no`	character vector. Site numbers; all sites are included by default.
`parm_cd`	character vector. Parameter codes; all parameters are included by default.
`robust`	logical flag. If true, displays additional information.

Value

A logical vector of length equal to the number of records (rows) in the samples data table. Values equal to TRUE indicate a valid record that adheres to the criteria specified by the user.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

Examples

site_no <- "433144112563501" # well CRA 2
is <- FlagTemporalData(inldata::samples,
  min_por = udunits2::ud.convert(15, "year", "day"),
  max_gap = udunits2::ud.convert(5, "year", "day"),
  min_ndates = 15, site_no = site_no, robust = TRUE
)
d <- droplevels(inldata::samples[is, , drop = FALSE])
sprintf(
  "Removed %s out of %s records (%s%%)",
  nrow(inldata::samples) - nrow(d), nrow(inldata::samples),
  signif((nrow(inldata::samples) - nrow(d)) /
    nrow(inldata::samples) * 100, 2)
)

parm_cd <- "00940" # chloride
plot(d[d$parm_cd == parm_cd, c("sample_dt", "result_va")])

Get Summary Statistics for Survival Model

Description

Retrieves selected statistics from a censored regression object.

Usage

GetSurvivalStats(model)

Arguments

model

survreg. A fitted parametric survival model, see survival::survreg.object for component descriptions. The survreg object must include component x, a vector of response times; see x argument in the survival::survreg function for the flag used to include this component (not included by default).

Value

A named numeric vector containing the following information:

n: number of observations
incr: long-term monotonic trend, estimated using the fractional increase in the response variable over a unit of measure for x (1-day).
rmse: root-mean-squared error, also known as residual standard error.
pval: p-value, statistical significance of the regression model.
r2_mkz: McKelvey and Zavoina's (1973) pseudo-R^2.
r2_mf: McFadden's (1975) pseudo-R^2.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

References

McFadden, D., 1973, Conditional logit analysis of qualitative choice behavior: p. 105–142 in Zarembka, P. (ed.), Frontiers in Econometrics. New York, NY, Academic Press.

McKelvey, R.D., and Zavoina, W., 1975, A statistical model for the analysis of ordinal dependent variables: The Journal of Mathematical Sociology, v. 4, no. 1, p. 103–120.

Examples

d <- inldata::samples
d <- d[
  d$site_no == "433322112564301" & # well USGS 38
    d$parm_cd == "01030" & # chromium
    d$sample_dt > as.Date("1989-01-01"),
  c("sample_dt", "result_va", "remark_cd")
]
t1 <- t2 <- d$result_va
t1[d$remark_cd == "<"] <- -Inf
d$surv <- I(survival::Surv(t1, t2, type = "interval2"))
model <- survival::survreg(surv ~ sample_dt, d,
  dist = "lognormal",
  x = TRUE
)
summary(model)
print(out <- GetSurvivalStats(model))
sprintf(
  "Percent change in concentration per year is %2.1f",
  100 * abs(out["incr"]) * 365
)

Krige Surface Map from Spatial Points

Description

Estimates a surface map of a feature (such as a contaminant concentration) across an area using a kriging interpolation method. Sample point data is averaged over time, with nondetection values set equal to zero. The sample variogram is fit using a Matern class model (Matern, 1960; Stein, 1999).

Usage

KrigeSurfaceMap(
  samples,
  sites,
  parm_cd = NULL,
  sdate = NA,
  edate = NA,
  resolution = NULL,
  newdata = NULL,
  Trans = NULL,
  BackTrans = NULL,
  type = c("QK", "OK"),
  kappa = seq(0.1, 10, by = 0.1),
  cutoff = NULL,
  nbins = 20,
  cloud = FALSE,
  nmax = Inf,
  maxdist = Inf,
  alpha = NULL,
  minarea = NULL,
  nfold = length(sites),
  nreps = 1,
  seed = NULL,
  minr2 = NULL,
  benchmarks = NULL,
  performance = FALSE,
  parallel = FALSE,
  robust = FALSE,
  old = NULL,
  ...
)

Arguments

`samples`	data.frame. Water quality data from analysis of environmental and quality-control samples. The required variables include `site_no`, `sample_dt`, `parm_cd`, `remark_cd`, and `result_va`; see `samples` dataset for variable descriptions.
`sites`	SpatialPointsDataFrame. Site coordinates with `site_no` and `site_nm` a required variable, see `sites` dataset for details.
`parm_cd`	character vector. Parameter codes; all parameters are included by default.
`sdate, edate`	Date or character. Start and end date, respectively, for the period over which measured values are described using their average value. Required date format is YYYY-MM-DD. The default value, `NA`, indicates that the earliest (or latest) date should be used.
`resolution`	numeric vector of length 1 or 2. Resolution (see `raster::res` function) of grid cells in the area of interpolation, defined by the convex hull of the observation points in `sites`.
`newdata`	SpatialPoints*. Prediction locations for kriging (optional). Used to bypass the default grid prediction locations.
`Trans`	function. Transformation applied to the data, such as the `log1p` function.
`BackTrans`	function. Back-transformation applied to kriging results, such as the `expm1` function.
`type`	character. Type of regression method, specify `"QK"` for quantile kriging (Journel and Deutsch, 1997; Juang and others, 2001) and `"OK"` for ordinary kriging (Clark, 1997; Kitanidis, 1997; Goovaerts, 1997).
`kappa`	numeric vector. A sequence of kappa values that will be searched for optimal Matern variogram model fit. Kappa is a shape parameter that controls the smoothness of the random field. A value of 0.5 results in the exponential model, while a smoothness of infinity is exactly the Gaussian model. The Matern model typically begins to look and behave as the Gaussian model for values greater than 10. Small values are considered ‘smooth’, while larger values describe a ‘rough’ random field.
`cutoff`	numeric. Spatial separation distance up to which data point pairs are included in empirical variance estimates; by default, set equal to the length of the diagonal of the box spanning the data divided by three.
`nbins`	integer. Number of bins (distance intervals) within the `cutoff` distance into which data point pairs are grouped for empirical variance estimates.
`cloud`	logical flag. Whether to calculate the variogram cloud.
`nmax`	integer. For local kriging: the number of nearest observations that should be used for a kriging prediction or simulation.
`maxdist`	numeric. For local kriging: the search radius that defines the local neighborhood; if combined with the `nmax` argument, both criteria apply.
`alpha`	numeric. Level of significance, a threshold for p-value's calculated using Moran's I test. Evaluates whether the spatial distribution of feature values is the result of random spatial processes. Models with p-value's greater than `alpha` are rejected. Requires that the ape package is available.
`minarea`	numeric. Minimum area for the kriging observations specified as a fraction of the area inside the convex hull of the kriging surface. Models are rejected if less than this fraction. Useful for removing models with excessive extrapolation.
`nfold`	integer. Number of folds to apply in k-fold cross validation (CV). A value greater than equal to the number of observation points results in leave-one-out CV. The CV method is used to determine the function type for the variogram model, and calculate performance metrics for the kriging model.
`nreps`	integer. Number of times to repeat cross validation. Final results are based on the median performance metric from all repetitions.
`seed`	integer. Random number generator state for random number generation, used to replicate the CV results.
`minr2`	numeric vector of length 1 or 2, value is recycled as necessary. Minimum value for R^2, between 0 and 1. Values applied separately to variogram and kriging analysis. Models are rejected if R^2 is less than the minimum value.
`benchmarks`	data.frame. Benchmarks, such as a selected hazardous threshold. A byproduct of quantile kriging, used to calculate the probability of a predicted value being greater than the benchmark value. The required variables include `parm_cd` and `mcl`, see `benchmarks` dataset for variable descriptions.
`performance`	logical flag. Whether to calculate performance metrics for the kriging model.
`parallel`	logical flag or integer count. Whether to use parallel computing. This argument can also be used to specify the number of cores to employ; by default, this is the number of physical CPUs/cores. The parallel and doParallel packages must be installed for parallel computing to work.
`robust`	logical flag. Whether to display output from variogram fitting and kriging. Only available if the `parallel` argument is false.
`old`	list. Output from a past run of the `KrigeSurfaceMap()` command. This is an optional way to provide the prediction locations and variogram models.
`...`	Not used

Value

A list with list components, one for each parameter kriged, and named after their respective parameter code (see parm_cd argument). Each list component has the following components:

kr_output: results from the kriging interpolation, that is, the prediction (var1.pred), standard error (var1.se), and kriging variance (var1.var). Where the kriging variance remains in transformed space. If using quantile kriging, the following variables are also returned: interval_range, the interval range, or the deviation between upper and lower bounds in units of measurement; and probability, the estimated probability that a predicted value is higher than the benchmark value.
kr_cv: kriging model prediction performance metrics derived from CV. Metrics include: me, mean error (ideally 0); mse, mean squared error (ideally small); rmse, root mean squared error (ideally small); nrmse, normalized root mean squared error (ideally small); mae, mean absolute error (ideally small); cor, correlation between observed and predicted values (ideally 1); and r2, coefficient of determination, or R^2 (ideally 1).
vg_cloud: variogram cloud.
vg_sample: sample variogram.
vg_model: fitted variogram model.
vg_summary: a summary of the variogram model with list components: model, is the model type; nugget effect, represents a discontinuity of the model at the origin; sill, is the variance when the model either reaches or becomes asymptotic to a constant value as lag distance increases; range, controls the rate of increase with distance; effective_range, is the effective range, that is the distance at which the variance value achieves 95 percent of the sill; kappa, is a shape parameter that controls the smoothness of the random field that models the spatial variability of the observed data; sserr, is the sum of squared error (ideally 0); and r2, is the coefficient of determination (ideally 1).
obs: observation points used in the prediction.

Note

Consider the discussion by Kitanidis (1997, p. 71, section 3.7) before kriging with a moving neighborhood using the nmax and maxdist arguments.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

References

Clark, Isobel, 1979, Practical Geostatistics: London, Applied Science Publishers, 119 p.

Goovaerts, Pierre, 1997, Geostatistics for Natural Resource Evaluation: New York, Oxford University Press, 483 p.

Journel, A.G., and Deutsch, C.V., 1997, Rank order geostatistics: a proposal for a unique coding and common processing of diverse data, In Geostatistics Wollongton '96, Volume 1, ed. E.Y. Baafi and N.A. Schofield, Dordrecht, Boston: Kluwer Academic, p. 174–187.

Juang, K.-W., Lee, D.-Y., and Ellsworth, T.R., 2001, Using rank-order geostatistics for spatial interpolation of highly skewed data in a heavy-metal contaminated site: Journal of Environmental Quality, v. 30, no. 3, p. 894–903.

Kitanidis, P.K., 1997, Introduction to geostatistics—Applications to hydrogeology: New York, Cambridge University Press, 249 p.

Matern, B. 1960, Spatial variation—stochastic models and their applications to some problems in forest survey sampling investigations: Report of the Forest Research Institute of Sweden, v. 49, no. 5, 144 p.

Stein, M.L., 1999, Interpolation of spatial data—Some theory for kriging: Springer, New York, 247 p.

Examples

parm_cd <- c(chloride = "00940", tritium = "07000")
models <- KrigeSurfaceMap(inldata::samples, inldata::sites,
  parm_cd = parm_cd, sdate = "1989-01-01",
  resolution = 500, cutoff = 10 * 1609,
  parallel = TRUE
)
names(models)

# Plot kriging results for each parameter
xlim <- sp::bbexpand(sp::bbox(inldata::sites)[1, ], 0.04)
ylim <- sp::bbexpand(sp::bbox(inldata::sites)[2, ], 0.04)
for (cd in names(models)) {
  srsname <- inldata::parameters[cd, "srsname"]
  x <- models[[cd]]
  try(gridExtra::grid.arrange(sp::bubble(x$obs,
    maxsize = 1,
    main = srsname
  ),
  gstat:::plot.gstatVariogram(
    x$vg_sample,
    x$vg_model
  ),
  sp::spplot(x$kr_output, "var1.pred",
    main = "Predicted",
    xlim = xlim, ylim = ylim
  ),
  sp::spplot(x$kr_output, "var1.se",
    main = "Standard Error",
    xlim = xlim, ylim = ylim
  ),
  ncol = 2
  ))
  prompt <- sprintf("%s (%s), press key to continue...", srsname, cd)
  if (interactive()) invisible(readline(prompt))
}

Perform Spatial Optimization Analysis

Description

Identifies an optimal subset of wells to exclude from an existing long-term monitoring network because they provide the least amount of information. A distributed multiple-population genetic algorithm (GA) is used to preserve spatial accuracy and long-term temporal trends in the reduced monitoring network.

Usage

OptimizeSpatial(
  k,
  sites,
  samples,
  site_no = NULL,
  parm_cd = NULL,
  arg = list(),
  surv = NULL,
  loess = NULL,
  w1 = 1,
  w2 = 1,
  w3 = 1,
  w4 = 1,
  penalty = NULL,
  exclude = NULL,
  suggestions = NULL,
  ...
)

Arguments

`k`	integer. Number of sites to exclude from the existing monitoring network.
`sites`	SpatialPointsDataFrame. Site coordinates with `site_no` and `site_nm` a required variable, see `sites` dataset for details.
`samples`	data.frame. Water quality data from analysis of environmental and quality-control samples. The required variables include `site_no`, `sample_dt`, `parm_cd`, `remark_cd`, `result_va`, `samp_type_cd`, and `rep_pair_no`; see `samples` dataset for variable descriptions.
`site_no`	character vector. Site numbers to include in the analysis; all sites are included by default.
`parm_cd`	character vector. Parameter codes to include in the analysis; all parameters are included by default.
`arg`	list. Additional arguments passed to the `KrigeSurfaceMap` function.
`surv`	list. Models from survival-regression analysis, the resulting object from running the `FitSurvivalModel` function (optional).
`loess`	data.frame. Predicted points from local-regression analysis, the resulting object from running the `FitLoessCurve` function (optional).
`w1, w2, w3, w4`	numeric number. Weighting factor applied to objective function, see ‘Details’ section for objective descriptions.
`penalty`	numeric number. Penalty to apply to infeasible sampling schemes that result in unestimated points in the interpolation domain using kriging.
`exclude`	character vector. Site numbers to exclude from the decision space but include in the kriging analysis.
`suggestions`	matrix. Monitoring networks to be included as initial guesses. See returned `solution` component for a suggested value for this argument.
`...`	GA control parameters passed to the `inlmisc::FindOptimalSubset` function.

Details

The GA::gaisl function (Scrucca, 2013, 2017) is used to search for an optimal subset of sites to remove from a existing monitoring network because they add little or no beneficial information. The number of sites to remove is fixed and specified by the user using the k argument. The performance of a network design is measured with respect to multiple objectives, each characterizes a contribution to the overall value (or fitness) of the network design. The design problem is formulated as a single-objective optimization using a weighted sum approach. More specifically, an islands GA method is used to minimize a fitness function that comprises a set of weighted objective functions. Weights are assigned to individual objective functions to establish the relative influence of each objective and (or) resolve conflicting objectives. Objectives are described as follows:

Minimize the normalized root-mean-square deviation between kriged concentrations estimated using the existing and reduced monitoring networks.
Decrease spatial interpolation errors by minimizing the ratio between the sum kriging variance estimated for the reduced and existing monitoring networks.
Preserve significant long-term monotonic temporal trends in wells by minimizing the maximum fractional change per year estimated from survival-regression analysis, in wells removed from the existing monitoring network.
Preserve continuous record blocks that have the largest variability by minimizing the maximum normalized range of concentrations estimated from local-regression analysis, in wells removed from the existing monitoring network.

Value

A list object with components: call, solution, ga_output, and ga_time, (see returned value from inlmisc::FindOptimalSubset function); rm_sites, a vector of site numbers indicating the wells that have been optimally selected for removal; fitness, a data table of weighted fitness values for individual objectives in the fitness function. The fitness component has the following variables:

parm_cd: USGS 5-digit parameter code;
epoch: interval between migrations in the islands GA run; and
w1f1, w2f2, w3f3, w4f4: weighted fitness value for individual objective functions.

The fitness component has a "weights" attribute that contains the weights assigned to individual objective functions. Fitness values can be displayed using the PlotFitness function.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

References

Scrucca, Luca, 2013, GA: A Package for Genetic Algorithms in R: Journal of Statistical Software, v. 53, no. 4, p. 1–37.

Scrucca, Luca, 2017, On some extensions to GA package: hybrid optimisation, parallelisation and islands evolution: The R Journal, v. 9, no. 1, 187–206.

Examples

site_no <- with(inldata::sites@data, site_no[network == "aquifer"])
parm_cd <- "07000" # tritium
arg <- list("sdate" = "1989-01-01", "cutoff" = 16093, "nbins" = 20)
exclude <- with(inldata::sites@data, site_no[completion_cd != "O"])
ans <- OptimizeSpatial(5, inldata::sites, inldata::samples,
  site_no, parm_cd, arg,
  w1 = 10, w4 = 0.1,
  exclude = exclude, run = 5, maxiter = 100,
  numIslands = 4, parallel = TRUE, seed = 123
)
print(ans)

print(m <- ans[["solution"]])
locations <- inldata::sites[m[1, ], ]
print(locations@data[, c("site_no", "site_nm")])
print(ans$fitness)
print(attr(ans$fitness, "weights"))
PlotFitness(ans$fitness, type = "objectives")

inlmisc::PlotMap(locations,
  dms.tick = TRUE,
  roads = list(inldata::roads),
  lakes = list(inldata::lakes),
  rivers = list(inldata::streams),
  scale.loc = "bottomleft"
)
points(inldata::sites, pch = 19, col = "black", xpd = FALSE)
points(locations, pch = 19, col = "red")
raster::text(locations,
  labels = locations@data$site_nm,
  cex = 0.7, pos = 4, halo = TRUE
)

plot(ans[["ga_output"]])
summary(ans[["ga_output"]])

Perform Temporal Optimization Analysis

Description

Provides a method for analyzing the temporal spacing of observations. It determines if fewer observations could have been used to characterize analyte concentrations at a sampling site over time. An iterative thinning approach, based on an algorithm described by Cameron (2004), is used to identify an optimal sampling interval that may be used for future long-term monitoring, one that reduces temporal redundancy. Before running this function, ensure that the data are appropriate for temporal redundancy analysis; the FlagTemporalData function is provided for this purpose.

Usage

OptimizeTemporal(
  samples,
  site_no = NULL,
  parm_cd = NULL,
  span = NULL,
  by = NULL,
  n = 200,
  conf = 0.9,
  nsim = 500,
  threshold = 0.8,
  tol = 0.05,
  Fun = stats::median,
  mult = 0.5,
  save_mc = FALSE,
  parallel = FALSE,
  monitor = FALSE,
  seed = NULL
)

Arguments

`samples`	data.frame. Water-quality data from analysis of environmental and quality-control samples. The required variables include `site_no`, `sample_dt`, `parm_cd`, `remark_cd`, and `result_va`; see `samples` dataset for variable descriptions.
`site_no`	character vector. Site numbers to include in the analysis; all sites are included by default.
`parm_cd`	character vector. Parameter codes to include in the analysis; all parameters are included by default.
`span`	numeric. Smoothing parameter, defined as the fraction of the total number of data points that is used in each local fit. Larger values give more smoothness. Specify as `NULL` to automate the selection of the smoothing parameter using the bias-corrected Akaike information criterion (Hurvich and others, 1998). Estimated values are restricted to lie with the range of values from 0.2 to 0.9.
`by`	number, difftime, or string. Increment of the predicted points over the period of record. Specify as either a number in days, object of class `difftime`, or character string containing one of "day", "week", "month", "quarter" or "year".
`n`	integer. Number of predicted points uniformly distributed over the period of record. Only used if the `by` argument is not specified.
`conf`	numeric number. Confidence level expressed as a fraction, used to construct confidence intervals that make up the confidence band for the Loess curve.
`nsim`	integer. Number of data-thinning simulations to perform during each iteration of the bisection method. For each simulation, a different set of randomly selected observations is removed from the full-data set and a new Loess curve is computed.
`threshold`	numeric. Fraction of predicted points on the Loess curve fitted to the reduced-data set, that must fall within the confidence band computed from the full-data set.
`tol`	numeric. Error tolerance for the bisection method; if crossed, and the regression is robust (see `threshold` argument), the solution is considered optimal.
`Fun`	function. Function used to aggregate daily values. By default, the `stats::median` function is applied to values reported on the same sample date.
`mult`	numeric. Multiplier for censored values (`remark_cd` equal to `"<"`).
`save_mc`	logical flag. Whether to return the data-thinning simulations.
`parallel`	logical flag or integer count. Whether to use parallel computing. This argument can also be used to specify the number of cores to employ; by default, this is the number of physical CPUs/cores. The parallel and doParallel packages must be installed for parallel computing to work.
`monitor`	logical flag. If true, displays output after each iteration of the bisection method. Monitoring is only available if the `parallel` argument is false.
`seed`	integer. Random number generator state for random number generation, used to replicate the results. The doRNG package must be installed if using parallel computing.

Details

The median date gap (that is, the median number of days between consecutive and unique sampling dates) is first calculated and used as the historic sampling interval. The iterative thinning algorithm (Cameron, 2004) computes a confidence band around a Loess curve fitted to the observations; see FitLoessCurve function for details. A fraction of the observations is removed from the full-data set and a new Loess curve is fitted to the reduced-data set. The ability of the reduced-data set to reproduce the historical trend is evaluated by calculating the fraction of the new Loess curve fitted to the reduced-data set that falls within the confidence band established using the full-data set. Increasing numbers of observations are removed from the full-data set and evaluated for their ability to reproduce the historical trend. A data-thinning method is used to safeguard against irregular trends that may arise from the selection of a single set of observations to remove. The optimal fraction of observations to remove, while still reproducing the historical trend, is computed using the bisection method. Finally, an optimal sampling interval, that minimizes temporal redundancy, is expressed as the historic sampling interval divided by one minus the optimal fraction to remove.

Value

A data.frame with the following components:

site_no: site number
parm_cd: parameter code
obs: observations in the unevenly-spaced time series stored as a matrix with columns x y, and cens. Where x is the sample date, y is the measured value, and cens indicates whether the measurement is censored.
frac_cens: fraction of censored measurements.
ndates: number of unique sampling dates
longest_gap: longest number of days between consecutive samples.
freq: historic sampling interval, in days, estimated as the median date gap.
pred: Loess curve and confidence band stored as a matrix with columns x, y, se, lcl, and ucl. See returned value from FitLoessCurve function for column descriptions.
reduction: largest fraction of observations that can be removed while preserving the temporal trend of the full-data set.
mc: Loess curve for each data-thinning simulation stored as a n by nsim matrix of predicted values. This component is only available if the save_mc argument is specified as true.
opt_freq: optimal sampling interval, in days.
freq_desc: adjective describing the historic sampling frequency; see DescribeFrequency function for details.
opt_freq_desc: adjective describing the optimal sampling frequency.

The returned object includes an elapsed_time attribute, an object of class difftime giving the total time taken to execute the function.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

References

Cameron, K., 2004, Better optimization of long-term monitoring networks: Bioremediation Journal, v. 8, no. 3–4, p. 89–107.

Examples

site_no <- "432019112563201" # well USGS 14 MV-61
parm_cd <- c(chloride = "00940", tritium = "07000")
d <- OptimizeTemporal(inldata::samples, site_no, parm_cd,
  by = "month", save_mc = TRUE,
  monitor = TRUE, seed = 123
)
print(d)
print(attr(d, "elapsed_time"))

i <- 1L # row index
obs <- d$obs[[i]][, 1:2]
graphics::plot(obs,
  type = "n", xlab = "Julian day",
  ylab = d$parm_cd[i], main = d$site_no[i]
)
x <- d$pred[[i]][, "x"]
graphics::matlines(x, d$mc[[i]], lty = 1, lwd = 0.5, col = "gray")
graphics::matlines(x, d$pred[[i]][, c("lcl", "ucl")],
  lty = 2,
  lwd = 2, col = 2
)
graphics::lines(x, rowMeans(d$mc[[i]], na.rm = TRUE),
  lwd = 2, col = 3
)
graphics::points(obs)

Plot Fitness Values

Description

Visualize the performance of the sampling network optimization problem by graphing fitness values across generations.

Usage

PlotFitness(
  fitness,
  type = c("monitor", "objectives", "parameters"),
  ylim = NULL,
  show_main = TRUE,
  digits = getOption("digits")
)

Arguments

`fitness`	data.frame. See `"fitness"` component returned from the `OptimizeSpatial` function.
`type`	character. Plot type, possible types are `"monitor"`, `"objectives"`, and `"parameters"`.
`ylim`	numeric vector of length 2. Minimum and maximum values for the y-axis.
`show_main`	character. Whether to show the main title.
`digits`	integer. Number of decimal places used when formatting the range of the best fitness value.

Value

Invisible NULL

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

Processing

An optimization analysis is performed when building the vignette documents of the R-package ObsNetQW. Prior to re-building these vignettes, be aware of the long run times (on the order of weeks) associated with the spatial component of the optimization analysis. First create a local copy of the ObsNetQW package source material (stored in a remote repository) using the following command:

git2r::clone("https://code.usgs.gov/inl/obsnetqw.git", "obsnetqw")

The package vignettes are located under obsnetqw/inst/doc of the working directory. To re-build the vignettes from their source files, run:

inlmisc::BuildVignettes("obsnetqw")

Reproducibility

R packages get updated over time, so future versions of dependent packages may result in errors. To address this, it may be necessary to recreate an identical R environment where all packages are consistent with package versions used in this report. Version information about R, the operating system, and attached and loaded packages at the time of processing is as follows:

R version 4.0.5 (2021-03-31)

Platform: x86_64-w64-mingw32/x64 (64-bit)

attached base packages:

graphics
grDevices
grid
methods
parallel
stats
tools
utils

other attached packages:

alphahull(v.2.2)
ape(v.5.5)
checkmate(v.2.0.0)
data.table(v.1.14.0)
dataRetrieval(v.2.7.7)
dichromat(v.2.0-0)
doParallel(v.1.0.16)
doRNG(v.1.8.2)
dplyr(v.1.0.5)
foreach(v.1.5.1)
GA(v.3.2.1)
git2r(v.0.28.0)
gridExtra(v.2.3)
gstat(v.2.0-7)
htmltools(v.0.5.1.1)
htmlwidgets(v.1.5.3)
igraph(v.1.2.6)
inldata(v.1.0.3)
inlmisc(v.0.5.2)
iterators(v.1.0.13)
knitr(v.1.33)
leaflet(v.2.0.4.1)
leaflet.extras(v.1.0.0)
maptools(v.1.1-1)
pander(v.0.6.3)
png(v.0.1-7)
raster(v.3.4-10)
remotes(v.2.3.0)
rgdal(v.1.5-23)
rgeos(v.0.5-5)
rmarkdown(v.2.7)
roxygen2(v.7.1.1)
scales(v.1.1.1)
sp(v.1.4-5)
strucchange(v.1.5-2)
survival(v.3.2-11)
testthat(v.3.0.2)
tinytex(v.0.31)
udunits2(v.0.13)
webshot(v.0.5.2)
wordcloud2(v.0.2.1)
xtable(v.1.8-4)
yaml(v.2.2.1)

loaded via a namespace (and not attached):

Rcpp(v.1.0.6)
pillar(v.1.6.0)
bslib(v.0.2.4)
compiler(v.4.0.5)
jquerylib(v.0.1.4)
digest(v.0.6.27)
jsonlite(v.1.7.2)
evaluate(v.0.14)
lifecycle(v.1.0.0)
tibble(v.3.1.1)
lattice(v.0.20-44)
pkgconfig(v.2.0.3)
ObsNetQW(v.1.0.0)
rlang(v.0.4.11)
crosstalk(v.1.1.1)
xfun(v.0.22)
stringr(v.1.4.0)
generics(v.0.1.0)
sass(v.0.3.1)
vctrs(v.0.3.8)
datasets(v.4.0.5)
tidyselect(v.1.1.1)
glue(v.1.4.2)
base(v.4.0.5)
R6(v.2.5.0)
fansi(v.0.4.2)
purrr(v.0.3.4)
magrittr(v.2.0.1)
backports(v.1.2.1)
codetools(v.0.2-18)
ellipsis(v.0.3.2)
sandwich(v.3.0-0)
utf8(v.1.2.1)
stringi(v.1.5.3)
crayon(v.1.4.1)
zoo(v.1.8-9)

For reproducibility purposes, Microsoft® R Application Network (MRAN) stores daily snapshots of the CRAN packages and R releases. Use the May 5, 2021 snapshot to recreate the R environment used in this report. To do so, download and install R version 4.0.5 (2021-03-31) from the MRAN and use the following commands to recreate the R environment:

install.packages("remotes")
repos <- "https://cran.microsoft.com/snapshot/2021-05-05/"
update.packages(repos = repos)
remotes::install_gitlab("inl/obsnetqw@v1.0.0",
  host = "code.usgs.gov",
  dependencies = TRUE,
  repos = repos,
)

The optimization analysis was performed on a 4 cores Intel® Xeon® CPU E5-1620 v3 running at 3.50GHz and with 32GB of RAM.

Appendix 2. Software User Manual

Introduction

Software

R

LaTeX

Pandoc

Datasets

Background Concentrations

Description

Usage

Format

References

Examples

Benchmark Concentrations

Description

Usage

Format

Source

References

Examples

Cities and Towns

Description

Usage

Format

Source

Examples

County Boundaries

Description

Usage

Format

Source

Examples

Laboratory Detection Limits

Description

Usage

Format

Source

References

Examples

Eastern Snake River Plain Boundary

Description

Usage

Format

Source

Examples

Idaho National Laboratory Facilities

Description

Usage

Format

Source

Examples

Groundwater Level Measurements

Description

Usage

Format

Source

References

Examples

State of Idaho Boundary

Description

Usage

Format

Source

Examples

Idaho National Laboratory Boundary

Description

Usage

Format

Source

Examples

Map Labels

Description

Usage

Format

Source

Examples

Lakes and Ponds

Description

Usage

Format