Note: Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

Introduction

R is a software environment for statistical computing and graphics. In R, the primary mechanism for sharing with others is the package. Packages are collections of computer code, data, and documentation in a well-defined format. Datasets pertaining to the United States Geological Survey-Idaho National Laboratory (USGS-INL) water-quality and water-level aquifer monitoring networks are stored in an R package named inldata. And computer code (functions) used to analyze this data are bundled together in an R package named ObsNetQW. This document is a vignette in the ObsNetQW package and provides an overview of the software used to optimize the USGS-INL water-quality aquifer monitoring network. A package vignette is a fully reproducible document that combines narrative text and R code to produce static (PDF) and dynamic (HTML) output formats. The R code is run when the vignette is built, and data analysis output (figures and tables) are created extemporaneously and inserted into the final document. Small chunks of stylized code are shown in this vignette and are intended to be used interactively in an R terminal. These code chunks may be useful for testing, development, and validation purposes. This document provides (1) software installation instructions, (2) dataset and function documentation, (3) processing instructions for performing the optimization analysis, and (4) guidance on recreating the R environment used when compiling this report.

Software

Software used in this report are all available as free and open-source software (FOSS), can be run on multiple platforms (Windows, OS X, Linux), and do not require administrative privileges to install or run.

R

The R software environment can be obtained from the Comprehensive R Archive Network (CRAN). And the capabilities of R may be extended by installing an assorted group of user-contributed R packages available on CRAN and the U.S. Geological Survey R Archive Network (GRAN). Typing the following commands into an R terminal (console) will install the latest versions of the R packages used in this report.

update.packages()
remotes::install_gitlab("inl/obsnetqw",
                        host = "code.usgs.gov",
                        dependencies = TRUE)

LaTeX

To create PDF vignette documents with R requires LaTeX, a document preparation system for typesetting. The LaTeX distribution used in this report is TinyTeX. To install TinyTeX, first uninstall any other LaTeX distribution that may be installed, and then run the following command in the R terminal:

tinytex::install_tinytex()

The capabilities of LaTeX may be extended by installing an assorted group of user-contributed LaTeX packages available on the Comprehensive TeX Archive Network (CTAN). To install additional packages that were used in this report and not included in the default installation of TinyTeX, run the following command:

inlmisc:::InstallLatexPackages()

Pandoc

To create HTML vignette documents with R requires pandoc, a universal document converter. Instructions for installing pandoc are available for multiple platforms.

Datasets

Datasets in the R-package inldata may be accessed from the R terminal by running the following command:

library("inldata")

The help documentation for each dataset is given below.

Background Concentrations

Description

Water-quality background concentrations for selected radionuclides, organic compounds, and chemical constituents analyzed for in water from the Eastern Snake River Plain aquifer at and near the Idaho National Laboratory (INL). Background concentrations are either naturally occurring or anthropogenic, and are not influenced by waste and wastewater disposal at the INL (Bartholomay and Hall, 2016, p. 3).

Usage

background

Format

A data frame with 73 records and 4 variables:

parm_cd

U.S. Geological Survey 5-digit parameter code, see parameters dataset for details.

bkgrd_min, bkgrd_max

minimum and maximum limits of background concentration

reference

source of background concentration limits. Reference citations are: "Bartholomay and Hall (2016)", "Knobel and others (1992)", "Michel (1989)", and "Orr and others (1991)".

Row names for the data frame indicate the Substance Registry Services (SRS) name and parameter units.

References

Bartholomay, R.C., and Hall, L.F., 2016, Evaluation of background concentrations of selected chemical and radiochemical constituents in groundwater in the eastern Snake River Plain aquifer at and near the Idaho National Laboratory, Idaho: U.S. Geological Survey Scientific Investigations Report 2016–5056, (DOE/ID–22237), 19 p., https://doi.org/10.3133/sir20165056.

Knobel, L.L., Orr, B.R., and Cecil, L.D., 1992, Summary of background concentrations of selected radiochemical and chemical constituents in groundwater from the Snake River Plain aquifer, Idaho: estimated from an analysis of previously published data: Journal of the Idaho Academy of Science, v. 28, no. 1, p. 48–61.

Michel, R.L., 1989, Tritium deposition in the continental United States, 1953–83: U.S. Geological Survey Water Resources Investigations Report 89–4072, 46 p., https://doi.org/10.3133/wri894072.

Orr, B.R., Cecil, L.D., and Knobel, L.L., 1991, Background concentrations of selected radionuclides, organic compounds, and chemical constituents in ground water in the vicinity of the Idaho National Engineering Laboratory: U.S. Geological Survey Water-Resources Investigations Report 91–4015 (DOE/ID–22094), 52 p., https://doi.org/10.3133/wri914015.

Examples

str(background)
rownames(background)

Benchmark Concentrations

Description

Water-quality benchmark concentrations of selected radionuclides, organic compounds, and chemical constituents. Benchmarks include: the U.S. Environmental Protection Agency (USEPA) Maximum Contaminant Levels (MCLs) and Human Health Benchmarks for Pesticides (HHBPs), and the U.S. Geological Survey (USGS) Health-Based Screening Levels (HBSLs).

Usage

benchmarks

Format

A data frame with 74 records and 9 variables:

parm_cd

USGS 5-digit parameter code, see parameters dataset for details.

mcl

USEPA MCLs.

hhbp_noncancer

USEPA Chronic Noncancer HHBPs

hhbp_cancer_min

USEPA Carcinogenic HHBPs for a one-in-one million cancer risk.

hhbp_cancer_max

USEPA Carcinogenic HHBPs for a one-in-ten thousand cancer risk.

hbsl_noncancer

USGS Noncancer HBSLs

hbsl_cancer_min

USGS Cancer HBSLs for a one-in-one million cancer risk.

hbsl_cancer_max

USGS Cancer HBSLs for a one-in-ten thousand cancer risk.

remark

comments

Row names for the data frame indicate the Substance Registry Services (SRS) name and parameter units.

Source

Many of the water-quality benchmarks were accessed from the U.S. Geological Survey Health-Based Screening Levels database, accessed on August 8, 2017, from https://water.usgs.gov/water-resources/hbsl/. Benchmarks for total Trihalomethanes, Tritium, and Strontium-90 were provided by the U.S. Environmental Protection Agency (2015). Note that MCL benchmark values reported in millirem per year were substituted with a 50 picocuries per liter screening level.

References

U.S. Environmental Protection Agency, 2015, Protection of environment—Code of Federal Regulations 40, Part 141, Subpart G, National Primary Drinking Water Regulations, Maximum Contaminant Levels and Maximum Residual Disinfectant Levels: Washington, D.C., Office of the Federal Register, National Archives and Records Administration.

Examples

str(benchmarks)
rownames(benchmarks)

Cities and Towns

Description

Cities and towns (populated places) in the vicinity of Idaho National Laboratory, eastern Idaho.

Usage

cities

Format

A SpatialPointsDataFrame of the sp package with 24 records and 16 variables. See projection dataset for coordinate reference system information.

Source

U.S. Department of Commerce, U.S. Census Bureau, Geography Division/Cartographic Products Branch. Spatial extract from the Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB), 2019 data collection, released April 2, 2020.

Examples

inlmisc::PlotMap(cities, dms.tick = TRUE)
sp::plot(cities, pch = 19, add = TRUE)
raster::text(cities, cities@data$NAME,
             pos = 1, cex = 0.6)
str(cities@data)

County Boundaries

Description

Boundaries of counties in the vicinity of Idaho National Laboratory, eastern Idaho, as of January 1, 2015.

Usage

counties

Format

A SpatialLines of the sp package with 45 features. See projection dataset for coordinate reference system information.

Source

U.S. Department of Commerce, U.S. Census Bureau, Geography Division/Cartographic Products Branch. Spatial extract from the Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB), 2019 data collection, released April 2, 2020.

Examples

inlmisc::PlotMap(counties, dms.tick = TRUE)
sp::plot(counties, lty = 5, add = TRUE)

Laboratory Detection Limits

Description

Analytical method detection limits of selected radionuclides, which are based on laboratory procedures.

Usage

dl

Format

A data frame with 8 records and 6 variables:

srsname

Substance Registry Services (SRS) name.

parm_cd

U.S. Geological Survey 5-digit parameter code.

parameter_units

parameter units

lab_det_lim_va

laboratory detection limit concentration

sdate

date that the detection limit went into effect.

reference

source of detection limit.

Source

Detection limits reported by: Bartholomay and others (2003, table 9); Bartholomay and others (2014, table D1); and Bodnar and Percival (1982).

References

Bartholomay, R.C., Knobel, L.L., and Rousseau, J.P., 2003, Field methods and quality-assurance plan for quality-of-water activities, U.S. Geological Survey, Idaho National Engineering and Environmental Laboratory, Idaho: U.S. Geological Survey Open-File Report 03–42 (DOE/ID–22182), 45 p. https://doi.org/10.3133/ofr0342.

Bartholomay, R.C., Maimer, N.V., and Wehnke, A.J., 2014, Field methods and quality-assurance plan for water-quality activities and water-level measurements, U.S. Geological Survey, Idaho National Laboratory, Idaho: U.S. Geological Survey Open-File Report 2014–1146 (DOE/ID–22230), 64 p. https://pubs.usgs.gov/of/2014/1146/.

Bodnar, L.Z., and Percival, D.R., eds., 1982, Analytical Chemistry Branch procedures manual—Radiological and Environmental Sciences Laboratory: U.S. Department of Energy Report IDO–12096 [variously paged].

Examples

str(dl)

Eastern Snake River Plain Boundary

Description

Boundary of the eastern Snake River Plain, Idaho.

Usage

esrp

Format

A SpatialPolygonsDataFrame of the sp package object with 1 feature and 1 variable. See projection dataset for coordinate reference system information.

Source

U.S. Geological Survey Idaho National Laboratory Project Office

Examples

inlmisc::PlotMap(esrp, dms.tick = TRUE)
sp::plot(esrp, col = "red", add = TRUE)
str(esrp@data)

Idaho National Laboratory Facilities

Description

Federal research facilities at the Idaho National Laboratory (INL).

Usage

facilities

Format

A SpatialPolygonsDataFrame object with 7 features and 1 variable. See projection dataset for coordinate reference system information.

Source

U.S. Geological Survey Idaho National Laboratory Project Office

Examples

inlmisc::PlotMap(facilities, dms.tick = TRUE)
sp::plot(facilities, col = "red", add = TRUE)
raster::text(facilities, facilities@data$NAME,
             cex = 0.6, pos = 1)
str(facilities@data)

Groundwater Level Measurements

Description

Groundwater level measurements in wells in the U.S. Geological Survey (USGS) water-level aquifer monitoring network, Idaho National Laboratory and vicinity, Idaho. Data was obtained from the National Water Information System (NWIS) (U.S. Geological Survey, 2020).

Usage

gwl

Format

A data frame with 54,417 records and 5 variables:

site_no

USGS site number

lev_dt

date the water level was measured.

lev_va

water level value, in feet below land surface. A water-level elevation may be calculated by subtracting the depth-to-water (lev_va) from the land-surface elevation (alt_va in the sites dataset).

lev_status_cd

a code indicating the status of the site at the time the water level was measured. The codes and their meanings are: "A" water level was affected by atmospheric pressure, "D" the site was dry (no water level is recorded), "N" the measurement was discontinued, "O" an obstruction was encountered in the well (no water level was recorded), "P" the site was being pumped, "R" the site had been pumped recently, "S" a nearby site that taps the same aquifer was being pumped, and "X" the water level was affected by stage in nearby surface-water site.

date_time

date and time the water level was measured. Missing values of time were substituted with “00:00” (12:00 midnight and start of day).

Source

Data obtained from the NWIS database (U.S. Geological Survey, 2020).

References

U.S. Geological Survey, 2020, National Water Information System—web services, accessed August 4, 2020, from https://doi.org/10.5066/F7P55KJN.

Examples

site_no <- "432700112470801"  # well USGS 1
xlim <- as.Date(c("1989-01-01", "2019-01-01"))
d <- gwl[gwl$site_no == site_no, c("lev_dt", "lev_va")]
main <- sites@data[sites@data$site_no == site_no, "site_nm"]
ylab <- sprintf("Water level, in %s below land surface",
                c("feet", "meters"))
inlmisc::PlotGraph(d, ylab = ylab, main = main, xlim = xlim,
                   type = "p", pch = 19, seq.date.by = "year",
                   conversion.factor = 0.3048,
                   center.date.labels = TRUE)
str(gwl)

alt_va <- sites@data[sites@data$site_no == site_no, "alt_va"]
y <- alt_va - d$lev_va
ylab <- sprintf("Water level, in %s above sea level",
                c("feet", "meters"))
inlmisc::PlotGraph(d$lev_dt, y, ylab = ylab, main = main,
                   xlim = xlim, type = "p", pch = 19,
                   seq.date.by = "year",
                   conversion.factor = 0.3048,
                   center.date.labels = TRUE)

State of Idaho Boundary

Description

Simplified representation of the boundary of Idaho, a state in the northwestern region of the United States.

Usage

idaho

Format

A SpatialPolygons of the sp package with 1 feature. See projection dataset for coordinate reference system information.

Source

U.S. Department of Commerce, U.S. Census Bureau, Geography Division/Cartographic Products Branch. Spatial extract from the Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB), 2019 data collection, released April 2, 2020.

Examples

inlmisc::PlotMap(idaho, dms.tick = TRUE)
sp::plot(idaho, col = "red", add = TRUE)

Idaho National Laboratory Boundary

Description

Boundary of the Idaho National Laboratory (INL).

Usage

inl

Format

A SpatialPolygonsDataFrame of the sp package object with 1 feature and 4 variables. See projection dataset for coordinate reference system information.

Source

U.S. Geological Survey Idaho National Laboratory Project Office

Examples

inlmisc::PlotMap(inl, dms.tick = TRUE)
sp::plot(inl, col = "red", add = TRUE)
str(inl@data)

Map Labels

Description

Map labels in the vicinity of Idaho National Laboratory, eastern Idaho.

Usage

labels

Format

A SpatialPointsDataFrame of the sp package with 50 features and 7 variables:

labels

text to be written

cex

character expansion factor

col, font

color and font to be used, respectively.

srt

string rotation in degrees.

halo

if true, a "halo" is printed around text.

map

map index

See projection dataset for coordinate reference system information.

Source

U.S. Geological Survey Idaho National Laboratory Project Office

Examples

inlmisc::PlotMap(labels, dms.tick = TRUE)
lab <- labels[labels$map == 1L, names(labels) != "map"]
for (i in seq_along(lab))
  do.call(raster::text, c(lab[i, ], as.list(lab@data[i, ])))
str(labels@data)

Lakes and Ponds

Description

Perennial lakes and ponds in the vicinity of Idaho National Laboratory, eastern Idaho.

Usage

lakes

Format

A SpatialPolygonsDataFrame of the sp package with 155 features of 12 variables. See projection dataset for coordinate reference system information.

Source

U.S. Geological Survey (USGS), National Geospatial Technical Operations Center, USGS National Hydrography Dataset (NHD) Medium Resolution for Idaho, released August 4, 2014.

Examples

inlmisc::PlotMap(lakes, dms.tick = TRUE,
                 lakes = list(lakes, "lwd" = 1))
str(lakes@data)

Mountain Ranges and Buttes

Description

Simplified representation of the mountain ranges and buttes in the vicinity of Idaho National Laboratory, eastern Idaho.

Usage

mountains

Format

A SpatialPolygonsDataFrame of the sp package with 17 features and 1 variable. See projection dataset for coordinate reference system information.

Source

U.S. Geological Survey Idaho National Laboratory Project Office

Examples

inlmisc::PlotMap(mountains, dms.tick = TRUE)
sp::plot(mountains, col = "gray", add = TRUE)
str(mountains@data)

Parameter Information for Analytes

Description

Parameter code information for selected chemical constituents, organic compounds, and radionuclides measured for in water samples collected from wells in the U.S. Geological Survey (USGS) water-quality aquifer monitoring network, Idaho National Laboratory and vicinity, Idaho.

Usage

parameters

Format

A data frame with 74 records and 7 variables:

parm_cd

USGS 5-digit parameter code

parameter_group_nm

parameter group name, such as "Radiochemical"

parameter_nm

long parameter name, such as "Strontium-90, water, unfiltered, picocuries per liter"

casrn

Chemical Abstracts Service (CAS) registry number, such as "10098-97-2" for Strontium-90

srsname

Substance Registry Services (SRS) name, such as "Strontium-90"

parameter_units

parameter units. Unit abbreviations and descriptions are: "mg/L" milligrams per liter, "mg/L as N" milligrams per liter as Nitrogen, "ug/L" microgram per liter, and "pCi/L" picocuries per liter.

siunitx

parameter units formatted for LaTeX using the siunitx package notation.

Source

USGS water data acquired from the National Water Information System (U.S. Geological Survey, 2019). The SRS name (srsname) for "Trihalomethanes (four), total, from SDWA NPDWR" was shorten to its preferred acronym "TTHM4".

References

U.S. Geological Survey, 2019, National Water Information System—web services, accessed June 11, 2019, from https://doi.org/10.5066/F7P55KJN.

Examples

str(parameters)

Coordinate Reference System

Description

PROJ.4 string defining a custom coordinate reference system (CRS) used by the U.S. Geological Survey Idaho National Laboratory Project Office. The CRS is based on the following attributes: Albers equal-area conic projection; latitude of first and second standard parallel is 42.83 and 44.16 decimal degrees, respectively; latitude and longitude of false origin is 41.5 and -113 decimal degrees, respectively; easting and northing of false origin is 200,000 and 0 meters, respectively; Clarke (1966) reference ellipsoid; North American Datum of 1983; and units of meters.

Usage

projection

Format

A character string describing the CRS in PROJ.4 format.

Examples

crs <- sp::CRS(projection)  # convert to class "CRS"
print(crs)

Primary and Secondary Roads

Description

Primary and secondary roads in the vicinity of Idaho National Laboratory, eastern Idaho.

Usage

roads

Format

A SpatialLinesDataFrame of the sp package with 1,616 features and 5 variables. See projection dataset for coordinate reference system information.

Source

U.S. Department of Commerce, U.S. Census Bureau, Geography Division/Cartographic Products Branch. Spatial extract from the Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB), 2019 data collection, released April 2, 2020.

Examples

inlmisc::PlotMap(roads, dms.tick = TRUE,
                 roads = list(roads, "lwd" = 1))
sp::plot(roads[roads@data$PRISEC, ], col = "red", add = TRUE)
str(roads@data)

Water-Quality Data Records

Description

Water-quality data from laboratory analyses of groundwater samples collected from wells in the U.S. Geological Survey (USGS) water-quality monitoring network, Idaho National Laboratory and vicinity, Idaho. Data was obtained from the National Water Information System (U.S. Geological Survey, 2019).

Usage

samples

Format

A data frame with 159,554 records and 19 variables:

site_no

USGS site identification number

sample_dt

date sample was collected

medium_cd

3-digit medium code that identifies the material type and quality assurance type of the sample. The codes and their meanings are: "WG" water below land surface contained in the saturated zone (groundwater); and "WGQ" groundwater quality-control (QC) sample.

db_no

2-digit NWIS database number. The codes and their meanings are: "01" is the environmental database, and "10" is the quality-assurance (QA) database.

anl_ent_cd

analyzing entity code of the organizational unit that performed the sample analysis used to obtain the result.

parm_cd

USGS 5-digit parameter code. For example, the parameter code for Tritium is "07000".

remark_cd

1-digit remark code (result level) used to qualify the parameter value. The codes and their meanings are: "" quantified value. "<" actual value is known to be less than the value reported, that is, the measured concentration is below the reporting limit (RL) and represented as a censored (or nondetection) value. For censored values, the value reported is the RL. "E" value is estimated, that is, the actual value is greater than the minimum detection limit (MDL) and less than the laboratory reporting level (LRL). "R" nondetect, result less than sample-specific critical level. "U" material specifically analyzed for but not detected. "V" value affected by contamination.

result_va

parameter value; see parameter_units variable in the parameters dataset for the units of measurement.

lab_std_dev_va

laboratory standard deviation (SD). For radiochemical data, SD is determined from the counting error. Prior to January 1, 2018, counting error was reported as two SD, therefore, these values were divied by 2.

dqi_cd

1-digit data quality indicator code that indicates the review status of a result. The codes and their meanings are: "A" historical data, "S" presumed satisfactory, and "R" reviewed and accepted.

rpt_lev_va

laboratory reporting limit in effect for the parameter and method at the time the measurement was made.

rpt_lev_cd

reporting level code that identifies the analytical reporting level appropriate for the analytical method. The codes and their meanings are: "DLBLK" detection limit by blank data; "DLDQC" detection limit by DQCALC, lowest concentration that with 90 percent confidence will be exceeded no more than 1 percent of the time when a blank sample is measured; "IRL" interim reporting level, a temporary reporting level; "LRL" laboratory reporting level, equal to twice the yearly-determined LT-MDL; "LT-MDL" long-term method detection limit, a detection level derived by determining the standard deviation of a minimum of 24 MDL spike sample measurements over an extended period of time; "MDL" method detection limit, minimum concentration of a substance that can be measured and reported with a 99 percent confidence that the analyte concentration in greater than zero; "PQL" practical quantitation limits; "MRL" minimum reporting level, smallest measured concentration that can be reliably measured using a given analytical method; "RLDQC" reporting limit by DQCALC, is greater than or equal to two times the DLDQC; "SSLC" sample-specific critical level, the calculated and reported value is below which the radiochemistry result is considered a non-detect; and "SSMDC" sample-specific minimum detectable concentration, a reporting level that varies for each sample and is primarily used in radiochemical analyses.

samp_type_cd

1-digit sample type code that identifies the QA type of a sample. The codes and their meanings are: "H" samples collected over a period of time (composite); "2" samples prepared from a reference material where none of the analytes of interest are present in detectable quantities (blank); "6" reference material that provide the baseline analytical results for comparison with reference samples; "7" replicate samples; and "9" sample taken from the environment (regular).

meth_cd

method code, the codes are documented in the "NWIS Method Code Dictionary"

result_cm_tx

comment about the water quality result

date_time

date and time the sample was collected. Missing values of time were substituted with “00:00” (12:00 midnight and start of day).

sample_cd

unique identifier for the water sample. The sample code is a concatenation of the date_time and site_no variables.

comments

comments pertaining to changes applied after the records were obtained from NWIS.

rep_pair_no

number used to identify replicate samples. Replicates are identified by matching a replicate sample (samp_type_cd eqaul to 7) with its corresponding environmental sample (samp_type_cd equal to 9).

Source

Data obtained from the NWIS-QWDATA database on June 11, 2019 using the QWDATA system (U.S. Geological Survey, 2019).

References

U.S. Geological Survey, 2019, National Water Information System—Water-Quality System (QWDATA) data retrieval program.

Examples

site_no <- "433002113021701"  # well RWMC PROD
parm_cd <- "32102"            # carbon tetrachloride
xlim <- as.Date(c("1989-01-01", "2019-01-01"))
d <- samples[samples$site_no == site_no & samples$parm_cd == parm_cd,
             c("sample_dt", "result_va")]
ylab <- parameters[parameters$parm_cd == parm_cd, "parameter_nm"]
main <- sites@data[sites@data$site_no == site_no, "site_nm"]
inlmisc::PlotGraph(d, ylab = ylab, main = main, xlim = xlim,
                   type = "p", pch = 19, seq.date.by = "year",
                   center.date.labels = TRUE)
str(samples)

Descriptive Site Information

Description

Information for sampling sites in the U.S. Geological Survey (USGS) aquifer monitoring networks, Idaho National Laboratory and vicinity, Idaho.

Usage

sites

Format

A SpatialPointsDataFrame of the sp package with 182 records and 21 variables:

site_no

USGS site number for monitoring well

station_nm

USGS site name

coord_meth_cd

latitude/longitude coordinate method code. The codes and their meanings are: "D" differentially corrected Global Positioning System (GPS), "G" mapping grade GPS unit (handheld accuracy range 3.7 to 12.2 meters), "L" long range navigation system, "M" interpolated from topographic map, and "S" transit, theodolite, or other surveying method.

coord_acy_cd

accuracy code for latitude/longitude values. The codes and their meanings are: "H" accurate to +/- 0.1 second, "5" accurate to +/- 0.5 second, "S" accurate to +/- 1 second, and "F" accurate to +/- 5 seconds.

alt_va

elevation of the land surface reference point, in feet above the North American Vertical Datum of 1988.

alt_meth_cd

method code for measuring elevation. The codes and their meanings are: "D" differentially corrected global positioning system, "L" level or other surveyed method, and "M" interpolated from topographic map.

alt_acy_va

accuracy of the elevation value (alt_va), does not account for vertical datum shift.

construction_dt

date the well was completed.

huc_cd

hydrologic unit code (HUC). Hydrologic units are geographic areas representing part or all of a surface drainage basin or distinct hydrologic feature and are delineated on the Hydrologic Unit Map.

reliability_cd

reliability code for data available for the site. The codes and their meanings are: "C" data have been checked by the reporting agency, and "U" unchecked data.

nat_aqfr_cd

national aquifer codes

aqfr_cd

aquifer codes defined by the catalog of aquifer names and geologic unit codes used by the Water Mission Area.

aqfr_type_cd

aquifer type code. The codes and their meanings are: "C" confined single aquifer, "M" confined multiple aquifers, "U" unconfined single aquifer, and "X" mixed (confined and unconfined) multiple aquifers.

well_depth_va

depth of the finished well, in feet below the land surface datum

hole_depth_va

total depth of the borehole, in feet below the land surface datum

depth_src_cd

source code for depth measurements. The codes and their meanings are: "A" reported by another government agency, "D" from driller's log or report, "G" private geologist-consultant or university associate, "L" interpreted from geophysical logs by personnel of source agency, "O" reported by owner of well, "R" reported by person other than the owner, driller, or another government agency, and "S" measured by personnel of reporting agency.

site_nm

local site name

completion_cd

borehole completion code. The codes and their meanings are: "O" open hole completion, "M" multilevel completion, and "P" open hole completion prior to multilevel completion.

network

aquifer monitoring network, either "aquifer" or "perched".

pos

a position specifier for site-labels on a map. Values of 1, 2, 3 and 4, respectively indicate positions below, to the left of, above and to the right of the site coordinates.

See projection dataset for coordinate reference system information.

Source

USGS site data acquired from the NWIS (U.S. Geological Survey, 2020).

References

U.S. Geological Survey, 2020, National Water Information System—web services, accessed August 4, 2020, from https://doi.org/10.5066/F7P55KJN.

Examples

inlmisc::PlotMap(sites, dms.tick = TRUE)
sp::plot(sites, add = TRUE)
str(sites@data)

Rivers and Streams

Description

Stream segments in the vicinity of Idaho National Laboratory, eastern Idaho.

Usage

streams

Format

A SpatialLinesDataFrame of the sp package with 197 features of 14 variables. See projection dataset for coordinate reference system information.

Source

U.S. Geological Survey (USGS), National Geospatial Technical Operations Center, USGS National Hydrography Dataset (NHD) Medium Resolution for Idaho, released August 4, 2014.

Examples

inlmisc::PlotMap(streams, dms.tick = TRUE,
                 rivers = list(streams, "lwd" = 1))
str(streams@data)

Land-Surface Topography

Description

Land-surface topography and shaded relief in the vicinity of Idaho National Laboratory, eastern Idaho.

Usage

topo

Format

A RasterStack of the raster package with 2 layers. Grid cells in the "elevation" layer represent land-surface elevations in meters above the North American Vertical Datum of 1988 (NAVD 88). Grid cells in the "hillshade" layer represent shaded relief calculated from the slope and aspect of land-surface elevations. The spatial grid is composed of 900 rows and 900 columns, and has cell sizes that are constant at 100 meters by 100 meters. See projection dataset for coordinate reference system information.

Source

The National Map (TNM) 1/3-arc-second DEM (Gesch, 2007; Gesch and others, 2002), accessed on August 4, 2020. This dataset can be downloaded in a Esri ArcGRID â„¢ format using TNM Download. Elevation datasets are distributed in geographic coordinates in units of decimal degrees, and in conformance with the NAD 83. Elevation values are in meters above the NAVD 88. The west, east, south, and north bounding coordinates for this dataset are -114, -112, 43, and 45 decimal degrees, respectively.

References

Gesch, D.B., 2007, The National Elevation Dataset, in Maune, D., ed., Digital Elevation Model Technologies and Applications—The DEM Users Manual, 2nd ed.: Bethesda, Maryland, American Society for Photogrammetry and Remote Sensing, p. 99–118.

Gesch, D., Oimoen, M., Greenlee, S., Nelson, C., Steuck, M., and Tyler, D., 2002, The National Elevation Dataset: Photogrammetric Engineering and Remote Sensing, v. 68, no. 1, p. 5–11.

Examples

inlmisc::PlotMap(topo[["elevation"]], bg.image = topo[["hillshade"]],
                 pal = inlmisc::GetColors(scheme = "dem3", alpha = 0.5),
                 dms.tick = TRUE, useRaster = TRUE)

Functions

Functions in the R-package ObsNetQW may be accessed from the R terminal by running the following command:

library("ObsNetQW")

The help documentation for each function is given below.

Describe Sampling Frequency

Description

Suggests an adjective that best describes a sampling frequency.

Usage

DescribeFrequency(x, frac = 0.2)

Arguments

x

numeric or difftime vector. Sampling frequencies; if numeric, specified in days.

frac

numeric. Fraction to move cutoffs to ensure that sampling frequencies fall within a sensible frequency.

Details

Invoking the DescribeFrequency() command without an x argument returns a data frame with the frequency values adjusted for frac.

Value

A character vector of frequency descriptors.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

Examples

x <- c(0.1, 0.8, 1, 6, 6.6, 7.6, 15, 27, 28, 31, 370)
data.frame(x, "description" = DescribeFrequency(x))
DescribeFrequency()

data.frame(x, "description" = DescribeFrequency(x, frac = 0.50))
DescribeFrequency(frac = 0.50)

DescribeFrequency(as.difftime(c(1, NA, 10, 100, 1000),
  units = "weeks"
))

Fit Local Regression Curve

Description

Uses locally weighted scatterplot smoothing (LOESS) to fit a polynomial to time-series data. LOESS is a non-parametric regression method originally proposed by Cleveland (1979) and further developed by Cleveland and Devlin (1988). The smooth curve fitted through the data is known as the Loess curve, where each smoothed value is given by a weighted linear least squares regression of the span. Values returned from this function include the points on the Loess curve, their standard error and confidence interval.

Usage

FitLoessCurve(
  samples,
  site_no = NULL,
  parm_cd = NULL,
  Fun = stats::median,
  mult = 0.5,
  sdate = NA,
  edate = NA,
  by = "month",
  conf = 0.9,
  span = NULL,
  degree = 1,
  family = c("symmetric", "gaussian"),
  ...,
  surface = c("direct", "interpolate")
)

Arguments

samples

data.frame. Water-quality data from analysis of environmental and quality-control samples. The required variables include site_no, sample_dt, parm_cd, remark_cd, and result_va; see samples dataset for variable descriptions.

site_no

character vector. Site numbers to include in the analysis; all sites are included by default.

parm_cd

character vector. Parameter codes to include in the analysis; all parameters are included by default.

Fun

function. Function used to aggregate daily values. By default, the stats::median function is applied to values reported on the same sample date.

mult

numeric. Multiplier for censored values (remark_cd equal to "<").

sdate, edate

Date or character. Start and end date of prediction interval, respectively. Required date format is YYYY-MM-DD.

by

number, difftime, or character string. Increment of the sequence of values with which to predict data points on the Loess curve, monthly by default.

conf

numeric number. Confidence level expressed as a fraction, used to construct confidence intervals that make up the confidence band for the Loess curve.

span

numeric. Smoothing parameter, defined as the fraction of the total number of data points that is used in each local fit. Larger values give more smoothness. Specify as NULL to automate the selection of the smoothing parameter using the bias-corrected Akaike information criterion (Hurvich and others, 1998). Estimated values are restricted to lie with the range of values from 0.2 to 0.9.

degree

numeric. Degree of the polynomials to use, first degree by default.

family

character. Indicates the method used for fitting. By default, a re-descending M estimator is used with Tukey's biweight function.

...

Arguments passed to the stats::loess function.

surface

character. Indicates how the fitted surface is calculated, exactly by default.

Details

Smoothed values are given by a weighted linear least squares regression over the span. The polynomial surface is fit using the stats::loess function.

Value

A data.frame with the following components:

site_no

site number

parm_cd

parameter code

obs

observations in the unevenly-spaced time series stored as a matrix with columns x y, and cens. Where x is the sample date, y is the measured value, and cens indicates whether the measurement is censored.

pred

predicted points on the Loess curve. Where x and y are coordinate values, se is the estimated standard error for predicted value, lcl and ucl are the lower and upper confidence interval for predicted value, respectively.

span

is the smoothing parameter.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

References

Cleveland, W.S., 1979, Robust locally weighted regression and smoothing scatterplots: J. American Statistical Association, v. 74, p. 829–836.

Cleveland, W.S., Devlin, S.J., 1988, Locally-weighted regression—an approach to regression analysis by local fitting: Journal of the American Statistical Association, v. 83, no. 403, p. 596–610.

Hurvich, C.M., Simonoff, J.S., and Tsai, C.L., 1998, Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion: Journal of the Royal Statistical Society, Series B, vol. 60, no. 2, p. 271–293.

See Also

OptimizeTemporal, OptimizeSpatial

Examples

site_no <- "432307112583101" # well USGS 124
parm_cd <- c(chloride = "00940", tritium = "07000")
d <- FitLoessCurve(inldata::samples, site_no, parm_cd)

is <- d$site_no == site_no & d$parm_cd == parm_cd[2]
obs <- d[is, "obs"][[1]][, 1:2]
inlmisc::PlotGraph(obs,
  ylab = "Tritium, in picocuries per liter",
  type = "p", pch = 3, col = "purple", xpd = TRUE
)
pred <- d[is, "pred"][[1]]
graphics::matpoints(pred$x, pred[, c("y", "lcl", "ucl")],
  pch = 19,
  col = c("black", "red", "red")
)

Fit Survival Regression Model

Description

Fits a parametric survival regression model to time series data. It is a wrapper around the survival::survreg function. Before running this function, ensure that the data are appropriate for survival analysis; the FlagTemporalData function is provided for this purpose.

Usage

FitSurvivalModel(
  samples,
  parm_cd = NULL,
  dl = NULL,
  conf = 0.95,
  dist = "lognormal",
  alpha = 0.05,
  ...
)

Arguments

samples

data.frame. Water quality data from analysis of environmental and quality-control samples. The required variables include site_no, sample_dt, parm_cd, remark_cd, result_va, and lab_std_dev_va; see samples dataset for variable descriptions.

parm_cd

character vector. Parameter codes; all parameters are included by default.

dl

data.frame. Laboratory detection limits; see dl dataset for table format.

conf

numeric number. Confidence level expressed as a fraction, used to calculate the confidence interval for a measured value (result_va) using its laboratory standard deviation value (lab_std_dev_va).

dist

character. Assumed distribution for the response variable of the regression model (result_va). Distribution types include "lognormal" (default), "logistic", "loglogistic", "weibull", "exponential", and "gaussian".

alpha

numeric. Level of significance, a threshold for p-value's.

...

Arguments passed to the survival::survreg.control function.

Details

The regression model requires data values that are greater than zero. To satisfy this requirement, an interval-censored value with a lower bound less than or equal to zero (or a detection limit) is represented as left-censored.

Value

A list object with survreg components. Components are named using a concatenation of variables specified in the subset_by argument. The returned object includes a diagnostic_messages attribute that contains a list object with diagnostic messages that result from regression problems, and a data attribute that contains a data.frame object with the data fit by the model.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

See Also

GetSurvivalStats

Examples

parm_cd <- "07000" # tritium
is <- inldata::samples$sample_dt > as.Date("1989-01-01")
models <- FitSurvivalModel(inldata::samples[is, ],
  parm_cd = parm_cd, dl = inldata::dl
)

site_no <- "433204112562001" # well CFA 1
model <- models[[paste(site_no, parm_cd)]]
summary(model)

d <- attr(models, "data")
d <- d[d$site_no == site_no & d$parm_cd == parm_cd, ]
m <- as.matrix(d$surv)
is <- m[, "status"] %in% 1:2
m[is, "time2"] <- m[is, "time1"]
m[m[, "status"] == 2, "time1"] <- -Inf
inlmisc::PlotGraph(d$sample_dt, m[, 1:2], type = "i", col = "blue")
x <- seq(par("usr")[1], par("usr")[2], length.out = 100)
newdata <- list(x)
names(newdata) <- colnames(model$x)[2]
y <- stats::predict(model, newdata,
  type = "quantile",
  p = c(0.025, 0.5, 0.975)
)
graphics::matlines(x, y, col = "red", lty = c(2, 1, 2))
coefs <- stats::coef(model)
fmt <- "y = exp(%sx + %s),\nwhere x is number of days since 1970-01-01"
txt <- sprintf(fmt, signif(coefs[2], 2), signif(coefs[1], 2))
graphics::mtext(txt, line = -2)

Fit Variability Model

Description

Fits a model of variability (standard deviation as a function of concentration) to replicate data. Replicates are defined as two or more water samples that are collected, prepared, and analyzed such that they are considered to be essentially identical in composition and analysis. The variability of replicate data is defined as the random error in independent measurements as the result of repeated application of the measurement process under identical conditions (Mueller and others, 2015). A model of variability is created for each analyte. Variability is a function of concentration and estimates of variability are developed for discrete ranges of concentration.

Usage

FitVariabilityModel(samples, parm_cd = NULL, h = 1)

Arguments

samples

data.frame. Water quality data from analysis of environmental and quality-control samples. The required variables include site_no, sample_dt, parm_cd, result_va, samp_type_cd, and rep_pair_no; see samples dataset for variable descriptions.

parm_cd

character vector. Parameter codes to include in the analysis; all parameters are included by default.

h

integer count. The minimal segment size in the low or high range either given as the minimal number of valid replicate pairs in each segment.

Details

A two-range model, as described by Mueller and Titus (2005) and Mueller and others (2015, p. 32–34), is used to estimate standard deviation as a function of concentration. The two-range model splits the replicate data into two pieces: a low concentration range for which standard deviation is about constant, and a high concentration range for which the relative standard deviation (RSD) is about constant. For concentrations within the low range, variability is estimated as the average standard deviation of replicates; within the high range, variability is estimated as the average RSD. Data subsetting was approached by identifying abrupt structural changes (breakpoints) in the variance of replicates sorted by their average concentration. Optimal breakpoints were calculated using the algorithm described by Bai and Perron (2003); see strucchange::breakpoints function for R implementation. These breakpoints define the end-points of the replicate-data subsets.

Value

A list object with function components. Each component defines a parameter's model of variability, that is, the standard deviation as a function of concentration, or the measurement error if a alpha argument is specified. The U.S. Geological Survey (USGS) parameter code (parm_cd) is assigned to the component name. The returned object includes 3 attributes: (1) replicates, a list with replicate plotting coordinates, see returned value from grDevices::xy.coords function for component format. (2) subsets, a data.frame with the following variables:

parm_cd

USGS 5-digit parameter code;

subset_min, subset_max

minimum and maximum concentrations of replicate subset, in parameter units;

nreplicates

number of replicates;

ave_sd

mean standard deviation, in parameter units; and

ave_rsd

mean relative standard deviation, in percent.

(3) data, a list with data.frame components. Each component stores the data used to create a parameter's variability model.

Note

Replicate sets with nondetection or negative values in one or more samples are removed.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

References

Bai, J., and Perron, P., 2003, Computation and Analysis of Multiple Structural Change Models: Journal of Applied Econometrics, v. 18, p. 1–22.

Mueller, D.K., Schertz, T.L., Martin, J.D., and Sandstrom, M.W., 2015, Design, analysis, and interpretation of field quality-control data for water-sampling projects: U.S. Geological Survey Techniques and Methods, book 4, chap. C4, 54 p., https://doi.org/10.3133/tm4C4.

Mueller, D.K., and Titus, C.J., 2005, Quality of nutrient data from streams and ground water sampled during water years 1992–2001: U.S. Geological Survey Scientific Investigations Report 2005–5106, 27 p., https://pubs.usgs.gov/sir/2005/5106/.

Examples

parm_cd <- "00930" # sodium
models <- FitVariabilityModel(inldata::samples, parm_cd)

FUN <- models[[parm_cd]] # model of variability
conc <- c(20, 50) # measured concentration
stdev <- FUN(conc) # standard deviation
error <- FUN(conc, alpha = 0.05) # measurement error

xy <- attr(models, "replicates")[[parm_cd]] # replicate data
inlmisc::PlotGraph(xy$x, xy$y,
  xlab = "Mean concentration",
  ylab = "Standard deviation", type = "p",
  pch = 21
)
x <- seq(par("usr")[1], par("usr")[2], length.out = 1000)
graphics::lines(x, y = FUN(x), col = "red", lwd = 1.2)

str(attr(models, "subsets"))

Flag Temporal Data for Regression Analysis

Description

Identifies data records that are suitable for temporal regression analysis. Data preprocessing can ensure that at least one continuous record block exists in each unevenly-spaced time series. A continuous record block adheres to the following limits: (1) a minimum duration for the length of time that continuous reliable observations are available; (2) a minimum number of unique observation dates; and (3) a minimum temporal spacing between consecutive observations.

Usage

FlagTemporalData(
  samples,
  subset_by = c("site_no", "parm_cd"),
  remark_cd = c("", "<", "E"),
  samp_type_cd = c("9", "7", "A"),
  sdate = NA,
  edate = NA,
  min_por = NULL,
  max_gap = NULL,
  min_ndates = NULL,
  site_no = NULL,
  parm_cd = NULL,
  robust = FALSE
)

Arguments

samples

data.frame. Water quality data from analysis of environmental and quality-control samples. The required variables include sample_dt, parm_cd, remark_cd, result_va, and samp_type_cd.

subset_by

character or integer vector. Variables (column names or numbers) in samples that should be used to subset a time series within the temporal data set.

remark_cd

character vector. Remark codes. Only records with the specified remark codes will be included in the data set.

samp_type_cd

character vector. Sample type codes. Only records with the specified sample type codes will be included in the data set.

sdate, edate

Date or character. Start and end date for the period of interest, respectively. Required date format is YYYY-MM-DD. The default value, NA, indicates that the earliest (or latest) date should be used.

min_por

numeric. Minimum duration for a continuous record block, in days.

max_gap

numeric. Maximum days between consecutive observations in a continuous record block.

min_ndates

integer. Minimum number of unique observation dates in a continuous record block.

site_no

character vector. Site numbers; all sites are included by default.

parm_cd

character vector. Parameter codes; all parameters are included by default.

robust

logical flag. If true, displays additional information.

Value

A logical vector of length equal to the number of records (rows) in the samples data table. Values equal to TRUE indicate a valid record that adheres to the criteria specified by the user.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

See Also

FitSurvivalModel, OptimizeTemporal

Examples

site_no <- "433144112563501" # well CRA 2
is <- FlagTemporalData(inldata::samples,
  min_por = udunits2::ud.convert(15, "year", "day"),
  max_gap = udunits2::ud.convert(5, "year", "day"),
  min_ndates = 15, site_no = site_no, robust = TRUE
)
d <- droplevels(inldata::samples[is, , drop = FALSE])
sprintf(
  "Removed %s out of %s records (%s%%)",
  nrow(inldata::samples) - nrow(d), nrow(inldata::samples),
  signif((nrow(inldata::samples) - nrow(d)) /
    nrow(inldata::samples) * 100, 2)
)

parm_cd <- "00940" # chloride
plot(d[d$parm_cd == parm_cd, c("sample_dt", "result_va")])

Get Summary Statistics for Survival Model

Description

Retrieves selected statistics from a censored regression object.

Usage

GetSurvivalStats(model)

Arguments

model

survreg. A fitted parametric survival model, see survival::survreg.object for component descriptions. The survreg object must include component x, a vector of response times; see x argument in the survival::survreg function for the flag used to include this component (not included by default).

Value

A named numeric vector containing the following information:

n

number of observations

incr

long-term monotonic trend, estimated using the fractional increase in the response variable over a unit of measure for x (1-day).

rmse

root-mean-squared error, also known as residual standard error.

pval

p-value, statistical significance of the regression model.

r2_mkz

McKelvey and Zavoina's (1973) pseudo-R^2.

r2_mf

McFadden's (1975) pseudo-R^2.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

References

McFadden, D., 1973, Conditional logit analysis of qualitative choice behavior: p. 105–142 in Zarembka, P. (ed.), Frontiers in Econometrics. New York, NY, Academic Press.

McKelvey, R.D., and Zavoina, W., 1975, A statistical model for the analysis of ordinal dependent variables: The Journal of Mathematical Sociology, v. 4, no. 1, p. 103–120.

See Also

FitSurvivalModel

Examples

d <- inldata::samples
d <- d[
  d$site_no == "433322112564301" & # well USGS 38
    d$parm_cd == "01030" & # chromium
    d$sample_dt > as.Date("1989-01-01"),
  c("sample_dt", "result_va", "remark_cd")
]
t1 <- t2 <- d$result_va
t1[d$remark_cd == "<"] <- -Inf
d$surv <- I(survival::Surv(t1, t2, type = "interval2"))
model <- survival::survreg(surv ~ sample_dt, d,
  dist = "lognormal",
  x = TRUE
)
summary(model)
print(out <- GetSurvivalStats(model))
sprintf(
  "Percent change in concentration per year is %2.1f",
  100 * abs(out["incr"]) * 365
)

Krige Surface Map from Spatial Points

Description

Estimates a surface map of a feature (such as a contaminant concentration) across an area using a kriging interpolation method. Sample point data is averaged over time, with nondetection values set equal to zero. The sample variogram is fit using a Matern class model (Matern, 1960; Stein, 1999).

Usage

KrigeSurfaceMap(
  samples,
  sites,
  parm_cd = NULL,
  sdate = NA,
  edate = NA,
  resolution = NULL,
  newdata = NULL,
  Trans = NULL,
  BackTrans = NULL,
  type = c("QK", "OK"),
  kappa = seq(0.1, 10, by = 0.1),
  cutoff = NULL,
  nbins = 20,
  cloud = FALSE,
  nmax = Inf,
  maxdist = Inf,
  alpha = NULL,
  minarea = NULL,
  nfold = length(sites),
  nreps = 1,
  seed = NULL,
  minr2 = NULL,
  benchmarks = NULL,
  performance = FALSE,
  parallel = FALSE,
  robust = FALSE,
  old = NULL,
  ...
)

Arguments

samples

data.frame. Water quality data from analysis of environmental and quality-control samples. The required variables include site_no, sample_dt, parm_cd, remark_cd, and result_va; see samples dataset for variable descriptions.

sites

SpatialPointsDataFrame. Site coordinates with site_no and site_nm a required variable, see sites dataset for details.

parm_cd

character vector. Parameter codes; all parameters are included by default.

sdate, edate

Date or character. Start and end date, respectively, for the period over which measured values are described using their average value. Required date format is YYYY-MM-DD. The default value, NA, indicates that the earliest (or latest) date should be used.

resolution

numeric vector of length 1 or 2. Resolution (see raster::res function) of grid cells in the area of interpolation, defined by the convex hull of the observation points in sites.

newdata

SpatialPoints*. Prediction locations for kriging (optional). Used to bypass the default grid prediction locations.

Trans

function. Transformation applied to the data, such as the log1p function.

BackTrans

function. Back-transformation applied to kriging results, such as the expm1 function.

type

character. Type of regression method, specify "QK" for quantile kriging (Journel and Deutsch, 1997; Juang and others, 2001) and "OK" for ordinary kriging (Clark, 1997; Kitanidis, 1997; Goovaerts, 1997).

kappa

numeric vector. A sequence of kappa values that will be searched for optimal Matern variogram model fit. Kappa is a shape parameter that controls the smoothness of the random field. A value of 0.5 results in the exponential model, while a smoothness of infinity is exactly the Gaussian model. The Matern model typically begins to look and behave as the Gaussian model for values greater than 10. Small values are considered ‘smooth’, while larger values describe a ‘rough’ random field.

cutoff

numeric. Spatial separation distance up to which data point pairs are included in empirical variance estimates; by default, set equal to the length of the diagonal of the box spanning the data divided by three.

nbins

integer. Number of bins (distance intervals) within the cutoff distance into which data point pairs are grouped for empirical variance estimates.

cloud

logical flag. Whether to calculate the variogram cloud.

nmax

integer. For local kriging: the number of nearest observations that should be used for a kriging prediction or simulation.

maxdist

numeric. For local kriging: the search radius that defines the local neighborhood; if combined with the nmax argument, both criteria apply.

alpha

numeric. Level of significance, a threshold for p-value's calculated using Moran's I test. Evaluates whether the spatial distribution of feature values is the result of random spatial processes. Models with p-value's greater than alpha are rejected. Requires that the ape package is available.

minarea

numeric. Minimum area for the kriging observations specified as a fraction of the area inside the convex hull of the kriging surface. Models are rejected if less than this fraction. Useful for removing models with excessive extrapolation.

nfold

integer. Number of folds to apply in k-fold cross validation (CV). A value greater than equal to the number of observation points results in leave-one-out CV. The CV method is used to determine the function type for the variogram model, and calculate performance metrics for the kriging model.

nreps

integer. Number of times to repeat cross validation. Final results are based on the median performance metric from all repetitions.

seed

integer. Random number generator state for random number generation, used to replicate the CV results.

minr2

numeric vector of length 1 or 2, value is recycled as necessary. Minimum value for R^2, between 0 and 1. Values applied separately to variogram and kriging analysis. Models are rejected if R^2 is less than the minimum value.

benchmarks

data.frame. Benchmarks, such as a selected hazardous threshold. A byproduct of quantile kriging, used to calculate the probability of a predicted value being greater than the benchmark value. The required variables include parm_cd and mcl, see benchmarks dataset for variable descriptions.

performance

logical flag. Whether to calculate performance metrics for the kriging model.

parallel

logical flag or integer count. Whether to use parallel computing. This argument can also be used to specify the number of cores to employ; by default, this is the number of physical CPUs/cores. The parallel and doParallel packages must be installed for parallel computing to work.

robust

logical flag. Whether to display output from variogram fitting and kriging. Only available if the parallel argument is false.

old

list. Output from a past run of the KrigeSurfaceMap() command. This is an optional way to provide the prediction locations and variogram models.

...

Not used

Value

A list with list components, one for each parameter kriged, and named after their respective parameter code (see parm_cd argument). Each list component has the following components:

kr_output

results from the kriging interpolation, that is, the prediction (var1.pred), standard error (var1.se), and kriging variance (var1.var). Where the kriging variance remains in transformed space. If using quantile kriging, the following variables are also returned: interval_range, the interval range, or the deviation between upper and lower bounds in units of measurement; and probability, the estimated probability that a predicted value is higher than the benchmark value.

kr_cv

kriging model prediction performance metrics derived from CV. Metrics include: me, mean error (ideally 0); mse, mean squared error (ideally small); rmse, root mean squared error (ideally small); nrmse, normalized root mean squared error (ideally small); mae, mean absolute error (ideally small); cor, correlation between observed and predicted values (ideally 1); and r2, coefficient of determination, or R^2 (ideally 1).

vg_cloud

variogram cloud.

vg_sample

sample variogram.

vg_model

fitted variogram model.

vg_summary

a summary of the variogram model with list components: model, is the model type; nugget effect, represents a discontinuity of the model at the origin; sill, is the variance when the model either reaches or becomes asymptotic to a constant value as lag distance increases; range, controls the rate of increase with distance; effective_range, is the effective range, that is the distance at which the variance value achieves 95 percent of the sill; kappa, is a shape parameter that controls the smoothness of the random field that models the spatial variability of the observed data; sserr, is the sum of squared error (ideally 0); and r2, is the coefficient of determination (ideally 1).

obs

observation points used in the prediction.

Note

Consider the discussion by Kitanidis (1997, p. 71, section 3.7) before kriging with a moving neighborhood using the nmax and maxdist arguments.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

References

Clark, Isobel, 1979, Practical Geostatistics: London, Applied Science Publishers, 119 p.

Goovaerts, Pierre, 1997, Geostatistics for Natural Resource Evaluation: New York, Oxford University Press, 483 p.

Journel, A.G., and Deutsch, C.V., 1997, Rank order geostatistics: a proposal for a unique coding and common processing of diverse data, In Geostatistics Wollongton '96, Volume 1, ed. E.Y. Baafi and N.A. Schofield, Dordrecht, Boston: Kluwer Academic, p. 174–187.

Juang, K.-W., Lee, D.-Y., and Ellsworth, T.R., 2001, Using rank-order geostatistics for spatial interpolation of highly skewed data in a heavy-metal contaminated site: Journal of Environmental Quality, v. 30, no. 3, p. 894–903.

Kitanidis, P.K., 1997, Introduction to geostatistics—Applications to hydrogeology: New York, Cambridge University Press, 249 p.

Matern, B. 1960, Spatial variation—stochastic models and their applications to some problems in forest survey sampling investigations: Report of the Forest Research Institute of Sweden, v. 49, no. 5, 144 p.

Stein, M.L., 1999, Interpolation of spatial data—Some theory for kriging: Springer, New York, 247 p.

See Also

gstat::fit.variogram, gstat::krige, gstat::krige.cv

Examples

parm_cd <- c(chloride = "00940", tritium = "07000")
models <- KrigeSurfaceMap(inldata::samples, inldata::sites,
  parm_cd = parm_cd, sdate = "1989-01-01",
  resolution = 500, cutoff = 10 * 1609,
  parallel = TRUE
)
names(models)

# Plot kriging results for each parameter
xlim <- sp::bbexpand(sp::bbox(inldata::sites)[1, ], 0.04)
ylim <- sp::bbexpand(sp::bbox(inldata::sites)[2, ], 0.04)
for (cd in names(models)) {
  srsname <- inldata::parameters[cd, "srsname"]
  x <- models[[cd]]
  try(gridExtra::grid.arrange(sp::bubble(x$obs,
    maxsize = 1,
    main = srsname
  ),
  gstat:::plot.gstatVariogram(
    x$vg_sample,
    x$vg_model
  ),
  sp::spplot(x$kr_output, "var1.pred",
    main = "Predicted",
    xlim = xlim, ylim = ylim
  ),
  sp::spplot(x$kr_output, "var1.se",
    main = "Standard Error",
    xlim = xlim, ylim = ylim
  ),
  ncol = 2
  ))
  prompt <- sprintf("%s (%s), press key to continue...", srsname, cd)
  if (interactive()) invisible(readline(prompt))
}

Perform Spatial Optimization Analysis

Description

Identifies an optimal subset of wells to exclude from an existing long-term monitoring network because they provide the least amount of information. A distributed multiple-population genetic algorithm (GA) is used to preserve spatial accuracy and long-term temporal trends in the reduced monitoring network.

Usage

OptimizeSpatial(
  k,
  sites,
  samples,
  site_no = NULL,
  parm_cd = NULL,
  arg = list(),
  surv = NULL,
  loess = NULL,
  w1 = 1,
  w2 = 1,
  w3 = 1,
  w4 = 1,
  penalty = NULL,
  exclude = NULL,
  suggestions = NULL,
  ...
)

Arguments

k

integer. Number of sites to exclude from the existing monitoring network.

sites

SpatialPointsDataFrame. Site coordinates with site_no and site_nm a required variable, see sites dataset for details.

samples

data.frame. Water quality data from analysis of environmental and quality-control samples. The required variables include site_no, sample_dt, parm_cd, remark_cd, result_va, samp_type_cd, and rep_pair_no; see samples dataset for variable descriptions.

site_no

character vector. Site numbers to include in the analysis; all sites are included by default.

parm_cd

character vector. Parameter codes to include in the analysis; all parameters are included by default.

arg

list. Additional arguments passed to the KrigeSurfaceMap function.

surv

list. Models from survival-regression analysis, the resulting object from running the FitSurvivalModel function (optional).

loess

data.frame. Predicted points from local-regression analysis, the resulting object from running the FitLoessCurve function (optional).

w1, w2, w3, w4

numeric number. Weighting factor applied to objective function, see ‘Details’ section for objective descriptions.

penalty

numeric number. Penalty to apply to infeasible sampling schemes that result in unestimated points in the interpolation domain using kriging.

exclude

character vector. Site numbers to exclude from the decision space but include in the kriging analysis.

suggestions

matrix. Monitoring networks to be included as initial guesses. See returned solution component for a suggested value for this argument.

...

GA control parameters passed to the inlmisc::FindOptimalSubset function.

Details

The GA::gaisl function (Scrucca, 2013, 2017) is used to search for an optimal subset of sites to remove from a existing monitoring network because they add little or no beneficial information. The number of sites to remove is fixed and specified by the user using the k argument. The performance of a network design is measured with respect to multiple objectives, each characterizes a contribution to the overall value (or fitness) of the network design. The design problem is formulated as a single-objective optimization using a weighted sum approach. More specifically, an islands GA method is used to minimize a fitness function that comprises a set of weighted objective functions. Weights are assigned to individual objective functions to establish the relative influence of each objective and (or) resolve conflicting objectives. Objectives are described as follows:

  1. Minimize the normalized root-mean-square deviation between kriged concentrations estimated using the existing and reduced monitoring networks.

  2. Decrease spatial interpolation errors by minimizing the ratio between the sum kriging variance estimated for the reduced and existing monitoring networks.

  3. Preserve significant long-term monotonic temporal trends in wells by minimizing the maximum fractional change per year estimated from survival-regression analysis, in wells removed from the existing monitoring network.

  4. Preserve continuous record blocks that have the largest variability by minimizing the maximum normalized range of concentrations estimated from local-regression analysis, in wells removed from the existing monitoring network.

Value

A list object with components: call, solution, ga_output, and ga_time, (see returned value from inlmisc::FindOptimalSubset function); rm_sites, a vector of site numbers indicating the wells that have been optimally selected for removal; fitness, a data table of weighted fitness values for individual objectives in the fitness function. The fitness component has the following variables:

parm_cd

USGS 5-digit parameter code;

epoch

interval between migrations in the islands GA run; and

w1f1, w2f2, w3f3, w4f4

weighted fitness value for individual objective functions.

The fitness component has a "weights" attribute that contains the weights assigned to individual objective functions. Fitness values can be displayed using the PlotFitness function.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

References

Scrucca, Luca, 2013, GA: A Package for Genetic Algorithms in R: Journal of Statistical Software, v. 53, no. 4, p. 1–37.

Scrucca, Luca, 2017, On some extensions to GA package: hybrid optimisation, parallelisation and islands evolution: The R Journal, v. 9, no. 1, 187–206.

Examples

site_no <- with(inldata::sites@data, site_no[network == "aquifer"])
parm_cd <- "07000" # tritium
arg <- list("sdate" = "1989-01-01", "cutoff" = 16093, "nbins" = 20)
exclude <- with(inldata::sites@data, site_no[completion_cd != "O"])
ans <- OptimizeSpatial(5, inldata::sites, inldata::samples,
  site_no, parm_cd, arg,
  w1 = 10, w4 = 0.1,
  exclude = exclude, run = 5, maxiter = 100,
  numIslands = 4, parallel = TRUE, seed = 123
)
print(ans)

print(m <- ans[["solution"]])
locations <- inldata::sites[m[1, ], ]
print(locations@data[, c("site_no", "site_nm")])
print(ans$fitness)
print(attr(ans$fitness, "weights"))
PlotFitness(ans$fitness, type = "objectives")

inlmisc::PlotMap(locations,
  dms.tick = TRUE,
  roads = list(inldata::roads),
  lakes = list(inldata::lakes),
  rivers = list(inldata::streams),
  scale.loc = "bottomleft"
)
points(inldata::sites, pch = 19, col = "black", xpd = FALSE)
points(locations, pch = 19, col = "red")
raster::text(locations,
  labels = locations@data$site_nm,
  cex = 0.7, pos = 4, halo = TRUE
)

plot(ans[["ga_output"]])
summary(ans[["ga_output"]])

Perform Temporal Optimization Analysis

Description

Provides a method for analyzing the temporal spacing of observations. It determines if fewer observations could have been used to characterize analyte concentrations at a sampling site over time. An iterative thinning approach, based on an algorithm described by Cameron (2004), is used to identify an optimal sampling interval that may be used for future long-term monitoring, one that reduces temporal redundancy. Before running this function, ensure that the data are appropriate for temporal redundancy analysis; the FlagTemporalData function is provided for this purpose.

Usage

OptimizeTemporal(
  samples,
  site_no = NULL,
  parm_cd = NULL,
  span = NULL,
  by = NULL,
  n = 200,
  conf = 0.9,
  nsim = 500,
  threshold = 0.8,
  tol = 0.05,
  Fun = stats::median,
  mult = 0.5,
  save_mc = FALSE,
  parallel = FALSE,
  monitor = FALSE,
  seed = NULL
)

Arguments

samples

data.frame. Water-quality data from analysis of environmental and quality-control samples. The required variables include site_no, sample_dt, parm_cd, remark_cd, and result_va; see samples dataset for variable descriptions.

site_no

character vector. Site numbers to include in the analysis; all sites are included by default.

parm_cd

character vector. Parameter codes to include in the analysis; all parameters are included by default.

span

numeric. Smoothing parameter, defined as the fraction of the total number of data points that is used in each local fit. Larger values give more smoothness. Specify as NULL to automate the selection of the smoothing parameter using the bias-corrected Akaike information criterion (Hurvich and others, 1998). Estimated values are restricted to lie with the range of values from 0.2 to 0.9.

by

number, difftime, or string. Increment of the predicted points over the period of record. Specify as either a number in days, object of class difftime, or character string containing one of "day", "week", "month", "quarter" or "year".

n

integer. Number of predicted points uniformly distributed over the period of record. Only used if the by argument is not specified.

conf

numeric number. Confidence level expressed as a fraction, used to construct confidence intervals that make up the confidence band for the Loess curve.

nsim

integer. Number of data-thinning simulations to perform during each iteration of the bisection method. For each simulation, a different set of randomly selected observations is removed from the full-data set and a new Loess curve is computed.

threshold

numeric. Fraction of predicted points on the Loess curve fitted to the reduced-data set, that must fall within the confidence band computed from the full-data set.

tol

numeric. Error tolerance for the bisection method; if crossed, and the regression is robust (see threshold argument), the solution is considered optimal.

Fun

function. Function used to aggregate daily values. By default, the stats::median function is applied to values reported on the same sample date.

mult

numeric. Multiplier for censored values (remark_cd equal to "<").

save_mc

logical flag. Whether to return the data-thinning simulations.

parallel

logical flag or integer count. Whether to use parallel computing. This argument can also be used to specify the number of cores to employ; by default, this is the number of physical CPUs/cores. The parallel and doParallel packages must be installed for parallel computing to work.

monitor

logical flag. If true, displays output after each iteration of the bisection method. Monitoring is only available if the parallel argument is false.

seed

integer. Random number generator state for random number generation, used to replicate the results. The doRNG package must be installed if using parallel computing.

Details

The median date gap (that is, the median number of days between consecutive and unique sampling dates) is first calculated and used as the historic sampling interval. The iterative thinning algorithm (Cameron, 2004) computes a confidence band around a Loess curve fitted to the observations; see FitLoessCurve function for details. A fraction of the observations is removed from the full-data set and a new Loess curve is fitted to the reduced-data set. The ability of the reduced-data set to reproduce the historical trend is evaluated by calculating the fraction of the new Loess curve fitted to the reduced-data set that falls within the confidence band established using the full-data set. Increasing numbers of observations are removed from the full-data set and evaluated for their ability to reproduce the historical trend. A data-thinning method is used to safeguard against irregular trends that may arise from the selection of a single set of observations to remove. The optimal fraction of observations to remove, while still reproducing the historical trend, is computed using the bisection method. Finally, an optimal sampling interval, that minimizes temporal redundancy, is expressed as the historic sampling interval divided by one minus the optimal fraction to remove.

Value

A data.frame with the following components:

site_no

site number

parm_cd

parameter code

obs

observations in the unevenly-spaced time series stored as a matrix with columns x y, and cens. Where x is the sample date, y is the measured value, and cens indicates whether the measurement is censored.

frac_cens

fraction of censored measurements.

ndates

number of unique sampling dates

longest_gap

longest number of days between consecutive samples.

freq

historic sampling interval, in days, estimated as the median date gap.

pred

Loess curve and confidence band stored as a matrix with columns x, y, se, lcl, and ucl. See returned value from FitLoessCurve function for column descriptions.

reduction

largest fraction of observations that can be removed while preserving the temporal trend of the full-data set.

mc

Loess curve for each data-thinning simulation stored as a n by nsim matrix of predicted values. This component is only available if the save_mc argument is specified as true.

opt_freq

optimal sampling interval, in days.

freq_desc

adjective describing the historic sampling frequency; see DescribeFrequency function for details.

opt_freq_desc

adjective describing the optimal sampling frequency.

The returned object includes an elapsed_time attribute, an object of class difftime giving the total time taken to execute the function.

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

References

Cameron, K., 2004, Better optimization of long-term monitoring networks: Bioremediation Journal, v. 8, no. 3–4, p. 89–107.

See Also

FitLoessCurve

Examples

site_no <- "432019112563201" # well USGS 14 MV-61
parm_cd <- c(chloride = "00940", tritium = "07000")
d <- OptimizeTemporal(inldata::samples, site_no, parm_cd,
  by = "month", save_mc = TRUE,
  monitor = TRUE, seed = 123
)
print(d)
print(attr(d, "elapsed_time"))

i <- 1L # row index
obs <- d$obs[[i]][, 1:2]
graphics::plot(obs,
  type = "n", xlab = "Julian day",
  ylab = d$parm_cd[i], main = d$site_no[i]
)
x <- d$pred[[i]][, "x"]
graphics::matlines(x, d$mc[[i]], lty = 1, lwd = 0.5, col = "gray")
graphics::matlines(x, d$pred[[i]][, c("lcl", "ucl")],
  lty = 2,
  lwd = 2, col = 2
)
graphics::lines(x, rowMeans(d$mc[[i]], na.rm = TRUE),
  lwd = 2, col = 3
)
graphics::points(obs)

Plot Fitness Values

Description

Visualize the performance of the sampling network optimization problem by graphing fitness values across generations.

Usage

PlotFitness(
  fitness,
  type = c("monitor", "objectives", "parameters"),
  ylim = NULL,
  show_main = TRUE,
  digits = getOption("digits")
)

Arguments

fitness

data.frame. See "fitness" component returned from the OptimizeSpatial function.

type

character. Plot type, possible types are "monitor", "objectives", and "parameters".

ylim

numeric vector of length 2. Minimum and maximum values for the y-axis.

show_main

character. Whether to show the main title.

digits

integer. Number of decimal places used when formatting the range of the best fitness value.

Value

Invisible NULL

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

Processing

An optimization analysis is performed when building the vignette documents of the R-package ObsNetQW. Prior to re-building these vignettes, be aware of the long run times (on the order of weeks) associated with the spatial component of the optimization analysis. First create a local copy of the ObsNetQW package source material (stored in a remote repository) using the following command:

git2r::clone("https://code.usgs.gov/inl/obsnetqw.git", "obsnetqw")

The package vignettes are located under obsnetqw/inst/doc of the working directory. To re-build the vignettes from their source files, run:

inlmisc::BuildVignettes("obsnetqw")

Reproducibility

R packages get updated over time, so future versions of dependent packages may result in errors. To address this, it may be necessary to recreate an identical R environment where all packages are consistent with package versions used in this report. Version information about R, the operating system, and attached and loaded packages at the time of processing is as follows:

R version 4.0.5 (2021-03-31)

Platform: x86_64-w64-mingw32/x64 (64-bit)

attached base packages:

other attached packages:

loaded via a namespace (and not attached):

For reproducibility purposes, Microsoft® R Application Network (MRAN) stores daily snapshots of the CRAN packages and R releases. Use the May 5, 2021 snapshot to recreate the R environment used in this report. To do so, download and install R version 4.0.5 (2021-03-31) from the MRAN and use the following commands to recreate the R environment:

install.packages("remotes")
repos <- "https://cran.microsoft.com/snapshot/2021-05-05/"
update.packages(repos = repos)
remotes::install_gitlab("inl/obsnetqw@v1.0.0",
  host = "code.usgs.gov",
  dependencies = TRUE,
  repos = repos,
)

The optimization analysis was performed on a 4 cores Intel® Xeon® CPU E5-1620 v3 running at 3.50GHz and with 32GB of RAM.