Note: Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.
R is a software environment for statistical computing and graphics. In R, the primary mechanism for sharing with others is the package. Packages are collections of computer code, data, and documentation in a well-defined format. Datasets pertaining to the United States Geological Survey-Idaho National Laboratory (USGS-INL) water-quality and water-level aquifer monitoring networks are stored in an R package named inldata. And computer code (functions) used to analyze this data are bundled together in an R package named ObsNetQW. This document is a vignette in the ObsNetQW package and provides an overview of the software used to optimize the USGS-INL water-quality aquifer monitoring network. A package vignette is a fully reproducible document that combines narrative text and R code to produce static (PDF) and dynamic (HTML) output formats. The R code is run when the vignette is built, and data analysis output (figures and tables) are created extemporaneously and inserted into the final document. Small chunks of stylized code are shown in this vignette and are intended to be used interactively in an R terminal. These code chunks may be useful for testing, development, and validation purposes. This document provides (1) software installation instructions, (2) dataset and function documentation, (3) processing instructions for performing the optimization analysis, and (4) guidance on recreating the R environment used when compiling this report.
Software used in this report are all available as free and open-source software (FOSS), can be run on multiple platforms (Windows, OS X, Linux), and do not require administrative privileges to install or run.
The R software environment can be obtained from the Comprehensive R Archive Network (CRAN). And the capabilities of R may be extended by installing an assorted group of user-contributed R packages available on CRAN and the U.S. Geological Survey R Archive Network (GRAN). Typing the following commands into an R terminal (console) will install the latest versions of the R packages used in this report.
update.packages()
remotes::install_gitlab("inl/obsnetqw",
host = "code.usgs.gov",
dependencies = TRUE)
To create PDF vignette documents with R requires LaTeX, a document preparation system for typesetting. The LaTeX distribution used in this report is TinyTeX. To install TinyTeX, first uninstall any other LaTeX distribution that may be installed, and then run the following command in the R terminal:
tinytex::install_tinytex()
The capabilities of LaTeX may be extended by installing an assorted group of user-contributed LaTeX packages available on the Comprehensive TeX Archive Network (CTAN). To install additional packages that were used in this report and not included in the default installation of TinyTeX, run the following command:
inlmisc:::InstallLatexPackages()
To create HTML vignette documents with R requires pandoc, a universal document converter. Instructions for installing pandoc are available for multiple platforms.
Datasets in the R-package inldata may be accessed from the R terminal by running the following command:
library("inldata")
The help documentation for each dataset is given below.
Water-quality background concentrations for selected radionuclides, organic compounds, and chemical constituents analyzed for in water from the Eastern Snake River Plain aquifer at and near the Idaho National Laboratory (INL). Background concentrations are either naturally occurring or anthropogenic, and are not influenced by waste and wastewater disposal at the INL (Bartholomay and Hall, 2016, p. 3).
background
A data frame with 73 records and 4 variables:
parm_cd
U.S. Geological Survey 5-digit parameter code,
see parameters
dataset for details.
bkgrd_min
, bkgrd_max
minimum and maximum limits of background concentration
reference
source of background concentration limits.
Reference citations are:
"Bartholomay and Hall (2016)"
,
"Knobel and others (1992)"
,
"Michel (1989)"
, and
"Orr and others (1991)"
.
Row names for the data frame indicate the Substance Registry Services (SRS) name and parameter units.
Bartholomay, R.C., and Hall, L.F., 2016, Evaluation of background concentrations of selected chemical and radiochemical constituents in groundwater in the eastern Snake River Plain aquifer at and near the Idaho National Laboratory, Idaho: U.S. Geological Survey Scientific Investigations Report 2016–5056, (DOE/ID–22237), 19 p., https://doi.org/10.3133/sir20165056.
Knobel, L.L., Orr, B.R., and Cecil, L.D., 1992, Summary of background concentrations of selected radiochemical and chemical constituents in groundwater from the Snake River Plain aquifer, Idaho: estimated from an analysis of previously published data: Journal of the Idaho Academy of Science, v. 28, no. 1, p. 48–61.
Michel, R.L., 1989, Tritium deposition in the continental United States, 1953–83: U.S. Geological Survey Water Resources Investigations Report 89–4072, 46 p., https://doi.org/10.3133/wri894072.
Orr, B.R., Cecil, L.D., and Knobel, L.L., 1991, Background concentrations of selected radionuclides, organic compounds, and chemical constituents in ground water in the vicinity of the Idaho National Engineering Laboratory: U.S. Geological Survey Water-Resources Investigations Report 91–4015 (DOE/ID–22094), 52 p., https://doi.org/10.3133/wri914015.
str(background)
rownames(background)
Water-quality benchmark concentrations of selected radionuclides, organic compounds, and chemical constituents. Benchmarks include: the U.S. Environmental Protection Agency (USEPA) Maximum Contaminant Levels (MCLs) and Human Health Benchmarks for Pesticides (HHBPs), and the U.S. Geological Survey (USGS) Health-Based Screening Levels (HBSLs).
benchmarks
A data frame with 74 records and 9 variables:
parm_cd
USGS 5-digit parameter code, see parameters
dataset for details.
mcl
USEPA MCLs.
hhbp_noncancer
USEPA Chronic Noncancer HHBPs
hhbp_cancer_min
USEPA Carcinogenic HHBPs for a one-in-one million cancer risk.
hhbp_cancer_max
USEPA Carcinogenic HHBPs for a one-in-ten thousand cancer risk.
hbsl_noncancer
USGS Noncancer HBSLs
hbsl_cancer_min
USGS Cancer HBSLs for a one-in-one million cancer risk.
hbsl_cancer_max
USGS Cancer HBSLs for a one-in-ten thousand cancer risk.
remark
comments
Row names for the data frame indicate the Substance Registry Services (SRS) name and parameter units.
Many of the water-quality benchmarks were accessed from the U.S. Geological Survey Health-Based Screening Levels database, accessed on August 8, 2017, from https://water.usgs.gov/water-resources/hbsl/. Benchmarks for total Trihalomethanes, Tritium, and Strontium-90 were provided by the U.S. Environmental Protection Agency (2015). Note that MCL benchmark values reported in millirem per year were substituted with a 50 picocuries per liter screening level.
U.S. Environmental Protection Agency, 2015, Protection of environment—Code of Federal Regulations 40, Part 141, Subpart G, National Primary Drinking Water Regulations, Maximum Contaminant Levels and Maximum Residual Disinfectant Levels: Washington, D.C., Office of the Federal Register, National Archives and Records Administration.
str(benchmarks)
rownames(benchmarks)
Cities and towns (populated places) in the vicinity of Idaho National Laboratory, eastern Idaho.
cities
A SpatialPointsDataFrame of the sp package with 24 records and 16 variables.
See projection
dataset for coordinate reference system information.
U.S. Department of Commerce, U.S. Census Bureau, Geography Division/Cartographic Products Branch. Spatial extract from the Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB), 2019 data collection, released April 2, 2020.
inlmisc::PlotMap(cities, dms.tick = TRUE)
sp::plot(cities, pch = 19, add = TRUE)
raster::text(cities, cities@data$NAME,
pos = 1, cex = 0.6)
str(cities@data)
Boundaries of counties in the vicinity of Idaho National Laboratory, eastern Idaho, as of January 1, 2015.
counties
A SpatialLines of the sp package with 45 features.
See projection
dataset for coordinate reference system information.
U.S. Department of Commerce, U.S. Census Bureau, Geography Division/Cartographic Products Branch. Spatial extract from the Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB), 2019 data collection, released April 2, 2020.
inlmisc::PlotMap(counties, dms.tick = TRUE)
sp::plot(counties, lty = 5, add = TRUE)
Analytical method detection limits of selected radionuclides, which are based on laboratory procedures.
dl
A data frame with 8 records and 6 variables:
srsname
Substance Registry Services (SRS) name.
parm_cd
U.S. Geological Survey 5-digit parameter code.
parameter_units
parameter units
lab_det_lim_va
laboratory detection limit concentration
sdate
date that the detection limit went into effect.
reference
source of detection limit.
Detection limits reported by: Bartholomay and others (2003, table 9); Bartholomay and others (2014, table D1); and Bodnar and Percival (1982).
Bartholomay, R.C., Knobel, L.L., and Rousseau, J.P., 2003, Field methods and quality-assurance plan for quality-of-water activities, U.S. Geological Survey, Idaho National Engineering and Environmental Laboratory, Idaho: U.S. Geological Survey Open-File Report 03–42 (DOE/ID–22182), 45 p. https://doi.org/10.3133/ofr0342.
Bartholomay, R.C., Maimer, N.V., and Wehnke, A.J., 2014, Field methods and quality-assurance plan for water-quality activities and water-level measurements, U.S. Geological Survey, Idaho National Laboratory, Idaho: U.S. Geological Survey Open-File Report 2014–1146 (DOE/ID–22230), 64 p. https://pubs.usgs.gov/of/2014/1146/.
Bodnar, L.Z., and Percival, D.R., eds., 1982, Analytical Chemistry Branch procedures manual—Radiological and Environmental Sciences Laboratory: U.S. Department of Energy Report IDO–12096 [variously paged].
str(dl)
Boundary of the eastern Snake River Plain, Idaho.
esrp
A SpatialPolygonsDataFrame of the sp package object with 1 feature and 1 variable.
See projection
dataset for coordinate reference system information.
U.S. Geological Survey Idaho National Laboratory Project Office
inlmisc::PlotMap(esrp, dms.tick = TRUE)
sp::plot(esrp, col = "red", add = TRUE)
str(esrp@data)
Federal research facilities at the Idaho National Laboratory (INL).
facilities
A SpatialPolygonsDataFrame object with 7 features and 1 variable.
See projection
dataset for coordinate reference system information.
U.S. Geological Survey Idaho National Laboratory Project Office
inlmisc::PlotMap(facilities, dms.tick = TRUE)
sp::plot(facilities, col = "red", add = TRUE)
raster::text(facilities, facilities@data$NAME,
cex = 0.6, pos = 1)
str(facilities@data)
Groundwater level measurements in wells in the U.S. Geological Survey (USGS) water-level aquifer monitoring network, Idaho National Laboratory and vicinity, Idaho. Data was obtained from the National Water Information System (NWIS) (U.S. Geological Survey, 2020).
gwl
A data frame with 54,417 records and 5 variables:
site_no
USGS site number
lev_dt
date the water level was measured.
lev_va
water level value, in feet below land surface.
A water-level elevation may be calculated by subtracting the depth-to-water
(lev_va
) from the land-surface elevation (alt_va
in the sites
dataset).
lev_status_cd
a code indicating the status of the site at the time the water level was measured. The codes and their meanings are: "A" water level was affected by atmospheric pressure, "D" the site was dry (no water level is recorded), "N" the measurement was discontinued, "O" an obstruction was encountered in the well (no water level was recorded), "P" the site was being pumped, "R" the site had been pumped recently, "S" a nearby site that taps the same aquifer was being pumped, and "X" the water level was affected by stage in nearby surface-water site.
date_time
date and time the water level was measured. Missing values of time were substituted with “00:00” (12:00 midnight and start of day).
Data obtained from the NWIS database (U.S. Geological Survey, 2020).
U.S. Geological Survey, 2020, National Water Information System—web services, accessed August 4, 2020, from https://doi.org/10.5066/F7P55KJN.
site_no <- "432700112470801" # well USGS 1
xlim <- as.Date(c("1989-01-01", "2019-01-01"))
d <- gwl[gwl$site_no == site_no, c("lev_dt", "lev_va")]
main <- sites@data[sites@data$site_no == site_no, "site_nm"]
ylab <- sprintf("Water level, in %s below land surface",
c("feet", "meters"))
inlmisc::PlotGraph(d, ylab = ylab, main = main, xlim = xlim,
type = "p", pch = 19, seq.date.by = "year",
conversion.factor = 0.3048,
center.date.labels = TRUE)
str(gwl)
alt_va <- sites@data[sites@data$site_no == site_no, "alt_va"]
y <- alt_va - d$lev_va
ylab <- sprintf("Water level, in %s above sea level",
c("feet", "meters"))
inlmisc::PlotGraph(d$lev_dt, y, ylab = ylab, main = main,
xlim = xlim, type = "p", pch = 19,
seq.date.by = "year",
conversion.factor = 0.3048,
center.date.labels = TRUE)
Simplified representation of the boundary of Idaho, a state in the northwestern region of the United States.
idaho
A SpatialPolygons of the sp package with 1 feature.
See projection
dataset for coordinate reference system information.
U.S. Department of Commerce, U.S. Census Bureau, Geography Division/Cartographic Products Branch. Spatial extract from the Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB), 2019 data collection, released April 2, 2020.
inlmisc::PlotMap(idaho, dms.tick = TRUE)
sp::plot(idaho, col = "red", add = TRUE)
Boundary of the Idaho National Laboratory (INL).
inl
A SpatialPolygonsDataFrame of the sp package object with 1 feature and 4 variables.
See projection
dataset for coordinate reference system information.
U.S. Geological Survey Idaho National Laboratory Project Office
inlmisc::PlotMap(inl, dms.tick = TRUE)
sp::plot(inl, col = "red", add = TRUE)
str(inl@data)
Map labels in the vicinity of Idaho National Laboratory, eastern Idaho.
labels
A SpatialPointsDataFrame of the sp package with 50 features and 7 variables:
labels
text to be written
cex
character expansion factor
col
, font
color and font to be used, respectively.
srt
string rotation in degrees.
halo
if true, a "halo" is printed around text.
map
map index
See projection
dataset for coordinate reference system information.
U.S. Geological Survey Idaho National Laboratory Project Office
inlmisc::PlotMap(labels, dms.tick = TRUE)
lab <- labels[labels$map == 1L, names(labels) != "map"]
for (i in seq_along(lab))
do.call(raster::text, c(lab[i, ], as.list(lab@data[i, ])))
str(labels@data)
Perennial lakes and ponds in the vicinity of Idaho National Laboratory, eastern Idaho.
lakes
A SpatialPolygonsDataFrame of the sp package with 155 features of 12 variables.
See projection
dataset for coordinate reference system information.
U.S. Geological Survey (USGS), National Geospatial Technical Operations Center, USGS National Hydrography Dataset (NHD) Medium Resolution for Idaho, released August 4, 2014.
inlmisc::PlotMap(lakes, dms.tick = TRUE,
lakes = list(lakes, "lwd" = 1))
str(lakes@data)
Simplified representation of the mountain ranges and buttes in the vicinity of Idaho National Laboratory, eastern Idaho.
mountains
A SpatialPolygonsDataFrame of the sp package with 17 features and 1 variable.
See projection
dataset for coordinate reference system information.
U.S. Geological Survey Idaho National Laboratory Project Office
inlmisc::PlotMap(mountains, dms.tick = TRUE)
sp::plot(mountains, col = "gray", add = TRUE)
str(mountains@data)
Parameter code information for selected chemical constituents, organic compounds, and radionuclides measured for in water samples collected from wells in the U.S. Geological Survey (USGS) water-quality aquifer monitoring network, Idaho National Laboratory and vicinity, Idaho.
parameters
A data frame with 74 records and 7 variables:
parm_cd
USGS 5-digit parameter code
parameter_group_nm
parameter group name, such as "Radiochemical"
parameter_nm
long parameter name, such as "Strontium-90, water, unfiltered, picocuries per liter"
casrn
Chemical Abstracts Service
(CAS) registry number,
such as "10098-97-2"
for Strontium-90
srsname
Substance Registry Services
(SRS) name,
such as "Strontium-90"
parameter_units
parameter units.
Unit abbreviations and descriptions are:
"mg/L"
milligrams per liter,
"mg/L as N"
milligrams per liter as Nitrogen,
"ug/L"
microgram per liter, and
"pCi/L"
picocuries per liter.
siunitx
parameter units formatted for LaTeX using the siunitx
package notation.
USGS water data acquired from the National Water Information System (U.S. Geological Survey, 2019).
The SRS name (srsname
)
for "Trihalomethanes (four), total, from SDWA NPDWR" was shorten to its preferred acronym "TTHM4".
U.S. Geological Survey, 2019, National Water Information System—web services, accessed June 11, 2019, from https://doi.org/10.5066/F7P55KJN.
str(parameters)
PROJ.4 string defining a custom coordinate reference system (CRS) used by the U.S. Geological Survey Idaho National Laboratory Project Office. The CRS is based on the following attributes: Albers equal-area conic projection; latitude of first and second standard parallel is 42.83 and 44.16 decimal degrees, respectively; latitude and longitude of false origin is 41.5 and -113 decimal degrees, respectively; easting and northing of false origin is 200,000 and 0 meters, respectively; Clarke (1966) reference ellipsoid; North American Datum of 1983; and units of meters.
projection
A character string describing the CRS in PROJ.4 format.
crs <- sp::CRS(projection) # convert to class "CRS"
print(crs)
Primary and secondary roads in the vicinity of Idaho National Laboratory, eastern Idaho.
roads
A SpatialLinesDataFrame of the sp package with 1,616 features and 5 variables.
See projection
dataset for coordinate reference system information.
U.S. Department of Commerce, U.S. Census Bureau, Geography Division/Cartographic Products Branch. Spatial extract from the Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB), 2019 data collection, released April 2, 2020.
inlmisc::PlotMap(roads, dms.tick = TRUE,
roads = list(roads, "lwd" = 1))
sp::plot(roads[roads@data$PRISEC, ], col = "red", add = TRUE)
str(roads@data)
Water-quality data from laboratory analyses of groundwater samples collected from wells in the U.S. Geological Survey (USGS) water-quality monitoring network, Idaho National Laboratory and vicinity, Idaho. Data was obtained from the National Water Information System (U.S. Geological Survey, 2019).
samples
A data frame with 159,554 records and 19 variables:
site_no
USGS site identification number
sample_dt
date sample was collected
medium_cd
3-digit medium code that identifies the material type and quality assurance type of the sample.
The codes and their meanings are:
"WG"
water below land surface contained in the saturated zone (groundwater); and
"WGQ"
groundwater quality-control (QC) sample.
db_no
2-digit NWIS database number.
The codes and their meanings are:
"01"
is the environmental database, and
"10"
is the quality-assurance (QA) database.
anl_ent_cd
analyzing entity code of the organizational unit that performed the sample analysis used to obtain the result.
parm_cd
USGS 5-digit parameter code.
For example, the parameter code for Tritium is "07000"
.
remark_cd
1-digit remark code (result level) used to qualify the parameter value.
The codes and their meanings are:
""
quantified value.
"<"
actual value is known to be less than the value reported,
that is, the measured concentration is below the reporting limit (RL) and represented as a censored (or nondetection) value.
For censored values, the value reported is the RL.
"E"
value is estimated, that is, the actual value is greater than the minimum detection limit (MDL)
and less than the laboratory reporting level (LRL).
"R"
nondetect, result less than sample-specific critical level.
"U"
material specifically analyzed for but not detected.
"V"
value affected by contamination.
result_va
parameter value;
see parameter_units
variable in the parameters
dataset for the units of measurement.
lab_std_dev_va
laboratory standard deviation (SD). For radiochemical data, SD is determined from the counting error. Prior to January 1, 2018, counting error was reported as two SD, therefore, these values were divied by 2.
dqi_cd
1-digit data quality indicator code that indicates the review status of a result.
The codes and their meanings are:
"A"
historical data,
"S"
presumed satisfactory, and
"R"
reviewed and accepted.
rpt_lev_va
laboratory reporting limit in effect for the parameter and method at the time the measurement was made.
rpt_lev_cd
reporting level code that identifies the analytical reporting level appropriate for the analytical method.
The codes and their meanings are:
"DLBLK"
detection limit by blank data;
"DLDQC"
detection limit by DQCALC, lowest concentration that with 90 percent confidence will be exceeded no more than
1 percent of the time when a blank sample is measured;
"IRL"
interim reporting level, a temporary reporting level;
"LRL"
laboratory reporting level, equal to twice the yearly-determined LT-MDL;
"LT-MDL"
long-term method detection limit, a detection level derived by determining the standard deviation of
a minimum of 24 MDL spike sample measurements over an extended period of time;
"MDL"
method detection limit, minimum concentration of a substance that can be measured and reported with
a 99 percent confidence that the analyte concentration in greater than zero;
"PQL"
practical quantitation limits;
"MRL"
minimum reporting level, smallest measured concentration that can be reliably measured using
a given analytical method;
"RLDQC"
reporting limit by DQCALC, is greater than or equal to two times the DLDQC;
"SSLC"
sample-specific critical level, the calculated and reported value is below which the radiochemistry result
is considered a non-detect; and
"SSMDC"
sample-specific minimum detectable concentration, a reporting level that varies for each sample and is primarily used
in radiochemical analyses.
samp_type_cd
1-digit sample type code that identifies the QA type of a sample.
The codes and their meanings are:
"H"
samples collected over a period of time (composite);
"2"
samples prepared from a reference material where none of the analytes of interest are
present in detectable quantities (blank);
"6"
reference material that provide the baseline analytical results for comparison with reference samples;
"7"
replicate samples; and
"9"
sample taken from the environment (regular).
meth_cd
method code, the codes are documented in the "NWIS Method Code Dictionary"
result_cm_tx
comment about the water quality result
date_time
date and time the sample was collected. Missing values of time were substituted with “00:00” (12:00 midnight and start of day).
sample_cd
unique identifier for the water sample.
The sample code is a concatenation of the date_time
and site_no
variables.
comments
comments pertaining to changes applied after the records were obtained from NWIS.
rep_pair_no
number used to identify replicate samples.
Replicates are identified by matching a replicate sample (samp_type_cd
eqaul to 7) with
its corresponding environmental sample (samp_type_cd
equal to 9).
Data obtained from the NWIS-QWDATA database on June 11, 2019 using the QWDATA system (U.S. Geological Survey, 2019).
U.S. Geological Survey, 2019, National Water Information System—Water-Quality System (QWDATA) data retrieval program.
site_no <- "433002113021701" # well RWMC PROD
parm_cd <- "32102" # carbon tetrachloride
xlim <- as.Date(c("1989-01-01", "2019-01-01"))
d <- samples[samples$site_no == site_no & samples$parm_cd == parm_cd,
c("sample_dt", "result_va")]
ylab <- parameters[parameters$parm_cd == parm_cd, "parameter_nm"]
main <- sites@data[sites@data$site_no == site_no, "site_nm"]
inlmisc::PlotGraph(d, ylab = ylab, main = main, xlim = xlim,
type = "p", pch = 19, seq.date.by = "year",
center.date.labels = TRUE)
str(samples)
Information for sampling sites in the U.S. Geological Survey (USGS) aquifer monitoring networks, Idaho National Laboratory and vicinity, Idaho.
sites
A SpatialPointsDataFrame of the sp package with 182 records and 21 variables:
site_no
USGS site number for monitoring well
station_nm
USGS site name
coord_meth_cd
latitude/longitude coordinate method code. The codes and their meanings are: "D" differentially corrected Global Positioning System (GPS), "G" mapping grade GPS unit (handheld accuracy range 3.7 to 12.2 meters), "L" long range navigation system, "M" interpolated from topographic map, and "S" transit, theodolite, or other surveying method.
coord_acy_cd
accuracy code for latitude/longitude values. The codes and their meanings are: "H" accurate to +/- 0.1 second, "5" accurate to +/- 0.5 second, "S" accurate to +/- 1 second, and "F" accurate to +/- 5 seconds.
alt_va
elevation of the land surface reference point, in feet above the North American Vertical Datum of 1988.
alt_meth_cd
method code for measuring elevation. The codes and their meanings are: "D" differentially corrected global positioning system, "L" level or other surveyed method, and "M" interpolated from topographic map.
alt_acy_va
accuracy of the elevation value (alt_va), does not account for vertical datum shift.
construction_dt
date the well was completed.
huc_cd
hydrologic unit code (HUC). Hydrologic units are geographic areas representing part or all of a surface drainage basin or distinct hydrologic feature and are delineated on the Hydrologic Unit Map.
reliability_cd
reliability code for data available for the site. The codes and their meanings are: "C" data have been checked by the reporting agency, and "U" unchecked data.
nat_aqfr_cd
national aquifer codes
aqfr_cd
aquifer codes defined by the catalog of aquifer names and geologic unit codes used by the Water Mission Area.
aqfr_type_cd
aquifer type code. The codes and their meanings are: "C" confined single aquifer, "M" confined multiple aquifers, "U" unconfined single aquifer, and "X" mixed (confined and unconfined) multiple aquifers.
well_depth_va
depth of the finished well, in feet below the land surface datum
hole_depth_va
total depth of the borehole, in feet below the land surface datum
depth_src_cd
source code for depth measurements. The codes and their meanings are: "A" reported by another government agency, "D" from driller's log or report, "G" private geologist-consultant or university associate, "L" interpreted from geophysical logs by personnel of source agency, "O" reported by owner of well, "R" reported by person other than the owner, driller, or another government agency, and "S" measured by personnel of reporting agency.
site_nm
local site name
completion_cd
borehole completion code. The codes and their meanings are: "O" open hole completion, "M" multilevel completion, and "P" open hole completion prior to multilevel completion.
network
aquifer monitoring network, either "aquifer" or "perched".
pos
a position specifier for site-labels on a map. Values of 1, 2, 3 and 4, respectively indicate positions below, to the left of, above and to the right of the site coordinates.
See projection
dataset for coordinate reference system information.
USGS site data acquired from the NWIS (U.S. Geological Survey, 2020).
U.S. Geological Survey, 2020, National Water Information System—web services, accessed August 4, 2020, from https://doi.org/10.5066/F7P55KJN.
inlmisc::PlotMap(sites, dms.tick = TRUE)
sp::plot(sites, add = TRUE)
str(sites@data)
Stream segments in the vicinity of Idaho National Laboratory, eastern Idaho.
streams
A SpatialLinesDataFrame of the sp package with 197 features of 14 variables.
See projection
dataset for coordinate reference system information.
U.S. Geological Survey (USGS), National Geospatial Technical Operations Center, USGS National Hydrography Dataset (NHD) Medium Resolution for Idaho, released August 4, 2014.
inlmisc::PlotMap(streams, dms.tick = TRUE,
rivers = list(streams, "lwd" = 1))
str(streams@data)
Land-surface topography and shaded relief in the vicinity of Idaho National Laboratory, eastern Idaho.
topo
A RasterStack of the raster package with 2 layers.
Grid cells in the "elevation"
layer represent land-surface elevations in meters above the
North American Vertical Datum of 1988 (NAVD 88).
Grid cells in the "hillshade"
layer represent shaded relief calculated from
the slope and aspect of land-surface elevations.
The spatial grid is composed of 900 rows and 900 columns,
and has cell sizes that are constant at 100 meters by 100 meters.
See projection
dataset for coordinate reference system information.
The National Map (TNM) 1/3-arc-second DEM (Gesch, 2007; Gesch and others, 2002), accessed on August 4, 2020. This dataset can be downloaded in a Esri ArcGRID â„¢ format using TNM Download. Elevation datasets are distributed in geographic coordinates in units of decimal degrees, and in conformance with the NAD 83. Elevation values are in meters above the NAVD 88. The west, east, south, and north bounding coordinates for this dataset are -114, -112, 43, and 45 decimal degrees, respectively.
Gesch, D.B., 2007, The National Elevation Dataset, in Maune, D., ed., Digital Elevation Model Technologies and Applications—The DEM Users Manual, 2nd ed.: Bethesda, Maryland, American Society for Photogrammetry and Remote Sensing, p. 99–118.
Gesch, D., Oimoen, M., Greenlee, S., Nelson, C., Steuck, M., and Tyler, D., 2002, The National Elevation Dataset: Photogrammetric Engineering and Remote Sensing, v. 68, no. 1, p. 5–11.
inlmisc::PlotMap(topo[["elevation"]], bg.image = topo[["hillshade"]],
pal = inlmisc::GetColors(scheme = "dem3", alpha = 0.5),
dms.tick = TRUE, useRaster = TRUE)
Functions in the R-package ObsNetQW may be accessed from the R terminal by running the following command:
library("ObsNetQW")
The help documentation for each function is given below.
Suggests an adjective that best describes a sampling frequency.
DescribeFrequency(x, frac = 0.2)
x |
numeric or difftime vector. Sampling frequencies; if numeric, specified in days. |
frac |
numeric. Fraction to move cutoffs to ensure that sampling frequencies fall within a sensible frequency. |
Invoking the DescribeFrequency()
command without an x
argument returns
a data frame with the frequency values adjusted for frac
.
A character vector of frequency descriptors.
J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center
x <- c(0.1, 0.8, 1, 6, 6.6, 7.6, 15, 27, 28, 31, 370)
data.frame(x, "description" = DescribeFrequency(x))
DescribeFrequency()
data.frame(x, "description" = DescribeFrequency(x, frac = 0.50))
DescribeFrequency(frac = 0.50)
DescribeFrequency(as.difftime(c(1, NA, 10, 100, 1000),
units = "weeks"
))
Uses locally weighted scatterplot smoothing (LOESS) to fit a polynomial to time-series data. LOESS is a non-parametric regression method originally proposed by Cleveland (1979) and further developed by Cleveland and Devlin (1988). The smooth curve fitted through the data is known as the Loess curve, where each smoothed value is given by a weighted linear least squares regression of the span. Values returned from this function include the points on the Loess curve, their standard error and confidence interval.
FitLoessCurve(
samples,
site_no = NULL,
parm_cd = NULL,
Fun = stats::median,
mult = 0.5,
sdate = NA,
edate = NA,
by = "month",
conf = 0.9,
span = NULL,
degree = 1,
family = c("symmetric", "gaussian"),
...,
surface = c("direct", "interpolate")
)
samples |
data.frame.
Water-quality data from analysis of environmental and quality-control samples.
The required variables include |
site_no |
character vector. Site numbers to include in the analysis; all sites are included by default. |
parm_cd |
character vector. Parameter codes to include in the analysis; all parameters are included by default. |
Fun |
function.
Function used to aggregate daily values.
By default, the |
mult |
numeric.
Multiplier for censored values ( |
sdate, edate |
Date or character. Start and end date of prediction interval, respectively. Required date format is YYYY-MM-DD. |
by |
number, difftime, or character string. Increment of the sequence of values with which to predict data points on the Loess curve, monthly by default. |
conf |
numeric number. Confidence level expressed as a fraction, used to construct confidence intervals that make up the confidence band for the Loess curve. |
span |
numeric.
Smoothing parameter, defined as the fraction of the total number of data points that is used in each local fit.
Larger values give more smoothness.
Specify as |
degree |
numeric. Degree of the polynomials to use, first degree by default. |
family |
character. Indicates the method used for fitting. By default, a re-descending M estimator is used with Tukey's biweight function. |
... |
Arguments passed to the |
surface |
character. Indicates how the fitted surface is calculated, exactly by default. |
Smoothed values are given by a weighted linear least squares regression over the span.
The polynomial surface is fit using the stats::loess
function.
A data.frame with the following components:
site_no
site number
parm_cd
parameter code
obs
observations in the unevenly-spaced time series
stored as a matrix with columns x
y
, and cens
. Where
x
is the sample date,
y
is the measured value, and
cens
indicates whether the measurement is censored.
pred
predicted points on the Loess curve. Where
x
and y
are coordinate values,
se
is the estimated standard error for predicted value,
lcl
and ucl
are the lower and upper confidence interval for predicted value, respectively.
span
is the smoothing parameter.
J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center
Cleveland, W.S., 1979, Robust locally weighted regression and smoothing scatterplots: J. American Statistical Association, v. 74, p. 829–836.
Cleveland, W.S., Devlin, S.J., 1988, Locally-weighted regression—an approach to regression analysis by local fitting: Journal of the American Statistical Association, v. 83, no. 403, p. 596–610.
Hurvich, C.M., Simonoff, J.S., and Tsai, C.L., 1998, Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion: Journal of the Royal Statistical Society, Series B, vol. 60, no. 2, p. 271–293.
OptimizeTemporal
, OptimizeSpatial
site_no <- "432307112583101" # well USGS 124
parm_cd <- c(chloride = "00940", tritium = "07000")
d <- FitLoessCurve(inldata::samples, site_no, parm_cd)
is <- d$site_no == site_no & d$parm_cd == parm_cd[2]
obs <- d[is, "obs"][[1]][, 1:2]
inlmisc::PlotGraph(obs,
ylab = "Tritium, in picocuries per liter",
type = "p", pch = 3, col = "purple", xpd = TRUE
)
pred <- d[is, "pred"][[1]]
graphics::matpoints(pred$x, pred[, c("y", "lcl", "ucl")],
pch = 19,
col = c("black", "red", "red")
)
Fits a parametric survival regression model to time series data.
It is a wrapper around the survival::survreg
function.
Before running this function, ensure that the data are appropriate for survival analysis;
the FlagTemporalData
function is provided for this purpose.
FitSurvivalModel(
samples,
parm_cd = NULL,
dl = NULL,
conf = 0.95,
dist = "lognormal",
alpha = 0.05,
...
)
samples |
data.frame.
Water quality data from analysis of environmental and quality-control samples.
The required variables include |
parm_cd |
character vector. Parameter codes; all parameters are included by default. |
dl |
data.frame.
Laboratory detection limits;
see |
conf |
numeric number.
Confidence level expressed as a fraction,
used to calculate the confidence interval for a measured value ( |
dist |
character.
Assumed distribution for the response variable of the regression model ( |
alpha |
numeric. Level of significance, a threshold for p-value's. |
... |
Arguments passed to the |
The regression model requires data values that are greater than zero. To satisfy this requirement, an interval-censored value with a lower bound less than or equal to zero (or a detection limit) is represented as left-censored.
A list object with survreg components.
Components are named using a concatenation of variables specified in the subset_by
argument.
The returned object includes a diagnostic_messages
attribute that contains
a list object with diagnostic messages that result from regression problems,
and a data
attribute that contains a data.frame object with the data fit by the model.
J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center
parm_cd <- "07000" # tritium
is <- inldata::samples$sample_dt > as.Date("1989-01-01")
models <- FitSurvivalModel(inldata::samples[is, ],
parm_cd = parm_cd, dl = inldata::dl
)
site_no <- "433204112562001" # well CFA 1
model <- models[[paste(site_no, parm_cd)]]
summary(model)
d <- attr(models, "data")
d <- d[d$site_no == site_no & d$parm_cd == parm_cd, ]
m <- as.matrix(d$surv)
is <- m[, "status"] %in% 1:2
m[is, "time2"] <- m[is, "time1"]
m[m[, "status"] == 2, "time1"] <- -Inf
inlmisc::PlotGraph(d$sample_dt, m[, 1:2], type = "i", col = "blue")
x <- seq(par("usr")[1], par("usr")[2], length.out = 100)
newdata <- list(x)
names(newdata) <- colnames(model$x)[2]
y <- stats::predict(model, newdata,
type = "quantile",
p = c(0.025, 0.5, 0.975)
)
graphics::matlines(x, y, col = "red", lty = c(2, 1, 2))
coefs <- stats::coef(model)
fmt <- "y = exp(%sx + %s),\nwhere x is number of days since 1970-01-01"
txt <- sprintf(fmt, signif(coefs[2], 2), signif(coefs[1], 2))
graphics::mtext(txt, line = -2)
Fits a model of variability (standard deviation as a function of concentration) to replicate data. Replicates are defined as two or more water samples that are collected, prepared, and analyzed such that they are considered to be essentially identical in composition and analysis. The variability of replicate data is defined as the random error in independent measurements as the result of repeated application of the measurement process under identical conditions (Mueller and others, 2015). A model of variability is created for each analyte. Variability is a function of concentration and estimates of variability are developed for discrete ranges of concentration.
FitVariabilityModel(samples, parm_cd = NULL, h = 1)
samples |
data.frame.
Water quality data from analysis of environmental and quality-control samples.
The required variables include |
parm_cd |
character vector. Parameter codes to include in the analysis; all parameters are included by default. |
h |
integer count. The minimal segment size in the low or high range either given as the minimal number of valid replicate pairs in each segment. |
A two-range model, as described by Mueller and Titus (2005) and Mueller and others (2015, p. 32–34),
is used to estimate standard deviation as a function of concentration.
The two-range model splits the replicate data into two pieces:
a low concentration range for which standard deviation is about constant,
and a high concentration range for which the relative standard deviation (RSD) is about constant.
For concentrations within the low range, variability is estimated as the average standard deviation of replicates;
within the high range, variability is estimated as the average RSD.
Data subsetting was approached by identifying abrupt structural changes (breakpoints) in the
variance of replicates sorted by their average concentration.
Optimal breakpoints were calculated using the algorithm described by Bai and Perron (2003);
see strucchange::breakpoints
function for R implementation.
These breakpoints define the end-points of the replicate-data subsets.
A list object with function components.
Each component defines a parameter's model of variability, that is,
the standard deviation as a function of concentration,
or the measurement error if a alpha
argument is specified.
The U.S. Geological Survey (USGS) parameter code (parm_cd
) is assigned to the component name.
The returned object includes 3 attributes:
(1) replicates
, a list with replicate plotting coordinates,
see returned value from grDevices::xy.coords
function for component format.
(2) subsets
, a data.frame with the following variables:
parm_cd
USGS 5-digit parameter code;
subset_min
, subset_max
minimum and maximum concentrations of replicate subset, in parameter units;
nreplicates
number of replicates;
ave_sd
mean standard deviation, in parameter units; and
ave_rsd
mean relative standard deviation, in percent.
(3) data
, a list with data.frame components.
Each component stores the data used to create a parameter's variability model.
Replicate sets with nondetection or negative values in one or more samples are removed.
J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center
Bai, J., and Perron, P., 2003, Computation and Analysis of Multiple Structural Change Models: Journal of Applied Econometrics, v. 18, p. 1–22.
Mueller, D.K., Schertz, T.L., Martin, J.D., and Sandstrom, M.W., 2015, Design, analysis, and interpretation of field quality-control data for water-sampling projects: U.S. Geological Survey Techniques and Methods, book 4, chap. C4, 54 p., https://doi.org/10.3133/tm4C4.
Mueller, D.K., and Titus, C.J., 2005, Quality of nutrient data from streams and ground water sampled during water years 1992–2001: U.S. Geological Survey Scientific Investigations Report 2005–5106, 27 p., https://pubs.usgs.gov/sir/2005/5106/.
parm_cd <- "00930" # sodium
models <- FitVariabilityModel(inldata::samples, parm_cd)
FUN <- models[[parm_cd]] # model of variability
conc <- c(20, 50) # measured concentration
stdev <- FUN(conc) # standard deviation
error <- FUN(conc, alpha = 0.05) # measurement error
xy <- attr(models, "replicates")[[parm_cd]] # replicate data
inlmisc::PlotGraph(xy$x, xy$y,
xlab = "Mean concentration",
ylab = "Standard deviation", type = "p",
pch = 21
)
x <- seq(par("usr")[1], par("usr")[2], length.out = 1000)
graphics::lines(x, y = FUN(x), col = "red", lwd = 1.2)
str(attr(models, "subsets"))
Identifies data records that are suitable for temporal regression analysis. Data preprocessing can ensure that at least one continuous record block exists in each unevenly-spaced time series. A continuous record block adheres to the following limits: (1) a minimum duration for the length of time that continuous reliable observations are available; (2) a minimum number of unique observation dates; and (3) a minimum temporal spacing between consecutive observations.
FlagTemporalData(
samples,
subset_by = c("site_no", "parm_cd"),
remark_cd = c("", "<", "E"),
samp_type_cd = c("9", "7", "A"),
sdate = NA,
edate = NA,
min_por = NULL,
max_gap = NULL,
min_ndates = NULL,
site_no = NULL,
parm_cd = NULL,
robust = FALSE
)
samples |
data.frame.
Water quality data from analysis of environmental and quality-control samples.
The required variables include |
subset_by |
character or integer vector.
Variables (column names or numbers) in |
remark_cd |
character vector. Remark codes. Only records with the specified remark codes will be included in the data set. |
samp_type_cd |
character vector. Sample type codes. Only records with the specified sample type codes will be included in the data set. |
sdate, edate |
Date or character.
Start and end date for the period of interest, respectively.
Required date format is YYYY-MM-DD.
The default value, |
min_por |
numeric. Minimum duration for a continuous record block, in days. |
max_gap |
numeric. Maximum days between consecutive observations in a continuous record block. |
min_ndates |
integer. Minimum number of unique observation dates in a continuous record block. |
site_no |
character vector. Site numbers; all sites are included by default. |
parm_cd |
character vector. Parameter codes; all parameters are included by default. |
robust |
logical flag. If true, displays additional information. |
A logical vector of length equal to
the number of records (rows) in the samples
data table.
Values equal to TRUE
indicate a valid record that adheres to the criteria specified by the user.
J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center
FitSurvivalModel
, OptimizeTemporal
site_no <- "433144112563501" # well CRA 2
is <- FlagTemporalData(inldata::samples,
min_por = udunits2::ud.convert(15, "year", "day"),
max_gap = udunits2::ud.convert(5, "year", "day"),
min_ndates = 15, site_no = site_no, robust = TRUE
)
d <- droplevels(inldata::samples[is, , drop = FALSE])
sprintf(
"Removed %s out of %s records (%s%%)",
nrow(inldata::samples) - nrow(d), nrow(inldata::samples),
signif((nrow(inldata::samples) - nrow(d)) /
nrow(inldata::samples) * 100, 2)
)
parm_cd <- "00940" # chloride
plot(d[d$parm_cd == parm_cd, c("sample_dt", "result_va")])
Retrieves selected statistics from a censored regression object.
GetSurvivalStats(model)
model |
survreg.
A fitted parametric survival model,
see |
A named numeric vector containing the following information:
n
number of observations
incr
long-term monotonic trend, estimated using the fractional increase
in the response variable over a unit of measure for x
(1-day).
rmse
root-mean-squared error, also known as residual standard error.
pval
p-value, statistical significance of the regression model.
r2_mkz
McKelvey and Zavoina's (1973) pseudo-R^2.
r2_mf
McFadden's (1975) pseudo-R^2.
J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center
McFadden, D., 1973, Conditional logit analysis of qualitative choice behavior: p. 105–142 in Zarembka, P. (ed.), Frontiers in Econometrics. New York, NY, Academic Press.
McKelvey, R.D., and Zavoina, W., 1975, A statistical model for the analysis of ordinal dependent variables: The Journal of Mathematical Sociology, v. 4, no. 1, p. 103–120.
d <- inldata::samples
d <- d[
d$site_no == "433322112564301" & # well USGS 38
d$parm_cd == "01030" & # chromium
d$sample_dt > as.Date("1989-01-01"),
c("sample_dt", "result_va", "remark_cd")
]
t1 <- t2 <- d$result_va
t1[d$remark_cd == "<"] <- -Inf
d$surv <- I(survival::Surv(t1, t2, type = "interval2"))
model <- survival::survreg(surv ~ sample_dt, d,
dist = "lognormal",
x = TRUE
)
summary(model)
print(out <- GetSurvivalStats(model))
sprintf(
"Percent change in concentration per year is %2.1f",
100 * abs(out["incr"]) * 365
)
Estimates a surface map of a feature (such as a contaminant concentration) across an area using a kriging interpolation method. Sample point data is averaged over time, with nondetection values set equal to zero. The sample variogram is fit using a Matern class model (Matern, 1960; Stein, 1999).
KrigeSurfaceMap(
samples,
sites,
parm_cd = NULL,
sdate = NA,
edate = NA,
resolution = NULL,
newdata = NULL,
Trans = NULL,
BackTrans = NULL,
type = c("QK", "OK"),
kappa = seq(0.1, 10, by = 0.1),
cutoff = NULL,
nbins = 20,
cloud = FALSE,
nmax = Inf,
maxdist = Inf,
alpha = NULL,
minarea = NULL,
nfold = length(sites),
nreps = 1,
seed = NULL,
minr2 = NULL,
benchmarks = NULL,
performance = FALSE,
parallel = FALSE,
robust = FALSE,
old = NULL,
...
)
samples |
data.frame.
Water quality data from analysis of environmental and quality-control samples.
The required variables include |
sites |
SpatialPointsDataFrame.
Site coordinates with |
parm_cd |
character vector. Parameter codes; all parameters are included by default. |
sdate, edate |
Date or character.
Start and end date, respectively,
for the period over which measured values are described using their average value.
Required date format is YYYY-MM-DD.
The default value, |
resolution |
numeric vector of length 1 or 2.
Resolution (see |
newdata |
SpatialPoints*. Prediction locations for kriging (optional). Used to bypass the default grid prediction locations. |
Trans |
function.
Transformation applied to the data, such as the |
BackTrans |
function.
Back-transformation applied to kriging results, such as the |
type |
character.
Type of regression method, specify
|
kappa |
numeric vector. A sequence of kappa values that will be searched for optimal Matern variogram model fit. Kappa is a shape parameter that controls the smoothness of the random field. A value of 0.5 results in the exponential model, while a smoothness of infinity is exactly the Gaussian model. The Matern model typically begins to look and behave as the Gaussian model for values greater than 10. Small values are considered ‘smooth’, while larger values describe a ‘rough’ random field. |
cutoff |
numeric. Spatial separation distance up to which data point pairs are included in empirical variance estimates; by default, set equal to the length of the diagonal of the box spanning the data divided by three. |
nbins |
integer.
Number of bins (distance intervals) within the |
cloud |
logical flag. Whether to calculate the variogram cloud. |
nmax |
integer. For local kriging: the number of nearest observations that should be used for a kriging prediction or simulation. |
maxdist |
numeric.
For local kriging: the search radius that defines the local neighborhood;
if combined with the |
alpha |
numeric.
Level of significance, a threshold for p-value's calculated using Moran's I test.
Evaluates whether the spatial distribution of feature values is the result of random spatial processes.
Models with p-value's greater than |
minarea |
numeric. Minimum area for the kriging observations specified as a fraction of the area inside the convex hull of the kriging surface. Models are rejected if less than this fraction. Useful for removing models with excessive extrapolation. |
nfold |
integer. Number of folds to apply in k-fold cross validation (CV). A value greater than equal to the number of observation points results in leave-one-out CV. The CV method is used to determine the function type for the variogram model, and calculate performance metrics for the kriging model. |
nreps |
integer. Number of times to repeat cross validation. Final results are based on the median performance metric from all repetitions. |
seed |
integer. Random number generator state for random number generation, used to replicate the CV results. |
minr2 |
numeric vector of length 1 or 2, value is recycled as necessary. Minimum value for R^2, between 0 and 1. Values applied separately to variogram and kriging analysis. Models are rejected if R^2 is less than the minimum value. |
benchmarks |
data.frame.
Benchmarks, such as a selected hazardous threshold.
A byproduct of quantile kriging, used to calculate the probability
of a predicted value being greater than the benchmark value.
The required variables include |
performance |
logical flag. Whether to calculate performance metrics for the kriging model. |
parallel |
logical flag or integer count. Whether to use parallel computing. This argument can also be used to specify the number of cores to employ; by default, this is the number of physical CPUs/cores. The parallel and doParallel packages must be installed for parallel computing to work. |
robust |
logical flag.
Whether to display output from variogram fitting and kriging.
Only available if the |
old |
list.
Output from a past run of the |
... |
Not used |
A list with list components,
one for each parameter kriged, and named after their respective
parameter code (see parm_cd
argument).
Each list component has the following components:
kr_output
results from the kriging interpolation, that is,
the prediction (var1.pred
), standard error (var1.se
),
and kriging variance (var1.var
).
Where the kriging variance remains in transformed space.
If using quantile kriging, the following variables are also returned:
interval_range
, the interval range, or the deviation between
upper and lower bounds in units of measurement; and
probability
, the estimated probability that a predicted value
is higher than the benchmark value.
kr_cv
kriging model prediction performance metrics derived from CV.
Metrics include:
me
, mean error (ideally 0);
mse
, mean squared error (ideally small);
rmse
, root mean squared error (ideally small);
nrmse
, normalized root mean squared error (ideally small);
mae
, mean absolute error (ideally small);
cor
, correlation between observed and predicted values (ideally 1); and
r2
, coefficient of determination, or R^2 (ideally 1).
vg_cloud
variogram cloud.
vg_sample
sample variogram.
vg_model
fitted variogram model.
vg_summary
a summary of the variogram model with list components:
model
, is the model type;
nugget
effect, represents a discontinuity of the model at the origin;
sill
, is the variance when the model either reaches or becomes asymptotic to
a constant value as lag distance increases;
range
, controls the rate of increase with distance;
effective_range
, is the effective range, that is the distance at which
the variance value achieves 95 percent of the sill;
kappa
, is a shape parameter that controls the smoothness of the random field
that models the spatial variability of the observed data;
sserr
, is the sum of squared error (ideally 0); and
r2
, is the coefficient of determination (ideally 1).
obs
observation points used in the prediction.
Consider the discussion by Kitanidis (1997, p. 71, section 3.7) before
kriging with a moving neighborhood using the nmax
and maxdist
arguments.
J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center
Clark, Isobel, 1979, Practical Geostatistics: London, Applied Science Publishers, 119 p.
Goovaerts, Pierre, 1997, Geostatistics for Natural Resource Evaluation: New York, Oxford University Press, 483 p.
Journel, A.G., and Deutsch, C.V., 1997, Rank order geostatistics: a proposal for a unique coding and common processing of diverse data, In Geostatistics Wollongton '96, Volume 1, ed. E.Y. Baafi and N.A. Schofield, Dordrecht, Boston: Kluwer Academic, p. 174–187.
Juang, K.-W., Lee, D.-Y., and Ellsworth, T.R., 2001, Using rank-order geostatistics for spatial interpolation of highly skewed data in a heavy-metal contaminated site: Journal of Environmental Quality, v. 30, no. 3, p. 894–903.
Kitanidis, P.K., 1997, Introduction to geostatistics—Applications to hydrogeology: New York, Cambridge University Press, 249 p.
Matern, B. 1960, Spatial variation—stochastic models and their applications to some problems in forest survey sampling investigations: Report of the Forest Research Institute of Sweden, v. 49, no. 5, 144 p.
Stein, M.L., 1999, Interpolation of spatial data—Some theory for kriging: Springer, New York, 247 p.
gstat::fit.variogram
, gstat::krige
, gstat::krige.cv
parm_cd <- c(chloride = "00940", tritium = "07000")
models <- KrigeSurfaceMap(inldata::samples, inldata::sites,
parm_cd = parm_cd, sdate = "1989-01-01",
resolution = 500, cutoff = 10 * 1609,
parallel = TRUE
)
names(models)
# Plot kriging results for each parameter
xlim <- sp::bbexpand(sp::bbox(inldata::sites)[1, ], 0.04)
ylim <- sp::bbexpand(sp::bbox(inldata::sites)[2, ], 0.04)
for (cd in names(models)) {
srsname <- inldata::parameters[cd, "srsname"]
x <- models[[cd]]
try(gridExtra::grid.arrange(sp::bubble(x$obs,
maxsize = 1,
main = srsname
),
gstat:::plot.gstatVariogram(
x$vg_sample,
x$vg_model
),
sp::spplot(x$kr_output, "var1.pred",
main = "Predicted",
xlim = xlim, ylim = ylim
),
sp::spplot(x$kr_output, "var1.se",
main = "Standard Error",
xlim = xlim, ylim = ylim
),
ncol = 2
))
prompt <- sprintf("%s (%s), press key to continue...", srsname, cd)
if (interactive()) invisible(readline(prompt))
}
Identifies an optimal subset of wells to exclude from an existing long-term monitoring network because they provide the least amount of information. A distributed multiple-population genetic algorithm (GA) is used to preserve spatial accuracy and long-term temporal trends in the reduced monitoring network.
OptimizeSpatial(
k,
sites,
samples,
site_no = NULL,
parm_cd = NULL,
arg = list(),
surv = NULL,
loess = NULL,
w1 = 1,
w2 = 1,
w3 = 1,
w4 = 1,
penalty = NULL,
exclude = NULL,
suggestions = NULL,
...
)
k |
integer. Number of sites to exclude from the existing monitoring network. |
sites |
SpatialPointsDataFrame.
Site coordinates with |
samples |
data.frame.
Water quality data from analysis of environmental and quality-control samples.
The required variables include |
site_no |
character vector. Site numbers to include in the analysis; all sites are included by default. |
parm_cd |
character vector. Parameter codes to include in the analysis; all parameters are included by default. |
arg |
list.
Additional arguments passed to the |
surv |
list.
Models from survival-regression analysis,
the resulting object from running the |
loess |
data.frame.
Predicted points from local-regression analysis,
the resulting object from running the |
w1, w2, w3, w4 |
numeric number. Weighting factor applied to objective function, see ‘Details’ section for objective descriptions. |
penalty |
numeric number. Penalty to apply to infeasible sampling schemes that result in unestimated points in the interpolation domain using kriging. |
exclude |
character vector. Site numbers to exclude from the decision space but include in the kriging analysis. |
suggestions |
matrix.
Monitoring networks to be included as initial guesses.
See returned |
... |
GA control parameters passed to the
|
The GA::gaisl
function
(Scrucca, 2013, 2017) is used to search for an optimal
subset of sites to remove from a existing monitoring network
because they add little or no beneficial information.
The number of sites to remove is fixed and specified by the user using the k
argument.
The performance of a network design is measured with respect to multiple objectives,
each characterizes a contribution to the overall value (or fitness) of the network design.
The design problem is formulated as a single-objective optimization using a weighted sum approach.
More specifically, an islands GA method is used to minimize a fitness function
that comprises a set of weighted objective functions.
Weights are assigned to individual objective functions to establish
the relative influence of each objective and (or) resolve conflicting objectives.
Objectives are described as follows:
Minimize the normalized root-mean-square deviation between kriged concentrations estimated using the existing and reduced monitoring networks.
Decrease spatial interpolation errors by minimizing the ratio between the sum kriging variance estimated for the reduced and existing monitoring networks.
Preserve significant long-term monotonic temporal trends in wells by minimizing the maximum fractional change per year estimated from survival-regression analysis, in wells removed from the existing monitoring network.
Preserve continuous record blocks that have the largest variability by minimizing the maximum normalized range of concentrations estimated from local-regression analysis, in wells removed from the existing monitoring network.
A list object with components:
call
, solution
, ga_output
, and ga_time
,
(see returned value from inlmisc::FindOptimalSubset
function);
rm_sites
, a vector of site numbers indicating the wells that have been optimally selected for removal;
fitness
, a data table of weighted fitness values for individual objectives in the fitness function.
The fitness
component has the following variables:
parm_cd
USGS 5-digit parameter code;
epoch
interval between migrations in the islands GA run; and
w1f1, w2f2, w3f3, w4f4
weighted fitness value for individual objective functions.
The fitness
component has a "weights"
attribute that
contains the weights assigned to individual objective functions.
Fitness values can be displayed using the PlotFitness
function.
J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center
Scrucca, Luca, 2013, GA: A Package for Genetic Algorithms in R: Journal of Statistical Software, v. 53, no. 4, p. 1–37.
Scrucca, Luca, 2017, On some extensions to GA package: hybrid optimisation, parallelisation and islands evolution: The R Journal, v. 9, no. 1, 187–206.
site_no <- with(inldata::sites@data, site_no[network == "aquifer"])
parm_cd <- "07000" # tritium
arg <- list("sdate" = "1989-01-01", "cutoff" = 16093, "nbins" = 20)
exclude <- with(inldata::sites@data, site_no[completion_cd != "O"])
ans <- OptimizeSpatial(5, inldata::sites, inldata::samples,
site_no, parm_cd, arg,
w1 = 10, w4 = 0.1,
exclude = exclude, run = 5, maxiter = 100,
numIslands = 4, parallel = TRUE, seed = 123
)
print(ans)
print(m <- ans[["solution"]])
locations <- inldata::sites[m[1, ], ]
print(locations@data[, c("site_no", "site_nm")])
print(ans$fitness)
print(attr(ans$fitness, "weights"))
PlotFitness(ans$fitness, type = "objectives")
inlmisc::PlotMap(locations,
dms.tick = TRUE,
roads = list(inldata::roads),
lakes = list(inldata::lakes),
rivers = list(inldata::streams),
scale.loc = "bottomleft"
)
points(inldata::sites, pch = 19, col = "black", xpd = FALSE)
points(locations, pch = 19, col = "red")
raster::text(locations,
labels = locations@data$site_nm,
cex = 0.7, pos = 4, halo = TRUE
)
plot(ans[["ga_output"]])
summary(ans[["ga_output"]])
Provides a method for analyzing the temporal spacing of observations.
It determines if fewer observations could have been used to
characterize analyte concentrations at a sampling site over time.
An iterative thinning approach, based on an algorithm described by Cameron (2004),
is used to identify an optimal sampling interval that may be used for future long-term monitoring,
one that reduces temporal redundancy.
Before running this function, ensure that the data are appropriate for temporal redundancy analysis;
the FlagTemporalData
function is provided for this purpose.
OptimizeTemporal(
samples,
site_no = NULL,
parm_cd = NULL,
span = NULL,
by = NULL,
n = 200,
conf = 0.9,
nsim = 500,
threshold = 0.8,
tol = 0.05,
Fun = stats::median,
mult = 0.5,
save_mc = FALSE,
parallel = FALSE,
monitor = FALSE,
seed = NULL
)
samples |
data.frame.
Water-quality data from analysis of environmental and quality-control samples.
The required variables include |
site_no |
character vector. Site numbers to include in the analysis; all sites are included by default. |
parm_cd |
character vector. Parameter codes to include in the analysis; all parameters are included by default. |
span |
numeric.
Smoothing parameter, defined as the fraction of the total number of data points that is used in each local fit.
Larger values give more smoothness.
Specify as |
by |
number, difftime, or string.
Increment of the predicted points over the period of record.
Specify as either a number in days, object of class |
n |
integer.
Number of predicted points uniformly distributed over the period of record.
Only used if the |
conf |
numeric number. Confidence level expressed as a fraction, used to construct confidence intervals that make up the confidence band for the Loess curve. |
nsim |
integer. Number of data-thinning simulations to perform during each iteration of the bisection method. For each simulation, a different set of randomly selected observations is removed from the full-data set and a new Loess curve is computed. |
threshold |
numeric. Fraction of predicted points on the Loess curve fitted to the reduced-data set, that must fall within the confidence band computed from the full-data set. |
tol |
numeric.
Error tolerance for the bisection method; if crossed,
and the regression is robust (see |
Fun |
function.
Function used to aggregate daily values.
By default, the |
mult |
numeric.
Multiplier for censored values ( |
save_mc |
logical flag. Whether to return the data-thinning simulations. |
parallel |
logical flag or integer count. Whether to use parallel computing. This argument can also be used to specify the number of cores to employ; by default, this is the number of physical CPUs/cores. The parallel and doParallel packages must be installed for parallel computing to work. |
monitor |
logical flag.
If true, displays output after each iteration of the bisection method.
Monitoring is only available if the |
seed |
integer. Random number generator state for random number generation, used to replicate the results. The doRNG package must be installed if using parallel computing. |
The median date gap (that is, the median number of days between consecutive
and unique sampling dates) is first calculated and used as the historic sampling interval.
The iterative thinning algorithm (Cameron, 2004) computes
a confidence band around a Loess curve fitted to the observations;
see FitLoessCurve
function for details.
A fraction of the observations is removed from the full-data set and
a new Loess curve is fitted to the reduced-data set.
The ability of the reduced-data set to reproduce the historical trend is evaluated by
calculating the fraction of the new Loess curve fitted to the reduced-data set
that falls within the confidence band established using the full-data set.
Increasing numbers of observations are removed from the full-data set and
evaluated for their ability to reproduce the historical trend.
A data-thinning method is used to safeguard against irregular trends that may arise from
the selection of a single set of observations to remove.
The optimal fraction of observations to remove, while still reproducing the historical trend,
is computed using the bisection method.
Finally, an optimal sampling interval, that minimizes temporal redundancy,
is expressed as the historic sampling interval divided by one minus the optimal fraction to remove.
A data.frame with the following components:
site_no
site number
parm_cd
parameter code
obs
observations in the unevenly-spaced time series
stored as a matrix with columns x
y
, and cens
. Where
x
is the sample date,
y
is the measured value, and
cens
indicates whether the measurement is censored.
frac_cens
fraction of censored measurements.
ndates
number of unique sampling dates
longest_gap
longest number of days between consecutive samples.
freq
historic sampling interval, in days, estimated as the median date gap.
pred
Loess curve and confidence band stored as a matrix with columns
x
, y
, se
, lcl
, and ucl
.
See returned value from FitLoessCurve
function for column descriptions.
reduction
largest fraction of observations that can be removed while preserving the temporal trend of the full-data set.
mc
Loess curve for each data-thinning simulation stored as
a n
by nsim
matrix of predicted values.
This component is only available if the save_mc
argument is specified as true.
opt_freq
optimal sampling interval, in days.
freq_desc
adjective describing the historic sampling frequency;
see DescribeFrequency
function for details.
opt_freq_desc
adjective describing the optimal sampling frequency.
The returned object includes an elapsed_time
attribute,
an object of class difftime
giving the total time taken to execute the function.
J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center
Cameron, K., 2004, Better optimization of long-term monitoring networks: Bioremediation Journal, v. 8, no. 3–4, p. 89–107.
site_no <- "432019112563201" # well USGS 14 MV-61
parm_cd <- c(chloride = "00940", tritium = "07000")
d <- OptimizeTemporal(inldata::samples, site_no, parm_cd,
by = "month", save_mc = TRUE,
monitor = TRUE, seed = 123
)
print(d)
print(attr(d, "elapsed_time"))
i <- 1L # row index
obs <- d$obs[[i]][, 1:2]
graphics::plot(obs,
type = "n", xlab = "Julian day",
ylab = d$parm_cd[i], main = d$site_no[i]
)
x <- d$pred[[i]][, "x"]
graphics::matlines(x, d$mc[[i]], lty = 1, lwd = 0.5, col = "gray")
graphics::matlines(x, d$pred[[i]][, c("lcl", "ucl")],
lty = 2,
lwd = 2, col = 2
)
graphics::lines(x, rowMeans(d$mc[[i]], na.rm = TRUE),
lwd = 2, col = 3
)
graphics::points(obs)
Visualize the performance of the sampling network optimization problem by graphing fitness values across generations.
PlotFitness(
fitness,
type = c("monitor", "objectives", "parameters"),
ylim = NULL,
show_main = TRUE,
digits = getOption("digits")
)
fitness |
data.frame.
See |
type |
character.
Plot type, possible types are
|
ylim |
numeric vector of length 2. Minimum and maximum values for the y-axis. |
show_main |
character. Whether to show the main title. |
digits |
integer. Number of decimal places used when formatting the range of the best fitness value. |
Invisible NULL
J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center
An optimization analysis is performed when building the vignette documents of the R-package ObsNetQW. Prior to re-building these vignettes, be aware of the long run times (on the order of weeks) associated with the spatial component of the optimization analysis. First create a local copy of the ObsNetQW package source material (stored in a remote repository) using the following command:
git2r::clone("https://code.usgs.gov/inl/obsnetqw.git", "obsnetqw")
The package vignettes are located under obsnetqw/inst/doc of the working directory. To re-build the vignettes from their source files, run:
inlmisc::BuildVignettes("obsnetqw")
R packages get updated over time, so future versions of dependent packages may result in errors. To address this, it may be necessary to recreate an identical R environment where all packages are consistent with package versions used in this report. Version information about R, the operating system, and attached and loaded packages at the time of processing is as follows:
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
attached base packages:
other attached packages:
loaded via a namespace (and not attached):
For reproducibility purposes, Microsoft® R Application Network (MRAN) stores daily snapshots of the CRAN packages and R releases. Use the May 5, 2021 snapshot to recreate the R environment used in this report. To do so, download and install R version 4.0.5 (2021-03-31) from the MRAN and use the following commands to recreate the R environment:
install.packages("remotes")
repos <- "https://cran.microsoft.com/snapshot/2021-05-05/"
update.packages(repos = repos)
remotes::install_gitlab("inl/obsnetqw@v1.0.0",
host = "code.usgs.gov",
dependencies = TRUE,
repos = repos,
)
The optimization analysis was performed on a 4 cores Intel® Xeon® CPU E5-1620 v3 running at 3.50GHz and with 32GB of RAM.