USGS visual identity mark and link to main Web site at http://www.usgs.gov/

Digital Mapping Techniques '02 -- Workshop Proceedings
U.S. Geological Survey Open-File Report 02-370

Digital Archives and Metadata as Mechanisms to Preserve Institutional Memory

By John C. Steinmetz, Richard T. Hill, and Kimberly H. Sowder

Indiana Geological Survey
Indiana University
611 North Walnut Grove
Bloomington, IN 47405-2208
Telephone: (812) 855-7636
Fax:(812) 855-2862
e-mail: jsteinm@indiana.edu, hill2@indiana.edu, sowderk@indiana.edu

ABSTRACT

Metadata are essential for any reliable geographic information system (GIS). Their utility extends beyond GIS, however. The Indiana Geological Survey (IGS) is employing metadata as a means to help preserve institutional memory. Due to its size and diversity, the IGS generates a multitude of data types in any given period of a few months. Unless properly documented, field and analytical data, samples, unpublished maps, well and mine records, and other similar data and information can be lost within the organization. Since much of this data is commonly used by various staff for different projects and for different reasons, the opportunity to misplace or lose it is great. Moreover, the permanent departure of any employee, whether through retirement or normal attrition, is an occasion for an organization to lose a large amount of knowledge about how things were done, the stage at which projects were abandoned, and even the physical location of important data. Once lost, and only if resources permit, can precious staff time be used to search for or even reconstruct or re-collect the data.

Virtually all data collected or processed by the IGS are geospatially oriented. Hence, the application of metadata to their cataloging within an organization is easy to envision. By constructing metadata incrementally throughout the life of a project, a means is provided to ensure that adequate documentation is captured upon publication of products, and also prior to an investigator's departure.

INTRODUCTION

Congratulations! You have just won the $10 million PowerBall Lottery! After carefully reconsidering your life's priorities, you promptly turn in your resignation before retiring to the Bahamas, leaving no forwarding address. But what about those data sets you have been working on for the last five years? Does your organization lose your knowledge about the data? Will the person who replaces you and inherits your data know enough about them to use them? Even though this will not be of concern to you as you bask on the beach, it should be of great concern to the organization.

INSTITUTIONAL MEMORY AND DISCOVERY

Data are expensive. Whether they are gathered by an investigator in the field or generated by an analyst in the laboratory, the cost of their acquisition is great. As expensive as data are, the replacement cost is even greater: Scientific staff must be paid again, and equipment maintained and refurbished, assuming the original collection site still exists.

Beyond the cost of collection, the value of scientific data lies in their use. Whether used by the original investigator or years later by someone else, scientific data only maintain value if utilized. To be of utility, they must be accessible; to be accessible, they must be discoverable. Many organizations rely heavily on institutional memory -- the collective knowledge and history of an organization held by employees, especially those who have been there for a number of years (National Research Council, 2002) -- to aid in the discovery and accessibility of data. All too often when a long-term and productive employee retires or leaves an organization, an institutional memory of inestimable value is lost. With him or her commonly go such simple information as the physical whereabouts of a data set. Once lost within the organization -- namely, once its location is no longer known within the building -- the data set is essentially useless.

Finding (discovery of) data within an organization involves identifying the existence and location of desired data sets and collections. Ancillary considerations include ascertaining data availability, quality, and format. Within the petroleum industry, it has been estimated that between 60 and 80 percent of a geoscientist's time is spent searching for data; the balance is spent organizing and analyzing it (S. Natali, Barrett Resources, personal commun., 2001). One internal goal of most public or private organizations is to shorten the discovery time so the investigator can invest more valuable time in using the data. In many instances, however, potential users gain knowledge of, and access to, data by traditional means: through personal acquaintances, letters, on-site visits, or by telephone, fax, or e-mail. Too often, knowledge of the mere existence of geoscience data is reliant on personal relations, that is, institutional memory.

Discovery of data from outside an organization requires a certain degree of public relations by the organization that is archiving data. The IGS, like many state and federal Earth science institutions, promotes its holdings not only via e-mail and the Internet, but also through mass mailings, professional meetings, posters, and CD-ROM's. Digital data catalogs and access to them over the Internet are increasingly common, but many investigators are surprised to learn that digital access to a catalog's collection is not yet available. In many instances, funds to build an electronic catalog and provide Internet access are available only when garnered from existing operational funds; new money for these efforts rarely is afforded.

Adequate cataloging of data may seem time consuming and not terribly exciting, yet the costs involved in data acquisition generally far outweigh all other costs combined; reacquisition of data, if even possible, is more costly than initial acquisition and retention. In the current economy, organizations simply cannot afford to lose the usefulness of valuable data by neglecting documentation procedures. The IGS has undertaken the task of inventorying, cataloging, and creating metadata for new and historical data sets. To promote this initiative, the IGS has formed the Data at the Indiana Geological Survey Committee (DIGS Committee). The chair of the committee is the head of the Technology Transfer Section, and the members include staff across all IGS disciplines. They are charged with examining the factors involved in conducting a Survey-wide inventory of files and records, samples, archives, and publications. The committee's immediate goals are to capture IGS data, to develop the means of data retrieval, to develop a database to organize and access records of all IGS data, and to provide Internet access to an inventory of selected records. Some considerations the committee is taking into account are the lumping and splitting of items into various categories, design of inventory forms, necessary resources (personnel, equipment, supplies), metadata and quality assurance, public vs. proprietary data, barcoding, prioritization of data capture, efficiency and ease of the inventory process, staff training, and Internet deliverability.

Geospatial Data

With the exception of administrative records, virtually all of the data at the IGS have a geospatial component to them. ("Geospatial" refers to information that identifies the geographic location and characteristics of natural or constructed features and boundaries on the Earth.) Some of the types of data at the IGS include maps, publications, open-file studies, CD-ROM's, rock and mineral specimens, thin sections and rock analyses, fossil specimens and paleontological data, outcrop descriptions and photographs, cores and core descriptions, ground penetrating radar, downhole sample data and interpretations, shallow and deep geophysical data, field measurements, field and laboratory chemical and physical analyses, aerial photographs, digital information of an increasing variety, card files, lithologic strips, project files, reports, drawings, transparencies, X-ray data, seismic data, field notes, log picks, photographic negatives, cross sections, test procedures, gravity and magnetic data, and more.

Metadata

The phenomenal development of telecommunications in the late 1990's has been accompanied by fundamental shifts in how scientific data are gathered, accessed, and used. Until recently, most data and information were published in paper format (for example, books or maps), and access was provided through card catalogs (paper or electronic) with limited search capabilities. Now, data can be posted on the Internet, searched in a multitude of ways, and accessed through clearinghouses and numerous other portals. To facilitate discovery and access, metadata are used.

Metadata are descriptive information about data and information resources. Typically, metadata describe, point to, or otherwise complement the information content of the data to which they are related. Metadata provide a concise aid in locating desired information and help make such information easily accessible. It is particularly useful for geospatial information because federal standards have been written.

On April 11, 1994, President Clinton signed Executive Order 12906. This order, among other things, established "a coordinated National Spatial Data Infrastructure (NSDI) to support public and private sector applications of geospatial data in such areas as transportation, community development, agriculture, emergency response, environmental management, and information technology." Additionally, Executive Order 12906 mandated "the Standardized Documentation of Data . . . each agency shall document all new geospatial data it collects or produces, either directly or indirectly, using the standard under development by the FGDC [Federal Geographic Data Committee], and make that standardized documentation electronically accessible to the Clearinghouse network."

The FGDC standard describes what information is to be provided by the metadata and in what format the data should be provided. For example, the FGDC standard directs that the producer of a data set must describe the data's quality. Metadata help ensure that data remain usable in perpetuity. Moreover, metadata provide assurance that the data are of sufficient quality and validity, and eliminate one of the greatest barriers to the use of scientific data: discovery.

Training of IGS staff in metadata guidelines as specified by the FGDC Content Standard for Digital Geospatial Metadata has been provided through a series of in-house workshops. A policy has been prepared by the IGS administration (see Appendix) to require metadata creation for all new data sets and the creation of project metadata for final products. The ultimate goal is to increase the value of already valuable data and make it easier to access and retrieve. The IGS Intranet Web site currently provides easy access to staff for metadata keyword and category searches, and the IGS Internet site will ultimately provide users with data that can be downloaded.

Access

Balanced against the cost of acquisition, the cost of retaining geoscience data is a mere fraction, and data may acquire an increased value through time (Montgomery, 1999), (Figure 1), yet unless those data are accessible, they are useless. Before the electronic age, lists of data in collections were kept in (serial) logbooks or on (alphabetic) file cards. An individual familiar with the order of the record-keeping system was essential, to look up the data listing and to locate the physical whereabouts of the desired data. Access depended on a high degree of institutional memory, and on individuals who cared about the system and its organization. Archives that rely on institutional memory are prone to degrade when staff transfer, retire, or otherwise leave the institution. Today, computer databases that catalog a collection's holdings can be searched and queried by any number of descriptive parameters, even remotely over the Internet, utilizing much of the same technology developed by libraries.

Balanced against the cost of acquisition, the cost of retaining geoscience data is a mere fraction, and data may acquire an increased value through time       Figure 1. The short-term cost and value of data, either gathered in the field or generated in the laboratory, differ from their long-term cost and value. While the initial cost to acquire data may be quite high (left), the annual and ongoing costs for retention can be low. The costs of reacquiring the data at some time in the future (to the right of the jagged lines), if reacquisition is even possible, are typically much higher than the original acquisition costs. (Figure modified from National Research Council, 2002.)

The American Association of Petroleum Geologists (AAPG) has been promoting geoscience preservation and access for over 50 years. It has had a standing committee for core and sample preservation since 1948, and supports the American Geological Institute (AGI) proposal to create a centralized repository, the National Geoscience Data Repository System (NGDRS) -- in effect, a Library of Congress for samples in the public domain (American Geological Institute, 1994, 1997; Montgomery, 1999). To initiate the formation of the NGDRS, AGI secured support from the U.S. Department of Energy and some petroleum companies, developed a repository data model, facilitated the transfer of some data (cores, cuttings, paleontological samples, seismic data, logs, and scout tickets) from the private to the public sector, and implemented and is currently operating GeoTrek, a software data catalog and access system available on the Internet http://www.agiweb.org/NGDRS/. The location of the centralized facility has not been determined, and petroleum industry support is mixed. Companies are not willing to donate materials until a repository is located (Montgomery, 1999), and many individuals feel that a network of distributed repositories at key locations in the country would foster a greater degree of use. Finally, many state geological surveys would not contribute their materials, since they already have a statutory obligation to archive state-derived data.

Computerization

Computerization involves the digitizing of paper records, copying from one electronic medium to another, and/or re-formatting existing digital data. Increasingly, collections are cataloged in digital databases. Nevertheless, paper serves as an important medium of storage, if only as a visible backup. At most institutions, few specimens are accompanied by digital data when they arrive. In nearly all cases, specimen data arrive as collector-generated labels, scientific publications, and maps that accompany the samples. Specimen data are usually prepared for computer entry by initially organizing them on handwritten forms. Although this seems cumbersome, the two-step process cuts down on errors and leaves a tangible trail. The goal is an error-free inventory database.

Once in digital form, data are not guaranteed immortality. Data loss can result from physical degradation of the magnetic medium (particularly tape, which should be re-written about every 5 years), obsolete formats (and obsolete equipment to access them), the migration from one format to another, or the lack of complete auxiliary data (such as header information, recording parameters, calibration data, metadata).

CD-ROM storage currently is one of the more popular forms of digital data storage. Benefits include a simple and low-cost replication process, ability to store multiple data sets (e.g., text, images, video, and audio), and random access to the information. CD-ROM's are also expected to have a shelf life projected to exceed 25 years under standard office conditions.

Digital data also require periodic refreshing. Accessibility and retrievability can be guaranteed only if data are migrated to protect against media deterioration and technology evolution.

EARLY LESSONS

Some of the early lessons the IGS DIGS Committee learned in undertaking an institutionwide inventory effort are:
  1. There are five stages of data preservation:
    1. Data acquisition or assimilation;
    2. Storage and maintenance;
    3. Awareness;
    4. Accessibility;
    5. Usefulness (sufficient quality and validity to be believable). Failure of any stage results in all stages being repeated. Metadata are necessary for each stage (National Research Council, 2002).
  2. Standardization of data structure is essential. It contributes to a consistent vocabulary of keywords and it facilitates metadata creation and ease of use. Similarly, templates for metadata enhance the ease of data capture and help ensure compliance with FGDC standards.
  3. Bar coding serves numerous purposes and is a popular means of controlling inventory. Barcoding of items in a collection not only enhances sample identity by connecting the user immediately to more complete metadata than can be recorded on a small label or boxtop, but it achieves another important and simple task: it easily signifies whether or not an item has been inventoried and is already part of the collection's catalog.
  4. Each piece should only be handled once. The magnitude of inventorying an institution with many different data types necessitates, if the inventory is to be successful, a certain efficiency. Physically handling each piece once and only once is an important step in that process.
  5. Resident staff participation is essential. Individual memories are an important part of the data-capturing process. Familiarity with, and personal investment in, the data would be lost if an "outsider" (i.e., contract laborer) were brought in to merely inventory physical objects.
  6. Staff participation must be sought, but only after the entire process has been thoroughly designed and rigorously tested. Since staff buy-in is critical, their participation can be assured only if they understand that the inventory will be taken only once. Everyone knows that staff time is expensive. Enthusiastic and determined participation of the staff can be won and sustained only if the inventory procedure is a tested and efficient process instead of a time-wasting experiment.

CONCLUSION

Properly cataloged geoscience data (geospatial data) are a unique and unconventional resource library of increasing value. Metadata provide a means to efficiently catalog and readily access those data. Data documentation can be a long and time-consuming process, but the value of knowing the details about data far outweighs the trouble of documentation. More information is created and shared today than at any time in the past. Users of data want easy access and quick results, as well as information guaranteeing the accuracy of the data they wish to use. Organizations should make the commitment to provide data with proper metadata, and to garner information from the individuals who have created the data before they hit the lucky numbers on that big lottery ticket pay-off.

FUTURE GOALS

Numerous state geological surveys are in the process of digitizing data and providing wider access by publishing catalogs on the Internet. Financial resources for staff and equipment are, in many cases, the only impediments to digitizing and providing Internet access to data.

The application of informatics may be an important goal in geoscience data discovery and access. In such a scenario, all data would be in digital form and accessible over the Internet. Each sample could be located by its spatial coordinates, and attendant metadata would record the circumstances under which the sample was collected and would provide quality control. Such a system requires standardized formats for data archiving, software support, data mining tools, and a knowledgeable end-user community (see, for example, the Kansas Geological Survey's Geoinformatics efforts at http://www.kgs.ukans.edu/Geoinfo2/).

The Smithsonian's National Museum of Natural History (NMNH) is creating a "Research and Collections Information System" that approaches an informatics-based system. The intention is to accomplish three main goals: (1) better collections management to track the disposition of specimens acquired, loaned, borrowed, or disposed, and their locations; (2) online access to all digital specimen data for the benefit of museum research, collections, public program's staff, scientists worldwide, and the general public worldwide; and (3) participation in national and international informatics initiatives. Using a suite of software applications that are used internationally, NMNH staff have begun to slowly implement the system in a number of science departments. The software was chosen for its stability, ability to scale, flexibility for diverse NMNH disciplines, and ability for customization. Museum officials estimate that between 40 and 50 million records will adequately represent NMNH specimens at a cost of $55 to $75 million. Presently, there are no funds for data entry, and the collections care and informatics initiatives are stalled for lack of funds.

REFERENCES

American Geological Institute (AGI), 1994, National Geoscience Data Repository System feasibility and assessment study, submitted to the Office of Fossil Energy, U.S. Department of Energy: Alexandria, Virginia, American Geological Institute, 68 p.

American Geological Institute (AGI), 1997, National directory of geoscience data repositories. Alexandria, Virginia, American Geological Institute, 91 p.

Montgomery, S.L., 1999, Core values: the growing need for repositories: Oil & Gas Journal, v. 97, no. 46, November 15, 1999, p. 84-87.

National Research Council (NRC), 2002, Committee on the Preservation of Geoscience Data and Collections, Committee on Earth Resources, National Research Council. Geoscience Collections and Data: National resources in peril: Washington, D.C., National Academy Press, 205 p.

APPENDIX

The Indiana Geological Survey Metadata Policy

The Administration and Staff of the Indiana Geological Survey recognize the inherent value of the work they undertake and the data they generate. Additionally, they recognize the geospatial nature of virtually all of these data and information. Further, they recognize that data without proper documentation are worthless, vulnerable to being lost, support questionable and tentative decisions, and may never be used again.

As stewards of public information, the IGS has an obligation to provide high-quality, well-documented data sets and information through readily searchable and easily accessible means. This policy is established and designed to ensure and facilitate full and open access to quality data for research and education.

Metadata serve as a means to efficiently collect, preserve, manage, access, and disseminate these data and information.

The intentions of establishing a Metadata Policy are to:

Preserve data for future use;
Save time, resources, and duplicated effort;
Contribute toward building the National Spatial Data Infrastructure;
Support sound science and decision-making;
Serve as a basis for an inventory of IGS holdings.

IGS Metadata Policy

The Metadata Working Group (of the DIGS Committee) serves as an internal resource to provide guidance by answering questions, establishing metadata templates, and in helping to assure ease of use in the metadata creation process. They will not, however, write metadata for the general staff.

After the 6-month implementation period, comments about the metadata system will be solicited. The Metadata Working Group will review all comments and determine what, if any, changes need to be made to the system.

The Metadata Process at IGS

What are required to have metadata?
Any completed data product
IGS publications
IGS Open-File Studies
Final reports on projects, both internal and external
Maps and GIS products
Digital data images
Databases
Collections of samples or data
Who needs to create metadata?
Project director and/or project staff, that is, those closest to the actual generation of the data.
Procedure to create metadata
Follow the file-naming conventions established by the Metadata Work Group
Utilize the IGS metadata template to create metadata
Follow authorship/citation guidelines
Insert all necessary keywords
Categories: theme, place, stratum, temporal
How should the metadata be submitted?
Completed metadata should be submitted to the publications review coordinator for internal technical and editorial review.
Following approval by the director, the metadata will be available for public release, and they will also be archived by Technology Transfer in the Document Archive Database and included in the metadata search engine on the IGS Intranet.

Handling of proprietary data

Proprietary data will be kept physically separate from those that are publicly available, and they may be used only by IGS staff or publicly released with permission of the director or his designate.
How are incomplete databases to be handled?
By way of example, as large and comprehensive as the IGS Petroleum Well Data Base (PWDB) is, its data are by no means perfect, nor is it a completed database, yet it is necessary to document data quality as precisely (and candidly) as possible for the end-user to evaluate whether or not to use the data set.


RETURN TO Contents
National Cooperative Geologic Mapping Program | Geologic Division | Open-File Reports
U.S. Department of the Interior, U.S. Geological Survey
URL: https://pubsdata.usgs.gov/pubs/of/2002/of02-370/steinmetz.html
Maintained by David R. Soller
Last modified: 19:15:47 Wed 07 Dec 2016
Privacy statement | General disclaimer | Accessibility