Content Metadata Standards for Marine Science: A Case Study, USGS Open-File Report 2004-1002
MRIB Case Study
There is broad interest, for both educational and practical reasons, in understanding the processes of the coastal, marine, and lake environments through the scientific lens. (For brevity, the remainder of this paper will use marine to indicate all three environments.) Marine scientists are challenged to share their knowledge with coastal residents, government decision-makers, fishers, learners, and other lay people.
The rise of the Internet, particularly the World Wide Web, has made the sharing of scientific information—to both academic and lay audiences—an apparently instantaneous process. A scientist may author a Web page to share his or her data the same day that data is processed. (Or he or she may instruct a computer to generate the page automatically at regular intervals.) This paper will use the general term information resources to include Web pages (individual HTML documents which may have other media types embedded in them), Web sites (agglomerations of one or more Web pages), and other Web-served media which present scientific information at any level of technical difficulty.
Access to the wealth of scientific information available over the Internet is impeded because most Web searching tools (among them search engines such as the popular Google) do not provide an efficient way to find scientific information. One limitation of traditional search engines is that they cannot infer synonymous words. If one asks Google to find pages containing the text "Southeastern U.S.," it will not intuit that one also desires pages that substitute for "Southeastern U.S." the names of individual states constituting that region. Secondly, traditional search engines cannot rule out homographs and phrases that include the searcher's words, but are irrelevant for the searcher's purpose. For instance, a searcher wanting scientific information about Monterey Bay might run a search for the phrase "Monterey Bay." The results would likely omit relevant pages that refer to the area as "Monterey Sanctuary" (but not "Monterey Bay") while including irrelevant pages such as travel guides and menus for local restaurants. A third limitation of traditional search engines from the perspective of one seeking scientific information is that they do not provide any evaluation of scientific merit.
If search engines based on automated textual analysis fail to meet the needs of one who seeks specific scientific information, then what? Efforts like the Open Directory Project (http://www.dmoz.org) provide an alternative to traditional search engines (which often link to them), but also have their drawbacks. Such projects, which for convenience we will call "Web directories," however, also have their limits. For one thing, they usually will only place a single Web resource in a single classification. This one-page-one-listing strategy prevents flooding of the directory by particular Web sites of especially broad scope. Thus, a Web site whose pages represent the words taken from a dictionary is not guaranteed an entry in every category of the directory. On the other hand, the one-page-one-listing strategy also ensures that Web sites will only be listed in the most general terms, and prevents searching for information resources by more than one criterion. A second problem is that the form of Web directory listings is simplistic, providing the searcher with only a title, a URL, and a brief description, so the user has little information at hand to compare listings, and thus must "surf" to find the most useful resources, if any.
Specialized Web directories—for instance, annotated lists of links gathered by a specialist in a particular field—may offer more up-front information, but they too have their drawbacks. They cannot usually aggregate information between scientific fields. If they are maintained manually, then they are limited both in the number of information resources they can describe and in the precision by which these resources are differentiated. They cannot interchange data with similarly-functioning directories; they do not use metadata.
The U.S. Geological Survey needed a better way to present its coastal and marine geology Web resources as an organized collection, and to eventually integrate resources from other agencies, so it developed the Marine Realms Information Bank (MRIB) project. The MRIB, which can be found at http://mrib.usgs.gov, is a Web-based distributed library about marine environments. By calling it a library, we mean that it catalogues information resources in a consistent, rigorous way, just as a library catalogue does, and permits searching of the catalogue. Moreover, it attempts to duplicate the most basic forms of "good advice" that could be offered to a searcher by a knowledgeable reference librarian, using detailed descriptions of information resources to suggest potential avenues for further searching. By qualifying the "library" as "distributed," we mean that the information resources are not held in the MRIB. They are described in the MRIB, in documents that remain in the MRIB catalogue, but the resources themselves are elsewhere. A central part of the MRIB is an ontology for Web materials about the marine environments. That ontology is what sets the MRIB apart from search engines and Web directories. The ontology, which is an abstract organizational schema, is manifested by a metadata standard which is the focus of this paper.
As suggested earlier, information resources listed in the MRIB catalogue remain on their originating servers (and in the control of their creators), while the MRIB stores and searches locally-held metadata that describe those resources. The MRIB currently catalogues information resources according to an ontology constituted by thirteen facets (metadata fields with controlled vocabulary lists) as well as a suite of other textual and numerical metadata fields that do not rely on controlled vocabularies. The MRIB's metadata fields and vocabularies have been tailored to adequately describe the distributed library content (coastal, marine, and lake science documents) while serving a broad audience spectrum ranging from elementary school students to policy makers to oceanographers.
In this paper, we will outline the special challenges of categorizing digital information about the marine realms for a heterogeneous audience. Then we will review several metadata standards and their usefulness in light of those challenges. Next we will describe the process and outcome of the MRIB metadata standard development. Finally, we will evaluate the suitability of both the metadata standard and its development process to meet those challenges.