USGS Open-File Report 2004-1002, Content Metadata Standards for Marine Science: A Case Study, Discussion: Meeting the Challenges of Cataloguing Digital Marine Science Resources

Content Metadata Standards for Marine Science: A Case Study, USGS Open-File Report 2004-1002

Title Page

Introduction

Cataloguing
Challenges

Evolution

MRIB Case Study

DISCUSSION/
CHALLENGES

1. Accommodate

6. Encourage Creators

Conclusion

References

Discussion: Meeting the Challenges of Cataloguing Digital Marine Science Resources

Earlier in this paper were listed six special challenges posed in the creation of a digital library for the marine sciences. These challenges were critical considerations during the development/expansion of the MRIB metadata fields and controlled vocabularies. The MRIB became officially "public" in January 2003 (although earlier versions were available online, they were not actively promoted). This means that it is especially timely to critically evaluate the MRIB metadata standard and the Web interface to the catalogue using that standard.

Early informal user testing of the MRIB suggests that the main user difficulties occur at the interface level. At this level, crucial considerations include 1) arranging the facets so they are all visible and clearly-purposed on the page, and 2) providing integrated definitions of words in the categorization scheme. In the meantime, despite the lack of a stable interface, it is possible to evaluate the MRIB metadata standard, at least on a preliminary basis, by considering the six challenges that we previously noted.

Back to Top

Challenge One: Accommodate geospatial and temporal "footprint" of information.

This challenge is perhaps the one most thoroughly met by the MRIB. The MRIB metadata fields include six fields specifically dealing with location (maximal, minimal, and mean latitudes and longitudes). Because these fields store information in a nearly universal format, that is, decimal coordinates, any front-end can process these data for a variety of display and matching interfaces. The current MRIB front-end uses these data to 1) plot information resources on a global map based on their area of study and to 2) match latitudinal and longitudinal ranges with entries in a gazetteer of named locations. Another location-related field, Physiographic Features, provides information about kinds of spatial features (such as mountains) rather than individual spatial features (such as Sand Mountain). Because natural processes should be very similar (or at least worth comparing) at locations which are similar feature types (for instance, two locations with coral reefs) many users will find this Physiographic Features field useful for finding relevant information about, for instance, coral reefs in general as well as in particular.

The MRIB standard also records six time-related data. The dates over which research that contributed to an information resource was carried out are noted in Research Start Time and Research Stop Time. The geological time that the information resource discusses, which for the marine sciences is often not the same as the time of the research, is placed in Geologic Time. The date when a document was last updated (prior to indexing) and the date of the last re-indexing, modification, or verification of a metadata profile are recorded in Item Last-Updated and EIC Last Updated respectively. The date a document was first indexed using the MRIB standard is recorded in the EIC Created field.

The MRIB metadata fields for time and spatial information are very thorough. One minor problem is that, because the Geologic Time facet uses the standard geologic time scale, studies over the past 10,000 years are all grouped under either of the terms "Holocene" or "Present." Because this period is one for which a very high resolution of information is available, it would be worthwhile to divide this period into millennial- or even decadal-scale blocks.

Back to Top

Challenge Two: Integrate Information From a Broad Spectrum of Academic Disciplines

Currently, the MRIB has limited itself in scope to information from the computational, natural, and social sciences. This removes some of the difficulty involved in distinguishing scientific from artistic understanding that would be posed by incorporation of materials from the arts and humanities. The Earth and social sciences are frequently concerned with space and time, and the approach of the MRIB to such information is outlined above.

The MRIB metadata standard allows users to choose from a hierarchical list of disciplines, the Disciplines facet. Additionally, the MRIB metadata include several fields that describe information relevant to specific discipline groups. Because the biological sciences are important to oceanography, it was necessary to develop a scheme for recording information about organisms discussed in a document. The Biota facet serves this purpose by providing a Linnean hierarchy of taxonomic clades of organisms.

The Biota facet does have some drawbacks. One is that the biological taxonomic order is in constant flux, meaning that not all biologists would agree with the placement of a given organism in a given clade. Moreover, the terms currently listed in the controlled vocabulary for this facet require a scientific background; they are derived as well as possible from current scientific classification of organisms, and usually go only to the Order level of depth. It will eventually be necessary for front-ends using the MRIB metadata standard to map from scientific names to common names. Additionally, it is becoming evident the term list needs to be extended all the way to the species or subspecies level. Such detail, though inconvenient unless the cataloguer is familiar with the biologic taxonomy, will enable seamless switching between Latin organism names and folk names in any language and at any level of depth (for instance, a mapping could be developed for the vague, general term "fish" to the scientific names it includes, and it would function as well as a mapping from a scientific species name to a common species name).

Lastly, the MRIB metadata includes an Other Keywords field that can encode information not stored elsewhere. Such keywords can be used as the basis for expanding the MRIB vocabulary lists and fields as needed, as well as providing additional terms to be text-searchable in a front end (for instance, the MRIB's current Web interface enables both browsing of the facets as well as a text search that queries all of the facets and the Other Keywords field).

Back to Top

Challenge Three: Organize Information So That a Variety of Searching Strategies Can Succeed

It is often said that the imperative journalistic question is "Who, What, Where, When, Why, and How?" The MRIB metadata standard was developed with this guiding question, because users with different experiential backgrounds (and different objectives) may be most strongly appealed to by any one of these questions. Moreover, answering such questions can guide the user to browse for information along lines that seem interesting to him or her, even without a specific goal in mind. Each of the six components of the journalistic question is answered by one or more facets (and some additional information about them is stored in non-searchable metadata). "Who?" is answered by Authors, Agencies, and Projects. "What?" is answered by Disciplines, Features, and Biota. "Where?" is answered by Location. "When?" is answered by Geologic Time. "Why?" is answered by Hot Topics, and "How?" is answered by Methods, Content Type, and File Type.

Moreover, the vocabularies are designed to address concepts redundantly by alluding to a single concept in different facets, with each occurrence of the concept being tempered by its relationship to the whole facet. For instance, a researcher interested in sediments might find relevant resources through the avenues noted in Table 2.

Table 2: Subconcepts of the “sediment” concept.

TERM	FACET
Geology/Sedimentology	Disciplines
Geochemistry/Sediment Geochemistry	Disciplines
Soil	Physiographic Features
Geological Features/Sediments	Physiographic Features
Environment/Environmental Issues Relating to Sediments	Hot Topics
Environment/Environmental Issues Relating to_ Habitats/Sediments in Habitats	Hot Topics
Disasters/Types of Disasters/Erosion	Hot Topics
Disasters/Types of Disasters/Subsidence	Hot Topics
Field Observation Methods/Sampling Methods/Surface Sampling Methods	Methods
Field Observation Methods/Sampling Methods/Sampling Methods Using Cores	Methods

The terms and facets shown in Table 2 all represent subtle aspects of a single concept, "sediment." It should be noted that this redundancy does not mean the same concept is represented by different terms (which would defeat the purpose of a controlled vocabulary); rather, many variations or sub-concepts of a broader concept are represented. A sophisticated front-end to the MRIB metadata standard might analyze these related concepts and provide "Related To..." options for the user.

Back to Top

Challenge Four: Minimize Jargon

Generally, the MRIB standards minimize argot. However, in some cases, it becomes inevitable. The Biota facet is an example of this: there is no precise way to describe organisms besides the taxonomic standard (a "standard" which is really in flux). Nonetheless, the taxonomic standard was chosen with the intent that it could be hidden from non-biologist users under a variety of interfaces that would be dependent on the precise nature of the scientific taxonomy. Another such case is Geologic Time, which again relies on standard naming conventions with which some users may be unfamiliar. In these instances, it is necessary for the front-end to provide term definitions and guidance to the user.

This is where the additional information stored in the MRIB's valid term lists becomes handy. These lists, in tab-delimited form, store not only term names but also definitions of the terms that can be incorporated into a front-end that reads the MRIB standard. (These standardized definitions also prove useful to indexers, because they may provide more precise connotations than a term generally carries, or clarify terms that are differently applied among academic disciplines.)

Where possible, the MRIB avoided terms that were likely to be confusing. For instance, terms with multiple meanings across fields were avoided. One example of this was the choice to use "soil science" for the geological field of "pedology" in the Disciplines facet because it sounded too similar to the field of "pediatrics", the medical treatment of children.

Back to Top

Challenge Five: Use Enough — But Not Too Many — Metadata Fields

When adding new fields, the MRIB team was cautious, and verified that concepts could not be incorporated logically into existing fields to avoid the fission of essentially similar concepts into an infinitely large (not to mention confusing) set of browseable fields. In some cases, facets were actually merged when unexpectedly significant overlap with other facets became evident (such as the Format, Audience, and Class facets, which were transformed into File Type and Content Type).

Despite this cautious approach to creating facets, the number of faceted metadata fields is still large enough to pose an interface design hurdle. Making each of the facets available, and their potential usefulness clear, is an ongoing process. As the interface is refined through continual adjustment in response to user testing and feedback, we will be able to discern with some clarity whether the MRIB fields are few enough to function well in an entry-level user interface. Regardless, it is possible to develop multiple user interfaces, some of which present for browsing the full breadth of metadata information available, and others which simplify the scheme to meet less rigorous (or more specific) searching needs.

Back to Top

Challenge Six: Encourage Composition of Metadata Records by Resource Creators Themselves

This challenge is closely tied to the other challenges, and is also a broad test of whether the categorization scheme is intuitive and consistent. If different cataloguers who have not extensively used the MRIB metadata standard can develop very similar records for the same document, that is solid evidence in favor of intuitiveness and consistency. (The records need not be exactly the same, since some of the free-text fields, such as abstract, will inherently vary.) Because the cataloguing process is independent of the main MRIB interface, and thus freed of interface concerns, feedback from document authors and maintainers who have catalogued their own Web pages may provide more insight into the sturdiness of the categorization scheme than user testing of the main MRIB. (Although a cataloguing interface is involved if the cataloguer chooses not to generate records manually, this interface proceeds through each metadata field sequentially, unlike that of the main Web interface, so it is ensured that users see each field and have the opportunity to use it.)

So far, cataloguing by people other than the MRIB staff has been successful so long as cataloguers are provided with concise definitions of terms. When definitions are absent, these cataloguers often become overwhelmed with the sheer number of metadata fields. However, with the term definitions in hand, the users are usually able to compose records that agree with those created by the MRIB staff.

Other elements of meeting this sixth challenge involve providing a cataloguing interface (and promoting it) to document authors who do not wish to enter the messy-looking world of manually encoding catalogue data in the EIC format. However, this is a relatively simple matter compared to providing an end-user interface to the metadata records, so it will not be elaborated here.

Back to Top

Title / Introduction / Cataloguing / Evolution / Case Study / Discussion / Conclusion / References /