usSEABED: East Coast Offshore Surficial Sediment Data Release, A simplified description of dbSEABED processing

A feature of dbSEABED is its ability to parse word-based descriptive data such as "brown fine sand with abundant shells; seagrass and some pebbles; whiff of H2S”. These types of data are held in their original terms though some abbreviation and coding is necessary. Thus dbSEABED is not a natural language parser even for the noun phrase constructions, such as the above description. A simplified description of the parsing functions is as follows.

Most descriptions consist of quantifiers, modifiers, and objects (qmO) and can be written as linear expressions:

In most cases the sediment fraction is the whole sample, but dbSEABED records explicitly where that is or is not the case in outputs. In the previous example which in dbSEABED coding is

where "-" on modifiers points to modified object, and "/" on quantifiers points to the quantified object. The use of abbreviations helps distinguish data from metadata in the data files and makes descriptions shorter, easier to process, and more human readable. The qmO coding for texture is then

m(fne)O(snd) + q(wi_ab)O(shls) + q(som)O(pbls)

where the "brown", "seagrass" and "h2s" are not shown because they are neutral for texture. The textural objects are each assigned a grain size which might be cast as Fuzzy Set Memberships across a range of grain sizes or may be a single size fraction percentage (Crisp Set). The grain-size assignment is acquired from a dictionary and is usually based on published scales such as (Wentworth, 1922) and Unified Soil Classification System (USCS)), on analysis of a region's sediment components, or on the sedimentologic experience of one or more people. The grain size is the median of grain sizes observed for sediments that have been labeled simply as "sand", which is 1.5 phi. Where a modifer is applied the grain size is adjusted. In the case of modifier "fine" the mean grain size of the sand becomes 2.5 phi. "Shells" has a textural meaning, typically at fine gravel -2 phi but ranging of course by species and preservation. "Pebble" is assigned its Wentworth scale grain size.

The quantifiers "abundant" and "some" assign weights to the "shell" and "pebble" parts and adjust the memberships of the assigned phi grain sizes. These memberships are specified in the dictionary: "abundant" 0.5, "some" 0.3. The unquantified "sand" is usually assigned 1.0. After normalization to 1.0 the "sand", "shell" and "pebble" components have weights of 0.56, 0.28, and 0.17, respectively. (Note: the normalization depends on the syntax, whether like the rear-significant ODP syntax, with weights increasing to the right; front significant naval data; or flat-significance syntax common in ecologic studies. At data entry which of these applies is specified in dbSEABED.) For the standard textural classes, gravel ("shell + pebble") membership totals 0.4, sand 0.6, and mud 0.0. (Note: silt and clay proportions cannot be determined from visual descriptive data.)

This is reported to the parsed outputs (_PRS). The existence of shell and pebbles is also noted in the "shell" component outputs (_CMP), with a relative membership of 0.28 and 0.17.

A similar process is performed successively for the other parameters, for consolidation (no information here, sediment will be assumed loose), features such as "seagrass", and color. "Brown" will cause an output of the Munsell color code of 5YR 4/4 through calibrated processes described in Jenkins, 2002.

The numbers and results above are examples only, and the explanation is simplified. Iterative comparisons of hundreds of verbal descriptions with each other and with the results of lab-based analyses have established that the accuracy is about 1 standard deviation at +/- 2 phi range.

Most input linguistic data are reasonably well organized, although there are a variety of descriptive linguistic structures familiar to geologists, ecologists, biologists, navy divers, and other seafloor samplers. To be usable in dbSEABED, sediment descriptions and analysis data do not need to be precise or fit a particular pattern. The dbSEABED program copes with a wide set of vocabularies (for example, foreign language, USACE codes, NOS Bottom Type Codes), and a variety of linguistic structures and ways of attaching quantities. There are currently over 5300 terms defined in the parsing dictionary.

Further, the data need not be absolutely complete. The dbSEABED program mines what data are there, giving outputs only where data are sufficiently complete to be reliable, while rejecting (and reporting for future attention) incomplete or erroneous structures.

Each description is kept close to its original form and structure but is coded in phonetically sensible terms that include links between quantifiers, modifiers, and objects. Data coders choose between a variety of data types to organize the data so the dbSEABED program can quickly parse words into values. The following table represents data types used in usSEABED output files:

The dbSEABED program contains a thesaurus where various terms used to describe the seabed are given lithologic, textural, and biological classes and weightings. Modifers and quantifiers are given relative weightings, and values are assigned to other categories as needed.

Several special semantic structures are catered for, notably the joint abundance structure such as: "snd som/ { pbl + shl + blk- nods }", wherein the pebble, shell and nodules total 0.3 in abundance ("som"). Also, a component may be as a proportion of some special fraction such as: "snd // acid_residu," in effect the non-carbonate sand proportion.

The parsing includes a number of quality control devices. If a term is not recognized in the dictionary, the process is aborted. The process is also aborted if component weightings from certain observations such as grain counts fail to total 100%. If a description is too complex (currently defined as greater than 32 terms) it is not parsed. Problems encountered are reported to a diagnostics file for later attention and perhaps correction. Homonym terms such as "dense" for consolidation and "abundance", are distinguished (as "dens(phy)", "dens(ab)"). terms that are marked "meaning unknown" in the dictionary will cause parsing to fail. Terms that have a special meaning in one survey, such as "iron" in DSDP data, are also specially marked ("iron(dsdp)"). XRD data are not parsed, since they are not regarded as reliable enough in comparison to petrologic counts or visual descriptions. At present, the location of structures in a core (for example, "below yellow layer") and gradients (for example, "grading upwards to") are not parsed.

More information about the dbSEABED software is obtainable from a number of articles listed in the Frequently Asked Questions section.

A Simplified Description of dbSEABED Processing