Home
Introduction
Setting the Stage
Metadata and the Web
Crosswalks, Metadata Harvesting, Federated Searching, Metasearching
Rights Metadata Made Simple
Practical Principles for Metadata Creation and Maintenance
Glossary
Selected Bibliography
Contributors
PDF Version



Introduction to Metadata
Crosswalks, Metadata Harvesting, Federated Searching, Metasearching: Using Metadata to Connect Users and Information

Mary S. Woodley

The author would like to thank Karim Boughida of George Washington University for his invaluable input about metasearching and metadata harvesting and Diane Hillmann, of Cornell University, who graciously commented on the chapter as a whole. The author takes full responsibility for any errors or omissions.

Since the turn of the millennium, instantaneous access to a wide variety of content via the Web has ceased to be considered "bleeding-edge technology" and instead has become expected. In fact, from 2000 to the time of this writing, there has been continued exponential growth in the number of digital projects providing online access to a range of information resources: Web pages, full-text articles and books, cultural heritage resources (including images of works of art, architecture, and material culture), and other intellectual content, including born digital objects. Users increasingly expect that the Web will serve as a portal to the entire universe of knowledge. Recently, Google Scholar, Yahoo! and OCLC's WorldCat (a union catalog of the holdings of national and international libraries) have joined forces to direct users to the closest library that owns the book they are seeking, whether it is available in print or online or both.1 Global access to the universe of traditional print materials and digital resources has become more than ever the goal of many institutions that create and/or manage digital resources.

Unfortunately, there are still no magic programming scripts that can create seamless access to the right information in the right context so that it can be efficiently retrieved and understood. At this point, most institutions (including governments, libraries, archives, museums, and commercial enterprises) have moved from in-house manual systems to automated systems in order to provide the most efficient means to control and provide access to their collections and assets.2 Some institutions have a single information system for managing all their content; others support multiple systems that may or may not be interoperable. Individual institutions, or communities of similar institutions, have created shared metadata standards to help to organize their particular content. These standards might include elements or fields, with their definitions (also known as metadata element sets or data structure standards);3 codified rules or best practices for recording the information with which the fields or elements are populated (data content standards); and vocabularies, thesauri, and controlled lists of terms or the actual data values that go into the data structures (data value standards).4 The various specialized communities or knowledge domains tend to maintain their own data structure, data content, and data value standards, tailored to serve their specific types of collections and their core users. It is when communities want to share their content in a broader arena, or reuse the information for other purposes, that problems of interoperability arise. Seamless, precise retrieval of information objects formulated according to diverse sets of rules and standards is still far from a reality.

The development of sophisticated tools to enable users to discover, access, and share digital content, such as link resolvers, OAI-PMH harvesters, and the development of the Semantic Web have increased users' expectations that they will be able to search simultaneously across many different metadata structures.5

The goal of seamless access has motivated institutions to convert their legacy data, originally developed for in-house use, to standards more readily accessible for public display or sharing; or to provide a single interface to search many heterogeneous databases or Web resources at the same time. Metadata crosswalks are at the heart of our ability to make this possible, whether they are used to convert data to a new or different standard, to harvest and repackage data from multiple resources, to search across heterogeneous resources, or to merge diverse information resources.

Definitions and Scope

For the purposes of this chapter, "mapping" refers to the intellectual activity of comparing and analyzing two or more metadata schemas; "crosswalks" are the visual and textual product of the mapping process.

A crosswalk is a table or chart that shows the relationships and equivalencies (and highlights the inevitable gaps) between two or more metadata formats. An example of a simple crosswalk is given in table 1, where a subset of elements from four different metadata schemas are mapped to one another. Table 2 is a more detailed mapping between MARC21 and Simple Dublin Core. Note that in almost all cases there is a many-to-one relationship between the richer element set (in this example, MARC) and the simpler set (Dublin Core).

Metadata Mapping and Crosswalks

Crosswalks are used to compare metadata elements from one schema or element set to one or more other schemas. In comparing two metadata element sets or schemas, similarities and differences must be understood on multiple levels so as to evaluate the degree to which the schemas are interoperable; crosswalks are the visual representations, or "maps," that show these relationships of similarity and difference.

One definition of interoperability is "the ability of different types of computers, networks, operating systems, and applications to work together effectively, without prior communication, in order to exchange information in a useful and meaningful manner. Interoperability can be seen as having three aspects: semantic, structural and syntactic."6 Semantic mapping is the process of analyzing the definitions of the elements or fields to determine whether they have the same or similar meanings. A crosswalk supports the ability of a search engine to query fields with the same or similar content in different databases; in other words, it supports "semantic interoperability." Crosswalks are not only important for supporting the demand for "one-stop shopping," or cross-domain searching; they are also instrumental for converting data from one format to another.7 "Structural interoperability" refers to the presence of data models or wrappers that specify the semantic schema being used. For example, the Resource Description Framework, or RDF, is a standard that allows metadata to be defined and shared by different communities.8 "Syntactic interoperability," also called technical interoperability, refers to the ability to communicate, transport, store, and represent metadata and other types of information between and among different systems and schemas.9

Table 1. Example of a Crosswalk of a Subset of Elements from Different Metadata Schemes



CDWA


MARC


EAD


Dublin Core


Object/Work-Type


655 Genre/form


<controlaccess><genreform>


Type


Titles or Names


24Xa Title and Title—
Related Information


<unittitle>


Title


Creation–Date


260c Imprint—
Date of Publication


<unitdate>


Date.Created


Creation-Creator-Identity


1XX Main Entry
7XX Added Entry


<origination><persname>
<origination><corpname>
<origination><famname>
<controlaccess><persname>
<controlaccess><corpname>


Creator


Subject Matter


520 Summary, etc.
6xx Subject Headings


<abstract>
<scopecontent>
<controlaccess><subject>


Subject


Current Location


852 Location


<repository><physloc>


 


Table 2. Example of a Crosswalk: MARC21 to Simple Dublin Core


MARC Fields


Dublin Core Elements


130, 240, 245, 246


Title


100, 110, 111


Creator


100, 110, 111, 700, 710, 711*


Contributor


600, 610, 630, 650, 651, 653


Subject / Keyword


Notes 500, 505, 520, 562, 583


Description


260 $b


Publisher


581, 700 $t, 730, 787, 776


Relationship


008/ 07-10  260 $c


Date


Mapping metadata elements from different schemas is only one level of crosswalking. At another level of semantic interoperability are the data content standards for formulating the data values that populate the metadata elements, for example, rules for recording personal names or encoding standards for dates. A significant weakness of crosswalks of metadata elements alone is that results of a query will be less successful if the name or concept is expressed differently in each database. By using standardized ways to express terms and phrases for identifying people, places, corporate bodies, and concepts, it is possible to greatly improve retrieval of relevant information associated with a particular concept. Some online resources provide access to controlled terms, along with cross-references for variant forms of terms or names that point the searcher to the preferred form. This optimizes the searching and retrieval of information objects such as bibliographic records, images, and sound files. However, there is no universal authority file,10 much less a universal set of cataloging rules that catalogers, indexers, and users consult. Each cataloging or indexing domain has developed its own cataloging rules as well as its own domain-specific thesauri or lists of terms that are designed to support the research needs of a particular community. Crosswalks have been used to migrate the data structure of information resources from one format to another, but only recently have there been projects to map the data values that populate those structures.11 When searching many databases at once, precision and relevance become even more crucial. This is especially true if one is searching bibliographic records, records from citation databases, and full-text resources at the same time. Integrated authority control would significantly improve both retrieval and interoperability in searching disparate resources like these.

The Gale Group attempted to solve the problem of multiple subject thesauri by creating a single thesaurus and mapping the controlled vocabulary from the individual databases to their own in-house thesaurus. It is unclear to what extent the depth and coverage of the controlled terms in the individual databases are compromised by this merging.12

The Simple Knowledge Organization Scheme (SKOS Core) project by the W3C Semantic Web Best Practices and Deployment Working Group is a set of specifications for organizing, documenting, and publishing taxonomies, classification schemes, and controlled vocabularies, such as thesauri, subject lists, and glossaries or terminology lists, within an RDF framework.13 SKOS mapping is a specific application that is used to express mappings between diverse knowledge organization schemes. The National Science Digital Library's Metadata Registry is one of the first production deployments of SKOS.14 Mapping and crosswalks of metadata elements are fairly well developed activities in the digital library world; mapping of data values is still in an early phase. But, clearly, the ability to map vocabularies (data value standards), as well as the metadata element sets (data structure standards) that are "filled" with the data values, will significantly enhance the ability of search engines to effectively conduct queries across heterogeneous databases.15

Syntactical interoperability is achieved by shared markup languages and data format standards that make it possible to transmit and share data between computers. For instance, in addition to being a data structure standard, MARC (Machine-Readable Cataloging Record) is the transmission format used by bibliographic utilities and libraries;16 EAD (Encoded Archival Description) can be expressed as a DTD (document type definition) or an XML schema for archival finding aids expressed in SGML or XML; CDWA Lite is an XML schema for metadata records for works of art, architecture, and material culture; and Dublin Core metadata records can be expressed in HTML or XML.17

The Role of Crosswalks in Repurposing and Transforming Metadata


The process of repurposing metadata covers a broad spectrum of activities: converting or transforming records from one metadata schema to another, migrating from a legacy schema (whether standard or local) to a different schema, integrating records created according to different metadata schemas, and harvesting or aggregating metadata records that were created using a shared community standard or different metadata standards. Dushay and Hillmann note that the library community has an extensive and fairly successful history of aggregating metadata records (in the MARC format) created by many different libraries that share data content and data value standards (Anglo-American Cataloguing Rules, Library of Congress authorities) as well as a common data structure standard and transmission format (MARC). However, aggregating metadata records from different repositories may create confusing display results, especially if some of the metadata was automatically generated or created by institutions or individuals that did not follow best practices or standard thesauri and controlled vocabularies.18

Data conversion projects transfer the values in metadata fields or elements from one system (and often one schema) to another. Institutions convert data for a variety of reasons, for example, when upgrading to a new system, because the legacy system has become obsolete, or when the institution has decided to provide public access to some or all of its content and therefore wishes to convert from a proprietary schema to a standard schema for publishing data. Conversion is accomplished by mapping the structural elements in the older system to those in the new system. In practice, there is often not the same granularity between all the fields in the two systems, which makes the process of converting data from one system to another more complex. Data fields in the legacy database may not have been well defined, or may contain a mix of types of information. In the new database, this information may reside in separate fields. Identifying the unique information within a field to map to a separate field may not always be possible and may require manipulating the same data several times before migrating it.

Some of the common misalignments that occur when migrating data are as follows:19
  1. There may be fuzzy matches. A metadata element in the original database does not have a perfect equivalent in the target database; for example, when mapping the CDWA element20 "Styles/Periods/Groups/Movements" to simple Dublin Core, we find that there is not a DC element with the exact same meaning. The Dublin Core Subject element can be used, but the semantic mapping is far from accurate, since it's the subject, not the style, that a work of art is "about."
  2. Although some metadata standards follow the principle of a one-to-one relationship,21 as in the case of Dublin Core, in practice many memory institutions use the same record to record information about the original object and its related image or digital surrogate, thus creating a sort of hybrid work/image or work/digital surrogate record. When migrating and harvesting data, this may pose problems if the harvester cannot distinguish between the elements that describe the original work or item and those that describe the surrogate (which is often a digital copy, full or partial, of the original item).
  3. Data that exists in one metadata element in the original schema may be mapped to more than one element in the target schema. For example, data values from the CDWA Creation-Place element may be mapped to the "Subject" element and/or the "Coverage" element in Dublin Core.
  4. Data in separate fields in the original schema may be in a single field in the target schema; for example, in CDWA, the birth and death dates for a "creator" are recorded in the Creator-Identity-Dates, as well as in separate fields—all apart from the creator's name. In MARC, both dates are a "subfield" in the string for the "author's" name.
  5. There is no field in the target schema with an equivalent meaning, so that unrelated information may be forced into a metadata element with unrelated or only loosely related content.
  6. The original "standard" is actually a mix of standards. Kurth, Ruddy, and Rupp have pointed out that even when metadata is being transformed from a single schema, it may not be possible to use the same conversion mapping for all the records that are being converted. Staff working on the Cornell University Library (CUL) projects became aware of the difficulties of "transforming" library records originally formulated in the MARC format to TEI XML headers. Not only were there subtle (and at times not so subtle) differences over time in the use of MARC, but the cataloging rules guiding how the content was entered had undergone changes from pre\x96Anglo-American Cataloguing Rules to the revised edition of AACR2.22
  7. In only a few cases does the mapping work equally well in both directions, due to differences in granularity and community-specific information. (See no. 2 above.) The Getty metadata crosswalk maps in a single direction:23 CDWA was analyzed and the other data systems were mapped to its elements. However, there are types of information that are recorded in MARC that are lost in this process; for example, the concepts of publisher and language are important in library records but are less relevant to CDWA, which focuses on one-of-a-kind cultural objects.
  8. One metadata element set may have a hierarchical structure with complex relationships between elements (e.g., EAD), while the other may be a flat structure (e.g., MARC ).24


Methods for Integrated Access/Cross-Collection Searching

Traditional Union Catalogs
The most time tested and in some ways still the most reliable way of enabling users to search across records from a variety of different institutions is the traditional union catalog. In this method, various institutions contribute records to an aggregator or service provider, preferably using a single, standard metadata schema (such as MARC for bibliographic records), a single data content standard (for libraries, AACR, to be superseded by RDA in the future), and shared controlled vocabularies (e.g., Library of Congress Subject Headings, the Library of Congress Name Authority File, Thesaurus for Graphic Materials, and Art & Architecture Thesaurus).

Within a single community, union catalogs can be created where records from different institutions can be centrally maintained and searched with a single interface, united in a single database consisting of records from different contributing institutions. This is possible because the contributing community shares the same rules for description and access and the same protocol for encoding the information. OCLC's WorldCat and RLG's RLIN25 bibliographic file are two major union catalogs that make records from a wide variety of libraries available for searching from a single interface, in a single schema (MARC). There are also "local" union catalogs that aggregate records from a particular consortium or educational system; for example, the University of California and the California State Universities maintain their own union catalogs of library holdings (Melvyl and PHAROS, respectively). Interoperability is high, because of the shared schemas and rules for creating the "metadata" or cataloging records.26

Metadata Harvesting
A more recent model for union catalogs is to create single repositories by "harvesting" metadata records from various resources. (See Tony Gill's discussion of metadata harvesting and figure 1 in the preceding chapter.) Metadata harvesting, unlike metasearching, is not a search protocol; rather, it is a protocol that allows the gathering or collecting of metadata records from various repositories or databases; the harvested records are then "physically" aggregated in a single database, with links from individual records back to their home environments. The current standard protocol being used to harvest metadata is the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) Version 2.27 The challenge has been to collect these records in such a way that they make sense to users in the union environment while maintaining their integrity and their relationship to their original context, both institutional and intellectual.

To simplify the process for implementation and to preserve interoperability, the OAI-PMH has adopted unqualified Dublin Core as its minimum metadata standard. Data providers that expose their metadata for harvesting are required to provide records in unqualified Dublin Core expressed in XML and to use UTF-8 character encoding,28 in addition to any other metadata formats they may choose to expose. The data providers may expose all or selective metadata sets for harvesting and may also decide how rich or "lean" the individual records they make available for harvesting will be. Service providers operating downstream of the harvesting source may add value to the metadata in the form of added elements that can enhance the metadata records (such as adding audience or grade level to educational resources). Service providers also have the potential to provide a richer contextual environment for users to find related and relevant content. Repositories using a richer, more specific metadata schema than Dublin Core (such as CDWA Lite, MARC XML, MODS, or ONIX) need to map their content to unqualified Dublin Core in order to conform to the harvesting protocol.29 Part of the exercise of creating a crosswalk is understanding the pros and cons of mapping all the content from a particular schema or metadata element set and the institution's specific records expressed in that schema, or deciding which subset of the content should be mapped.

The pitfalls of mapping between metadata standards have been outlined above. Bruce and Hillmann established a set of criteria for measuring the quality of metadata records harvested and aggregated into a "union" collection. The criteria may be divided into two groups, one that evaluates the intellectual content of the metadata records in terms of its completeness, currency, accuracy, and provenance; and one that evaluates the metadata records from a more detailed perspective: the conformance of the metadata sets and application profiles as expected and the consistency and coherence of the data encoded in the harvested records.30 In the context of harvesting data for reuse, Dushay and Hillmann have identified four categories of metadata problems in the second category of criteria: (1) missing data, because it was considered unnecessary by the creating institution (e.g., metadata records that do not indicate that the objects being described are maps or photographs, because they reside in a homogeneous collection where all the objects have the same format); (2) incorrect data (e.g., data that is included in the wrong metadata element or encoded improperly); (3) confusing data that uses inconsistent formatting or punctuation; and (4) insufficient data concerning the encoding schemes or vocabularies used.31 A recent study evaluating the quality of harvested metadata found that collections from a single institution did not vary much in terms of the criteria outlined above, but the amount of "variance" increased dramatically when the aggregations of harvested metadata came from many different institutions.32

Tennant echoes the argument that the problem may be mapping to simple Dublin Core. He suggests that both data providers and service providers consider exposing and harvesting records encoded in metadata schemas that are richer and more appropriate to the collections at hand than unqualified Dublin Core. Tennant argues that the metadata harvested should be as granular as possible and that the service provider should transform and normalize data such as dates, which are expressed in a variety of encoding schemes by the various data providers.33

Like the traditional union catalog model, the metadata harvesting model creates a single "place" for searching instead of providing real-time decentralized or distributed searching of diverse resources, as in the metasearching model. In the harvesting model, to facilitate searching, an extra "layer" is added to the aggregation of harvested records; this layer manages the mapping and searching of heterogeneous metadata records within a single aggregated resource. Godby, Young, and Childress have suggested a model for creating a repository of metadata crosswalks that could be exploited by OAI harvesters. Documentation about the mapping would be associated with the standard used by the data providers, and the metadata presented by the service providers would be encoded in METS.34 This would provide a mechanism for facilitating the transformation of OAI-harvested metadata records by service providers.

Metasearching
The number of metadata standards continues to grow, and it is unrealistic to think that records from every system can be converted to a common standard that will satisfy both general and domain-specific user needs. An alternative is to maintain the separate metadata element sets and schemas that have been developed to support the needs of specific communities and offer a search interface that allows users to search simultaneously across a range of heterogeneous databases. This can be achieved through a variety of methods and protocols that are generally grouped under the rubric metasearch.

Many different terms and definitions have been used for metasearching, including broadcast searching, parallel searching, and search portal. I follow the definition given by the NISO MetaSearch Initiative: "search and retrieval to span multiple databases, sources, platforms, protocols, and vendors at one time."35

The best-known and most widely used metasearch engines in the library world are based on the Z39.50 protocol.36 The development of this protocol was initiated to allow simultaneous searching of the Library of Congress, OCLC's WorldCat, and the RLIN bibliographic file to create a virtual union catalog and to allow libraries to share their cataloging records. With the advent of the Internet, the protocol was extended to enable searching of abstracting and indexing services and full-text resources when they were Z39.50 compliant. Some people touted Z39.50 as the holy grail of search: one-stop shopping with seamless access to all authoritative information. At the time of its implementation, Z39.50 had no competitors, but it was not without its detractors.37

The library community is split over the efficacy of metasearching. When is "good enough" really acceptable? Often, the results created through a keyword query of multiple heterogeneous resources have high recall and little precision, leaving the patron at a loss as to how to proceed. Users who are used to Web search engines will often settle for the first hits generated from a metasearch, regardless of their suitability for their information needs. Authors have pointed to Google's "success" to reaffirm the need for federated searching without referring to any studies that evaluate the satisfaction of researchers.38 A recent preliminary study conducted by Lampert and Dabbour on the efficacy of federated searching laments that until recently studies have focused on the technical aspects of metasearch, without considering student search and selection habits or the impact of federated searching on information literacy.39

What are some of the issues related to metasearch? In some interfaces, search results may be displayed in the order retrieved, or by relevance, either sorted by categories or integrated. As we know, relevance ranking often has little or nothing to do with what the searcher is really seeking. Having the choice of searching a single database or multiple databases allows users to take advantage of the specialized indexing and controlled vocabulary of a single database or to cast a broader net, with less vocabulary control.

There are several advantages of a single gateway, or portal, to information. Users do not always know which of the many databases they have access to will provide them with the best information. Libraries have attempted to list databases by categories and provide brief descriptions; but users tend not to read lists, and this type of "segregation" of resources neglects the interdisciplinary nature of research. Few users have the tenacity to read lengthy alphabetic lists of databases or to ferret out databases relevant to their queries when they are buried in lengthy menus. On the other hand, users can be overwhelmed by large result sets from federated searches and may have difficulty finding what they need, even if the results are sorted by relevance.40

As of this writing, the commercial metasearch engines for libraries are still using the Z39.50 protocol to search across multiple repositories simultaneously.41 In simple terms, this protocol allows two computers to communicate in order to retrieve information; a client computer will query another computer, a server, which provides a result. Libraries employ this protocol to support searching of other library catalogs as well as abstracting and indexing services and full-text repositories. Searches and results are restricted to databases that are Z39.50 compatible. The results that users see from searching multiple repositories through a single interface and those achieved when searching their native interfaces individually may differ significantly, for the following reasons:
  • The way the server interprets the query from the client. This is especially the case when the query uses multiple keywords. Some databases will search a keyword string as a phrase; others automatically add the Boolean operator "and" between keywords; yet others automatically add the Boolean operator "or."
  • How a specific person, place, event, object, idea, or concept is expressed in one database may not be how it is expressed in another. This is the vocabulary issue, which has a significant impact on search results when querying single resources (e.g., the name or term that the user employs may or may not match the name or term employed in the database to express the same concept). This is exacerbated when querying multiple resources, where different name forms and terms proliferate.
  • Metasearch engines vary in how results are displayed. Some display results in the order in which they were retrieved; others, by the database in which they were found; still others, sorted by date or integrated and ranked by relevance. The greater the number of results, the more advantages may be derived from sorting by relevance and/or date.42


ZING (Z39.50 International: Next Generation)43 strives to improve the functionality and flexibility of the Z39.50 protocol while making the implementation of Z39.50 easier for vendors and data publishers in the hope of encouraging its adoption. ZING incorporates a series of services. One is a Web service for searching and retrieving (SRW) from a client to a server using SOAP (Simple Object Access Protocol), which uses XML for the exchange of structured information in a distributed environment.44 Another is SRU, a standard search protocol for the Web that searches and retrieves through a URI.45 Although the development of ZING holds the promise of better performance and interoperability, as of this writing it has not been widely adopted.

The limitations of Z39.50 have encouraged the development of alternative solutions to federated searching to improve the way results are presented to users. One approach is the XML Gateway (MXG), which allows queries in an XML format from a client to generate result sets from a server in an XML format.46 Another approach used by metasearch engines when the database does not support Z39.50 relies on HTTP parsing, or "screen scraping." In this approach, the search retrieves an HTML page that is parsed and submitted to the user in the retrieved set. Unfortunately, this approach requires a high level of maintenance, as the target databases are continually changing and the level of accuracy in retrieving content varies among the databases.

Table 3. Methods for Enabling Integrated Access/Cross-Collection Searching



Method


Description


Examples


Federated searching of physically aggregated contributed metadata records


Records from various data providers are aggregated in a single database, in a single metadata schema (either in the form contributed, e.g., in the MARC format, or "massaged" by the aggregator into a common schema), and searched in a single database with a single protocol. The service provider preprocesses the contributed data prior to it being searched by users and stores it locally. For records to be added or updated, data providers must contribute fresh records, and aggregators must batch process and incorporate the new and updated records into the union catalog.


Traditional union catalogs such as OCLC's WorldCat and the Online Archive of California (OAC); "local" or consortial union catalogs such as OhioLink (a consortium of Ohio's college and university libraries and the State Library of Ohio) and Melvyl (the catalog of the University of California libraries)


Federated searching of physically aggregated harvested metadata records


Records expressed in a standard metadata schema (e.g., Dublin Core) are made available by data providers on specially configured servers.  Metadata records are harvested, batch processed, and made available by service providers from a single database. Metadata records usually contain a link back to the original records in their home environment, which may be in a different schema than the one used for the harvested records. The service provider preprocesses the contributed data prior to it being searched by users and stores it locally. In order for records to be added or updated, data providers must post fresh metadata records, and service providers must reharvest, batch process, and integrate the new and updated records into the union database.


OAI-harvested union catalogs such as  the National Science Digital Library (NSDL), OAIster, the Sheet Music Consortium, and the UIUC Digital Gateway to Cultural Heritage Materials


Metasearch of distributed metadata records


Diverse databases on diverse platforms with diverse metadata schemas are searched in real time via one or more protocols. The service provider does not preprocess or store data but rather processes data only when a user launches a query. Fresh records are always available because searching is in real time, in a distributed environment.


Arts and Humanties Data Service, Boston College CrossSearch, Cornell University Find Articles search service, University of Notre Dame Article QuickSearch, University of Michigan Library Quick Search, University of Minnesota Libraries MNCAT


The key to improvement may lie in the implementation of multiple protocols rather than a single protocol. As of this writing, some vendors are combining Z39.50 and XML Gateway techniques to increase the number of "targets," or servers, that can be queried in a single search.47

Case Studies

Each instance of data conversion, transformation, metasearching, or metadata harvesting will bring its own unique set of issues. Below are examples of projects that illustrate the complexities and pitfalls of using crosswalks and metadata mapping to convert existing metadata records from one schema to another, to enhance existing records, or to support cross-collection searching.

Case Study 1: Repurposing Metadata. Links to ONIX metadata added to MARC records.

In 2001 a task force was created by the Cataloging and Classification: Access and Description Committee, an Association for Library Collections & Technical Services (ALCTS) committee under the aegis of the American Library Association (ALA), to review a standard developed by the publishing industry and to evaluate the usefulness of data in records produced by publishers to enhance the bibliographic records used by libraries. The task force reviewed and analyzed the ONIX (Online Information Exchange) element set48 and found that some of the metadata elements developed to help bookstores increase sales could have value for the library user as well.49 In response, the Library of Congress directed the Bibliographic Enrichment Advisory Team (BEAT) to repurpose data values from three metadata elements supplied by publishers in the ONIX format—tables of contents, descriptions, and sample texts from published books—to enhance the metadata in MARC records for the same works. The ONIX metadata is stored on servers at the Library of Congress and is accessed via hyperlinks in the corresponding MARC records,<sup>50</sup> as shown in figure 1. In this way, ONIX metadata originally created to manage business assets and to provide information to bookstores that would help increase book sales has been used to enhance the bibliographic records used by libraries to provide information for users so that they can more easily evaluate the particular publication.