By Tony Gill
Introduction
Few people would argue with the assertion that catalogs are
useful tools for managing collections of items, and that their
usefulness increases proportionately with the size of the
collection being managed. A catalog of concise, well-structured
descriptions of the items in a collection should always be
easier to manage than the collection itself, since it should
provide both a distillation of the collection in terms of
volume and a consistent, easily-understood structure. However,
perhaps fewer people appreciate that the act of cataloging
a collection is actually a process of knowledge representation.
Designing a catalog for a collection is ultimately a philosophical
problem-solving exercise; it is an attempt to determine the
most significant attributes or properties of the items in
the collection, so that the essence of the items can be captured
as concise descriptions. These concise descriptions then represent
the items in the catalog, and provide a route back to the
items themselves. The catalog should be much easier to search,
sort and browse than the collection itself, provided that
sufficient consistency in the structure and content of the
descriptions is achieved, because it contains only the most
essential information characterizing the items in the collection.
Computers are innately well-suited for managing catalogs;
in fact, it could be argued that storing and manipulating
large collections of structured data is a core component of
their raison d'être. Database management systems
have been used to store every conceivable type of catalog,
from mailing lists to stock inventories to museum collections
to library holdings, since they were first developed. .
Computers have always employed catalogs internally as well,
to keep track of different discrete data objects. In order
to function correctly, they must keep an accurate record of
the identity and location of every item of data stored in
the various memories. For example, the operating system of
a computer uses a catalog called the File Allocation Table
to store the names of files and their physical position on
a disk.
This type of data catalog is itself stored by the computer
as data, a recursive relationship that has resulted in it
being referred to as "metadata."
Many introductory articles about metadata begin by defining
it simply and economically as "data about data",
in an attempt to demystify a term that is used considerably
more often than it is fully understood. This concise and accurate
definition is often then incorrectly generalized, either implicitly
by the reader or explicitly by the author, to mean "information
about information".
The unhelpful result of this undoubtedly well-intentioned
semantic lenience is that the term "metadata" is now increasingly
used in contexts where the term "data’ would have sufficed
just a few short years ago (for example, descriptions of people,
objects and events), often resulting in confusion and misunderstanding.
This variety of interpretations of the term "metadata" is
not altogether surprising — it is formed from two root
terms that have both been adopted and re-purposed by practitioners
of diverse disciplines over several millennia, ranging from
epistemology and metaphysics to chemistry and computer science.
The usage of the term "metadata" in the context of this essay
will borrow and synthesize meaning from the disciplines of
both computer science and philosophy. Computer science provides
a useful constraint for the concept of "data", by limiting
it to the realm of discrete identifiable pieces of digital
"computer data" — certainly still a fairly abstract
concept, but considerably less so than the more general interpretation
of data as facts or assertions used for analysis and inference.
Philosophy, specifically metaphysics, provides the example
usage of "meta" as a prefix to denote an alternate or second-order
kind of relationship between two similar types of entities,
and the underlying notion of the essential attributes that
make up a metadata description.1
So, moving from the abstract realm to the practical, the
term "metadata" in the context of this essay refers to structured
descriptions, stored as computer data, that attempt to describe
the essential properties of other discrete computer data objects—specifically,
the data objects that make up the information on the World
Wide Web, the world’s largest and fastest-growing collection
of data.
The Rise and Rise of the World Wide Web
It is impossible to determine the exact size of the World
Wide Web; it has grown so large, so fast, and is so impenetrable
to practical survey methodologies that it has effectively
transcended our ability to measure it with any degree of precision.
However, although the actual numerical quantities will never
be entirely accurate and are instantaneously out of date,
carefully-designed surveys carried out at regular intervals
can at least provide some insight into the trends in Web growth
and usage over time.
The most recent Netcraft survey,2
carried out on 1 April 2000, received responses to HTTP requests
for server names from 14,322,950 "sites," where a site in
this case represents a unique hostname such as http://www.hostname.com
or http://www.hostname.org. This is an arbitrary
but simple approximation for the total number of Web sites
that counts different hostnames on the same IP address as
separate, but does not count separate distinct Web sites that
share the same hostname: For example, http://www.hostname.com/myWeb
site/ and http://www.hostname.com/yourWeb site/
would not be counted separately.
To put this number into context, a similar type of survey
conducted by Matthew Gray of the Massachusetts Institute of
Technology found just 130 Web hosts in June of 1993; the Web
grew by nearly eight million percent in less than
seven years.3
Growth in the Number of Web Sites
The number of hosts is only one metric for determining the
size of the Web, however; there have also been a number of
attempts to count the number of individual pages available.
The most recent attempt at the time of writing is the Inktomi
WebMap,4 a joint survey
by the search engine company Inktomi and the NEC Research
Institute, which announced in a press release dated 18 January
2000 that the Web contained in excess of one billion unique,
indexable documents.5
This does not include duplicate documents on mirror servers
or documents that are "hidden" from Web crawlers, such as
documents that are dynamically generated by querying underlying
databases or that require some kind of user log-on.
|
A Selection of Web Facts
- The Tenth GVU Web Survey,6
conducted in October 1998, found that 85% of respondents
used search engines to find information on the Web,
making it the second most common way of finding content
(the most common method, used by 88%, is to follow
hyperlinks from other pages).
- The survey by Lawrence & Giles found that, of the
15 terabytes of data that made up the estimated 800
million pages of the publicly indexable Web in February
1999, only 6 terabytes (40%) contained useful text
after removing HTML tags, comments, and white space.7
- The same 1999 survey found that the mean number
of Web pages per server was 289 and that search engines
were more likely to index pages that were accessed
via links from other pages.
- According to the results of a survey by Alexa Internet
at the end of 1999, 80% of Web traffic is directed
at just 0.5% of sites, with the top 5 sites (Yahoo,
Microsoft, Excite, eBay, and AltaVista), Disney (Go.com),
and AOL accounting for one click in five.8
- As of April 2000, new domains were being registered
at a rate of one per second.9
|
Finding Needles in a Global Haystack
In view of the huge size and explosive rate of growth of
the World Wide Web, it is clear that catalogs of some kind
would be invaluable in helping users discover relevant information
resources. Unfortunately, neither the Internet nor the World
Wide Web were originally designed with the cataloging of their
contents in mind; the TCP/IP suite of network protocols that
enables the basic infrastructure of the Internet to function
is solely a transport layer, concerned with getting packets
of data from one point to another as quickly and reliably
as possible, whereas the Hyper Text Transfer Protocol (or
HTTP) only deals with the delivery of hyperlinked World Wide
Web information.
This means that the existing network protocols do not provide
any dedicated support for locating specific information resources
available on the network. This sorry state of affairs falls
very short of the vision of the Memex, a comprehensive
and affordable personal reference and research tool originally
proposed way back in 1945 by Vannevar Bush, believed by many
to be the precursor of hypertext.10
The disappointment of the hypertext community with the World
Wide Web is clearly illustrated by this quote from Ted Nelson
(the man who first coined the term "hypertext" in
1965), delivered at the Hypertext 97 conference:
The reaction of the hypertext research community to
the World Wide Web is like finding out that you have a fully
grown child. And it's a delinquent.11
Unsurprisingly, tools designed to address the resource location
problem and help make sense of the Internet’s vast information
resources started to appear soon after the launch of the first
Web browsers in the early 1990’s; for example, Tim Berners-Lee
founded the WWW Virtual Library12
shortly after inventing the Web itself, and Yahoo!,13
Lycos14, and Webcrawler15
were all launched during 1994.
The tools currently available to help users find Web resources
are many times larger and more powerful than their 1994 predecessors
— they have to be, in order to keep up with the explosive
growth in both the amount of information available and the
number of users accessing it. However, there are still only
two principle classes of Web resource locating tools: directories
and search engines.
Directories are listings of network resources created by
real people, who select, catalog and classify Web resources
that they feel are appropriate for their constituency, based
on factors such as accuracy, authority, and currency. Directories
can either be general in scope, such as the World Wide Web
Virtual Library and Yahoo!, or they can specialize in particular
subject areas, such as the Art, Design, Architecture & Media
Information Gateway (ADAM)16
and the Edinburgh Engineering Virtual Library (EEVL).17
Directories typically provide access to the resources they
have cataloged both by searching and by browsing a hierarchical
set of subject headings.
Search Engines, often called "spiders," "crawlers" or "robots,"
are automated systems that continuously traverse the Web visiting
sites, saving copies of the pages and their locations as they
go in order to build up a huge catalog of fully-indexed pages.
They typically provide powerful searching facilities and extremely
large result sets, which are relevance-ranked (using closely-guarded
proprietary algorithms) in an effort to make them usable.
In recent years some hybrid approaches have started to appear
— for example, the Northern Light18
search engine, which attempts to automatically cluster results
into dynamically-generated "Custom Search Folders" according
to subject, type of document, source or language, giving the
kind of hierarchical organization of results more usually
associated with directory services.
However, there are serious problems with both the directory
and search engine approaches. Human-mediated directories generally
provide good search precision at the broad subject level,
and are normally considered to provide higher-quality information
overall because of the human intervention in the indexing
and classification process. However, this mediation is a costly,
labor-intensive process that is not sufficiently scaleable
to provide comprehensive up-to-date coverage of the whole
Web, much of which is highly transient.
Another problem with the hand-crafted approach to cataloging
Web resources is deciding upon the granularity of the resources
to be described; should descriptions be created for Web sites
as a whole, or should each page be cataloged individually?
Clearly, a cost-benefit tradeoff will always need to be made.
The crawler-based search engines also suffer from a number
of serious problems, which affect their ability to provide
an index that is both comprehensive and current, and the likelihood
that users will find what they are looking for even if it
has been indexed:
- Increasingly, information on the Web is being generated
dynamically from databases in response to user input. This
information is sometimes referred to as "the hidden Web,"
because it is beyond the indexing reach of the Web crawlers.
- The Web crawling components of the search engines are
fully automated, which means that the indexed Web resources
are selected by software algorithms rather than people,
and are therefore variable in both quality and depth of
indexing.
- The Web indexing playing field is not a level one: Recent
research suggests that "search engines are typically
more likely to index US sites than non-US sites (AltaVista
is an exception), and more likely to index commercial sites
than educational one. "19
- Searching large automatically-indexed databases often
results in extremely large results sets, which are frequently
unusable despite increasingly sophisticated information
retrieval tools, relevance ranking procedures and context-aware
artificial intelligence algorithms.
- As the volume of information on the Web continues to
increase exponentially, the amount of network bandwidth
(information-carrying capacity) required by the crawlers
in order to maintain current and comprehensive indices could
eventually reach unacceptable levels; ethical "codes of
conduct" for Web crawlers have already existed for some
years.
The search engines seem to be showing signs of strain in
attempting to keep up with the explosive growth of the Web.
Steve Lawrence & C. Lee Giles of the NEC Research Center conducted
a scientifically rigorous survey of the search engine’s
coverage of Web content in February 1999.
The findings of their survey, published in the peer-reviewed
journal Nature, suggest that the combined coverage of the
11 search engines used for the study was about 42% of the
total number of unique indexable pages on the Web (i.e. not
including the ever-expanding "hidden Web"), with no search
engine indexing more than about 16%. In summary:
Our results show that the search engines are increasingly
falling behind in their efforts to index the Web.20
The publication of these findings subsequently seemed to
prompt the search engines both to increase the size and currency
of their indices, and to start quoting ever-larger numbers
of the pages visited in order to generate their indices. As
Danny Sullivan observes in the March 2000 issue of the Search
Engine Report:
One of the latest trends these days is for crawlers
to flaunt both how many pages they have in their index plus
the larger number of pages visited to create that index.
[..] Why have dual numbers returned? Because no matter how
big your competition is, the Web is even bigger.21
However, despite these renewed efforts by the search engines
(according to Sullivan, Inktomi claimed to have an index of
over 500 million pages in April 200022),
the outlook for their ability to keep up with the growth of
the Web in the long term is not promising.
Cataloging the Web
Although initially it appears that both directories and search
engines suffer from different types of problems, it seems
clear that most if not all of the difficulties are the result
of ambitions which are likely to prove untenable in the long
term; the Web is simply too big for any single organization
or service to catalog, irrespective of whether they use people
or computers to generate their indices.
If there is a solution to the problem of resource discovery
on the Web, it must surely be based on a distributed metadata
catalog model. Ironically, the WWW Virtual Library uses just
such a distributed model; however, the altruistic efforts
of its volunteer curators have proved insufficient to keep
pace with the growth of the Web.
The necessary technical protocols for creating distributed
meshes of resource discovery databases, such as Z39.50 and
WHOIS++, are already available — interoperability at
a technical level is no longer a significant problem.
What is required now is the widespread adoption of standards
for metadata structure, content and authentication that will
allow secure interoperability on the semantic level. However,
before discussing the specifics of the metadata standards
currently available, it will be helpful to consider in more
detail some of the specific applications that metadata can
be used for, and some of the more problematic issues that
arise in the description of networked resources.
Metadata Applications and Issues
Clearly, the information structure and content of Web metadata
records should capture the essence of the Web resources they
describes and facilitate the various tasks for which the metadata
was devised.
Unfortunately, this is the point where real-world complexities
start to intrude; with such a large collection of information
objects to describe, spanning the breadth and depth of human
knowledge and creativity, and with tens of millions of users,
the number of potential applications for Web metadata is limited
only by the imagination. Consequently, consensus on the most
appropriate structure and content for Web metadata remains
elusive, despite significant efforts worldwide; some of the
more significant descriptive standards resulting from this
metadata research are described below, and elsewhere on this
site.
The most common application of Web metadata is generally
referred to as "resource discovery," because the metadata
is intended to assist Web users discover the information they
are looking for; the availability of consistent, accurate
and well-structured descriptions of Web resources could enable
much greater search precision and more accurate relevance
ranking of the large result sets typically retrieved by search
engines, for example.
Once potentially useful candidate resources and their locations
have been identified, metadata can also be used to provide
short descriptions or evaluations that can help the user determine
the relevance of the resource, or information about any access
restrictions or rights implications that may prohibit the
intended use of the information. Whether or not these applications
are intrinsic parts of the resource discovery process or are
in fact separate applications of Web metadata remains the
subject of debate.
Metadata is also often used in the management and administration
of digital networked resources; this type of "administrative
metadata" is essential for ensuring that Web resources are
kept up to date, for example, or are free of rights restrictions
that may prohibit their distribution over the Internet.
One of the more interesting consequences of the metadata
research taking place around the globe is that effective cataloging,
historically perceived as an arcane art practised only by
librarians, museum curators and archivists, is now becoming
an issue for a much wider community.
Acceptance of the importance of controlled vocabularies and
formal classification schemes is becoming increasingly widespread
— a fact that most experienced catalogers have taken
for granted for decades (notwithstanding the fact that the
sheer diversity of information on the Web is highlighting
the shortcomings of the existing taxonomies for organizing
the sum of human learning!).
However, the sheer scale of the Web as an information space
will require new applications of the old tools and skills,
such as the use of thesauri by software for automatically
expanding users’ queries to include synonyms or even
translations of the query terms into alternate languages,
or mappings between different classification schemes and terminology
authorities.
Similarly, the fact that a diverse range of vocabularies
and classification schemes will need to coexist in the same
vast information space means that computers must be able to
identify the source authority for terms or classmark; consequently,
schema registries will be required in order to define
namespaces and thereby ensure that the labels used
to identify the various authorities are unique and unambiguous.
While there are undoubtedly many lessons that can and should
be learned from the traditional custodians of information,
there are also a number of new challenges unique to the pandisciplinary,
transglobal, multilingual and multicultural networked environment
of the Web that will require fresh approaches and new solutions.
For example, deciding upon the most appropriate granularity
for the resource descriptions is another issue that the would-be
Web cataloger must address: How much detail about a Web resource
should a catalog record contain? How many catalog records
should be created for a given Web resource? Increasing user
expectations regarding retrieval capabilities, combined with
the flexibility and diversity of the hypertext information
environment, jointly conspire to render the analogy between
Web cataloging and bibliographic cataloging only partially
valid. No longer content with the traditional "author, title,
keyword" searches offered by library catalogs, users now expect
to be able to search for words or phrases appearing within
the body text of Web resources. A hybrid approach that incorporates
hand-crafted site-level descriptions produced by skilled catalogers
and augments them with automated full-text indexes, could
provide the most effective solution providing the results
are relevance-ranked accordingly.
Another significant conceptual difficulty arises from the
need to describe the relationships between networked resources
and other objects: What exactly should metadata describe?
Strictly speaking, metadata should describe the properties
of an object which is itself data, for example a Web page,
a digital image or a database — which is analogous to
the librarian’s practice of cataloging "the thing in
hand." For networked resources, however, these properties
are often not very interesting or useful for the purposes
of discovery; for example, if a researcher is interested in
discovering images of famous artworks on the Web, they would
generally search using the properties of the original artworks
(e.g. CREATOR = Picasso, DATE = 1937), not the properties
of the digital copies or "surrogates" of them (e.g. CREATOR
= Scan-U-Like Imaging Labs Inc., DATE = 2000-02-29).
Both the "granularity" and "surrogacy" problems have at their
root the need to describe the relationships between
different objects (not all of which will exist on the Web)
in the various records describing those objects; for example,
a record describing a Web page within a site should indicate
its membership of the site, and a scanned image of a Picasso
painting on the Web should identify the painting from which
it was derived.
Of course, none of the problems described above are new–the
traditional guides to information resources, such as librarians,
museum curators and archivists, have been wrestling with the
seemingly-impossible task of "Modeling the World" in order
to describe information resources for decades. But the urgent
need to catalog the Web has made these fundamentally epistemological
issues significant for a new and much larger community.
Tools for Web Cataloging
Over the last few years, a plethora of tools for cataloging
the Web have appeared — most of them Web-accessible
themselves.
Some tools simply provide basic metadata creation and editing
features, allowing syntactically-correct metadata records
to be created and edited manually without the need to understand
the complexities of the various encoding syntaxes. Other tools
provide more sophisticated features, such as the ability to
convert between different metadata formats or automatically
extract embedded metadata from Web pages; some tools even
attempt to generate metadata automatically by making inferences
from the contents of documents. A detailed list of metadata-related
tools is maintained on the Dublin Core Web site.23
In addition to hosting the Dublin Core Web site, OCLC also
operates a developing service called CORC24
(Cooperative Online Resource Catalog), that provides an insight
into how the various Web cataloging tools can be provided
in an integrated system to support the creation and maintenance
of a collaborative Web resource catalog. CORC provides a suite
of Web cataloging tools that can be used by participating
librarians to add to the central shared CORC database, which
at the time of writing contained 229,075 resource descriptions.
Standards for Metadata on the Web
In order for metadata to be as useful and cost-effective
as possible, it is essential that its structure, semantics
and syntax conform to widely supported standards, so that
it is effective for the widest possible constituency, maximizes
its longevity and so that processing can be automated as far
as possible.
Three metadata standards efforts are particularly pertinent
in the Web context: The "keyword" and "description" meta tags
as implemented by the search engines, the Dublin Core Metadata
Initiative, and the Resource Description Framework. These
are discussed below.
Search Engine Meta Tags
The AltaVista search engine originally popularized the use
of two simple metadata elements, "keywords" and "description,"
that can be embedded in Web pages by their authors using the
HTML meta tag. The original intention was that the "keyword"
metadata could be used to provide more effective retrieval
and relevance ranking, whereas the "description" would be
used in the display of search results to provide a more accurate
summary of a Web resource.
With the exception of meta tags that are automatically (and
somewhat pointlessly) inserted into Web pages by authoring
tools such as the "generator" tag, "keywords" and "description"
are now the most commonly-used meta tags on the Web. The Lawrence
& Giles 1999 survey25
ascertained that they were used in the homepages of 34% of
sites, for example.
Unfortunately, many of the major search engines have now
stopped using meta tags to improve relevance ranking, and
some have even stopped indexing meta tags, because of the
increase (or at least the perceived increase) in meta tag
spamming or spoofing. Meta tag spamming is the term given
to the deliberate misuse of meta tags in order to boost a
site's ranking in search results, for example by repeating
keywords hundreds of times or by using sexually-explicit keywords.
The following policy statements are from the Web sites of
AltaVista, Excite and Northern Lights respectively:
Why aren’t META tags given preference? Consider
the opportunity for abuse and spamming. [..] Basically,
META tags are a band aid to help you deal with pages that
don’t state what they are about in clear text, right
up front. Do it right to begin with, and you don’t
need META tags at all. You’ll get far better results
in terms of search engine traffic that way.26
Unfortunately, meta tag information is not always reliable.
It may or may not accurately reflect the content of the
site. In general, our spider does not honor metatags. This
means we do not index the content of the meta tag.27
While our crawler does make note of META tags, Northern
Light does not assign any particular relevance to words
contained in META tags, nor do we use them to control descriptions
on our results list.28
According to Search Engine Watch,29
the only search engines that use the "keywords" meta tag to
provide more effective relevance ranking are those based upon
the Inktomi search engine (Inktomi lists America Online, Freeserve.net,
Goto.com, LookSmart, HotBot, MSN and Yahoo! among its customers).
Inktomi claims that their search engine can detect common
spamming techniques, and "penalizes" documents it suspects
of containing inappropriate metadata by ranking them lower.30
The "description" tag is also used by some search engines
(e.g. AltaVista, Inktomi, Excite) to provide more naturalistic
descriptions of sites in results displays, when compared to
the automatically generated summaries from the first few lines
of the document that are generally used otherwise.
Although the search engines all have different approaches
with respect to metadata and relevance ranking, they appear
to have one characteristic in common—they all use the
contents of the HTML <TITLE> tag as the single most
significant factor in the ranking of result sets.
Dublin Core
The Dublin Core Metadata Element Set31
(a.k.a. "Dublin Core" or just "DC") is a set of 15 information
elements that can be used to describe a wide variety of information
resources on the Internet for the purpose of simple cross-disciplinary
resource discovery. The 15 elements (described in more detail
elsewhere on this site) are:
Contributor, Coverage, Creator, Date, Description,
Format, Identifier, Language, Publisher, Relation, Rights,
Source, Subject, Title, and Type.
The 15 elements and their meanings have been developed and
refined by a group of librarians, information professionals,
and subject specialists through an ongoing consensus-building
process that has included seven international workshops to
date and an active mailing list.32
From the outset, the development of the Dublin Core element
set has been underpinned by a number of guiding philosophies:
- The elements must be simple to understand and use, so
that any creator of networked resources would be able to
describe their own work without requiring extensive training.
- Every element is both optional and repeatable.
- The elements should be international and cross-disciplinary
in scope and applicability.
- The element set should be extensible, to allow discipline
or task-specific enhancements.
- The most important strategic application of the element
set would be for embedded descriptions of Web resources,
created by the resource authors, which meant a syntax that
could be accommodated within HTML 's <META> tag.
Early adopters of the Dublin Core soon encountered the types
of problems discussed in the previous section, which have
resulted in a number of additional extensions and refinements
to the simple core element set:
- The Warwick Framework,33
a conceptual container architecture for diverse heterogeneous
metadata packets; prototype SGML and MIME implementations
of the Warwick Framework have been developed, but perhaps
the most important contribution of this work is the formalization
of requirements that led to the development of the Resource
Description Framework (discussed below).
- Interoperability Qualifiers34
that can be used either to refine the semantics of the element
or to provide more information about the encoding scheme
used for an element's value.
- Acknowledgement of the 1:1 Principle, which
states that the most robust solution to the granularity
and surrogacy issues described previously is to use separate
metadata "sets" or "packets" for each discrete object (item
or collection, network resource or otherwise), and to describe
the relationships between them using an enumerated list
of relationship types.
There are now a number of large-scale deployments of Dublin
Core metadata around the globe — the official Dublin
Core Web site lists15 in North America and Mexico, 27 in Europe
and 12 across Asia and Australia.35
Some of these initiatives are on a national scale, for example
the Australian Government Locator Service36
and the CCTA Government Information Service in the UK, open.gov.uk.37
However, although significant progress in raising awareness
and increasing deployment of the Dublin Core has been made
over the last few years, there is still a long way to go before
it can begin to deliver on its promise of better resource
discovery on the Web. The Lawrence & Giles 1999 survey,38
for example, found that only 0.3% of Web sites contained Dublin
Core metadata. This poor uptake, in global terms at least,
is undoubtedly due at least in part to the reluctance of the
major search engines to support Dublin Core:
Search engine support is crucial for success, as demonstrated
by the lack of support for the existing Dublin Core meta
tags. [..] Practically no one uses these tags, and the reason
why is because none of the major search engines does anything
with them. They don't index them, nor do they provide a
way to search within the Dublin Core meta tag fields.39
Another factor that has hindered the widespread adoption
of Dublin Core metadata is the length of time it has taken
to reach consensus on approved Interoperability Qualifiers.40
Qualifiers for refining element semantics and identifying
formal encoding schemes were originally proposed as the "Canberra
Qualifiers"41 during
the fourth Dublin Core Workshop in Australia in March 1997,
but the initial set of approved qualifiers was not formally
accepted as part of the Dublin Core "registry" until April
2000, more than three years later.
The lengthy delay in reaching consensus on qualifiers was
certainly not caused by a lack of effort or commitment from
those involved; the Dublin Core Metadata Initiative is a voluntary
international standards effort, and the participants regularly
donate significant time and resources to the cause of improved
Web resource discovery.
Notwithstanding the effort required to reach international
cross-disciplinary consensus on any topic, the intellectual
difficulty in reaching agreement on qualifiers is partly the
result of well-intentioned attempts to apply Dublin Core far
more broadly than what it was originally designed for —
simple discovery of "document-like objects" on the World Wide
Web.
CIMI, the Consortium for the Computer Interchange of Museum
Information, conducted a detailed two-phase investigation
into the utility of Dublin Core metadata over a three-year
period. Starting in 1998, Phase I looked at simple unqualified
Dublin Core for museum information resource discovery, whereas
Phase II extended the "testbed" to include the use of qualified
Dublin Core metadata for the interchange of richer descriptions
between museums.
CIMI found that the unqualified implementation of the Dublin
Core Metadata Element Set could be an effective tool for the
coarse-grained discovery of museum information resources in
a cross-disciplinary networked environment, particularly if
the recommendations in the CIMI Guide to Best Practice were
followed. 42
However, CIMI also found that Qualified Dublin Core (DCQ)
could not be recommended for information interchange within
the museum community, because it could not support the rich
descriptions that museums need to share. This was due to a
combination of constraints imposed by the underlying data
model of the element set, which was originally designed for
the description of text-based Web resources, and the "dumb-down"
rule for the application of "semantic refinement" qualifiers,
which stipulates that qualifiers can refine but not extend
the semantics of any given element.
Regardless of the success or failure of the Dublin Core in
its current guise to be widely adopted for resource discovery
on the Web, the Herculean and ongoing effort has resulted
in a deliverable that could prove even more significant in
the long-term — international, cross-disciplinary consensus
on the key requirements for effective resource discovery on
the Web.
The lessons learnt in the Dublin Core Metadata Initiative
have helped to build the foundations of another metadata standard:
the Resource Description Framework.
Resource Description Framework
The Resource Description Framework,43
produced as part of the World Wide Web Consortium’s
Metadata Activity, is a metadata application of XML,44
the Extensible Markup Language, the successor to HTML and
the future language of the Web. Its development was informed
by previous work such as PICS45
(Platform for Internet Content Selection), the Dublin Core/Warwick
Framework initiative, and the metadata activities of major
software vendors such as Microsoft and Netscape.
The Resource Description Framework is built upon a simple
but robust data model that allows resources to be
described in terms of their properties. The values
of the properties can be either atomic in nature,
such as text strings or numbers, or they can in turn be other
resources, which can have properties of
their own.
This data model is often depicted visually using a type of
diagram called a directed labeled graph, also known
as a node and arc diagram. A generalized example
of an RDF description could take the following form (all of
the examples and diagrams in this section are based heavily
on Eric Miller's excellent examples46):

As the name implies, RDF is a framework for resource
description; it has to be adapted in order to serve specific
communities or applications through the use of RDF Schemas,
which use the XML Namespace mechanism to unambiguously
identify the particular semantics of the property types.47

To illustrate this by example, a description of this essay
and its authorship could feasibly be described using two RDF
Schemas, each based on a different metadata standard with
different semantics; Dublin Core element definitions could
be used for the description of the Web document, whereas the
semantics of the elements in the vCard48
scheme could be used to describe the properties of the author.
In this example, the namespace mechanism is used to specify
that property types prefixed with "DC" refer to Dublin Core
element semantics and those prefixed with "CARD" refer to
vCard semantics:
Using this highly extensible and robust logical framework,
rich metadata descriptions of networked resources can be created
that draw on a theoretically unlimited set of semantic vocabularies.
Interoperability for automated processing is maintained, however,
because the strict underlying XML syntax requires that each
vocabulary be specifically declared using the namespace mechanism.
In effect, RDF is a practical implementation of the Warwick
Framework, in that it supports the coexistence of heterogenous
"packets" of metadata, but it could in principle accomplish
much more than the Warwick Framework set out to achieve —
RDF could enable the Web to evolve into a global semantic
network.
Metathreats and Metaopportunities
It is just two years since this essay was first published,
and although much progress has been made in terms of the standards
and tools to support the deployment of metadata on the World
Wide Web, practical solutions to some of the underlying social,
political and economic problems remain elusive.
This should not be too surprising — factors such as
trust, privacy, authenticity, and authority have always been
critically important in the dissemination of information,
and the ease with which the Web allows information to flow
exacerbates the need to address these issues in the networked
environment.
It can no longer be argued that the lack of metadata on the
Web is caused by a lack of standards; a range of usable metadata
standards are now available, from simple search engine "keyword"
and "description" tags, to a comprehensive architecture for
creating interoperable knowledge representations. Nor can
a lack of tools be blamed.
Creating good metadata requires time and money, but there
is little incentive for content creators to expend much of
either on the creation of metadata descriptions, because many
search engines don’t use them. The metadata that does
exist, most of which is created in good faith, is not being
used by search engines because they cannot rely on it to provide
accurate and faithful descriptions. The missing ingredient
is trust, without which the Web's resource discovery
cake has a bitter taste.
Traditionally, publishers who made fraudulent claims or who
published misleading information would end up facing either
legal action or bankruptcy, or possibly both. Most nations
have extensive legal provisions for dealing with libel, theft
of intellectual property, publication of offensive materials,
false advertising etc. in the traditional publishing industries.
In fact, there have been a number of lawsuits over disputed
uses of Web metadata49
in recent years, most notably a series of cases involving
Playboy Enterprises as both the plaintiff and the defendant.
So far, the judgments in these cases appear to have been rational
and just.
However, recourse to traditional legal measures is costly
and time-consuming, particularly across international boundaries,
and the world's judicial systems are ill-equipped to keep
up with the pace of technological change in the networked
environment.
Ultimately, the architects who are responsible for the ongoing
development of the Web are also responsible for enabling the
exchange of trust in the Web environment — governments
and legal systems do not have the right skills or resources
to accomplish this without resorting to restrictive, heavy-handed
measures.
Fortunately, various constituencies within the Web developer
community are fully aware of this responsibility, as evidenced
by the current and emerging technologies to support digital
signatures based on public key infrastructures such as VeriSign's
DigitalID,50 the CREN
Certificate Authority Service,51
and the W3C's XML-Signature initiative.52
The widespread adoption of digital signatures will ultimately
enable metadata descriptions of Web resources to be digitally
signed—the Resource Description Framework has been designed
from the outset to support digitally-signed descriptions,
for example.
Once the authority and authenticity of metadata descriptions
can be easily and reliably established, search engine and
portal providers will be much more willing and enthusiastic
to use them to enhance the resource discovery service they
provide for their users.
Some search engine and portal builders may want to produce
their own metadata descriptions, since they can then exercise
editorial control over the style of description, the indexing
techniques and the classification or rating methods. However,
if they are not familiar with cataloging, they will rapidly
discover that there's a lot more to the art of description
than meets the eye!
Museums, libraries and archives, however, have long been
expert in the business of capturing, authenticating, and making
sense of knowledge through the description of objects and
collections, and have been trusted as providers of accurate,
impartial information for centuries. In addition to the vast
repositories of high-quality knowledge they possess, they
are also rich in the less tangible currencies of trust, credibility
and authority.
The availability of a robust, secure, and semantically-powerful
metadata architecture will not only allow "memory institutions"53
such as museums, libraries, and archives to more effectively
meet their own institutional missions in providing access
to their information treasures; it will also empower them
to fulfil a role as trusted, nonpartisan guides to the best
information the Web has to offer, and thereby act as guardians
of our shared cultural record.
Presumably man's spirit should be elevated if he can
better review his shady past and analyze more completely
and objectively his present problems. He has built a civilization
so complex that he needs to mechanize his record more fully
if he is to push his experiment to its logical conclusion
and not merely become bogged down part way there by overtaxing
his limited memory. His excursion may be more enjoyable
if he can reacquire the privilege of forgetting manifold
things he does not need to have immediately to hand, with
some assurance that he can find them again if they prove
important.54

|