A paper prepared by Jenny Kena
Assignment 1A Electronic Publishing (LAR5007)
Master of Information Management
Monash University
September 1997
2. Metadata - what is it and what is its purpose?
3. The development of Metadata
4. Implementing Metadata on the Web
This paper will provide a general overview
of developments in the use of metadata to describe networked resources.
There will be an emphasis on looking at the extent to which metadata is
being implemented in the broad Internet community. The evidence suggests
that broad implementation is still a long way off. The possible reasons
for this will be examined as well as expected future developments which
may affect the wider adoption of metadata.
2. Metadata - what is it and what is its purpose?
Metadata is defined in many places as information about information or data about data. In the context of networked information resources, it is a set of information which describes a networked resource. This information may form part of a resource e.g. it may be contained in the Header section of the HTML code for the resource or, it may be held separately but linked to the resource in some way. Metadata may be created by authors, publishers, librarians or others along the information chain. It may also be created by machines. It can be very simple e.g. a description and a few keywords or, it can be quite complex e.g. a full MARC record following detailed rules.
Metadata has been around for years
as a way of describing items in a library collection - a catalogue entry
- so that the items are uniquely identified and able to be discovered and
then physically located. As more and more information becomes available
electronically - housed on various networks and accessible through the
Internet - discovery of specific information resources has become increasingly
difficult. The need for a standard way of describing networked resources
to make discovery easier has led to the development of metadata schemes,
the most widely recognised being the Dublin
Core Metadata System.
3. The development of Metadata
Some of the issues covered in the literature relating to the development of metadata include -
Lorcan Dempsey and Rachel Heery carried out a review of current resource description methods. (Dempsey & Heery, 1997). In this review they mention the need for metadata as a means of managing repositories of information so that contents are consistently disclosed. They describe metadata as typically supporting a number of functions including location, discovery, documentation, evaluation and selection.
Paul Miller (1996), in his article Metadata for the masses discusses the problems of finding information on the Internet and the huge volume of irrelevant sites found using a common search engine.
Several writers, including Susanne Moir and Andrew Wells (Moir & Wells, 1996) mention the importance of metadata for describing non-textual resources. Textual resources can be discovered on the Internet using browsers but non-textual resources need some associated description for them to be found. As they say - a digital audio recording does not index itself. Other justifications for metadata (which they prefer to describe as 'surrogates') which they discuss are - improved performance of systems (metadata records smaller then the objects they represent), content (including subject headings and book reviews can enhance discovery), economics (metadata assists potential purchasers in their decision to obtain an item, intellectual property (controlled access to intellectual property while at the same time representing it).
Using a sample query on a number of search engines, Bipin Desai (Desai, 1997) examines the problems of indexing of Internet resources. He sees one of the main problems with the search engines as lack of context, or the ability to specify context when carrying out a search.
Carl Lagoze (Lagoze, 1996) also refers to the shortcomings of the search engine robots giving the example of carrying out a search for 'Mercury' and the problem of not being able to specify the context.
Dempsey & Heery expect that metadata will be created by authors, repository managers and third party creators. For metadata to be created by authors there will need to agreement about the use of the META tags in HTML documents for embedding metadata which will be harvested by programs. It will be important for authors to create metadata, as this is the most cost-effective method. The creation of full, structured metadata is expensive.
Desai suggests that whatever system
is used, it is difficult to follow a traditional centralised approach due
to the enormous number of resources involved and hence the time and cost.
In a distributed system such as the Internet, he suggests it would be more
natural to have the providers of resources prepare the metadata using a
standardised system. For this to be successful, the entry and search systems
have to be supported by easy-to-use graphical interfaces.
Levels of metadata - striking a balance between simple and complex
The tension between simplicity and functionality in metadata is discussed by Dempsey & Heery. They describe three bands of metadata. Band one is unstructured indexes - the data currently created by web crawlers. These can be reasonably effective for finding a known item but less effective for discovery. Band two includes data which contains a full enough description to allow a user to assess the usefulness or interest of a resource without having to retrieve it or connect to it. The Dublin Core is included in this band. Band three includes fuller descriptive formats which may be used for location and discovery but also have a role in documenting objects. They suggest that the trend is for the middle band to become more important as a general-purpose access route.
Paul Miller discusses the problems of complex metadata standards such as MARC which are acceptable in a traditional library with professional staff but not suited to the chaotic online world where new resources appear all the time, often created and maintained by interested individuals rather than centrally funded organisations. Again, he sees the Dublin Core as a compromise between the hit and miss of the search engines and the complexity of schemes such as MARC.
Moir and Wells also discuss the use of MARC as a format for metadata concluding that although it is possible to use it, it is costly to create and maintain.
Desai proposes two index metadata structures for indexing and supporting search and discovery on the Internet - the Dublin Core Elements List and the Semantic Header. The Semantic Header is proposed because the Dublin Core does not provide an abstract.
In discussing the use of the Dublin Core, Lagoze suggests it is descriptive enough to help in finding resources but also easy for authors and publishers to use. His paper related to the Warwick Workshop which was the second of the Dublin Core Metadata Workshops and resulted in the Warwick Framework - a container architecture for diverse sets of metadata. The Warwick Framework was devised to accommodate the linking of various specialist schemes to the basic Dublin Core set.
Each of the Dublin Core Metadata Workshops (of which the fifth is about to be held in Helsinki in early October 1997) appears to have grappled with the issue of keeping it simple versus the need for more standards and qualifiers. Stuart Weibel and others (Weibel, 1997) discuss the tension between the minimalist and structuralist camps at the Workshop held in Canberra in March 1997. A resource description continuum is described which has full-text indexing at one end and richly-structured surrogates at the other. The Dublin Core community is located in the middle of the continuum with the minimally fielded surrogates. They suggest that every point in the continuum involves some compromise relating to cost, ease of creation and maintenance, and utility. They suggest that every decision to employ additional qualifiers should be measured against the question "Will the qualifier improve discovery?". They claim that the most important data in the metadata record is the Element Value itself and suggest that an index of the undifferentiated collection of these values would probably serve for many resource discovery purposes or, in other words, if you just strung all the words from a metadata record one after the other in no structure at all, this would still be a great leap forward for discovery by a search engine. The need for a Web Metadata Architecture is also discussed and the various initiatives underway in this regard.
According to Dempsey & Heery, success of metadata will partly depend on the development of more sophisticated automatic extraction techniques i.e. search engines that recognise metadata and give it special meaning and treatment. They describe standards-based resource discovery services as still being in the early stages. The one exception they cite is MARC-like formats which are already used widely around the world and have infrastructure in place to support them. They also mention the difficulty in predicting future scenarios because of the lack of a single driving agency and the need to take into account vested interests, competitive advantage, integration with legacy systems and existing custom and practice.
Some of the implementation issues Miller identifies for the use of Dublin Core metadata are the need for enhancements to the existing definitions of the META tag in HTML, the need for the creation of metadata to be automated and the need for search engine producing companies to develop their software to make full use of Dublin Core-compliant web pages. As this stage, he thought that the latter development could not be far off happening.
Metadata and PICS is discussed in an article by Chris Armstrong (Armstrong,1997). PICS (Platform for Internet Content Selection) is a set of technical standards developed by the World Wide Web Consortium (W3C) to enable distribution of descriptions of digital works in a simple, computer-readable form. Although the original use of PICS was for filtering content for children, it was later realised that PICS labels could be used to not only restrict access to certain sites but also to select access to specific sites. Armstrong suggests adding a label to sites which indicate the quality of the information on the site. This could assist in finding desirable materials if this information could be picked up by search engines. Paul Resnick (Resnick, 1997) raises a problem with labelling systems suggesting that they may tend to stifle noncommercial communication as many sites of limited interest will probably go unlabelled and therefore not be retrieved if searching for sites with certain labels becomes the default way of searching.
4. Implementing metadata on the Web
The majority of the discussion in the
literature appears to relate to the detail of metadata standards or the
use of metadata for subject-specific projects within organisations or fields
of knowledge with only passing reference to the issues involved in the
widespread implementation of metadata across the whole Internet community.
With reference to the aspects that were raised in the previous material
and some other sources I would now like to look in more detail at some
implementation issues. In looking at these issues, the suggestion is that
the following are important aspects of implementation -
Making
it easy for authors/publishers to use metadata
Some of the methods that could be used to encourage authors and publishers to include metadata in the resources they publish on the Internet are templates, inclusion of shortcuts for meta tags in standard web authoring packages, self-regulation or government regulation requiring that metadata be included, promotion of benefits of metadata and, education for authors and publishers.
There are several templates available on the Internet which assist authors to create metadata. The Nordic Metadata Project provides the Dublin Core Metadata Template. Aimed at the Nordic "Net-publisher" community, the template is an easy-to-use form, which is claimed to result in high quality metadata. To demonstrate the usefulness of providing metadata, the "Nordic Web Index" indexes the information after the form is submitted. The template is still in a test state. Another similar service is the Dublin Core Generator. This service will retrieve a Web page and automatically generate Dublin Core HTML <META> tags for embedding in the HEAD section of Web pages. Other templates are available for particular subject communities e.g. EdNA has a template for entering metadata into HTML files.
Generally, commercially available Web authoring packages do not include shortcuts for Meta tags. Netscape Gold makes it easy to include some meta information using Document Properties. Fields for Title, Author, Description, Keyword and Classification are included. There is apparently at least one producer of HTML authoring tools (SoftQuad, Ltd) which is committed to embedding Dublin Core resource description templates in their products when the syntax and guidelines are sufficiently stable (Dempsey & Weibel, 1996).
I am not aware of any moves to make inclusion of metadata either a self-regulation or government regulation for authors or publishers nor am I aware of promotion of metadata to this community or education for authors and publishers. However, one relevant project in this area is BIBLINK. BIBLINK was launched in April 1996 with funding from the EC and aims to establish a relationship between national bibliographic agencies and publishers of electronic material in order to establish authoritative bibliographic information that will benefit both sectors. The idea is that publishers include bibliographic information in their electronic publications.
Web robots and search engines and metadata
The evidence on the treatment of metadata by Web robots is not encouraging. In his paper on Dublin Core management, Andy Powell (Powell, 1997) makes the statement - Note that none of the big search engines, as far as I'm aware, look for Dublin Core META tags yet! That's not to say that they won't index the words found in Dublin Core META tags, but they don't currently give those words any special significance. Given that metadata has been discussed at high levels in the Internet community for some years now and that search engine companies have apparently been part of these discussions, this situation seems quite extraordinary.
The misuse of metadata has put some search engines off using it altogether. The Help text for Excite states that - Our spider doesn't honor meta tags. We believe our decision protects our users from unreliable information….If the user can't see or use it, we don't bother to index it or search on it. They go on to give examples of misleading information was included in meta tags to attract people to a site such as a real estate firm using the following statement in a meta tag - This site offers high quality information about how to buy residential real estate. Our experts can help beginner home buyers save money. If this type of misuse was widespread and all search engines made this type of decision, metadata would become useless in the Internet community.
Gillian Westera (Westera, 1997) did a comparison of search engine user interface capabilities which included the use of meta tags. Of the seven search engines she compared, she found that Alta Vista, HotBot, Infoseek and WebCrawler used the description and keywords meta tags (I assume, to establish relevance).
The Web Robots FAQ associated with WebCrawler states that Some robots index the HTML Titles, or the first few paragraphs, or parse the entire HTML and index all words, with weightings depending on HTML constructs, etc. Some parse the META tag, or other special hidden tag. We hope that as the Web evolves more facilities become available to efficiently associate meta data such as indexing information with a document. This is being worked on…..
There is some evidence that the search
engines and other key companies including Netscape and Microsoft are working
on it. Participants at the Distributed
Indexing/Searching Workshop held in
May 1996 and sponsored by the World Wide Web Consortium included representatives
from search engine and browser companies. Microsoft and Netscape have contributed
proposals to the Resource
Description Framework (RDF) specification
which is designed to provide an infrastructure to support metadata across
many web-based activities including web crawling and distributed authoring.
However, these companies do not appear to be so involved in the detail
of Dublin Core. The 5th
Dublin Core Metadata Workshop to be
held in Helsinki October 6-8 1997 does include Netscape on its list of
participants but not Microsoft or any of the search engine companies.
The 5th Dublin Core Metadata
Workshop will include project presentations relating to the use of metadata
in particular projects. So far, nineteen projects have been submitted.
All projects are associated with either a university or government organisation
(many of them libraries) except for BIBLINK which involves publishers as
well as libraries. The standard information reported for each project includes
the question - Who is creating the metadata? The Math-Net project is the
only one that mentions authors being involved. The others involved professionals
although for some this was in their capacity of supervising graduate students
creating the metadata. There is no evidence here of metadata being created
"at the coalface" by authors, not even authors within academic communities.
Other common themes in the project reports are that the metadata seems
to be mainly aimed at internal use with local search engines being used
for searching. One project is using a customised version of Alta Vista
to search its site locally. Also, each project seems to be developing its
own templates and other authoring tools to make metadata creation easier.
Although it is widely recognised that there is a need to improve resource discovery on the Internet and that the use of metadata could make a difference, there are barriers to its widespread implementation. One of these relates to Internet software. With a few large companies now having such enormous control over Web applications, it may be impossible to implement widely until they sort out their commitment to it. In the meantime, it is finding some useful applications in the description of specialised network resources.
Armstrong, Chris, 1997, Metadata, PICS and quality, http://www.ariadne.ac.uk/issue9/pics
BIBLINK, 1996, http://www.ukoln.ac.uk/metadata/biblink/
Dempsey, Lorcan, 1996, Meta Detectors http://www.ariadne.ac.uk/issue3/metadata/intro.html
Dempsey, Lorcan and Rachel Heery, 1997, Specifications for resource description methods. Part 1. A review of metadata: a survey of current resource description formats, http://www.ukoln.ac.uk/metadata/desire/overview/
Dempsey, Lorcan and Stuart L Weibel, 1996, The Warwick metadata workshop: a framework for the deployment of resource description, D-Lib Magazine, July/August 1996, http://www.dlib.org/dlib/july96/07weibel.html
Desai, Bipin C, 1997, Supporting discovery in virtual libraries, Journal of the American Society for Information Science, v48(3), Mar 1997, p190-204
Dublin Core Generator, 1997, http://www.ukoln.ac.uk/cgi-bin/dcdot.pl
Dublin Core Homepage, 1997, http://purl.oclc.org/metadata/dublin_core
Dublin Core Metadata Template [Nordic Metadata Project], 1997, http://www.ub2.lu.se/metadata/DC_creator.html
The 5th Dublin Core Metadata Workshop Helsinki, Finland, October 6-8, 1997, http://linnea.helsinki.fi/meta/DC5.html
Getting listed on Excite [Excite Help Text], 1997, http://www.excite.com/Info/listing.html#anchor4877066
Koster, Martijn, 1997, The Web robots FAQ..., http://info.webcrawler.com/mak/projects/robots/faq.html
Lagoze, Carl, 1996, The Warwick Framework: a container architecture for diverse sets of metadata, http://www.dlib.org/dlib/july96/lagoze/07lagoze.html
Metadata on EdNA , 1997, http://www.edna.edu.au/edna/owa/info.getpage?sp=&pagecode=5210
Miller, Paul, 1996, Metadata for the masses, http://www.ariadne.ac.uk/issue5/metadata-masses/intro.html
Moir, Susanne and Andrew Wells, 1996, Descriptive cataloguing and the Internet: recent research, Cataloguing Australia, v22(1-2), Mar-Jun, 1996, p8-16
Powell, Andy, 1997, Dublin Core management, http://www.ariadne.ac.uk/issue10/dublin/intro.html
Report of the Distributed Indexing/Searching Workshop Cambridge, MA, May 19-28, 1996, http://www.w3.org/Search/9605-Indexing-Workshop/
Resnick, 1997, Filtering information on the Internet, Scientific American, March 1997, URL: http://www.sciam.com/0397issue/0397resnick.html
Resource Description Framework (RDF), 1997, http://www.w3.org/RDF/
Weibel, Stuart, 1995, Metadata: the foundations of resource description, D-Lib Magazine, July 1995, URL: http://www.dlib.org/dlib/July95/07weibel.html
Weibel, Stuart and Rennato Iannella and Warwick Cathro, 1997, The 4th Dublin Core Metadata Workshop report , D-Lib Magazine, June 1997, http://www.dlib.org/dlib/june97/metadata/06weibel.html
Westera, Gillian, 1997, Comparison of search engine
user interface capabilities, http://www.curtin.edu.au:80/curtin/library/staffpages/gwpersonal/senginestudy/zcompare.htm
Links checked 23rd October 1999