Link to AHDS Home

Search Collections
Creating Resources
Guides to Good Practice
Information Papers
Case Studies
Acronym Buster
Depositing Resources
News and Events
About the AHDS
Projects
Search Site
Metadata for your Digital Resource

Content written on: 8th June 2004 by Iain Wallace and Eileen Maitland.
Content updated: 8th June 2004 by Iain Wallace and Eileen Maitland.
Introduction

Documentation is a crucial part of any digitisation project. Careful recording of all aspects of a digital collection and the circumstances surrounding its creation can make the difference between a resource which has limited value beyond the context in which it was originally created, and one whose value extends far beyond this context and may be used extensively by the academic community in perpetuity.

This paper will discuss different forms of documentation, from unstructured information to resource discovery and preservation metadata. The paper should enable anyone embarking on a digitisation project to make informed choices about how to successfully document their digital resources.

Readers of this Information Paper may also be interested in the article on Guidelines for Documenting Data, which discusses less structured forms of documentation at greater length.

What do we mean by Documentation?

At its most general, a resource's documentation should outline the reasons for and circumstances surrounding its creation. As an absolute minimum, the document should provide details of the resource's provenance, contents, structure, and the terms and conditions that apply to its use.

Why bother with Documentation?

As creator of the resource, it is natural that your main concern will be with its primary data, and the value this offers to the academic community. However, in order to maximise that value, both now and in the future, it is just as important to take pains to provide adequate accompanying information. Anyone with relevant computer skills but no knowledge of your resource should be able to find it, and then exploit it fully and effectively. In this way, the potential for its future re-use within other contexts and for an audience well beyond its initial target community is significantly enhanced. Creating this documentation should be as integral a part of your project as the research on which it is focused - not an add-on or afterthought when the main body of work has been completed.

Documenting your project

The digital resources you create should be described using a structured metadata schema. An account of how to approach metadata creation follows later in this paper. However, most research projects require or produce some documentation which does not fit within the conventions of a metadata schema. This type of information may be far less structured but forms, nevertheless, a valuable part of the deposit, and can enhance its future use significantly. The vital role played by informal contextual documentation might not immediately appear to merit the priority or consideration given to other areas of the project. However, the absence of such material can in itself render the resource meaningless.

What do we mean by contextual documentation?

Contextual documentation can include any unstructured material which does not comprise part of the resource itself, but which supports and enhances its use. For example, documentation relating to the provenance of a data collection should include how, why, when and by whom the data collection was created and used. This type of information would include: the aims and objectives of the research, the funding arrangements supporting it, its scope and subject matter, related research, strengths and weaknesses, methodologies chosen, rights associated with material used, keys, codes, guides, glossaries, abbreviations, encryption schemes etc necessary in order to understand the resource(s) created.

A data collection's intellectual context should be documented thoroughly enough to enable someone who has not been involved in the project to understand the intellectual framework in which it was created.

Information about contents, structure and terms and conditions can generally be recorded in a much more structured way using metadata.

What exactly is Metadata?

In terms of a traditional library environment, metadata would be described as cataloguing information. Metadata is a more recent term, which relates more specifically to digital resources. It is information relating to and describing other information: data about data, and has been described as "the sum total of what one can say about any information object at any level of aggregation." (Buca, ed. 2000)

What does metadata do?

Metadata summarises not only the content of a resource, but a whole range of factors associated with its creation, content, context and structure. The metadata can be separate from the resource it describes, or in can actually be embedded within it. This was the case even before the advent of digital resources, as we can see in things like the CIP (Cataloguing In Publication) information data on the back of a book's title page, or the information printed on the label of a vinyl record. Even right-clicking on a digital image on a website will immediately provide you with a moderate amount of metadata, telling you what type of image it is (e.g. jpeg, gif etc), its web address, its size (no. of bytes), dimension (in pixels), and dates of creation and modification.

How is metadata different from other kinds of documentation?

Metadata is distinguishable from other forms of documentation in that it is structured and exhibits consistency. Standardisation in the way that metadata is created means that information about resources can be presented in a meaningful and consistent way, and is crucial for effective resource discovery. It is also vital in facilitating interoperability, which allows integrated access to and searching of a wide range of resources across different systems. Standardised structures for organising and presenting metadata are known as schema. Dublin Core (described later in this document) is one such schema, which comprises 15 key elements.

Why is metadata important?

Metadata enriches the resource it describes by extending the user's understanding of its content and the factors surrounding its creation. It places the resource in context and provides a background to it, thus enhancing the user's appreciation and understanding of it.

Metadata also extends the usefulness of the resource to the wider research community by facilitating access to it beyond the confines of the individual project or institution in which it was created.

Metadata enhances the value of the resource to researchers by enabling them to locate it, and by allowing them to make informed decisions as to whether or not a particular resource is relevant to their purposes.

Metadata allows the resource to be managed effectively by the party responsible for it.

In recording the resource's physical features and inherent qualities and events and activities to which it has been subjected since its creation, its viability as a useful and usable resource is significantly increased over the longer term.

What different kinds of metadata are there?
1. Resource discovery metadata

This type of metadata is primarily related to the content of the resource, and describes the resource in such a way as to allow it to be located in a search, and differentiated from other, similar resources.

2. Preservation metadata

Preservation metadata can be broadly divided into the categories "technical" and "administrative", and basically comprises any information essential to continued use of the resource. Technical metadata documents the resource's history, such as the processes involved in its creation (e.g. file formats, date of digitisation etc.) or any manipulation it has undergone (e.g. colour adjustments to an image). Administrative metadata includes anything related to its management, delivery or distribution, such as rights information.

3. Metadata at different levels

Metadata can describe resources at different levels of aggregation. One record can refer to a whole collection, or could be confined to a single item.

What are Metadata Standards?

Metadata standards are designed to impose structure and consistency on the way metadata is recorded. This consistency ensures accuracy and reliability in information retrieval and allows users to cross-search different disciplines, collections and domains by promoting interoperability. Whatever approach is adopted in terms of developing metadata, it is crucial to use established standards as part of the process.

A metadata standard is a specification that outlines a set of fields or elements, each of which is designed to contain information on a particular aspect of the resource.

The standard defines a meaning for each element, and guidelines as to its application.

Can't I just make it up myself?

Intimate knowledge of your resource is no guarantee that your description of it will make sense and be meaningful and accessible to the wider world. In order to do that, you must be, in a sense, describing it in the same terms as others are describing theirs. It is acceptable to use your own set of fields for recording data about your digital resources, as long as your in-house schema can be mapped to existing metadata standards.

However, inconsistency and a lack of precision in description and data entry can lead to resources being missed in searches (or appearing as irrelevant in a list of results, which is just as bad), and will not enhance their value to the research community - possibly the reverse.

To ensure consistency within descriptive elements, from the way in which a date is expressed, to the use of corporate and personal names, established standards should be adhered to where possible. Even something as simple as spelling can cause problems - simple carelessness means there is always a good chance of finding a bargain on eBay! For example, try searching for "plam pilots".

How will I know which standard to choose?

There is no such thing as a "one size fits all" standard, and the following considerations will all influence the selection of the most appropriate standard.

Aggregation (will the metadata describe collections/groups of resources or individual items?)

Granularity (will the metadata provide considerable detail or is a broader approach more appropriate/all that is manageable within resources available to the project?)

Context (do the resources fall into a very specialised subject grouping or are they part of a larger collection which covers a number of disciplines?)

Concept (what is the collection for? What kind of metadata will best represent the resource to its users now and also in the future?)

What kinds of standards are out there?

Different authoritative bodies have developed many different Metadata Standards. Some of these are associated with different levels of aggregation, while others relate to material in particular subjects. These are a few examples:

  • EAD (Encoded Archival Description) was developed as a means of marking up the data contained in a finding aid so that it can be structured, displayed and searched online. Basic finding aids include guides, inventories, card catalogues, checklists, shelflists, and indexes. In general, finding aids are highly structured and hierarchical, and relate to a group of materials.
  • The VRA (Visual Resources Association) metadata element set is for the description of visual materials. These might be paintings, buildings or sculpture, but in terms of a repository of information are more likely to be surrogates of those originals such as photographs and slides.
  • SPECTRUM - The UK Museum Documentation Standard
  • The RSLP Collection Description schema is a structured set of metadata attributes, for describing collections in a consistent and machine-readable way.
  • MARC - Machine-Readable Cataloguing - a standard primarily used for library catalogue data
  • The TEI (Text Encoding Initiative) is a set of tags and rules defined in XML, which describe the structure and elements of a type of document. TEI is designed for marking up electronic texts such as novels, plays and poetry.
Don't all these standards just lead to confusion?

Inevitably, with so many different systems and schemas in place, it is often necessary to be able to "translate" or "map" the elements from one system to another. Such mapping systems, which allow metadata created by one community to be used by a group using a different metadata standard are known as crosswalks. The success of any such mapping arrangement depends on the similarity between the two schemes, the granularity of elements in the target scheme compared to that in the source, and the compatibility of the rules on content within each element.

What is Dublin Core?

Dublin Core began as an initiative to improve discovery of digital resources, primarily on the Web. Dublin Core is an international protocol for resource discovery, which can encompass both digital and non-digital resource formats, and was designed to be used by individuals who do not necessarily have any kind of background in information management. It was not designed for complex resource description: its simplicity is intended to facilitate effective retrieval in a networked environment of the resources it describes, and to accommodate researchers across a range of disciplines, whose perspectives on the resources they create and require access to might be widely different.

Effectively, what Dublin Core offers is a compromise whose beauty lies in its simplicity and breadth of application. In short, Dublin Core can be regarded as metadata's lowest common denominator. Its 15 elements are as follows:

  1. Title - Name of resource
  2. Creator - Party responsible for content of resource
  3. Subject - What the resource is about - usually expressed as keywords/phrases/classification codes; should be drawn from authority list/formal classification scheme
  4. Description - Account of content of resource; could be list of contents/abstract/free text description
  5. Publisher - Person(s)/institution(s) responsible for making the resource available
  6. Contributor - Person(s)/institution(s) responsible for making contributions to the resource
  7. Date - A date associated with the life cycle of the resource; very often the date it was created or made available. Expressed (as defined in ISO 8601 [W3CDTF] in yyyy-mm-dd format
  8. Type - Nature or genre of the content of the resource
  9. Format - Physical or digital manifestation of the resource - includes information re associated software or hardware. Best practice - select a value from a controlled vocabulary such as the list of MIME types defining computer media formats.
  10. Identifier - a unique reference to the resource. Examples = URI (Uniform Resource Identifier) - including URL, Digital Object Identifier (DOI) & ISBN
  11. Source - Reference to a resource from which the present resource is derived. Best practice is to use a string or number conforming to a formal identification system.
  12. Language - Language of intellectual content of resource. Best practice is to use RFC 3066 , which, in conjunction with ISO 639 defines 2, & 3-letter primary language tags with optional sub tags. Examples = "en" or "eng" for English, "mar" for Marathi, and "en-GB" for English used in the UK.
  13. Relation - Reference to a related resource. Best practice is to use a string or number conforming to a formal identification system.
  14. Coverage - The extent or scope of the content of a resource. This might be spatial (a place name/geographic coordinates), temporal (period label, date or date range) or jurisdiction (e.g. a named administrative entity). Best practice is to select a value from a controlled vocabulary such as the TGN (Thesaurus of Geographic Names) & that, where appropriate, named places or time periods are used in preference to numeric identifiers such as coordinates or date ranges.
  15. Rights - Information about rights held in or over a resource. This could be a rights management statement, or reference to a service providing this information. Rights information often encompasses Intellectual Property Rights, (IPR), copyright and various property rights. If this element is absent, no assumptions should be made.

Here is an example of a record describing a photograph, using Dublin Core: o Creator:Donald Coopero Role=Photographero Subject: Shakespeare, William, 1564-1616, Antony and Cleopatra [LC]o Description:Vanessa Redgrave as Cleopatrao Date: 1973-08-09o Type:Imageo Format:JPEGo Identifier:4150 [catalogue no]o Source: negative no 235o Relation: Antony and Cleopatra: Thompson/73-8o IsPartOfo Coverage:Bankside Globeo Role=Spatialo Rights:Donald Cooper

What about Controlled Vocabularies and Thesauri?

Controlled vocabulary lists and thesauri offer consistency in terminology for use in elements like Subject (where the metadata creator wants to indicate what the resource is about). The more consistency that can be applied to this procedure, the more fruitful searches will be, both within one set of metadata records and across records held by different organisations. If multiple organisations describe their collections consistently by using terms from a controlled list, the common approach will reap great benefits during searches.

For example, if a group of resources on the history of theatre in India all attach the Subject Heading 'Theatre History, India' (which comes from Library of Congress Subject Headings ), as opposed to making up their own headings (e.g. Indian Theatre History), individual records will not slip through the net during a search. It is important also to state which list your term is selected from (e.g. by putting [LC] in brackets after the heading).

There are also controlled lists for terms within particular disciplines. These are produced by authoritative bodies, and are often available online. The National Monuments Record Type (NMR) , Humanities and Social Sciences Electronic Thesaurus (HASSET) The Art & Architecture Thesaurus (AAT), Union List of Artists Names (ULAN) and Thesaurus of Geographic Names (TGN) are all examples of these. The last three were all developed at the Getty Research Institute in California, which promotes innovative scholarship in the arts and humanities.

In addition to these, there are also established standards for expressing elements like Date, Type and Language, such as ISO 639 for language abbreviations and RFC 2045 and 2046 for Media (MIME) types.

Further Issues to Consider

Each project will have different metadata requirements. The schema employed should be tailored to the characteristics particular to the project's digital resources and can be mapped later on to standards such as Dublin Core for the purposes of interoperability. These standards should be seen less as "off-the-shelf" commodities and more as reference points to which bespoke schemas can map. Metadata must be "fit for purpose" - think carefully about the level of complexity required to describe your resource. The content and number of fields to be used in a specific record may vary according to the requirements of a particular collection and the nature of the individual digital object. This important flexibility allows the time and attention given to cataloguing to vary according to the size, significance and location of the digital resource being described.

There are now quite a number of software tools (many of them online) available to aid the process of metadata creation. DC.dot (a web-based tool for creating Dublin Core tags for networked resources), The Nordic Metadata Project DC Template, and the RSLP Collection Level Description metadata generator are all examples of these.

It is very important to have consistency and authority in your metadata records. The most important aspect of metadata standardisation is not that all records must contain the same fields, but that where the same field exists in records belonging to different collections then it should be used for the same purpose with the same standards.

So why do we need Preservation Metadata as well?

There are few environments that change as rapidly as the digital environment. The speed with which technical innovations emerge and processes change are such that within a very few years hardware and software on which valuable data has been stored can be rendered obsolete. Fortunately, solutions in order to meet these challenges are constantly being developed and updated. Nevertheless, whatever the particular circumstances, it is never enough simply to have the resource to hand. If you do not also have the technical and administrative information which provides the background to its creation, delivery, operation and administration, you cannot be said to have full access to it, and it cannot be said to have been preserved. Metadata which informs the user of the technical context for the resource (its file format, file size, associated software, version etc), and other information (e.g. copyright information) crucial to its ongoing management are as integral a part of its preservation as its physical security. Without this information, some or all of the following vital questions, and others, will remain unanswered:

  • what is the resource?
  • how can it be used?
  • how has it been changed?
  • who has been involved in its creation/alteration?
What purpose does Preservation Metadata serve?

Five key functions of preservation metadata identified by the National Library of Australia are as follows:

  • To store technical information that supports preservation decisions and action
  • To document preservation action taken (e.g. migration or emulation)
  • To record effects of preservation strategies
  • To ensure authenticity of digital resources over time (e.g. by using digital signatures)
  • To note information on Collection and Rights Management
Preservation Metadata Initiatives

As digital resources become an ever more dominant method of recording and disseminating information in academia and beyond, the management of preservation metadata is inevitably an area of increasing concern to those responsible for such resources. There are various initiatives attempting to develop a framework for preservation metadata, including OCLC/RLG , CEDARS (Leeds University) , PADI (National Library of Australia) and NEDLIB (based in the Netherlands).

There is as yet no agreed standard, but the general principles governing discussions within these groups do seem to be moving in the same direction, and are in some cases already converging. The "CEDARS Guide to Preservation Metadata", for example, is closely based on the OCLC Preservation Metadata Framework, which is very much at the forefront of developments in Preservation Metadata.

What does Preservation Metadata look like?

This depends on the resource to which it relates. Technical metadata relating to an image, for example, might record features such as its format (e.g. JPEG), level of granularity (e.g. 24-bit), and colourspace (e.g. RGB, CMYK, etc.)

A record for an audio file would record the bit rate, no. of channels, sample rate, etc., and one for a video would provide details of its resolution, format of video content (codec used - e.g. DIVX), the format of the video sound (e.g. .wma, .mov, .ra, etc.)

This sounds complicated!

But the good news is…that the AHDS can undertake some of this preservation documentation work for you. On receipt of a collection the AHDS will add metadata relating to the future handling of the resource. When such information is added to the preservation metadata developed by a resource creator (recording details such as the original technical specifications of the resource and information on collection and rights management), a siginficant body of data has been created to help ensure the long-term maintenance of the resource.

The Future

The emergence of the Semantic Web, in which the way information is created and organised, makes metadata a more crucial activity than ever before.

"The Semantic Web is an extension of the current web in which informationis given well-defined meaning, better enabling computers and people to work in cooperation." (Berners-Lee et al, 2001)

The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework (RDF), which integrates a variety of applications using XML for syntax and URIs for naming. In this way metadata can be easily re-used in a number of different ways by intelligent agents and other programs, as well as human users.

Developments in metadata schemas continue to expand rapidly, as the effort to keep pace with and manage the explosion of digital materials afforded by new technology continues. Every day new groups of users and creators emerge, with their own unique perspectives and requirements - and consequently new schemas are developed or existing ones adapted to meet these.

Summary

The creation of good quality metadata (to be embedded within a digital resource or to accompany it to the repository) should always go hand in hand with that of the object(s) it describes. This, more than any other activity in the digitisation process, will add value to a digital object as a resource for teaching, learning and research, aid its promotion throughout the educational community, and increase its longevity. Metadata is the key to ensuring that your resource is meaningful and accessible both now and in the future, and should be a central consideration from the outset by anyone undertaking a digitisation project, with its development costed in as an integral part of the project plan.

What steps can I take to ensure that the metadata I provide is of a high standard?

Contact the AHDS as early as possible in a project that will result in a deposit. We can provide advice and guidance on creating and submitting metadata which will maximize the usefulness and viability of the material it describes.

Links and Bibliography

Martha Buca (ed.), 2000, Introduction to Metadata: Pathways to Digital Information

Encoded Archival Description (EAD)

VRA Core Categories

Spectrum

RSLP Collection Description

MARC

Text Encoding Initiative (TEI)

Crosswalks : mapping between schemas

Dublin Core

ISO 8601

RFC 3066

ISO 639

Library of Congress

Art & Architecture Thesaurus (AAT)

National Monuments Record Thesauri

Thesaurus of Geographic Names (TGN)

Humanities & Social Science Electronic Thesaurus (HASSET)

Union List of Artists Names (ULAN)

MIME types

DC.dot

Nordic Metadata Project DC Template

OCLC/RLG

PADI (National Library of Australia)

CEDARS (Leeds University)

NEDLIB

Guidelines for Documenting Data Tim Berners-Lee, James Hendler, Ora Lassila, 2001, The Semantic Web, Scientific American

Gail Hodge, Metadata made simpler Daniel Gelaw Alemneh, Samantha Kelly Hastings, and Cathy Nelson Hartman, "

A Metadata approach to the preservation of digital resources", First Monday Cory Doctorow, Metacrap:

Putting the torch to seven straw men of the meta-utopia