Free help and advice to the UK Further and Higher Education community

Helpdesk

An Introduction to Metadata

Last updated: 07 January 2010
Published in: Managing your digital resources
Tags: business & community engagement | metadata

Comment icon Comments (0)

Summary

This is the first in a series of advice documents about metadata. In general the documents are aimed at those developing managed and sharable digital collections and are of use to those creating still image, moving image or audio collections. This first document defines metadata and introduces some of the key themes and issues that are dealt with in more depth later on.

Introduction

This first document defines metadata and introduces some of the key themes and issues that are dealt with in more depth later on. The full list of the other advice documents in the series is:

What is metadata?

Metadata is often defined as ‘data about data’ or ‘information about information’. In the digital world, metadata is usually structured textual information that describes something about the creation, content, or context of an individual file or collection of many digital files.

Metadata might take the form of controlled terminology, carefully constructed or chosen from formal lists and entered into pre-established categories. Or it might be simply a free text description or set of keywords used to annotate or ‘tag’ an image. It might describe something objective and straightforward, such as the file size of the digital file; or something much more complex, such as the subject matter of the resource or legal rights associated with its use. Metadata is often held within databases, but it can take other forms - it can just as easily be found embedded within the digital file itself.  In short, metadata provides the means for us to describe our digital resources in a structured way that enables us to share those resources with other people and machines.

Selective

Metadata invariably offers a selective or simplified description of a resource. The Oxford English Dictionary defines metadata as “data that operates at a higher level of abstraction”. If “a picture paints a thousand words” (or more!) it is clear that our text-based descriptions will only ever partially capture the information or meanings held within a digital resource - let alone all the other information that might be associated with it (e.g. the history of its creation, its relationship to other resources, or possible uses to which it might be put). The challenge for those applying metadata to a digital collection is to work out which information is going to be the most important and useful to record.

Structured

Metadata is usually structured in some way. Rather than randomly associating terms with the digital file, it is common to use a set of generic categories (e.g. ‘Creator’, ‘Title’, ‘Subject’) and then assign specific terms within those categories (e.g. ‘Creator: Leonardo da Vinci’, ‘Title: Mona Lisa’, ‘Subject: woman’). The example below relates to still image metadata.

Leonardo da Vinci's Mona Lisa
Image: Wikipedia Commons
Metadata schema categories Metadata vocabulary terms
Creator Leonardo da Vinci
Title Mona Lisa
Subject woman, portrait, Renaissance…
...etc  

This approach has several advantages:

  • It make it easier to create the metadata, since the categories tell the cataloguer which information needs to be collected and recorded
  • It makes it easier to understand the metadata, making it clear to a user, for example, that it is Leonardo who has created this image rather than Mona Lisa!
  • It makes it easier to retrieve the image in a search, since the search query can be much more specific, targeting relevant categories rather than searching across all of the metadata
  • It also makes it easier to share the image and its metadata with other image collections - as long as common categories and terminologies have been used

Sometimes metadata categories are referred to as metadata ‘elements’ or ‘units’ and the full set of categories used to describe a resource are called a metadata ‘schema’, ‘data structure’ or ‘format’. Each of these labels can be problematic, since they also have other meanings. We will generally use the phrase ‘metadata schema’ in these advice documents, but the reader should be aware that this phrase is sometimes used more narrowly to refer to a particular way of encoding metadata categories within the XML (Extensible Mark-up Language) format (we will say more on XML later).

Another phrase you will often find used in these advice documents (and elsewhere) is “controlled vocabularies”. This is used where the specific terminology used within a metadata category has been drawn from a pre-defined list (e.g. thesauri) or has been constructed according to a standard set of rules (e.g. “enter the creator name in this form: ‘Surname, Forenames’”). The advice documents on metadata schemas and related standards and metadata vocabularies provide more information on these topics.

Different levels and layers

Metadata might focus on describing different levels of a digital resource. Although we will generally want to describe individual resources (e.g. a photo, a moving image or an audio file), sometimes we may prefer to describe aggregations of resources (e.g. a photo album, an online learning resource or a music album). Or perhaps we might wish to describe just a part of a larger whole (e.g. an illustration found within a published book, a particular scene from a moving image file or a single track of music from an audio file). Those developing metadata standards have approached this challenge in different ways. Some have created separate metadata records to describe individual ‘things’ (e.g. collection, single item, part of an item) and then made links within the metadata record to related files and metadata records, e.g. the Dublin Core (DC) schema. Some have created complex metadata schemas that are capable of describing different levels within a single metadata record, e.g. the SEPIADES schema. Others use different kinds of metadata to describe the various levels of a complex resource and then tie them together using special metadata schemas that are intended to structure and coordinate other metadata, e.g. the METS schema (Metadata Encoding and Transmission Standard).

As well as being focused on different levels, metadata might describe different ‘layers’ of content within the digital resource. Take again the example of Leonardo’s Mona Lisa. In this case there might be (a) an original art work (the painting), (b) a photographic reproduction of that art work (a slide), and (c) a digital representation of that work (a digital file). The table below shows how the metadata might differ according to the different ‘content layer’ being described.

  Leonardo da Vinci's Mona Lisa Slide of Leonardo da Vinci's Mona Lisa 010010100101
010010100101
001010101010
100001001010
101010101001
010100101010
011001010101
  Original image Slide image Digital image
Creator Leonardo da Vinci Jane Smith [Photographer] John Brown [Scanning Technician]
Format Painting Photographic transparency JPEG image
Location Louvre Museum University slide collection A:\images\0023.jpg
...etc      

The challenge for those creating metadata is to decide which layers need to be described, how much detail to go into for each, and how the resulting metadata will be organised. Although, in the example above, Leonardo, Jane and John have all contributed to the creation of the final digital resource, you are likely to mislead users of your collection if you put all of their names in one field in your database. As with the different levels, metadata schemas tackle this problem in different ways. Some will create sub-categories (e.g. Creator_OriginalWork; Creator_SurrogateImage); others, separate categories (e.g. Artist; Photographer; Scanning Technician); others, completely separate records for each layer.

Different purposes

Metadata can also serve different purposes. It might be used to help us to find the resource (often termed ‘resource discovery’ metadata), or might tell us what it is (descriptive metadata). It might tell us where the resource has come from, who owns it and how it can be used (provenance and rights metadata). It might describe how the digital resource was created (technical metadata), how it is managed (administrative metadata) and how it can be kept into the future (preservation metadata). Or it might, as mentioned above, help us to relate this digital resource with other resources (structural metadata).

These are not discrete sets of metadata: there is obviously a considerable overlap. For example, descriptive metadata (e.g. subject of image) will also be very important in searching and retrieving the image (resource discovery); while metadata relating to the creation of the resource (e.g. filename and format) will clearly also be vital in managing and preserving it.

However, while there are no clear divisions, it can be convenient to use labels like “descriptive” or “administrative” to characterise the different metadata standards in existence. Some standards tend to be much more focused on resource description (e.g. Dublin Core), while others include a larger proportion of administrative categories (e.g. Categories for the Description of Works of Art). There have also been some attempts to create standards that are focused on particular purposes (e.g. NISO Technical Metadata, PREMIS Preservation metadata).

These distinctions are also useful to keep in mind when you are developing your own metadata framework and delivering your collection. What activities do you need to support? What particular metadata categories will you need to include to support those activities? The broad distinction between “descriptive metadata” and “administrative metadata” is a useful reminder that some of your metadata is going to be particularly aimed at the end users of your digital collection and other metadata will be primarily for your own use and management of the collection. Descriptive metadata is likely to be searched and displayed within a public interface, while much of the administrative metadata will need to be hidden from public display (e.g. location of your master files).

Those who have written about metadata have often differed in the broad categories or types of metadata they identify. In this and other advice documents, we will generally talk about four categories (below), but the reader must always remember that these are to some extent artificial and overlapping:

  • Descriptive metadata - used to find, identify and understand a resource
  • Administrative metadata - used to manage the creation, use and preservation of the resource (includes technical and preservation metadata)
  • Structural metadata - used to record and facilitate relationships between or within digital resources
  • Use metadata - metadata collected from or about the users themselves (e.g. user annotations, number of people accessing a particular resource)

Different communities and users

Metadata does not exist in a vacuum. As the previous paragraphs have indicated, it serves particular purposes and particular groups of users. We’ve just distinguished the end users of the collection from those managing the collection (who are also “users”). But even among these groups there will often be different kinds of users with different needs. In developing a metadata framework for your collection it is important that you identify all of these users and needs. It’s best to ask your users what information they need rather than make assumptions.

Digital collections are often based within particular ‘communities’, for example: libraries, archives, museums, educators. Many of the formal metadata standards currently in use have been developed within such communities. This has advantages and disadvantages. It means that these standards are generally good at supporting the needs of that community. However, they can also incorporate old-fashioned or ‘legacy’ approaches that may have worked well in a non-digital environment, but are not as practical or useful in the digital world. If your collection is firmly based within a particular community, it probably makes sense to adopt the metadata standards commonly used within that community - if they exist.

However it is important to realise that the approaches and biases that community-based standards encode within their metadata may not be suitable for digital collections that span different kinds of communities or physical collections. In these cases, it will usually be necessary to take a more generic approach, which often means some kind of compromise.

As this section has indicated, metadata can take many different forms. It can focus on different aspects of the digital assets (layers and levels), serve different purposes, users, and communities, and be structured in different ways. Those developing a metadata framework for a digital collection will face several challenges and will need to decide on the most suitable approaches to overcoming them.

Where does metadata come from?

Metadata relating to a digital resource can come from one of two sources: (a) it can be automatically derived from the digital resource itself, or (b) can be created and associated with a resource by human beings.

The first kind of metadata might be called intrinsic or implicit metadata. Examples of this include file formats, resolution, bit-depth, or frame-rate. File formats typically encode this sort of information within the header (the first section) of the digital file. If an image has been created by a digital camera, it is likely that the camera has also written a certain amount of information about the digital capture into the file header, such as the camera make and model, its settings, and the date the photograph was taken (this makes use of the EXIF standard)

Most implicit metadata is technical in nature and is generally - although not always - of more use to those administering the collection rather than those using it. While most implicit metadata is derived from the file itself, a certain amount could also be derived from its context (e.g. its location within directories/folders or on servers). In developing a digital collection it may be useful to extract some implicit metadata and hold it separately within a database for the purposes of retrieval, quality control, or digital preservation. Typically, though, much implicit metadata is left untouched within the file.

The second kind of metadata might be called extrinsic or explicit metadata. Because this is created by humans, it is the most difficult and expensive metadata to create. But it is also usually the most important - especially to the end user. The advice documents in this series are mostly concerned with creating and managing explicit metadata.

Although explicit metadata must be created by humans, it need not all be created by those building and cataloguing a digital collection. It is very likely that there is some pre-existing legacy metadata that can be exploited (even if it is just a scrawled inscription on the back of a photograph, film can or audio cassette). Or it might be possible to get your collection users to add to the metadata in a semi-controlled way way (via tags or annotations).

Those developing digital collections will need to make decisions about what implicit metadata should be extracted and what explicit metadata needs to be gathered or created to support the collection and its users. Usually resource limitations will play a role in these decisions. Some collections can afford to spend hours creating metadata for each resource; others less so. Some digital asset management systems are able to automatically extract implicit metadata from a digital file; many cannot and will rely on the cataloguer using other tools to discover this information manually.

Where is metadata kept?

Metadata for digital collections can be held in several different places: (a) within the digital file; (b) within a database; (c) in a separate XML-encoded file; or (d) all of the above at once. These options are briefly described in the next few paragraphs.

As the earlier discussion of implicit metadata has indicated, there is already a certain amount of metadata held within a digital file. Some of this might be extracted for use outside of the digital asset. In addition to extracting metadata, it is possible to embed some metadata within the digital asset. Those embedding metadata into still image files can make use of the well-supported IPTC standard (originally developed to enable photojournalists to “wire” their images) or the eXstensible Metadata Platform (XMP) standard.

Most people developing digital collections will make use of a database to hold their metadata. Other advice documents in this section of JISC Digital Media’s Advice documentation provide advice on digital asset management systems.

Increasingly, XML is being used as a way of encoding metadata. XML is related to HTML (the original coding used on the World Wide Web). While the HTML tags are primarily focused on presentation (e.g. <b>Bold</b>), XML tags are used to indicate meaning (e.g. <organisationName>JISC Digital Media</organisationName>). This approach lends itself well to expressing metadata, enabling metadata schema categories to be turned into tags and wrapped around specific terms, as the simplified example below shows.

Leonardo da Vinci's Mona Lisa
Image: Wikipedia Commons
<image record>
  <original work>
      <format>painting</format>
      <creator>Leonardo da
      Vinci</creator>
  </original work>
  <reproduction>
      <format>photographic
      transparency</format>
      <creator>Jane
      Smith</creator>
  </reproduction>
  <reproduction>
      <format>JPEG
      image</format>
      <creator>John
      Brown</creator>
  </reproduction>
</image record>

Most digital collections still store their metadata in databases, but might write a program to take the metadata from their database and encode it in an XML format for sharing with others.

Increasingly digital collections are making use of all three approaches: writing metadata into their digital files, storing and delivering their metadata via a database system, and encoding their metadata as XML for certain purposes, such as sharing their data with other collections. This trend is likely to continue, especially as the newer file formats provide better support for metadata within the file; systems become more capable of importing and exporting XML; and standard XML encodings are developed for existing metadata standards.

Conclusion

This advice document has provided an introduction to metadata and an overview of some of the issues involved in developing a metadata framework for a digital collection. The other advice documents in this series explore these issues in more depth and provide practical advice.

There are other good overviews of metadata available online. JISC Digital Media would particularly recommend the following:

Last updated: 07 January 2010
Published in: Managing your digital resources
Tags: business & community engagement | metadata

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Comments (0)

Post your comment

How was this document useful to you? Do you have any questions?

Name

Email (required, but will not be shown)

URL (optional)


Please note: All comments are reviewed by a moderator for approval