Metadata Standards and Interoperability
This advice document aims to provide a comprehensive look at the various choices the developer of multimedia collections has in terms of metadata standards and the principles behind using them. It attempts to provide a synopsis of general metadata trends, a) in usage for audio, moving and still image format types; b) in specific areas of practice such as museums, archives, libraries and education; and, c) in various activities and tasks such as preservation, interoperability and resource discovery. For an overview of the whole series of papers, and an introduction to the metadata issues discussed here, please see An Introduction to Metadata.
- Why use existing standards?
- What exactly do we mean by 'standards'?
- What metadata standards are there?
- Community specific metadata schemas
- Task specific metadata schemas
- Application Profiles
- Metadata vocabularies
- Content standards
- Conceptual models
- Embedded metadata
While you could choose to make up your own metadata schemas and vocabularies from scratch, for various reasons it is generally preferable to use or adapt existing standards. Using an existing standard can offer:
- Cost saving - you won't need to develop the schema, its usage guidelines or vocabularies yourself;
- Access to help and advice - a given standard is likely to have a community of users that has built up over time, which is likely to mean there will be comparatively easy access to help and advice about how best to use the standard.
- Usability - if your users are already familiar with your metadata structure or terminology they can more quickly and easily use your collection
- Resource discovery - your collection could be more easily opened up to be searched and shared by others;
- Sustainability - your use of common standards would make it easier to pass your collection on to someone else to look after if you ever need to
The last two advantages in this list (resource discovery and sustainability) are particularly concerned with 'interoperability' - the ability of your collection to work alongside other collections, either through shared resource discovery services or by contributing your metadata to other collections.
Interoperability depends on either the strict use of common standards or by understanding how your 'non-standard' metadata can be mapped to or transformed to common standards. When deciding on metadata for your collection, in addition to thinking about the collection itself and its specific needs, you will need to ask yourself whether it's important for you to interoperate with other analogous collections. If you do, it is likely to influence your choice and usage of standards.
The word 'standard' can be problematic, since people use this term in different ways. If you're seeking formal internationally approved metadata standards (often called de jure standards), then you'll find very few. But if you're looking for metadata schemas or vocabularies with reasonably wide endorsement and use, then you'll find many more de facto standards.
Instead of adopting a narrow definition of 'standards', JISC Digital Media suggests that standards can be thought of as: commonly used and consistently applied formats or processes, which are measurable, well documented, and endorsed by someone. Such 'standards' will be found throughout your digital collection and workflows: e.g. file formats, digitisation workflow, Web delivery and - the focus of this paper - your metadata.
JISC Digital Media recommends:
- Where there are clear and obvious standards for your resource type, community, or task, make use of them
- Where the standards are unclear and competing, follow models of 'good practice' within your community
- Where you can find no appropriate standards or models, adapt an existing standard to better fit your needs and document the changes you make very carefully using the documentation methods and mappings deployed by existing standards as a guide
If you're doing some research into metadata you'll encounter many different standards, often with long names or confusing acronyms. It can sometimes be difficult to know which are the most relevant for your collection. JISC Digital Media's documents on metadata schemas and vocabularies are also intended to provide help in understanding and choosing standards. These should be read alongside this current paper:
- Putting Things in Order - a Directory of Metadata Schemas and Related Standards
- Controlling your Language - a Directory of Metadata Vocabularies
The remainder of this paper introduces several metadata standards of particular relevance to multimedia collections. It concentrates on metadata schemas (standards describing the categories of information you might record). It also briefly considers metadata vocabularies (standards describing specific terms that could be entered within the schemas categories) and two other special kinds of metadata standards: conceptual models and content standards (these are defined below).
The Dublin Core (DC) is one of the most widely used schemas and illustrates many of the issues you'll need to think about when choosing or adapting a metadata schema. If you can't find a more suitable schema for your collection, then DC would be a good starting point.
DC was originally conceived as an attempt to provide a set of 'core' metadata properties that were able to provide a basic description of any type of resource, including individual items such as image, video, sound and text resources and composite resources such as websites. DC is one of the few metadata schemas that has achieved international standardisation (ISO15836). Due to its generic - sometimes referred to as 'lowest common denominator' - nature DC is often used as a basis for extending and adapting to better describe a specific resource, and it also increasingly provides the means for a basic level of interoperability between and across separate digital collections.
The DC table (PDF) lists the 15 elements or categories of DC, along with their formal definitions and two examples.
In the examples above, we've shown how DC might be used to catalogue a physical object (painting) or a digital object (image file). We might also have included a record describing a whole collection of physical or digital resources.
If you compare the two examples given, you'll notice that while the elements are the same for each record, some of them are filled in very differently depending on what is being described (see, for example, Date and Format).
Some other metadata schemas take a different approach, choosing to include information about an original item and its digital reproductions within the same metadata record (the CDWA and SEPIADES schemas described below do this). DC creates separate records because it adopts a principle known as "one-to-one". This principle states that a metadata record should only refer to one thing. If there are different 'layers' within an image (e.g. an original work and several reproductions of that work) then DC expects you to create a different record for each, as we have above. Similarly, if there are different 'levels' that you need to describe (e.g. collection, group of images, single image) then each of these will need to have their own records.
We've presented the simple, 15-element standard version of DC above, but people frequently add other elements (e.g. Audience or Provenance). It is also very common to 'refine' or "qualify" a DC element in order to provide more information about the data contained within that element.
The DC initiative has suggested some refinements and those building digital collections frequently add others. For example, the Date element might be usefully broken down further (Date - Created, Date - Available, Date - Modified...). The Relation element is another good example. Looking again at the two example records above, it is clear that the two things being described by these records are related to each other. But the nature of the relationship is not stated. The table below shows how we might qualify the Relation element further to make the relationship clearer:
Element and qualifier table
|Element and qualifier||Record A - a painting||Record B - a digital image|
|Relation - Has Format||Record B||-|
|Relation - Is Format of||-||Record A|
This indicates that Record B (the digital image) is a surrogate of Record A (the original painting).
Where possible, it is better to qualify an existing element rather than create a new separate element. Then, if you need to, you can always collapse your richer, qualified metadata back into its core element for the purposes of interoperability (e.g. exchanging records with others). In doing so, you will have lost some refinement, but you won't have lost any data. This method - of being able to collapse your data into a broader category - is sometimes referred to a 'dumbing down' your metadata.
It is likely however that you may need to extend the DC element set with some additional elements. DC focuses more on describing resources, than administering them. So if you're using DC as the basis for your metadata you'll probably choose to add a few administrative or technical metadata elements to help you manage your collection. You might, for example, want to add elements to record which camera was used to capture the digital image (e.g. Capture Device), or any conservation work done to the original painting (Conservation), or perhaps any corrections made to the digital image (Optimisation). Although these non-standard metadata additions are not going to be as interoperable (since they can't be 'dumbed down' to a standard DC category), this is not so critical since this information is unlikely to be used in search and retrieval and probably won't even be shown to the users of your collection.
When adding new elements to DC, it is preferable to draw on other metadata standards where possible. Recently, quite a lot of attention has been paid to technical and preservation-related metadata (see NISO Technical Metadata and PREMIS, below), so you could use some categories from these standards to supplement DC.
Your adaptation of DC will result in a particular version of the standard that suits your needs. If a number of different collections have similar needs and are interested in interoperating in some way, then it makes sense to standardise these adaptations. Sometimes this has led to the development of new standards that are based on (and mapped to) DC, such as the UK e-Government Metadata Standard (e-GMS), or the development of Application Profiles that use DC as a base (see section on Application Profiles below).
Visual Resources Association Core
The Visual Resources Association Core (VRA) is a widely used metadata schema for describing works of visual culture and their associated images. It takes its name from its developing body: the Visual Resources Association, which is a US based association of visual resource librarians and associated image media professionals. VRA Core can be seen as an extension of DC aimed specifically at visual resources and has been influenced also by other initiatives in the field like the Categories for the Description of Works of Arts (CDWA) standard (see below).
Like DC, in its early versions VRA Core adopted the one-to-one principle, although it made a distinction between: original works (e.g. a painting, building or born digital work) and digital and analogue surrogate images of works (e.g. slides, prints or digital photographs). VRA Core 3 presented the same 17 categories for both its Work and Image records, and like DC, they were filled in differently depending on what was being described.
The VRA Element table (PDF) lists the 17 elements of VRA Core 3.0, along with their formal definitions and some examples.
VRA Core 3.0 also recommended some qualifiers (e.g. 'Date.Creation' or 'Date.Restoration') and some controlled vocabularies. As with users of DC, those using the VRA Core will frequently find they need to add new qualifiers or extend the schema with additional categories. For example, the African & Asian Visual Artists Archive collection hosted by the Visual Arts Data Service (VADS) decided to refine the VRA's image categories with elements like 'Colour Space', and 'Compression' to accommodate some additional technical metadata.
In 2007 VRA Core 3 was superseded by Core 4. The new version has been influenced by two recent trends in the development of metadata: (a) the use of XML to encode and express the schema; and (b) increased attention to the way data is entered within the categories. There have been a number of modifications and structural changes to a number of the elements to take account of XML coding, as well as the addition of four new elements. The changes are:
|Core 3.0||Core 4.0||Nature of change|
|RECORD TYPE||WORK, COLLECTION OR IMAGE||Name change and structural change|
|TYPE||WORK TYPE||Name change only|
|CREATOR||AGENT||Name change and structural change|
|ID NUMBER||LOCATION.REFID||Sub-element under LOCATION for IDs associated with repository|
|ID NUMBER||TEXTREF||IDs not associated with repository|
|CULTURE||AGENT.CULTURE||Sub-element under AGENT to denote agent culture or nationality|
|CULTURE||CULTURAL CONTEXT||Describes cultural context|
|STATE EDITION||New element|
Please visit VRA Core 4.0 for further information on the element outline and element descriptions and tagging examples and links to related resources.
CDWA (Categories for the Description of Works of Arts)
Like the VRA Core, CDWA is a standard for cataloguing cultural objects, such as those found within museums and galleries. With 512 categories or sub-categories it is much more detailed and extensive than VRA Core. Recognising that such a lengthy standard is too much for many institutions or collections, the CDWA identifies a set of 35 core categories which should be used as a minimum. During 2005-6 a revised version of the CDWA was prepared and released, reflecting the development of the Cataloguing Cultural Objects (CCO) content standard (see below). At the same time an XML version of CDWA has been released, called CDWA Lite. As its name suggests, CDWA Lite only encodes a subset of CDWA categories.
The CDWA table (PDF) shows the main elements of CDWA Lite (but excludes its sub-elements):
Note that unlike DC and VRA Core, both CDWA and CDWA Lite include information about the original work and any digitised images of it within the same metadata record. All of CDWA's elements are repeatable, so it is possible to include multiple resource descriptions within the same metadata record to describe different views of an object or different versions of digital images.
CDWA Lite concentrates on those CDWA elements that are covered by the new CCO standard (see previous section) so it bears a close similarity to the VRA Core. CDWA Lite also conforms to an important standard used in interoperability, OAI-PMH (see Interoperability metadata, below). This has the potential to enable collections using CDWA Lite to share their metadata records more easily with others.
SEPIADES (SEPIA Data Element Set)
In 2003 a European-funded project called SEPIA (Safeguarding European Photographic Images for Access) published a set of recommendations for describing photographic collections, which are known as SEPIADES. The SEPIA project focused on photographic archives, so the metadata it recommended was closely modelled on archival metadata, especially the General International Standard for Archival Description (ISAD(G) - see Archival schemas, below). The archival approach to metadata is hierarchical or "multi-level": it creates a single metadata record for a whole collection of items, and then breaks the collection down into groups and, where significant, individual items.
At the Single Item level, SEPIADES makes a distinction between a "Visual Image" (i.e. its visual content) and its "Physical Description" (the particular material form it takes). Physical Description is further divided up into "Photographs" (which includes negatives, slides and prints) and "Digital Photo File" (which includes born-digital or digitised images).
The SEPIDES table (PDF) shows the information SEPIADES recommends recording for a Single Item. The dashes indicate sub-categories.
In addition to its metadata standard, one of the SEPIA partners developed an open-source JAVA-based cataloguing system to record SEPIADES records. This system provides data entry and search facilities, and it also exports SEPIADES records as DC in OAI-PMH compliant XML (see Interoperability metadata, below).
Although SEPIADES is ideally suited to a photographic archive, its relationship to ISAD(G) means that elements of it could be incorporated within other archival metadata frameworks. The Single Item metadata could also be used independently as the basis of a collection's metadata or could be used to supplement another core schema.
PB Core (or the Public Broadcasting Metadata Dictionary) is intended for use by television, radio and web broadcasters and aims to be a standard way of describing and using multimedia content (including video, audio and still image), allowing it to be more easily retrieved and shared among colleagues, software systems, institutions, community and production partners, private individuals and educators. It can also be used as a guide for an archival or asset management process at an individual station or institution. As with other metadata standards PB Core can be incorporated to cover multimedia metadata within structures such as a METS record. PB Core is based on DC and organises its fifty three elements in 'containers', which in turn are organised in four classes. Which are:
- PBCoreIntellectualContent (metadata elements describing the actual intellectual content of a media asset or resource)
- PBCoreIntellectualProperty (metadata elements related to the creation, creators, usage, permissions, constraints, and use obligations associated with a media asset or resource)
- PBCoreInstantiation (metadata elements that identify the nature of the media asset as it exists in some form or format in the physical world or digitally)
- PBCoreExtensions (additional descriptions that have been crafted by organisations outside of the PBCore Project. These extensions fulfil the metadata requirements for these outside groups as they identify and describe their own types of media with specialised, custom terminologies unique to their needs and community requirements)
** update - the following table had format issues and has been removed **
The table below shows the PB Core elements with their definitions and content classes. More information on these elements, along with examples of use, can be viewed at the PB Core website.
You will notice how many of these elements are based on DC. Indeed, the PB Core schema is essentially an application profile for the broadcasting industry whose elements combine DC and others, and provide pointers to specific controlled vocabularies and structured values and can be expressed in XML.
PB Core is intended to be used by a wide range of expert and non-expert users in the public broadcasting domain, and therefore is intended be as 'simple' in the sense that DC is, but to also facilitate useful exchange of information. Also like DC, PB core is intended to be used as basic starting point, and users of the schema are encouraged to add their own extensions and element qualifiers where appropriate.
MPEG-7 (Moving Picture Experts Group)
MPEG-7 is a multimedia metadata schema which can be used to provide rich descriptions of digital image, digital video or digital audio content. One key strength of MPEG-7 is the ability to segment time-based media and attribute different metadata to each part. When constructed MPEG-7 was intended to take into account aspects of several other schemas such as: the SMPTE (Society of Motion Picture Technical Experts) Metadata Dictionary, DC, P/Meta and TV-Anytime. MPEG-7 can be used alone or as a metadata schema within models such as METS or MPEG-21. The standard achieved ISO status in 2001 (ISO/IEC 15398).
Formally called the Multimedia Content Description Interface, the standard focuses on representing information about the content, and not the encoding of the content itself, as was the case with MPEG 1, 2 and 4 standards. MPEG 7's broad aims are to provide a standard for:
- A core set of Descriptors (Ds) that can be used to describe the various features of multimedia content
- Pre-defined structures of Descriptors and their relationships, called Description Schemes (DSs)
- A language to define Description Schemes and Descriptors, called the Description Definition Language (DDL)
- Coded representations of descriptions to enable efficient storage and fast access
The standard also allows for additional information for organising, managing, and accessing the content, such as:
- Information about how objects are related and gathered in collections
- Information to support efficient browsing content (such as summaries, variations, and transcoding information)
- Information about the interaction of the user with the content (such as user preferences and usage history)
Moreover, in addition to standard archival type descriptors that relate to the content's production processes (e.g. titles, locations, actors etc.), storage formats and copyright, MPEG 7 provides scope to record detailed descriptions of information within the content itself, such as:
- Information regarding the content's spatial, temporal, or spatio-temporal structure (for example, scene cuts, segmentation in regions, and region motion tracking)
- Information about low-level features in the content (for example, colours, textures, sound timbres, and melody descriptions)
- Semantic information captured by the content (for example, objects, events, and interactions between objects)
MPEG 7 is therefore a relatively comprehensive metadata standard (at time of writing comprising twelve parts), that aims to provide descriptors for multimedia content that are useful across a wide range of domains and applications; that describe varying levels of abstraction - from low level descriptions like size and colour, to positional descriptors (i.e. where in a scene is a specific object), to high level semantic descriptors about what's in the actual content itself.
Its descriptors are not laid out in list form like, for example DC, but rather form part of a series of methods and tools that can be altered depending on viewpoint. Here is a simple example record which is describing the MPEG logo, taken from the article 'Introduction to MPEG 7' at MPEG Industry Forum:
The previous sections have highlighted schemas that are directly related to image and multimedia collections or, in the case of DC, are non prescriptive in the types of resources they can describe. In addition to these, there are also community specific metadata standards that have been developed by the Museum, Library and Archive communities, and emerging standards developed for Educational resource use. Some of these will be used to describe image and multimedia resources, particularly in these communities or in situations where systems developed primarily for these communities are being utilised elsewhere. Web links and further information are provided in JISC Digital Media's Putting Things in Order - a Directory of Metadata Schemas and Related Standards.
Because libraries collect published items that are also held within many other libraries, they were early in developing standard cataloguing formats and have been very active in metadata initiatives such as DC. The main electronic metadata standard used by libraries is MARC (Machine-Readable Cataloguing) which has been in use since the 1960s in several different versions. MARC concentrates on bibliographic items (e.g. books and journals), but has also often been used by libraries to catalogue other types of collections. Those building digital collections within a library context may want to derive their metadata from existing MARC records or develop metadata compatible with the MARC format. Or it may be that they are digitising books and so require a metadata schema suitable for describing bibliographic works.
Because MARC is such an extensive standard, the library community has developed a sub-set of elements taken from MARC called MODS (Metadata Object Description Schema). MODS is intended to be used for a variety of library purposes, uses language base tags rather than the numeric ones used by MARC and is expressed in XML. This latter development gives MODS an advantage when being used to share data between different sources and combining with other metadata standards. For example, those digitising a book might choose to use MODS to describe the book as a whole, DC to describe the individual page image files, and METS to wrap the various records together (see Structural metadata, below).
The library community is also currently developing a new content standard to govern the way data is recorded within catalogue records (similar to the CCO standard mentioned below). This will be called RDA (Resource Description and Access) and will replace a much older, pre-digital standard called the Anglo-American Cataloguing Rules (AACR).
Museums and Heritage
For UK museums, a key standard is the MDA's (Museum Documentation Association) SPECTRUM documentation standard. SPECTRUM is more than a metadata schema. It is a guide to documenting all the procedures a museum might need to undertake in managing its collections (e.g. acquisition, cataloguing, auditing, and loans). SPECTRUM recommends several "units of information" that can be recorded to support each of these procedures, some of which are required, others recommended.
In terms of cataloguing museum objects, SPECTRUM suggests that sometimes it will be appropriate to catalogue at a collection level; at other times, at the item level. It suggests that any catalogue record should include at least: an identity number, name of the object, number of items or parts, physical description, and details about its acquisition, location and any associated images. SPECTRUM does not prescribe particular elements for digital reproductions, so those developing museum collection management systems often use SPECTRUM as the basis for the object information and DC to record information about any associated digital images.
While SPECTRUM focuses on the description and management of heritage objects, the MIDAS standard (Monument Inventory Data Standard) concentrates on UK heritage environments. MIDAS is maintained by FISH (Forum on Information Standards in Heritage), who have developed an XML version and other tools for interoperability. It is likely that we will see some convergence or increased compatibility between SPECTRUM and MIDAS in the future.
Because of the hierarchical nature of archival resources (typically large amounts of unique materials arranged in collections and sub-collections) the archival community has adopted a different approach to cataloguing its resources to libraries. Instead of creating metadata records for individual items, archives typically create metadata records for a whole collection, breaking it down into series and item levels where these are important and an archives' limited financial resources will enable it. We have already seen this multi-level approach to cataloguing in the SEPIADES standard (above). The main standard for archives is ISAD(G).
Like the SPECTRUM standard, ISAD(G) says which units of information should be recorded for a collection, but does not specify a particular data structure or form of encoding. An independent but closely-related standard called EAD (Encoded Archival Description) provides such an encoding, using the XML format. EAD is increasingly being used to enable archives to publish or share their archival records. It is used, for example, within the UK's Archives Hub and the Online Archive of California (OAC). EAD includes some elements for describing digitised versions of archival materials (see its <dao>Digital Archival Object</dao> tag). Multimedia objects can be described in simple terms within an EAD record, but those using EAD may prefer to link to more detailed records described using another schema.
In 2005 the US archival community published a content standard similar to the art image community's CCO and the library community's RDA standards (see above). Called DACS (Describing Archives: A Content Standard), it is intended to help archivists decide how to select and format the information they put within ISAD(G), EAD, or MARC categories.
The rise in electronic resources being used for specific teaching purposes has led to the requirement for metadata standards that can accompany and describe such resources. A key international standard is IEEE LOM (Learning Object Metadata). LOM was based on older metadata schemas and has been much influenced by DC. A UK application profile of the LOM, called UK LOM Core, is currently available in a draft form.
Because it deals with learning resources, LOM includes categories related to the resource's educational use (e.g. interactivity level, typical age range, typical learning time). Another interesting category is Annotation, which provides space within the metadata record for educators to record their comments about the learning resource.
The previous sections have discussed standards for describing specific types of resources (e.g. art works, photographic archives) or materials held by specific communities. In addition, some specialist schemas or related standards have been developed for particular tasks such as recording technical or preservation-related information, structuring different sets of metadata, or sharing metadata with others. We present the main standards for each below.
Technical metadata for digital still images
A standard has been developed to record technical information about raster (i.e. pixel-based) digital images. This is referred to as Technical Metadata for Digital Still Images or NISO Z39.87, which is the code given to it by the US based National Information Standards Organization (NISO). It takes the form of a data dictionary rather than a formal schema, listing technical elements that an organisation might want to record about a digital image. In order to make it more usable and interoperable, the Library of Congress, MARC Standards Office and the original committee responsible for its creation have developed an XML version called MIX (NISO Metadata for Images in XML), which is currently in its second version. NISO Technical Metadata is an extremely detailed standard which gives scope to record basic image information and extensive image capture and change history elements. Those building digital image collections may want to be selective about the elements they choose from the standard (perhaps via MIX) rather than implement it in its entirety.
Preservation metadata (PREMIS)
Like NISO Technical metadata (previous section) a data dictionary has been developed listing core metadata elements that can be used to support the preservation of a digital resource. Called PREMIS, this standard was based on an international survey of practice and on previous preservation research. It was particularly influenced by the Open Archival Information System (OAIS), which provides a framework for the long-term preservation of digital (and non-digital) resources.
PREMIS recommends recording various bits of information about (1): the Intellectual Entity (i.e. the "work" itself); (2) the related digital Objects (e.g. their format or encoding); (3) any particular preservation-related Events (e.g. acquisition, conservation); (4) Agents (e.g. details of the preservation repository or rights owner); and (5) any related Rights information (e.g. conditions of use). All five sets of information are important for understanding how a resource has reached its current state and what can be done with it in the future.
The developers of PREMIS felt that the metadata for the Intellectual Entity was best supplied using a relevant descriptive metadata standard (e.g. DC, VRA Core or MODS), so PREMIS only provides elements for Objects, Events, Agents and Rights. It is not very prescriptive about how much of this information should be recorded or in what form it should be encoded, but it does provide XML encoding. Like the NISO Technical Metadata standard, it is likely that those using PREMIS will pick and choose which elements they want to use, and will use the XML version to incorporate PREMIS within their overall metadata framework, perhaps incorporating it within a METS record (see next section).
METS is potentially a very powerful standard and is becoming more useful as more XML-based schemas are developed and used. However there is still a lack of easy-to-use tools for generating or displaying METS records. Custom-written software is usually required to generate a METS record from a collection database and to style it for display via a Web browser. For the time being METS is most likely to be used by larger projects that are digitising complex resources, such as books or archives. This may change if off-the-shelf systems start providing support for the standard.
For information, we have produced two example METS records that package together various standards on:
At the beginning of this paper we said that an important reason for choosing a standard metadata schema is to be able to interoperate with other collections. While this has long been a goal of those building digital collections, it has been quite difficult to achieve in practice. There are several different approaches to interoperability:
- Cross-searching - your metadata and digital objects stay where they are, but are searched alongside other collections.
- Contribution - you physically give your metadata and objects to someone who is building a larger collection.
- Harvesting - your metadata and objects stay where they are, but you make available metadata records in a standard format for others to use in building catalogues which point to your resources.
The harvesting approach is a kind of cross between the other two. It has the advantage of opening up your collection to others while maintaining your ability to manage and maintain your digital objects and their metadata locally. OAI-PMH is increasingly being used to achieve this kind of interoperability.
OAI-PMH requires you to generate your resource discovery metadata in a standard XML format. Simple DC is required by the protocol at minimum, although any standard set of metadata can be used additionally for this purpose. These records are placed in a public space on a server and made available for others to harvest. The data can then be incorporated into the harvesting bodies 'catalogues or directories,' thus making the records from your collection cross searchable with however many other datasets they choose to harvest. If using simple DC, these OAI records may represent a "dumbed-down" version of your richer metadata, but users will often have a link through to your own collection to view the digital resource itself. Once there, they will be exposed to your full metadata and can see the item in its context.
The diagram below shows a simplified top level view of the protocol in operation, with the contributor's metadata being made available on the contributor's server to be requested and ingested by the harvesting bodies' catalogue.' There is no limit as to how many data-sets can be harvested in this way, and the OAI protocol therefore can enable end users to cross search many collections of metadata from one place.
This diagram is a simplified version of the one available from the National Library of Australia's Picture Australia Technical Guide. Indeed Picture Australia is a very good current example of OAI-PMH in practice, whereby various image collections from around Australia contribute metadata data to a central place where their materials can be accessed together.
Other examples can be seen at the OAIster project at the University of Michigan, which has harvested almost 10 million OAI records from 700 institutions:
Application Profiles have been defined as " ... schemas which consist of data elements drawn from one or more namespaces, combined together by implementers, and optimised for a particular local application." In other words, it is possible to 'mix and match' metadata elements from across different schemas and use the resultant 'profile' as a means to provide more precise search and retrieval within a narrow domain, say learning resources, while maintaining DC (and other) mappings to enable basic interoperability with the world at large. Some examples from specific communities include:
- Library Application Profile (DC-Lib)
- Collection Description Application Profile (DC CD AP)
- Education Application Profile (DC Education)
And there has also been recent development in the UK to develop #DC based application profiles for specific resource types (text, images and moving images) that can aid interoperability of the growing network of Institutional Repositories, these are:
- Scholarly Works Application Profile (SWAP)
- Images Application Profile
- Time Based Media Application Profile
Application Profiles are often not only concerned with the elements and vocabularies that are chosen, but also the way the metadata is encoded. They often specify a particular way of tagging and laying out a metadata record using #XML encoding. This level of specification is very important if metadata records created using the profile are going to be easily interoperable.
The development of #XML#-based metadata schemas or application profiles is becoming increasingly common. Early metadata standards were not very prescriptive about how their categories were named or laid out (most people were just using them as fields in their databases). However as the use of XML has become more common and its potential for interoperability is being realised, XML encodings are being developed for most formal metadata schemas. See for example the recommended XML schemas for the Dublin Core.
This paper has concentrated on metadata schemas: the categories you use to describe and manage your digital images. However the use of standard categories will not guarantee that your collection can be efficiently searched or understood by your users. Metadata vocabulary standards will assist you in choosing specific terms to enter into those categories.
Metadata schemas frequently come with recommendations for vocabularies. #VRA Core, for example, recommends that the Art and Architecture Thesaurus (AAT) is used with several of its categories. JISC Digital Media's metadata vocabularies advice document describes several different kinds of vocabularies and provides links to many examples available via the web.
Of course, some schema categories commend themselves more to formal vocabularies than others. While it makes good sense to draw on a thesaurus or word list to fill in a Subject or Format category, a Title or Description category will require a different approach. For these kind of categories, data entry guidelines must be devised. Some metadata schemas provide this level of guidance to their users, others don't. Although there are some older precedents, a fairly recent trend is the development of separate content standards (e.g. #CCO, RDA and DACS above). Their purpose is to provide detailed guidance on data entry to help ensure consistency and so improve interoperability.
Each metadata standard mentioned above has a conceptual model which underpins it, and while not directly related to the primary objective of this paper, the notion of the conceptual model still worth some consideration. In the context of metadata, and information management in general, the aim of the 'conceptual model' is to provide the means of achieving an efficient structure and flow of information in a specific problem area in a given domain. This is achieved through the modelling of relationships between things or actions (concepts) that have been identified. So, for example, a museum (i.e. the domain) may have to provide a system that can catalogue its objects and manage their life-cycle (i.e. the problem area). A typical museum object will have been made in a specific place, by a specific person, may be related to other objects, may have had various conservation treatments, images captured, been on loan to various places etc. The resulting conceptual model could involve creating a series of entities developed from the concepts of: 'object', 'person', 'place', 'images' and 'event' and then articulate how these entities are best described (i.e. what attributes they have) and how the entities relate to one another (i.e. a person 'makes' an object, an image 'contains' an object etc.) Concept models are developed at a level of abstraction removed from the actual domain or problem area they are addressing which means that one model can underpin the whole system that eventually manages, in this case, the museums' holdings.
Metadata schemas often acknowledge their conceptual framework (such as DC's one-to-one principle, or the hierarchical relationships implied in ISAD(G)). There are also attempts to provide separate, fully articulated conceptual models, which amount to metadata 'world views' of given domains. For example, the CIDOC CRM (Conceptual Reference Model), which began life in the mid 1990s as an attempt to establish a model for the description of museum and cultural heritage information, is now an 'ISO' standard which is likely to underpin many future information sharing and semantic web initiatives. The Libraries domain has developed the Functional Requirements for Bibliographic Records (FRBR), which provides a model for the relationships between, and access and retrieval of, library holdings, primarily, though not limited to, bibliographic records. FRBR is also seeing applicability in other areas (see current work on application profiles and attempts to harmonise CRM and FRBR). Both these models are detailed in Putting Things in Order - a Directory of Metadata Schemas and Related Standards.
Conceptual models can be useful in: (1) helping make explicit some of the assumptions implicit within metadata schemas; (2) informing the evolution of existing schemas or the development of new schemas, (3) assisting in the task of mapping two different, but related, schemas, and (4) developing database and other data management systems.
As we said in the introductory paper to this series, metadata can be extracted from, or embedded within, the digital file itself.
The first part of an image file - before all the pixel data - is the 'file header'. This contains information about the file itself (e.g. its format and order), but it also has some room for other data. Digital Cameras make use of this space within the TIFF and JPEG formats to write EXIF data (information about the camera and its settings). Programs like Adobe Photoshop enable users to embed descriptive metadata according to a schema called IPTC. Adobe have also developed an XML-based metadata schema called XMP, which can potentially embed a wide variety of other metadata standards within image files, including DC. More information about these standards is provided in JISC Digital Media;s Putting Things in Order - a Directory of Metadata Schemas and Related Standards.
There is some potential here for those developing digital collections to write their metadata directly into their digital files. This can be particularly useful in: (1) providing identification information, in case your image is separated from its usual context or is renamed; and (2) enabling you to record information about copyright or usage restrictions within the file.
However, there are also some problems with relying on embedded metadata: (1) it is usually very easy for someone to delete this information, either deliberately or inadvertently; (2) these standards are not supported across all common formats or accessible to all image optimisation or image management software; (3) writing data into some file formats will require re-compressing the image, which can compromise its quality (e.g. the popular JPEG format).
There are many considerations to take into account when choosing the metadata standard(s) that best fit your resource and your users' needs. This paper has outlined some of these, which centre around: the type of resource you are creating; the domain you work in; the particular tasks you want to carry out; and the level to which you want your resource to complement and perhaps be searched alongside other related collections.
At one level, every resource will have its own idiosyncratic metadata needs, users and priorities, however, hopefully this paper has shown that this need not be at the expense of using common standards. Indeed to ensure a managed resource that can engage users effectively and enjoy a degree of longevity, conformity to formal metadata standards at some level - more often than not adapted to fit your needs - is essential.
To recap, some of the factors likely to influence your decisions are:
- Your users and their needs - what kind of information do they require and expect?
- Your own needs as a collection manager - what information do you require to manage, deliver and preserve your collection?
- Your community';s approach to metadata - are there clear standards being used by similar collections?
- Your legacy metadata - what metadata already exists, what form does it take?
- Existing systems - does the metadata need to work well within particular systems (e.g. library catalogues, VLEs)?
- Your resources - how much time can you allocate to cataloguing; can you really afford to fill in dozens of categories or do you need something simpler?
- The level of technical expertise available to you
- Interoperability - how important is it that your collection works alongside other collections?
- The future development of your collection - e.g. do you expect it to grow to include other formats or subjects?
All the tables from this guide:
- ALL tables in a zipped folder (ZIP)
- DC Element table (PDF)
- VRA Element table (PDF)
- CDWA Element table (PDF)
- SEPIADES table (PDF)
- Element and Qualifier Table (PDF)
- METs table (PDF)