Controlling your Language: a Directory of Metadata Vocabularies
This directory provides details of more than 70 vocabulary sources. It categorises the various types of vocabularies available to us as Thesauri, Subject headings, Authority lists and Classification schemes. Thesauri, subject headings and word lists more generally, are used primarily in aiding retrieval, whereas Classification schemes help us to organise resources, and Authority Lists help us to standardise the expression of values used in our metadata, like, for example, the way we enter names and dates. Although there are overlaps, broadly speaking each serves a different purpose in helping to control the terminology used in our schemas and in aiding the search and retrieval of our resources.
There are many vocabulary sources already available and it makes sense to check these out before inventing your own. Depending on your particular needs you might find yourself:
- Using an existing controlled vocabulary as it is
- Adapting or customising a vocabulary - e.g. deciding to use a classification or thesauri to a particular level of detail
- Developing your own vocabulary - not recommended, but sometimes the best solution
- Using "uncontrolled" vocabulary - i.e. keywords entered by your cataloguers or, more radically, your users
Of course, you could also use a combination of these approaches. It is quite reasonable to use multiple vocabularies, for example, a formal controlled vocabulary plus additional keywords the cataloguer thinks will assist in retrieval.
In choosing a vocabulary, you should bear in mind:
- Your users - are the terms used going to be meaningful to them?
- Your community - it makes good sense to use vocabularies that similar collections are using
- The nature and extent of your collection - if your collection is small, you're unlikely to need a highly detailed vocabulary
- The skills and available time of your cataloguing staff - some of these vocabularies will require experience or training to use properly
- Copyright issues - you may need to check whether permission or a license is required to use the vocabulary in the way you wish to
This directory presents a selection of formal vocabularies, most of which are available via the Internet. Brief introductions are given to the different types of vocabularies and their uses.
Thesauri, subject headings and word lists are sources of subject terms and their primary purpose is to aid retrieval.
A thesaurus orders its words hierarchically. If you look up a particular term (e.g. houses), you are likely to find references to Broader Terms (e.g. buildings), Narrower Terms (e.g. cottages), or Related Terms (e.g. palaces - terms which are different, but overlap in meaning). Where there are different words with the same meaning (e.g. houses and dwellings), a thesaurus will also tell you which is the preferred term (e.g. "dwellings, USE houses"). The thesaurus's hierarchical structure is intended to help you find a suitable subject term at the appropriate level of detail.
Typical thesaurus entries
USE FOR dwellings
Subject headings are often arranged like a thesaurus, so the distinction is not always clear. However, instead of giving you a single term or phrase to use, as a thesaurus does, subject headings often enable you to link or coordinate terms to produce long phrases or strings of terms (this is sometimes referred to as 'pre-coordination'). For example, the Library of Congress Subject Headings (LCSH) bring together the concepts "Art" and "War" to form the heading "Art and war". You can further coordinate this with headings for particular wars, for example "World War, 1939-1945 - - Art and the war" (this latter example, using the '- -' notation, is known as 'subdivision' - dividing up a main concept with another concept). The published LCSH is very big, including 270,000 pre-formed headings, but because of the way headings can be coordinated and sub-divided, the total number of potential headings is incredibly vast.
Sometimes people use thesauri to generate subject headings, for example "buildings - houses - cottages" (from our thesauri example above). This goes against traditional indexing practice, which insists that you take the thesauri term at the appropriate level and don't include any of its broader terms, but it can make good sense in the age of digital retrieval. If we only added "cottages" to a record, a search on "buildings" would not retrieve it (unless the search software was quite sophisticated). So in this example, including the broader terms in the hierarchy would greatly improve the search results. Some cataloguing systems now do this automatically - if you choose a term from their thesaurus, they automatically insert all of the broader terms into the catalogue record. This kind of practice is blurring the distinction between thesauri and subject headings.
We've included the term "word lists" in our heading for this section to catch the simpler lists of words that are not coordinated like subject headings or organised hierarchically like thesauri. These sorts of vocabularies are, typically, simple alphabetical lists of terms or phrases. They're also often created locally, for particular projects or institutions. The IEEE 1998 Keyword List (see below) offers an example of such a word list, although this is probably much longer than any list you would produce 'in-house'.
2.1 General thesauri, subject headings and word lists
- Australian Pictorial Thesaurus (APT)
- Library of Congress Moving Image Genre - Form Guide
- Library of Congress Subject Headings
- Library of Congress Authorities
National Digital Archive of Datasets (NDAD) Thesaurus
(Based on UNESCO thesaurus)
- SEARS Subject Headings
- Thesaurus for Graphic Materials (TGM) 1: Subject Terms
- Thesaurus for Graphic Materials (TGM) 2: Genre and Physical Characteristics
- Thinkmap Visual Thesaurus
UK Archival Thesaurus (UKAT)
(Based on UNESCO thesaurus)
- UNESCO Thesaurus
- WordNet (Princeton University)
2.2 Specialist thesauri, subject headings and word lists
2.2.1 Arts and Humanities
- Art and Architecture Thesaurus (AAT)
- ARTLex Art Dictionary
- British Museum Materials Thesaurus
- British Museum Object Names Thesaurus
- Glossary of Technical Theatre Terms
- HASSET - Humanities and Social Science Electronic Thesaurus (UK Data Archive)
- ICOM Vocabulary of Basic Terms for Cataloguing Costume
- International Index to Film Periodicals: Subject Headings
National Monuments Record Thesauri
Set of thesauri relating to monuments and structures
- Understanding Illuminated Manuscripts: A Guide to Technical Terms
- Words of Art
- Agrovoc Thesaurus (UN Food and Agriculture Organisation)
Alexandria Digital Library Feature Types Thesaurus
Categorises features relating to geographic locations
- Biocomplexity Thesaurus
BIOSIS Controlled Vocabulary
Covers life sciences
Covers applied life sciences
- Canadian Thesaurus of Construction Science and Technology
- Connecting Mathematics
- eHealth Thesaurus
ERIC (Educational Resources Information Clearinghouse) Thesaurus
Covers educational topics
- General Multilingual Environmental Thesaurus (GEMET)
Covers geological topics
IEEE 1998 Keyword List
List of keywords, mostly relating to electronics and computing
INSPEC Thesaurus (Institute of Electronic Engineers)
Covers Information Technology topics
- MeSH Medical Subject Headings (US National Library of Medicine)
- Multilingual Thesaurus of the Geosciences (MULTHES)
- NASA Thesaurus - Vol 1 and Vol 2
- National Agricultural Library's Thesaurus (NALT)
- Terms of the Environment
- Zoological Record Thesaurus
2.2.3 Social Science
- ASIS Thesaurus of Information Science
- British Education Index Thesaurus
Eurovoc Thesaurus (European Union)
Covers topics relating to the EU
GEM Controlled Vocabularies
- Global Legal Information Network (GLIN) Thesaurus
- International Thesaurus of Refugee Terminology (ITRT)
Classifications are sources of subject categories and their primary purpose is to organise resources.
Traditionally, the main purpose of subject headings and thesauri terms was retrieval, while classification schemes were more about putting things 'in their place' on a shelf, in a box, into a category, etc. Generally (there have always been exceptions), an item would be assigned many different subject terms, but only one classification. This makes perfect sense in a physical world, but in the virtual world there is no reason why something shouldn't have more than one 'location'. So the distinction between classifications and subject terms is beginning to break down.
Classifications are usually hierarchical: they start off with broad subject areas and then break them down into increasingly narrower topics. In this way they resemble thesauri, but classifications are generally much more rigid in their structure. While it is entirely feasible for a thesaurus term to have more than one broader term (this is known as 'polyhierarchy'), a classification scheme will break down its subject domain in just one way. Because of this, classifications offer a single 'world view', imposing a structure that is never going to satisfy every user. And, unlike thesauri terms, classification schemes declare their structural biases openly through the numbers and codes they employ. For example, in the Dewey Decimal classification resources on Buddhism are usually classified at "294". These digits are meaningful: the 200s are for "Religion"; the 290s, "Other and comparative religions" (note that most of the numbers from 200-289 are devoted to Christianity); and the 294s, "Religions of Indic Origin". Here the nineteenth-century Western world view upon which the Dewey classification is based becomes apparent.
The classification scheme's use of codes or numbers is the other important feature that distinguishes it from other kinds of controlled vocabulary, which are word-based. This coding can be used to advantage in a digital context, especially if it is based on a decimal system, like Dewey or the UDC (see below). Since numbers are much more "machine-readable" than words they can be used to advantage in searching. For example, searching for all the Dewey Classifications beginning with "2" would retrieve items relating to religion. They can also be used to generate hierarchical browse interfaces: users might be shown the first 10 subject categories, then choose one of these to view 10 sub-categories, then one of these to look at the next level... etc. Some of those building online collections are taking advantage of these opportunities.
3.1 General classifications
- BLISS Classification
- Book Industry Communication (BIC) Standard Subject Categories
- Dewey Decimal Classification (DDC)
Library of Congress Classification (LCC)
Open Directory Project categories
Entire category structure can be downloaded in XML
- Universal Decimal Classification (UDC)
3.2 Specialist classifications
3.2.1 Arts and Humanities
Covers art history iconography
- ACM Computing Classification System
- Mammal Species of the World
- Mathematics Classification Scheme
- Physics and Astronomy Classification Scheme
3.2.3 Social Science
Learning Directory Classification System
Course classification system
- International Standard Classification of Occupations (ISCO)
- International Family of Economic and Social Classifications (United Nations)
- JACS educational subject classifications
- Classification of HE disciplines and subjects
- North American Industry Classification System (NAICS)
- PAIS International Broad Topics Classification System
- United Nations Standard Products and Services Code (UNSPSC)
Integrated Public Sector Vocabulary (IPSV)
Covers topics relating to the UK government
Authority lists help you control names.
The other main grouping of controlled vocabularies are "authority lists" or "authority files". These are sources of proper nouns (e.g. people, organisations, places). Names and places could be included in general thesauri or subject headings, but it makes sense to keep them in separate lists or databases.
Some institutions are cataloguing resources related to internationally known figures, such as authors and artists. In these instances, they can draw on common authority lists like the Library of Congress Authorities (see below), which includes the names of nearly 4 million individuals. Other institutions, like archives and museums, have resources relating to people that are not widely known. They cannot draw these names from a common list, but must create their own authority records, using particular rules like the ISAAR(CPF) or the UK National Council on Archives Rules (see below).
There are several online sources of place names listed below. Other sources could include atlases or official maps. Increasingly, digitisation projects are adding other place references such as postcodes or geospatial coordinates.
4.1 Name authorities
- International Standard Archival Authority Record for Corporate Bodies, Persons, and Families (ISAAR(CPF))(International Council on Archives)
- Library of Congress Authorities
- Rules for the Construction of Personal, Place and Corporate Names (UK National Council on Archives)
- Union List of Artist Names (ULAN)
4.2 Place authorities
- Alexandria Digital Library Gazetteer Service
- Getty Thesaurus of Geographic Names (TGN)
- Global Gazetteer
- Library of Congress Authorities
If chosen and managed carefully, controlled vocabularies can make cataloguing easier and improve the retrieval and presentation of items from your collection. Careful choice and management of vocabularies are key as:
- Vocabularies can improve retrieval, but only if the terms are obvious and meaningful to your users. If you're using a small or specialised vocabulary, it will greatly assist your users if they can call up a full list of the terms you've used. Consider using multiple vocabularies and adding additional keywords (in another field) to aid retrieval.
- Vocabularies can improve cataloguing consistency, but only if everyone cataloguing your resources is using the vocabulary in a consistent way. It is vital that you write clear guidelines on what aspects of the image or object to describe, produce scope notes for the terms used (including examples), and regularly check and compare work.
- Vocabularies can help your collection 'interoperate' with other collections, but only if you're using the same vocabularies and in the same way. If you adapt or customise a vocabulary you must record the changes you've made and you should be aware that you're reducing the chances of fully interoperating with others.
- Some vocabularies will save you time and money; others might add to your costs, especially if they require a lot of expertise or intellectual effort. Where possible, automate the cataloguing process: use thesauri management software, cut and paste rather than retype.
JISC Digital Media has more information about metadata:
- An Introduction to Metadata
- Metadata Standards and Interoperability
- Metadata and Digital Images
- Metadata and Audio Resources
- Metadata and Digital Video
- Putting Things in Order: a Directory of Metadata Schemas and Related Standards
- Approaches to Describing Images
JISC Digital Media also runs a training course:
Other vocabulary information on the Web:
Taxonomy Warehouse (Synapse)
Database of vocabulary sources
HILT Project (University of Strathclyde)
Project looking at mapping different vocabularies for interoperability
Hosts several online thesauri and provides links to others
Barbara Lute's Web Thesaurus Compendium
Links to vocabulary sources