Approaches to Describing Images
The following paper seeks to provide background to some methods for describing images that may be deployed by those building image collections for sharing to support research and teaching and learning.
While the emphasis is on describing digital still images, those developing collections in other multimedia formats - particularly video - may find some of the methods and research outlined in this paper applicable.
Digital images can be very complex to describe and before you can even begin to say anything about an image you need to be very clear about what you're actually focusing on. For example, are you more concerned with the content depicted within the image (e.g. an object, a place or a person) or the analogue image itself (e.g. the painting or photograph) or both?
A digital image may contain many different layers (e.g. a particular landscape, a drawing of that landscape, a photograph of that drawing, a digital scan of that photograph). Each of these "images within the image" will have its own context (e.g. a geographical location, an art collection, a photo album, a folder on a server). Each will also have its own particular history (e.g. when they were made, and by whom). How much of their context and history do you need to record?
Therefore, in developing metadata for your collection, you will first need to make decisions about what it is you're describing and how best to build the relationships between the various possible 'layers' of your image. You will then need to decide what particular characteristics or categories are going to be necessary to record for each. Decisions about the level of detail you need to record in your metadata will need to be thought through. To enable effective re-use of your images, do your users require only some comparatively straightforward low level descriptors like 'title',' 'creator', 'format', 'colour', 'size' etc?, or perhaps your users require information on a more abstract higher level meaning, for example, feelings or emotions elicited from an image, such as 'happy', 'sad', 'love', 'anger' etc., or fuller descriptions of the image content.
Also, bear in mind the constraints inherent in providing text-based descriptions of images. Limited by the knowledge, culture, experience and point of view of the cataloguer, your metadata can only ever partially capture the information, meaning, context and possible uses of an image in a given setting. Much of the work involved in scoping image metadata is therefore deciding on what detail the identifiable users of your images require, and how best to achieve this detail given the likely constraints of time and resources available.
This paper suggests two alternative ways of approaching the task of describing images: (a) getting your users to tell you what they see; (b) drawing on some of the research that's been done into image analysis and retrieval.
Ask your users to tell you what they see in your images
Do not assume you know how your users will view your images. If possible, try gathering together a representative sample of your images and your users. Hand each user a pad and pencil and ask them to have a go at describing what they see. If you are unable to gather your users in the same physical space, you might consider uploading a number of your images to a secure Flickr page and asking your users to 'tag' them.
These kind of exercise are not just good fun, but can be genuinely enlightening. It should give you some clues about (1) your users' focus (the image or its content or its context); (2) the particular categories of information that they're interested in knowing, and (3) the kinds of terminology they'll understand and be using in their searches. This exercise is especially good at helping identify subject terminology - which tends to be the most difficult to decide on when cataloguing.
Consider some of the research into image analysis and retrieval
Some people have taken a much more systematic approach to describing images, developing complex theories and models. This research often crosses disciplinary boundaries drawing on theory from art history, psychology, information science, and computer science.
In the mid-twentieth century, art historian Erwin Panofsky (1962) published his theories about the iconography of Renaissance art. He suggested that there might be three levels at which an art work could be described: the pre-iconographical (generic things in the image), iconographical (specific things), and iconological (symbolic things). The table below shows an example of this approach.
Image: Wikipedia Commons
|Iconographic||Mona Lisa, Italy, three-quarter portrait|
|Iconological||beauty, nature, contentment|
In the 1980s, information scientist Sara Shatford (1986) applied Panofsky's model to indexing images. She relabelled Panofsky's terms as Generic (pre-iconographic), Specific (iconographic) and Abstract (iconological) and extended the model further by breaking each of these three levels into four facets: Who, What, Where and When. Using the example above, a Generic Who might be "woman", while a Specific Who would be "Mona Lisa". A Specific Where could be "Italy", while an Abstract Where might be "nature". The 3x4 matrix Shatford produced is often referred to as the Panofsky/Shatford model and it has been used in several studies to analyse still and moving images.
Shatford developed her theories further, making a distinction between the things an image is Of (objective things, either generic or specific) and the things an image is About (more subjective or abstract meanings). She also moved beyond just describing the visual content of the image (i.e. its Subject) to consider other attributes an image might have, which she termed: Biographical (i.e. the history or 'life' of the image), Exemplified (the type of image it is an example of, e.g. a painting), and Relationship (how this image relates to other images, e.g. other versions or formats) (see Shatford-Layne 1994).
Computer scientists Alejandro Jaimes and Shih-Fu Chang (2000) came to the challenge of analysing images from a background in Content-Based Image Retrieval (CBIR, the automatic recognition and retrieval of images by computers based on colour, texture or shape). Jaimes and Chang built on the work of Panofsky, Shatford and others, producing a pyramid model for describing images (see diagram below - please click image for larger version).
Jaimes and Chang (2000) - reproduced with permission
Layers 5-10 of this pyramid can be seen as incorporating the Panofsky/Shatford model (Generic, Specific, Abstract). Jaimes and Chang classed these layers as semantic or conceptual information. Coming from their CBIR perspective, they stressed that there is also some perceptual information within an image, which they located in layers 2-4 of their pyramid model. Global distribution is the overall colour or texture of the image, Local structure is the colour or texture of particular features within the image, while Global composition describes the arrangement or layout of these features. At the very top of their pyramid they added a layer to describe what sort of image it is (e.g. photograph, print, painting).
You will see that the conceptual layers of the pyramid (5-10) are divided into objects and scenes. Again, drawing on their CBIR background, they are stressing that you might use some terms to describe the image as a whole (a scene) and others that relate to a particular element within the image (an object). In addition to the pyramid, Jaimes and Chang acknowledge that there will be some external or non-visual information that relates to an image, such as its history and context (this corresponds to Shatford's Biographical information).
The arrows along the top of the Jaimes/Chang pyramid are meant to indicate that more knowledge is required as you move down the pyramid to describe its lower layers. The upper (perceptual) layers require very little knowledge and can be safely left to a computer program to analyse. The lower (semantic) layers must be described by humans and require increasing amounts of knowledge (e.g. it takes more knowledge to recognise that the image is of "Mona Lisa" rather than a "woman"). The very lowest layers (Abstract Objects and Abstract Scene, equivalent to Shatford's About) require the cataloguer to interpret the meaning of the image, which is obviously a very difficult and subjective task.
Although none of these theoretical models are cataloguing templates, they can be useful in analysing the different aspects of an image and can aid you in making decisions about your cataloguing practice. Will your subject keywords or description include information About the meaning of the image, or will you try to steer away from subjective interpretations and concentrate on the more objective aspects of the image ( Generic or Specific)? Would it be useful to include some terms relating to the perceptual aspects of the image (e.g. colour or composition), or could you perhaps employ some technology (e.g. CBIR) to help your users search on this sort of information? What other non-visual information might you need to record about the image (Biographical, Relationship)?
This paper has provided a simple account of some very complex theory, so if you're interested in exploring this further, take look at some of the more recent papers listed below and follow up the references they cite.
Although it's likely you'll want to make use of some of the formal standards described in the other papers in this series, it's useful to first consider the ways your users will approach your images and the different aspects of a visual work. Even if you're sticking very closely to established standards, you will still have some choices to make - especially at the level of vocabulary (the specific terms you apply to your images). Having a clear understanding of your images and the way your users view them will enable you to more critically evaluate potential metadata schemas and vocabularies and to assemble a metadata framework that works well for both your users and your cataloguers.
- Hollink, L. et al. (2004). Classification of user image descriptions. International Journal of Human-Computer Studies, 61(5). [accessed January 2010]
- Jaimes, A. and Chang, S.-F. (2000). A conceptual framework for indexing visual information at multiple levels. In: IS&T/SPIE Internet Imaging , vol. 3964. [accessed January 2010]
- Panofsky, E. (1962). Studies in iconology. Harper & Row, New York.
- Shatford, S. (1986). Analyzing the subject of a picture: a theoretical approach. Cataloging & Classification Quarterly, 6(3):39-62.
- Shatford Layne, S. (1994). Some issues in the indexing of images. Journal of the American Society for Information Science, 45(8):583-588.
- Shatford Layne, S. (2002). Subject Access to Art Images. In: Baca, M. (ed). Art image access: issues, tools, standards, strategies. Getty, Los Angeles. [accessed January 2010]