Optical Character Recognition (OCR)
This advice document looks at the process of converting printed type into a computer readable format. In the 1950s the first commercial optical character recognition (OCR) applications appeared, however it wasn't until the early 1990s that 'off the shelf' products were made available. This document looks at the OCR process and how the condition of the original document and the capture technique can affect the quality of the digital output.
There is a huge volume of printed text documents which do not have a digital master file. Much of it is ephemeral and of little historic or academic interest, while some of it has a value and may need to be preserved.
Some of these documents may be fragile and need to be captured before they are lost forever, others need to be converted into a format which is better suited to the modern digital workflow, or made available to a new or wider audience via text to speech applications.
Printed text documents can be captured in the same way as we would digitise a photograph or other 2D object. A standard scan of text is purely a record of the tonal values of the document. If we want to 'read' the document as editable text we need to use an OCR application which can convert the digitised type into a readable format.
The OCR process
There are numerous OCR applications available and they all work in more or less the same way:
- The original object is captured using a camera or scanner - this may be carried out directly from within the OCR application or via a separate scanning program
- The image is then opened in the OCR application - it may identify the areas of type automatically, but if the original is in poor condition or the capture technique was unsatisfactory the application may need some help
- The software should detect the areas of the image that contain readable text - it may also identify photographs, graphics, and captions. The user may then be able to select the areas to read and their order in the document before starting the OCR process. The application might also capture the images separately if required
- The OCR application will normally highlight the letters it has 'guessed' or it feels are suspect. In our tests with good quality type and an effective capture technique, these highlighted guesses were normally correct
- Occasionally other non-text elements are 'read' by the application, the image in the Resolution section (below) shows the woman's arm, which has been recognised as the letter 'J'.
Unedited OCR text is referred to as 'dirty' or 'raw' OCR. While the process of scanning and OCRing may be quite quick, proof reading dirty OCR can be time consuming and therefore expensive. Some collections store the dirty OCR and postpone 'cleaning' it up until they have the time or resources.
OCR application window showing original text on left, detected type fields in green boxes. Images or unreadable data appears in red field. The converted type appears in the right frame with any suspect words or characters highlighted. Note that some of the creases in the paper have also been 'recognised' as type.
Some OCR applications offer automated features which will batch process a number of scanned documents in succession. When used with a scanner fitted with a document feeder a decent OCR application should be able to convert a good quality unbound document with a minimum of operator input.
Cleaning up dirty OCR is a laborious process - if the quality of the capture is improved then the time required to proof read is reduced.
There are a few factors that can influence the success rate. Some factors, such as the typeface used in the original, are unavoidable. However the capture process can have a direct affect on the process:
- If there is a choice of originals choose the best quality example free from annotations or other marks
- Scanners should be free of dust or marks, the original should be as flat as possible on the glass and positioned perpendicular to the sides of the scanner
- The camera's sensor and lenses should be as dust free as possible, the image should be taken straight on with a lens that doesn't introduce distortions
- The object should be in focus across the image area, if necessary a smaller aperture could be used to increase the depth of field.
- The object should be evenly lit and captured at the highest possible resolution
- Over-aggressive JPEG compression should be avoided, as should sharpening or other automated 'in camera' treatments
- The camera should be carefully focussed and mounted on a tripod or copy stand if possible to avoid camera shake
There is a longstanding debate on the choice of typeface for optimum human legibility. Like people OCR applications also prefer certain typefaces. The images below show the results from capturing commonly used computer typefaces as well as a script font and handwriting.
OCR application showing recognition results for Times New Roman, suspect characters and words in right frame
OCR application showing recognition results for Courier, suspect characters and words in right frame
OCR application showing recognition results for script typeface, suspect characters and words in right frame
OCR application showing recognition results for handwriting, suspect characters and words in right frame.
The example using the widely used Times New Roman font shows a handful of suspicious characters though it has interpreted them all correctly. This example is 100% accurate.
The second test uses Courier, a font that is very similar to text from a typewriter. This again produced a few suspect characters, which were different to the example with Times New Roman. These suspect words however were all however correct and so like Times the output was 100% accurate.
The script typeface, which is harder for the human eye to read also presented a challenge for the OCR application. Most of the text was highlighted or wrongly read.
The last example was handwritten and while the human eye could read it, it couldn't be converted by the OCR application.
The majority of original documents will be in regular typefaces and so based on these tests the success rates should be very high. Some documents however might have handwritten annotations which may be important. If so these should be keyed in manually or scanned and stored as PDF image files.
One might assume that successful text recognition reduces as the type gets smaller. However, in our tests we scanned the same text printed at 12, 10 and 8 point Times New Roman and the results showed a single suspect character for each sample size. All suspects though were captured correctly and so the OCRing was 100% accurate.
OCR applications are most effective at reading type against white paper, coloured paper may lack contrast and reduce recognition rates.
OCR application showing low level of suspect words but failed to recognise half of title. Scanned at 300ppi
The image above shows that the OCR application has managed to recognise the type in the grey border on the left and the orange bar at the bottom, but it has been unable to read the orange part of the title Intermedia or the JISC logo against the grey masthead.
Original document condition
If the only original version of the document is damaged it may still be possible to process it in an OCR application, though the accuracy of the recognition may be reduced. If the original were of value or in very poor condition it would be wise to consult a conservator before proceeding.
In the example below the application has been confused by the creases crossing the body text. The recognition rate is considerably lower than some of the examples above and it has also read some of the creases below the main text area as suspect words and characters. It has also detected the tear at the top of the page as an image element.
OCR application showing results of damaged document
The resolution you use to capture the original document will directly affect the accuracy of the recognition, though it will slow down the process and occupy more storage space. Resolution is particularly important when capturing smaller type.
OCR application window of document scanned at 100ppi
The above image shows the results from a lower resolution scan. There are considerably more suspect characters and words (highlighted in green) than in the example in the Paper colour section above), which was scanned at 300ppi. With lower resolution, the woman's right arm has been read as an orange letter 'J'.
If the original document is in good condition and it has been digitised with care, then the accuracy of the output should be high. OCR applications have improved significantly over the years and are capable of recognising text captured with both scanners and cameras. Even images of documents captured using camera phones can be processed if they were photographed carefully under bright light, though this is a far from ideal method of capture.
OCR applications will deliver greater accuracy if the captured text is sharp against the white paper. When scanning this simply requires the paper to be held as flat as possible against the glass at a reasonable resolution (see above).
If a camera is used it should be firmly supported on a tripod or copy stand to avoid camera shake and the lens should be correctly focussed. The lighting should also be even to avoid shadows and the image should be exposed correctly. Extreme wide-angle lenses should be avoided as they introduce distortions which may lead to increased recognition problems towards the edge of the frame.
OCR applications are most effective if the text runs horizontally from left to right. In tests we found that the software could correct documents scanned at 5 and 10 degrees from the horizontal. The text was recognised with 100% accuracy. However, the same document captured at 30 degrees could not be read by the application.
The process of rotating any digital image at anything other than multiples of 90 degrees will result in a loss of detail. Lower detail will often mean lower quality OCR and so whenever possible the image should be placed straight in the scanner.
The final output format will vary between projects. Most OCR applications offer a variety of familiar formats including DOC, XLS, and PPT. If the aim of the project is simply to convert the text into a digital format then a standard word processing format such as DOC might be used. If the original appearance of the document is important but you need embedded searchable text within the file then you should choose the Acrobat format (PDF). PDF files are considerably larger than the equivalent DOC file and should only be used when you want to record the original appearance of the original document.
When used to process well-captured and good quality typed originals, the OCR application should be capable of delivering accurate output requiring minimal proof reading. From our tests we found that OCR applications work well with common computer and typewritten fonts, but are less effective with script typefaces. If handwritten notes are required they should be typed in manually or kept as image data within a searchable PDF file. While OCR applications may still be able to recognise type captured in less than perfect conditions there is an increase in suspect words and characters, which require more time to clean up after the OCR process.
All screen shots are from ABBYY FineReader and used with kind permission of ABBYY.