The U.S. Postal Service uses optical character recognition (OCR) technology to read the addresses on pieces of mail. For the mail to be readable by an OCR mail sorter, however, the addresses and fonts need to be formatted a certain way. OCR software is useful for converting scanned images of typed or handwritten documents to searchable electronic text, but it has disadvantages that limit its applications.
OCR works best with good quality typed documents. Handwritten documents cannot be easily read by OCR software. Likewise, typed fonts that resemble handwriting -- as well as non-Latin fonts -- create many errors during the OCR process. If the document has poor contrast, is creased or dirty, or the text and the background are similar in darkness, then OCR may not work well. OCR has difficulty with documents that have both images and text. Spreadsheets will also produce more errors.
No OCR software is 100 percent accurate. The number of errors depends upon the quality and type of document, including the font used. Errors that occur during OCR include misreading letters, skipping over letters that are unreadable, or mixing together text from adjacent columns or image captions. If high accuracy is required -- as with converting digital books to electronic format -- then a clean-up of the electronic text will be needed.
OCR has difficulty differentiating between characters, such as the number zero and a capital "O." To work around this, a special OCR font can be used, such as writing out zero. However, this works only for documents created with OCR in mind, such as questionnaires. When creating questionnaires that will be hand-written, researchers also use boxes for each letter.
Even if the scanned image of the original document is high-quality, additional steps must occur to clean up the OCR text. It is very labor-intensive to correct the errors created by OCR. A person has to manually compare the original document and the electronic text. People also make errors when typing text from a document, but sometimes it is faster to skip the OCR step.