Experimenting with OCR

In my digital history class we recently discussed the OCR program, software that transcribes scanned images of texts. I was curious as to how to use this program, so I decided to give it a trial run.

I bought a copy of the Boston Globe one Friday, with the intention of scanning it and running it through OCR. I selected three articles to scan: two were local and one was national. I decided to focus on one article in particular for a sample transcription. It was about a strike between the union and top officials of Boston’s public transportation system. It might certainly be of interest one day to economic historians.

To scan the newspaper, I had to go to my school library. The interface was pretty simple. It allowed me to select the kind of file I wanted to save the scans as (I chose PDF format), and whether or not to scan them in color, greyscale, or black and white (I chose black and white at the advice of my professor). Conveniently, the library computer gave me the option of scanning the newspaper directly to an email attachment, so I was able to access the scans on my computer almost immediately.

One of the scanned pages

One of the scanned pages

The complete set of scans, in their designated folder

The complete set of scans, in their designated folder

After installing Tesseract, the OCR software, I had to move the scans into an isolated folder (labeled “pdfs”) and download a file that would make it possible to type in an appropriate command line and allow Tesseract to recognize the scans as an object to transcribe. I then typed in the command “cd” and dragged the folder from the Finder application (I use a Mac) into the command line, which made the complete command:

With this command entered, I could then command Tesseract to transcribe the scans. To do so, I typed in the command “sh” followed by the name of the “director” file and the name of the input file (the sh command is used to specify the input file):

All that was left to do was to hit “enter,” and Tesseract converted the pdf and transcribed it, with the following result:

The "images" and "texts" folders were created by Tesseract; the "Tesseract-1" file is the "director" file

The “images” and “texts” folders were created by Tesseract; the “Tesseract-1″ file is the “director” file

The sample article, transcribed

The sample article, transcribed

As the above picture shows, Tesseract did a fairly decent job transcribing the article accurately. It even reflected the newspaper’s text margin and was able to recognize separate articles, even though their print may have been horizontally aligned on the same page. The program did have difficulty in some areas though. Not surprisingly, the minor text at the top of the page, such as the letter identifying the section of the newspaper and the stock market indexes, were sloppily transcribed:

There were also some peculiar spelling errors:

"MBTA officials"

“MBTA officials”

"Lawsuit"

“Lawsuit”

A potentially more serious problem was that at one point, Tesseract “misread” an article and aligned the text of another article in the middle of my sample:

The Original

The Original

The (Incorrect) Transcription

The (Incorrect) Transcription

Nevertheless, I completed the transcription myself. Pretending that this would one day actually be used by historians, I rearranged the transcription into a more compact format, added the missing bits, and corrected spelling errors.

The Final Copy

The Final Copy

My results using OCR were mixed, but overall, it does expedite the process of transcription, and its errors can fairly easily be accounted for by a simple review. Frederick Gibbs and Trevor Owens have argued in their essay¬†The Hermeneutics of Data and Historical Writing that descriptions of the methods of the digital humanities needs to be included in the historical literature, so that potential inaccuracies may be spotted. As far as this very limited example, OCR, is concerned, I am not convinced of the need for historians to explicate the fact that they used digitally transcribed sources and the process this method of transcription entails, so long as the transcriptions are diligently checked for accuracy. What might be needed more so in this case is simply for the “bugs” of the software to be corrected; since software is not something static (there are many versions and updates), it seems like OCR has the potential to develop into a very powerful tool.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">