OCR Practicum

When comparing different applications of OCR on historical documents, it is obvious that the accuracy rates vary from source to source. Upon looking at three pages from Chronicling America from the Library of Congress and a page from the Pinkerton archives that used Google Drive’s OCR software, many problematic issues of digitization and OCR were discovered.

Google Drive OCR

While comparing Google Drive’s OCR of page 3 of the Pinkerton records, provided by Dr. Robertson, with a classmate, we discovered that not only were our outcomes different due to the manner in which we manipulated our original image sources, but that the content delivered by the OCR was different as well. While we both had fairly inaccurate results after cropping, adjusting the contrast, and changing the horizontal image to vertical, the fact that some of our incorrect results didn’t match was interesting. I wondered if this has anything to do with the way that Google Search will customize results based on the user’s computer, but I don’t think that would make much sense. Due to the large number of inaccuracies in just one page, I would consider the cost-effectiveness of hand-transcribing the text, which would take longer with more human effort, but the results would be more accurate. Here is the result of my Google Drive OCR experiment.

LOC papers from Chronicling America:

The Library of Congress appears to have a better OCR accuracy than Google, although it is far from perfect. In the three front pages that I reviewed, I noticed that a majority of the content from the articles and headlines were mostly correct. However, it seems to have a problem with the columns. Although the columns present a problem, I still think that OCR is better used here than with Google, and transcribing each page would be fairly time consuming.

Once I started reading through the OCR text, I noticed a pattern. The text was displayed in a long column that was organized in a top to bottom, left to right fashion. Once I discovered this pattern, the OCR text version was much easier to compare. At first glance, I could barely understand the OCR. However, the three papers I compared had varying results due to the quality of their scans.

With the Omaha Daily Bee, the top part of the paper with the headlines spread across the columns, there were some issues with the order of the OCR text. Some lines were just completely out of order. However, the actual content of the articles proved to be mostly accurate. The different sizes of the sections on the front page make for a difficult reading of the OCR, but if you spend enough time and energy trying to make sense of it, you probably can. It’s not perfect, but acceptable, if you focus on just the articles.

The articles in the Richmond Times Dispatch were less easy to read in the OCR text format, as the scanning quality appears to be an issue. Many of the words and letters were light and barely visible, which resulted in some erroneous results. This made reading the text format very difficult.

The Washington Times had accuracy rates similar to the Richmond Times Dispatch, in that the scanning quality was not clear, resulting in poor readability in the OCR. Headlines and larger fonts were more accurate than the faint article text, but the articles themselves were easier to read than the Richmond Times Dispatch.

Finally, I looked at another Chronicling America page from the Day Book, a Chicago advertisement-free newspaper, about the railroad moving into Alaska. This document was the most accurate, although it was also the shortest. It had just a few spelling errors that were most likely due to printing smudges, but the article was relatively error-free.

Overall, this practicum has helped illustrate the various scenarios and factors that go into play when obtaining the OCR for historical documents. It’s important to realize that not all OCR software is equal, and the condition of the original source when scanned plays a large part in the accuracy of the results.

 

Leave a Reply

Your email address will not be published. Required fields are marked *