Category Archives: digitization

Text Mining and Visualizing History

This week’s readings on the topic of text mining has helped me to understand a little bit more clearly just why historians might want to consider such an aspect of digital history. Text mining and topic modeling can both help reveal new patterns and themes about events, people, and documents that might otherwise have been overlooked. When information is presented in a visual map, whether it’s a chart or graph, or even word cloud, that information takes on a new perspective that researchers can choose to investigate more fully, although it is important to remember the context that surrounds the original data.

I’ve come to realize that I am very much a visual learner, which makes text mining and topic modeling quite interesting to me. By seeing the data maps that Cameron Blevins used in his article by using a program that measured the instances of geographic locations mentioned in two Houston newspapers during the 1830s/1840s and then later in the 1890s. The idea of “imagined geography” was new to me when I read the article and accompanying website, and I think it is aptly named. At the time when the newspapers’ articles, features, railroad schedules, etc were being written, I hardly doubt that anyone was thinking of all of the locations that were being referenced, nor their sociological/historical impacts.

Having read Ted Underwood’s tips for starting your own text mining project before diving into Blevin’s or Kaufman’s Quantifying Kissinger project, I had a basic understanding of the large amount of text needed for such undertakings. Underwood’s FAQ style post puts the idea of text mining into easy to understand concepts, which I found most helpful. When I saw the Wordle example, I though, “hey, I’ve done that before!” which put a personal connection to this week’s readings. Then I tried out Google’s ngram viewer, which is a cool tool for visualizing the usage of a word or phrases over time, but clearly the context is lacking, especially since we can’t see which texts are being searched.

Moving on from last week’s discussion about “buckets of words,” this week’s readings tell how we can take those buckets, do our keyword searching, and find out how those words stack up across time, for whatever it’s worth. While I think text mining definitely has a cool factor because of its employment of data maps, which I really find helpful, I need to remember that due to poor OCR and lack of context, text mining is just another tool to use in the grand historian toolbox.

Thinking about Searching Databases

This week’s readings all explored how wonderful it is that so many historical documents have been scanned, digitized, run through OCR software and made available through countless different online databases, making lengthy trips to libraries and archives less common. Of course, there are drawbacks to relying on database searching, as the authors have pointed out.

Different databases behave differently when users type in a keyword or search phrase. As Patrick Spedding points out in “The New Machine,” some databases will run OCR for transcription purposes, to be used in the search process, but will not make that original OCR text file available to users. In Spedding’s example of the Eighteenth Century Collections Online database, this lack of transcription is supplanted by using “coded linkage” by highlighting the keyword in the original document. Other issues arise when alternate spellings and synonyms come into play as well. How can you be sure that you are finding all of the documents related to your search when using these handy databases? You can’t.

This reminds me of Lara Putnam’s example of the Benbow Follies in her working paper from this year. While researching using microfilm, she came across and editorial that referenced “Benbow’s Follies” and three years later decided to do some more digging on that serendipitous find. Turning to Google Books, she found more information, but still wanted know how Benbow appeared in her original research in Costa Rica. By searching digitized newspaper sources, she found advertisements for Caribbean Tours by Benbow’s musical troupes, which she had never found in the traditional sources such as music reference sources.

These examples help illustrate the good and the bad of searching databases. On one hand, you might not be able to find what you want, either because you don’t know what you don’t know or because of wonky searching capabilities. But on the other hand, the ability to search such a multitude of documents from the comfort of your home can aid in tackling a research question that previously may not have been answered if one hasn’t had the opportunity to travel to libraries and archives across the world. Either way, this “digital turn” is still evolving, and hopefully the future has in store for us researchers more comprehensive and creative searching abilities.

OCR Practicum

When comparing different applications of OCR on historical documents, it is obvious that the accuracy rates vary from source to source. Upon looking at three pages from Chronicling America from the Library of Congress and a page from the Pinkerton archives that used Google Drive’s OCR software, many problematic issues of digitization and OCR were discovered.

Google Drive OCR

While comparing Google Drive’s OCR of page 3 of the Pinkerton records, provided by Dr. Robertson, with a classmate, we discovered that not only were our outcomes different due to the manner in which we manipulated our original image sources, but that the content delivered by the OCR was different as well. While we both had fairly inaccurate results after cropping, adjusting the contrast, and changing the horizontal image to vertical, the fact that some of our incorrect results didn’t match was interesting. I wondered if this has anything to do with the way that Google Search will customize results based on the user’s computer, but I don’t think that would make much sense. Due to the large number of inaccuracies in just one page, I would consider the cost-effectiveness of hand-transcribing the text, which would take longer with more human effort, but the results would be more accurate. Here is the result of my Google Drive OCR experiment.

LOC papers from Chronicling America:

The Library of Congress appears to have a better OCR accuracy than Google, although it is far from perfect. In the three front pages that I reviewed, I noticed that a majority of the content from the articles and headlines were mostly correct. However, it seems to have a problem with the columns. Although the columns present a problem, I still think that OCR is better used here than with Google, and transcribing each page would be fairly time consuming.

Once I started reading through the OCR text, I noticed a pattern. The text was displayed in a long column that was organized in a top to bottom, left to right fashion. Once I discovered this pattern, the OCR text version was much easier to compare. At first glance, I could barely understand the OCR. However, the three papers I compared had varying results due to the quality of their scans.

With the Omaha Daily Bee, the top part of the paper with the headlines spread across the columns, there were some issues with the order of the OCR text. Some lines were just completely out of order. However, the actual content of the articles proved to be mostly accurate. The different sizes of the sections on the front page make for a difficult reading of the OCR, but if you spend enough time and energy trying to make sense of it, you probably can. It’s not perfect, but acceptable, if you focus on just the articles.

The articles in the Richmond Times Dispatch were less easy to read in the OCR text format, as the scanning quality appears to be an issue. Many of the words and letters were light and barely visible, which resulted in some erroneous results. This made reading the text format very difficult.

The Washington Times had accuracy rates similar to the Richmond Times Dispatch, in that the scanning quality was not clear, resulting in poor readability in the OCR. Headlines and larger fonts were more accurate than the faint article text, but the articles themselves were easier to read than the Richmond Times Dispatch.

Finally, I looked at another Chronicling America page from the Day Book, a Chicago advertisement-free newspaper, about the railroad moving into Alaska. This document was the most accurate, although it was also the shortest. It had just a few spelling errors that were most likely due to printing smudges, but the article was relatively error-free.

Overall, this practicum has helped illustrate the various scenarios and factors that go into play when obtaining the OCR for historical documents. It’s important to realize that not all OCR software is equal, and the condition of the original source when scanned plays a large part in the accuracy of the results.

 

Thoughts on Digitization

This week’s readings focused on the digitization of historical documents and the various considerations that surround this modern approach to preserving the past. While large scale projects may benefit from outsourcing due to cost-effectiveness rather than scanning materials in-house, the issue of OCR accuracy rates is a factor that cannot be overlooked. Across the readings, it is evident that not all projects are the same, yet all successful projects must be well-planned, organized, and accurate.

The readings were interesting to me as I have played a hand in some small-scale digitization projects as an undergrad working in my college’s archives as a student worker. I performed some of the scanning process in one project that involved glass plate negatives, and in another I helped organize back issues of the school’s student-run newspaper as it was being prepared to be microfilmed and digitized at an outside facility. At that time, I had no real idea about about the OCR accuracy rate of the newspapers (or even if they would be even using OCR), although I assumed that they would be more easily searchable than browsing through the collections in the archives. Clearly, a reading like Cohen’s and Rosenzweig’s Becoming Digital chapter would have been helpful to me at the time to understand the overall layout of the land of digitization, but it was written two years too late for that project.

I find Ian Milligan’s Illusionary Order article interesting in that the author stresses the importance of transparency while researching. His findings that two Canadian newspapers that have been digitized are showing up more frequently in dissertation citations is definitely worth some consideration. One one hand, it’s great that these collections are now searchable online. However, there appears to be a dependency on these newspapers that have been digitized, leaving behind the still print-only materials that may alter the nature or direction of one’s research. Mulligan points out that dissertations have been leaning towards more Toronto-based research due to the scope of the newspapers available in Pages of the Past and Canada’s Heritage since 1844. The fact that the dissertations are more often citing just the newspapers and not the actual online database that the authors’ used to access the articles is something we briefly talked about in class last week that I’ve been spending a lot of time thinking about. While citing the original newspaper is technically correct, I agree that for transparency’s sake in research, it’s important to give credit to the databases that house these sources. I know in my own research, I have failed to do this but will now be making more of a conscientious effort to do so.