In a previous blog post, I mentioned that I thought it would be useful to consider the value of the Digital Humanities outside of the framework of strictly research and interpretation. For my project, however, I decided to work within this framework, but from a different standpoint. Referring to readings over my introduction to the Digital Humanities, I inferred that its use-value, so to speak, is largely viewed in terms of digital publications, digital archives, and large-scale macro analysis, at least on the textual level of things. My thought was, what if digital methodology could be applied to textual analysis in a more traditional, “close reading” sense, as Matthew Jockers would say.
My plan was to use OCR software to transcribe PDFs of collections of writings by Ralph Waldo Emerson, an important figure in American intellectual history. I figured that with some background knowledge on Emerson’s life, ideas, and influence, I could use topic modeling on the transcribed texts to identify keywords to search the PDFs, thus making the research process more efficient, and, foreseeably, allowing the historian using digital tools to consult more sources than previously possible (as by using only ones eyes, concentration, and caffeine, for example).
Halfway through my project I realized I had naively assumed too much about the ease with which digital sources could be transcribed, modeled, and searched. For one thing, I did not take into account how the text would be transcribed:
I realized that I had assumed that the transcription would go smoothly, not that the column breaks would be interpreted literally! This was a particularly egregious error on my part as I had had prior experience with OCR software. Also, many words were improperly transcribed.
To my relief, the topic modeling produced some noticeable results, in spite of picking up the fragmented words. However, searching the pdf files for multiple words at a time was unfruitful as the search engine proved to be programmed to search page-by-page instead of by multiple pages at once.
At this point I realized I had approached the project with the underlying assumption that the methods I employed here would be utilized by a “rogue digital historian,” if you will: one who takes the initiative to download, transcribe, model, search, and read the sources. I concluded that perhaps it would be better in the future to use my methods in the context of a digital project in its own right, such as crowdsourcing is done. I am beginning to be familiar with programs such as Dedoose, while working as a research assistant, which allow users to annotate and demarcate topics in texts. I think it would be beneficial for all historians to make digitally annotated sources available so that they could search quickly and efficiently for the subject matter they need. It might also be a useful way to reduce the opposition between more traditional approaches to history writing and the digital humanities, whose methods by and large seem to be conceived of as opposed to one another in some way.
On a more practical level, I’d certainly be curious to know if there is a way to quickly edit transcribed text documents, and if there’s a way to program “Ngrams” into Mallet for topic modeling.