All posts by johnheatwalsh

My DH Project

In a previous blog post, I mentioned that I thought it would be useful to consider the value of the Digital Humanities outside of the framework of strictly research and interpretation.  For my project, however, I decided to work within this framework, but from a different standpoint. Referring to readings over my introduction to the Digital Humanities, I inferred that its use-value, so to speak, is largely viewed in terms of digital publications, digital archives, and large-scale macro analysis, at least on the textual level of things. My thought was, what if digital methodology could be applied to textual analysis in a more traditional, “close reading” sense, as Matthew Jockers would say.

My plan was to use OCR software to transcribe PDFs of collections of writings by Ralph Waldo Emerson, an important figure in American intellectual history. I figured that with some background knowledge on Emerson’s life, ideas, and influence, I could use topic modeling on the transcribed texts to identify keywords to search the PDFs, thus making the research process more efficient, and, foreseeably, allowing the historian using digital tools to consult more sources than previously possible (as by using only ones eyes, concentration, and caffeine, for example).

Halfway through my project I realized I had naively assumed too much about the ease with which digital sources could be transcribed, modeled, and searched. For one thing, I did not take into account how the text would be transcribed:

Screen Shot 2013-12-08 at 6.07.30 PM

 

I realized that I had assumed that the transcription would go smoothly, not that the column breaks would be interpreted literally! This was a particularly egregious error on my part as I had had prior experience with OCR software. Also, many words were improperly transcribed.

To my relief, the topic modeling produced some noticeable results, in spite of picking up the fragmented words. However, searching the pdf files for multiple words at a time was unfruitful as the search engine proved to be programmed to search page-by-page instead of by multiple pages at once.

At this point I realized I had approached the project with the underlying assumption that the methods I employed here would be utilized by a “rogue digital historian,” if you will: one who takes the initiative to download, transcribe, model, search, and read the sources. I concluded that perhaps it would be better in the future to use my methods in the context of a digital project in its own right, such as crowdsourcing is done. I am beginning to be familiar with programs such as Dedoose, while working as a research assistant, which allow users to annotate and demarcate topics in texts. I think it would be beneficial for all historians to make digitally annotated sources available so that they could search quickly and efficiently for the subject matter they need. It might also be a useful way to reduce the opposition between more traditional approaches to history writing and the digital humanities, whose methods by and large seem to be conceived of as opposed to one another in some way.

On a more practical level, I’d certainly be curious to know if there is a way to quickly edit transcribed text documents, and if there’s a way to program “Ngrams” into Mallet for topic modeling.

Themes of Historical Blog Entries?

This blog caught my attention recently. I think its content is a worthy point of reference in discussing some of the issues of the Digital Humanities that I’ve been discussing in my Digital History class. In this brief little excursion of a historical study, the author investigates a digitally annotated corpus of London trial records from the 1600s to the 1800s. He looks particularly at words that are closely associated with first names, concluding that most of the words used in context with first names that are not in dictionaries are likely last names. Referencing studies that have shown that justice in England during this time period was frequently informal, he concludes that a lot of the new last names in these records can be thought of as recent migrants into the London area. He then traces the appearance of these last names in the records in terms of wars, attempting to show effects of the dynamics of war and peace on migration.

To be sure, I found this blog entry lacking in certain respects. Considering the number of words he thought could be last names was astronomical (55,000), it was unclear how he went about identifying names that were “new” and how he could be sure that they were appearing for the first time. It seems reasonable to conclude, given what I know about the Digital Humanities, that he did not look over them word by word. Because he did not explicate his methods here, it made his theoretical foundations (for example, the assumption that most of the names belonged to recent immigrants) considerably shakier. Yet still, I think this was a thought provoking examination that is legitimate. As I have argued in class, Digital Humanities projects, just like historical narratives, should be approached critically. The people who read them should not accept them as mere fact; the point is to weigh the truthfulness of the claim and stimulate thought about society. And this blog does so just fine, in my opinion.

This brings to mind a reading by Gibbs and Owens, The Hermeneutics of Data and Historical Writing, that my class covered earlier in the semester. The blog entry, to me, seems like a good medium, much more so than the monograph, to explain digital methodologies in historical inquiry, due to its form as a short but substantial digestible chunk of information, as opposed to an exasperating long winded text full of facts, interpretation and methodology.

In fact, I would go so far as to say that the blog entry seems well suited to displaying and discussing Digital Humanities research for this very reason (though certainly historical blog entries need not be limited to research in the Digital Humanities). It seemed to me also, that the scope of this project was rather limited compared to that of a traditional monograph. Yet in spite of this, as I have argued, I found it to be thought provoking and therefore worthy of publication. In another reading from earlier in the semester, the historian Robert Darnton discusses some of the issues surrounding digital publication and consideration for tenure. Perhaps, I hope, by thinking of historical blog entries in terms of the breadth of their focus and how much this actually diminishes the value of the study behind them, maybe historians can tackle some of the issues surrounding the legitimacy of digital publication.

Reflections on the “Our Marathon” Project

At the beginning of the semester, I recall, we were a bit critical of crowdsourced digital collections. In particular, I’m thinking of the September 11th digital archive. I remember we were skeptical as to its usefulness as a source of historical data. This week our class took a look at Northeastern University’s “Our Marathon” project, which is similarly constructed. After looking at this, my feeling is that these sites do have some usefulness.

To be sure, it doesn’t seem like a historian could interpret the bombing in terms of its meaning to the history of the country as a whole or Boston from this source. The project focuses on the stories of everyday people, such as those who were near where the bomb went off, describing how they remembered the events of that day as they transpired, and police officers involved in the immediate aftermath. There isn’t much about how the federal government responded in terms of policy or how the national mood changed, for instance, that would lend toward an interpretation of the bigger picture.

While providing an interpretation of historical events may be one of the (if not the) central tasks of the historian, this is not all that is involved in the historian’s line of work. Sometimes it helps to describe historical events in detail, so that there is a deeper connection to the historian’s overall interpretation. And this website provides the historian a tool that can be used to “tell the story,” if not to interpret it.

I should emphasize that a big part of why I think this site works is because the event is both so recent and so specific.  The memories are still fresh in the contributor’s minds, and how they felt that day is very evident in their descriptions. It would be more difficult, for example, to try to crowdsource an archive of memories of “McCarthyism.” What all would that label encompass? Would we so easily be able to identify common threads amongst the contributors? Would their memory of how they felt still be as clear sixty years later?

The fact that this event is recent and (relatively) clearly demarcated means we can worry less about the accuracy and consistency of focus in responses and identify reoccurring elements that could help the historian of modern America tell the story of that day in rich, interesting detail, even while trying to keep the focus on a group of people instead of a few individuals. I noticed, for example, many people, in their initial reaction to the first explosion, thought that an accident had happened (such as a ceremonial cannon accidentally going off or a truck hitting a building). It wasn’t actually until the second explosion shortly after that mass hysteria started to spread rapidly. Another aspect of the project I was interested in was its preservation of “memes,” short slogans and icons that were meant to capture a collective feeling, in this case, the resilience and pride of the Boston community; unswayed by the malicious intent of the bombers. I thought this was a powerful visual element that captured both the feelings involved that day and the way those feelings are expressed in our increasingly digital world.

To summarize my point, these examples, I think, prove the effectiveness of crowdsourced digital collections in contributing to the storytelling aspect of the Historian’s craft, if not the hermeneutical aspect.

Data Visualizations and the Role of the Historian

In the past week, I was required to read an article entitled “Spatial History” by Richard White, in which he discussed the plotting of data against the background of a digital map, whether of a country, a city, or any other space specified by the researcher. In this article, white emphasizes the role of these data visualizations as “a means of doing research” (White, 6). He also argues that the visualizations should not be merely illustrations that illustrate concepts the researcher has developed through non-digital means. It is appropriate to emphasize the use of visualizations as an interpretive tool, I think. But it is incorrect to assume that these visualizations should be used solely as a means of interpretation.

Too often, I believe, the role of the historian has been emphasized as somebody whose prime responsibility is to read an interpret texts of the past. I believe this ignores another, equally important function of the historian. As much a part of the historian’s job as research (even though research might be the most time-consuming aspect of our job, though, to be sure, I am theorizing here, as I am merely training to become a professional historian), is the role of a public orator. Historians are not merely researchers; they are also instructors whose skills are intended to help upcoming generations think critically about society and human interaction (it’s hard to imagine a professor who hasn’t taught a class at some point!). I believe that data visualizations have the potential to synthesize data in a way that is more readily and easily interpreted by students, and therefore make historical data potentially more accessible and interesting (to the students whose major isn’t in the social sciences, for example).

For this reason, it was refreshing for me to read the article “Visualizations and Historical Arguments” by John Theibault and the book “Envisioning Information” by Edward Tufte. Theibault’s article discusses the power of visualizations in communicating “an argument or narrative beyond the meaning of the words.” He discusses the importance of balancing information density and its proper display. This was also a theme discussed heavily in “Envisioning Information.” Tufte, for his part, discusses the importance of making visual data legible by employing concepts such as “chart junk,” defined as the presence of unnecessary visual displays, and the “1+1 = 3 rule” which emphasizes the problem of constructing visualizations in a way that does not distract the viewer  and thus obscures the meaning of the display.

I think that in discussing the potential of the digital humanities, it’s important that we consider it in terms of the historian’s role as an instructor and arbiter of knowledge as well as just a researcher. Conceptualizing humanities computing in this way might be a step forward, I think, in allowing us to reduce some of the ambiguity about its potential benefits over traditional methods.

Experimenting with OCR

In my digital history class we recently discussed the OCR program, software that transcribes scanned images of texts. I was curious as to how to use this program, so I decided to give it a trial run.

I bought a copy of the Boston Globe one Friday, with the intention of scanning it and running it through OCR. I selected three articles to scan: two were local and one was national. I decided to focus on one article in particular for a sample transcription. It was about a strike between the union and top officials of Boston’s public transportation system. It might certainly be of interest one day to economic historians.

To scan the newspaper, I had to go to my school library. The interface was pretty simple. It allowed me to select the kind of file I wanted to save the scans as (I chose PDF format), and whether or not to scan them in color, greyscale, or black and white (I chose black and white at the advice of my professor). Conveniently, the library computer gave me the option of scanning the newspaper directly to an email attachment, so I was able to access the scans on my computer almost immediately.

One of the scanned pages

One of the scanned pages

The complete set of scans, in their designated folder

The complete set of scans, in their designated folder

After installing Tesseract, the OCR software, I had to move the scans into an isolated folder (labeled “pdfs”) and download a file that would make it possible to type in an appropriate command line and allow Tesseract to recognize the scans as an object to transcribe. I then typed in the command “cd” and dragged the folder from the Finder application (I use a Mac) into the command line, which made the complete command:

With this command entered, I could then command Tesseract to transcribe the scans. To do so, I typed in the command “sh” followed by the name of the “director” file and the name of the input file (the sh command is used to specify the input file):

All that was left to do was to hit “enter,” and Tesseract converted the pdf and transcribed it, with the following result:

The "images" and "texts" folders were created by Tesseract; the "Tesseract-1" file is the "director" file

The “images” and “texts” folders were created by Tesseract; the “Tesseract-1″ file is the “director” file

The sample article, transcribed

The sample article, transcribed

As the above picture shows, Tesseract did a fairly decent job transcribing the article accurately. It even reflected the newspaper’s text margin and was able to recognize separate articles, even though their print may have been horizontally aligned on the same page. The program did have difficulty in some areas though. Not surprisingly, the minor text at the top of the page, such as the letter identifying the section of the newspaper and the stock market indexes, were sloppily transcribed:

There were also some peculiar spelling errors:

"MBTA officials"

“MBTA officials”

"Lawsuit"

“Lawsuit”

A potentially more serious problem was that at one point, Tesseract “misread” an article and aligned the text of another article in the middle of my sample:

The Original

The Original

The (Incorrect) Transcription

The (Incorrect) Transcription

Nevertheless, I completed the transcription myself. Pretending that this would one day actually be used by historians, I rearranged the transcription into a more compact format, added the missing bits, and corrected spelling errors.

The Final Copy

The Final Copy

My results using OCR were mixed, but overall, it does expedite the process of transcription, and its errors can fairly easily be accounted for by a simple review. Frederick Gibbs and Trevor Owens have argued in their essay The Hermeneutics of Data and Historical Writing that descriptions of the methods of the digital humanities needs to be included in the historical literature, so that potential inaccuracies may be spotted. As far as this very limited example, OCR, is concerned, I am not convinced of the need for historians to explicate the fact that they used digitally transcribed sources and the process this method of transcription entails, so long as the transcriptions are diligently checked for accuracy. What might be needed more so in this case is simply for the “bugs” of the software to be corrected; since software is not something static (there are many versions and updates), it seems like OCR has the potential to develop into a very powerful tool.

 

Data Visualization

While exploring the world of “humanities computing,” that is, using digital methods in research, some common themes I’ve noticed are the visualization of data and text mining. By “visualization” I refer to the use of computers to generate charts and models to show associations between, or elements of, data (usually quantitative in nature). Text mining, on the other hand, is the scanning of digitalized texts for word occurrences and word patterns, and can be open-ended or delimited by the researcher according to his or her interest.

Text mining was covered heavily by Matthew Jockers in Macroanalysis. His main concern is to demonstrate the viability of applying text mining to large sets of data; in his case, a corpus of principally Irish and British literature. By using computer software Jockers is able to hastily complete an analytical process that would take weeks if left to human mental faculties. He compiles data about the writing style, themes, and subjects in the corpus of novels to show their relationships with, for example, nationality and gender. He also represents this data visually by using charts and word clusters (a model where the size of the word or phrase and its position relative to the center of the cluster symbolize significance) to show trends or hidden aspects in the data, whether by year, region, nationality, etc.

This week I’ve encountered some essays which use the visualization technique in a slightly different way. In The Other Ride of Paul Revere Shin-Kap Han I was introduced to the use of network modeling. Han uses models to provide a visualization of membership data which illustrates the connections between Paul Revere, Joseph Warren, and political groups that were otherwise disjointed at best. Caroline Winterer, for her part, uses network modeling to map letter correspondences for Benjamin Franklin and Voltaire. A significant difference between these scholars and Jockers is that they use modeling not to represent aspects of text but more concrete historical social realities.

Whatever the end result of modeling and/or text mining, there’s a notable degree of cautious skepticism present among the writers. Jockers suggests the methods of the digital humanities can only accompany, or are even secondary to, the traditional method in humanities scholarship of close reading and interpretation. Winterer also says distinctly that her study attempts to bump against the boundaries of digital mapping and that we have “good reasons to be wary of what digitization and visualization can offer us” (Winterer, 598).

While I think it’s too early for me to put in a good word for text mining, especially after an in-class experiment last week returned some peculiar results, the visualization aspect of the digital humanities seems, based on my inchoate impressions, at least, to be a useful tool for interpreting data. Certainly there’s validity to the issues of how to properly interpret the meaning of the data that’s being visualized in the wider historical context–an end to which, it could be argued, is dependent on the traditional “close reading” approach. But it seems more striking, more eye-catching, I think, to visualize the transfer of letters instead of weeding out one by one their individual destinations, for example. We should be cautious on how to interpret models but certainly we can accept them as an aid to our interpretation.

The Potential of Cliometrics

Anyone who reads Robert Fogel and Stanley Engerman’s “Time on the Cross” knows immediately that its claims were bound to cause controversy. A significant part of this controversy, no doubt, stems from the heavy use of numerical data in the analysis of the institution of slavery. The numbers seem cold and barely capture the macabre-colored picture of slavery that we are used to encountering in more traditional, humanistic expositions of slavery. Worse still, at times Fogel and Engerman’s language seems to allude to the “Uncle Tom” image of a pitifully subservient and obedient black when describing the typical slave. The authors did not mean to suggest this (they say they admire black achievement under the adversity of white overlordship), one cannot help but to conjure the image when they speak, for example, of the supposed motivation of slaves to be appointed to “better” roles on the plantation.

Despite the controversy, I think it’s a shame that this study may have caused cliometrics to fade completely into the background of historical research, because it offers some useful tools for historians. In particular, I thought its capabilities as a tool for comparative studies were particularly strong. One relatively non-controversial section of “Time on the Cross” was the first chapter, where Fogel and Engerman discuss some of the differences between slavery in the United States and in the Caribbean. They use comparisons of slave imports into the Caribbean and the U.S., foreign-born slaves with the rest of the U.S. population, and the growth of the actual slave populations in the U.S. and Caribbean to expose very real differences between the slave trades of the U.S. and the Caribbean that are in fact made more explicit numerically. The reader gets a harrowing portrait of slaves being sent to the Caribbean in droves to replace those who have succumbed to tropical diseases, while in the U.S., the slave population became “naturalized,” creating a potentially different dynamic to be further explored by historical study.

Steven Ruggles’ “The Transformation of the American Family Structure” is another example of the use of quantitative comparisons to show intriguing facts. Some scholars claim that the traditional family structure never existed, we learn. Yet Ruggles suggests that although these “extended households” might have been a minority, they were still an ideal that served to direct behavior more often than not. By using life expectancy to calculate a potential percentage of families that could have had the traditional structure of elderly kin living with younger generations, Ruggles shows that a high percentage of those families that could have this structure actually did. By contrast, life expectancy has risen in the 20th century, and yet the traditional family structure is found even less often.

Comparative quantitative studies offer one way to make meaning out of numbers in a way that is detailed and exciting much in the same way as first-hand accounts provide meaningful qualitative data. It would be a shame to push them aside completely.

Is the Internet Reliable for Source Material?

Daniel Cohen and Roy Rosenzweig only briefly mention the idea that the internet has potential as a source of historical data. I think this is a topic which begs for more consideration. So many people have access to the internet nowadays, and so many websites give them the ability to “talk” (or should I say, type) about themselves, their lives, and the world around them. The internet could, in theory, be a reservoir of data on everything from popular opinions and reactions to monumental political events, such as the enacting of a new law, to race, or to a high-grossing film.

I recently came across this blog entry which discussed an online newspaper that came up with the idea to add “annotations” to individual paragraphs in its articles. This way ordinary people can discuss their own thoughts or experiences in accordance with specifics discussed in the article. Imagine then, that we are reading an article that discusses the career of a politician running for election. Each segment of the article, discussing a different event in the person’s career, could be commented on by the readers, who might respond with contempt, anger, approval, intrigue, etc. This could be used to pinpoint various attitudes that affect the outcome of the election. It adds a qualitative aspect to measuring values and opinions of the public.

Searching the internet for source material is also beset by difficulties, however. Not everybody chooses to post a response online. Furthermore, the internet is highly anonymous. We don’t know if it’s a certain kind of person that’s posting. And generalizing about the general public is more difficult. It’s easier to generalize, for example, about the values of Republicans, who represent a specific portion of the voting population, when we have a central figure speaking at a national convention to quote. When we’re dealing with commentary on the internet, however, we don’t know if the people who are commenting are outliers and we don’t always have access to other social variables such as income or party affiliation; in short, it’s difficult to quantify the data and state what it means for the larger population or determine what group of people it represents.

It is difficult to tell how reliable the use of common internet discussions and comments are for historical analysis. Perhaps for this reason the internet has not been thoroughly discussed as a potential for source material. Although comments and “posts” on the internet have questionable merit as data, they do illuminate to some degree the lives of ordinary people. A suitable next step might be to ask how the production of such material could be channeled to make its interpretation less problematic for the historian.

Blogs I will Follow

I will be following