Thoughts on the effect of distance reading on research

I have been working on researching the collaboration of Black organizations in South Africa during the 1920’s. It is during these years that Marcus Garvey’s UNIA movement was gaining momentum in Africa, specifically South Africa and Liberia, The South African Communist Party was founded, The Industrial and Commercial workers Union (ICU) a sister organization to the IWW was founded, The African National Congress was founded and gaining momentum in South Africa, as well as religious rebellions. While this topic has not attracted a great deal of attention from historians interested in South Africa, there is a wealth of sources, many easily accessible following the Truth and Reconciliation Commission. The challenge has been sorting through all the useful sources. For example, I have more than 50 speeches, articles and documents from these organizations specifically from the 20s with many more that I am aware of. There is also an abundance of newspapers published by or circulated by these groups at this time. My project has been to understand the social and political climate  of the native community in South Africa during this interim period between South African Independence from Britain and the establishment of Apartheid.

Digital history has begun to open to me new ways of reading the many sources that I have been struggling with as well as new questions to ask concerning these sources. This week I have been tracking down these many sources by sifting through online archives and picking out nearly all sources from the 1920s in South Africa. The majority of these sources have been on http://africanactivist.msu.edu/index.phphttp://www.sahistory.org.za/, and http://www.anc.org.za/index.php. Many of these I had seen before and yet passed by because I simply could not read it all. While searching this week I was not concerned about my ability to read each source and choosing only a selections that seemed specifically interesting to me, but was thinking that instead that by using text analysis to “read” these sources I would actually get a more complete representation of these organizations. While we have discussed concerns about Digital History methods further distancing us from history, making it a less personal practice of statistics, graphs and text analysis rather than telling a story, but this week i was realizing that these tools will allow me to do further research and tell a more complete story.

Visualizing Networks

It’s difficult to live in Boston and not have a preconceived notion of Paul Revere’s contributions to the American Revolution. Usually these notions are placed in two camps, those who love Longfellow’s poem and believe he was the sole rider and most important person during the night before Lexington and Concord or those who believe he is an overrated figure who doesn’t deserve the praise and notoriety brought to him through Longfellow’s work. Having worked as a state house tour guide for a year, I confronted many interpretations of Paul Revere’s ride through my daily interactions with the public. As a public historian, one is often put in this position, how do we navigate national memory, invented traditions and personal beliefs in the scope of conveying some sort of historical truth to our audience?

This networks of Paul Revere’s role during the Revolution to me helps to find a solution to this problem where words and other more “traditional” methods of history have thus far failed us. By placing data into these illustrations and charts, it becomes clear what Revere (and Warren’s) roles were in the Revolution. We find that the answer lies somewhere in between the two camps that are typically formed. As Dave states in his post, “these sorts of big data visualizations give us a way to demonstrate large-scale data in a more effective way than just numbers or prose.” We may not be talking about the large-scale data when it comes to Paul Revere, but the same idea applies. Visualizations are not only a way to accompany a scholarly work, but they can standalone as quite effectively as well. To me this does what Winterer discusses in Where is America in the Republic of Letters, “like a satellite hovering above the Earth, visualization can help us to see the big picture amid bewildering complexity and to detect new patterns over time and space.”

Winterer’s work is different than that of Shin-Kap Han who is working with physical data and membership information, while Winterer uses letters that are difficult to digitize and categorize. Winterer demonstrates how they can be used to show globalization trends and to examine data across national borders as well as within them.  From both articles we see that networks serve as a valuable way to examine the past from a new perspective.

A Reflection on Data Visualizations

This week I was catching up on recent blog posts in my RSS feed and I stumbled upon a very interesting (and very pretty) visualization of Baltic Sea Traffic (thanks to James Cheshire at http://spatial.ly/ for posting this video!). This visualization got me thinking about the benefits of using visualizations to represent big data, how visualizations can be argumentative, and the consequences of these realizations on how we critically examine such visualizations. But first, let’s take a look at the video:

So what does this visualization show us about the potential of using visualizations to represent data? First, it shows us how big data visualizations help viewers understand the scale and quantity of our data. It reminds me of the often-used Stalin quote, “A single death is a tragedy; a million deaths is a statistic.” In the Baltic Sea Traffic visualization, sure they could just say how many ships travel in and around the Baltic Sea on any given day. But that is just a number, and a really large one at that. If they just said there were–to make up a number–100,000 ships per day, would we really be able to understand how many ships there were? Would we get a sense of their routes? I think the answer is no. When I see a number like 100,000, I understand that it is a large number but I cannot really picture it. This visualization is so much more effective at getting the viewer to understand the sheer number of ships moving in and out of the Baltic Sea on a given day. By depicting each ship as its own node we get to see their movement and interaction in real time. It is a truly chaotic picture, which is made even more effective by showing the number of accidents, collisions, groundings, and illegal spills in the middle of the traffic visualization. These sorts of big data visualizations give us a way to demonstrate large-scale data in a more effective way than just numbers or prose.

This brings up another aspect of visualizations that I think is very important. Visualizations are not just evidence that supports a given argument. They are not just data or information.  Visualizations can be argumentative. Sure you might want to add some prose to explain or flesh out the argument, but this three-minute video mostly lets the moving image speak for itself. And I think it is more powerful because of that.

Finally, if visualizations can be argumentative then we, as critical humanists, must evaluate these visualizations as arguments–meaning we need to consider intentionality and purpose when analyzing the strength and value of these visual arguments. Much like a photograph or a work of art, these visualizations have a purpose and a message that affects the way they are constructed or displayed. This intentionality must be critically examined when evaluating these visualizations.

Data Visualization

While exploring the world of “humanities computing,” that is, using digital methods in research, some common themes I’ve noticed are the visualization of data and text mining. By “visualization” I refer to the use of computers to generate charts and models to show associations between, or elements of, data (usually quantitative in nature). Text mining, on the other hand, is the scanning of digitalized texts for word occurrences and word patterns, and can be open-ended or delimited by the researcher according to his or her interest.

Text mining was covered heavily by Matthew Jockers in Macroanalysis. His main concern is to demonstrate the viability of applying text mining to large sets of data; in his case, a corpus of principally Irish and British literature. By using computer software Jockers is able to hastily complete an analytical process that would take weeks if left to human mental faculties. He compiles data about the writing style, themes, and subjects in the corpus of novels to show their relationships with, for example, nationality and gender. He also represents this data visually by using charts and word clusters (a model where the size of the word or phrase and its position relative to the center of the cluster symbolize significance) to show trends or hidden aspects in the data, whether by year, region, nationality, etc.

This week I’ve encountered some essays which use the visualization technique in a slightly different way. In The Other Ride of Paul Revere Shin-Kap Han I was introduced to the use of network modeling. Han uses models to provide a visualization of membership data which illustrates the connections between Paul Revere, Joseph Warren, and political groups that were otherwise disjointed at best. Caroline Winterer, for her part, uses network modeling to map letter correspondences for Benjamin Franklin and Voltaire. A significant difference between these scholars and Jockers is that they use modeling not to represent aspects of text but more concrete historical social realities.

Whatever the end result of modeling and/or text mining, there’s a notable degree of cautious skepticism present among the writers. Jockers suggests the methods of the digital humanities can only accompany, or are even secondary to, the traditional method in humanities scholarship of close reading and interpretation. Winterer also says distinctly that her study attempts to bump against the boundaries of digital mapping and that we have “good reasons to be wary of what digitization and visualization can offer us” (Winterer, 598).

While I think it’s too early for me to put in a good word for text mining, especially after an in-class experiment last week returned some peculiar results, the visualization aspect of the digital humanities seems, based on my inchoate impressions, at least, to be a useful tool for interpreting data. Certainly there’s validity to the issues of how to properly interpret the meaning of the data that’s being visualized in the wider historical context–an end to which, it could be argued, is dependent on the traditional “close reading” approach. But it seems more striking, more eye-catching, I think, to visualize the transfer of letters instead of weeding out one by one their individual destinations, for example. We should be cautious on how to interpret models but certainly we can accept them as an aid to our interpretation.

Identifying Themes

In Macroanalysis, Jockers makes a concerted effort to delineate the reach and the limitations of “big data” with regards to interpretation and meaning. In the foundational opening chapters, he is clear to qualify the products of computer-based text analysis when he notes that “an early mistake …was that computers would somehow provide irrefutable conclusions about what a text might mean” (40). He explicitly denies that this macro scale is supposed to supplant analysis in that regard. Regarding his careful maintenance of this limitation through all the literary and historical conclusions he comes to, I thought the way he toed this line during his section on topic modeling particularly interesting.

I am accustomed to thematic material, derived between the lines of a book’s contents, that is usually subjectively interpreted, but I think now that is because I conflate the discovery of a theme with its interpretation. However, this relates back to Jockers’ description of research bias in terms of Irish Americans, and the focus on urban environments in Eastern states which clouded the understanding of important activities in the Western frontier. The raw material, consisting of the author’s actual words, in a piece of literature is the same no matter who reads it and specific topis require the use of specific words. Through topic modeling, these words are drawn out to find themes that may be obvious in one text (whaling in Moby Dick) but not so much other texts. Also, done on a macro scale, learning of the existence of these topics in a corpus of works too large for a single person to synthesize clearly reveals trends that open up other avenues for research, probably done on a micro scale.

This is where interpretation and meaning become foregrounded considerations, as subjectivity is derived from the commentary the author makes with the theme and the emphasis or connection the readers has on certain attributes of the theme due their own biases, and an author’s “artistry…comes in the assembly of these resources” (174). The trends that are highlighted from the macro reading serve to contextualize findings from a micro reading (even as often macroanalysis literally removes words from their surrounding text) because works of literature do not exist in isolation, and neither do historical events and activities. In Jockers’ own efforts mapping influence and discussing intertextuality in literature, he touches upon what seems like a valuable application for historical research. As literature synthesizes the (usually) unique voice and insight of an individual author the with common qualities of language, culture, and generation, and requires different methods to access specific data, addressing history engages similar parameters of time and distance, and thus similarly benefits from the interplay of close and distant readings.

Zooming In and Out: Close Reading and Distant Reading

After reading through Jocker’s (Macroanalysis 2013) introductory chapters, one concept really jumped out to me that I found very valuable to understanding the role of Macroanalysis and “distant reading” in the humanities. One usually hears about the close reading/distant reading debate as something that is mutually exclusive—where a humanist either employs close reading or distant reading when analyzing sources. Close readings suffer from what Jocker’s refers to as “anecdotal evidence” (8), where one hypothesizes overarching theories from a very limited sample. Distant readings, on the other hand, may be able to analyze more texts in different ways, but often result in a loss of the contextual information that a close reading can reveal.

Jockers, however, in his third chapter, “Tradition,” uses a word that I think is a very valuable way of thinking about the close/distant reading debate. He uses the word “zooming” (in and out), to describe how close and distant readings can be complementary. To “zoom” is a useful and interesting way to describe this phenomenon. It infers a spectrum of scale in text analysis, and in digital work in general. Instead of choosing to do a close reading OR a distant reading on a given corpus, one can zoom in and out along this spectrum of textual analysis. Zooming establishes a complementary rather than a combative interplay between close and distant reading.

And this zooming can be employed across different projects or within the same one. One can “zoom in” on a single work or small corpus of sources, employing a traditional close reading, or one can “zoom out” and perform a distant reading of a million books. But this does not mean that once one zooms in they cannot zoom out in the same project. The scholar analyzing a single work or small corpus of sources can still benefit from a distant reading of both those sources and the larger corpus of digitized works. They can perform a basic text analysis of word usage, word pairings, and structural components to inform their close reading. Moreover, one can zoom out even more and use a broader text analysis of related works during the period to confirm or support their broader claims based on close reading. For example, one might postulate broader societal, political, and religious trends of a certain place during a specific time period based on a close reading. A distant reading of the larger corpus of works from the same region around the same time can support or disprove these speculations. A distant reading, therefore, complements a close reading by acting as means to sidestep only using “anecdotal evidence.” A distant reading can also be supplemented by a close reading of text. A pure distant reading runs the risk of becoming too abstract or removed from the texts. A close reading of a sample of works from a larger text analysis can support the broader phenomenological and discursive trends that distant readings attempt to reveal. Consequently, zooming in and out along this spectrum of scale allows close readings to compliment distant readings and large-scale text analysis to support the claims of in depth study into a limited number of sources. Zooming in and out allows those working in the humanities—digital or traditional (for lack of a better word)—to make their arguments with a greater level of precision and efficacy.

 

Text Analysis.

In addition to Jockers’ Macroanalysis, which is a formidable work with a huge number of texts, we’re reading a short blog post by Cameron Blevins about topic modeling Martha Ballard’s diary.

This is an example of the sort of small scale, but potentially helpful interventions that humanists can do quite easily with some existing packages.

We’re going to do some topic modeling in class using R and Mallet; I’ll have some sets of text build up that you can work with, but it will be extra useful if–like Blevins–you’re able to find some text collection of your own that you can work with. If you have one or an idea for one, let me know and we can figure out how to get it ready for class.

Applying Macroanalysis to Individual Research Projects

I identified with this week’s readings as being something that I would be able to apply directly to my own interests and research topics. While reading Macroanalysis it became clear that I could use these methods for myself. Just thinking of the possibilities of thousands of transcribed cookbooks (much like last week’s assignment) and applying methods of text analysis and data mining to them, think of what we could do with the information. One of the major issues I have grappled with in my study of cookbooks and trying to find popular recipes and standard measurements and ingredients is deciding which cookbooks to look at. In my last paper on the history of baking apple pie I focused a lot on the Joy of Cooking as the be all, end all general cookbook for people interested in making pastry in the United States. By being able to compare hundreds and thousands of cookbooks side by side, looking for keywords such as “granny smith,” “butter,” “lard,” I would have had a much easier time delineating what the most popular recipes and techniques have been throughout history. Instead, I had to assume that in 20th century America, all women followed the standard piecrust presented in one cookbook, albeit a popular one. It also would be helpful in finding what recipes were most popular during specific time periods, information that could lead to broader analysis of food availability by region or cultural preference, in the same way that Jockers was first able to examine Irish-American authors, and later a broader range of texts.

I was further encouraged in reading about Martha Ballard’s diary and how that was digitized and analyzed. It seems that these methods would give the benefits of close reading, but also allow for a broad study of several hundred primary sources. In Macroanalysis I was mostly impressed with the layout of the book, the charts, his descriptions, it appeared as though he was doing what we read about last week by revealing his methods and the thought process behind his analysis. In doing so it allowed the work to feel more transparent, which I believe allowed me to connect it more to my own interests and resources I am working with.

I should also note that while there are several advantages (clearly) to these methods and presentation of them by the two digital humanists we have looked at, I could see where those involved in the postcolonial digital humanities discussion would have a hard time jumping on this macroanalysis bandwagon right away. Jockers was able to analyze works by authors from England, Ireland, Scotland, and the U.S. This lends the question of how long would it take and if anyone is working to digitize the writings and primary sources from other nations and outside of this white, mostly well-educated frame. What interests me as well is if these programs can be developed to look at different languages, even ones that don’t use typical Latin characters, and if they would be as effective as the one that was able to read Martha Ballard’s scrawl. I tried to do some searching on my own, but do we know if scholars are currently developing programs to counteract this apparent English-language dominance?

Digitizing Family Heirlooms & Records

While scrolling through my twitter feed, I found a link to an article from the Daily Mail in UK about new Civil War artifacts that are being “discovered” and digitized as part of the sesquicentennial celebrations. Historians and archivists are working to expand the types of primary sources they include in their online databases by reaching out to families and encouraging them to search their attics for relics.

One statement that I found particularly interesting is: “In Virginia, archivists have borrowed from the popular PBS series ‘Antiques Roadshow,’ travelling weekends throughout the state and asking residents to share family collections, which are scanned and added to the already vast collection at the Library of Virginia.” This is a  different sort of crowdsourcing from the transcriptions and rectifying maps that we discussed in class. These photos and documents are being made public by state libraries and archives, in fact you can see some of them within the article as well.

From these initiatives we can see the value in opening up the digital humanities to public use. By making these artifact accessible to the public and by expanding public interest we can ensure the preservation of our nation’s heritage as well as expand  our knowledge of the American Civil War.

Transparency & Selectivity

In light of the issues concerning Time on the Cross and the marked lack of transparency constructed around the authors’ methodology, one aspect of data accessibility stuck with me as an articulation expanding on the qualifications we discussed in class. Bringing methodology to the foreground of historical writing, as described by Gibbs & Owens in the Hermeneutics of Data and Historical Writing, integrates the communication of processes with the findings. This serves to incorporate data and quantitative analysis with traditional uses of historical sources in a manner that assigns each a function appropriate to its scope, which the authors describe when they note that “as historical data becomes more ubiquitous, humanists will find it useful to pivot between distant and close readings”. Distant readings call for innovative methodologies and collaborations, as they also often serve as the mechanism for framing questions and directing attention to previously imperceptible patterns and trends, which can not be separated from the findings themselves. Gibbs & Owens also note that “historians need to treat data as text”, which seems to summarize the dichotomy between text and data though an analogy of a false separation. Both are treated similarly already: acquired, manipulated, analyzed and represented. This broader, more exploratory approach to data highlights also the fundamental difference from purely mathematical hypothesis testing, which ignores the subjectivity of historical information.

Fogel and Engerman certainly generated a fair amount of discourse through their own procedure, however, as Gibbs demonstrates through Owens’ own research, the type of additive commentary that occurs during a project where the methodology is laid bare allows researchers greater access to critique, commentary, expansion, and inspiration. Where the scope of and access to data is expanding, a greater weight is given to a researcher’s methodology than previously and the interaction is part of the interpretation. However, as Robert Darton describes in his article, access remains tied to money and power, and there is an ever shifting balance between private and public interests. As information is also treated as an asset, this valuation of data seems to guarantee conclusions warped to reflect the selectivity of available material. Darton references the Enlightenment thinkers to frame the disconnect between accessibility and privilege, where ideals fail to reflect reality. This seems to also be applicable to the discussion Gibbs & Owen initiate. Through their emphasis on transparent methodology, there might be a window for corrective forces for the bias of data filtered by private interests. Though Darton has faith in Google as a benign force, it may be that the vigilance he calls for is most appropriately generated when analyzing these works made denser through the additional scope of methodology, in whatever form that ends up taking.