---
title: "Humanities Data Analysis"
date: 2015-04-01
author: Benjamin MacDonald Schmidt
...


# Introduction

**Slides off**

Thanks to everyone for bringing me here.

I want to talk about the challenges and opportunities working with data as a primary source in historical research. I'll mostly do this in the context of my major historical project right now, which re-analyzes the data created by the 19th-century American state to better understand the origins of the way we talk about data today.

But before I get there, I want to talk a just a little more broadly by way of introduction about the broader issues at play here. As I'll describe more in the second half of my talk, the American historical profession has long been intricately involved in shaping the reception and the understanding of certain large data sets.

But there's been more attention lately to the connection to the engagement that humanists can and should have with large data sets.


1. Our sources are increasingly digitized. Hardly any historians conduct research without relying heavily on keyword search for literature review. Those who manage to stick to the old methods have trouble engaging with their students.
2. We're becoming increasingly interested in understanding how data itself was created in a wide variety of fields *outside the sciences*. 
   a. The most recent major foray of mainstream historians was in the era of social history and the dreaded specter of cliometrics--data offered an opportunity to describe the experiences of neglected populations, through an objective suite of tools proffered by the social sciences.
   b. For better or for worse, there's something of a "mania for management" [@johnson_brute_2014] in the historiography right now. Managers and their bureaucracy produce data; and better ways of understanding that data can help us understand both their ideologies and the shortcomings of their approaches. Johnson is pessimistic about the trend, but I think it's undeniable there's been a post-2008 trend towards reassessing the importance of large structural forms.
3. That push towards a better vocabulary for the critical analysis of data serves obvious needs to understand the role that managerial and algorithmic approaches based on data play in the present day.
   a. Nationally, the analysis and visualization of data is only increasing in public importance.
   b. Historians are all quite aware at the discursive level that data is a historical artifact, constructed by social forces, and hardly the bearer of scientific truth.
   c. But we often remain poor readers of data itself. Too much of an emphasis on "reading," and not enough on "analyzing," means we're left with a limited set of tools for critical analysis of data, which often don't let us really understand what's going on under the surface.

# 1. Digital sources: texts

To help better illustrate that great underflow, I want to start off by just talking briefly about texts.

Like many digital humanists, I'm particularly interested in texts. I was trained primarily as an intellectual historian; my dissertation was about changes in discourse in a number of professions in and out of the academy, and focused in part of shifts in language traceable across some very large textual corpora. The digital humanities infrastructural project I'm most involved with is Bookworm, which allows all sorts of visualizations into our complicated digital libraries.

1. The first, opacity, is a **technical** problem. There are obviously enormous possible benefits from the huge collections of texts that libraries and corporations are putting out there: but we tend to understand very imperfectly the actual contents of those archives; what they include, what they omit, and why.

Take, for example, the Google Books collection. It's built up out of all sorts of different libraries, collected at different times with different agendas.

Some scientists at Harvard worked with Google to build a website showing the uses of particular words over time. That's called the Google Ngrams viewer; it lets you plot the occurrence of single words over time. They usually promote it as a way to understand how all of human culture is changing: that's problematic. But what it certainly lets you do is learn something about the history of books. Let me give you an extreme example of that. If you look at the words "thou, thy, and thee" in Google Ngrams, they decline steadily for centuries until suddenly, in 2008, they make a huge resurgence.

	![](>https://books.google.com/ngrams/graph?content=thou%2Cthy%2Cthee&year_start=1800&year_end=2008&corpus=15&smoothing=0&share=&direct_url=t1%3B%2Cthou%3B%2Cc0%3B.t1%3B%2Cthy%3B%2Cc0%3B.t1%3B%2Cthee%3B%2Cc0)

A second slide shows the ways that presses change.

	![](https://books.google.com/ngrams/graph?content=BiblioBazaar%2CHarvard+University+Press%2CMacmillan&year_start=1800&year_end=2008&corpus=15&smoothing=0&share=&direct_url=t1%3B%2CBiblioBazaar%3B%2Cc0%3B.t1%3B%2CHarvard%20University%20Press%3B%2Cc0%3B.t1%3B%2CMacmillan%3B%2Cc0)

but it's also about libraries.

Take the zip code 02138

*Slide: 02138* 

*Slide: 02138Explanation* 

    ![Google's partner libraries shift in 1900, 1922, and later](http://benschmidt.org/slides/images/libraries2.png)

Often historians conclude from this that you'd be foolish to use a tool like Ngrams to try to prove anything. That's exactly the wrong conclusion to make.

Although it's fun to blame Google, historians doing keyword searches in Proquest or Jstor or Ancestry.com or any of the databases that have become indispensable to research can be incredibly ignorant of where the data they use comes from.[@jockers_macroanalysis:_2013].

The only thing that's distinctive about the Google Books collection is that we finally have some tools that let us see what's in our collections. Everyone who uses a search engine in research is hitting against strange restrictions: there are more newspapers in Readex before 1876 because that's the cutoff date that their most important partner library, the American Antiquarian Society, set on their collections. Everyone who uses Google books will get more medical literature before 1922, because the Harvard medical libraries are better, and more 

That means we can start the work of source criticism to know what sort of texts we're looking at.
For the first time, we're able to search for things that are missing in the record, not just things that are there. And we can also see what the shape of those libraries are, and understand the contours of how they came to be digitized.

# Logbooks

### Seeing like a state.

I've increasingly come to think, though, that the real interest for most historians in information like this should not be as evidence about shipping: it should be as part of understanding about the developing 19th century state collected information, and how that collection shaped the categories of information that it collected. [@kinnahan_charting_2008; @schulman_geographic_2002] 
 
(It's also convenient for me, since I study the history of quantification and measurement and don't know a jib from a foresail).

So let me show you just a little bit of that. I started off by showing this, which was the first chart I made of this data.

    ![Deck 701, US Maury Collection (1789-c.1865)](http://benschmidt.org/slides/images/ghostmaps/701.png)

In January, when it was a year old, it had a brief moment of internet glory in which it was widely circulated. One of the questions I frequently got about it was what a modern version, including the Suez and the Panama Canals, would look like. The answer that people want to see, I think, looks like this: Deck 892, of American shipping between 1980 and 1997.

    ![Deck 892, US shipping 1980-1997](http://benschmidt.org/slides/images/ghostmaps/892.png)

But the US Maury collection is not a transparent view of what America's maritime activity actually looked like. Just like those Google Books, it's a strange amalgamation of many different sources.

    ![ICOADS Deck 720, German weather data, 1876-1914](http://benschmidt.org/slides/images/ghostmaps/720.png)

    ![ICOADS Deck 735, Russian Research Vessel (R/V) Digitisation](http://benschmidt.org/slides/images/ghostmaps/735.png)

    ![Closeup of Deck 735. Soviet Vessels carefully avoiding the coast of South America.](http://benschmidt.org/slides/images/ghostmaps/735-closeup.png)

    ![ICOADS Deck 735, Russian Research Vessel (R/V) Digitisation](http://benschmidt.org/slides/images/ghostmaps/735.png)

    ![German Deep Drifter Data (via ISDM; originally from IfM/Univ. Kiel)](http://benschmidt.org/slides/images/ghostmaps/715.png)


These are useful charts because they show very clearly what different patterns of rigidity in state data collection looks like.

## Race and skin in the census
Here's a different way that the state gathered information, though not in a way that fed up through the bureaucracy as cleanly as Maury's data: 

    ![Acushnet Crew Register, 1840--Source: Records of the U.S. Customs Service, RG 36, National Archives at Boston http://blogs.archives.gov/prologue/?p=7587](http://benschmidt.org/slides/images/Shipping/CustomsRegisters.png)

In crew registers of whaling logs, for example, there's a really extraordinary constelation of physical descriptions. But these are plain text data.

*Slide: Crewmember dataset*

This is information about race. This is in a period before 'race' was a category on the census with a number of prescribed categories. (Census forms included "color" until 1900, when they switched to "color or race"). The white-black binary that was fundamental to American society failed in New Bedford, where there were Pacific Islanders, Atlantic Islanders from the Azores, and all sorts of other individuals.

The meanings here are diverse, and represent a set of interpretations quite alien to what we might assign to them.

What does it mean to have "yellow" skin, for example?
If you look at collocations, for example, it becomes clear. People with yellow skin usually have "black" or "wooly" hair; that last is a giveaway that for whatever reason, "yellow" was encoded to refer to people at least partly of African descent.

*Slide: Hair Skin Combinations*

> Another example: the 1880 census. Sort it by age, and you get about 10 people over the age of 115. 9 of them are black; one is Native American. This is not simply a matter of record keeping; there were not birth certificates for every octagenerian Yankee who might claim to have been born before the revolution. Rather, race and age as accounted by state agents--census takers--are categories that are easily bound together.