I'm a graduate student in history at Princeton, and also the visiting graduate fellow at the Harvard Cultural Observatory.

I came into text mining from American intellectual and cultural history, which has two contrary influences that have been tearing me apart for the last few years;

On the one hand) Highly suspicious of quantification (because it simplifies experience; because it reifies problematic categories created by powerful historical actors, because it claims more rigor than it might really have;

On the other) Really fond of talking about change in language, about discursive fields of power, and about the ways that available vocabulary shapes the way we live in the world.

What's changed is that suddenly the data to apply those notions is immediately present; American intellectual history, at least, has shifted without us really noticing from a field that has no data at all to one of the most richly endowed in the academy, because so much of our digital heritage has been scanned. Not everything, of course; but I'd wager that everything someone like Richard Hofstadter read in his life we have now.

2 possible agendas:

1) Make discovery easier. I love saying that text mining is already endemic in History; it's called search, and though humanists use it every day they don't have any idea how it works. One thing I think everyone here has made some great contributions to is making search work for historical ideals 

And not for marketing purposes, which is how something like Google books works. At the cultural observatory, I made with Martin Camacho a website called Bookworm that takes everything that's so fun about Google ngrams, which also came from our group, and adds in the ability to subset by all the metadata librarians have been collecting for years, and to transparently see the sources creating the graph and even to go edit them. (That's because we use fully domain data from the Open Library project; like Google Books, it's primarily items from the big university libraries, but it has fully open data and metadata for before 1922. (Finding large textual sources is actually much easier than finding ones unencumbered by licesing restrictions; that may be something worth talking about more).


2) Much more controversial, and I'm fascinated to see if it can have a place in history going forward. That's to use statistical data and aggregate textual results as historical evidence in itself, and not just as a means of making arguments from traditional.

All this suspicion of quantification we learn makes me nervous even to say this. The basic idea is that for over a hundred years, hundreds of thousands of authors have been writing books; you can learn a lot about someone's communities of knowledge by what they talk about, their intellectual world by what they say abou , and even about their social world by what language choices they make. Language changes over time, sometimes steadily, sometimes suddenly. When you look at the data, the sheer number of things that have strong historical shifts are apparent.

Just to give 3 quick examples of what this looks like when we deal with really big sets of data.

1) There's steady drift, as with the patterns where one word displaces another. I'm not even sure this is historical change; 

2) There's disciplinary driven change; to take one example from my dissertation, the verbs used to describe what attention does change dramatically over the last 200 years; how writers think the the mind works from the metaphors they use. For example, the construction of 'focusing attention' doesn't occur at all until around 1895; and using library metadata, it's possibly to quickly pin down the origins of that new discourse in two fields; psychology, which everyone knows to be one of the big drivers and LB, pedagogy, which I argue is a field that has had more influence than we realize, since it's not as personality driven as psychology.

3) And more suggestively, there's geographic data. One example I was just looking at this week is the spelling of Pittsburg; in 1911, the federal government officially removed the 'H' from the end of the city's name, and within ten years, most books used the spelling. If you look at the individual states, there are striking patterns; Washington DC, with all the federal publications, changes the most strongly; and the farther you get from Washington and from Pittsburgh, the later that usage shifts. As an anecdote, that's suggests something really interesting about the transmission of information and of practices like spelling across geographic distance, and about the comparative reach and influence of federal power. (Utah?)

Since it's Sunday morning, just preach for one second.

I wrote my senior thesis in college about an episode in the life of Theodor Adorno. We rightly think of Adorno as one of the fiercest critics of technocratic reasoning and scientific determinism in the canon of critical theory. I wrote about how he came to the United States in 1937 to work with the quantitative sociologist Paul Lazarsfeld and the research directory of CBS (soon to be the president) Frank Stanton. While he was there they created the Lazarsfeld-Stanton program analyzer; Adorno, with great grounds excoriated it, the collaboration essentially dissolved, it didn't work, he moved off to LA to write the Dialectic of Enlightenment. But then he got right back with the quantitative sociologists who were using factor analysis on survey results for the project that became the Authoritarian Personality; a work that was very influential, although it hasn't necessarily held up. I guess you conclude from this that even Adorno couldn't do good quantitative cultural studies, but I'd say the opposite; that if we're not at least trying to incorporate all of the evidence and all of the methods at our disposal into understanding patterns of power and structures in the past, we are selling short our intellectual heritage, we're not living up to the ideals

If we