You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.
Feb 10 2019

I periodically write about Google Books here, so I thought Id point out something that Ive noticed recently that should be concerning to anyone accustomed to treating it as the largest collection of books: it appears that when you use a year constraint on book search, the search index has dramatically constricted to the point of being, essentially, broken.

Aug 23 2018

I did a slightly deeper dive into data about the salaries by college majors while working on my new Atlantic article on the humanities crisis. As I say there, the quality of data about salaries by college major has improved dramatically in the last 8 years. I linked to others analysis of the ACS data rather than run my own, but I did some preliminary exploration of salary stuff that may be useful to see.

Jul 27 2018

NOTE 8/23: Ive written a more thoughtful version of this argument for the Atlantic. Theyre not the same, but if you only read one piece, you should read that one.

Back in 2013, I wrote a few blog post arguing that the media was hyperventilating about a crisis in the humanities, when, in fact, the long term trends were not especially alarming. I made two claims them: 1. The biggest drop in humanities degrees relative to other degrees in the last 50 years happened between 1970 and 1985, and were steady from 1985 to 2011; as a proportion of the population, humanities majors exploded. 2) The entirety of the long term decline from 1950 to 2010 had to do with the changing majors of women, while mens humanities interest did not change.

Jul 10 2018

Historians generally acknowledge that both undergraduate and graduate methods training need to teach students how to navigate and understand online searches. See, for example, this recent article in Perspectives Google Books is the most important online resource for full-text search; we should have some idea whats in it.

Jun 13 2018

Matthew Lincoln recently put up a Twitter bot that walks through chains of historical artwork by vector space similarity. https://twitter.com/matthewdlincoln/status/1003690836150792192.
The idea comes from a Google project looking at paths that traverse similar paintings.

Sep 15 2017

This is a blog post Ive had sitting around in some form for a few years; I wanted to post it today because:

Jul 24 2017

Digging through old census data, I realized that Wikipedia has some really amazing town-level historical population data, particularly for the Northeast, thanks to one editor in particular typing up old census reports by hand. (And also for French communes, but thats neither here nor there.) Im working on pulling it into shape for the whole country, but this is the most interesting part.

Jul 11 2017

Ive been doing a lot of reading about population density cartography recently. With election-map cartography remaining a major issue, theres been lots of discussion of them: and the Joy Plot is currently getting lots of attention.

Jul 05 2017

Robert Leonard has an op-ed in the Times today that includes the following anecdote:

May 16 2017

The Library of Congress has released MARC records that Ill be doing more with over the next several months to understand the books and their classifications. As a first stab, though, I wanted to simply look at the history of how the Library digitized card catalogs to begin with.

Apr 14 2017

One of the interesting things about contemporary data visualization is that the field has a deep sense of its own history, but that professional historians havent paid a great deal of attention to it yet. Thats changing. I attended a conference at Columbia last weekend about the history of data visualization and data visualization as history. One of the most important strands that emerged was about the cultural conditions necessary to read data visualization. Dancing around many mentions of the canonical figures in the history of datavis (Playfair, Tukey, Tufte) were questions about the underlying cognitive apparatus with which humans absorb data visualization. What makes the designers of visualizations think that some forms of data visualization are better than others? Does that change?

Dec 23 2016

I want to post a quick methodological note on diachronic (and other forms of comparative) word2vec models.

Dec 20 2016

This is a quick digital-humanities public service post with a few sketchy questions about OCR as performed by Google.

Dec 01 2016

Like everyone else, Ive been churning over the election results all month. Setting aside the important stuff, understanding election results temporally presents an interesting challenge for visualization.

Sep 09 2016

Im pulling this discussion out of the comments thread on Scott Enderles blog, because its fun. This is the formal statement of what will forever be known as the efficient plot hypothesis for plot arceology. Noble prize in culturomics, here I come.

Aug 29 2016

Word embedding models are kicking up some interesting debates at the confluence of ethics, semantics, computer science, and structuralism. Here I want to lay out some of the elements in one recent place that debate has been taking place inside computer science.

Jul 20 2016

Debates in the Digital Humanities 2016 is now online, and includes my contribution, Do Digital Humanists Need to Understand Algorithms? (As well as a pretty snazzy cover image) In it I lay out distinction between transformations, which are about states of texts, and algorithms, which are about processes. Put briefly:

Jul 18 2016

Some scientists came up with a list of the 6 core story types. On the surface, this is extremely similar to Matt Jockerss work from last year. Like Jockers, they use a method for disentangling plots that is based on sentiment analysis, justify it mostly with reference to Kurt Vonnegut, and choose a method for extracting ur-shapes that naturally but opaquely produces harmonic-shaped curves. (Jockers using the Fourier transform, and the authors here use SVD.) I started writing up some thoughts on this two weeks ago, stopped, and then got a media inquiry about the paper so thought Id post my concerns here. These sort of ramp up from the basic but important (only about 40% of the texts they are using are actually fictional stories) to the big one that ties back into Jockerss original work; why use sentiment analysis at all? This leads back into a sort of defense of my method of topic trajectories for describing plots and some bigger requests for others working in the field.

Jul 05 2016

I usually keep my mouth shut in the face of the many hilarious errors that crop up in the burgeoning world of datasets for cultural analytics, but this one is too good to pass up. Nature has just published a dataset description paper that appears to devote several paragraphs to describing center of population calculations made on the basis of a flat earth.

May 30 2016

I started this post with a few digital-humanities posturing paragraphs: if you want to read them, youll encounter them eventually. But instead let me just get the point: heres a trite new category of analysis that wouldnt be possible without distant reading techniques that produces sometimes charmingly serendipitous results.

Nov 03 2015

A heads-up for those with this blog on their RSS feeds: Ive just posted a couple things of potential interest on one of the two other blogs (errm) Im running on my own site.

Jan 19 2015

Mitch Fraas and I have put together a two-part interactive for the Atlantic using Bookworm as a backend to look at the changing language in the State of Union. Yoni Appelbaum, who just took over this week, spearheaded a great team over there including Chris Barna, Libby Bawcombe, Noah Gordon, Betsy Ebersole, and Jennie Rothenberg Gritz who took some of the Bookworm prototypes and built them into a navigable, attractive overall package. Thanks to everyone.

Dec 30 2014

Far and away the most interesting idea of the new government college ratings emerges toward the end of the report. It doesnt quite square the circle of competing constituencies for the rankings I worries about in my last post, but it gets close. Lots of weight is placed on a single magic model that will predict outcomes regardless of all the confounding factors they raise (differing pay by gender, sex, possibly even degree composition). As an inveterate modeler and data hound, I can see the appeal here. The federal government has far better data than US News and World Report, in the guise of the student loan repayment forms; this data will enable all sorts of useful studies on the effects of everything from home-schooling to early-marriage. I dont know that anyone is using it yet for the sort of studies it makes possible (do you?), but it sounds like theyre opening the vault just for these college ranking purposes.

Dec 30 2014

Before the holiday, the Department of Education circulated a draft prospectus of the new college rankings they hope to release next year.That afternoon, I wrote a somewhat dyspeptic post on the way that these rankings, like all rankings, will inevitably be gamed. But its probably better to bury that off and instead point out a couple looming problems with the system we may be working under soon. The first is that the audience for these rankings is unresolved in a very problematic way; the second is that altogether two much weight is placed on a regression model solving every objection that has been raised. Finally, Ill lay out my constructive solution for salvaging something out of this, which is that rather than use a three-tiered excellent - adequate - needs improvement, everyone would be better served if we switched to a two-tiered Good/Needs Improvement system. Since this is sort of long, Ill break it up into three posts: the first is below.

Dec 18 2014

Sometimes it takes time to make a data visualization, and sometimes they just fall out of the data practically by accident. Probably the most viewed thing Ive ever made, of shipping lines as spaghetti strings, is one of the latter. Im working to build one of the former for my talk at the American Historical Association out of the Newberry Librarys remarkable Atlas of Historical County Boundaries. But my second ggplot with the set, which I originally did just to make sure the shapefiles were working, was actually interesting. So I thought Id post it. Heres the graphic: then the explanation. Click to enlarge.

Dec 16 2014

Note: a somewhat more complete and slightly less colloquial, but eminently more citeable, version of this work is in the [Proceedings of the 2015 IEEE International Conference on Big Data](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=7363937&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabsall.jsp%3Farnumber%3D7363937). Plus, it was only there that I came around to calling the whole endeavor plot arceology._
Its interesting to look, as I did at my last post, at the plot structure of typical episodes of a TV show as derived through topic models. But while it may help in understanding individual TV shows, the method also shows some promise on a more ambitious goal: understanding the general structural elements that most TV shows and movies draw from. TV and movies scripts are carefully crafted structures: I wrote earlier about how the Simpsons moves away from the school after its first few minutes, for example, and with this larger corpus even individual words frequently show a strong bias towards the front or end of scripts. These crafting shows up in the ways language is distributed through them in time.

Dec 11 2014

The most interesting element of the Bookworm browser for movies I wrote about in my last post here is the possibility to delve into the episodic structure of different TV shows by dividing them up by minutes. On my website, I previously wrote about story structures in the Simpsons and a topic model of movies I made using the general-purpose bookworm topic modeling extension. For a description of the corpus or of topic modeling, see those links.

Sep 15 2014

Heres a very fun, and for some purposes, perhaps, a very useful thing: a Bookworm browser that lets you investigate onscreen language in about 87,000 movies and TV shows, encompassing together over 600 million words. (Go follow that link if you want to investigate yourself).

Sep 11 2014

An FYI, mostly for people following this feed on RSS: I just put up on my home web site a post about applications for the Simpsons Bookworm browser I made. It touches on a bunch of stuff that would usually lead me to post it here. (Really, it hits the Sapping Attention trifecta: a discussion of the best ways of visualizing Dunning Log-Likelihood, cryptic allusions to critical theory; and overly serious discussions of popular TV shows.). But its even less proofread and edited than what I usually put here, and Ive lately been more and more reluctant to post things on a Google site like this, particularly as blogger gets folded more and more into Google Plus. Thats one of the big reasons I dont post here as much as I used to, honestly. (Another is that I dont want to worry about embedded javascript). So, head over there if you want to read it.

Aug 13 2014

Right now people in data visualization tend to be interested in their fields history, and people in digital humanities tend to be fascinated by data visualization. Doing some research in the National Archives in Washington this summer, I came across an early set of rules for graphic presentation by the Bureau of the Census from February 1915. Given those interests, I thought Id put that list online.

May 23 2014

People love to talk about how practical different college majors are: and practicality is usually majored in dollars. But those measurements can be very problematic, in ways that might have bad implications for higher education. Thats what this post is about.

Apr 03 2014

Heres a little irony Ive been meaning to post. Large scale book digitization makes tools like Ngrams possible; but it also makes tools like Ngrams obsolete for the future. It changes what a book is in ways that makes the selection criteria for Ngramsif it made it into print, it must have _some _significancecompletely meaningless.

Mar 31 2014

A map I put up a year and a half ago went viral this winter; it shows the paths taken by ships in the US Maury collection of the ICOADS database. Ive had several requests for higher-quality versions: I had some up already, but I just put up on Flickr a basically comparable high resolution version. US Maury is Deck 701 in the ICOADS collection: I also put up charts for all of the other decks with fewer than 3,000,000 points. You can page through them below, or download the high quality versions from Flickr directly. (At the time of posting, you have to click on the three dots to get through to the summaries).

Jun 27 2013

OK: one last post about enrollments, since the statistic that humanities degrees have dropped in half since 1970 is all over the news the last two weeks. This is going to be a bit of a data dump: but theres a shortage of data on the topic out there, so forgive me.

Jun 26 2013

A quick addendum to my post on long-term enrollment trends in the humanities. (This topic seems to have legs, and I have lots more numbers sitting around I find useful, but theyve got to wait for now).

Jun 07 2013

There was an article in the Wall Street Journal about low enrollments in the humanities yesterday. The heart of the story is that the humanities resemble the late Roman Empire, teetering on a collapse precipitated by their inability to get jobs like those computer scientists can provide. (Never mind that the news hook is a Harvard report about declining enrollments in the humanities, which makes pretty clear that the real problem is students who are drawn to social sciences, not competitition from computer scientists.)

May 24 2013

What are the major turning points in history? One way to think about that is to simply look at the most frequent dates used to start or end dissertation periods.* That gives a good sense of the general shape of time.

May 09 2013

Heres some inside baseball: the trends in periodization in history dissertations since the beginning of the American historical profession. A few months ago, Rob Townsend, who until recently kept everyone extremely well informed about professional trends at American Historical Association* sent me the list of all dissertation titles in history the American Historical Association knows about from the last 120 years. (Its incomplete in some interesting ways, but thats a topic for another day). Its textual data. But sometimes the most interesting textual data to analyze quantitatively are the numbers that show up. Using a Bookworm database, I just pulled out from the titles the any years mentioned: that lets us what periods of the past historians have been the most interested in, and what sort of periods theyve described..

Apr 12 2013

The new issue of the Journal of Digital Humanities is up as of yesterday: it includes an article of mine, Words Alone, on the limits of topic modeling. In true JDH practice, it draws on my two previous posts on topic modeling, here and here. If you havent read those, the JDH article is now the place to go. (Unless you love reading prose chock fullve contractions and typos. Then you can stay in the originals.) If you have read them, you might want to know whats new or why I asked the JDH editors to let me push those two articles together. In the end, the changes ended up being pretty substantial.

Mar 29 2013

The hardest thing about studying massive quantities of digital texts is knowing just what texts you have. This is knowlege that we havent been particularly good at collecting, or at valuing.

Feb 28 2013

My last post had the aggregate statistics about which parts of the library have more female characters. (Relatively). But in some ways, its more interesting to think about the ratio of male and female pronouns in terms of authors whom we already know. So I thought Id look for the ratios of gendered pronouns in the most-collected authors of the late 19th and early twentieth centuries, to see what comes out.

Feb 25 2013

Now back to some texts for a bit. Last spring, I posted a few times about the possibilities for reading genders in large collections of books. I didnt follow up because I have some concerns about just what to do with this sort of pronoun data. But after talking about it to Ryan Cordells class at Northeastern last week, I wanted to think a little bit more about the representation of male and female subjects in late-19th century texts. Further spurs were Matt Jockers recently posted the pronoun usage in his corpus of novels; Jeana Jorgensen pointed to recent research by Kathleen Ragan that suggests that editorial and teller effects have a massive effect on the gender of protagonists in folk tales. Bookworm gives a great platform for looking at this sort of question.

Feb 14 2013

Im cross-posting here a piece from my language anachronisms blog, Prochronisms.

Feb 06 2013

My last post was about how the frustrating imprecisions of language drive humanists towards using statistical aggregates instead of words: this one is about how they drive scientists to treat words as fundamental units even when their own models suggest they should be using something more abstract.

Jan 10 2013

Just a quick post to point readers of this blog to my new Atlantic article on anachronisms in Kushner/Spielbergs Lincoln; and to direct Atlantic readers interested in more anachronisms over to my other blog, Prochronisms, which is currently churning on through the new season of Downton Abbey. (And to stick around here; my advanced market research shows you might like some of the posts about mapping historical shipping routes.)

Jan 09 2013

Following up on my previous topic modeling post, I want to talk about one thing humanists actually do with topic models once they build them, most of the time: chart the topics over time. Since I think that, although Topic Modeling can be very useful, theres too little skepticism about the technique, Im venturing to provide it (even with, Im sure, a gross misunderstanding or two). More generally, the sort of mistakes temporal changes cause should call into question the complacency with which humanists tend to  topics in topic modeling as stable abstractions, and argue for a much greater attention to the granular words that make up a topic model.

Nov 15 2012

A stray idea left over from my whaling series: just how much should digital humanists be flocking to military history? Obviously the field is there a bit already: the Digital Scholarship lab at Richmond in particular has a number of interesting Civil War projects, and the Valley of the Shadow is one of the archetypal digital history projects. But its possible someone could get a lot of mileage out of doing a lot more.

Nov 15 2012

[Temporary note, March 2015: those arriving from reddit may also be interested in this post, which has a bit more about the specific image and a few more like it.]

Nov 14 2012

Note: this post is part 5 of my series on whaling logs and digital history. For the full overview, click here.

Nov 02 2012

Note: this post is part 4, section 2 of my series on whaling logs and digital history. For the full overview, click here.

Nov 01 2012

Note: this post is part 4 of my series on whaling logs and digital history. For the full overview, click here.

Oct 30 2012

Note: this post is part I of my series on whaling logs and digital history. For the full overview, click here.

Oct 18 2012

Heres a special post from the archives of my too-boring for prime time files. I wrote this a few months ago but didnt know if anyone needed: but now Ill pull it out just for Scott Weingart since I saw him estimating word counts using the, which is exactly what this post is about. If that sounds boring to you: for heavens sake, dont read any further.

Oct 18 2012

Note: this post is part III of my series on whaling logs and digital history. For the full overview, click here.

Oct 12 2012

Note: this post is part II of my series on whaling logs and digital history. For the full overview, click here.

Sep 25 2012

Ive now seen a paragraph about advertising inJill Lepores latest New Yorker piece in a few places, including Andrew Sullivans blog. Digital history blogging should resume soon, but first some advertising history, since something weird is going on here:

Jul 31 2012

Ive been thinking more than usual lately about spatially representing the data in the various Bookworm browsers.

Jul 12 2012

A follow up on my post from yesterday about whether theres more history published in times of revolution. I was saying that I thought the dataset Google uses must be counting documents of historical importance as history: because libraries tend to shelve in a way that conflates things that are about history and things that are history.

Jul 11 2012

A quick post about other peoples data, when I should be getting mine in order:

May 08 2012

Its pretty obvious that one of the many problems in studying history by relying on the print record is that writers of books are disproportionately male.

May 07 2012

We just rolled out a new version of Bookworm (now going under the name Bookworm Open Library) that works on the same codebase as the ArXiv Bookworm released last month. The most noticeable changes are a cleaner and more flexible UI (mostly put together for the ArXiv by Neva Cherniavksy and Martin Camacho, and revamped by Neva to work on the OL version), couple with some behind-the-scenes tweaks that should make it easy to add new Bookworms on other sets of texts in the future. But as a little bonus, theres an additional metadata category in the Open Library Bookworm were calling author gender.

Apr 27 2012

[The American Antiquarian Society conference in Worcester last weekend had an interesting rider on the conference invitationthey wanted 500 words from each participant on the prospects for independent research libraries. Im posting that response here.]

Apr 10 2012

I saw some historians talking on Twitter about a very nice data visualization of shipping routes in the 18th and 19th centuries on Spatial Analysis. (Which is a great bloglooking through their archives, I think Ive seen every previous post linked from somewhere else before).

Apr 04 2012

Apr 02 2012

[The following is a revised version of my talk on the collaboration panel at a conference about Needs and Opportunities for the Research Library in the Digital Age at the American Antiquarian Society in Worcester last week. Thanks to Paul Erickson for the invitation to attend, and everyone there for a fascinating weekend.]

Mar 21 2012

[Update: Ive consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]

Mar 06 2012

A quick follow-up on this issue of author gender.

Mar 06 2012

I just saw that various Digital Humanists on Twitter were talking about representativeness, exclusion of women from digital archives, and other Big Questions. I can only echo my general agreement about most of the comments.

Feb 29 2012

I wanted to try to replicate and slightly expand Ted Underwoods recent discussion of genre formation over time using the Bookworm dataset of Open Library books. I couldnt, quite, but I want to just post the results and code up here for those who have been following that discussion. Warning: this is a rather dry post.

Feb 20 2012

[Update: Ive consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]

Feb 20 2012
  1. The new USIH blogger LD Burnett has a post up expressing ambivalence about the digital humanities because it is too eager to reject books. This is a pretty common argument, I think, familiar to me in less eloquent forms from New York Times comment threads. Its a rhetorically appealing positionto set oneself up as a defender of the book against the philistines who not only refuse to read it themselves, but want to take your books away and destroy them. I worry theres some mystification involvedconflating corporate publishers with digital humanists, lumping together books with codices with monographs, and ignoring the tension between reader and consumer. This problem ties up nicely into the big event in DH in the last weekthe announcement of the first issue of the ambitiously all-digital Journal of Digital Humanities. So let me take a minute away from writing about TV shows to sort out my preliminary thoughts on books.

Feb 13 2012

[Update: Ive consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]

Feb 02 2012

Though I usually work with the Bookworm database of Open Library texts, Ive been playing a bit more with the Google Ngram data sets lately, which have substantial advantages in size, quality, and time period. Largely I use it to check or search for patterns I can then analyze in detail with text-length data; but theres also a lot more that could be coming out of the Ngrams set than what Ive seen in the last year.

Jan 30 2012

Another January, another set of hand-wringing about the humanities job market. So, allow me a brief departure from the digital humanities. First, in four paragraphs, the problem with our current understanding of the history job market; and then, in several more, the solution.

Jan 05 2012

[This is not what Ill be saying at the AHA on Sunday morning, since Im participating in a panel discussion with Stefan Sinclair, Tim Sherrat, and Fred Gibbs, chaired by Bill Turkel. Do come! But if I were to toss something off today to show how text mining can contribute to historical questions and what sort of issues we can answer, now, using simple tools and big data, this might be the story Id start with to show how much data we have, and how little things can have different meanings at big scales]

Dec 16 2011

When data exploration produces Christmas-themed charts, thats a sign its time to post again. So heres a chart and a problem.

Nov 19 2011

Ted Underwood has been talking up the advantages of the Mann-Whitney test over Dunnings Log-likelihood, which is currently more widely used. Im having trouble getting M-W running on large numbers of texts as quickly as Id like, but Id say that his basic contentionthat Dunning log-likelihood is frequently not the best methodis definitely true, and theres a lot to like about rank-ordering tests.

Nov 14 2011

I may (or may not) be about to dash off a string of corpus-comparison posts to follow up the ones Ive been making the last month. On the surface, I think, this comes across as less interesting than some other possible topics. So I want to explain why I think this matters now. This is not quite my long-promised topic-modeling post, but getting closer.

Nov 10 2011

A few points following up my two posts on corpus comparison using Dunning Log-Likelihood last month. Nur ein stueck Technik.

Nov 03 2011

Natalie Cecire recently started an important debate about the role of theory in the digital humanities. Shes rightly concerned that the THATcamp mottomore hack, less yackpromotes precisely the wrong understanding of what digital methods offer:

Oct 07 2011

As promised, some quick thoughts broken off my post on Dunning Log-likelihood. There, I looked at _big_ corpusestwo history classes of about 20,000 books each. But I also wonder how we can use algorithmic comparison on a much smaller scale: particularly, at the level of individual authors or works. English dept. digital humanists tend to rely on small sets of well curated, TEI texts, but even the ugly wilds of machine OCR might be able to offer them some insights. (Sidenoteinteresting post by Ted Underwood today on the mechanics of creating a middle group between these two poles).

Oct 06 2011

Historians often hope that digitized texts will enable better, faster comparisons of groups of texts. Now that at least the 1grams on Bookworm are running pretty smoothly, I want to start to lay the groundwork for using corpus comparisons to look at words in a big digital library. For the algorithmically minded: this post should act as a somewhat idiosyncratic approach to Dunnings Log-likelihood statistic. For the hermeneutically minded: this post should explain why you might need _any_ log-likelihood statistic.

Sep 30 2011

We just launched a new website, Bookworm, from the Cultural Observatory. I might have a lot to say about it from different perspectives; but since it was submitted to the DPLA beta sprint, lets start with the way it helps you find library books.

Sep 05 2011

Weve been working on making a different type of browser using the Open Library books Ive been working with to date, and its raised a interesting question I want to think through here.

Aug 28 2011

Hank wants me to post more, so heres a little problem Im working on. I think its a good example of how quantitative analysis can help to remind us of old problems, and possibly reveal new ones, with library collections.

Aug 04 2011

I mentioned earlier Ive been rebuilding my database; Ive also been talking to some of the people here at Harvard about various follow-up projects to ngrams. So this seems like a good moment to rethink a few pretty basic things about different ways of presenting historical language statistics. For extensive interactions, nothing is going to beat a database or direct access to text files in some form. But for quick interactions, which includes a lot of pattern searching and public outreach, we have some interesting choices about presentation.

Jul 15 2011

Starting this month, Im moving from New Jersey to do a fellowship at the Harvard Cultural Observatory. This should be a very interesting place to spend the next year, and Im very grateful to JB Michel and Erez Lieberman Aiden for the opportunity to work on an ongoing and obviously ambitious digital humanities project. A few thoughts on the shift from Princeton to Cambridge:

Jun 16 2011

Let me get back into the blogging swing with a (too longthis is why I cant handle Twitter, folks) reflection on an offhand comment. Dont worry, theres some data stuff in the pipe, maybe including some long-delayed playing with topic models.

May 10 2011

Before end-of-semester madness, I was looking at how shifts in vocabulary usage occur. In many cases, I found, vocabulary change doesnt happen evenly across across all authors. Instead, it can happen generationally; older people tend to use words at the rate that was common in their youth, and younger people anticipate future word patterns. An eighty-year-old in 1880 uses a world like outside more like a 40-year-old in 1840 than he does like a 40-year-old in 1880. The original post has a more detailed explanation.

Apr 18 2011

A couple weeks ago, I wrote about how ancestry.com structured census data for genealogy, not history, and how that limits what historians can do with it. Last week, I got an interesting e-mail from IPUMS, at the Minnesota population center on just that topic:

Apr 13 2011

All the cool kids are talking about shortcomings in digitized text databases. I dont have anything so detailed to say as what Goose Commerce or Shane Landrum have gone into, but I do have one fun fact. Those guys describe ways that projects miss things we might think are important but that lie just outside the most mainstream intereststhe neglected Early Republic in newspapers, letters to the editor in journals, etc. They raise the important point that digital resources are nowhere near as comprehensive as we sometimes think, which is a big caveat we all need to keep in mind. I want to point out that its not just at the margins were missing texts: omissions are also, maybe surprisingly, lurking right at the heart of the canon. Heres an example.

Apr 11 2011

Lets start with two self-evident facts about how print culture changes over time:

Apr 03 2011

Shane Landrum (@cliotropic) says my claim that historians have different digital infrastructural needs than other fields might be provocative. I dont mean this as exceptionalism for historians, particularly not compared to other humanities fields. I do think historians are somewhat exceptional in the volume of texts they want to processat Princeton, they often gloat about being the heaviest users of the library. I do think this volume is one important reason English has a more advanced field of digital humanities than history does. But the needs are independent of the volume, and every academic field has distinct needs. Data, though, is often structured for either one set of users, or for a mushy middle.

Apr 01 2011

When I first thought about using digital texts to track shifts in language usage over time, the largest reliable repository of e-texts was Project Gutenberg. I quickly found out, though, that they didnt have works for years, somewhat to my surprise. (Its remarkable how much metadata holds this sort of work back, rather than data itself). They did, though, have one kind of year information: author birth dates. You can use those to create same type of charts of word use over time that people like me, the Victorian Books project, or the Culturomists have been doing, but in a different dimension: we can see how all the authors born in a year use language rather than looking at how books published in a year use language.

Mar 28 2011

Let me step away from digital humanities for just a second to say one thing about the Cronon affair.
(Despite the professor-blogging angle, and that Cronons upcoming AHA presidency will probably have the same pro-digital history agenda as Graftons, I dont think this has much to do with DH). The whole we are all Bill Cronon sentiment misses whats actually interesting. Cronons playing a particular angle: one that gets missed if we think about him as either a naïve professor, stumbling into the public sphere, or as a liberal ideologue trying to score some points.

Mar 24 2011

Back from Venice (which is plastered with posters for Mapping the Republic of Letters, making a DH-free vacation that much harder), done grading papers, MAW paper presented. That frees up some time for data. So let me start off looking at a new pool for book data for a little while that I think is really interesting.

Mar 02 2011

url: /2011/03/what-historians-dont-know-about.html

Feb 22 2011

Heres an animation of the PCA numbers Ive been exploring this last week.

Feb 20 2011

I wanted to see how well the vector space model of documents Ive been using for PCA works at classifying individual books. [Note at the outset: this post swings back from the technical stuff about halfway through, if youre sick of the charts.] While at the genre level the separation looks pretty nice, some of my earlier experiments with PCA, as well as some of what I read in the Stanford Literature Labs Pamphlet One, made me suspect individual books would be sloppier. There are a couple different ways to ask this question. One is to just drop the books as individual points on top of the separated genres, so we can see how they fit into the established space. By the first two principal components, for example, we can make all the books  in LCC subclasses BF (psychology) blue, and use red for QE (Geology), overlaying them on a chart of the first two principal components like Ive been using for the last two posts:

Feb 17 2011

I used principal components analysis at the end of my last post to create a two-dimensional plot of genre based on similarities in word usage. As a reminder, heres an improved (using all my data on the 10,000 most common words) version of that plot:

Feb 14 2011

One of the most important services a computer can provide for us is a different way of reading. Its fast, bad at grammar, good at counting, and generally provides a different perspective on texts we already know in one way.

Feb 11 2011

Ive spent a lot of the last week trying to convince Princeton undergrads its OK to occasionally disagree with each other, even if theyre not sure theyre right. So let me make one of my notes on one of the places Ive felt a little bit of skepticism as I try to figure whats going on with the digital humanities.

Feb 02 2011

Genre information is important and interesting. Using the smaller of my two book databases, I can get some pretty good genre information about some fields Im interested in for my dissertation by using the Library of Congress classifications for the books. Im going to start with the difference between psychology and philosophy. Ive already got some more interesting stuff than these basic charts, but I think a constrained comparison like this should be somewhat more clear.

Feb 01 2011

Im changing several things about my data, so Im going to describe my system again in case anyone is interested, and so I have a page to link to in the future.

Jan 31 2011

Open Library has pretty good metadata. Im using it to assemble a couple new corpuses that I hope should allow some better analysis than I can do now, but just the raw data is interesting. (Although, with a single 25 GB text file the best way to interact with it, not always convenient). While Im waiting for some indexes to build, that will give a good chance to figure out just whats in these digital sources.

Jan 28 2011

Im trying to get a new group of texts to analyze. We already have enough books to move along on certain types of computer-assisted textual analysis. The big problems are OCR and metadata. Those are a) probably correlated somewhat, and b) partially superable. Ive been spending a while trying to figure out how to switch over to better metadata for my texts (which actually means an almost all-new set of texts, based on new metadata). Ive avoided blogging the really boring stuff, but Im going to stay with pretty boring stuff for a little while (at least this post and one more later, maybe more) to get this on the record.

Jan 21 2011

In writing about openness and the ngrams database, I found it hard not to reflect a little bit about the role of copyright in all this. Ive called 1922 the year digital history ends before; for the kind of work I want to see, its nearly an insuperable barrier, and its one I think not enough non-tech-savvy humanists think about. So let me dig in a little.

Jan 20 2011

The Culturomics authors released a FAQ last week that responds to many of the questions floating around about their project. I should, by trade, be most interested in their responses to the lack of humanist involvement. Ill get to that in a bit. But instead, I find myself thinking more about what the requirements of openness are going to be for textual research.

Jan 18 2011

Ill end my unannounced hiatus by posting several charts that show the limits of the search-term clustering I talked about last week before I respond to a couple things that caught my interest in the last week.

Jan 11 2011

Because of my primitive search engine, Ive been thinking about some of the ways we can better use search data to a) interpret historical data, and b) improve our understanding of what goes on when we search. As I was saying then, there are two things that search engines let us do that we usually dont get:

Jan 10 2011

More access to the connections between words makes it possible to separate word-use from language. This is one of the reasons that we need access to analyzed texts to do any real digital history. Im thinking through ways to use patterns of correlations across books as a way to start thinking about how connections between words and concepts change over time, just as word count data can tell us something (fuzzy, but something) about the general prominence of a term. This post is about how the search algorithm Ive been working with can help improve this sort of search. Ill get back to evolution (which I talked about in my post introducing these correlation charts) in a day or two, but let me start with an even more basic question that illustrates some of the possibilities and limitations of this analysis: What was the Civil War fought about?

Jan 06 2011

To my surprise, I built a search engine as a consequence of trying to quantify information about word usage in the books I downloaded from the Internet Archive. Before I move on with the correlations I talked about in my last post, I need to explain a little about that.

Jan 05 2011

How are words linked in their usage? In a way, thats the core question of a lot of history. I think we can get a bit of a picture of this, albeit a hazy one, using some numbers. This is the first of two posts about how we can look at connections between discourses.

Dec 30 2010

Ive started thinking that theres a useful distinction to be made in two different ways of doing historical textual analysis. First stab, Id call them:

Dec 27 2010

I finally got some call numbers. Not for everything, but for a better portion than I thought I would: about 7,600 records, or c. 30% of my books.

Dec 26 2010

Before Christmas, I spelled out a few ways of thinking about historical texts as related to other texts based on their use of different words, and did a couple examples using months and some abstract nouns. Two of the problems Ive had with getting useful data out of this approach are:

Dec 23 2010

Dan Cohen gives the comprehensive Digital Humanities treatment on Ngrams, and he mostly gets it right. Theres just one technical point I want to push back on. He says the best research opportunities are in the multi-grams. For the post-copyright era, this is true, since they are the only data anyone has on those books. But for pre-copyright stuff, theres no reason to use the ngrams data rather than just downloading the original books, because:

Dec 23 2010

Back to my own stuff. Before the Ngrams stuff came up, I was working on ways of finding books that share similar vocabularies. I said at the end of my second ngrams post that we have hundreds of thousands of dimensions for each book: let me explain what I mean. My regular readers were unconvinced, I think, by my first foray here into principal components, but Im going to try again. This post is largely a test of whether I can explain principal components analysis to people who dont know about it so: correct me if you already understand PCA, and let me know me know whats unclear if you dont. (Or, it goes without saying, skip it.)

Dec 19 2010

I wrote yesterday about how well the filters applied to remove some books from ngrams work for increasing the quality of year information and OCR compared to Google books.

Dec 18 2010

As I said: ngrams represents the state of the art for digital humanities right now in some ways. Put together some smart Harvard postdocs, a few eminent professors, the Google Books team, some undergrad research assistants for programming, then give them access to Google computing power and proprietary information to produce the datasets, and youre pretty much guaranteed an explosion of theories and methods.

Dec 17 2010

(First in a series on yesterdays Google/Harvard paper in Science and its reception.)

Dec 17 2010

Days from when I said Google Trends for historical terms might be worse than nothing to the release of Google ngrams: 12. So: well get to see!

Dec 15 2010

We all know that the OCR on our digital resources is pretty bad. Ive often wondered if part of the reason Google doesnt share its OCR is simply it would show so much ugliness. (A common misreading, tlie for the, gets about 4.6m results in Google books). So how bad is the the internet archive OCR, which Im using? Ive started rebuilding my database, and I put in a few checks to get a better idea. Allen already asked some questions in the comments about this, so I thought Id dump it on to the internet, since there doesnt seem to be that much out there.

Dec 14 2010

Can historical events suppress use of words? Usage of the word panic seems to spike down around the bank panics of 1873 and 1893, and maybe 1837 too. Im pretty confident this is just an artifact of me plugging in a lot of words in to test out how fast the new database is and finding some random noise. There are too many reasons to list: 1857 and 1907 dont have the pattern, the rebound in 1894 is too fast, etc. Its only 1873 that really looks abnormal. What do you think:

Dec 13 2010

Im interested in the ways different words are tied together. Thats sort of the universal feature of this project, so figuring out ways to find them would be useful. I already looked at some ways of finding interesting words for scientific method, but that was in the context of the related words as an endpoint of the analysis. I want to be able to automatically generate linked words, as well. Im going to think through this staying on capitalist as the word of the day. Fair warning: this post is a rambler.

Dec 09 2010

A commenter asked about why I dont improve the metadata instead of doing this clustering stuff, which seems just poorly to reproduce the work of generations of librarians in classifying books. Id like to. The biggest problem right now for text analysis for historical purposes is metadata (followed closely by OCR quality). What are the sources? Im going to think through what I know, but Id love any advice on this because its really outside my expertise.

Dec 08 2010

Let me get ahead of myself a little.

Dec 07 2010

Maybe this is just Patricia Cohens take, but its interesting to note that she casts both of the text mining projects shes put on the Times site this week (Victorian books and the Stanford Literature Lab) as attempts to use modern tools to address questions similar to vast, comprehensive tomes written in the 1950s. There are good reasons for this. Those books are some of the classics that informed the next generation of scholarship in their field; they offer an appealing opportunity to find people who should have read more than they did; and, more than some recent scholarship, they contribute immediately to questions that are of interest outside narrow disciplinary communities. (I think Ive seen the phrase public intellectuals more times in the four days Ive been on Twitter than in the month before). One of the things that the Times articles highlight is how this work can re-engage a lot of the general public with current humanities scholarship.

Dec 06 2010

Dan asks for some numbers on capitalism and capitalist similar to the ones on Darwinism and Darwinist I ran for Hank earlier. That seems like a nice big question I can use to get some basic methods to warm up the new database I set up this week and to get some basic functionality written into it.

Dec 04 2010

This verges on unreflective datadumping: but because its easy and I think people might find it interesting, Im going to drop in some of my own charts for total word use in 30,000 books by the largest American publishers on the same terms for which the Times published Cohens charts of title word counts. Ive tossed in a couple extra words where it seems interestingincluding some alternate word-forms that tell a story, using a perl word-stemming algorithm I set up the other day that works fairly well. My charts run from 1830 (there just arent many American books from before, and even the data from the 30s is a little screwy) to 1922 (the date that digital history endsthank you, Sonny Bono.) In some cases, (that 1874 peak for science), the American and British trends are surprisingly close. Sometimes, they arent.

Dec 04 2010

Patricia Cohens new article about the digital humanities doesnt come with the rafts of crotchety comments the first one did, so unlike last time Im not in defensive crouch. To the contrary: Im thrilled and grateful that Dan Cohen, the main subject of the article, took the time in his moment in the sun to link to me. The article itself is really good, not just because the Cohen-Gibbs Victorian project is so exciting, but because P. Cohen gets some thoughtful comments and the NYT graphic designers, as always, do a great job. So I just want to focus on the Google connection for now, and then Ill post my versions of the charts the Times published.

Dec 04 2010

Lexical analysis widens the hermeneutic circle. The statistics need to be kept close to the text to keep any work sufficiently under the researchers control. Ive noticed that when I ask the computer to do too much work for me in identifying patterns, outliers, and so on, it frequently responds with mistakes in the data set, not with real historical data. So as I start to harness this new database, one of the big questions is how to integrate what the researcher already knows into the patterns he or she is analyzing.

Dec 03 2010

Dan Cohen, the hub of all things digital history, in the news and on his blog.

Dec 03 2010

I have my database finally running in a way that lets me quickly select data about books. So now I can start to ask questions that are more interesting than just how overall vocabulary shifted in American publishers. The question is, what sort of questions? Ill probably start to shift to some of my dissertation stuff, about shifts in verbs modifying attention, but there are all sorts of things we can do now. Im open to suggestions, but here are some random examples:

Dec 03 2010

So I just looked at patterns of commemoration for a few famous anniversaries. This is, for some people, kind of interestinghow does the publishing industry focus in on certain figures to create news or resurgences of interest in them?  I love the way we get excited about the civil war sesquicentennial now, or the Darwin/Lincoln year last year.

Dec 03 2010

I was starting to write about the implicit model of historical change behind loess curves, which Ill probably post soon, when I started to think some more about a great counterexample to the gradual change Im looking for: the patterns of commemoration for anniversaries. At anniversaries, as well as news events, I often see big spikes in wordcounts for an event or person.

Dec 02 2010

Jamies been asking for some thoughts on what it takes to do thisstatistics backgrounds, etc. I should say that Im doing this, for the most part, the hard way, because 1) My database is too large to start out using most tools I know of, including I think the R text-mining package, and 2) I want to understand how it works better. I dont think Im going to do the software review thing here, but there are what look like a _lot _of promising leads at an American Studies blog.

Dec 01 2010

Ive had digital humanities in the blogs subtitle for a while, but its a terribly offputting term. I guess its supposed to evoke future frontiers and universal dissemination of humanistic work, but it carries an unfortunate implication that the analog humanities are something completely different. It makes them sound older, richer, more subtleand scheduled for demolition. No wonder a world of online exhibitions and digital texts doesnt appeal to most humanists of the tweed and dust-jacket crowd. I think we need a distinction that better expresses how digital technology expands the humanities, rather than constraining it.

Dec 01 2010

Jamie asked about assignments for students using digital sources. Its a difficult question.

Dec 01 2010

Mostly a note to myself:

Nov 28 2010

Most intensive text analysis is done on heavily maintained sources. Im using a mess, by contrast, but a much larger one. Partly, Im doing this tendentiouslyI think its important to realize that we can accept all the errors due to poor optical character recognition, occasional duplicate copies of works, and so on, and still get workable materials.

Nov 28 2010

In addition to finding the similarities in use between particular isms, we can look at their similarities in general. Since we have the distances, its possible to create a dendrogram, which is a sort of family tree. Looking around the literary studies text-analysis blogs, I see these done quite a few times to classify works by their vocabulary. I havent seen much using words, though: but it works fairly well. I thought it might help answer Hanks question about the difference between evolutionism and darwinism, but, as youll see, that distinction seems to be a little too fine for now.

Nov 27 2010

What can we do with this information weve gathered about unexpected occurrences? The most obvious thing is simply to look at what words appear most often with other ones. We can do this for any ism given the data Ive gathered. Hank asked earlier in the comments about the difference between Darwinism and evolutionism, so:

Nov 26 2010

Now to the final term in my sentence from earlier How often, compared to what we would expect, does a given word appear with any other given word?**.** Lets think about How much more often. I though this was more complicated than it is for a while, so this post will be short and not very important.

Nov 26 2010

Nov 26 2010

This is the second post on ways to measure connectionsor more precisely, distancebetween words by looking at how often they appear together in books. These are a little dry, and the payoff doesnt come for a while, so let me remind you of the payoff (after which you can bail on this post). Im trying to create some simple methods that will work well with historical texts to see relations between wordswhat words are used in similar semantic contexts, what groups of words tend to appear together. First Ill apply them to the isms, and then well put them in the toolbox to use for later analysis.
I said earlier I would break up the sentence How often, compared to what we would expect, does a given word appear with any other given word? into different components. Now lets look at the central, and maybe most important, part of the questionhow often do we expect words to appear together?

Nov 25 2010

Im back from Moscow, and with a lot of blog content from my 23-hour itinerary. Im going to try to dole it out slowly, though, because a lot of it is dull and somewhat technical, and I think its best to intermix with other types of content. I think there are four things I can do here.

Nov 23 2010

Ties between words are one of the most important things computers can tell us about language. I already looked at one way of looking at connections between words in talking about the phrase scientific method--the percentage of occurrences of a word that occur with another phrase. Ive been trying a different tack, however, in looking at the interrelations among the isms. The whole thing has been do complicatedI never posted anything from Russia because I couldnt get the whole system in order in my time here. So instead, I want to take a couple posts to break down a simple sentence and think about how we could statistically measure each component. Heres the sentence:

Nov 18 2010

One more note on that Grafton quote, which Ill post below.

Nov 17 2010

Im in Moscow now. I still have a few things to post from my layover, but there will be considerably lower volume through Thanksgiving.

Nov 17 2010

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; min-height: 14.0px}

Nov 15 2010

Hank asked for a couple of charts in the comments, so I thought Id oblige. Since Im starting to feel theyre better at tracking the permeation of concepts, well use appearances per 1000 books as the y axis:

Nov 15 2010

Im going to keep looking at the list of isms, because a) theyre fun; and b) the methods we use on them can be used on any group of wordsfor example, ones that we find are highly tied to evolution. So, lets use them as a test case for one of the questions I started out with: how can we find similarities in the historical patterns of emergence and submergence of words?

Nov 14 2010

Heres a fun way of using this dataset to convey a lot of historical information. I took all the 414 words that end in ism in my database, and plotted them by the year in which they peaked,* with the size proportional to their use at peak. Im going to think about how to make it flashier, but its pretty interesting as it is. Sample below, and full chart after the break.

Nov 14 2010

Its time for another bookkeeping post. Read below if you want to know about changes Im making and contemplating to software and data structures, which I ought to put in public somewhere. Henry posted questions in the comments earlier about whether we use Princetons supercomputer time, and why I didnt just create a text scatter chart for evolution like the one I made for scientific method. This answers those questions. It also explains why I continue to drag my feet on letting us segment counts by some sort of genre, which would be very useful.

Nov 13 2010

Henry asks in the comments whether the decline in evolutionary thought in the 1890s is the Eclipse of Darwinism, rise or prominence of neo-Lamarckians and saltationism and kooky discussions of hereditary mechanisms? Lets take a look, with our new and improved data (and better charts, too, compared to earlier in the weekany suggestions on design?). First,three words very closely tied to the theory of natural selection.

Nov 12 2010

All right, lets put this machine into action. A lot of digital humanities is about visualization, which has its place in teaching, which Jamie asked for more about. Before I do that, though, I want to show some more about how this can be a research tool. Henry asked about the history of the term scientific method. I assume he was asking a chart showing its usage over time, but I already have, with the data in hand, a lot of other interesting displays that we can use. This post is a sort of catalog of what some of the low-hanging fruit in text analysis are.

Nov 11 2010

I now have counts for the number of books a word appears in, as well as the number of times it appeared. Just as I hoped, it gives a new perspective on a lot of the questions we looked at already. That telephone-telegraph-railroad chart, in particular, has a lot of interesting differences. But before I get into that, probably later today, I want to step back and think about what we can learn from the contrast between between wordcounts and bookcounts. (Im just going to call them bookcountsI hope thats a clear enough phrase).

Nov 10 2010

Obviously, I like charts. But Ive periodically been presenting data as a number of random samples, as well.  Its a technique that can be important for digital humanities analysis. And its one that can draw more on the skills in humanistic training, so might help make this sort of work more appealing. In the sciences, an individual data point often has very little meaning on its ownits just a set of coordinates. Even in the big education datasets I used to work with, the core facts that I was aggregating up from were generally very dullone university awarded three degrees in criminal science in 1984, one faculty member earned $55,000 a year. But with language, theres real meaning embodied in every point, that were far better equipped to understand than the computer. The main point of text processing is to act as a sort of extraordinarily stupid and extraordinarily perseverant research assistant, who can bring patterns to our attention but is terrible at telling which patterns are really important. We cant read everything ourselves, but its good to check up periodicallythats why I do things like see what sort of words are the 300,000th in the language, or what 20 random book titles from the sample are.

Nov 10 2010

Heres what googling that question will tell you: about 400,000 words in the big dictionaries (OED, Websters); but including technical vocabulary, a million, give or take a few hundred thousand. But for my poor computer, thats too many, for reasons too technical to go into here. Suffice it to say that Im asking this question for mundane reasons, but the answer is kind of interesting anyway. No real historical questions in this post, thoughIll put the only big thought I have about it in another post later tonight.

Nov 09 2010

I cant resist making a few more comments on that technologies graph that I laid out. Im going to add a few thousand more books to the counts overnight, so I wont make any new charts until tomorrow, but look at this one again.

Nov 08 2010

An anonymous correspondent says:

Nov 08 2010

Ive rushed straight into applications here without taking much time to look at the data Im working with. So let me take a minute to describe the set and how Im trimming it.

Nov 07 2010

A collection as large as the Internet Archives OCR database means I have to think through what I want well in advance of doing it. Im only using a small subset of their 900,000 Google-scanned books, but thats still 16 gigabytesit takes a couple hours just to get my baseline count of the 200,000 most common words. I could probably improve a lot of my search time through some more sophisticated database management, but Ill still have to figure out what sort of relations are worth looking for. So what are some?

Nov 07 2010

Lets start with just some of the basic wordcount results. Dan Cohen posted some similar things for the Victorian period on his blog, and used the numbers mostly to test hypotheses about change over time. I can give you a lot more like that (I confirmed for someone, though not as neatly as hed probably like, that business became a much more prevalent word through the 19C). But as Cohen implies, such charts can be cooler than they are illuminating.

Nov 07 2010

Im going to start using this blog to work through some issues in finding useful applications for digital history. (Interesting applications? Applications at all?)

Jul 15 2009