Tags
- Data exploration and visualization (22)
- Featured (21)
- Digital Humanities (20)
- Building a Corpus (12)
- Evolution (11)
- Ngrams (10)
- Digital Humanities Now Editors' Choice (10)
- Changes in language over time (10)
- isms (9)
- Ships (8)
- Whaling (7)
- pca (7)
- collocation (7)
- Historical memory (5)
- Gender (5)
- This Blog (5)
- Genres (4)
- Libraries (4)
- authors (4)
- Metadata (4)
- Dunning (3)
- Comparisons (3)
- Resources (3)
- Topic Modelling (2)
- Machine Learning (2)
- TV watch (2)
- HathiTrust (2)
- History (1)
- Theory (1)
- Howells (1)
- Literature (1)
- Bookworm (1)
- Open Library (1)
- Online Databases (1)
- The Profession (1)
- search (1)
- LCC classes (1)
- capitalism (1)
I periodically write about Google Books here, so I thought I’d point out something that I’ve noticed recently that should be concerning to anyone accustomed to treating it as the largest collection of books: it appears that when you use a year constraint on book search, the search index has dramatically constricted to the point of being, essentially, broken.
I did a slightly deeper dive into data about the salaries by college majors while working on my new Atlantic article on the humanities crisis. As I say there, the quality of data about salaries by college major has improved dramatically in the last 8 years. I linked to others’ analysis of the ACS data rather than run my own, but I did some preliminary exploration of salary stuff that may be useful to see.
NOTE 8/23: I’ve written a more thoughtful version of this argument for the Atlantic. They’re not the same, but if you only read one piece, you should read that one.
Back in 2013, I wrote a few blog post arguing that the media was hyperventilating about a “crisis” in the humanities, when, in fact, the long term trends were not especially alarming. I made two claims them: 1. The biggest drop in humanities degrees relative to other degrees in the last 50 years happened between 1970 and 1985, and were steady from 1985 to 2011; as a proportion of the population, humanities majors exploded. 2) The entirety of the long term decline from 1950 to 2010 had to do with the changing majors of women, while men’s humanities interest did not change.
Historians generally acknowledge that both undergraduate and graduate methods training need to teach students how to navigate and understand online searches. See, for example, this recent article in Perspectives. Google Books is the most important online resource for full-text search; we should have some idea what’s in it.
Matthew Lincoln recently put up a Twitter bot that walks through chains of historical artwork by vector space similarity. https://twitter.com/matthewdlincoln/status/1003690836150792192.
The idea comes from a Google project looking at paths that traverse similar paintings.
This is a blog post I’ve had sitting around in some form for a few years; I wanted to post it today because:
Digging through old census data, I realized that Wikipedia has some really amazing town-level historical population data, particularly for the Northeast, thanks to one editor in particular typing up old census reports by hand. (And also for French communes, but that’s neither here nor there.) I’m working on pulling it into shape for the whole country, but this is the most interesting part.
I’ve been doing a lot of reading about population density cartography recently. With election-map cartography remaining a major issue, there’s been lots of discussion of them: and the “Joy Plot” is currently getting lots of attention.
Robert Leonard has an op-ed in the Times today that includes the following anecdote:
The Library of Congress has released MARC records that I’ll be doing more with over the next several months to understand the books and their classifications. As a first stab, though, I wanted to simply look at the history of how the Library digitized card catalogs to begin with.
One of the interesting things about contemporary data visualization is that the field has a deep sense of its own history, but that “professional” historians haven’t paid a great deal of attention to it yet. That’s changing. I attended a conference at Columbia last weekend about the history of data visualization and data visualization as history. One of the most important strands that emerged was about the cultural conditions necessary to read data visualization. Dancing around many mentions of the canonical figures in the history of datavis (Playfair, Tukey, Tufte) were questions about the underlying cognitive apparatus with which humans absorb data visualization. What makes the designers of visualizations think that some forms of data visualization are better than others? Does that change?
I want to post a quick methodological note on diachronic (and other forms of comparative) word2vec models.
This is a quick digital-humanities public service post with a few sketchy questions about OCR as performed by Google.
Like everyone else, I’ve been churning over the election results all month. Setting aside the important stuff, understanding election results temporally presents an interesting challenge for visualization.
I’m pulling this discussion out of the comments thread on Scott Enderle’s blog, because it’s fun. This is the formal statement of what will forever be known as the efficient plot hypothesis for plot arceology. Noble prize in culturomics, here I come.
Word embedding models are kicking up some interesting debates at the confluence of ethics, semantics, computer science, and structuralism. Here I want to lay out some of the elements in one recent place that debate has been taking place inside computer science.
Debates in the Digital Humanities 2016 is now online, and includes my contribution, “Do Digital Humanists Need to Understand Algorithms?” (As well as a pretty snazzy cover image…) In it I lay out distinction between transformations, which are about states of texts, and algorithms, which are about processes. Put briefly:
Some scientists came up with a list of the 6 core story types. On the surface, this is extremely similar to Matt Jockers’s work from last year. Like Jockers, they use a method for disentangling plots that is based on sentiment analysis, justify it mostly with reference to Kurt Vonnegut, and choose a method for extracting ur-shapes that naturally but opaquely produces harmonic-shaped curves. (Jockers using the Fourier transform, and the authors here use SVD.) I started writing up some thoughts on this two weeks ago, stopped, and then got a media inquiry about the paper so thought I’d post my concerns here. These sort of ramp up from the basic but important (only about 40% of the texts they are using are actually fictional stories) to the big one that ties back into Jockers’s original work; why use sentiment analysis at all? This leads back into a sort of defense of my method of topic trajectories for describing plots and some bigger requests for others working in the field.
I usually keep my mouth shut in the face of the many hilarious errors that crop up in the burgeoning world of datasets for cultural analytics, but this one is too good to pass up. Nature has just published a dataset description paper that appears to devote several paragraphs to describing “center of population” calculations made on the basis of a flat earth.
I started this post with a few digital-humanities posturing paragraphs: if you want to read them, you’ll encounter them eventually. But instead let me just get the point: here’s a trite new category of analysis that wouldn’t be possible without distant reading techniques that produces sometimes charmingly serendipitous results.
A heads-up for those with this blog on their RSS feeds: I’ve just posted a couple things of potential interest on one of the two other blogs (errm) I’m running on my own site.
Mitch Fraas and I have put together a two-part interactive for the Atlantic using Bookworm as a backend to look at the changing language in the State of Union. Yoni Appelbaum, who just took over this week, spearheaded a great team over there including Chris Barna, Libby Bawcombe, Noah Gordon, Betsy Ebersole, and Jennie Rothenberg Gritz who took some of the Bookworm prototypes and built them into a navigable, attractive overall package. Thanks to everyone.
Far and away the most interesting idea of the new government college ratings emerges toward the end of the report. It doesn’t quite square the circle of competing constituencies for the rankings I worries about in my last post, but it gets close. Lots of weight is placed on a single magic model that will predict outcomes regardless of all the confounding factors they raise (differing pay by gender, sex, possibly even degree composition). As an inveterate modeler and data hound, I can see the appeal here. The federal government has far better data than US News and World Report, in the guise of the student loan repayment forms; this data will enable all sorts of useful studies on the effects of everything from home-schooling to early-marriage. I don’t know that anyone is using it yet for the sort of studies it makes possible (do you?), but it sounds like they’re opening the vault just for these college ranking purposes.
Before the holiday, the Department of Education circulated a draft prospectus of the new college rankings they hope to release next year.That afternoon, I wrote a somewhat dyspeptic post on the way that these rankings, like all rankings, will inevitably be gamed. But it’s probably better to bury that off and instead point out a couple looming problems with the system we may be working under soon. The first is that the audience for these rankings is unresolved in a very problematic way; the second is that altogether two much weight is placed on a regression model solving every objection that has been raised. Finally, I’ll lay out my “constructive” solution for salvaging something out of this, which is that rather than use a three-tiered “excellent” - “adequate” - “needs improvement”, everyone would be better served if we switched to a two-tiered “Good”/“Needs Improvement” system. Since this is sort of long, I’ll break it up into three posts: the first is below.
Sometimes it takes time to make a data visualization, and sometimes they just fall out of the data practically by accident. Probably the most viewed thing I’ve ever made, of shipping lines as spaghetti strings, is one of the latter. I’m working to build one of the former for my talk at the American Historical Association out of the Newberry Library’s remarkable Atlas of Historical County Boundaries. But my second ggplot with the set, which I originally did just to make sure the shapefiles were working, was actually interesting. So I thought I’d post it. Here’s the graphic: then the explanation. Click to enlarge.
Note: a somewhat more complete and slightly less colloquial, but eminently more citeable, version of this work is in the [Proceedings of the 2015 IEEE International Conference on Big Data](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=7363937&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabsall.jsp%3Farnumber%3D7363937). Plus, it was only there that I came around to calling the whole endeavor “plot arceology.”_
It’s interesting to look, as I did at my last post, at the plot structure of typical episodes of a TV show as derived through topic models. But while it may help in understanding individual TV shows, the method also shows some promise on a more ambitious goal: understanding the general structural elements that most TV shows and movies draw from. TV and movies scripts are carefully crafted structures: I wrote earlier about how the Simpsons moves away from the school after its first few minutes, for example, and with this larger corpus even individual words frequently show a strong bias towards the front or end of scripts. These crafting shows up in the ways language is distributed through them in time.
The most interesting element of the Bookworm browser for movies I wrote about in my last post here is the possibility to delve into the episodic structure of different TV shows by dividing them up by minutes. On my website, I previously wrote about story structures in the Simpsons and a topic model of movies I made using the general-purpose bookworm topic modeling extension. For a description of the corpus or of topic modeling, see those links.
Here’s a very fun, and for some purposes, perhaps, a very useful thing: a Bookworm browser that lets you investigate onscreen language in about 87,000 movies and TV shows, encompassing together over 600 million words. (Go follow that link if you want to investigate yourself).
An FYI, mostly for people following this feed on RSS: I just put up on my home web site a post about applications for the Simpsons Bookworm browser I made. It touches on a bunch of stuff that would usually lead me to post it here. (Really, it hits the Sapping Attention trifecta: a discussion of the best ways of visualizing Dunning Log-Likelihood, cryptic allusions to critical theory; and overly serious discussions of popular TV shows.). But it’s even less proofread and edited than what I usually put here, and I’ve lately been more and more reluctant to post things on a Google site like this, particularly as blogger gets folded more and more into Google Plus. That’s one of the big reasons I don’t post here as much as I used to, honestly. (Another is that I don’t want to worry about embedded javascript). So, head over there if you want to read it.
Right now people in data visualization tend to be interested in their field’s history, and people in digital humanities tend to be fascinated by data visualization. Doing some research in the National Archives in Washington this summer, I came across an early set of rules for graphic presentation by the Bureau of the Census from February 1915. Given those interests, I thought I’d put that list online.
People love to talk about how “practical” different college majors are: and practicality is usually majored in dollars. But those measurements can be very problematic, in ways that might have bad implications for higher education. That’s what this post is about.
Here’s a little irony I’ve been meaning to post. Large scale book digitization makes tools like Ngrams possible; but it also makes tools like Ngrams obsolete for the future. It changes what a “book” is in ways that makes the selection criteria for Ngrams—if it made it into print, it must have _some _significance—completely meaningless.
A map I put up a year and a half ago went viral this winter; it shows the paths taken by ships in the US Maury collection of the ICOADS database. I’ve had several requests for higher-quality versions: I had some up already, but I just put up on Flickr a basically comparable high resolution version. US Maury is “Deck 701” in the ICOADS collection: I also put up charts for all of the other decks with fewer than 3,000,000 points. You can page through them below, or download the high quality versions from Flickr directly. (At the time of posting, you have to click on the three dots to get through to the summaries).
OK: one last post about enrollments, since the statistic that humanities degrees have dropped in half since 1970 is all over the news the last two weeks. This is going to be a bit of a data dump: but there’s a shortage of data on the topic out there, so forgive me.
A quick addendum to my post on long-term enrollment trends in the humanities. (This topic seems to have legs, and I have lots more numbers sitting around I find useful, but they’ve got to wait for now).
There was an article in the Wall Street Journal about low enrollments in the humanities yesterday. The heart of the story is that the humanities resemble the late Roman Empire, teetering on a collapse precipitated by their inability to get jobs like those computer scientists can provide. (Never mind that the news hook is a Harvard report about declining enrollments in the humanities, which makes pretty clear that the real problem is students who are drawn to social sciences, not competitition from computer scientists.)
What are the major turning points in history? One way to think about that is to simply look at the most frequent dates used to start or end dissertation periods.* That gives a good sense of the general shape of time.
Here’s some inside baseball: the trends in periodization in history dissertations since the beginning of the American historical profession. A few months ago, Rob Townsend, who until recently kept everyone extremely well informed about professional trends at American Historical Association* sent me the list of all dissertation titles in history the American Historical Association knows about from the last 120 years. (It’s incomplete in some interesting ways, but that’s a topic for another day). It’s textual data. But sometimes the most interesting textual data to analyze quantitatively are the numbers that show up. Using a Bookworm database, I just pulled out from the titles the any years mentioned: that lets us what periods of the past historians have been the most interested in, and what sort of periods they’ve described..
The new issue of the Journal of Digital Humanities is up as of yesterday: it includes an article of mine, “Words Alone,” on the limits of topic modeling. In true JDH practice, it draws on my two previous posts on topic modeling, here and here. If you haven’t read those, the JDH article is now the place to go. (Unless you love reading prose chock full’ve contractions and typos. Then you can stay in the originals.) If you have read them, you might want to know what’s new or why I asked the JDH editors to let me push those two articles together. In the end, the changes ended up being pretty substantial.
The hardest thing about studying massive quantities of digital texts is knowing just what texts you have. This is knowlege that we haven’t been particularly good at collecting, or at valuing.
My last post had the aggregate statistics about which parts of the library have more female characters. (Relatively). But in some ways, it’s more interesting to think about the ratio of male and female pronouns in terms of authors whom we already know. So I thought I’d look for the ratios of gendered pronouns in the most-collected authors of the late 19th and early twentieth centuries, to see what comes out.
Now back to some texts for a bit. Last spring, I posted a few times about the possibilities for reading genders in large collections of books. I didn’t follow up because I have some concerns about just what to do with this sort of pronoun data. But after talking about it to Ryan Cordell’s class at Northeastern last week, I wanted to think a little bit more about the representation of male and female subjects in late-19th century texts. Further spurs were Matt Jockers recently posted the pronoun usage in his corpus of novels; Jeana Jorgensen pointed to recent research by Kathleen Ragan that suggests that editorial and teller effects have a massive effect on the gender of protagonists in folk tales. Bookworm gives a great platform for looking at this sort of question.
I’m cross-posting here a piece from my language anachronisms blog, Prochronisms.
My last post was about how the frustrating imprecisions of language drive humanists towards using statistical aggregates instead of words: this one is about how they drive scientists to treat words as fundamental units even when their own models suggest they should be using something more abstract.
Just a quick post to point readers of this blog to my new Atlantic article on anachronisms in Kushner/Spielberg’s Lincoln; and to direct Atlantic readers interested in more anachronisms over to my other blog, Prochronisms, which is currently churning on through the new season of Downton Abbey. (And to stick around here; my advanced market research shows you might like some of the posts about mapping historical shipping routes.)
Following up on my previous topic modeling post, I want to talk about one thing humanists actually do with topic models once they build them, most of the time: chart the topics over time. Since I think that, although Topic Modeling can be very useful, there’s too little skepticism about the technique, I’m venturing to provide it (even with, I’m sure, a gross misunderstanding or two). More generally, the sort of mistakes temporal changes cause should call into question the complacency with which humanists tend to ’topics’ in topic modeling as stable abstractions, and argue for a much greater attention to the granular words that make up a topic model.
A stray idea left over from my whaling series: just how much should digital humanists be flocking to military history? Obviously the field is there a bit already: the Digital Scholarship lab at Richmond in particular has a number of interesting Civil War projects, and the Valley of the Shadow is one of the archetypal digital history projects. But it’s possible someone could get a lot of mileage out of doing a lot more.
[Temporary note, March 2015: those arriving from reddit may also be interested in this post, which has a bit more about the specific image and a few more like it.]
Note: this post is part 5 of my series on whaling logs and digital history. For the full overview, click here.
Note: this post is part 4, section 2 of my series on whaling logs and digital history. For the full overview, click here.
Note: this post is part 4 of my series on whaling logs and digital history. For the full overview, click here.
Note: this post is part I of my series on whaling logs and digital history. For the full overview, click here.
Here’s a special post from the archives of my ‘ too-boring for prime time ’ files. I wrote this a few months ago but didn’t know if anyone needed: but now I’ll pull it out just for Scott Weingart since I saw him estimating word counts using ‘ the, ’ which is exactly what this post is about. If that sounds boring to you: for heaven’s sake, don’t read any further.
Note: this post is part III of my series on whaling logs and digital history. For the full overview, click here.
Note: this post is part II of my series on whaling logs and digital history. For the full overview, click here.
I’ve now seen a paragraph about advertising inJill Lepore’s latest New Yorker piece in a few places, including Andrew Sullivan’s blog. Digital history blogging should resume soon, but first some advertising history, since something weird is going on here:
I’ve been thinking more than usual lately about spatially representing the data in the various Bookworm browsers.
A follow up on my post from yesterday about whether there’s more history published in times of revolution. I was saying that I thought the dataset Google uses must be counting documents of historical importance as history: because libraries tend to shelve in a way that conflates things that are about history and things that are history.
A quick post about other people’s data, when I should be getting mine in order:
It’s pretty obvious that one of the many problems in studying history by relying on the print record is that writers of books are disproportionately male.
We just rolled out a new version of Bookworm (now going under the name “Bookworm Open Library”) that works on the same codebase as the ArXiv Bookworm released last month. The most noticeable changes are a cleaner and more flexible UI (mostly put together for the ArXiv by Neva Cherniavksy and Martin Camacho, and revamped by Neva to work on the OL version), couple with some behind-the-scenes tweaks that should make it easy to add new Bookworms on other sets of texts in the future. But as a little bonus, there’s an additional metadata category in the Open Library Bookworm we’re calling “author gender.”
[The American Antiquarian Society conference in Worcester last weekend had an interesting rider on the conference invitation–they wanted 500 words from each participant on the prospects for independent research libraries. I’m posting that response here.]
I saw some historians talking on Twitter about a very nice data visualization of shipping routes in the 18th and 19th centuries on Spatial Analysis. (Which is a great blog–looking through their archives, I think I’ve seen every previous post linked from somewhere else before).
[The following is a revised version of my talk on the ‘ collaboration ’ panel at a conference about “Needs and Opportunities for the Research Library in the Digital Age” at the American Antiquarian Society in Worcester last week. Thanks to Paul Erickson for the invitation to attend, and everyone there for a fascinating weekend.]
[Update: I’ve consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]
A quick follow-up on this issue of author gender.
I just saw that various Digital Humanists on Twitter were talking about representativeness, exclusion of women from digital archives, and other Big Questions. I can only echo my general agreement about most of the comments.
I wanted to try to replicate and slightly expand Ted Underwood’s recent discussion of genre formation over time using the Bookworm dataset of Open Library books. I couldn’t, quite, but I want to just post the results and code up here for those who have been following that discussion. Warning: this is a rather dry post.
[Update: I’ve consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]
The new USIH blogger LD Burnett has a post up expressing ambivalence about the digital humanities because it is too eager to reject books. This is a pretty common argument, I think, familiar to me in less eloquent forms from New York Times comment threads. It’s a rhetorically appealing position–to set oneself up as a defender of the book against the philistines who not only refuse to read it themselves, but want to take your books away and destroy them. I worry there’s some mystification involved–conflating corporate publishers with digital humanists, lumping together books with codices with monographs, and ignoring the tension between reader and consumer. This problem ties up nicely into the big event in DH in the last week–the announcement of the first issue of the ambitiously all-digital Journal of Digital Humanities. So let me take a minute away from writing about TV shows to sort out my preliminary thoughts on books.
[Update: I’ve consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]
Though I usually work with the Bookworm database of Open Library texts, I’ve been playing a bit more with the Google Ngram data sets lately, which have substantial advantages in size, quality, and time period. Largely I use it to check or search for patterns I can then analyze in detail with text-length data; but there’s also a lot more that could be coming out of the Ngrams set than what I’ve seen in the last year.
Another January, another set of hand-wringing about the humanities job market. So, allow me a brief departure from the digital humanities. First, in four paragraphs, the problem with our current understanding of the history job market; and then, in several more, the solution.
[This is not what I’ll be saying at the AHA on Sunday morning, since I’m participating in a panel discussion with Stefan Sinclair, Tim Sherrat, and Fred Gibbs, chaired by Bill Turkel. Do come! But if I were to toss something off today to show how text mining can contribute to historical questions and what sort of issues we can answer, now, using simple tools and big data, this might be the story I’d start with to show how much data we have, and how little things can have different meanings at big scales…]
When data exploration produces Christmas-themed charts, that’s a sign it’s time to post again. So here’s a chart and a problem.
Ted Underwood has been talking up the advantages of the Mann-Whitney test over Dunning’s Log-likelihood, which is currently more widely used. I’m having trouble getting M-W running on large numbers of texts as quickly as I’d like, but I’d say that his basic contention–that Dunning log-likelihood is frequently not the best method–is definitely true, and there’s a lot to like about rank-ordering tests.
I may (or may not) be about to dash off a string of corpus-comparison posts to follow up the ones I’ve been making the last month. On the surface, I think, this comes across as less interesting than some other possible topics. So I want to explain why I think this matters now. This is not quite my long-promised topic-modeling post, but getting closer.
Natalie Cecire recently started an important debate about the role of theory in the digital humanities. She’s rightly concerned that the THATcamp motto–“more hack, less yack”–promotes precisely the wrong understanding of what digital methods offer:
As promised, some quick thoughts broken off my post on Dunning Log-likelihood. There, I looked at _big_ corpuses–two history classes of about 20,000 books each. But I also wonder how we can use algorithmic comparison on a much smaller scale: particularly, at the level of individual authors or works. English dept. digital humanists tend to rely on small sets of well curated, TEI texts, but even the ugly wilds of machine OCR might be able to offer them some insights. (Sidenote–interesting post by Ted Underwood today on the mechanics of creating a middle group between these two poles).
Historians often hope that digitized texts will enable better, faster comparisons of groups of texts. Now that at least the 1grams on Bookworm are running pretty smoothly, I want to start to lay the groundwork for using corpus comparisons to look at words in a big digital library. For the algorithmically minded: this post should act as a somewhat idiosyncratic approach to Dunning’s Log-likelihood statistic. For the hermeneutically minded: this post should explain why you might need _any_ log-likelihood statistic.
We just launched a new website, Bookworm, from the Cultural Observatory. I might have a lot to say about it from different perspectives; but since it was submitted to the DPLA beta sprint, let’s start with the way it helps you find library books.
We’ve been working on making a different type of browser using the Open Library books I’ve been working with to date, and it’s raised a interesting question I want to think through here.
Hank wants me to post more, so here’s a little problem I’m working on. I think it’s a good example of how quantitative analysis can help to remind us of old problems, and possibly reveal new ones, with library collections.
I mentioned earlier I’ve been rebuilding my database; I’ve also been talking to some of the people here at Harvard about various follow-up projects to ngrams. So this seems like a good moment to rethink a few pretty basic things about different ways of presenting historical language statistics. For extensive interactions, nothing is going to beat a database or direct access to text files in some form. But for quick interactions, which includes a lot of pattern searching and public outreach, we have some interesting choices about presentation.
Starting this month, I’m moving from New Jersey to do a fellowship at the Harvard Cultural Observatory. This should be a very interesting place to spend the next year, and I’m very grateful to JB Michel and Erez Lieberman Aiden for the opportunity to work on an ongoing and obviously ambitious digital humanities project. A few thoughts on the shift from Princeton to Cambridge:
Let me get back into the blogging swing with a (too long—this is why I can’t handle Twitter, folks) reflection on an offhand comment. Don’t worry, there’s some data stuff in the pipe, maybe including some long-delayed playing with topic models.
Before end-of-semester madness, I was looking at how shifts in vocabulary usage occur. In many cases, I found, vocabulary change doesn’t happen evenly across across all authors. Instead, it can happen generationally; older people tend to use words at the rate that was common in their youth, and younger people anticipate future word patterns. An eighty-year-old in 1880 uses a world like “outside” more like a 40-year-old in 1840 than he does like a 40-year-old in 1880. The original post has a more detailed explanation.
A couple weeks ago, I wrote about how ancestry.com structured census data for genealogy, not history, and how that limits what historians can do with it. Last week, I got an interesting e-mail from IPUMS, at the Minnesota population center on just that topic:
All the cool kids are talking about shortcomings in digitized text databases. I don’t have anything so detailed to say as what Goose Commerce or Shane Landrum have gone into, but I do have one fun fact. Those guys describe ways that projects miss things we might think are important but that lie just outside the most mainstream interests—the neglected Early Republic in newspapers, letters to the editor in journals, etc. They raise the important point that digital resources are nowhere near as comprehensive as we sometimes think, which is a big caveat we all need to keep in mind. I want to point out that it’s not just at the margins we’re missing texts: omissions are also, maybe surprisingly, lurking right at the heart of the canon. Here’s an example.
Let’s start with two self-evident facts about how print culture changes over time:
Shane Landrum (@cliotropic) says my claim that historians have different digital infrastructural needs than other fields might be provocative. I don’t mean this as exceptionalism for historians, particularly not compared to other humanities fields. I do think historians are somewhat exceptional in the volume of texts they want to process—at Princeton, they often gloat about being the heaviest users of the library. I do think this volume is one important reason English has a more advanced field of digital humanities than history does. But the needs are independent of the volume, and every academic field has distinct needs. Data, though, is often structured for either one set of users, or for a mushy middle.
When I first thought about using digital texts to track shifts in language usage over time, the largest reliable repository of e-texts was Project Gutenberg. I quickly found out, though, that they didn’t have works for years, somewhat to my surprise. (It’s remarkable how much metadata holds this sort of work back, rather than data itself). They did, though, have one kind of year information: author birth dates. You can use those to create same type of charts of word use over time that people like me, the Victorian Books project, or the Culturomists have been doing, but in a different dimension: we can see how all the authors born in a year use language rather than looking at how books published in a year use language.
Let me step away from digital humanities for just a second to say one thing about the Cronon affair.
(Despite the professor-blogging angle, and that Cronon’s upcoming AHA presidency will probably have the same pro-digital history agenda as Grafton’s, I don’t think this has much to do with DH). The whole “we are all Bill Cronon” sentiment misses what’s actually interesting. Cronon’s playing a particular angle: one that gets missed if we think about him as either a naïve professor, stumbling into the public sphere, or as a liberal ideologue trying to score some points.
Back from Venice (which is plastered with posters for “Mapping the Republic of Letters,” making a DH-free vacation that much harder), done grading papers, MAW paper presented. That frees up some time for data. So let me start off looking at a new pool for book data for a little while that I think is really interesting.
url: /2011/03/what-historians-dont-know-about.html —
I wanted to see how well the vector space model of documents I’ve been using for PCA works at classifying individual books. [Note at the outset: this post swings back from the technical stuff about halfway through, if you’re sick of the charts.] While at the genre level the separation looks pretty nice, some of my earlier experiments with PCA, as well as some of what I read in the Stanford Literature Lab’s Pamphlet One, made me suspect individual books would be sloppier. There are a couple different ways to ask this question. One is to just drop the books as individual points on top of the separated genres, so we can see how they fit into the established space. By the first two principal components, for example, we can make all the books in LCC subclasses “BF” (psychology) blue, and use red for “QE” (Geology), overlaying them on a chart of the first two principal components like I’ve been using for the last two posts:
I used principal components analysis at the end of my last post to create a two-dimensional plot of genre based on similarities in word usage. As a reminder, here’s an improved (using all my data on the 10,000 most common words) version of that plot:
One of the most important services a computer can provide for us is a different way of reading. It’s fast, bad at grammar, good at counting, and generally provides a different perspective on texts we already know in one way.
I’ve spent a lot of the last week trying to convince Princeton undergrads it’s OK to occasionally disagree with each other, even if they’re not sure they’re right. So let me make one of my notes on one of the places I’ve felt a little bit of skepticism as I try to figure what’s going on with the digital humanities.
Genre information is important and interesting. Using the smaller of my two book databases, I can get some pretty good genre information about some fields I’m interested in for my dissertation by using the Library of Congress classifications for the books. I’m going to start with the difference between psychology and philosophy. I’ve already got some more interesting stuff than these basic charts, but I think a constrained comparison like this should be somewhat more clear.
I’m changing several things about my data, so I’m going to describe my system again in case anyone is interested, and so I have a page to link to in the future.
Open Library has pretty good metadata. I’m using it to assemble a couple new corpuses that I hope should allow some better analysis than I can do now, but just the raw data is interesting. (Although, with a single 25 GB text file the best way to interact with it, not always convenient). While I’m waiting for some indexes to build, that will give a good chance to figure out just what’s in these digital sources.
I’m trying to get a new group of texts to analyze. We already have enough books to move along on certain types of computer-assisted textual analysis. The big problems are OCR and metadata. Those are a) probably correlated somewhat, and b) partially superable. I’ve been spending a while trying to figure out how to switch over to better metadata for my texts (which actually means an almost all-new set of texts, based on new metadata). I’ve avoided blogging the really boring stuff, but I’m going to stay with pretty boring stuff for a little while (at least this post and one more later, maybe more) to get this on the record.
In writing about openness and the ngrams database, I found it hard not to reflect a little bit about the role of copyright in all this. I’ve called 1922 the year digital history ends before; for the kind of work I want to see, it’s nearly an insuperable barrier, and it’s one I think not enough non-tech-savvy humanists think about. So let me dig in a little.
The Culturomics authors released a FAQ last week that responds to many of the questions floating around about their project. I should, by trade, be most interested in their responses to the lack of humanist involvement. I’ll get to that in a bit. But instead, I find myself thinking more about what the requirements of openness are going to be for textual research.
I’ll end my unannounced hiatus by posting several charts that show the limits of the search-term clustering I talked about last week before I respond to a couple things that caught my interest in the last week.
Because of my primitive search engine, I’ve been thinking about some of the ways we can better use search data to a) interpret historical data, and b) improve our understanding of what goes on when we search. As I was saying then, there are two things that search engines let us do that we usually don’t get:
More access to the connections between words makes it possible to separate word-use from language. This is one of the reasons that we need access to analyzed texts to do any real digital history. I’m thinking through ways to use patterns of correlations across books as a way to start thinking about how connections between words and concepts change over time, just as word count data can tell us something (fuzzy, but something) about the general prominence of a term. This post is about how the search algorithm I’ve been working with can help improve this sort of search. I’ll get back to evolution (which I talked about in my post introducing these correlation charts) in a day or two, but let me start with an even more basic question that illustrates some of the possibilities and limitations of this analysis: What was the Civil War fought about?
To my surprise, I built a search engine as a consequence of trying to quantify information about word usage in the books I downloaded from the Internet Archive. Before I move on with the correlations I talked about in my last post, I need to explain a little about that.
How are words linked in their usage? In a way, that’s the core question of a lot of history. I think we can get a bit of a picture of this, albeit a hazy one, using some numbers. This is the first of two posts about how we can look at connections between discourses.
I’ve started thinking that there’s a useful distinction to be made in two different ways of doing historical textual analysis. First stab, I’d call them:
I finally got some call numbers. Not for everything, but for a better portion than I thought I would: about 7,600 records, or c. 30% of my books.
Before Christmas, I spelled out a few ways of thinking about historical texts as related to other texts based on their use of different words, and did a couple examples using months and some abstract nouns. Two of the problems I’ve had with getting useful data out of this approach are:
Dan Cohen gives the comprehensive Digital Humanities treatment on Ngrams, and he mostly gets it right. There’s just one technical point I want to push back on. He says the best research opportunities are in the multi-grams. For the post-copyright era, this is true, since they are the only data anyone has on those books. But for pre-copyright stuff, there’s no reason to use the ngrams data rather than just downloading the original books, because:
Back to my own stuff. Before the Ngrams stuff came up, I was working on ways of finding books that share similar vocabularies. I said at the end of my second ngrams post that we have hundreds of thousands of dimensions for each book: let me explain what I mean. My regular readers were unconvinced, I think, by my first foray here into principal components, but I’m going to try again. This post is largely a test of whether I can explain principal components analysis to people who don’t know about it so: correct me if you already understand PCA, and let me know me know what’s unclear if you don’t. (Or, it goes without saying, skip it.)
I wrote yesterday about how well the filters applied to remove some books from ngrams work for increasing the quality of year information and OCR compared to Google books.
As I said: ngrams represents the state of the art for digital humanities right now in some ways. Put together some smart Harvard postdocs, a few eminent professors, the Google Books team, some undergrad research assistants for programming, then give them access to Google computing power and proprietary information to produce the datasets, and you’re pretty much guaranteed an explosion of theories and methods.
Days from when I said “Google Trends for historical terms might be worse than nothing” to the release of “Google ngrams:” 12. So: we’ll get to see!
We all know that the OCR on our digital resources is pretty bad. I’ve often wondered if part of the reason Google doesn’t share its OCR is simply it would show so much ugliness. (A common misreading, ‘ tlie ’ for ‘ the ’, gets about 4.6m results in Google books). So how bad is the the internet archive OCR, which I’m using? I’ve started rebuilding my database, and I put in a few checks to get a better idea. Allen already asked some questions in the comments about this, so I thought I’d dump it on to the internet, since there doesn’t seem to be that much out there.
Can historical events suppress use of words? Usage of the word ‘ panic ’ seems to spike down around the bank panics of 1873 and 1893, and maybe 1837 too. I’m pretty confident this is just an artifact of me plugging in a lot of words in to test out how fast the new database is and finding some random noise. There are too many reasons to list: 1857 and 1907 don’t have the pattern, the rebound in 1894 is too fast, etc. It’s only 1873 that really looks abnormal. What do you think:
I’m interested in the ways different words are tied together. That’s sort of the universal feature of this project, so figuring out ways to find them would be useful. I already looked at some ways of finding interesting words for “scientific method,” but that was in the context of the related words as an endpoint of the analysis. I want to be able to automatically generate linked words, as well. I’m going to think through this staying on “capitalist” as the word of the day. Fair warning: this post is a rambler.
A commenter asked about why I don’t improve the metadata instead of doing this clustering stuff, which seems just poorly to reproduce the work of generations of librarians in classifying books. I’d like to. The biggest problem right now for text analysis for historical purposes is metadata (followed closely by OCR quality). What are the sources? I’m going to think through what I know, but I’d love any advice on this because it’s really outside my expertise.
Maybe this is just Patricia Cohen’s take, but it’s interesting to note that she casts both of the text mining projects she’s put on the Times site this week (Victorian books and the Stanford Literature Lab) as attempts to use modern tools to address questions similar to vast, comprehensive tomes written in the 1950s. There are good reasons for this. Those books are some of the classics that informed the next generation of scholarship in their field; they offer an appealing opportunity to find people who should have read more than they did; and, more than some recent scholarship, they contribute immediately to questions that are of interest outside narrow disciplinary communities. (I think I’ve seen the phrase ‘ public intellectuals ’ more times in the four days I’ve been on Twitter than in the month before). One of the things that the Times articles highlight is how this work can re-engage a lot of the general public with current humanities scholarship.
Dan asks for some numbers on “capitalism” and “capitalist” similar to the ones on “Darwinism” and “Darwinist” I ran for Hank earlier. That seems like a nice big question I can use to get some basic methods to warm up the new database I set up this week and to get some basic functionality written into it.
This verges on unreflective datadumping: but because it’s easy and I think people might find it interesting, I’m going to drop in some of my own charts for total word use in 30,000 books by the largest American publishers on the same terms for which the Times published Cohen’s charts of title word counts. I’ve tossed in a couple extra words where it seems interesting—including some alternate word-forms that tell a story, using a perl word-stemming algorithm I set up the other day that works fairly well. My charts run from 1830 (there just aren’t many American books from before, and even the data from the 30s is a little screwy) to 1922 (the date that digital history ends–thank you, Sonny Bono.) In some cases, (that 1874 peak for science), the American and British trends are surprisingly close. Sometimes, they aren’t.
Patricia Cohen’s new article about the digital humanities doesn’t come with the rafts of crotchety comments the first one did, so unlike last time I’m not in a defensive crouch. To the contrary: I’m thrilled and grateful that Dan Cohen, the main subject of the article, took the time in his moment in the sun to link to me. The article itself is really good, not just because the Cohen-Gibbs Victorian project is so exciting, but because P. Cohen gets some thoughtful comments and the NYT graphic designers, as always, do a great job. So I just want to focus on the Google connection for now, and then I’ll post my versions of the charts the Times published.
Lexical analysis widens the hermeneutic circle. The statistics need to be kept close to the text to keep any work sufficiently under the researcher’s control. I’ve noticed that when I ask the computer to do too much work for me in identifying patterns, outliers, and so on, it frequently responds with mistakes in the data set, not with real historical data. So as I start to harness this new database, one of the big questions is how to integrate what the researcher already knows into the patterns he or she is analyzing.
Dan Cohen, the hub of all things digital history, in the news and on his blog.
I have my database finally running in a way that lets me quickly select data about books. So now I can start to ask questions that are more interesting than just how overall vocabulary shifted in American publishers. The question is, what sort of questions? I’ll probably start to shift to some of my dissertation stuff, about shifts in verbs modifying “attention”, but there are all sorts of things we can do now. I’m open to suggestions, but here are some random examples:
So I just looked at patterns of commemoration for a few famous anniversaries. This is, for some people, kind of interesting–how does the publishing industry focus in on certain figures to create news or resurgences of interest in them? I love the way we get excited about the civil war sesquicentennial now, or the Darwin/Lincoln year last year.
I was starting to write about the implicit model of historical change behind loess curves, which I’ll probably post soon, when I started to think some more about a great counterexample to the gradual change I’m looking for: the patterns of commemoration for anniversaries. At anniversaries, as well as news events, I often see big spikes in wordcounts for an event or person.
Jamie’s been asking for some thoughts on what it takes to do this–statistics backgrounds, etc. I should say that I’m doing this, for the most part, the hard way, because 1) My database is too large to start out using most tools I know of, including I think the R text-mining package, and 2) I want to understand how it works better. I don’t think I’m going to do the software review thing here, but there are what look like a _lot _of promising leads at an American Studies blog.
I’ve had “digital humanities” in the blog’s subtitle for a while, but it’s a terribly offputting term. I guess it’s supposed to evoke future frontiers and universal dissemination of humanistic work, but it carries an unfortunate implication that the analog humanities are something completely different. It makes them sound older, richer, more subtle—and scheduled for demolition. No wonder a world of online exhibitions and digital texts doesn’t appeal to most humanists of the tweed– and dust-jacket crowd. I think we need a distinction that better expresses how digital technology expands the humanities, rather than constraining it.
Jamie asked about assignments for students using digital sources. It’s a difficult question.
Most intensive text analysis is done on heavily maintained sources. I’m using a mess, by contrast, but a much larger one. Partly, I’m doing this tendentiously–I think it’s important to realize that we can accept all the errors due to poor optical character recognition, occasional duplicate copies of works, and so on, and still get workable materials.
In addition to finding the similarities in use between particular isms, we can look at their similarities in general. Since we have the distances, it’s possible to create a dendrogram, which is a sort of family tree. Looking around the literary studies text-analysis blogs, I see these done quite a few times to classify works by their vocabulary. I haven’t seen much using words, though: but it works fairly well. I thought it might help answer Hank’s question about the difference between evolutionism and darwinism, but, as you’ll see, that distinction seems to be a little too fine for now.
What can we do with this information we’ve gathered about unexpected occurrences? The most obvious thing is simply to look at what words appear most often with other ones. We can do this for any ism given the data I’ve gathered. Hank asked earlier in the comments about the difference between “Darwinism” and evolutionism, so:
Now to the final term in my sentence from earlier— “How often, compared to what we would expect, does a given word appear with any other given word?”**.** Let’s think about How much more often. I though this was more complicated than it is for a while, so this post will be short and not very important.
This is the second post on ways to measure connections—or more precisely, distance—between words by looking at how often they appear together in books. These are a little dry, and the payoff doesn’t come for a while, so let me remind you of the payoff (after which you can bail on this post). I’m trying to create some simple methods that will work well with historical texts to see relations between words—what words are used in similar semantic contexts, what groups of words tend to appear together. First I’ll apply them to the isms, and then we’ll put them in the toolbox to use for later analysis.
I said earlier I would break up the sentence “How often, compared to what we would expect, does a given word appear with any other given word?” into different components. Now let’s look at the central, and maybe most important, part of the question—how often do we expect words to appear together?
I’m back from Moscow, and with a lot of blog content from my 23-hour itinerary. I’m going to try to dole it out slowly, though, because a lot of it is dull and somewhat technical, and I think it’s best to intermix with other types of content. I think there are four things I can do here.
Ties between words are one of the most important things computers can tell us about language. I already looked at one way of looking at connections between words in talking about the phrase “scientific method”--the percentage of occurrences of a word that occur with another phrase. I’ve been trying a different tack, however, in looking at the interrelations among the isms. The whole thing has been do complicated–I never posted anything from Russia because I couldn’t get the whole system in order in my time here. So instead, I want to take a couple posts to break down a simple sentence and think about how we could statistically measure each component. Here’s the sentence:
I’m in Moscow now. I still have a few things to post from my layover, but there will be considerably lower volume through Thanksgiving.
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; min-height: 14.0px}
Hank asked for a couple of charts in the comments, so I thought I’d oblige. Since I’m starting to feel they’re better at tracking the permeation of concepts, we’ll use appearances per 1000 books as the y axis:
I’m going to keep looking at the list of isms, because a) they’re fun; and b) the methods we use on them can be used on any group of words–for example, ones that we find are highly tied to evolution. So, let’s use them as a test case for one of the questions I started out with: how can we find similarities in the historical patterns of emergence and submergence of words?
Here’s a fun way of using this dataset to convey a lot of historical information. I took all the 414 words that end in ism in my database, and plotted them by the year in which they peaked,* with the size proportional to their use at peak. I’m going to think about how to make it flashier, but it’s pretty interesting as it is. Sample below, and full chart after the break.
It’s time for another bookkeeping post. Read below if you want to know about changes I’m making and contemplating to software and data structures, which I ought to put in public somewhere. Henry posted questions in the comments earlier about whether we use Princeton’s supercomputer time, and why I didn’t just create a text scatter chart for evolution like the one I made for scientific method. This answers those questions. It also explains why I continue to drag my feet on letting us segment counts by some sort of genre, which would be very useful.
Henry asks in the comments whether the decline in evolutionary thought in the 1890s is the “‘ Eclipse of Darwinism, ’ rise or prominence of neo-Lamarckians and saltationism and kooky discussions of hereditary mechanisms?” Let’s take a look, with our new and improved data (and better charts, too, compared to earlier in the week–any suggestions on design?). First,three words very closely tied to the theory of natural selection.
All right, let’s put this machine into action. A lot of digital humanities is about visualization, which has its place in teaching, which Jamie asked for more about. Before I do that, though, I want to show some more about how this can be a research tool. Henry asked about the history of the term ‘ scientific method. ’ I assume he was asking a chart showing its usage over time, but I already have, with the data in hand, a lot of other interesting displays that we can use. This post is a sort of catalog of what some of the low-hanging fruit in text analysis are.
I now have counts for the number of books a word appears in, as well as the number of times it appeared. Just as I hoped, it gives a new perspective on a lot of the questions we looked at already. That telephone-telegraph-railroad chart, in particular, has a lot of interesting differences. But before I get into that, probably later today, I want to step back and think about what we can learn from the contrast between between wordcounts and bookcounts. (I’m just going to call them bookcounts–I hope that’s a clear enough phrase).
Obviously, I like charts. But I’ve periodically been presenting data as a number of random samples, as well. It’s a technique that can be important for digital humanities analysis. And it’s one that can draw more on the skills in humanistic training, so might help make this sort of work more appealing. In the sciences, an individual data point often has very little meaning on its own–it’s just a set of coordinates. Even in the big education datasets I used to work with, the core facts that I was aggregating up from were generally very dull–one university awarded three degrees in criminal science in 1984, one faculty member earned $55,000 a year. But with language, there’s real meaning embodied in every point, that we’re far better equipped to understand than the computer. The main point of text processing is to act as a sort of extraordinarily stupid and extraordinarily perseverant research assistant, who can bring patterns to our attention but is terrible at telling which patterns are really important. We can’t read everything ourselves, but it’s good to check up periodically–that’s why I do things like see what sort of words are the 300,000th in the language, or what 20 random book titles from the sample are.
Here’s what googling that question will tell you: about 400,000 words in the big dictionaries (OED, Webster’s); but including technical vocabulary, a million, give or take a few hundred thousand. But for my poor computer, that’s too many, for reasons too technical to go into here. Suffice it to say that I’m asking this question for mundane reasons, but the answer is kind of interesting anyway. No real historical questions in this post, though–I’ll put the only big thought I have about it in another post later tonight.
I can’t resist making a few more comments on that technologies graph that I laid out. I’m going to add a few thousand more books to the counts overnight, so I won’t make any new charts until tomorrow, but look at this one again.
I’ve rushed straight into applications here without taking much time to look at the data I’m working with. So let me take a minute to describe the set and how I’m trimming it.
A collection as large as the Internet Archive’s OCR database means I have to think through what I want well in advance of doing it. I’m only using a small subset of their 900,000 Google-scanned books, but that’s still 16 gigabytes–it takes a couple hours just to get my baseline count of the 200,000 most common words. I could probably improve a lot of my search time through some more sophisticated database management, but I’ll still have to figure out what sort of relations are worth looking for. So what are some?
Let’s start with just some of the basic wordcount results. Dan Cohen posted some similar things for the Victorian period on his blog, and used the numbers mostly to test hypotheses about change over time. I can give you a lot more like that (I confirmed for someone, though not as neatly as he’d probably like, that ‘ business ’ became a much more prevalent word through the 19C). But as Cohen implies, such charts can be cooler than they are illuminating.
I’m going to start using this blog to work through some issues in finding useful applications for digital history. (Interesting applications? Applications at all?)