Feature Reduction on the Underwood-Sellars corpus

This is some real inside baseball; I think only two or three people will be interested in this post. But I’m hoping to get one of them to act out or criticize a quick idea. This started as a comment on Scott Enderle’s blog, but then I realized that Andrew Goldstone doesn’t have comments for the parts pertaining to him… Anyway.

Basically I’m interested in feature reduction for token-based classification tasks. Ted Underwood and Jordan Sellars’ article on the pace of change (hereafter U&S) has inspired a number of replications. They use the 3200 most-common words to classify 720 books of poetry as “high prestige” or “low prestige.”

Shortly after it was published, I made a Bookworm browser designed to visualize U&S’s core model, and asked Underwood about whether similar classification accuracy on a much smaller feature set was possible. My hope was that a smaller set of words might produce a more interpretable model. In January, Andrew Goldstone took a stab at reproducing the model: he does, but then argues that trying to read the model word by word is something of a fool’s errand:

Researchers should be very cautious about moving from good classification performance to interpreting lists of highly-weighted words. I’ve seen quite a bit of this going around, but it seems to me that it’s very easy to lose sight of how many sources of variability there are in those lists. Literary scholars love getting a lot from details, but statistical models are designed to get the overall picture right, usually by averaging away the variability in the detail.

I’m sure that Goldstone is being sage here. Unfortunately for me, he hits on this wisdom before using the lasso instead of ridge regression to greatly reduce the size of the feature set (down to 219 features at 77% success rate, if I’m reading his console output correctly), so I don’t get to see what features a smaller model selects. Scott Enderle took up Goldstone’s challenge, explained the difference between ridge regression and lasso in an elegant way, and actually improved on U&S’s classification accuracy with 400 tokens–an eightfold reduction in size.

So I’m left wondering whether there’s a better route through this mess. For me, the real appeal of feature selection on words would be that it might create models which are intuitively apprehensible for English professors. But if Goldstone is right that this shouldn’t be the goal, I’m unclear why the best classification technique would use words as features at all.

So I have two questions for Goldstone, Enderle, and anyone else interested in this topic:

  1. Is there any redeeming interpretability to the features included in unigram model? Or is Goldstone right that we shouldn’t do this?
  2. If we don’t want model interpretability, why use tokens as features at all? In particular, wouldn’t the highest classification accuracy be found by using dimensionality reduction techniques across the *entire* set of tokens in the corpus? I’ve been using the U&S corpus to test a dimensionality reduction technique I’m currently writing up. It works about as well as U&S’s features for classification, even though it does nothing to solve the collinearity problems that Goldstone describes in his post. A good feature reduction technique for documents, like latent semantic indexing or independent components analysis, should be able to do much better, I’d think–I would guess the classification accuracy over 80% with under a thousand dimensions. Shouldn’t this be the right way to handle this? Does anyone want take a stab at it? This would be nice to have as a baseline for these sorts of abstract feature-based classification tasks.

Buying a computer for digital humanities work

I’ve gotten a couple e-mails this week from people asking advice about what sort of computers they should buy for digital humanities research. That makes me think there aren’t enough resources online for this, so I’m posting my general advice here. (For some solid other perspectives, see here). For keyword optimization I’m calling this post “digital humanities.” But, obviously, I really mean the subset that is humanities computing, what I tend to call humanities data analysis. [Edit: To be clear, ] Moreover, the guidelines here are specifically tailored for text analysis; if you are working with images, you’ll have somewhat different needs (in particular, you may need a better graphics card). If you do GIS, god help you. I don’t do any serious social network analysis, but I think the guidelines below should work relatively with Gephi.

Continue reading

Commodius vici of recirculation: the real problem with Syuzhet

Practically everyone in Digital Humanities has been posting increasingly epistemological reflections on Matt Jockers’ Syuzhet package since Annie Swafford posted a set of critiques of its assumptions. I’ve been drafting and redrafting one myself. One of the major reasons I haven’t is that the obligatory list of links keeps growing. Suffice it to say that this here is not a broad methodological disputation, but rather a single idea crystallized after reading Scott Enderle on “sine waves of sentiment.” I’ll say what this all means for the epistemology of the Digital Humanities in a different post, to the extent that that’s helpful.

Here I want to say something much more specific: that Fourier transforms are the wrong “smoothing function” (insofar as that is the appropriate term to use) to choose for plots, because they assume plot arcs are periodic functions in which the beginning must align with the end. I’m pretty sure I’m right about this, but as usual I’m relying on an intuitive understanding of the techniques under discussion here rather than a deeply mathematical one. So let me know if I’m making a total ass of myself, and I’ll withdraw my statements here.

Even before Swafford posted her critique, I felt like there was something quite wrong about using the Fourier transform as a “smoothing” mechanism. Fourier transforms, in my experience with them, are bad at dealing with humanities data, because they rely on a very precise definition of “signal.” I’ve had to use wavelets instead of the Fourier transform in the past even to extract obviously periodic data from time series, because the assumptions of regularity in the fourier transform are so strong that some periods are simply missed.

As I was reading Enderle’s post, it occurred to me that we’ve been graphing these fourier transformed waves with the x axis reading 1 to 100, as if it was a closed domain. But, in fact, if plot is a sum of sine waves, that domain should actually read from 0 to 2*pi. (Or, if you’re so inclined, from 0 to tau). The difference being that waveforms are cyclical: this is the fundamental assumption of fourier transforms, whence all of the ringing artifacts that Swafford usefully points out come. After 100 comes 101: but 2 pi is the same as zero. This assumption is true only for novels whose last sentence is aligned to feed back into their first, a rare breed indeed. (Although ironically, given the primacy that Portrait of the Artist has played in this debate, Joyce wrote one.)

To put that graphically: this cyclicality means that syuzhet imposes an assumption that the start of plot lines up with the end of a plot. If you generate an artificial plot that starts with sentiment “-5” and ends with sentiment “5”, it looks like this with normal smoothing methods. (Rolling average or loess).


Screen Shot 2015-04-03 at 11.52.25 AM



But if you try to use syuzhet’s filter, it comes up looking completely different: wavy.

Screen Shot 2015-04-03 at 11.47.38 AM


This holds true on real documents. I ran it on every state of the union address since 1960. I’ve added dashed lines to show the overall sentiment movement in the address. Blue shows loess smoothing from beginning to end, and red shows the fourier transform. As you can see, loess allows plots to get happier or sadder: fourier forces them to return almost to their starting place.

All the code for this is online here: you can try it on your own plots as desired.

Screen Shot 2015-04-03 at 11.55.30 AM



I can see no sound reason to do this. Plots can start sad and get happy. But if you look at Jockers’ six “fundamental plots,” all start and end in the same approximate emotional register. This, I think, is an artifact of the assumptions of periodicity built into the Fourier transform, not the underlying plots. There’s no room in this world for Vonnegut’s “From bad to worse,” or for any sort of rags to riches. It treats plot as a zero-sum game.

If I’m not misunderstanding something here, this should convince Jockers to retire the waveform assumptions in favor of something like Loess smoothing or moving averages, so digital humanists can move on to talking about something other than “ringing artifacts.” I don’t think this devastating for the Syuzhet package as a whole: it has absolutely nothing to do with the suitability of sentiment analysis for determining plot, which is a much more interesting question others are contributing to. (I am still undecided whether I think my own method of plotting arcs through multidimensional topic spaces, which I originally came up from my misunderstanding something Jockers said to me a year ago about his idea for syuzhet, is better: I do think it adds something to the conversation.) One of the broader points my unfinished post makes is that we shouldn’t be taking failures in one component of a chain to mean the rest is unsound: that’s an oddly out-of-domain application of falsifiability.



Rate My Professor

Just some quick FAQs on my professor evaluations visualization: adding new ones to the front, so start with 1 if you want the important ones.

-3 (addition): The largest and in many ways most interesting confound on this data is the gender of the reviewer. This is not available in the set, and there is strong reason to think that men tend to have more men in their classes and women more women. A lot of this effect is solved by breaking down by discipline, where faculty and student gender breakdowns are probably similar; but even within disciplines, I think the effect exists. (Because more women teach at women’s colleges, because men teach subjects like military history than male students tend to overtake, etc). Some results may be entirely due to this phenomenon, (for instance, the overuse of “the” in reviews of male professors). But even if it were possible to adjust for this, it would only be partially justified. If women are reviewed differently because a different sort of student takes their courses, the fact of the difference in their evaluations remains.

-2 (addition): This  no peer review, and I wouldn’t describe this as a “study” in anything other than the most colloquial sense of the word. (It won’t be going on my CV, for instance.) A much more rigorous study of gender bias was recently published out of NCSU. Statistical significance is a somewhat dicey proposition in this set; given that I downloaded all of the ratings I could find, almost any queries that show visual results on the charts are “true” as statements of the form “women are described as x more than men are on rateMyProfessor.com.” But given the many, many peculiarities of that web site, there’s no way to generalize from it to student evaluations as used inside universities. (Unless, God forbid, there’s a school that actually looks at RMP during T&P evaluations.) I would be pleased if it shook loose some further study by people in the field.

-1. (addition): The scores are normalized by gender and field. But some people have reasonably asked what the overall breakdown of the numbers is. Here’s a chart. The largest fields are about 750,000 reviews apiece for female English and male math professors. (Blue is female here and orange male–those are the defaults from alphabetical order, which I switched for the overall visualization). The smallest numbers on the chart, which you should trust the least, are about 25,000 reviews for female engineering and physics professors.

Screen Shot 2015-02-07 at 10.16.38 AM

0. (addition): RateMyProfessor excludes certain words from reviews: including, as far as I can tell, “bitch,” “alcoholic,” “racist,” and “sexist.” (Plus all the four letter words you might expect.) Sometimes you’ll still find those words typing them into the chart. That’s because RMP’s filters seem not to be case-sensitive, so “Sexist” sails through, while “sexist” doesn’t appear once in the database. For anything particularly toxic, check the X axis to make sure it’s used at a reasonable level. For four letter words, students occasionally type asterisks, so you can get some larger numbers by typing, for example, “sh *” instead of “shit.”

1. I’ve been holding it for a while because I’ve been planning to write up a longer analysis for somewhere, and just haven’t got around to it. Hopefully I’ll do this soon: one of the reasons I put it up is to see what other people look for.

2. The reviews were scraped from ratemyprofessor.com slowly over a couple months this spring, in accordance with their robots.txt protocol. I’m not now redistributing any of the underlying text. So unfortunately I don’t feel comfortable sharing it with anyone else in raw form.

3. Gender was auto-assigned using Lincoln Mullen’s gender package. There are plenty of mistakes–probably one in sixty people are tagged with the wrong gender because they’re a man named “Ashley,” or something.

4. 14 million is the number of reviews in the database, it probably overstates the actual number in this visualization. There are a lot of departments outside the top 20 I have here.

5. There are other ways of looking at the data other than this simple visualization: I’ve talked a little bit at conferences and elsewhere about, for example, using Dunning Log-Likelihood to pull out useful comparisons (for instance, here, of negative and positive words in history and comp. sci. reviews.) without needing to brainstorm terms.

6. Topic models on this dataset using vanilla sets are remarkably uninformative.

7.People still use RateMyProfessor, though usage has dropped since its peak in 2005. Here’s a chart of reviews by month. (It’s intensely periodic around the end of the semester.


By Month




8. This includes many different types of schools, but is particularly heavy on masters and community colleges in the most represented schools. Here’s a bar chart of the top 50 or so institutions:


top schools

The Bookworm-Mallet extension

I promised Matt Jockers I’d put together a slightly longer explanation of the weird constraints I’ve imposed on myself for topic models in the Bookworm system, like those I used to look at the breakdown of typical TV show episode structures. So here they are.

The basic strategy of Bookworm at the moment is to have a core suite of tools for combining metadata with full text for any textual corpus. In the case of the movies, the texts are each three-minute chunks of movies or TV shows; a topic model will capture the size of each individual movie. A variety of extensions allow you to port in various other algorithms into the system; so for instance, you can use the geolocation plugin to put in a latitude and longitude for a corpus which has publication places listed in it.

The Bookworm-Mallet extension handles incorporating topic models into Bookworm. The obvious way to topic model is to just feed the text straight into Mallet. This is particularly easy because the Bookworm ingest format is designed to be exactly the same as the Mallet format. But I don’t do that, partly because Bookworm has an insanely complicated (and likely to be altered) set of tokenization rules that would be a pain to re-implement in the package, and partly because we’ve *already* tokenized; why do it again?

So instead of working with the raw text, I load a stopwords list (starting with Jockers’ list of names) directly into the database, and pull out not the tokens but the internal numeric IDs used by Bookworm for each word. This has an additional salutary effect, which is that we can define from the beginning exactly the desired vocabulary size. If we want a vocab size of the most common 2^16-1 tokens in the corpus, it’s trivially easy to do it. That means that the Mallet memory requirements, which many Bookworms bump up against, can be limited. (David Mimno has used tricks like this to speed up Mallet on extremely large builds; I don’t actually know how he does it, but want to keep the options open for later.) And though I’m not already limited precisely, I do drop out words that appear fewer than two times from the model to save space and time.

The actually model is run on a file not of words, but of integer IDs. Here are the first ten lines of the movie dataset as I enter it into Mallet.

Each number is a code for a word; they appear not in the original order, but randomly shuffled. Wordid 883 is ‘land,’ 24841 is “Stubborn,” 3714 is “influence,” etc. This file is much shorter for being composed of integers without stopwords than it would be from the full text.

Then all the tokens and topic assignments are loaded back into the database, not just as overall distributions but as individual assignments. That makes it possible to look directly at the individual tokens that make up a topic, which I think is potentially quite useful. This gives a much faster, non-memory based access to the data in the topic state file than any other I know of; and it comes with full integration with any other metadata you can cook up.

Jockers’ “Secret sauce” consists, in part, of restricting to only nouns, adjectives, or other semantically useful terms. There is a way of doing that in the Bookworm infrastructure, but it involves not treating the topic model as a one-off job, but fully integrating the POS-tagging into the original tokenization. We would be then be able to only feed adjectives into the topic modeling. But the spec for that isn’t fully laid out: and POS-tagging takes so long that I’m in no big hurry to implement it. It has proven somewhat useful in the Google Ngrams corpus, but I’m a little concerned by the ways that it tends to project modern POS uses into the past. (Words only recently verbified get tokenized as words much longer ago in the 2012 Ngrams release).

Perhaps more interesting are the ways that the full Bookworm API may expose some additional avenues for topic modeling. Labelled LDA is an obvious choice, since Bookworm instances are frequently defined by a plethora of metadata. Another option would be to change the tokens imported in; using either Bookworm’s lemmatization (removed in 2013 but not forgotten) or even something weirder, like the set of all placenames extracted out in NLP, as the basis for a novel. Finally, it’s possible to use metadata to more easily change the definition of a *text*; for something like the new Movie Bookworm, where each text takes three minutes, it would be easy to recalculate with each text instead coming in as an individual film.







Building outlines and slides from Markdown lectures with Pandoc

Just a quick follow-up to my post from last month on using Markdown for writing lectures. The github repository for implementing this strategy is now online.

The goal there was to have one master file for each lecture in a course, and then to have scripts automatically create several things, including a slidedeck and an outline of the lecture (inferred from the headers in the text) to print out for students to follow along in class.

To make this work, I invented my own slightly extended version of the markdown syntax. It has three new conventions:

1. Any phrase in bold is a keyword to be pulled out and included in outlines

2. Anything in a code block is to be used as a slide. Each separate code block is its own slide. Any first-degree header is a full page slide. (The easiest way to do a code block is just to tab indent a line: must of my slides are just a single element  line like this:

> ![Edison electric light](http://scienceblogs.com/retrospectacle/wp-content/blogs.dir/463/files/2012/04/i-3530f86be619cdc7d42c13cdca188088-edison.bmp)

3. As in the previous example, the image format is extended so that labels in slides appear not as alt-text, but in the text above the image: in addition, any image link beginning with the character “>” is treated not as an image but as an iframe, making it easy to embed things like youtube videos or interactive Bookworm charts.

The slide decks are built with reveal.js, which drops everything into a nicely organized batch. Here’s what one looks like.  (This is for a lecture on household technologies in the 20s). My favorite feature is that by hitting escape, you get an overall view of everything in the lecture sorted by header–this is particularly useful when studying for exams, because those headers align exactly with the outlines.


The outlines are produced from the same lecture notes, but in a different way; rather than pull the code blocks, they walk through all the headers in the document and append them (and any bolded terms) to a new document that students can see. For that lecture, it looks like this:



There are a few things I still don’t love about this: image positioning and sizing is not so good as it is in powerpoint. But the thing that’s nice is that it’s extremely portable; if I don’t make through the end of a lecture, I can just cut out the last few paragraphs, paste them into the next day’s document, and have the outline and slides immediately reflect the switch for both days. This makes a lot of last-minute, before-class changes dramatically easier.

The basic scripts, though not the full course management repo, is up on github.The code is in Haskell, which I’ve never written in before, so I’d love a second set of eyes on it.  Some brief reflections on coding for pandoc in Python and Haskell follow.

I thought it would be easy to switch between headers and an outline, but they turn out to have almost nothing in common in the Pandoc type definition; the outline needs to be built up recursively out of component parts. It’s an operation that’s much closer to really basic data structures than anything I’ve done before.

I initially used the pandocfilters Python package for this. That code is here. It basically works–thanks primarily to insight gleaned from an exchange on GitHub between, I think, Caleb McDaniel and John McFarlane that I’ve lost the link for) that you need to scope a global python variable and append to it from a walk function. But it has a tendency to break unexpectedly, and uses an incredibly confusing welter of accessors into the rather ugly pandoc json format. Plus, it’s fundamentally an attempt to write Haskell-esque code in Python, which is about the least pleasant thing I’ve ever seen.

By the time I made that python script work. I had spent a couple hours reading and re-reading the pandoc types definition, and it seemed like it would simpler to just write the filter in Haskell directly. (I did a few Haskell problem sets for a U Penn course this summer out of curiosity; without that basic understanding of Haskell data types, I doubt I would have been able to understand the Pandoc documentation.) The lecture-to-outline Haskell code, to my surprise, ended up being a bit longer than the Python version (though much of that is type definitions and comments, which doesn’t really count). If anyone out there who knows Haskell can explain to me a better way to avoid some of the stranger elements in there (particularly the reversing and unreversing of lists just to allow pattern matching on them, which is a substantial proportion of what I wrote), I’m all ears.

Programming in Haskell is certainly more interesting than python; I agree with Andrew Goldstone’s comment that “whereas programming normally feels like playing with Legos, programming in Haskell feels more like trying to do a math problem set, with ghc in the role of problem-set grader”. I’m left with a strong temptation to write a TEI-to-Bookworm parser, which I’ve previously sketched in Python, in Haskell instead; both for performance and readability reasons, I think it might work quite well. Stay tuned.


More thoughts on topic models and tokens

I’ve been thinking a little more about how to work with the topic modeling extension I recently built for bookworm. (I’m curious if any of those running installations want to try it on their own corpus.) With the movie corpus, it is most interesting split across genre; but there are definite temporal dimensions as well. As I’ve said before, I have issues with the widespread practice of just plotting trends over time; and indeed, for the movie model I ran, nothing particularly interesting pops out. (I invite you, of course, to tell me how it is interesting.)

So here I’m going to give two different ways of thinking about the relationship between topic labels and the actual assigned topics that underlie them.

One way of thinking about the tension between a topic and the semantic field of the words that make it up is to simply just plot the “topic” percentages vs the overall percentages of the actual words. So in this chart, you get all the topics I made on 80,000 movie and TV scripts: in red are the topic percentages, and in blue are the percentages for the top five words in the topic. Sometimes the individual tokens are greater than the topic, as in “Christmas dog dogs little year cat time,” probably mostly because “time” is an incredibly common word that swamps the whole thing; sometimes the topic is larger than the individual words, as in the swearing topic, but there are all sorts of ways of swearing beside the topic assignments.

In some cases, the two lines map very well–this is true for swearing, and true for the OCR error class (“lf,” “lt” spelled with an ell rather than an aye at the front).

In other cases, the topic shows sharper resolution: “ain town horse take men”, the “Western” topic, falls off faster than its component parts.

In other cases the identification error is present: towards the top, “Dad Mom dad mom” takes off after 1970 after holding steady with the component words until then. I’m not sure what’s going on there–perhaps some broader category of sitcom language is folded in?

Topics vs tokens



Another approach is to ask how important those five words are to the topic, and how it changes over time. So rather than take all uses of the tokens in “Christmas dog dogs little year cat time,” I can take only those uses assigned into that full topic: and then look to see how those tokens stack up against the full topic. This line would ideally be flat, indicating that the label characterizes it equally well across all time periods. For the Christmas topic, it substantially is, although there’s perhaps an uptick towards the end.

But in other topics, that’s not the case. “Okay okay Hey really guys sorry” was steadily about 8% composed of its labels: but after 2000, that declined steadily to about 4%. Something else is being expressed in that later period.”Life money pay work…” is also shifting significantly, towards being composed of its labels.

On the other hand, this may not be only a bug: the swear topic is slowly becoming more heavily composed of its most common words, which probably reflects the actual practice (and the full of ancillary “damn” and “hells” in sweary documents. You can see the rest here.



These aren’t particularly bad results, I’d say, but do suggest a further need for more ways to integrate topic counts in with results. I’ve given two in the past: looking at how an individual word is used across topics:

and slope charts of top topic-words across major metadata splits in the data:


Both of these could be built into Bookworm pretty easily as part of a set of core diagnostic suites to use against topic models.

The slopegraphs are, I think, more compelling; they are also more easily portable across other metadata groupings besides just time. (How does that “Christmas” topic vary when expressed in different genres of film?) Those are questions for later.

Building topic models into Bookworm searches

I’ve been seeing how deeply we could integrate topic models into the underlying Bookworm architecture a bit lately.

My own chief interest in this, because I tend to be a little wary of topic models in general, is in the possibility for Bookworm to act as a diagnostic tool internally for topic models. I don’t think simply plotting description absent any analysis of the underlying token composition of topics is all that responsible; Bookworm offers a platform for actually accessing those counts and testing them against metadata.

But topics also have a lot to offer token-based searching. Watching links into the Bookworm browser, I recently stumbled on this exchange:



How can I solve this biologist’s problem? (Or, at least, waste more of his time?)

The word-level topic assignments I have on hand are actually real useful for this. (I’m assuming, I should say, that you know both the basics of topic modeling and of the movie bookworm.) I can ask the beta bookworm browser for the top topics associated with each of the words “fly” (top) and “ant” (bottom):



Fly usage by topic



Ant usage by topic


“Fly” is overwhelmingly associated with the topics “boat ship Captain island plane sea water” (airplane flying) and “life day heart eyes world time beautiful” (unclear, but might be superman flying). (It’s even more so than on this chart, since I’ve lopped off the right side: there are about 2200 uses of “fly” in the first topic).

But “ant” is most used in two clearly animal related topics: “water animals years fish time food ice” and “dog cat little boy dogs Hey going.” And both of those topics show up for “fly” as well.

So in theory, at least, we can *restrict searches by topic:* rather than put into a Bookworm *every* usage of the word “fly”, we can get only those that seem, statistically, to be used in an animal-heavy context.

With an imperfect, 64-topic model on a relatively small corpus like the Movie Bookworm, this is barely worth doing.


Ant in animal topics per million words in all topics


Fly in animal topics per million words in all topics

And given that “flying” is something that plenty of animals do, the “fly” topic here is probably not all Order Diptera.

But with collections the size of the Hathi trust, this could potentially be worth exploring, particularly with substantially larger models. “Evolution” is one of the basic searches in a few bookworms: but it’s hard to use, because “evolution” means something completely different in the context of 1830s mathematics as opposed to 1870s biology. A topic model that could conceivably make a stab at segregating out just biological “evolution,” though, would be immensely useful in tracing out Darwinian changes; one that could disentangle military shooting from the interjection “shoot!” might be good at studying slang.

Above all, this might be good at finding words that migrate meanings in early uses: most new phrases actually emerge out of some early construction, but this would let us try to recover meaning through context.

Hell, it might even have an application in Prochronisms work; given a large, pre-built topic model, any new scripts could be classified against it and their words assigned to topics, and tested for their appropriateness as a topic-word combination.

Technical note: the basics of this are pretty easy with the current system: the only issue with incorporating “topic” as a metadata field on the primary browser right now is that the larger corpus it compares against would also be limited by topic. This could be solved by using the asterisk syntax that no one uses: {“*topic”:[3],”*word”:[“fly”]} will ensure both are dropped, not just one, by just specifying the “compare_limits” field manually.


Searching for structures in the Simpsons and everywhere else.

This is a post about several different things, but maybe it’s got something for everyone. It starts with 1) some thoughts on why we want comparisons between seasons of the Simpsons, hits on 2) some previews of some yet-more-interesting Bookworm browsers out there, then 3) digs into some meaty comparisons about what changes about the Simpsons over time, before finally 4) talking about the internal story structure of the Simpsons and what these tools can tell us about narrative formalism, and maybe why I’d care.

It’s prompted by a simple question. I’ve been getting a lot of media attention for my Simpsons browser. As a result of that, I need some additional sound bytes about what changes in the Simpsons. The Bookworm line charts, which remain all that most people have seen, are great for exploring individual words; but they don’t tell you what words to look for. This is a general problem with tools like Bookworm, Ngrams, and the like: they don’t tell you what’s interesting. (I’d argue, actually, that it’s not really a problem; we really want tools that will useful for addressing specific questions, not tools that generate new questions.)

The platform, though, can handle those sorts of queries (particularly on a small corpus like the Simpsons) with only a bit of tweaking, most of which I’ve already done. To find interesting shifts, you need:

1) To be able to search without specifying words, but to get results back faceted by words;

2) Some metric of “interestingness” to use.

Number 1 is architecturally easy, although mildly sort of expensive. Bookworm’s architecture has, for some time, prioritized an approach where “it’s all metadata”; that includes word counts. So just as you can group by the year of publication, you can group by the word used. Easy peasy; it takes more processing power than grouping by year, but it’s still doable.

Metrics of interestingness are a notoriously hard problem; but it’s not hard to find a partial solution, which is all we really need. The built-in searches for Bookworm focus on counts of words and counts of texts. The natural (and intended) use are the built-in limits like “percentage of texts” and “words per million,” but given those figures for two distinct corpora (the search set and the broader comparison sets) also make it possible to calculate all sorts of other things. Some are pretty straightforward (“average text length”); but others are actual computational tools in themselves, including  TF-IDF and two different forms of Dunning’s Log-Likelihood. (And those are just the cheap metrics; you could even run a full topic model and ship the results back, if that wasn’t a crazy thing to do).

So I added in, for the time being at least, a Dunning calculator as an alternate return count type to the Bookworm API. (A fancy new pandas backend makes this a lot easier than the old way.) So I can set two corpora, and compare the results of each to each.

To plow through a bunch of different Dunning scores, some kind of visualization is useful.

Last time I looked at the Dunning formula on this blog, I found that Dunning scores are nice to look in wordclouds. I’m as snooty about word clouds as everyone else in the field. But for representing Dunning scores, I actually think that wordclouds are one of the most space-efficient representations possible. (This is following up on how Elijah Meeks uses wordclouds for topic model glancing, and how the old MONK project used to display Dunning scores).

There’s aren’t a lot of other options. In the past I’ve made charts for Dunning scores as bar charts: for example, the strongly female and the most strongly male words in negative reviews of history professors on online sites. (This is from a project I haven’t mentioned before online, I don’t think; super interesting stuff, to me at least). So “jerk,” “funny,” and “arrogant” are disproportionately present in bad reviews of men; “feminist,” “work,” and “sweet” are disproportionately present in bad reviews of women.


This is a nice and precise way to do it, but it’s a lot of real estate to take up for a few dozen words. The exact numbers for Dunning scores barely matter: there’s less harm in the oddities of wordclouds (for instance, longer words seeming more important just because of its length).

We can fit both aspects of this: the words and the directionality–by borrowing an idea that I think the old MONK website had; colorizing results by direction of bias. So here’s one that I put online recently: a comparison of language in “Deadwood” (black) and “The Wire” (red).


This is a nice comparison, I think; individual characters pop out (the Doc, Al, and Wu vs Jimmy and the Mayor); but it also captures the actual way language is used, particularly the curses HBO specializes in. (Deadwood has probably established an all-time high score on some fucking-cucksucker axis forever; but the Wire more than holds it own in the sphere of shit/motherfucker.) This is going to be a forthcoming study of profane multi-dimensional spaces, I guess.

Anyhoo. What can that tell us about the Simpsons?

Screen Shot 2014-09-11 at 4.47.06 PM

Here’s what the log-likelihood plot looks like. Black are words characteristic of seasons 2-9 (the good ones); red is seasons 12-19. There’s much, much less that’s statistically different about two different 80-hour Simpsons runs than two  roughly 80-hour HBO shows: that’s to be expected. And most the differences we do find are funny things involving punctuation that have to do with how the Bookworm is put together.

But: there are a number of things that are definitely real. First is the fall away from several character names. Smithers, Burns, Itchy and Scratchy (Itchy always stays ahead), Barney, and Mayor Quimby all fall off after about season 9. Some more minor characters (McBain drop away as well.)

Few characters increase (Lou the cop; Duffman; Artie Ziff, though in only two episodes). Lenny peaks right around season 9; but Carl has had his best years ever recently.

Screen Shot 2014-09-11 at 5.26.00 PM

We do get more, though, of some abstract words. Even though one of the first appearances was a Christmas special, “Christmas” goes up. Things are more often “awesome,” and around season 12 kids and spouses suddenly start getting called “sweetie.” (Another project would be to match this up against the writer credits and see if we could tell whether this is one writer’s tic.)

“Gay” starts showing up frequently.

Others are just bizarre: The Simpsons used the word “dumped” only once in the 1990s, and a 19 times in the 2000s. This can’t mean anything (right?) but seems to be true.

What about story structure? I found myself, somehow, blathering on to one reporter about Joseph Campbell and the hero’s journey. (Full disclosure: I have never read Joseph Campbell, and everything I know about him I learned from Dan Harmon podcasts).

But those things are interesting. Here are the words most distinctively from the first act (black) and the third act (red). (Ie, minutes 17-21 vs 2-8).

Screen Shot 2014-09-11 at 5.09.16 PM

As I said earlier, school shows up as a first-act word. (Although “screeching,” here, is clearly from descriptions of the opening credits, school remains even when you cut the time back quite a bit, so I don’t think it’s just credit appearances driving this). And there are a few more data integrity issues: elderman is not a Simpsons character, but a screenname for someone who edits Simpsons subtitles; www, Transcript, and Synchro are all unigrams about the editing process. I’ll fix these for the big movie bookworm, where possible.

That said, we can really learn something about the structural properties of fictional stories here.

Lenny is a first act character, Moe a third act one.

Screen Shot 2014-09-11 at 5.55.09 PM

We begin with “school” and “birthday” “parties;”

Screen Shot 2014-09-11 at 5.55.58 PM


we end with discussions of who “lied” or told the “truth,” what we “learned” (isn’t that just too good?), and, of course with a group “hug.” (Or “Hug”: the bias is so strong that both upper- and lower-case versions managed to get in). And we end with “love.”

Screen Shot 2014-09-11 at 5.54.12 PM

The hero returns from his journey, having changed.

Two last points.

First, there are no discernably “middle” words I can find: comparing the middle to the front and back returns only the word “you,” which indicates greater dialogue but little else.

Second: does it matter? Can we get anything more out of the Simpsons through this kind of reading than just sitting back to watch? Usually, I’d say that it’s up to the watcher: but assuming that you take television at all seriously, I actually think the answer may be “yes.” (Particularly given whose birthday it is today). TV shows are formulaic. This can be a weakness, but if we accept them as formulaically constructed, seeing how the creators are playing around with form can make us appreciate them better, better appreciate how they make us feel, and how they work.

Murder mysteries are like this: half the fun to all the ITV British murder mysteries is predicting who will be the victim of murder number 2 about a half hour in; all the fun of Law and Order is guessing which of the four-or-so templates you’re in Wrongful accusation? Unjust acquittal? It was the first guy all along? (And isn’t it fun when the cops come back in the second half hour?)

But the conscious play on structures themselves are often fantastic. The first clip-show episode of Community is basically that; essentially no plot, but instead a weird set of riffs on the conventions the show has set for itself that verges on a deconstruction of them. One could fantasize that we’re getting to the point where the standard TV formats are about as widespread, as formulaic, and as malleable as sonata form was for Haydn and Beethoven. What made those two great in particular was their use of the expectations built into the form. Sometimes you don’t want to know how the sausage is made; but sometimes, knowing just gets you better sausage.

And it’s just purely interesting. Matt Jockers has been looking recently at novels and their repeating forms; that’s super-exciting work. The (more formulaic?) mass media genres would provide a nice counterpoint to that.

The big, 80,000 movie/TV episode browser isn’t broken down by minute yet: I’m not sure if it will be for the first release. (It would instantly become an 8-million text version, which makes it slower). But I’ll definitely be putting something together that makes act-structure possible.

Markdown, Historical Writing, and Killer Apps

Like many technically inclined historians (for instance, Caleb McDaniel, Jason Heppler, and Lincoln Mullen) I find that I’ve increasingly been using the plain-text format Markdown for almost all of my writing.

The core idea of Markdown is that rather than use Microsoft Word, Scrivener, or any of the other pretty-looking tools out there, you type in “plain text” using formatting conventions that should be familiar to anyone who’s ever written or read an e-mail. (Click on Mullen’s or Heppler’s name for a better introduction than this, or see the Chronicle’s wrapup of approaches).

The benefits are many, but they’re mostly subtle:

  • A simple format like Markdown creates documents you’ll have not trouble reading in twenty years. I’ve been teaching a survey course this semester and had a hell of a time reading my old notes from generals which I took using EndNote; with Markdown, any web browser, text editor, or Microsoft Word descendant will have no trouble opening it.
  • It’s very easy to produce content that will look good in multiple media: I can make a course syllabus or personal CV with that formats nicely on a website and produces a clean looking PDF at the same time.
  • It becomes much easier to do things to a bunch of notes at the same time: bundle them into PDFs, search through all of your notes simulataneously, and so forth.

None of these, though, are a particularly strong sell for those who use a computer instrumentally: in reality, your Microsoft Words documents aren’t about to disappear, either. And there are disadvantages to giving up Word.

  • Things like footnotes with a citation manager are not very easy, even for the technically competent. 1 Even footnotes without a citation manager are fairly clumsy.
  • The best tool for making your Markdown documents into attractive web pages , Pandoc, is not especially easy to install or configure if you don’t use the command line on a regular basis.
  • The core definition of Markdown is a little unclear: particularly in the last week, there have been some conflicts over the definition that will be confusing to newcomers. (Although the proposal that sparked them, “Common Markdown,” is likely to be a good thing in the long run)

The heart of Markdown’s appeal is its flexibility: to drive any adoption outside the hard core of people, you need a killer app built off of it that solves a problem. In the technology sector, that has been Markdown’s ability to easily handle links and snippets of computer code for those writing on two widely used sites, GitHub and Stack Overflow

Among historians, neither of those are very important. And the footnote problem is big enough that I generally wouldn’t recommend anyone to use Markdown, right now, unless they enjoy banging their head against the wall.

Lectures and Notes: the killer apps.

There are two places, though, where even historians don’t tend to use footnotes: lectures, and notes. And in both of these, Markdown makes some amazing things possible.

If there’s any reason for historians to use markdown, it’s in these two spheres. The reason I keep using Markdown is that it makes it possible for me to personally solve two problems that have driven me crazy:

  1. Quickly making slides decks to go alongside a lecture, and borrowing and reusing chunks of slides from one talk in another;
  2. Making heads or tails of the thousands of pictures you take while in an archival trip.

Markdown and lectures: multimedia and transposability.

First lectures. With Markdown, I’m able to write my own notes and create a slide deck at the same time. An example will help. Here’s a snippet from my lecture notes on the memory of the Civil War:

With some ancillary code I wrote, that does two things at once: builds a slide showing the wikimedia copy of Sherman’s grizzled mug, and creates a set of notes for me under the header “Abolitionist memory of the war” to go on the paper notes I’ll read from.

Later on, I’ll write another script that will find pull every phrase in boldface (like “Field Order 15”) from all my notes and put them onto a list of possible IDs for the midterm I can hand out. Another script could strip just the section headers and print out outlines for the lectures to hand out before class.

This is writing documents for multiple uses, and it can be incredibly useful. If, two minutes before class, I decide I want to switch the order I talk about the abolitionist memory of the war and the white supremacist memory of the war, I can just cut and paste the chunks of text, and all the slides associated with each will have their order switched.

Something like this could provide a really useful way to integrate and share resources, and free up some of the tedium with prepping lectures. But:

  • That syntax for including an image as a slide is my own, not standard Markdown. I’ve defined scripts for dropping in YouTube videos, images, captions, and some other predefined formats: but it would take a lot of work to define a set of them that make sense for anyone but me.
  • There are a lot of standards out there for working with HTML slides. None is winning, in part because none is anywhere as good as Keynote or Powerpoint for the average user. My code works with deck.js, one of the only HTML formats not supported by Pandoc; but there’s no obvious other standard to switch to.
  • Constructing slides that are more complicated than a single image with a title, or a numbered list, requires some serious HTML/CSS expertise. My scripts support that, but not in a pretty way.

Modern HTML allows some beautiful things: I can easily imagine a GUI for one of the standards that would make it easy to create slides for re-use in one of the competing platforms. But I think the standards are still evolving too rapidly in this sphere to make the way forward obvious.

Pull out the slide deck, and you still might have a useful tool here: something that generates a lecture notes for me, outlines for the students/course web page, and IDs for the test prep sessions. But I think there’s something even more valuable possible for archive notes.

Markdown and the Archives: integrating notes and photos

Markdown is a great language for taking archival notes. Archives are all about hierarchy: and Markdown easily lets you tag mutliple levels of headers (Series, Box, Collection, file…). But so is Microsoft Word: and there are plenty of outlining programs out there that are even better.

There are a few things that Markdown notes might do more easily than normal ones. Build a good enough web interface, and you could even click on a photo or quote in your notes and instantly get back a string that ascends the various headers to tell you where it is: Series 3a, Box 13, Folder 4, Letter on 4/18. But the place where there’s really an opportunity lies in Digital Photos.

Digital cameras have completely changed historians’ relations to archives in the last 15 years. (That is, in the subset of archives where cameras are allowed). We used to take notes: now, a massive part of our archival practice involves taking pictures, which have to be sorted through on our return.

When I’m wading through boxes, I tend to type the name of the box, and then some information about each folder followed by descriptions of the documents: if it’s especially useful or especially visual, I take a picture (or a series of several pictures). I think this is pretty similar to what most people do. It means that I end up with two separate timelines to sort through when I get home. 1) A bunch of textual notes that contain my impressions of the works and the rationales for why I copied them and what they are. 2) A stream of pictures with little context but their order to patch together their origin, sometimes with a close-up of a box or folder label thrown in to help.

The tough question is: how can you insert pictures into your notes? Unless you want to physically pick up your laptop and use the webcam for your pictures, it’s not obvious what the best way would be. And if you try to put more than a couple pictures into a Word document, it will crash right away.

Unlike the systems most historians use for notes, Markdown is plain text and has an easy method for inserting multimedia. That means that you can use it to integrate your archival photos directly into your notes; and that unlike Word, it can handle hundreds of images or thumbnails with aplomb.

The last challenge is knowing which parts of your notes go with which pictures. This is a surprisingly hard thing to solve: but there’s an existing answer in a second technology much beloved by the technology industry: version control.

Version control can get complicated, but in its simplest form it’s much like a wikipedia edit history: not just the current state of a file, but every previous revision is stored in memory.

So for archival notes, we just need to save the state of your archival notes every 10 or 15 seconds; match those markers against the timestamps of the photos from a digital camera; and insert the pictures into the text just in place.

When you want to review your notes, you just open them up in HTML format: thumbnails of every picture will appear in place, and you can click on them to get the full version.

For the technically savvy, I’ve put a set of scripts online that do just this. I use gitit to view the notes themselves so I can interlink between pages. A daemon handles the git commits: but that only works because I have always been a compulsive, several-times-a-minute saver of my documents.

What would a user-friendly platform look like?

My repo might be useful for those who are already comfortable with tools like version control: but those are the people who are already using Markdown anyway.

To make this useful for anyone else, we’d need a system with three easy, non-command line steps:

1. Installation

Puts Pandoc, Git, and a good Markdown editor on your computer at once.

2. Writing (in the archives)

This should resemble existing note taking as closely as possible: the user will need to make sure their camera’s clock is well-calibrated, but other than that it should look only like using a new text editor.

Whenever you type in the editor, it saves the files and runs git commit at close intervals. (Git experts may find the idea of automatic commits without a clear commit message cringe-inducing. Insofar as they have a point, edits should probably take place on a separate branch that is forked back into the main one periodically.)

3. Compilation (loading your pictures)

Imports photos from an sdcard or photo library, finds the version control files and matches photo times against them, and builds an html file for each document of notes.

What’s the platform?

Some of the technical components are obvious. I can’t imagine using anything other than git for version control; and though I use gitit to view files, I think that standalone html files are the only sensible way for most people to view their files. The scripting language for step three, as well, isn’t very important: I’ve used python, but anything with a set of hooks into git.

The big question is: what’s the text editor to be? I use emacs, and get the impression that most people writing in Markdown are using vim. Both of these are clearly bad choices for the ordinary historian. For all that Markdown can be written in any editor, the writing function also must support auto-save and auto-git-commit, so anything without a scripting interface is out. SublimeText has its selling points, but free’s probably the way to go.

That means, unless I’m missing a central player in the ecosystem, that the natural choice is the new Atom editor from Github. But perhaps there’s a more lightweight alternative?

Platform will also be an issue. The Mac is the obvious platform to capture a majority of historians: but a surprising number of people seem to take their notes with an iPad-keyboard array, which would call the whole stack into question.


So that’s the proposal. Once historians see how great Markdown is for notes, maybe they’ll think about it for lectures; once they use it for lectures, maybe the footnote ecosystem will start to improve, and we’ll finally be able to distribute historical papers as text, making them more portable, more easily structured, and more lasting.

So, anyone want to try?

  1. It took me a few hours of mucking about in Emacs Lisp to make inserting a link to something in my Zotero library almost as easy as it is under Microsoft Word; and if you want to configure the core behavior of Pandoc, it’s best to use Haskell. Even the “programming historian” may not have heard of either of these languages. Both (well, at least Haskell) have their strengths: but suffice it to say that neither has ever been anyone’s answer to the question “If I should only learn one computer language, which should it be?”↩