The Simpsons Bookworm

I thought it would be worth documenting the difficulty (or lack of) in building a Bookworm on a small corpus: I’ve been reading too much lately about the Simpsons thanks to the FX marathon, so figured I’d spend a couple hours making it possible to check for changing language in the longest running TV show of all time.

For some thoughts on how to build a bookworm, read “prep”: otherwise, skip to analysis. Or just head over the browser.


Step one is getting the texts. This is easy enough here, something I know how to do from all my Prochronisms posts: I can just use the subtitles, which are available a batch at a time. The only challenge is deciding what to do with audio-effects subtitles. I’m deciding to download the files that include them where necessary, but probably disable them by default. I also end up with only 540-something episodes, about ten short of the complete run: rather than try to figure that out at the start, I’m going to let the Bookworm data visualizations themselves be the clue to what I’m missing.

Next up is choosing what a “text” will be. The obvious choice would be for each episode to be a single text: but 550 episodes, while it’s a lot to watch, doesn’t give many angles for analysis. My second idea is that it might be interesting to look at a really granular level: ideally, we’d be able to compare the first, second, and third acts. That info isn’t in the subtitles, but we can split up by lines of speech: later on, we’ll be able to aggregate the queries to look in just the first hundred lines, or the first third, or whatever. The only downside is that it dramatically increases the number of texts: but that’s not really a huge problem.

That also makes it easy to decide what I’ll display in the search results: the individual line from the script containing the word.

Next step is to parse into bookworm format. Since these are in SRT format, it’s not as easy as it could be: I’m looking to create indexes that are episode-season-line. To get the season and episode names, I write out some regular expressions that match the various different filenames. This one of the uglier parts, and where I actually spend the most time. The final parsing code uses a whole bunch of regexes to handle the different formats people use: “S04E20”, “[1.3],” and so forth. One batch doesn’t have season numbers at all: I’ll have to fix that later.

Next is actually parsing the text, and adding some new information to it about the position of each line. This is usually the hardest part, but SRT parsing is pretty easy as these things go. Plus, nailing down the format leads me to an insight–rather than use line number, I can take the embedded time information in the SRT files and index by the minute and second in the episode that a subtitle flashes on the screen. Each subtitle block will correspond to a file, and we’ll know the exact moment it appeared. Turns out there are about 200,000 of those in the series, which is a reasonable number of texts to include in a Bookworm. (Though if I were hypothetically to do this for a whole bunch of TV series (more than a couple hundred) at the same time, that might push the system’s limits.) Parsing out the SRT time information works well. We’re left with some straggling sound effects, which I’m just leaving in for the time being. Occasionally characters names appear at the front of texts: again, that’s something I’d correct if this were a weekend project rather than a weeknight one.

That means the final scheme will give us, for each subtitle block:
1. Season Number
2. Episode number in the season
3. Episode number in the series (will make some plots easier).
4. Minute in the episode
5. Second in the episode
6. The actual text of the block.

From that information, if we were true Simpsons scholars, we could easily add:

1. Act (roughly: call minutes 0-7 act 1, minutes 8-14 act 2, and minutes 15 to the end act 3)
2. Air date, episode director, and other information easily linkable from IMDB.
3. Whether it’s a finale or what.

Once the text is parsed, the file-creation is pretty easy, we’re ready to ingest. The input.txt file is just the text and an id number constructed from the moment the block appears on screen: the jsoncatalog.txt is just a dump of an object that’s useful for processing, anyway.

I’ve already written a specialized makefile for my Federalist papers bookworm to clone the Bookworm repo and put files in the right place, so that’s easily adapted.

And then we’ve got it! I didn’t designate any fields as “time,” so a first inspection will be easier using the D3 browser.

The first test is to find out about those pesky missing episodes. So I’ll plot a heatmap of the number of words for each episode (x axis) and season (y axis):
Screen Shot 2014-08-29 at 11.35.28 AM

This shows that we’ve got about 25 episodes for season, but: we’ve got a season 0 and no season 1 (that one set of srts that didn’t give a season, no doubt); we’ve got no seasons 16 and 17; and, curiously, most season 6 episodes are twice as long as they should be. Probably season 16 was mislabeled season 6, and we’re actually missing season 17. We’re also missing the first 9 episodes of season 21, and the first two of season 22. Oh well. Something to catch on a next run.


The beta lets us quickly check out some other things, like the number of words (color) by *minute* (y axis) and season (x): you can see commercial creep, as sometime around season 14 we lose most of minute 21.

Screen Shot 2014-08-29 at 11.40.32 AM

OK: let’s check the actual words. Here are uses of each of the central four characters: season on the x axis, unigram on the y axis.

Screen Shot 2014-08-29 at 12.19.33 PM

Nothing too suspicious here: the shift from Bart to Homer looks good, etc.

Just trying some line charts: yep, Maude only gets mentioned much by name around the season she dies:

Screen Shot 2014-08-29 at 12.31.00 PM

But what’s really interesting, maybe, isn’t the season-to-season change but the internal episode structure. For instance, at what minute in the episodes do characters talk about “school?”

Screen Shot 2014-08-29 at 12.33.00 PM

That’s pretty interesting, actually: pretty much every minute, the plots seem to shift away from school.

Likewise, “I’m Kent Brockman” seems to be overwhelmingly a gag from the opening scene:

Screen Shot 2014-08-29 at 12.35.03 PM

OK, that’s enough: here’s the link to the Bookworm, and here’s the source code.

Finding the best ordering for states

Here’s a very technical, but kind of fun, problem: what’s the optimal order for a list of geographical elements, like the states of the USA?

If you’re just here from the future, and don’t care about the details, here’s my favorite answer right now:

But why would you want an ordering at all? Here’s an example. In the baby name bookworm, if you search for a name, you can see the interaction of states and years. Let’s choose “Kevin,” because it played such a role in my anachronism-hunting piece on Lincoln.

Screen Shot 2014-06-04 at 3.00.44 PM

Clearly the name took off around the start of the baby boom. But is there a geographical pattern? It’s very hard to say. It does look like the red names begin around 1955 in much of the country. But in a few, it’s not until the early 1970s. Which ones? Alabama, Georgia, North Carolina, South Carolina. That is, after substantial reading parsing over to the axis, it’s clear that most of those are southern states. But this is the sort of insight that should be immediately obvious. And there may be other connections we’re missing out on. The whole point of data visualization over tables is that you can pick out patterns using faster forms of cognition: requiring you to push over to the left to read off the names is a major loss.

Alphabetical order makes it easy to find any individual state (assuming you know its name) but hard to see the way related states move with each other.  It means that to trace out regional variations over time, we tend to animate maps: but using time as the proxy for time makes cross-temporal comparisons much harder to tell. As Tufte says, comparisons should be enforced across the eyespan: relying on animation to trace out common names is a big problem. So there’s a dramatic interest in seeing different names pop up in (for instance) Reuben Fischer-Baum’s animation of baby names; but you have to watch the whole thing to think through questions like “what regions tend to adopt names early?” or “what’s the name that stays on top for the longest?”

Putting it all into X-Y makes these questions easier. But that means we need to map states to X or to Y. Alphabetical order means that states are not arranged in a way that places states near others like them.

So how could we make the states usefully arranged? We need some dimensionality reduction.

Linear reductions

One obvious way would be east-to-west or north-to-south: that starts out quite well, with all of New England:

But quickly falls apart with Ohio, Florida, Georgia, and Michigan in immediate succession. If we plot the states, you can quickly see why. Rather than list orders, I’m going to show them as paths through a map: here’s what that looks like in this case.



(By the way, you can see that the points are a little arbitrary: I’ve taken the first geonames hit for the state, which is sometimes the capital, sometimes the state centroid, and sometimes the most important city. Ideally I’d be using the population-weighted centroid, but in some ways I kind of like the results that come out of this.

There are some other possibilities for linear dimensionality reduction (principal components comes to mind) but they’ll have the same fundamental problem. We want a metric that takes proximity more fully into account. Even non-metric multi-dimensional scaling fails: it handles a couple cases better (Jackson and St. Louis are in a more sensible order, for instance), but it still jumps erratically up and down, preventing any larger groups like “the south” from coming into sight:

nonmetric MDS


Hierarchical clustering approaches

One possible approach, suggested to me by Miriam Huntley, is hierarchical clustering: using distances, we can cluster the states by proximity. Here’s the initial result of that:

Initial Dendrogram

The individual groups are quite nice (New England is there, plus New York at the end), and every state is adjacent to an immediate neighbor. And while the groups have geographical coherence, they aren’t exactly the regions we know and love: the “mid-atlantic” runs down to South Carolina, and the midwest includes the gulf coast all the way to Tallahassee. The connections between the groups are scattered. Florida is next to Pennsylvania, and South Carolina to Massachusetts. Seen as a path, the weirdness of this is clear:


Leaf ordering in dendrograms is arbitrary, however, and we can do better than this. Using a method developed by Bar-Joseph et al, and implemented in the “cba” library for R, we can reorder the dendrogram so that groups stay the same, but the leaves are ordered so that transitions from one group to the next are maintained.


Reordered dendrogram


Now, the path looks considerably better:



The clusters remain adjacent, but now the transitions are so smooth that it’s not obvious where one begins and the other ends. Instead, we get a serpentine path through the states that both ensures every path is between two adjacent states, and keeps paths generally inside the same region.

Network approaches

Can we do better? The strategy of plotting these as paths suggests that maybe this is an instance of the traveling salesperson problem, in which we want to travel through all the states minimizing the distance traveled. Why shouldn’t the “best” solution simply be the one where the overall sum of distances is the least?

Inserting a dummy node as start- and end-point lets us view that: using the best method found by the “TSP” package in R (which is not guaranteed to be the optimal solution, since the traveling salesman is a notoriously difficult problem to solve), we get quite a different path:



Rather than start in Maine, this route begins in Tennessee! After winding through the Midwest to West Virginia, it leaps to Vermont and then takes a beautifully practical course down the Eastern seaboard through Texas, through the great plains, and then takes up nearly an east-to-west ordering through the Mountain and Western time zones. While many of the regional choices here look better to me than the dendrogram solution (particularly the coherence of the south, the distance-optimizing strategy means that there are a few nearby states that have nothing in common: the leap from New Mexico to Montana, for example, and the extremely strange choice to place Washington DC between West Virginia and Vermont, ten nodes removed from either Maryland or Virginia, the closest geographical points. (In fact, I think the route could be improved by heading straight to Vermont from WV and putting DC in its rightful place: but it says something that out of the 7 algorithms in the free version of the TSP package, none was able to improve on this route).

Fractal Curves 

Another option is not to minimize travel distance but to maximize the likelihood that two points will be next to each other. That suggests filling the geographic region with some kind of fractal curve, and then positioning each state along the curve.

This is an appealing way to think of arranging the country linearly: not as a network, but as iterable set of points. For just the United States, we could use some already-existing curve path. The most widespread linear mapping of points is the Zip code system: Samuel Arbesman has written about this on Wired, and includes a link to Robert Kosara’s ZipScribble maps. Here’s Kosara’s idea with a few minor changes (I use a rainbow spectrum, rather than coloring each state separately, and an Albers projection. And it appears that the zip database I have handy has something weird going on in southwest Georgia.) ZipManifold

Space-filling curves

The ZIP system isn’t especially logical, but there should be something similar that’s better. My first thought for this problem, which the whole post, was to use a Hilbert Curve. It turns out that Kosara has mapped that approach onto the Zip dataset.

Using just the state points, it’s possible to draw a Hilbert curve that covers the continental United States, and then visit each state at the moment it’s closest to the curve. The actual path taken can then be simplified down to eliminate the intervening states. Here’s what that looks like, with both the Hilbert curve and the simplified route. I’ve shaded the Hilbert curve using a double rainbow so it’s easier to trace from its origin near the Bahamas (first making shore near South Carolina) to its exit off the coast of Los Angeles.

Hilbert Curve Route


I’m disappointed by the performance here. While there is some regional coherence (the stretch from Wisconsin to Kansas is well done, and the first jumps through the South are acceptable), the square binning forces some rather strange choices: the odd jag down to North Carolina, the detour to Colorado and Wyoming.

There are other issues as well. Hilbert curves work best in square spaces, and the patches of ocean/Canada/Mexico that get filled are pretty far off limits. While I don’t show Alaska and Hawaii, for the other algorithms they’ve simply been tacked on at the end in a reasonable manner: here, though, a solution that includes Alaska and Hawaii makes some significant changes to the full arrangement and vastly increases the percentage of empty space, which tends to introduce odd decisions (like interposing Alaska between Oregon and Nevada.)HilbertCurveWIthAK


I suspect there are ways of optimizing the Hilbert curve, or some similar fractal path, so that it better maps onto actual geographic spaces. That seems like an interesting avenue, potentially: but the initial results here seem worse, not better, than traveling salesman approximations.

Conclusions and Deus ex Machina

So on this particular set, the best results seem to come from, in descending order,

1) Reordered hierarchical clustering

2) Traveling Salesperson solutions

3) Fractal Curves

4) Quasi-linear dimensionality reduction (east-to-west, multi-dimensional scaling, etc).

For the general problem (European countries, say, or counties in a state) I’d probably start with reordered hierarchical clustering or TSP solutions, at least until I learn how better to fit a fractal curve to an arbitary space.

But for this particular problem, I’ve got an ace in the hole: there are conventional orderings of states that provide an acid test. In particular, we want something that matches to census regions.

The ordering inside census regions is arbitrary, just like our clustering diagrams. So the best possible solution that includes some knowledge about the intrinsically real regions of the United States (the midwest, the south, etc.) is to combine the census regions with the optimal-dendrogram measures.

Putting phony clusters just from the census regions looks like this: Phony Cluster

I can just plug those into a dummy distance matrix so that group membership trumps any other sorts of distance: and then allow geographical distance to sort out the spinning of those trees into a more sensible order.

So, adding the constraint that census divisions and regions be kept intact, the optimal ordering looks like this: starting in Maine, traveling through the South west to Texas, skipping to the upper Midwest and then taking the same route west through the plains and mountains as the dendrogram:

Census Clustering

Is this the perfect ordering? To my mind, it’s not: but the flaws come straight from the census, not from the algorithm. West Virginia should not in the coastal south, it should be in the same division as Kentucky; the leap from Oklahoma to Wisconsin is unfortunate, and so is the one from Florida to Kentucky. Still, the census regions constrain is quite nice to have. And unlike the unguided paths, it preserves all but one of what I intuitively think of as the essential pairings: the Dakotas, the Carolinas, Alabama-Mississippi, Vermont-New Hampshire, Kansas-Nebraska, Colorado-Wyoming.

So, let’s return to the original visualization to see what this new ordering helps us see. Remember, this original version revealed only with some serious axis-reading that the South starting using “Kevin” later.



Screen Shot 2014-06-04 at 3.00.44 PM

Here it is with the census-based ordering. (For a version you can hover over, see the browser itself). The southern states, two-thirds of the way down the page, clearly do begin later: but now it’s also immediately evident which of them don’t lag as much. There are also several patterns that are immediate evident which remain completely obscure in an alphabetical ordering: usage of “Kevin” is significantly higher around 1990 in the northeast, particularly the mid-Atlantic, than it is in the rest of country. And while the South waits the longest, a lag in the Arizona-New Mexico pairing is also clear.

Screen Shot 2014-06-05 at 3.52.16 PM


This style of display also makes subtler patterns visible. “Jennifer,” for example, rises a year later in the South than elsewhere. That would be lost as visual noise in an alphabetical ordering, but is completely clear here.

Is a geographical ordering the best? Not always. Take “Madison“: its rise shows striped bands that don’t seem to be regional. Illinois, New Jersey, Washington DC, and New Mexico all avoid the wave. In fact, if you look closer, this is clearly a racial thing: “Madison” was most popular in states with overwhelmingly white populations. (Except Wisconsin, it seems). And aside from the bend through the southwest, there aren’t a whole lot of largely-minority states in any contiguous curve.

Screen Shot 2014-06-05 at 4.04.51 PM


But on another level, that just points out more the usefulness of some sensible ordering to start with.

Bleg 1: String Distance

String distance measurements are useful for cleaning up the sort of messy data from multiple sources.
There are a bunch of string distance algorithms, which usually rely on some form of calculations about the similarities of characters. But in real life, characters are rarely the relevant units: you want a distance measure that penalized changes to the most information-laden parts of the text more heavily than to the parts that are filler.

Real-world example: say you’re trying to match two lists of universities to each other. In one you have:

[500 university names…]
Rutgers the State University of New Jersey

and in the other you have:

[499 university names…]
Rutgers University
New Hampshire State University

By most string distance measures, ‘State University’ and ‘New’ will make the long version of Rutgers match New Hampshire State, not Rutgers. But in the context of those 500 other names, that’s not the correct match to make. The phrase “State University” actually conveys very little information (I’d guess fewer than 8 bits) , but that “R-u-t-g-e-r-s” are characters you should lose lots of points for changing. (Rough guess, 14 bits).

In practice, I often get around this by changing the string vocabulary by hand. (Change all occurrences of “University” to “Uni”, etc., ) I can imagine a few ways to solve this: eg., normalized compression distance starting from a file of everything, or calculating a standard string distance metric on a compressed version of names instead of the English version. But I feel like this must exist, and my Internet searches just won’t find it.