The Simpsons Bookworm

I thought it would be worth documenting the difficulty (or lack of) in building a Bookworm on a small corpus: I’ve been reading too much lately about the Simpsons thanks to the FX marathon, so figured I’d spend a couple hours making it possible to check for changing language in the longest running TV show of all time. For some thoughts on how to build a bookworm, read “prep”: otherwise, skip to analysis.

Finding the best ordering for states

Here’s a very technical, but kind of fun, problem: what’s the optimal order for a list of geographical elements, like the states of the USA? If you’re just here from the future, and don’t care about the details, here’s my favorite answer right now: ["HI","AK","WA","OR","CA","AZ","NM","CO","WY","UT","NV","ID","MT","ND","SD","NE","KS","IA","MN","MO","OH","MI","IN","IL","WI","OK","AR","TX","LA","MS","AL","TN","KY","GA","FL","SC","WV","NC","VA","MD","DE","PA","NJ","NY","CT","RI","MA","NH","VT","ME"] But why would you want an ordering at all? Here’s an example. In the baby name bookworm, if you search for a name, you can see the interaction of states and years.

Bleg 1: String Distance

String distance measurements are useful for cleaning up the sort of messy data from multiple sources. There are a bunch of string distance algorithms, which usually rely on some form of calculations about the similarities of characters. But in real life, characters are rarely the relevant units: you want a distance measure that penalized changes to the most information-laden parts of the text more heavily than to the parts that are filler.