You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Screen time!

Sep 15 2014

Heres a very fun, and for some purposes, perhaps, a very useful thing: a Bookworm browser that lets you investigate onscreen language in about 87,000 movies and TV shows, encompassing together over 600 million words. (Go follow that link if you want to investigate yourself).

Ive been thinking about doing this for years, but some of the interest in my recent Simpsons browser and some leaps and bounds in the Bookworm platform have spurred me to finally lay it out. This comes from a very large collection of closed captions/subtitles from the website opensubtitles.org; thanks very much to them for providing a bulk download.

Just as a set of line charts, this provides a nice window into changing language. Ive been interested in the need to/ought to shift since I wrote about it in Mad Men: its quite clear in the subtitle corpus, and the ratio is much higher as of 2014 than anything Ngrams can show.

Add caption

On the other hand, you know whats going down? Any discussion of global warming. This is probably worth a longer treatment just on its own, to trace out exactly why and when this happened. (Its a phenomenon not just true in movies and TV, but also newspapers and TV news; but its worth thinking about whether its a culture-wide failing, or just certain sections of the media.)

Like any Bookworm browser, the metadata underneath is just as important as the size of the corpus. Thanks to IMDB.org, there is stellar movie metadata; much more consistent and in-depth than for books. You can see one example even in that chart above, actually: Ive limited it to movies that were originally written in English. Movies in other languages dont show the same shift, because in most cases the translations were done after 1990: so theres no past where we translate the characters in Tokyo Story, say, as saying ought to even though in some weird sense it might be more accurate. There is, though, an increase after about 1998, because the epoch of DVD subtitle translations is itself long enough that we can the shift in language underway.

But original language is only scratching the surface, and only really important for linguistic questions.

Theres all sorts of other metadata. (And Ive only started converting in the IMDB files). The data has country of filming, so you can compare Hollywood to Bollywood or East Germany to West Germany (although the sample may be a bit small);  it has the filming studio, which seems useless to me but might be fun if you really know your lots.

On the biggest level, though, you have movies vs. television; this corpus is about half of each. For a lot of studies, it will make sense to do one or the other. Ill tell you something about name a word tools like this: the search logs are often dispiriting, and always deeply profane. It will get worse here: because is the first Bookworm Ive seen thats actually interesting to type them into. The movie-TV swearing curve is actually quite interesting: a steady ascent for films, but a leveling off for TV after HBOs curse-heavy heyday. (One of the reasons for the TV decline may be more and more dross from basic cable showing up, though; yet more to do).

I havent added in MPAA ratings yet, but those might generate all sorts of queries: when did asshole become OK for PG-13?

Besides medium, Ive also ported over a number of other things; the studio (useful particularly back in the Golden Age, although the list is dominated now by TV production companies). The IMDB textfiles are a strange sort of quasi-relational form (among other things, they dont actually include the IMDB master ID number, which necessitates a lot of workarounds), but if theres anything really useful in them along these lines, Id love to hear.

This also makes it possible to drill down to the individual shows.

I love Mad Men, but Deadwood is the greatest historical TV show of them all. And its wildly anachronistic, which is why Ive never subjected it to the quantitative Prochronisms treatment. But as it happens, itand almost every other HBO showare in the set, and so you can get some nice tidbits such as that in its second season, Deadwood used the word fucking more times than it used the word of.

People seem to really enjoy the Simpsons browser, so I was a bit tempted to roll out a whole series; one for Seinfeld, one for the Sopranos, until eventually I was putting them out for Everybody Loves Raymond and Gunsmoke, and people started to wonder how much time I have on my hands. But really, any individual TV show is possible just by clicking from the dropdown.*

The movie equivalent of a show is, I guess, the director. So thats here too: behold how Woody Allen abandons sex and death for money starting around 1979. (The all-time low point for money, in 1977, is Annie Hall.

And as always, the Bookworm guarantee is that you can get as close to the texts as possible. In this case, that means clicking on the chart will show the movies lying behind the point; and that you can click through to IMDB for more metadata, or to open subtitles for the original text. (Which I cant redistribute, sorry.)

There are, as in all things like this, various missing elements and omissions.

The SRT files arent perfect, nor is my parsing of them.

  • Some SRTs list the wrong movie altogether. Ive dropped all movies from the silent era because the errors are too many and too confusing back then; Im sure there are a number from after, as well.

  • There are also problems with my parsing of the files, or text in the subtitles thats not in the movies. In most cases this isnt a big deal, but if you search for uploaded, subtitles, or the username of an opensubtitles user, youll get almost entirely junk. Internet itself might be slightly infected, which is a term I know many people will search for immediately.

  • Transcriptions are imperfect, may have bad spellings, and may have some eggcorns. In some cases, subtitles not originally in English may have been simply machine-translated.

  • Sometimes character names and onscreen depictions of actions are in the subtitles, and sometimes they arent.

The metadata isnt perfect. In particular, about 10,000 episodes/movies arent linked up to IMDB metadata. These seem, at first glance, to be mostly television shows that dont use episode titles and movies identified by their foreign-language title. At some point I hope to go back and fix this.

Finally (and unfortunately, I think its necessary to say this explicitly) this is no way a comprehensive list. It has most of the movies I can think to look for; its coverage of television, particularly before about 2000, is wildly patchy. It has most of the classic shows, and a lot of science fiction shows that arent classic: but it doesnt have the soaps, for instance, at least that I can see, or many of the most popular 60s and 70s dramas. Its weighted towards what people who know how to capture subtitle files want to watch on DVD. But its probably particularly bad at answering questions like when did zombies break into the mainstream, because the mainstreaming of zombies means we have a lot more old zombie movies than we otherwise might.

Theres also lots of interesting stuff yet to come.

I havent yet broken the movies and shows down by time in the episode; that should produce some really interesting stuff when I get an afternoon.

Theres lots more IMDB metadata to integrate as well.

But for now, let me know if you find anything great. Heres that link again.

* Not any TV show is available, I guess, because I arbitrarily decided to limit it to the top 50. But if theres any missing show you think should be on the list, just let me know.

Comments:

Fantastic work. Appreciated.

Anonymous - Sep 3, 2014

Fantastic work. Appreciated.

Great! Very interesting linguistic/philosophical/q

Anonymous - Sep 3, 2014

Great! Very interesting linguistic/philosophical/quantitative :) approach. Thanks

Courtesy to DVDs, all the old sitcoms and variety

car tv - Oct 3, 2014

Courtesy to DVDs, all the old sitcoms and variety shows can now be watched, remembered and cherished by those that remember when they were the stars of television.

Go and get more new Natok, Movie, Video From

Tanzina - Nov 2, 2014

Go and get more new Natok, Movie, Video From

Bangla Natok
Hindi Natok
Star Plus Natok
Star Jalsha Natok
Mirakkal 8
Comedy Show
Reality shows
Funny video
Mosharraf Karim Natok
Dare 2 Dance
New Movies

V shows. Care to expand on this?

Try doing World War 1, World War I, Werld Warre 1

Anonymous - Mar 5, 2015

Try doing World War 1, World War I, Werld Warre 1, World War 1, Verlde Vorre vun, Global Fight No1, or any other misspelling you can think of. They all come out as a solid million per million words! Same for WWII!

Uh, neeeever mind, what causes it is three word co

Anonymous - Mar 5, 2015

Uh, neeeever mind, what causes it is three word combinations. Bug much?