Genre Classification from Topic models

Sep 13 2015

This post is just kind of playing around in code, rather than any particular argument. It shows the outlines of using the features stored in a Bookworm for all sorts of machine learning, by testing how well a logistic regression classifier can predict IMDB genre based on the subtitles of television episodes.

Jon “Fitz” Fitzgerald was asking me for example for training a genre classifer on textual data. To reduce dimensionality into the model, we have been thinking of using a topic model as the classifiers instead of the tokens. The idea is that classifiers with more than several dozen variables tend to get finicky and hard to interpret, and with more than a few hundred become completely unmanageable. If you want to classify texts based on their vocabularies, you have two choices:

Only use some of the words as classifiers. This is the normal approach, used from Mosteller and Wallace on the Federalist papers through to Ted Underwood’s work on classifying genre in books.
Aggregate the words somehow.
You could think of stemming and case-normalization, frequently used when the word set is limited, as a first step of aggregation.
The best way, from an information-theoretic point of view, is to use the first several principal components of the term-document matrix as your aggregators. This is hard, though, because principal components vectors are hard to interpret and on very large corpora (for which the term-document matrix doesn’t fit in memory) somewhat tedious to calculate.

Using topic models as classifiers is somewhat appealing. They should be worse at classification than principal components, but they should also be readable like words, to some degree. I haven’t seen it done that much because there are some obvious problems; topic models are time-consuming to fit, and they usually throw out stopwords which tend to be extremely successful at classification problems. That’s where a system like Bookworm, which will just add in a one-size-fits-all topic model with a single command, can help; it lets you try loading in a pre-computed model to see what works.

So this post just walks through some of the problems with genre classification in a corpus of 44,000 television episodes and a pre-fit topic model. I don’t compare it directly to existing methods, in large part because it quickly becomes clear that “IMDB genre” is such a flexible thing that it’s all but impossible to assess whether a classifier is working on anything but a subjective level. But I do include all of the code for anyone who wants to try fitting something else.

Code, Descriptions, and charts

Note: all the code below assumes the libraries dplyr, bookworm, and tidyr are loaded.

First we make the data wide (columns as topic labels). That gives us 127 topics across 44,258 episodes of television, each tagged with a genre by IMDB.


  wide = movies %>% spread(topic_label,WordsPerMillion,fill=0)

Now we’ll train a model. We’re going to do logistic regression (in R, a glm with family=binomial), but I’ll define a more general function that can take an svm or something more exotic for testing.


  # Our feature set is a matrix without the categorical variables and a junk variable getting introduced somehow.
modeling_matrix = wide %>% select(-TV_show,-primary_genre,-season,-episode,-`0`) %>% as.matrix
training = sample(c(TRUE,FALSE),nrow(modeling_matrix),replace=T)

dim(modeling_matrix)


  ## [1] 44258   127


  training_frame = data.frame(modeling_matrix[training,])
training_frame$match = NA
build_model = function(genre,model_function=glm,...) {
  # genre is a string indicating one of the primary_genre fields;
  # model function is something like "glm" or "svm";
  # are further arguments passed to that function.
  training_frame$match=as.numeric(wide$primary_genre == genre)[training]
  # we model against a matrix: the columns are the topics, which we get by dropping out the other four elements
  model = model_function(match ~ ., training_frame,...)
}

Here’s a plot of the top genres. I’ll model on the first ten, because there’s a nice break before game show, reality, and fantasy.


  library(ggplot2)

wide %>% filter(training) %>% group_by(primary_genre) %>% summarize(episodes=n()) %>% mutate(rank=rank(-episodes)) %>% arrange(rank) %>% ggplot() + geom_bar(aes(y=episodes,x=reorder(primary_genre,episodes),fill=rank<=7),stat="identity") + coord_flip() + labs(title="most common genres, by number of episodes in training set")

Actually building the models. I use lapply to build one for each of I’ll store the models in a data.frame, even though that’s a little wonky; then I run the predictions for each.

The random variable “training” is splitting this into a training set and a testing set.


  top_genres = wide %>% group_by(primary_genre) %>% summarize(episodes=n()) %>% mutate(rank=rank(-episodes)) %>% arrange(rank) %>% slice(1:7) %>% select(primary_genre) %>% unlist

top_genres


  ## primary_genre1 primary_genre2 primary_genre3 primary_genre4 primary_genre5 
##       "Comedy"       "Action"        "Drama"        "Crime"    "Animation" 
## primary_genre6 primary_genre7 
##    "Adventure"  "Documentary"


  # Use lapply to create a list of one logit model for each of the ten genres
models = lapply(top_genres,build_model,glm,family=binomial)


  ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred


  # Generate the predictions from each model. type="response" is particular to binomial models
# The predictions are on a training set.
predictions = lapply(models,predict,newdata = data.frame(modeling_matrix[!training,]),type="response")

# Create a matrix of predictions that we can pull into a data.frame
predictions_frame = do.call(cbind,predictions) %>% as.data.frame
names(predictions_frame) = top_genres
# These align with the first four columns of the wide matrix, so we can toss them into the front.

predictions_frame = cbind(wide %>% select(TV_show,primary_genre,season,episode) %>% filter(!training),predictions_frame)

To see how that’s working, we can inspect the predictions.


  set.seed(1)
predictions_frame %>% sample_n(6)


  ##                         TV_show primary_genre season episode     Comedy
## 11775             Falling Skies        Action      3       3 0.03436976
## 16561     How I Met Your Mother        Comedy      2      10 0.85780612
## 25412 Only Fools and Horses....        Comedy      5       7 0.69376495
## 40202                  The Wire         Crime      1       5 0.17228152
## 9074                   Deadwood         Crime      3       8 0.02565560
## 39762       The Vampire Diaries         Drama      2       5 0.03678272
##            Action        Drama      Crime   Animation   Adventure
## 11775 0.202569167 0.3802495040 0.03092575 0.020564454 0.091366639
## 16561 0.003954963 0.0006445534 0.00815855 0.592317318 0.003896103
## 25412 0.013540870 0.0361088588 0.14748145 0.009966243 0.006062556
## 40202 0.170548665 0.1176822845 0.49326395 0.003152967 0.005204129
## 9074  0.046069422 0.0963970708 0.04870178 0.028939707 0.021441583
## 39762 0.216597790 0.7051110511 0.09713269 0.002394103 0.147156241
##        Documentary
## 11775 2.827572e-05
## 16561 6.103521e-06
## 25412 9.807231e-07
## 40202 2.567087e-06
## 9074  3.039941e-06
## 39762 1.029023e-07

Each of those numbers is a probability. Season 2 episode 10 of “How I Met Your Mother” is 85% likely to be classed as Comedy, but also 59% likely to be Animation; it’s basically zero for anything else. (These numbers don’t sum to 100%, because I’m just running a series of binary classifiers). So for that one the classifier counts as a definite success; the show is a comedy.

“Falling Skies,” on the other hand, is listed in IMDB as “Action”; while that is a 20% probability by the classifiers, the models think that Drama (38%) is more likely.

“The Wire” is unambiguously correctly classed as Crime; “Deadwood” is mistakenly called Drama and not Crime. This raises the obvious question of just how “wrong” the classifier can be; “Crime” is not quite so distinct from “Drama” as a classification model would like. Technically the algorithm would fail by calling Ice Road Truckers a documentary and Prison Break a crime show, something it does fairly often. In real life, though, I’d say those tags are better than the “Adventure” and “Action” tags they get in the database.

I should note that I’m using the first genre tag for each episode, but in fact they may have more than one in IMDB. The classifier would probably work better with that extra information; the example case Fitz and I were talking about was with mutually exclusive classifiers, so I’ve stuck to that here.

Aggregate Predictions

I’ll use dplyr, though, to do to the full aggregate predictions. We’ll first assume that the top probability is the prediction for each episode, and plot the confusion matrix as a bar chart.


  tidied = predictions_frame %>% gather("classified_genre","probability",-TV_show,-primary_genre,-season,-episode)

best_guesses = tidied %>% group_by(TV_show,season,episode) %>% 
  arrange(-probability) %>% slice(1) %>% # (Only take the top probability for each episode)
  mutate(actual_genre=primary_genre)

confusion = best_guesses %>% group_by(actual_genre,classified_genre) %>% summarize(`Number of episodes`=n())

ggplot(confusion) + geom_histogram(stat="identity") + aes(x=actual_genre,y=`Number of episodes`,fill=classified_genre) + coord_flip()

What that’s saying: comedy is usually classified correctly. Action is, too, though it’s sometimes mistaken for crime. Drama is tricky; it’s often mistaken for comedy or action. Animation is often mistaken for comedy, which makes sense solely based on dialogue; the distinction between adventure and action is also poorly understood. Adding some interaction terms might help.

One way to understand this better will be to look at individual shows: what are typical misclassifications looking like?


  confusion = best_guesses %>% group_by(TV_show) %>% mutate(`Total Episodes`=n()) %>% filter(actual_genre!=classified_genre) %>% group_by(TV_show,actual_genre,classified_genre) %>% summarize(n_misclassified=n(),`Total Episodes`=`Total Episodes`[1]) %>% ungroup %>% arrange(-n_misclassified)

confusion


  ## Source: local data frame [1,611 x 5]
## 
##              TV_show actual_genre classified_genre n_misclassified
## 1       The Simpsons    Animation           Comedy             103
## 2         CSI: Miami       Action            Crime              80
## 3         Doctor Who    Adventure           Action              80
## 4         Family Guy    Animation           Comedy              79
## 5            CSI: NY       Action            Crime              74
## 6         Smallville    Adventure           Action              54
## 7      One Tree Hill        Drama           Comedy              51
## 8      American Dad!    Animation           Comedy              50
## 9  NCIS: Los Angeles        Crime           Action              50
## 10            Castle       Comedy            Crime              48
## ..               ...          ...              ...             ...
## Variables not shown: Total Episodes (int)

These results are a little more reassuring. Half of the episodes of the Simpsons are misclassified as Comedy. That’s great. IMDB marks the CSI series as “Action” shows, but most of the time the classifier counts them as “Crime”; I think that’s a much better classification, myself. Dr Who should be Sci-Fi, not either Action or Adventure; IMDB has Comedy first for Castle, but Wikipedia says the classifier is correct and that it’s a “Crime Drama.”

We can go further and ask the classifier what TV shows, out of those it has more than 100 episodes of, it thinks IMDB has wrong; that is, that that highest percentage of episodes are misclassified.


  confusion = best_guesses %>% group_by(TV_show) %>% mutate(`Total Episodes`=n()) %>% filter(actual_genre!=classified_genre,n()>100) %>% group_by(TV_show,actual_genre,classified_genre) %>% summarize(n_misclassified=n(),`Total Episodes`=`Total Episodes`[1]) %>% ungroup %>% mutate(pairing_frequency = n_misclassified/`Total Episodes`) %>% arrange(-pairing_frequency)

confusion


  ## Source: local data frame [40 x 6]
## 
##                           TV_show actual_genre classified_genre
## 1                      CSI: Miami       Action            Crime
## 2                      Family Guy    Animation           Comedy
## 3                      Doctor Who    Adventure           Action
## 4                      Smallville    Adventure           Action
## 5                    The Simpsons    Animation           Comedy
## 6                      Smallville    Adventure            Drama
## 7                      South Park    Animation           Comedy
## 8                  Criminal Minds        Crime           Action
## 9  CSI: Crime Scene Investigation        Crime           Action
## 10                     Doctor Who    Adventure           Comedy
## ..                            ...          ...              ...
## Variables not shown: n_misclassified (int), Total Episodes (int),
##   pairing_frequency (dbl)

A lot of these are adventure and animation; The others are law shows that IDMB doesn’t classify as crime that seems OK to me.

OK, now let’s get into classifiers as information retrieval to find outliers. Here’s a question: what are some episodes of TV shows that are dramatically different in a classification from the other episodes? We can operationalize that as: how many standard deviations more than the average for its show is an episode on every classifier?


  tidied %>% group_by(TV_show) %>% filter(n()>3) %>% group_by(TV_show,classified_genre) %>% mutate(deviation =scale(probability) %>% as.numeric) %>% ungroup %>% arrange(-deviation) %>% filter(probability > .1)


  ## Source: local data frame [44,523 x 7]
## 
##                                       TV_show primary_genre season episode
## 1                                The Simpsons     Animation     17       3
## 2                                          ER         Drama     15      10
## 3              CSI: Crime Scene Investigation         Crime      8       5
## 4                                The Simpsons     Animation     12       1
## 5  NCIS: Naval Criminal Investigative Service        Action      3      14
## 6                               Stargate SG-1        Action      2       6
## 7                                     CSI: NY        Action      2      11
## 8                     Everybody Loves Raymond        Comedy      1       9
## 9                                The Simpsons     Animation     17       3
## 10                                The X Files         Drama      3      12
## ..                                        ...           ...    ...     ...
## Variables not shown: classified_genre (fctr), probability (dbl), deviation
##   (dbl)

The results are mostly just odd, but I haven’t seen the episodes well enough to be sure. Season 12, Episode 1 of the Simpsons is 23% likely to be adventure, 10 standard deviations above the norm; it’s one of the atypical “Treehouse of Horror” episodes. The abnormally “Crime” classified episode is about a lawsuit against Mr. Burns; the ER classified as animation is a Christmas episode. But the rest looks mostly like noise.

But the rest seem like mostly noise: I haven’t checked them all, but there’s little evidence from any of the other ones that these are particularly genre-bending episodes.

Higher dimensionality reduction

Well, that’s more than enough time: the point here was supposed to be just testing the logit regression on topic models in Bookworm. And that seems to work reasonably well with enough data.

One last question is whether we can reduce down the dimensionality a lot further by using PCA on the topics.


  modeling_matrix_smaller = predict(prcomp(modeling_matrix))[,1:20]

modeling_frame = as.data.frame(modeling_matrix_smaller[training,])

build_model_smaller = function(genre,model_function=glm,...) {
  # genre is a string indicating one of the primary_genre fields;
  # model function is something like "glm" or "svm";
  # are further arguments passed to that function.
  matches_genre_logical = as.numeric(wide$primary_genre == genre)
  # we model against a matrix: the columns are the topics, which we get by dropping out the other four elements
  modeling_frame$match = matches_genre_logical[training]
  model = model_function(match ~ .,modeling_frame,...)
}

models_smaller = lapply(top_genres,build_model_smaller,glm,family=binomial)

# Generate the predictions from each model. type="response" is particular to binomial models
predictions_smaller = lapply(models_smaller,predict,newdata=data.frame(modeling_matrix_smaller[!training,]),type="response")

# Create a matrix of predictions that we can pull into a data.frame
predictions_frame_smaller = do.call(cbind,predictions_smaller) %>% as.data.frame
names(predictions_frame_smaller) = top_genres

# These align with the first four columns of the wide matrix, so we can toss them into the front.

predictions_frame_smaller = cbind(wide %>% filter(!training) %>% select(TV_show,primary_genre,season,episode),predictions_frame_smaller)

tidied_smaller = predictions_frame_smaller %>% gather("classified_genre","probability",-TV_show,-primary_genre,-season,-episode)

best_guesses_smaller = tidied_smaller %>% group_by(TV_show,season,episode) %>% 
  arrange(-probability) %>% slice(1) %>% # (Only take the top probability for each episode)
  mutate(actual_genre=primary_genre)

left = best_guesses_smaller %>% select(TV_show,season,episode,actual_genre,classified_genre)
right = best_guesses %>% select(TV_show,season,episode,actual_genre,classified_genre)

joined = left %>% inner_join(right,by =c("TV_show","season","episode","actual_genre"))
dim(joined)


  ## [1] 22048     6


  joined %>% mutate(different = classified_genre.x != classified_genre.y,smaller_was_right=classified_genre.x==actual_genre,bigger_was_right=classified_genre.y==actual_genre) %>% count(different,smaller_was_right,bigger_was_right)


  ## Source: local data frame [5 x 4]
## Groups: different, smaller_was_right
## 
##   different smaller_was_right bigger_was_right     n
## 1     FALSE             FALSE            FALSE  4882
## 2     FALSE              TRUE             TRUE 12199
## 3      TRUE             FALSE            FALSE  1299
## 4      TRUE             FALSE             TRUE  2703
## 5      TRUE              TRUE            FALSE   965


  joined %>% mutate(different = classified_genre.x != classified_genre.y,smaller_was_right=classified_genre.x==actual_genre,bigger_was_right=classified_genre.y==actual_genre) %>% ungroup() %>% summarize(smaller_hit_rate = sum(smaller_was_right)/n(),bigger_hit_rate = sum(bigger_was_right)/n())


  ## Source: local data frame [1 x 2]
## 
##   smaller_hit_rate bigger_hit_rate
## 1         0.597061        0.675889

The overall accuracy of the 128 dimension model is about 67.5 percent; the overall accuracy of the 20 dimensional model is 59.7%.

A 60 dimensional model (not shown in the code) is right about 65.3% of the time.

Historicizing Genre

To follow some of Ted Underwood’s work on genre in fiction, we could see whether the genre classifier has any historical bias. Are comedy and genre more distinct in the 1980s than in the 2010s?


  metadata = bookworm(database="screenworm",
                  # The asterisk before `topic_label` means we'll get words per million across all topics but still subsetted by TV_show and genre
                  groups=list("TV_show","MovieYear","primary_genre","season","episode"),
                  search_limits=list(medium="TV show"),
                  counttype="WordCount",
                  host="benschmidt.org") %>% select(-WordCount)


withYears = metadata %>% inner_join(tidied)

Here are the misclassification rates for the ten largest fields. “Adventure” is almost always misclassified (usually as action.)


  misclassifieds = best_guesses %>% inner_join(metadata) %>% mutate(period = floor(MovieYear/5)*5) %>% 
  filter(MovieYear >= 1970) %>% group_by(period,actual_genre) %>% summarize(misclassification_rate=sum(actual_genre!=classified_genre)/n(),total_shows=n())


  misclassifieds %>% filter(actual_genre %in% unique(best_guesses$classified_genre)) %>% ggplot() + geom_line(aes(x=period,y=misclassification_rate)) + facet_wrap(~actual_genre)

The operational question here is probably: is genre becoming blurred in the era when scripted television is no longer just a network TV phenomenon?

The chart above doesn’t tell us that much, in large part because I think the later periods are those for which the most data exists. Animation becomes easier to classify after 1990 because most later animation that makes it to subtitles is like The Simpsons. But there are some interesting signs in the big four genres: Comedy, Crime, Drama, and Action.

All are easiest to classify in the 1990s; for each of them, 2010 is higher than 2005 and 2005 is higher than 2000.


  misclassifieds %>% filter(actual_genre %in% c("Comedy","Crime","Drama","Action")) %>% ggplot() + geom_line(aes(x=period,y=misclassification_rate)) + facet_wrap(~actual_genre)


  misclassifieds %>% filter(actual_genre %in% c("Comedy","Crime","Drama","Action")) %>% ggplot() + geom_line(aes(x=period,y=total_shows)) + facet_wrap(~actual_genre) + labs(title="Total number of episodes in each genre, 1970-2010,\nby five-year period.")

How are the models working?

But one of the most appealing things about topics as classifiers is that it makes it easier to see how the models are actually working. We can plot, for instance, the coefficients for the fifteen most important topics in each classifier we’ve built.


  names(models)


  ## [1] "primary_genre1" "primary_genre2" "primary_genre3" "primary_genre4"
## [5] "primary_genre5" "primary_genre6" "primary_genre7"


  top_predictors = lapply(1:length(top_genres),function(n,return_length=15) {
  comedy_model = models[n][[1]]
  using = (rank((comedy_model$coefficients))<=(return_length/2)) | (rank(-comedy_model$coefficients)<=(return_length/2))
  coefficients = data.frame(genre = top_genres[n],topic=names(comedy_model$coefficients[using]) %>% gsub("modeling_matrix","",.),strength = comedy_model$coefficients[using],row.names = NULL)
  coefficients
}) %>% rbind_all


  ggplot(top_predictors %>% filter(topic!="(Intercept)")) + geom_point(aes(x=strength,y=topic,color=strength>0)) + facet_wrap(~genre,scales="free",ncol=2)

The crummy “Adventure” classifier, for instance, is mostly dominated by negative weights, but places a positive hope in any terms about “truth” or belief.” The comedy scores go up when they encounter conversation about coffee and tea, about bathrooms, about hair and teeth.

Conclusions

This ain’t scientific. I should be testing the classifiers on more rigorously held-out data that isn’t just chosen randomly; it would probably be best to exclude entire shows, so that “Springfield” (say) doesn’t become a key word classifying for animation.