In-browser Text classification using Bookworm.

This page automatically classifies a snippet of text (pasted into the text area below) against a bookworm database, so you can see how any given snippet lines up with the metadata you've defined for a collection.

The classification method is about as simple as it gets: for every word, we see how often it's used by each category (here, the categories are authors). Those usage percentages can be thought of as probabilities: as we churn through the text the probability is updated in realtime. In this kind of model any individual word is quite unlikely, but some words are more unlikely than others. The x axis is expressed in log likelihood per token: the process has been slowed down so you can see the words as they are tested.

For the demonstration here, you're classifying one of the disputed Federalist papers (no. 57) by author in a dataset of the Federalist Papers. (Thanks to Matt Jockers for the texts, originally created for Jockers, Matthew L. and Daniela M. Witten. “A Comparative Study of Machine Learning Methods for Authorship Attribution.” Literary and Linguistic Computing, 25.2, (2010): 215-224.) As each word is checked against the rest of the federalist papers, it's removed from the chart. A filter has been set to only use stopwords: the red words are used for classification, and the green words are being skipped. Mosteller and Wallace showed that stopword-based classification works well for the Federalist; and indeed, you'll see that it matches quite well for most of the disputed papers or paragraphs from them. Most scholars believe that these were written by Madison. (Find some other federalist papers online to check them).

If you have a Bookworm installation of your own, you can easily modify the code here to classify by whatever text variables you might have on hand--I've had good luck with books by genre, for instance. By changing the filter function code in the code, you could use only long words, or words matching some other condition. You'll also noticed that here "disputed" seems the most likely, because this is itself a disputed paper: for a more serious classification exercise, you'd want to exclude the test subject when designing the query.

API call: