Building topic models into Bookworm searches

I’ve been seeing how deeply we could integrate topic models into the underlying Bookworm architecture a bit lately.

My own chief interest in this, because I tend to be a little wary of topic models in general, is in the possibility for Bookworm to act as a diagnostic tool internally for topic models. I don’t think simply plotting description absent any analysis of the underlying token composition of topics is all that responsible; Bookworm offers a platform for actually accessing those counts and testing them against metadata.

But topics also have a lot to offer token-based searching. Watching links into the Bookworm browser, I recently stumbled on this exchange:

Tweets

 

How can I solve this biologist’s problem? (Or, at least, waste more of his time?)

The word-level topic assignments I have on hand are actually real useful for this. (I’m assuming, I should say, that you know both the basics of topic modeling and of the movie bookworm.) I can ask the beta bookworm browser for the top topics associated with each of the words “fly” (top) and “ant” (bottom):

 

fly

Fly usage by topic

 

ant

Ant usage by topic

 

“Fly” is overwhelmingly associated with the topics “boat ship Captain island plane sea water” (airplane flying) and “life day heart eyes world time beautiful” (unclear, but might be superman flying). (It’s even more so than on this chart, since I’ve lopped off the right side: there are about 2200 uses of “fly” in the first topic).

But “ant” is most used in two clearly animal related topics: “water animals years fish time food ice” and “dog cat little boy dogs Hey going.” And both of those topics show up for “fly” as well.

So in theory, at least, we can *restrict searches by topic:* rather than put into a Bookworm *every* usage of the word “fly”, we can get only those that seem, statistically, to be used in an animal-heavy context.

With an imperfect, 64-topic model on a relatively small corpus like the Movie Bookworm, this is barely worth doing.

Ant-in-context

Ant in animal topics per million words in all topics

fly2

Fly in animal topics per million words in all topics

And given that “flying” is something that plenty of animals do, the “fly” topic here is probably not all Order Diptera.

But with collections the size of the Hathi trust, this could potentially be worth exploring, particularly with substantially larger models. “Evolution” is one of the basic searches in a few bookworms: but it’s hard to use, because “evolution” means something completely different in the context of 1830s mathematics as opposed to 1870s biology. A topic model that could conceivably make a stab at segregating out just biological “evolution,” though, would be immensely useful in tracing out Darwinian changes; one that could disentangle military shooting from the interjection “shoot!” might be good at studying slang.

Above all, this might be good at finding words that migrate meanings in early uses: most new phrases actually emerge out of some early construction, but this would let us try to recover meaning through context.

Hell, it might even have an application in Prochronisms work; given a large, pre-built topic model, any new scripts could be classified against it and their words assigned to topics, and tested for their appropriateness as a topic-word combination.

Technical note: the basics of this are pretty easy with the current system: the only issue with incorporating “topic” as a metadata field on the primary browser right now is that the larger corpus it compares against would also be limited by topic. This could be solved by using the asterisk syntax that no one uses: {“*topic”:[3],”*word”:[“fly”]} will ensure both are dropped, not just one, by just specifying the “compare_limits” field manually.

 

8 thoughts on “Building topic models into Bookworm searches

  1. Ted Underwood

    Ah, it’s like we’re back in the innocent days of 2011 again.

    I think this is a good idea, though a tricky thing to automate on the scale of a tool like Bookworm.

    It’s related to a visualization strategy I developed to address your (important) observation that topic trajectories != the trajectories of component words. Lately I’ve been relying heavily on it. Basically it involves selecting a word or words as a fixed object of inquiry and then mapping its progress through topics represented as a stack graph or stream graph.

    The viz. here is an example:
    https://www.ideals.illinois.edu/bitstream/handle/2142/50034/REP127_05_Underwood.pdf?sequence=2

    There’s also an example using “power” in Andrew Goldstone & my article for NLH (page 13):
    https://www.ideals.illinois.edu/bitstream/handle/2142/49323/QuietTransformations.pdf?sequence=2

    Of course in a sense taking a word as the central object of inquiry obviates the point of topic modeling, which is to get beyond individual words and predetermined objects of inquiry. But I find that in practice it’s useful to move back and forth between these approaches. Even when I know what word or concept I’m interested in, an unsupervised strategy like topic modeling can be a valuable way to get a sense of the different contexts where it figures prominently.

    Reply
    1. ben Post author

      2011 was the day… Having some time/money for Bookworm means I’ll think about this stuff a bit more. I (And having this more obscure Blogger means I’ll post it…? ) Although this is really more an IR question than a DH one at heart.

      The application you say is one of the two basic ones I want to support. The other is comparing the token makeup of the same topic across any two arbitrary meta data groupings, probably using the UI components from the main bookworm site. Sometimes those help debug, other times they’re useful analytically ; you can see how battle language is differently deployed by movie, say.

      Automation shouldn’t be hard, except for the art of building a non-lousy model; but having some suite of visualizations should actually make that easier. I’ll try to persuade someone else with a Bookworm install to give it a shot. But basically, doing a unigram lookup with a topic filter is exactly the same as a bigram.

      Reply
      1. Ted Underwood

        That second idea — comparing topics across metadata categories — would, indeed, be interesting.

        The main thing that seemed hard to me is, as you say, the “non-lousy model.” Even good topics, in a good model, may not line up with the conceptual categories researchers bring to a problem. And you’ll get significantly different models depending on the generic and chronological scope of the corpus selected. But looking at the examples in the beta browser, it appears that you’ve chosen to create *enough* topics that they are in practice both pretty interpretable and pretty portable across genres. Is that like 1000 topics?

        Reply
        1. ben Post author

          Longer comment below. But in theory, at least, I suspect the final version of the API would make it possible to build an arbitrary number of models into a single bookworm and switch between them… somehow. (I can’t think of a straightforward way of specifying this under the API, which is generally a sign that it’s not a great idea… but I can think of hacky ways easily).

          That would allow some interesting visualizations in itself that would be helpful for deconstructing topic models. I’ve tried to convince some undergrads at other institutions to do something, for instance, where you take Elijah-Meeks-esque wordclouds of topic distributions, but place them in a force-directed layout (http://bl.ocks.org/donaldh/2920551); and then add a slider to switch from 5 to 6 topics and see which words move from one topic to another, how a single topic might split in two (even better here for hierarchical topic models, obviously), or how futzing with alpha and beta parameters affect the models. You need to precompute an enormous amount, since in theory at the very least n, alpha, and beta can all vary; best for each browser to just demo one, probably. But if you can precompute and load it into instantly-loading query model (which Bookworm provides) it would be visually compelling, at least, and possibly instructive.

          Reply
  2. ben Post author

    Comparisons of words across models may just look like these slopegraphs, although I’m also thinking about doing straightforward comparisons through Dunning or whatever else. (BTW, e-mail me if you want to actually look at the thing).

    This particular model is actually just a 64-topic one trained against the movie corpus: parameters (which are pretty minimal) in the Makefile here. My general idea would be a workflow involving tweaking that (including some better defaults, a post-training burn-in, etc.) producing a custom model for each corpus.

    I suppose it might be possible to create a generalizable english-language books model: but one of the things I really like, actually, is that you tend to get some topics that are indicative of low-quality portions of the corpus.

    64 is probably a little small; with some more, though, there may be categories that at least will be moderately useful. But that’s going to require some David Mimno, probably, because I can only model something about 10x as big as this before MALLET starts running out of memory.

    Reply
  3. ben Post author

    Why not keep commenting on my own post? Here’s what the topic assignments look like for the federalist papers, which my general test for this sort of thing.

    Reply
    1. Ted Underwood

      Very cool. If you’re able to create a custom model for each corpus, then Bookworm becomes a remarkably flexible text analysis tool. I think the ability to move back and forth between word-centered and topic-centered analysis is really valuable.

      It’s funny that the movie model looked to me like it had a lot of topics; I think I’m just used to the sort of topics that emerge in print, and dialogue-based topics were so different that they seemed highly specialized.

      Reply
  4. ben Post author

    FTR, here’s the full movie model, with the number of words and scripts that get dumped into each one: followed by the history dissertation title corpus. Not perfect, but good as a supplement; for the history one, I literally did nothing but clone the repo and type “make,” which is nice.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *