Feature Reduction on the Underwood-Sellars corpus

This is some real inside baseball; I think only two or three people will be interested in this post. But I’m hoping to get one of them to act out or criticize a quick idea. This started as a comment on Scott Enderle’s blog, but then I realized that Andrew Goldstone doesn’t have comments for the parts pertaining to him… Anyway.

Basically I’m interested in feature reduction for token-based classification tasks. Ted Underwood and Jordan Sellars’ article on the pace of change (hereafter U&S) has inspired a number of replications. They use the 3200 most-common words to classify 720 books of poetry as “high prestige” or “low prestige.”

Shortly after it was published, I made a Bookworm browser designed to visualize U&S’s core model, and asked Underwood about whether similar classification accuracy on a much smaller feature set was possible. My hope was that a smaller set of words might produce a more interpretable model. In January, Andrew Goldstone took a stab at reproducing the model: he does, but then argues that trying to read the model word by word is something of a fool’s errand:

Researchers should be very cautious about moving from good classification performance to interpreting lists of highly-weighted words. I’ve seen quite a bit of this going around, but it seems to me that it’s very easy to lose sight of how many sources of variability there are in those lists. Literary scholars love getting a lot from details, but statistical models are designed to get the overall picture right, usually by averaging away the variability in the detail.

I’m sure that Goldstone is being sage here. Unfortunately for me, he hits on this wisdom before using the lasso instead of ridge regression to greatly reduce the size of the feature set (down to 219 features at 77% success rate, if I’m reading his console output correctly), so I don’t get to see what features a smaller model selects. Scott Enderle took up Goldstone’s challenge, explained the difference between ridge regression and lasso in an elegant way, and actually improved on U&S’s classification accuracy with 400 tokens–an eightfold reduction in size.

So I’m left wondering whether there’s a better route through this mess. For me, the real appeal of feature selection on words would be that it might create models which are intuitively apprehensible for English professors. But if Goldstone is right that this shouldn’t be the goal, I’m unclear why the best classification technique would use words as features at all.

So I have two questions for Goldstone, Enderle, and anyone else interested in this topic:

  1. Is there any redeeming interpretability to the features included in unigram model? Or is Goldstone right that we shouldn’t do this?
  2. If we don’t want model interpretability, why use tokens as features at all? In particular, wouldn’t the highest classification accuracy be found by using dimensionality reduction techniques across the *entire* set of tokens in the corpus? I’ve been using the U&S corpus to test a dimensionality reduction technique I’m currently writing up. It works about as well as U&S’s features for classification, even though it does nothing to solve the collinearity problems that Goldstone describes in his post. A good feature reduction technique for documents, like latent semantic indexing or independent components analysis, should be able to do much better, I’d think–I would guess the classification accuracy over 80% with under a thousand dimensions. Shouldn’t this be the right way to handle this? Does anyone want take a stab at it? This would be nice to have as a baseline for these sorts of abstract feature-based classification tasks.

20 thoughts on “Feature Reduction on the Underwood-Sellars corpus

  1. Ted Underwood

    Fun question! I may not be the best person to answer it, because I’m (obviously) already entrenched in a particular set of practices here. You’ll get better replies from people who are less entrenched.

    But for what it’s worth, I do think interpretability is still a goal. I just think it’s hard, and is going to require a lot of humanistic flexibility. The top ten words in a list of 3200 words definitely won’t tell you much. You need some way to let broader patterns emerge. Pre-defined lists of words that we treat as proxies for concepts are one approach. I’m finding it useful to export categories that Heuser & LeKhac defined to explain fiction, and see how they work for poetry — e.g., https://twitter.com/Ted_Underwood/status/711212456023093248. Linguists have different sets of words they use; see the categories in Kao & Jurafsky: (https://web.stanford.edu/~jurafsky/pubs/56-151-1-PB.pdf).

    Defining these categories/lists is certainly a challenge; the usual problems of circularity in historical hermeneutics will rear their heads. But my response to that is: yep, it’s going to be messy; we’re humanists.

    In the article, as you know, we don’t even bother explicitly defining lists to support our impressionistic assertions about “concrete” and “dark” diction. Instead, we use close readings of cherry-picked passages to solve the rhetorical problem. It’s not the best possible form of proof. But I feel that the toughest challenge we confront is often rhetorical rather than intellectual. In practice, an audience of literary critics can be convinced by close readings, where word lists would cause them to invent lots of specious objections. ¯\_(ツ)_/¯

    I do also think it would be possible to use completely abstract, dimension-reduced features and make an argument that’s purely about the strength of different boundaries, without interpreting features at all. I don’t think that would be wrong; in some cases it may be the simplest solution, if we prefer to just skip the messy issue of feature interpretation. I’m doing that in a different article, because a) I don’t have space to talk about features and b) the features aren’t surprising; they’re mostly things like “proof” and “suspect” for detective fiction.

  2. Scott

    This is definitely a question on my mind as well. Glad you’re taking this up, and looking forward to hearing about your selection method! I missed your earlier entry into this discussion — I’ll look at it now.

    I have three semi-related thoughts:

    1. I ran a topic model on the poems and looked for words that were frequently pulled out by my feature selector. They seemed to cluster strongly in a few topics. Promising? But I haven’t tried running a LR model on the topics-as-features yet. I think LDA should reduce (but probably not totally eliminate) multicollinearity. Would the highly-rated topic combinations be interpretable? Or would we be in the same boat with topic combinations as we are with word combinations?

    2. What about using neural networks? LR is basically a 2-layer neural network; adding a third layer would group features together in the middle layer, kinda-sorta like a label-aware topic model. Would those groups be interpretable? Neural networks are supposed to be hard to interpret that way, but a three-layer network mightn’t be so very much harder than what we’ve got now.

    Speaking of label-aware topic models, I ran a topic model with “thistextwasreviewed” added as a word in reviewed texts. The results were sure interesting! What did they mean? What’s it mean to posit a Dirichlet prior over reviewedness? I have no idea. It’s harder to reason about generative models when you’re actually modeling reception.

    Also: I want to try restricted Boltzmann machines. Or, equivalently, I _think_: a word embedding, but with an additional “thistextwasreviewed” word in every ngram from a text that was reviewed.

    3. Can we really even interpret _a single word_ as a feature? If you create a LR model using the word “eyes” as your only feature — _just that single word_ — you get 63% accuracy. But… “eyes”? I have some guesses, but I’m not sure how to test them.

    Here’s a 12-word feature set that gets higher than 80% accuracy (picked using the selection method I’m playing with — but trained on all the data, so it’s technically cheating).

    [‘eyes’, ‘what’, ‘i\’ll’, ‘black’, ‘hurrying’, ‘sign’, ‘wondrous’, ‘hollow’, ‘half’, ‘not’, ‘sake’, ‘mission’]

    I mean, it’s suggestive. But again, I am not sure what to make of it. I think the problem of interpretation may call for different tools, and I don’t know what they are. Back to concordances and close reading?

    But this at least makes me confident that there’s something worth interpreting! To me, that’s the most obvious utility of this approach.

    1. Ted Underwood

      I think Scott’s third question here is important. In a traditional statistical model, interpretation works at the level of single features. But I don’t think that’s how it’s going to work best here.

      That’s why the Lasso doesn’t particularly excite me. I’m not convinced yet that reducing the number of features really increases interpretability, or reduces problems of collinearity. Actually, I suspect a push toward parsimony could heighten hidden problems of collinearity, and create the illusion that individual features are more meaningful than they are.

      Topic models and other ways of generating multi-feature lists strike me as more fruitful interpretive strategies.

      1. Scott

        I agree about the risk of illusory significance — but the benefit of sparsity is that it thins out the problem space. “Eyes” is definitely not functioning here as a transparent signifier, but if we can understand the larger cluster of terms and patterns that it’s a proxy for, we might be able to make some headway. I don’t know how to do that second part. But I feel a lot more confident approaching that problem for a handful of words than for 1000 words.

        And I’m not certain that pure dimensionality reduction will help because — well — suppose the model identifies five or six different highly predictive and anti-predictive clusters? Do we assume that the topics — such as they are — straightforwardly correspond to reviewability? Doesn’t that lead us into the same blind alley? I would want the dimensions to be interpretable relative to reviewability in a more obvious way.

        Maybe it would help if there were looooooottts of topics. Or maybe we need a richer quantitative model of reading than we have right now.

        It occurs to me that perhaps the Bookworm PCA visualization (http://bookworm.benschmidt.org/posts/2016-01-19-txtLAB450.html) using some highly predictive words could give some insight here.

        1. ben Post author

          > It occurs to me that perhaps the Bookworm PCA visualization
          > (http://bookworm.benschmidt.org/posts/2016-01-19-txtLAB450.html)
          > using some highly predictive words could give some insight here.

          Yeah. The 2d scatterplot is actually less useful in some ways; with a function that applies logistic weights properly, we could visualize just the model weights easily in one dimension and leave the other free for time, replicating (say) Ted’s plot of classification across time.

          OTOH, I’m not sure how pressing a need this is. I’m probably going to do some more with geography before worrying about it.

      2. Scott

        To put a point on it — I think we definitely need more sparsity somewhere in the flow between feature sets and interpretations. Many small sparse topics (i.e. few probable words) seems promising.

      3. Ted Underwood

        Right. We can’t wade through 3200 features one by one, so there’s got to be some sparsity/ simplification somewhere.

        On the other hand, what’s the goal we’re aiming for? If “an interpretation of the model” means “a complete explanation of the model, where all factors are accounted for,” I give up right now. Polysemy and collinearity, plus the sheer complexity of the world, are going to make that very hard. Even if it were theoretically possible to explain everything, literary critics wouldn’t sit still long enough to read the explanation.

        But we don’t have to be aiming for that. My goal is usually “to identify and plausibly describe a few of the salient factors, without claiming to cover everything.” If that’s the goal, I don’t think it’s hard to achieve. Topic modeling is one approach. But we can also just create lists of terms like “colors,” “body parts,” “physical adjectives,” etc, the way they did in LitLab pamphlet #4.

        Lists like that don’t have to be perfect or complete to tell you that something’s up. And there’s something to be said for their portability. E.g., the “hard seeds” Heuser and LeKhac constructed for fiction in pamphlet #4 also do a good job of explaining poetic prestige: https://twitter.com/Ted_Underwood/status/711212456023093248

        1. ben Post author

          I do feel shaky about using standard word lists, particularly on historical texts. But the story from sentiment analysis is certainly that straightforward lists of words do as good a job at discriminating on fuzzy categories as do theoretically much more complicated constructions. No reason to think it’s not a good strategy here.

          The aspect of interpretability I’m particularly interested in, to put it naively, is the bit where English professors and Atlantic readers and the like can somewhat assess a model without worrying the garden of forking paths and all that. Those people, I’d hope, can think past colinearity, maybe; they intuitively know that “eyes” means all their associations with the word, and not just that poems about optometrists will (won’t?) get reviewed. Although I have a soft spot for the topic models or LSI, it’s true that probably loses them. But maybe you can just do what Piper and So did recently and say “we used the kitchen sink, and still couldn’t discriminate MFA from NYC that well.”

        2. Scott

          I think you’re absolutely right! And I’m definitely not looking for complete explanation. That would be awfully ambitious at this stage. I just worry about whether factors we identify really are the salient factors.

          (“But Scott, what does ‘really are’ mean?” *shrugs*)

        3. Ted Underwood

          “The aspect of interpretability I’m particularly interested in, to put it naively, is the bit where English professors and Atlantic readers and the like can somewhat assess a model without worrying the garden of forking paths and all that.”

          Exactly. I keep coming back to a feeling that the real bottleneck is rhetorical. Given time to slice the data different ways, it’s possible to achieve a pretty good understanding of a model. Topic modeling, predefined word lists, Lasso, close inspection of examples, running lots of different models — they’re all good; they can all give you some grip on interpretation. And by the time I write an article, I’ve typically tried several of those things. But then I have to sketch an explanation in five pages or less, for an audience of humanists who have limited previous experience, and frankly limited patience for methodological excursions. For me, that’s the hard part.

    2. ben Post author

      (This is in reply to Scott’s top-level comment, parts 1 and 2; part 3 is further explored in the Ted-thread.)

      1. As I said below, I like LDA distributions as a reduced feature set.

      2. I love the idea of just tossing “thistextwasreviewed” in there, although yeah, WTF. At the very least, you have to scale it by number of reviews. It does suggest a whole interesting avenue of incorporating reader responses directly into the text as part of the modeling. This would be both appealing in the sense of behaving trendily for an English department, and marketable in that it might play interestingly with Kindle or iBooks data for the connected few…

      I am curious about neural network architectures for this general problem; it’s been a summer plan for a while. I would think the dream would be not to use feature counts at all, but to have some recurrent neural network that tries to guess what kind of text it’s been in based on the state at the end. Not that I know how to do that right now: and obviously that won’t work with 360 instances, though! And it would be only minimally interpretable.

  3. Andrew Goldstone

    I’m sure you’re being generous when you say I was being sage. But seriously, thanks for taking this up. It would be good to have some baselines about the behavior of these classifying techniques so we know when better-than-chance model performance is actually interesting.

    Obviously I agree with myself on (1). It’s a very general problem with procedural model selection (the textbooks tell me it goes equally for forward/backward selection for linear models, selection by AIC, etc.), and which the p >> n situation makes worse. But equally obviously the thing I really want to resist is an a priori restriction to linguistic features, which opens the door wide to literary scholars’ favorite move of explaining everything as a function of the text.

    In this case I don’t see any reason to go for classification performance without model interpretability. For something like Ted and Jordan’s data, all the interest is in using a model explanatorily. “Prediction” only enters as a way of validating the model and protecting against overfitting. What good would a black-box model be?

    For straight-up classification, I take it you’re suggesting reducing the feature set with PCA or similar and then doing the (regularized or un-?) logistic regression on some small number of principal component weights instead? No idea whether it can play tricks on you statistically, but any feature-reduction technique will be tuned to pick combinations of variables that are good at distinguishing documents, no?

    Quickly and carelessly rerunning the old code from https://github.com/agoldst/us-standards-rep/blob/master/post.Rmd, I get that the best lasso model has these highly positively weighted features

    and these negative:

    which is similar to what you get from the ridge.

    1. ben Post author

      This makes me think I should be clearer about goals. One direction my interest in this stuff takes me–as I think I’ve told you, Andrew–is in juxtaposing the “purely text” features against the “sociological” ones to see what comes up as mattering. Which maybe is a false distinction itself; but I feel more comfortable with it in that form than just jumbling in all the imaginable features together. So a black-box textual model stands as “what literary scholars think matters”; and then you drop something else on top. I suppose multi-level modeling would make sense here.

      For reduction; yes, essentially PCA. My reading of the literature suggests that ICA, LSI, or (if you want to get trendy) paragraph2vec may work better than old-fashioned PCA, though I couldn’t say why. (LSI is basically correspondence analysis, though, which should warm your Bourdeauvian heart, right?).

      And thanks for the features. I do like them. But yes, it’s really impossible to say which is more interpretable.

      1. Ted Underwood

        Re: the juxtaposition of social factors and textual features: I should mention that I don’t understand textual features as causal at all.

        If I were using factors like education or geographic location to explain literary prominence, I would be thinking about them, on some implicit level, as causes. Maybe I couldn’t put it that bluntly, because correlation ≠ causation, etc., but that would be the implicit hypothesis.

        With textual features, it’s not even an implicit hypothesis — at least not for me. The point of modeling is really just to ask “how coherent was the ‘elite’ style in this period?” I’m not at all hypothesizing that “these books got reviewed because they used the word ‘hollow’ a lot.” To put it another way, I’m not looking at style because I think it explains prestige; I’m trying to understand the social differentiation of styles. (How firm was that boundary?) Prestige may be formally, in the model, the thing-predicted, but (for me at least) that’s only a formal gambit.

      2. Andrew Goldstone

        Ben, if you’re interested to look at how linguistic features compete with contextual ones for explaining some outcome, I think I see why you might want to go for a multilevel model, but maybe it would work to do your dimension reduction and then just throw the 6 (where 6 means “a small number”) most important linguistic dimensions into a model with the contextual variables? Could the reduction make the competition “fair”? Provided of course that the linguistic “black box” features aren’t themselves too collinear with the contextual ones.

        Re: my Bourdieu.*an heart, the main thing that worries me about B’s correspondence analyses is that there’s no account of the statistical properties of the reduction. Tony Bennett’s attack on Bourdieu (http://doi.org/10.1353/nlh.2007.0013) basically claims that latter’s method of interpreting the CA is biased in favor of finding distinctions among individuals rather than ways their characteristics overlap. More broadly, after doing a dimension reduction one can worry over the generalizability of the feature loadings (or whatever the cool advanced version is) to new data. Is there a standard way to talk about this in the dimension-reduction stuff you’ve been looking at?

        Ted, the essay is clear that you and Jordan aren’t giving the textual features a causal role. Anyway, it’s interesting to think about the CV success rate from your models as a kind of measurement of the degree to which “Reviewed” and “Random” are well-separated in feature space. It would be interesting to compare this measure to others. Ben’s proposal would be provocative because it would generate another measure, based on the same inputs, but in a space with a different geometry (maybe).

        1. Ben

          I’m not sure what the statistical characteristics that are missing are.
          Looking at Bennett, his critique seems limited to the argument you made against reading loadings directly, which is that the rotation of a single feature doesn’t mean what you think. (IIRC, the plots in Distinction mostly overlay features with elements? You get “Lawyers” in the space as “Art of the Fugue” in the same manner as a PCA biplot?)
          The analogous complaint here would be that these models might make a reader think that lots of the non-reviewed materials contain the words. (Which, I’m surprised, they actually seem to: 42% of the unreviewed books use the word “mission”, and 22% of the reviewed ones do.) It’s not an argument, I don’t think, that bus drivers might really be just as much like lawyers as they are like train conductors. Plus he argues that CA forces a binary choice, which isn’t the case w/ feature counts.

          With LSI or ICA; usually you’re only using it on a fixed corpus anyway. So the danger is just that the classifier will break on that random Dutch-language text that crept in. If you remained worried–well, my new top-secret dimensionality reduction technique is entirely about generalizability to new data (at the expense of almost everything else).

          Vis-a-vis causation: my personal interest is in cases where we can regard both sociological features and textual content as causal in classificatory decisions, BUT there’s some received model that suggests it should just be text. Women whose evaluations describe them as unengaged teachers get lower scores than men with similar language, or whatever. Obviously there’s collinearity; but there is also some measure of causation. Maybe the feature to use would not be the raw textual data, but the output of a more naive classifier. Or maybe it’s hopeless.

    2. Scott

      Odd, I got different results from sklearn’s L1 model. Maybe I didn’t happen to hit this point in regularization space. I’ll have to take a second look. These are almost identical to the top features produced by the grid search I was doing, which may mean I’ve been doing unnecessary work!

      BTW, I’m briefly highjacking Ben’s thread to tell you that I looked over your Instantaneous Mutual Information code a week or two ago, and it seemed right — FWIW. I wanted to write an actual test before posting but I thought I’d mention it. (The idea behind the check is really elegant isn’t it?)

  4. Scott

    Just noticed Ted’s comment about generalizable dimension reduction — I too am interested. Generalizable to new data with respect to what other data or assumptions? To put it differently, given two datasets, you learn a DR technique on one set with knowledge of labels, and then repeat it on the second set without having to look at their labels? Just pulling guesses out of a hat; no need to answer that question now.

    Also: this conversation has, after some reflection, made me want to spend some more time thinking about “interpretation,” “explanation,” and “causation.” They’re starting to feel like blunt instruments to me.


Leave a Reply

Your email address will not be published. Required fields are marked *