A computational critique of a computational critique of computational critique.

Nan Da has a critique of some subset of digital humanities in Critical Inquiry. She calls it “Computational Literary Studies,” or CLS. I will happily adopt that acronym because 1.) it’s not, as far as I know, associated with any author; and 2.) after this particular article, it’s unlikely anyone will ever use it again. I started taking some notes on it, and rapidly fell into finding so mistakes that I figured I’d post it.

The “computational” aspect of Da’s case is twofold:

  1. It offers that in other areas–science and industry–computational methods are being deployed perfectly and appropriately; but that sadly, such methods can not be applied in literary studies.

  2. It asserts that actually existing CLS is ridden with statistical errors that could be easily corrected.

This is an interesting feint. We have moved past anti-positivism into computational NIMBYism. The old anti-digital-humanities polemics straightforwardly attacked the cultural authority of numbers. This has become increasingly problematic both as the hegemony of STEM has increased inside the university, and as humanists have come to realize that there are evildoers in the world even worse than economists. The rhetorical tools you can deploy against scientism are strong, but they risk appearing to make it seem–say–that maybe we shouldn’t listen to climate scientists. So the piece posits, a little generously, that everyone else is using numbers right–but also holds out that the exercise in replication and methodological analysi (a good thing) proffered here don’t actually hold the way out for better. This risks mystification: by asserting that scientists do something arcane, powerful, and true. But through the deployment of technical vocabulary and replication, it argues that CLS is engaged in a sort of cargo-cult version of data science in which nothing can be learned because tools are applied out of domain. (The core assumption–that things have to be used in the contexts for which they were designed–is problematic).

As a matter of argument, though, these two goals pull in different directions. If existing pieces are so heavily flawed, then we probably don’t know the limits of the knowable. If, on the other hand¸ we’re able to tell that CLS will never produce useful results for literature, it would probably only be because the existing articles give us some sense of what’s possible.

A careful replication to interrogate published articles is a good thing. I’ve even published a blog post raising some of the exact same concerns about the original blog posts about the Underwood-Goldstone paper she looks at. (The problems of doing intemporal analysis of topic models.) They responded, integrating some of the critiques. All of this is stuff that should happen–the push that everyone needs to make splashy disciplinary argument has probably cut it short. An article that tries to sift through which CLS work plays fast and loose with statistics, as opposed to which ones treats the limitations of its knowledge responsibly, is needed. There’s an explosion of idiosyncratic metrics and methods, and a slapdash approach to things like stopword lists, that demands some methodological pruning occur. (Maybe. Or maybe the Cambrian explosion of method is just what we need).

But this isn’t that article. I’m moved to check in here because so much of this article rests rhetorically on bemoaning the lack of sophistication in CLS work. “In good statistical work, the burden to show difference within naturally occurring differences (‘diff in diff’) is extremely high.” Etc. This blizzard of terminology establishes for the innumerate reader that they finally have an expert who will debunk statistics with greater certainty than they can, themselves. But it’s often mumbo jumbo. “Difference in difference” analysis does not generally refer to testing whether two distributions are different beyond statistical noise and you’ve properly used the Bonferroni correction; it refers specifically to treatment intervention testing In the first specific critique, the article about 95% p-values in the following way: “statistics automatically assumes that 95 percent of the time there is no difference and that only 5 percent of the time there is a difference. That is what it means to look for p-value less than 0.05.” Set aside that p-values are unrelated to the task at hand and ask: what? To look for a p-value under 0.05 is to look for a pattern that would only occur 5% of the time as a result of random variation. It’s not a great threshold. But Underwood’s paper does not rely on them.

So let’s take a look at how well the statistical claims here hold up: is the debunking useful?

Computational Critique

There’s plenty to complain about. I frequently have large reservations about work that somehow makes it into print in cultural analytics. (Although, I feel the same way about most published research, of which I generally think we have too much.)

But the computational evidence deployed here–the thing that tries to make this piece stand out–is sloppy at best. Perhaps the whole piece is intended as a parody of what can slide into top literary journals nowadays–it is indeed the case that Critical Inquiry will allow you to publish with terribly inadequate code appendices. But it certainly does not show that good statistics can obliterate the bad statistics that are widepsread.

This tension of the two goals evident in the first piece of the set, on a Ted Underwood piece on genre classification. She at once claims a simple correction–

Underwood should train his model on pre-1941 detective fiction (A) as compared to pre-1941 random stew and post-1941 detective fiction (B) as compared to post-1941 random stew, instead of one random stew for both, to rule out the possibility that the difference between A and B is not broadly descriptive of a larger trend (since all literature might be changed after 1941).

and that Underwood uses methods that could never find differences between genres.

It is true that Underwood does use methods inadequate to prove there is no difference in detective fiction pre and post-1930. (Her use of the year “1941” is a mistake–it seems to stem from confusing the date of one of Underwood’s sources with the year he chose for a testing cutoff). This is an absurdly high bar–of course something changed, if only the existence of words like ‘television’ and ‘databases.’ Underwood says as much. The actual article is caught up in a more interesting discussion of the comparative stability of genres. The core argument is not, as Da says, that genres have been “more or less consistent from the 1820s to the present,” but that detective fiction, the gothic, and science fiction–specifically– show different patterns, with detective fiction being a far more coherent pattern than the gothic novel. By focusing only detective work, she’s missing the entire argument of the article here.

I don’t know what Underwood used to train. But if he did allow the ‘random stew’ to contain both pre- and post-1930 work that would make the performance of his model more remarkable, not less–it would indicate that it was correctly tagging Elmore Leonard (say) novels as detectives even though they use words like “fax” or “polaroid” it had previously seen in the post-1930 set.

A computational critique of a computational critique of computational critique.

This is the core sin here: Da pulls statistical pronouncements out of thin air and presents them as that which must be done. These claims are often either misinformed or misleading.

I can’t bear to go through all her sections. But as an example, take the analysis of Andrew Piper’s work on Augustine’s Confessions. In a few paragraph, she makes as many mistakes as she holds him to account for in a full article.

First, she criticizes Piper for performing Principal Components Analysis on unscaled word frequencies, and produces scatterplots that show dramatically different results from his: “The way to properly scale this type of matrix is outlined in G. Casella et al’s Introduction to Statistical Learning… The second step [Z-scaling] is necessary if each word is to be seen as a feature for PCA.” George Casella did not write a book called Introduction to Statistical Learning; she means the 2013 volume by Gareth James et al. published (after Casella’s death) in a Springer series for which he was general editor. The chapter she cites certainly does not say that PCA matrices must always be scaled by standard deviations. It says, rather that scaling PCA is a consideration the researcher should take. When units are arbitrary, PCA should be scaled–if comparing SAT scores to grade point averages, you don’t want the difference between a 1420 and a 1421 on the test to be the same as a 2.5 and a 3.5 on the GPA. But word frequencies are not arbitrary. In those cases, the researcher must decide. To quote from the text: “In certain settings, however, the variables may be measured in the same units. In this case, we might not wish to scale the variables to have standard deviation one before performing PCA.“@james_introduction_2013.

This is a central challenge familiar to anyone who has tried to grapple with wordcounts. There are so many uncommon words used once or twice in any given text that, when scaling is used, they can completely swamp the repeated words. A variety of solutions are in common use. TF-IDF scaling drops out the most common words while allowing those of medium frequency to shine through; log transformations of various flavors proliferate. Ideally solutions would not be wholly dependent on the parameter space, but the phrasing of the question matters.

Da sidesteps these all these complications for her reqaders by implying the real difference has to do with a philological failing, that Piper doesn’t stem Latin text. This is something a literary audience can understand, and gestures towards a

Comically, her version reproduces many of the same philological failings. She implies that Piper didn’t use a Latin Stemmer because the “only Latin stemmer available is the Schinke stemmer,” but that she has taken the effort. This is incorrect on both fronts. First, there are quite a few Latin stemmers available. (For an in-depth analysis of at least 6, see Patrick Burn’s work.

And her effort seems to be scattershot at best. It’s hard to tell what code Da actually ran–the online appendices for analyzing Piper’s case only include the PCA code for Chinese, not the figures included in the appendix. Ordinarily I would be forgiving of this kind of lapse, which is all too common; perhaps the inadequate code appendices are intended as a higher-order critique of computational work. But her failings vis-a-vis replication are far greater than those of, say, Ted Underwood, who generally supplies a single script called replicate.py that you can run yourself inside any of his projects.

Still, from what she has posted online, Da appears to have re-implemented Schinke’s algorithm in both R and python, with separate rules for nouns and verbs. But then, in her Cross Distance code, she simply applies the noun stemming rules to all words, (probably) because choosing a part of speech is much harder. This results in many problems; both because some verbs are not stemmed at all (‘resurrexit’ remains ‘resurrexit’ even though the verb rules would have it as ‘resurrexi’); and because the rules are applied to function words as well with silent NULL results in her code, so that words like ‘que,’ ‘cum,’ ‘te,’ and ‘me’ are deleted from the text altogether. That is: many function words are being dropped altogether because a new implementation was hastily coded rather than using one of the more mature implementations available.

I wrote this and then, quickly, checked what difference it all makes. (Code and edits online here) I was, honestly, expecting that the scaling factor would be significant and account for the differences in texts. But actually, what I got looks more or less like Piper’s original.

Reproduction of Piper’s original:

A reproduction of the original

Reproduction using Da’s scaling.

A reproduction of a reproduction

And so on.

I could go on. The debunking of topic model, for example, uses not the well established literature about comparing topic model distributions to each other, but some arbitrarily chosen robustness tests. (It drops 1% of documents). But it is not a replication. Topic models rely on extremely specific assumptions about the distribution of words in texts based on word counts; they attempt to reproduce the frequencies in actual documents.

But rather than fit on word counts, the model, for no apparent reason, uses TF-IDF vectors that multiply the significance of rare words and decrease the significance of common ones. I have never seen a TF-IDF vectorization fed into an LDA feature set before– it’s an extremely odd choice that guarantees the results will be different from Underwood and Goldstone’s, and partially explains the incoherent topics in the appendix, such as doulce attractiveness unsatisfying gence dater following mecum wigan cio milieu. (Edit 03-20) I’m wrong about this: Andrew Goldstone points out that there’s an argument to the TFIDF vectorizer in her codes that makes it output raw frequencies. Frequencies might still produce results different than the counts that Underwood and Goldstone used, but this is not a howler. It’s still unreasonable, though, to expect that the topics put out by the Variational Bayes, online LDA implementation in scikit-learn will be the same as those in the Gibbs-Sampling method Underwood and Golstone use from Mallet. Different methods can produce dramatically different results when the hyperparameters are not properly tuned While Goldstone does optimize hyperparameters there’s nothing in the scikit-learn code that indicates an effort to produce the best model.

In fact, Goldstone and Underwood’s original work on this dealt with this issue very clearly:

On the other hand, to say that two models “look substantially different” isn’t to say that they’re incompatible. A jigsaw puzzle cut into 100 pieces looks different from one with 150 pieces. If you examine them piece by piece, no two pieces are the same — but once you put them together you’re looking at the same picture.

There are a variety of other statements that strike me as meaningless that I’m not going to track down. The section on Jockers and Kiriloff’s work on gender in the appendix, for example, obviously mislabels its bins (it claims that her replication found the “she killed” and “he wept” are gender stereotypes, rather than the opposite) and makes some extremely fishy claims such as “Overall, the percentage differences between these top most correlated verbs for each gender was very low (0.031% to 0.307%) meaning that while a difference can be found, male/female is not very differentiated from one another if we look at verbs.” I don’t know what that range is supposed to be, but at least for ‘wept’, Google Ngrams gives the difference in gender usage as 400%

But to go through all of this is a pain. I’m sure others have written other analyses. This work is tedious, which is the reason that it’s rarely done; and it’s hard to reproduce another workflow even when it’s well-documented.

Related