- bookworm (9)
- Bookworm (8)
- Digital Humanities (8)
- Humanities (5)
- Data Visualization (5)
- Historical Profession (4)
- Programming (4)
- Dimensionality Reduction (4)
- News (3)
- Publishing (3)
- word2vec (3)
- R (3)
- Word_Embeddings (3)
- Digital Archives (2)
- AI (2)
- Apache Arrow (2)
- Higher Education (2)
- SRP (2)
- pandoc (2)
- HathiTrust (2)
- Nonconsumptive (1)
- Ruby (1)
- Jekyll (1)
- Programming Languages (1)
- Modern Web Galleries (1)
- This site (1)
- Sveltekit (1)
- Digitization (1)
- Metadata (1)
- Arrow (1)
- DuckDB (1)
- Majors (1)
- Google (1)
- Google Ngrams (1)
- Teaching (1)
- Historiography (1)
- Degrees (1)
- Machine Learning (1)
- Articles (1)
- fiction (1)
- textLAB (1)
- politics (1)
- UK (1)
- RateMyProfessor (1)
- SOTU (1)
- VMs (1)
- machine learning (1)
- TV (1)
- movies (1)
- subtitles (1)
- screenworm (1)
- Usenet (1)
- Rate My Professor (1)
- Uncategorized (1)
- Narrative (1)
- Data Munging (1)
It’s not very hard to get individual texts in digital form. But working with grad students in the humanities looking for large sets of texts to do analysis across, I find that larger corpora are so hodgepodge as to be almost completely unusable. For humanists and ordinary people to work with large textual collections, they need to be distributed in ways that are actually accessible, not just open access.
I’ve never done the “Day of DH” tradition where people explain what, exactly, it means to have a job in digital humanities. But today looks to be a pretty DH-full day, so I think, in these last days of Twitter, I’ll give it a shot. (thread)
There are programming languages that people use for money, and programming languages people use for love. There are Weekend at Bernie’s/Jeremy Bentham corpses that you prop up for the cash, and there are “Rose for Emily” corpses you sleep with every night for decades because it’s too painful to admit that the best version of your life you ever glimpsed is not going to happen.
I’ve been spending more time in the last year exploring modern web stacks, and have started evangelizing for Svelte-Kit, which is a new-ish entry into the often-mystifying world of web frameworks. As of today, I’ve migrated this, personal web site from Hugo, which I’ve been using the last couple years, to svelte-kit. Let me know if you encounter any broken links, unexpected behavior, accessibility issues, etc. I figured here I’d give a brief explanation of why svelte-kit, and how I did a Hugo-Svelte kit migration.
Scott Enderle is one of the rare people whose Twitter pages I frequently visit, apropos of nothing, just to read in reverse. A few months ago, I realized he had at some point changed his profile to include the two words “increasingly stealthy.” He had told me he had cancer months earlier, warning that he might occasionally drop out of communication on a project we were working on. I didn’t then parse out all the other details of the page—that he had replaced his Twitter mugshot with a photo of a tree reaching to the sky, that the last retweet was my friend Johanna introducing a journal issue about “interpretive difficulty”—the problems literary scholars, for all their struggles to make sense, simply can’t solve. I only knew—and immediately stuffed down the knowledge—that things must have gotten worse.
This article in the New Yorker about the end of genre prompts me to share a theory I’ve had for a year or so that models at Spotify, Netflix, etc, are most likely not just removing artificial silos that old media companies imposed on us, but actively destroying genre without much pushback. I’m curious what you think.
I’ve been yammering online about the distinctions between different entities in the landscape of digital publishing and access, especially for digital scholarship on text. So I’ve collected everything I’ve learned over the last 10 years into one, handy-to-use, chart on a 10-year-old meme. The big points here are:
I mentioned earlier that I’ve been doing some work on the old Bookworm project as I see that there’s nothing else that occupies quite the same spot in the world of public- facing, nonconsumptive text tools.
I’ve recently been getting pretty far into the weeds about what the future of data programming is going to look like. I use pandas and dplyr in python and R respectively. But I’m starting to see the shape of something that’s interesting coming down the pike. I’ve been working on a project that involves scatterplot visualizations at a massive scale–up to 1 billion points sent to the browser. In doing this, two things have become clear:
I used to blog everything that I did about a project like Bookworm, but have got out of the habit. There are some useful changes coming through through the pipeline, so I thought I’d try to keep track of them, partly to update on some of the more widely used installations and partly
I last looked at the H-Net job numbers about a month ago.
Since then, the news isn’t exactly good, but it’s also probably as good as anyone could expect. For most of September and October, history jobs were at about 25% of their average for the 2010s; this was slightly worse than we’re seeing in the approximate numbers in–for instance–science jobs, where new job openings are at about 30% of their normal levels (Thanks to Dylan Ruediger at the AHA for passing along that link.)
Out of a train-wreck curiosity about what’s been happening to the historical profession, I’ve been watching the numbers on tenure-track hiring as posted on H-Net, one of the major venues for listing history jobs.
[Update 10-2: switching to US and Canada only. An earlier version of this included other countries, even though I said it didn’t.]
Every year, I run the numbers to see how college degrees are changing. The Department of Education released this summer the figures for 2019; these and next year’s are probably the least important that we’ll ever see, since they capture the weird period as the 2008 recession’s shakeout was wrapping up but before COVID-19 upended everything once again. But for completism, it’s worth seeing how things changed.
Ranking Graduate Programs
While I was choosing graduate programs back in 2005, I decided to come up with my own ranking system. I had been reading about the Google PageRank algorithm, which essentially imagines the web as a bunch of random browsing sessions that rank pages based on the likelihood that you–after clicking around at random for a few years–will end up on any given page. It occurred to me that you could model graduate school rankings the same way. It’s essentially a four-step process:
As I often do, I’m going to pull away from various forms of Internet reading/engagement through Lent. This year, this brings to mind one of my favorite stray observations about digital libraries that I’ve never posted anywhere.
As part of the 2016 Republican Primary, Jeb! Bush released a website enabling exploration of e-mails related to his official accounts as governor of Florida in the early 2000s. This whole sentence has an antiquity to it; the idea of pre-emptive disclosure (in large part to contrast with his presumed general election opponent, Hilly Clinton) seems hopelessly antique. And at the time, it was critized for accidentally disclosing all sort of personal information, both stories and Social Security Numbers. It did not make Jeb! president. Anyhow, back then I downloaded Jeb!’s e-mails–and Hillary’s–to think about what sort of stuff historians will do with these records in the future.
(This is a talk from a January 2019 panel at the annual meeting of the American Historical Association. You probably need to know, to read it, that the MLA conference was simultaneously taking place about 20 blocks north.)
Since 2010, I’ve done most of my web hosting the way that the Internet was built to facilitate: from a computer under the desk in my office. This worked extremely well for me, and made it possible to rapidly prototype a lot of of websites serving large amounts of data which could then stay up indefinitely; I have a curmudgeonly resistance to cloud servers, although I have used them a bit in the last few years (mostly for course websites where I wanted to keep student information separate from the big stew.)
Some news: in September, I’ll be starting a new job as Director of Digital Humanities at NYU. There’s a wide variety of exciting work going on across the Faculty of Arts and Sciences, which is where my work will be based; and the university as a whole has an amazing array of programs that might be called “Digital Humanities” at another university, as well as an exciting new center for Data Science. I’ll be helping the humanities better use all the advantages offered in this landscape. I’ll also be teaching as a clinical associate professor in the history department.
Critical Inquiry has posted an article by Nan Da offering a critique of some subset of digital humanities that she calls “Computational Literary Studies,” or CLS. The premise of the article is to demonstrate the poverty of the field by showing that the new structure of CLS is easily dismantled by the master’s own tools. It appears to have succeeded enough at gaining attention that it clearly does some kind of work far outsize to the merits of the article itself.
I wrote this year’s report on history majors for the American Historical Association’s magazine, Perspectives on History; it takes a medium term view of at the significant hit the history major has taken since the 2008 financial crisis. You can read it here.
As part of the Creating Data project, I’ve been doing a lot of work lately with interactive scatterplots. The most interesting of them is this one about the full Hathi collection. But I’ve posted a few more I want to link to from here:
I have a new article on dimensionality reduction on massive digital libraries this month. Because it’s a technique with applications beyond the specific tasks outlined there, I want to link to a few things here.
The article in Cultural Analytics.
Instructions for best using those features for your own projects in Creating Data.
I’m switching this site over from Wordpress to Hugo, which makes it easier for me to maintain.
It may also confuse the RSS feed a bit. This should be hopefully be a one-time occurrence.
I have a new article in the Atlantic about declining numbers for humanities majors.
I put up a new post at Sapping Attention about . In short, it’s been bad enough to make me recant earlier statements of mine about the long-term health of the humanities discipline.
This is some real inside baseball; I think only two or three people will be interested in this post. But I’m hoping to get one of them to act out or criticize a quick idea. This started as a comment on Scott Enderle’s blog, but then I realized that Andrew Goldstone doesn’t have comments for the parts pertaining to him… Anyway.
Andrew Piper announced yesterday that the McGill text lab is releasing their corpus of modern novels in three languages. One of first thoughts with any corpus is: what existing Bookworm methods might add some value here? It only took about ten minutes to write the code to import it into a bookworm; the challenge is figuring how methods developed for millions of books can be useful on a set of just 450.
A first pass at understanding the potential of the Hansard corpus through a Bookworm browser.
I’ve divided up the native XML by using the intrinsic speaker tag into a variety of individual speeches.
A “speech” can be very short; on average, each one in the Hansard corpus is 225 words.
My post on ‘rejecting the gender binary’ showed a way to use word2vec models (or, in theory, any word embedding model) to find paired gendered words–that is, words that mean the same thing except that one is used in the context of women, and the other in the context of men.
My last post provided a general introduction to the new word embedding of language (WEMs), and introduced an R package for easily performing basic operations on them. It was geared mostly towards people in the Digital Humanities community. This post looks more closely at a single word2vec model I’ve trained, on about 14 million reviews of faculty members from ratemyprofessors.com,
To be precise: it is a 500-dimensional skip-gram model with window of about 12 on lowercased, punctuation-free text using the original word2vec C code. I’ve then heavily culled the vocabulary to remove words that usually appear uppercased, on the assumption that they are proper nouns.
Recent advances in vector-space representations of vocabularies have created an extremely interesting set of opportunities for digital humanists. These models, known collectively as word embedding models, may hold nearly as many possibilities for digital humanitists modeling texts as do topic models. Yet although they’re gaining some headway, they remain far less used than other methods (such as modeling a text as a network of words based on co-occurrence) that have considerably less flexibility. “As useful as topic modeling” is a large claim, given that topic models are used so widely. DHers use topic models because it seems at least possible that each individual topic can offer a useful operationalization of some basic and real element of humanities vocabulary: topics (Blei), themes (Jockers), or discourses (Underwood/Rhody).
Or, more tongue in cheek, trade routes (Schmidt)
The convoluted language is because there are two major methods, and no a single algorithm that unites the two most important methods. Word2vec uses neural networks, while the GloVe method works maximizes a function across a word-word matrix. The differences in methods between them aren’t worth going to into in an introduction. Suffice it to say that Word2vec was first, GloVe is more clearly theorized, but they have various tradeoffs in performance and efficacy in building a model. My general take on the literature so far is that whatever differences there are in quality of the final models tend to be swamped by the differences set by choices of hyperparameters.
There’s no full description of the D3 bookworm package yet, because it’s still something of a moving target.
But Abby Mullen wanted to know what the different possibilities were for charts through the API, so I thought it was time to give a quick tour.
Core chart types
Bookworm 0.4 is now released on github. It contains a number of improvements to the code from over the summer. It makes the existing code much, much more sensible for anyone wanting to build a bookworm on their own collections of texts based on the experience of many using it so far. All the stages: installation, configuration, and testing are now a lot easier. So if you have a collection of texts you wish to explore, I welcome you to test it out. (I’ll explain at more length later, but for the absolute lowest investment of time you can just run a prebuilt bookworm virtual machine using vagrant.)
This post is just kind of playing around in code, rather than any particular argument. It shows the outlines of using the features stored in a Bookworm for all sorts of machine learning, by testing how well a logistic regression classifier can predict IMDB genre based on the subtitles of television episodes.
I just saw Matt Wilkens’ talk at the Digital Humanities conference on places mentioned in books; I wanted to put up, mostly for him, a quick stab at some of the raw data running the equivalents on my movie bookworm.
I’ve gotten a couple e-mails this week from people asking advice about what sort of computers they should buy for digital humanities research. That makes me think there aren’t enough resources online for this, so I’m posting my general advice here. (For some solid other perspectives, see here). For keyword optimization I’m calling this post “digital humanities.” But, obviously, I really mean the subset that is humanities computing, what I tend to call humanities data analysis. [Edit: To be clear, ] Moreover, the guidelines here are specifically tailored for text analysis; if you are working with images, you’ll have somewhat different needs (in particular, you may need a better graphics card). If you do GIS, god help you. I don’t do any serious social network analysis, but I think the guidelines below should work relatively with Gephi.
This is a quick post to share some ideas for interacting with the data underlying the recent article by Ted Underwood and Jordan Sellers on the pace of change in literary standards for poetry.
Here are some interactives I’ve made in preparation for my talk at the Literary Lab at Stanford on Tuesday on plot arcs in television shows based on underlying language.
This is sort of in lieu of a handout for the talk, so some elements may not make much sense if you aren’t in the room.
Even if you think you don’t know Usenet, you probably do. It’s the Cambrian explosion of the modern Internet, among the first places that an online culture emerged, but modern enough that it can seamlessly blend into the contemporary web. (I was recently trying to work out through Google where I might buy a clavichord in Boston; my hopes were briefly raised about one particular seller until I realized that the modern-looking Google Groups page I was reading was actually a presentation of a discussion from the Usenet archives in 1992.)
Just a day after launching this blog (RSS feed, by the way, is now up here) I came across a perfect little example question to look at. The Guardian ran an article about appearance on teaching evaluations that touches on some issues that my Rate My Professor Bookworm can answer, with a few new interactive charts.
Though more and more outside groups are starting to adopt Bookworm for their own projects, I haven’t yet written quite as much as I’d like about how it should work. This blog is attempt to rectify that, and begin to explain how a combination of blogging software, interactive textual visualizations, and a exploratory data analysis API for bag-of-words models can make it possible to quickly and usefully share texts through a Bookworm installation.
Practically everyone in Digital Humanities has been posting increasingly epistemological reflections on Matt Jockers’ Syuzhet package since Annie Swafford posted a set of critiques of its assumptions. I’ve been drafting and redrafting one myself. One of the major reasons I haven’t is that the obligatory list of links keeps growing. Suffice it to say that this here is not a broad methodological disputation, but rather a single idea crystallized after reading Scott Enderle on “sine waves of sentiment.” I’ll say what this all means for the epistemology of the Digital Humanities in a different post, to the extent that that’s helpful.
Just some quick FAQs on my professor evaluations visualization: adding new ones to the front, so start with 1 if you want the important ones.
-3 (addition): The largest and in many ways most interesting confound on this data is the gender of the reviewer. This is not available in the set, and there is strong reason to think that men tend to have more men in their classes and women more women. A lot of this effect is solved by breaking down by discipline, where faculty and student gender breakdowns are probably similar; but even within disciplines, I think the effect exists. (Because more women teach at women’s colleges, because men teach subjects like military history than male students tend to overtake, etc). Some results may be entirely due to this phenomenon, (for instance, the overuse of “the” in reviews of male professors). But even if it were possible to adjust for this, it would only be partially justified. If women are reviewed differently because a different sort of student takes their courses, the fact of the difference in their evaluations remains.
I promised Matt Jockers I’d put together a slightly longer explanation of the weird constraints I’ve imposed on myself for topic models in the Bookworm system, like those I used to look at the breakdown of typical TV show episode structures. So here they are.
Just a quick follow-up to my post from last month on using Markdown for writing lectures. The github repository for implementing this strategy is now online.
I’ve been thinking a little more about how to work with the topic modeling extension I recently built for bookworm. (I’m curious if any of those running installations want to try it on their own corpus.) With the movie corpus, it is most interesting split across genre; but there are definite temporal dimensions as well. As I’ve said before, I have issues with the widespread practice of just plotting trends over time; and indeed, for the movie model I ran, nothing particularly interesting pops out. (I invite you, of course, to tell me how it is interesting.)
I’ve been seeing how deeply we could integrate topic models into the underlying Bookworm architecture a bit lately.
My own chief interest in this, because I tend to be a little wary of topic models in general, is in the possibility for Bookworm to act as a diagnostic tool internally for topic models. I don’t think simply plotting description absent any analysis of the underlying token composition of topics is all that responsible; Bookworm offers a platform for actually accessing those counts and testing them against metadata.
This is a post about several different things, but maybe it’s got something for everyone. It starts with 1) some thoughts on why we want comparisons between seasons of the Simpsons, hits on 2) some previews of some yet-more-interesting Bookworm browsers out there, then 3) digs into some meaty comparisons about what changes about the Simpsons over time, before finally 4) talking about the internal story structure of the Simpsons and what these tools can tell us about narrative formalism, and maybe why I’d care.
I thought it would be worth documenting the difficulty (or lack of) in building a Bookworm on a small corpus: I’ve been reading too much lately about the Simpsons thanks to the FX marathon, so figured I’d spend a couple hours making it possible to check for changing language in the longest running TV show of all time.
Here’s a very technical, but kind of fun, problem: what’s the optimal order for a list of geographical elements, like the states of the USA?
If you’re just here from the future, and don’t care about the details, here’s my favorite answer right now:
String distance measurements are useful for cleaning up the sort of messy data from multiple sources.
There are a bunch of string distance algorithms, which usually rely on some form of calculations about the similarities of characters. But in real life, characters are rarely the relevant units: you want a distance measure that penalized changes to the most information-laden parts of the text more heavily than to the parts that are filler.