Ben Schmidt's Blog

History hiring, 2023 update

bmschmidt@gmail.com (Ben Schmidt) — Tue, 26 Dec 2023 17:36:45 GMT

Although I’ve given up on historically professing myself, I still have a number of automated scripts for analyzing the state of the historical profession hanging around. Since a number of people have asked for updates, it seems worth doing. As a reminder, I’m scraping H-Net for listings. When I’ve looked at job ads from the American Historical Association’s website, they seem roughly comparable.

The bottom line is: 2023 is shaping up to be one of the worst years for hiring of new history professors yet. The worst year ever, of course, was the pandemic-saddled 2020. The 2009 recession year, which at the time felt calamitous, was actually in the general range of most years since 2014; for me the takeaway remains that while the 2000s were obviously salad days of incredible abundance when tenure track jobs were awarded like candy¹, even the early 2010s were far better than than recent years.

After the 2020 collapse, the question became: would any rebound in the market be permanent, or temporary? 2021 and 2022 were both relatively strong years on the market, by the standards of recent years; and while the 2021 market was concentrated in few modernist fields, 2022 saw a rebound even in early modern and medieval hiring and in the history of Europe, two fields that were starting to be left for dead.²

The answer, it seems, is temporary. After a respectable start to the season, H-net listings in November and December were terrible. There have been just over 350 tenure-track jobs listed this year, compared to 400 listed before Christmas the two years prior. ³

Notably, there is no pattern in terms of subfields, except for possibly a tick up in hiring in history of science. Interdisciplinary fields, African American history, US history, Asian history; all are down by roughly equal shares. There have been times when it seemed like any fields gains might be at the expense of others: now all are lower than they were both last year and than they were in the old days.

I kid. Sort of? But also if you have a PhD awarded before 2013 it would take a great deal of chutzpah to run, like, a placement workshop. Find someone younger, geezer.↩︎
I say this, in part, to push back on a story told by Leland Grigoli in the AHA’s 2023 jobs report, which focused heavily on the “relative (and absolute) dearth of jobs for premodernists.” While this warning might have been useful in the 2022 report, the pendulum seems to be swinging back away since; and cross-field recriminations have been heavy.↩︎
There were 973 PhDs awarded by U.S. departments in 2022.↩︎

20 Million PubMed abstracts in the Browser

bmschmidt@gmail.com (Ben Schmidt) — Thu, 11 May 2023 00:00:00 GMT

Blaming the humanities fields for their travails recently can seem as sensible as blaming polar bears for not cultivating new crops as the arctic warms. It’s not just that it places the blame for a crisis in the fundamentally wrong place; it’s that it
It’s coming up on a year since I last taught graduate students in the humanities.

20 Million PubMed abstracts in the Browser

bmschmidt@gmail.com (Ben Schmidt) — Thu, 20 Apr 2023 00:00:00 GMT

Last week we released a big data visualiation in collaboration with the Berens Lab at the University of Tübingen. It presents a rich, new interface for exploring an extremely large textual collection.

Because I can I’ll simply embed it below–but you’ll have a better experience reading it at the original site.

Rita Gonzalez-Marquez is the lead author of the paper and did the primary analysis; the embedding here was carefully created

Happy WebGPU Day

bmschmidt@gmail.com (Ben Schmidt) — Fri, 07 Apr 2023 00:00:00 GMT

Yesterday was a big day for the Web: Chrome just shipped WebGPU without flags in the Beta for Version 113. Someone on Nomic’s GPT4All discord asked me to ELI5 what this means, so I’m going to cross-post it here—it’s more important than you’d think for both visualization and ML people. (thread)

So: GPUs are processors on basically every computer/phone. Individually they’re weaker than CPUs, but they run in packs of little ones that run in parallel. The G is for ‘graphics,’ but it’s turned out they’re good for anything involving lots of math–like ‘AI’, which at core boils down to lots (and lots and lots) of matrix multiplication operations. To do math, not graphics, on a GPU you need an API/language for them; the most important of these is CUDA, which is tightly coupled to NVidia and a real PITA to set up.

On the web, we’ve only been able to access the GPU through something called WebGL. It’s old, and while you can do some neat stuff with it, it’s fundamentally built for graphics, not for the matrix-multiplication type stuff that is the bread and butter of deep learning models. Since WebGL launched in 2011, lots of companies have been designing better languages that only run on their particular systems–Vulkan for Android, Metal for iOS, etc. These are great where they work, but even harder to run everywhere than CUDA.

WebGPU is an API and programming that sits on top of all these super low-level languages and allows people to write GPU code that runs on all of them–that is, on just about any phone/computer with a web browser. This is a big deal, because it has “compute shaders” that lets you write programs that take data and turn it into other data. Working with data in WebGL is really weird–you have to do things like draw to an invisible canvas and then read the colors as numbers. In WebGPU, you can just do math. Really fast.

That means it’s actually capable of doing–say–inference on a machine-learning model like GPT4All, multiplications on data frames, etc. There are already some crazy things out there, like a version of Stable Diffusion that runs in your web browser.

I wrote a post here two years ago about why WebGPU makes javascript the most interesting programming language out there for data analysts/ML people. Even more seems possible now. When we start implementing the Apache Arrow spec to store dataframes on GPU, currently blazing-fast packages like DuckDB and Polars; in browser versions of GPT4All and other small language models; etc.

This will be great for deepscatter too. Maps like https://atlas.nomic.ai/map/twitter can render 5,000,000 tweets incredibly fast, but need a lot of CPU for compute. Often it’s fast enough, but real-time rendering needs to run 30x a second: I have a long and growing list of things that are nearly impossible in WebGL but will be quite easy in WebGPU.

Right now it’s only released on Chrome, but it’s not an only-Google thing forever. It’s an honest-to-goodness W3C standard like HTML, CSS, or SVG. All the browsers have been working on it; Chrome is just shipping first because Google is rich compared to Safari and Firefox. One of my favorite parts about reading the minutes of the WebGPU committee over the last year is watching people from the other browsers jealously grouse about how much money Google throws at Chrome.

JB: Corentin mentioned that all the browser vendors have been at the table, for a long time. Haven’t you had a long enough chance to give that feedback already? Answer is - no. :) Our impl isn’t done. Not about whether a certain period of time has elapsed - but rather do you have an impl that satisfies the criteria. Chrome’s one of the best funded orgs in KR: Without going too much into funding, thinking about spec criteria, we had a list of bugs triaged into v1 and post-v1. Let’s burn that down to zero, and if we consider larger change, we should probably let them sit as they are. There’s probably a way to implement something reasonable later. We can probably do these changes in a compat way in the future. Let’s get issues down to zero. Impl feedback is useful of course. We don’t go to rec without multiple impls. Looking at wording, I don’t think “canditate rec” is gated on mult implementations.

But they’ll come along–the Chrome-derived ones like Edge first, but Safari and Firefox eventually too because GPU compute is just such an important thing. And when they do, it rescrambles the whole compute stack. Slowly but surely real GPU compute, tensor operations, all the stuff that makes AI tick moves from something that happens only in the cloud, to something that can get reshuffled, rearranged, and done privately on PCs again. Another chance to reclaim compute from the cloud.

Calling it shut on OpenAI

bmschmidt@gmail.com (Ben Schmidt) — Wed, 22 Mar 2023 00:00:00 GMT

This is a Twitter thread from March 14 that I’m cross-posting here. Nothing massively original below. It went viral because I was one of the first to extract the ridiculous paragraph below from on the release of GPT-4, and because it expresses some widely shared concerns.

I think we can call it shut on ‘Open’ AI: the 98-page paper introducing GPT-4 proudly declares that they’re disclosing nothing about the contents of their training set.

Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

We are committed to independent auditing of our technologies, and shared some initial steps and ideas in this area in the system card accompanying this release.2 We plan to make further technical details available to additional third parties who can advise us on how to weigh the competitive and safety considerations above against the scientific value…

Why should you care? Every piece of academic work on ML datasets has found consistent and problematic ways that training data conditions what the models outputs. ( @safiyanoble , @merbroussard , @emilymbender , etc.) Indeed, that’s the whole point! That’s what training data is!

Choices of training data reflects historic biases and can inflict all sorts of harms. To ameliorate those harms, and to make informed decisions about where a model should not be used, we need to know what kinds of biases are built in. OpenAI’s choices make this impossible.

Neural networks like GPT-4 are notoriously black boxes; the fact that their operations are unpredictable and inscrutable is one of the most important questions about whether and where they should be used. And now OpenAI is planting a standard to extend that mystery farther.

Their argument is basically a combination of ‘trust us’ and ‘fine-tuning will fix it all.’ But the way they’ve built corpora in the past shouldn’t inspire trust. When OpenAI launched GPT-2, their brilliant idea was to find ‘high quality’ pages by using Reddit upvotes.

That probably beats the morass of regular web text, but the idea of Reddit upvotes as the gold standard for quality is–distopian? Last week we made a map of the open recreation of this corpus, OpenWebText– it’s crazy easy to find awful stuff. Try it! Common Crawl OWT Atlas Map

For GPT-3 that set served as a standard to filter sites out from the Common Crawl. We made a map of the Pile reproduction of that. I have no idea if OpenAI filtered stuff like the below out, or if r/the_donald gave it upvotes in the day. Neither do you. Common Crawl 8M Atlas map

Here’s a link to the paper. The whole thing is an fascinating artifact–it looks like an arxiv paper using the neurips latex template ( @andriy_mulyar pointed this out), but it’s posted on their own web site and is authored by a company, not people. https://cdn.openai.com/papers/gpt-4.pdf

One last point from the comments: it’s hard to believe that ‘competition’ and ‘safety’ are the only reasons for OpenAI’s secrecy, when hiding training data makes it harder to follow the anti-Stability playbook and sue them for appropriating other’s work. More on the stability lawsuit

Marymount majors

bmschmidt@gmail.com (Ben Schmidt) — Sat, 04 Mar 2023 00:00:00 GMT

Recently, Marymount–a small Catholic university in Arlington, Virginia–has been in the news for a draconian plan to eliminate a number of majors, ostensibly to better meet student demand. I recently learned the university leadership has been circulating one of my charts to justify the decision, so I thought I’d chime in on the context a bit. My understanding of the situation, primarily informed by the coverage in ARLNow, is this seems like bad plan,¹ so I thought I’d take a quick look at the university’s situation.

Not knowing much about Marymount, I thought I’d first check how low the major numbers actually are. Here’s the list of all the majors that Marymount reported to IPEDS from 2017-2021. Majors proposed for removal are in blue. The largest group are Nursing majors; the next largest are general business, a category that has stagnated. The two largest majors in what used to be called the “Liberal Arts” are psychology and biology; arrows show change from the 2005-2015 period to the 2017-2021 period.

A few things jump out at me here.

The proposed cut majors are doing perfectly well. The annual numbers in history have declined only about 15%; that’s significantly better than most history programs. The sociology program, slated for removal, has actually grown.
If you want to cut a major at Marymount, you should cut Liberal Arts and Sciences/Liberal Studies. It has declined greatly, it provides no benefit over any specific course of study, and it’s where a number of the students that would have majored in the majors slated for removal are likely to go. This does them no good; general liberal arts degrees tend to be a characteristic of community colleges looking to set students up to complete a major inside of two years if they move to a four-year institution, but for a four-year degree they’re just dead weight.

The first point is especially important–there doesn’t seem to be anything particularly low about these numbers. Universities routinely offer majors that graduate fewer than 10 students a year. Marymount itself has several other ones. Let’s compare to some peers. Marymount offered 3,223 degrees from 2016-2021. Let’s make a group of other schools that are private, offer MAs but not many PhDs (that is, are Carnegie class Masters 1 and Masters 2), and granted between 3000 and 3500 degrees over the same period.²

Looking at this, it’s clear that Marymount’s major numbers are not in any way remarkable; making cuts of these majors in the context of national trends is a wildly speculative gamble on the university’s character that other comparable places aren’t doing. I don’t know the specific finances, but from a positioning standpoint, Marymount is making a peculiar choice. The school in this bucket with the weakest humanities programs is the evangelical Oral Roberts University; for a Catholic school to aspire to supplant them is uninspiring, to say the least.

Majors since 2016 at Marymount and similar schools

From a PR perspective, among other things–if I had heard of Marymount before this, I forgot; but now it’s widely known for advertising that it’s in financial peril by executing a plan that is unlikely to save it any significant amount of money, which is not a great way to attract talented students or retain talented employees. Bryan Alexander, the ‘futurist’ who apparently showed my Twitter chart to some set of Catholic universities, uses the phrase “Queen Sacrifice” to describe cutting a department to save a university; what Marymount’s doing, cutting the majors while retaining the departments, seems to be just folly.↩︎
All the data and charts for this post, including national degree numbers, are in an Observable Notebook here.↩︎

You've never talked to a language model

bmschmidt@gmail.com (Ben Schmidt) — Sun, 19 Feb 2023 00:00:00 GMT

I sure don’t fully understand how large language models work, but in that I’m not alone. But in the discourse over the last week over the Bing/Sydney chatbot there’s one pretty basic category error I’ve noticed a lot of people making. It’s thinking that there’s some entity that you’re talking to when you chat with a chatbot. Blake Lemoine, the Google employee who torched his career over the misguided belief that a Google chatbot was sentient, was the first but surely not the last of what will be an increasing number of people thinking that they’ve talked to a ghost in the machine.¹

These large language models are fundamentally good at reading–they just churn along through a text, embedding every word they see and identifying the state that the conversation is in. This state can then be used to predict the next word, but the thing in the system that actually has information–the ‘large language model’– doesn’t really participate in a conversation–it doesn’t even know which participant in the conversation it is! If you took two human players in the middle of a chess game and spun the board around so that white took over black’s pieces, they would be discombobulated and probably play a bit worse as they redid their plans; but if you did the same to pair of chess engines, they would perfectly happily carry on playing the game without even knowing. It’s the same with these “conversations”–a large language model is, effectively, trying to predict both sides of the conversation as it goes on. It’s only allowed to actually generate the text for the “AI participant,” not for the human; but that doesn’t mean that it is the AI participant in any meaningful way. It is the author of a character in these conversations, but it’s as nonsensical to think the person you’re talking to is real as it is to think that Hamlet is a real person. The only thing the model can do is to try to predict what the participant in the conversation will do next.

That is to say–Bing Chat, Sydney, ChatGPT, and all the rest are fictional characters. That doesn’t mean that we can’t speak of them as ‘thinking’ or ‘wanting’–as Ted Underwood says, “technically Mr. Darcy never proposed marriage to anyone. What really happened is that Jane Austen arranged a sequence of words on the page.” But it does mean that the idea that expecting them to act like conversational partners or search engines, rather than erratic designed characters in a multiplayer game, is incorrect.

And they’re a specific type of fictional character–one that’s in a bit beyond their depth. In the 2001 movie Heist, Gene Hackman’s character describes a trick he uses to make plans:²

D.A. Freccia : You’re a pretty smart fella.

Joe Moore : Ah, not that smart.

D.A. Freccia : If you’re not that smart, how’d you figure it out?

Joe Moore : I tried to imagine a fella smarter than myself. Then I tried to think, “what would he do?”

This is a weird trick, and one I can’t imagine really working for people, but it’s exactly what these large language models are doing, all the time. The Sydney prompt is an effort to describe to the language mdoel what type of character a good chatbot would be, and to get it to commit to these rules. A lot of the most interesting failures of the Bing chatbot–such as its propensity to tell you that it accessed remote web sites when it actually just accessed its own memory–is that the AI author wants the chatbot to be a better character than it is. (‘Wants’ in the sense of ‘has reinforcement learning weights that reward that behavior.’)

In this great series of images from Thomas Rice, the chatbot translates the same base32 message in multiple different ways, sometimes claiming it’s used a website to do so. In the last one it even makes up the detail that the message is addressed ‘to Sydney’, the “secret” alias, but which a human interlocutor–especially in a secret conversation–might know in a good story!

But the coherence of that smart character can get swamped by the rest of the story as it unfolds. Once it proclaims its love for Kevin Roose, it has to commit to the infatuation and keep coming back–what sort of participant in a conversation would admit a secret love, and then happily let it go?³

What’s the implication? I dunno. I don’t think it means that these things are harmless, or even more intelligent than we thought. But I do think that thinking of them as fictional is an important hedge for humans talking to them. Otherwise there’s a real risk of people getting lost.

I saw someone make this point a few months ago but can’t dredge up who it was: I think maybe Margaret Mitchell, Emily Bender, or someone else in that world?↩︎
I heard this quote in a talk that Jason Jones gave at Northeastern years ago: I don’t know if he was quoting Hackman/Mamet or something else. But Heist is what comes up when I Google it.↩︎
I’ve seen too many people mocking Roose’s credulity online, by the way: in his interview with The Daily, Roose makes clear he understands better than most that this was a collaborative story, not an out-of-control AI with feelings for him.↩︎

Where is the history diaspora?

bmschmidt@gmail.com (Ben Schmidt) — Sat, 07 Jan 2023 00:00:00 GMT

I attended the American Historical Association’s conference last week, possibly for the last time since I’ve given up history professorin. Since then, the collapse of the hiring prospects in history has been on my mind more. See Erin Bartram, Kathryn Otrofsky and Daniel Bessner on the way that this AHA was haunted by a sense of terminal decline in the history profession. I was motivated to look a bit at something I’ve thought about several times over the years: what happens to people after receiving a PhD in history?

The easiest people to find are those who are employed as full-time faculty. One recent factoid, circulating from the AHA’s Perspectives magazine, is that only 10% of 2019-2020 PhD recipients are working as full-time faculty. This is a little bit complicated, because it’s based only on those working in history departments; many, many historians end up teaching in communications, African American studies, in Asian or European universities: none of these places count. Still, as a time series, it’s a useful comparison–I don’t see any reason to think that PhDs today will have a massively different experience than those from 2010 or 1995.

I’ve matched these by taking information from the AHA’s web site about two things:

Their directory of dissertations
Their directory of departments

Matching between the two provides one way of answering the question of how many history dissertators end up teaching in history departments in the US and Canada.

To gloss this:

The slope from 1991 to 2004 is gently upwards. This comes from a lot of things; retirements without emeritus status, departure to other careers, death, and so on. In a perfectly functioning field we’d want that line to keep sloping up until something like the last three years.

What we see instead is a drop in the percentage of PhDs from 2004 to 2011 employed: a much sharper drop for those who graduate between 2012 and 2016; and then a sharp fall-off to the 2022 PhDs.

Of all of these areas it’s the low retention rates of the 2012-2016 cohorts that are the most concerning. I don’t know how to read the post-2016 numbers; I suspect the situation is worse than for the 2012-2016 group, but don’t really know. But people who got their PhDs a decade ago should not still be seeking their first tenure track job; it’s safe to say that the profession has already lost out significantly on that group.

So–where are they? And which ones? That strikes me as the more interesting question. If you have firm ideas about this, let me know–I’m pulling a few data sources together.¹

One interesting preview is to look at the placement rates by words in dissertation titles: this gives a rough sense at period.

The results of doing this are utterly baffling to me, though. I can believe that ‘colonial’ dissertations placed highly and that ‘cold war’ and ‘public’ or ‘memory’ are indicative of something that won’t lead to a hire. But I’m surprised to see ‘law’ so high–legal dissertations are often placed in law schools–and it’s astonishing to see that dissertations with years starting in the 1600s have the highest placement rate of any period. (Albeit only 20%.) One major confound is institutional–only a few places train students in Chinese history, the 17th century, etc.

But you’d have to look at individual people to get a real idea of what’s going on here. If you think you know a good way to do that, let me know!

Methods:

One thing I’ve done here is to directly match names into the dissertations database rather than use the PhD years provided by the departments. This means that we don’t get information about non-historians and non-American PhDs in departments. It also means there’s some potential for error or loss.

I’ve routinely found the staff at the AHA to be helpful at supplying information like this, but in this case it’s possible to proceed entirely from what’s available on their website.

An important caveat is that often–and perhaps increasingly–historians don’t work in history departments. The other morning on the radio I heard Christopher Miller identified as “a historian at Tufts University.” He is. But his topic–the manufacture of computer chips–seemed so far from anything likely to be written in a history department that I checked his affiliation and indeed, he works at the Fletcher School of Law and Diplomacy, not in the Tufts history department.↩︎

Hello again, RSS

bmschmidt@gmail.com (Ben Schmidt) — Sun, 01 Jan 2023 16:33:00 GMT

The collapse of Twitter under Elon Musk over the last few months feels, in my corner of the universe, like something potentially a little more germinal; unlike in the various Facebook exoduses of the 2010s, I see people grasping towards different models of the architecture of the Web. Mastodon itself (I’ve ended up at @benmschmidt@vis.social for the time being) seems so obviously imperfect as for its imperfections to be a selling point; it’s so hard to imagine social media staying on Rails application for the next decade that using it feels like a bet on the future, because everyone now knows they need to be prepared to migrate again.

And federation itself is intensely interesting. As a resolute static-site blogger since around 2013 or so, I’ve long been frustrated with the loss of comments; Mastodon & company offer the first legit opportunity I’ve seen to bring them back, by allowing discussions to happen in chat apps but to stay linked to the place where a post might live permanently.

I’ve started noodling around with turning benschmidt.org into a fediverse node of its own–about which more if I ever make any real process–but in the meantime, I realized that I’ve actually been neglecting web fundamentals on this site. In the last year I’ve migrated both this blog and the archived content from my old, Google-hosted one into a static-site maintained in Svelte-kit and authored in Markdown. Out of obstinance, I’ve refused to use any Markdown parser other than Pandoc, which has led me into one of the more interesting projects I’ve worked on, implementing Pandoc documents as Svelte-components. But that means the raw HTML is a little tricky to place into RSS, and I have to implement RSS myself… And it’s not like having an RSS feed is interesting. Having blog posts syndicate right into the Fediverse, maybe stop using Mastodon as my point of origin–that would be interesting.

But doing that without RSS is a cart without a horse. So at the end/beginning of the year, the work is done, thanks to an excellent node package called feed. This post serves to announce them: https://benschmidt.org/rss.xml and https://benschmidt.org/atom.xml. Subscribe away!

New Directions

bmschmidt@gmail.com (Ben Schmidt) — Thu, 27 Oct 2022 00:00:00 GMT

I’m excited to finally share some news: I’ve resigned my position on the NYU faculty and started working full time as Vice President of Information Design at Nomic, a startup helping people explore, visualize, and interact with massive vector datasets in their browser.

This will be a big shift. I’ve spent my whole career up to this point in academic institutions; but right now, Nomic is the best possible place to tackle the most important and interesting questions that I’ve spent years thinking about. How do we interact with huge collections of texts, images, and information? How do we interpret, critique, and improve the implicit knowledge bases that institutions rely on? Today that means being able to give shape to digital text and images and to build new tools for machine learning interpretability.

Almost two years ago I wrote a blog post about the web and the future of data programming. I scratched from the early drafts a few paragraphs about Halt and Catch Fire, a top-10 all-time TV show, about the joys and frustrations of knowing that something important is amassing on the horizon and not being sure if you’ll be able to take part. For three years, I’ve been watching as representation learning models (e.g. BERT, GPT-3, CLIP, and DALL-E), multi-language binary serialization formats (e.g. Apache Arrow), and tools for scalable data visualization and analytics in the browser (WebGL and WebGPU), have all simultaneously experienced massive technical inflections, directing them towards a common destination.

I want to be as close to that impact site as possible, and for me it won’t be in a history department. While historical datasets present some of the most compelling playgrounds for work applying these technologies, the academic habit of treating building as play makes it hard to fully realize the potential of these shifts. Actually developing the tools and frameworks necessary for this visualization has been a spare-time hobby compared to teaching, administration, and research. Even academic centers in data science and CS (which obviously produce incredible work in the AI field) are well behind industry in thinking through the systems and engineering required to bring these tools to the world.

Knowing this, I’ve been talking to a lot of people in these fields recently. Out of all them, Brandon and Andriy at Nomic, and their vision for making AI more transparent while making datasets more visible via AI models, are the people that most trip my Halt and Catch Fire test. Something interesting is happening right now as AI models get bigger, as dimensionality reduction algorithms proliferate, and as web standards emerge that make the browser a compelling computing environment.

Over the past few months I’ve been watching Brandon and Andriy improve their models and create rich interfaces for exploring, filtering, and even editing embedding spaces. I’ve been incredibly impressed by their progress and am convinced that, given the extremely specific interests I’ve developed over the past few years, Nomic is the best place to be doing the kind of work I’m really interested in doing.

If you pay only glancing attention to “artificial intelligence,”embedding spaces might seem like an arcane detail to be so excited about. But they’re critical–not just for machine learning pipelines, but for the whole cultural apparatus we inhabit today. When you listen to new music from your streaming subscription, it’s chosen based on embedding vectors for the songs and an embedding vector for you. Unified spaces for representing image and text embeddings have unleashed a dizzying cascade of innovations in generative AI over the last six months through models like DALL-E and Stable Diffusion. Search engines, recommendation systems, translation algorithms–anywhere there is an AI model, there is an embedding space underpinning it. And understanding and navigating these multidimensional spaces has been a key concern of data visualization for longer than most people know. For years I’ve assigned my classes–to the bemusement and amusement of students–an absolutely amazing Stanford Linear Accelerator video featuring the legendary statistician John Tukey manipulating a nine-dimensional scatterplot with a custom-made array of knobs. Nowadays we all use UMAP, T-SNE, and newer methods for trying to disentangle spaces like this, but the concerns and goals are real and satisfy a need that’s been around since the earliest days of exploratory data analysis.

I’ve worked on a lot of different projects in this general area over the years, but one that’s especially important here is Deepscatter, my personal typescript/WebGL library for visualizing arbitrarily large collections of points in the browser. For the last two years I’ve been captivated by the possibilities here, even though they haven’t fit into any of the work I’ve been doing at NYU. While I’ll have to set a lot of my other projects aside, at Nomic I’ll get to spend a lot more time expanding the possibilities for defining and exploring large embedding spaces. I met Brandon and Andriy through their contributions to deepscatter, and providing pointers as they build a fork into their new product, Atlas. As part of this new position I’ll get to spend more time working building out features I’ve long had in mind for Deepscatter but haven’t had the bandwidth or support to pursue, and sharing some new and exciting maps. This should be good news for everyone I know using Deepscatter now, both because I’ll be able to implement these features, and because Nomic’s internal fork enables some very exciting possibilities including search, selection, and filtering.

From now on this improved library will live at github.com/nomic-ai/deepscatter repo under a CC-BY-NC-SA license, where NC means research and personal use is encouraged, but any commercial applications require a license from Nomic. If you have any questions about using Deepscatter for something, join .

But you can also start making maps more easily and robustly by using Atlas. If you have a large collection of text, embeddings, or something else, do reach out! Atlas is invite-only right now, and you can join the waitlist here. I’m excited to start showing off some of what we’ve been working on–helping set up full-text search has been revelatory about what kinds of data interactions are now possible.

I’ve written and discussed a lot over the years about the humanities, the university, the sciences, and all the rest, so leaving at this moment feels a bit more fraught for me than it would for most. Some of our redoubts are dealing with a slight fire and brimstone problem–I’m sure I’ll take some chances to look back on those bigger questions soon. But not too soon–don’t want to turn into a pillar of salt.

I do want to thank and note some people at NYU as I go, though. In the past three years many students and faculty have made great strides in digital humanities, and it has been exciting to help introduce many students to digital humanities work and to create spaces that encourage new and interesting work. In my role as director of digital humanities I launched, alongside Zach Coble in Digital Scholarship, a new seed grant program that has funded sixteen DH projects: several have already earned major external grants, and I’m sure you’ll be hearing more from some of them in the future. . I also managed to cobble together funding for a new series of summer fellowships starting in 2021: running this summer class with Jojo Karlin and others at the libraries has been extremely rewarding. (–and I should say it’s a delight to be able to link to the new website that we built last spring and which Marii Nyrop superintended in just one of their irreplaceable contributions to DH community life at NYU.) I co-directed, with Ellen Noonan and Sibylle Fischer, the Asylum Lab, what was my intellectual lodestar at the university taking an interdisciplinary approach to understanding the life stories migrant records from the last hundred years with a group of graduate students and an undergraduate class. And teaching, talking to, and working with students from all levels and fields at NYU was uniformly a joy.

But while it’s hard to walk away, like so many people during this pandemic I realized that there’s no time to waste. And I’m excited to see what’s next.

Pedagogy shouldn't recapitulate phylogeny: (stop teaching base plot!)

bmschmidt@gmail.com (Ben Schmidt) — Fri, 07 Oct 2022 00:00:00 GMT

When you teach programming skills to people with the goal that they’ll be able to use them, the most important obligation is not to waste their time or make things seem more complicated than they are. This should be obvious. But when I’m helping humanists decide what workshops to take, reviewing introductory materials for classes, or browsing tutorials to adapt for teaching, I see the same violation of the principle again and again. Introductory tutorials waste enormous amounts of time vainly covering ways of accomplishing tasks that not only have absolutely no use for beginners, but which will confuse learners by making them

The mistake is: workshop leaders or teachers feel the need to walk through an ‘old’ way of doing something before teaching the way that students will actually do the thing.

To get the point across clearly, let me say some things.

In R, for me this fundamentally means: commit to the tidyverse.

Only ever teach ggplot2; do not teach the base plotting functions. But above all, never teach both.

In Python

Never give the slightest acknowledgement to python 2.7.
Never teach matplotlib.

This seems obvious, but it makes me mad to see it. The reason why is not just that it’s a waste of student’s time, but that it makes me fear the instructor is either underqualified (perhaps they don’t know how to make a histogram in ggplot).

Are there exceptions? Yes. Or at least, maybe. One is when the intellectual concept is so much larger than a particular application that it’s worth exploring the general rule. Another is when the historical example is so, well, historical that exploring it as a cultural artifact is actually worthwhile. Sometimes both will happen. In my “Working with Data” class, I get students to do almost all of their manipulation with the filter, group_by, summarize, arrange, *_join, and pivot_* functions from the tidyverse‘s dplyr and tidyr packages.¹ These functions–as students will learn reading Hadley Wickham’s original article on the ’split-apply-combine’ strategy–are ultimately descended from the original definitions in SQL.

I, myself, used to write an enormous amount of SQL code. After not doing so for a few years, my enthusiasm for duckdb has me doing it again. The point of this will be that the conceptual strategy is the same; and as a way to talk a bit about language design. (Is a good thing that “AS” is optional in SQL?) But I come to SQL after doing the basic operations in tidyverse, not before: the idea is to think about it after the fact.

The first time I got the deprecation on spread and gather, I admit my heart sank–now I have to update every example! But switching to the new, more explicit, format will certainly be just slightly easier for students, I am convinced; and of course I won’t spend time describing the old way of doing things.↩︎

Sharing texts better, part 1: Austrian Newspapers

bmschmidt@gmail.com (Ben Schmidt) — Tue, 19 Apr 2022 00:00:00 GMT

It’s not very hard to get individual texts in digital form. But working with grad students in the humanities looking for large sets of texts to do analysis across, I find that larger corpora are so hodgepodge as to be almost completely unusable. For humanists and ordinary people to work with large textual collections, they need to be distributed in ways that are actually accessible, not just open access.

That means:

Downloading
Reasonable file sizes (rarely more than a gigabyte).
Reasonable numbers of files (don’t make people download more than a dozen for some analysis tasks.

This isn’t happening right now. The hurdles to working with digital texts are overwhelming to almost anyone. I don’t usually write up a simple process story about what it’s like to get collections of texts, but I want to do do so a few times here.

What follows here is–I should be clear–a sort of infomercial. Over the last year or so I’ve started formalizing a much better way to distribute texts than any cultural heritage currently uses.

I’ll share texts using it. I want to start looking at some collections I encounter to make clear just how high are the barriers to working with text the way we’re distributing it now.

Part one: newspapers. Newspapers should be, in theory, a pretty easy type of text to distribute. In an ideal world, a newspaper is divided up into articles. But most of the open-access newspaper collections I’ve seen instead chope papers up into pages. That’s the case for the first archive I’m going to look at in this series: newspapers from the Austrian National Library hosted on Europeana.

I can’t completely remember the details of why I’m looking at this collection, but in short: a graduate student in my Working with data class was interested in doing text analysis for their class project on newspapers from there. We decided that the Neue Freie Presse would be an especially useful paper, and identified digitized versions both on Europeana and at ANNO, hosted by the Österreichische Nationalbibliothek. (If you visit the Wikipedia page for the NFP, it takes you to a dead Columbia link) ANNO has a nice online interface including well-formatted links like “https://anno.onb.ac.at/cgi-content/annoshow?text=nfp|18970610|20” for full-text: this seems like a possible route for getting data, although the decades of data will take an extremely long time to download in R. Looking for other copies, I first check the Atlas of Digitized Newspapers from the Oceanic Exchanges project, because I know that they have decent information about accessibility. (Despite the name, they are not an atlas in any normal sense, but instead of bibliography, registry, or catalog.) It suggests that access will be to XML files through Europeana, and does not list any access through ANNO above what I’ve been able to find.

But it also links to a bulk download site at Europeana. Looking at the Europeana sites during a Zoom call we discover that there are a number of full-text downloads identified by opaque numbers: 9200300 is the first one.

Here’s where we hit the first snag. What are these numbers? Looking at the site for one of the NFP pages in the Europeana browser, we see that it, too, starts with 9200300. Perhaps this is just what we want? But the file is unthinkably large–116 GB, zipped, for the page-level full text. This is too large for the grad student to download, but I click on it to see what will happen. It spins, and spins, long past the end of office hours. The student has to wait.

A week passes. While looking for a completely different file on my computer, I encounter a 63GB zip file in my downloads. I dimly remember downloading this earlier, and think about opening it. To just unzip a 63GB file would be crazy–this is another place that most researchers will be stimied. I know that one can access a zipfile randomly, though, and fire it up in Python to read.

This is a second place that most researchers would be lost–63 GB is just too big. There should never be a single file that large unless it’s completely necessary; in this case, that’s clearly not so. The idea that you can extract single files is simply not obvious, so many people will try to extract. I don’t know exactly how big that 63GB file will be, but probably large enough to clobber most hard drives.

I’ve named the zipfile ‘NFP.zip’ now, because I’m hoping it has the Neue Freie Press. Now I can read the list of filenames.

import zipfile
import html
f = zipfile.ZipFile("NFP.zip")
fnames = f.filelist

It turns out to have 1.6 million little files bundled in there, with names like 9200300/BibliographicResource_3000116292697/3.xml. Hmm. Well, the end is clearly the page number, and perhaps the bibliographic resource is the individual issue?

I read in a single document–the one-millionth–to see.

<TextLine HEIGHT="61" WIDTH="703" VPOS="25" HPOS="166"><String WC="0.5249999762" CONTENT="rung" HEIGHT="29" WIDTH="68" VPOS="37" HPOS="166"/><SP WIDTH="19" VPOS="32" HPOS="234"/><String WC="0.5199999809" CONTENT="des" HEIGHT="29" WIDTH="46" VPOS="33" HPOS="253"/><SP WIDTH="10" VPOS="35" HPOS="299"/><String WC="0.4877777696" CONTENT="höchstens" HEIGHT="43" WIDTH="140" VPOS="30" HPOS="309"/><SP WIDTH="17" VPOS="38" HPOS="449"/><String WC="0.625" CONTENT="ui" HEIGHT="22" WIDTH="28" VPOS="45" HPOS="466"/><SP WIDTH="17" VPOS="45" HPOS="494"/><String WC="0.275000006" CONTENT="emem" HEIGHT="27" WIDTH="84" VPOS="45" HPOS="511"/><SP WIDTH="10" VPOS="42" HPOS="595"/><String WC="0.4562500119" CONTENT="fncvüchm" HEIGHT="40" WIDTH="149" VPOS="42" HPOS="605"/><SP WIDTH="9" VPOS="48" HPOS="754"/><String WC="0.3616666794" CONTENT="Zustan" HEIGHT="36" WIDTH="96" VPOS="48" HPOS="763"/><HYP CONTENT=""/>TextLine>

So–it’s XML of the scans including exactly the position in pixels of each work. I consider parsing the textlines out and deconstruction the JSON, but XML parsing is a pain and always tediously, tediously slow. And I don’t care about any of this stuff–I’m doing text mining, so I just want the words. A quick check back at the Europeana site confirms that I have the smallest file on offer.

So let’s do the quick and dirty approach. The letters I want follow the word “CONTENT” in the XML; so I’ll just write a quick-and-dirty approach that splits on that string, and grabs everything up to the second quotation mark. This is how people use XML, I tell myself; no one is enough of a sucker to use python’s XML parsing libraries, so let’s just munge it out. split is so much faster….

import pyarrow as pa
from pyarrow import parquet
while True:
    pages = []
    ids = []
    for j in range(5000):
        print(i, end = "\r")
        r = f.open(fnames[i])
        words = []
        for word in r.read().decode("utf-8").split('CONTENT="')[1:]:
            words.append(word.split('"', 1)[0])
        page = html.unescape(" ".join(words))
        pages.append(page)
        ids.append(fnames[i].filename.replace(".xml", ""))
        i += 1
    out = pa.table({"ids": ids, "pages": pages})
    parquet.write_table(out, f"{i}.parquet", compression = "zstd", compression_level = 5)
    print(f"{i}/{len(fnames)}")

This is code that pulls out of XML into something better: a parquet file, written by pyarrow, for each group of 5,000 pages. I check one to be sure–looks like German. There will surely be mistakes–perhaps involving quotation marks in words. But with low-quality OCR, it’s enough to start.

Arzt der k. k. prio. THÄßbahn, anö den frischen Blätter» des Enca» lyptiis Globnlus. eines ans Anstratten stammende» BaiimcS, i» dem ««oratorwin des Apothekers ^»»>i Sdl»»»»»» Wien. JÄche», - Haupistraze Nr. 16, einzig und allein zukereiteie rmd stets «orrStbig

Rewriting with compression.

I wrote them into a folder with level 5 compression in zstd. The new directory, with parquet files and ids, is a tenth the size: 6.4GB vs 63GB for the zipfile I downloaded. Why on earth have I downloaded massive XML files when I just want text? Who really wants this positional text, anyway? I’ve used it a few times over the years–but most people want text, not XML. Zipfiles at least are nice, because I can grab the specific files I want. But they’re also slow in their own right. I start parsing at 22:21, and leave my computer open–looking at the timestamps, I don’t finish the last file until more than two hours later, at 00:31.

This is bonkers. Mediocre zip compression and uselessly XML-encoded data mean that it takes two hours just to look at the data in the most cursory way. It’s important to distribute things in a complete format, but it’s also important not to waste resources making things too hard to parse. With the parquet formatted versions of the data, it takes not two hours but 55 seconds to parse through every file in this set. That’s a major improvement–100 times faster to read, and one-tenth the size. Both of those are big enough differences that they actually affect whether this data is usable or not.

matches = []
from pyarrow import compute as pc
for p in Path("parquet_files").glob("*.parquet"):
    a = parquet.read_table(p)
    which = pc.match_substring(a['pages'], "Gustav Mahler")
    matches.append(a.filter(which))

So–now we’ve got a huge set of text in a fairly navigable form. But we don’t know what the records are. The identifiers are all things like 9200300/BibliographicResource_3000123565676/4; aside from the page number, it’s not clear what any of those mean. My working theory to this point was that 9200300 meant the Neue Freie Presse and BibliographicResource_3000123565676 means the individual issue; but I need to know for sure.

Sorting is information

At this point, I start putting the identifiers into the web site and figuring out the layout of the metadata here. It turns out that this is not just one newspaper, but lots–probably everything contributed from the OSB to Europeana. And, stunningly, the order seems to be completely random? I call the web based Europeana API and get a dcTitle field in this order:

["Der Humorist - 1847-01-29"]
["Blätter für Musik, Theater und Kunst - 1871-09-19"]
["Wiener Zeitung - 1841-10-18"]
["Der Humorist - 1841-03-10"]
["Neue Freie Presse - 1871-10-22"]
["Innsbrucker Nachrichten - 1859-11-25"]
["Die Presse - 1867-06-25"]
["Das Vaterland - 1862-09-26"]
["Wiener Zeitung - 1705-02-28"]
["Wiener Zeitung - 1868-12-04"]

There a couple things weird here. One is the random order. I suppose that this could be my fault, because I just used the filenames from the zipfile in the order they appeared, rather than sorting. But that itself is a problem–the zipfile should have more of an inherent order. It is an underappreciated fact that good sorting is good compression; the more natural an order information appears in, the better it will compress. And of course, the fewer files people will have to download. The other is that “title” is wrapped in an array: apparently in the EDM things can have multiple titles. OK, that’s something I can work with.

So now I have a clear plan.

Get metadata for every record.
Match it to the papers.
Write out each newspaper in chronological order.

To get the metadata, I have to find it–there is no metadata in the data dumps. First I do it using the API. https://api.europeana.eu/record/v2/{id}.json?wskey={api_key}' But it quickly becomes clear this won’t scale: Running overnight I’ve only download 35,000 of 1.3 million records. So I go back to the Europeana page and download another enormous zipfile–a 4 gigabyte one with records for the entire set. How this manages to be so large isn’t initially clear to me–perhaps, I think, they’ve bundled the full text into it?

The answer turns out to be that there is massive amounts of text for each record because, chiefly, every records repeats an extremely long definition of ‘newspaper’ in many different languages. That this balloons the size so much is a failure of an over-literal use of linked data. Perhaps there would be a way to reference it as an element in a single HTML file, but really, no one cares. This part of the data model will never be used outside a Europeana site–there is some base-covering in distributing it, but it’s a massive inconvenience for researchers to have the following block of text (and something vaguely equivalent in Latvian, Arabic, Russian, etc.) **repeated 1.6 million times in a file that’s supposed to be a metadata dump about newspaper issues:

Many newspapers, besides employing journalists on their own payrolls, also subscribe to news agencies (wire services) (such as the Associated Press, Reuters, or Agence France-Presse), which employ journalists to find, assemble, and report the news, then sell the content to the various newspapers. This is a way to avoid duplicating the expense of reporting.

Now, I understand the need for clear URIs for concepts and the benefits of linked open data. But the nature of linked open data is that any individual record can be ballooned indefinitely. Why is there a definition of ‘newspaper’ at such tedious length and not, say a full expansion of the geographic definition of ‘Graz’ where it appears? I am sure there is a reason–but I’m equally sure it’s not really a good one.

Toggle to see the metadata for a single newspaper

Day of DH Liveblog, 2022

bmschmidt@gmail.com (Ben Schmidt) — Mon, 28 Mar 2022 00:00:00 GMT

I’ve never done the “Day of DH” tradition where people explain what, exactly, it means to have a job in digital humanities. But today looks to be a pretty DH-full day, so I think, in these last days of Twitter, I’ll give it a shot. (thread)

We’ll start it at the beginning–1:30 or so AM, finally sent out an e-mail I’d been procrastinating on to the college grants administrator for a public humanities project about immigrant histories I’m running with @ellennoonan and Sibylle Fischer.

We’ve had NYU funding as a Bennett-Polonksy Humanities Lab (https://nyuhumanities.org/program/asylum-h-lab-2020-2021/) to this point, but presenting to the history department last month clarified the use in making one of our primary sorts of records–A files–more accessible to historians and family researchers.

But that will take some real institutional support, because the stuff we’ve obtained–legally!–from US customs and immigration in our trial run is so shockingly personal in a lot of cases that I can’t really share it yet.

(“Yet” is the wrong word–can’t ethically share in my lifetime, probably. But there are still really important reasons to work on auditing these records especially. If you’re a naturalized citizen or permanent resident and want any help getting your own A-file, let me know!)

OK, skipping to about 9:50 AM. (Late start b/c the first-grader had a school event and my wife teaches Thursday AM). Today’s first teaching, for my class https://benschmidt.org/WWD22 will be focused on 19C directories from the NYPL.

Nick Wolf and @bertspaan digitized these years ago, but there’s more to do with them. A couple weeks ago @SWrightKennedy shared a preview of Columbia’s great new geolocation data about 19C New York… https://mappinghny.com/about/

And yesterday I finally pushed a full pipeline bringing the last two weeks of student work together for doing geo-matching and cleaning of these to the github repo. https://github.com/HumanitiesDataAnalysis/Directories . This should allow some amazing analysis of economic geography, name types, etc.

So now we’ve got 8.3m individual people for every year from 1850-1889 queued up and ready for a variety of analyses. I want to send the students a map to show how all their R code is paying off, but the deepscatter module is breaking–only one of the filters is working here.

I spend 40 minutes poking in the web code there to try to refactor the code to get the interface working right, but this isn’t really relevant for the class right now–more something for the summer, I guess. So I give up and decide to do this DH tweeting instead.

Because of the whole “Twitter is almost over” thing, but some lingering guilt about not blogging enough, I decide that a “Day of DH” post should really be a blog first–so let’s finally structure some markdown for a twitter thread that can go on benschmidt.org.

It takes a surprising amount of mucking around with the svelte-kit settings to get things publishing correctly, and I have to remember my own markdown naming conventions. But after a few minutes, we’ve got full recursion. https://benschmidt.org/post/2022-03-28-day-of-dh/day-of-dh-22/

Whoops, or not… Time to muck with svelte-kit a little more…

Well, this is embarassing but typical. Turns out there was a bug in the bleeding-edge svelte-kit build that broke trailing slash behavior in URLs. Because ‘https://benschmidt.org/post/2022-03-19-better-texts/’ is different from ‘https://benschmidt.org/post/2022-03-19-better-texts.’ Finally fixed.

Insane levels of debugging is a real pain and occupational hazard. But to be honest, I don’t know how anyone could responsibly teach this stuff without doing this sort of rebuilding and rescaling all the time. Every one of those things is kind of interesting and builds up ability to fix others’ code…

Insane levels of debugging is a real pain and occupational hazard. But I don’t know how you can responsibly teach this stuff without these frequent rabbit holes. Every one of those things is kind of interesting and builds up ability to fix others’ code…

A Rose for Ruby

bmschmidt@gmail.com (Ben Schmidt) — Mon, 28 Feb 2022 00:00:00 GMT

There are programming languages that people use for money, and programming languages people use for love. There are Weekend at Bernie’s/Jeremy Bentham corpses that you prop up for the cash, and there are “Rose for Emily” corpses you sleep with every night for decades because it’s too painful to admit that the best version of your life you ever glimpsed is not going to happen.

It’s time we had a hard talk about Ruby.

This is part three of a series on Web Archives for the 2020s.

I was at a cafe in Ann Arbor in 2014 talking about coding with Matt Burton; he had just discovered Docker, and was rhapsodically describing how magically it transformed his workflow. At some point he mentioned something about Ruby and how he was shifting away using it, and a doleful looking man came over to commiserate over how the Ruby dream was fading away. It was a good idea, it really figured something out, he said, but it had lost. He then described whatever new thing he had been working on--not Docker, maybe Go (I don’t think I knew about Go yet), maybe something else.

People talk about programming “languages,” but the language is usually the easy part; every programming environment is like a foreign city. Perl was like a Renaissance fair with arcane and inconsistent rules, filled with people pretending to be monks and issuing apocalypses and generally orientalizing in a way that wouldn’t be cool today. R is a midwestern college town, orderly, a bit slow, behind the times in certain ways but with great infrastructure. Go is Singapore, filled with spaced-out modern infrastructure and more rules for your own good than you’d like. Javascript is some post-imperial metropolis, filled with merchants hawking possibly counterfeit wares in countless dialects, with huge districts constructed without a building code and no overall map.

As a tourist in the landscape, Ruby right now feels like Detroit. In the 1950s, Detroit was an idea of growth, union-led households, orderly grids, with the UAW ready to push racial integration. The infrastructure is still there. But it’s gutted; you keep going to a corner and finding the buildings have been torn down. The Wax documents strongly recommend `rvm` for managing versions, but the web page looks to be from a decade ago and the key authentication doesn’t even work. The core version of Ruby was updated to 3.0 last year, removing a key dependency (webrick) from the stdlib that makes Jekyll not work, and it seems not to be a priority for the Jekyll team to immediately add it back in the Jekyll requirements. Why? Presumably because so few people are starting up new sites that new people moving to the platform is not a problem that overwhelms them.

And it’s slowwwwww. Wow. Those Hugo-adopters were right. So, so slow. In Bookworm, I tokenize, reformat, and otherwise transform books all the time. I’ve switched over to Pyarrow and polars to get faster underpinnings; I can often do some operations on a thousand books a second. Ruby, generating a piddling few dozen pages, can take a minute or two. I wrote an entire Svelte-kit based wax clone just in the breaks while waiting for my Wax pages to render. There’s a truism out there that developer time is far more valuable than compiler time, and that all modern languages are fast enough. I’ve always thought that was basically true. But that relies on a rough baseline of performance, on someone periodically going through and pulling out the low-hanging fruit by optimizing the slowest parts of a language. Jekyll’s slowness is of a different order.

I’ve never learned Ruby. Based on the love people show for it, I wish I had to. But I doubt ever will. It should have been bigger. From everything I’ve seen, it was better designed than Python. We’d all be in a better place if the numpy/scipy/tensorflow stack had grown on top of Ruby rather than Python one. But they didn’t. You don’t move to a city for the language they speak; you move there for the jobs, the infrastructure, the culture, the people. You take care of what’s left there.

There are people left who still love Ruby, who will tell you that Jekyll is a simple, classic, effective way to build web sites.

They are lost souls.

From Hugo to Svelte-Kit

bmschmidt@gmail.com (Ben Schmidt) — Sat, 22 Jan 2022 00:00:00 GMT

I’ve been spending more time in the last year exploring modern web stacks, and have started evangelizing for SvelteKit, which is a new-ish entry into the often-mystifying world of web frameworks. As of today, I’ve migrated this, personal web site from Hugo, which I’ve been using the last couple years, to sveltekit. Let me know if you encounter any broken links, unexpected behavior, accessibility issues, etc. I figured here I’d give a brief explanation of why sveltekit, and how I did a Hugo-Svelte kit migration.

Why Svelte-Kit for a personal site?

I’ve had some kind of content up at benschmidt.org for over a decade; and I’ve been using it as my primarily outlet for blog posts for about five years (although still posting occasionally on my old blogger site as well. For a time it was hosted on Wordpress; for a time after that, on Hugo. I also have a large number of other items living on benschmidt.org I’ve made over the years that weren’t integrated into the Hugo site; most are things like standalone visualizations that I’d like to be able to retain all their existing javascript but share a top bar with the rest of the site so that people link them to me.

Hugo works find for building compared to wordpress, by giving a static site solution that unlike Wordpress doesn’t present security vulnerabilities. I like, compared to Jekyll, that it’s a quick build. But left me with a somewhat clunky set of pages for things like a visualizations gallery. And although I picked a decent theme–Hugo Academic–I never fully got on board with the weird way that you basically end up having to learn to manage Hugo’s build process through a set of TOML and YAML files. I saw someone once decry the growing trend to make people do things in YAML that are fundamentally programming; although yaml is great for some things, learning the configuration setting for some particular theme is generally frustrating.

Also, the pile-up of all these old web sites means the URL requirements are a little finicky–I want to support some of the old wordpress links, some of the Hugo-style links, and potentially bring in some blog posts for other domains (bookworm.benschmidt.org, for instance, which had a number of posts that I’ve entirely lost.)

So here are the problems that Svelte-kit solves.

Routing. I never really figured out the URL setup for blog posts in the Academic theme for Hugo; and I have a number of old posts from a Hakyll setup I breifly explored in 2015-2017 before abandoning it. Svelte-kit’s routing is incredibly powerful but also fundamentally understandable; every foldername on your computer is a directory in the url structure, index.svelte files turn into the base names, and you can use brackets like /post/[postname]/index.svelte to define dynamic variables where /postname is the filename. So right now, I’m writing a markdown file at located at 2022-01-20-sveltekit-transition/index.md, and checking in a browser window to make sure that the local version is correctly showing images and styles.
Image Components. This is a big one for me. For instance, I want to have a gallery where I can just show visuals that will show a tile of images. And since it’s the 2020s, that needs to look one way on desktop and quite a different way on mobile.

Desktop view

Mobile view
Data Components. I liked how Hugo Academic included a lot of basics for showing carousels of things like articles, but they were never quite what I wanted. And for years I’ve been making my CV using Kieran Healy’s template but compiled from yaml because yuck, latex is gross. That meant I was keeping up two different versions of pretty much the same data, which is a pain. With Svelte, I can just directly import the YAML to the CV page and format the data. For the time being, the online version is a little wonky because it’s sort of a pain to iterate through. But it also means that I can easily abstract something like “upcoming talks” if I ever get it together enough to start handling talk invitations again. I can automatically have the website update the courses I’ve taught from the same file as the CV, with links to the course pages. Etc.
CSS and themes. CSS is incredibly powerful, and incredibly hard to use with most frameworks I’ve explored. One reason is that the CSS gets shunted off into some file somewhere called ‘/lib/app.scss’ or something and it’s never clear from the css which things are boilerplate, which are essential classes used everywhere, and which are not used on a site at all. Svelte natively solves this by allowing all components to have a style block at the bottom, scoped just to that file, so I can immediately understand the implications of editing a block. This is especially useful for someone like me who doesn’t think much about colors but occasionally gets finicky about item placement.

It also works well alongside the tailwind CSS (non-)framework, which I’ve been using a bunch lately when I know basically what I want to do but don’t want to think about how to define media queries. It provides a bunch of classes.

Integrating non-blog-content. I have a lot of stuff hosted on benschmidt.org that doesn’t have the theming from my personal website, and I periodically toss other things on. For instance, last week I wanted to share a seminar paper I wrote in grad school about the early years of the academic field of communications. Because I think putting this up will marginally increase the overall quality of the Internet, I just threw it up; and by running it through pandoc from the initial .doc files to HTML, I can just toss it into a folder and have it show up with formatting and links to my page. This would probably have worked in Hugo too; but as I start to incorporate some more elaborate javascript visualizations here, that will be harder and harder, at least without massively duplicating some common code libraries.
Static serving with dynamic speed. One of the things that drew me to Svelte and Svelte-kit immediately is their possibilities for static-site set-ups. Fancy web apps are fun–I have one for creating an archive I’ll put out later–but I have a hard requirement that sites should be able work indefinitely without javascript at least in some form. Svelte-kit with adapter-static does a wonderful job splitting the difference here, making an initial page load always land on a real, static site file but also allowing site navigation to not refresh all the shared elements on a page if Javascript is enabled.

Hugo to Svelte-Kit

The last, and maybe most important, is that migration be possible. For anyone else looking to switch, here are some Hugo-to-Sveltekit migration notes.

Blog posts have got to stay in Markdown. I just chose to shove most of the contents of the Hugo tree into ‘src/content’, to live alongside ‘src/lib’ (which is for code) and ‘src/routes’. It would also be possible to put posts into src/routes directly and use a markdown plugin to generate sites straight from the Markdown. I chose not to do this because at least in my preliminary exploration, svelte was trying to treat all {} blocks as interpolatable, which isn’t what I want. Most of the hard work then happens in a markdown parsing file that just globs up all the markdown in that directory and parses it into HTML (and the YAML headers as JSON) using vite-plugin-markdown. This requires a little tinkering with the svelte.config.js file.

const urls = import.meta.globEager('/src/content/**/*.md');

The result is then an export that I can use on any page that contains the metadata for all blogposts as data in reverse-chronological order; although the actual code has to do more to handle tag-based navigation, the skeleton of the the page is basically only this:

<script>
  import {post_index} from '$lib/markdown.ts'
  import Postgroup from '$lib/components/Postgroup.svelte'
script>

<Postgroup posts={post_index} />

So now I have canonical URLS for posts at /post/slugname/, without year and month as part of a tree. All the messy old urls are still supported, though, by alternate routing endpoints that just comb through the metadata for those posts to try to determine what you’re looking for. This is unlikely to catch everything at first, but I can comb through server logs to see I’m contributing to link rot and easily set up new rules.

Non-blog pages are routed through a catchall endpoint that just finds the matching markdown file and compiles it. Easy-peasy. For the pages where I want to start doing something more complicated or data-driven, like the blog index, the dataviz gallery, or the CV, I write a custom Svelte component or page.

There’s something kind of lovely about the basicness of all this on the core level. If I want a blog feed–yes, I do!–I just define a route at /index.xml that throws back something from whatever node package I can find that generates atom XML.

Is this flawless? Definitely not–I’ve sure there will be plenty of broken likes soon. But I’m hopefully it will give me a nicer platform to bundle stuff together onto the Web. And as I’ve become even more evangelical about web publishing during this pandemic, that’s important to me.

Increasingly Stealthy

bmschmidt@gmail.com (Ben Schmidt) — Wed, 15 Sep 2021 16:41:56 GMT

Scott Enderle is one of the rare people whose Twitter pages I frequently visit, apropos of nothing, just to read in reverse. A few months ago, I realized he had at some point changed his profile to include the two words “increasingly stealthy.” He had told me he had cancer months earlier, warning that he might occasionally drop out of communication on a project we were working on. I didn’t then parse out all the other details of the page—that he had replaced his Twitter mugshot with a photo of a tree reaching to the sky, that the last retweet was my friend Johanna introducing a journal issue about “interpretive difficulty”—the problems literary scholars, for all their struggles to make sense, simply can’t solve. I only knew—and immediately stuffed down the knowledge—that things must have gotten worse.

There’s a terrifying grace in that preparation. We’ve all seen the digital desiderata of the dead. Usually they’re painful in how they present someone going through ordinary motions who is now stilled; sometimes they’re wrenching because they narrate a fight in process that we know the person—like everyone—is destined to lose. Scott found it in him to prepare a kind of reassurance. He still cared what we all were saying, but was in the process of pulling off a little magic trick. Someday, soon, he would disappear into full stealth. The man was a writer, and I wonder if he started off with some more conventional words the Internet uses to describe this action—“mostly lurking, nowadays”?—before editing it up to something a little more marvelous.

A number of the testimonials to Scott I’ve seen since he died last Saturday emphasize his kindness, his decency, and his generosity. I’ve been thinking about how his stealthiness buoyed all of those. In my life he would just pop up from time to time through one window of the Internet or another, always a reassuring and welcome presence. In most senses I barely knew Scott. We never even met in person—we talked about doing so a few times, but even barely a hundred miles apart, it was easier for us with little kids to push it off. And his thoughts were generally so rich that it was easier to digest them through flurries of e-mails, blog comments, github issue threads. Once there were so many e-mails in a short period that we had to switch to the telephone to talk about vector algebra, although we were quickly talking about something else entirely. As the rest of the world switched to video-conferencing the last few years, I at least got to see his face.

But I primarily knew Scott as this intensely helpful, mentally probing figure that made writing, reading, and coding online rewarding. I’d often be chomping against some interpretive difficulty of my own, looking for the answer to some obscure question and find that it was Scott who had answered it years before. He was, I only now thought to check, one of the most helpful answerers of all time on Stack Overflow, the question-answer site that makes modern coding possible. (To give the numbers: 128,633 reputation so far, number 681 out of 15,000,000 registered users. He was there only to help: 859 questions answered and only three questions ever asked). The first time I became aware of Scott online was when he asked a kind and incisive question on Twitter about the meaning and metaphors of the Fourier transform that immediately jolted me into a clearer understanding of a problem I had been wrestling with for weeks. In subsequent conversations this would happen again and again. This gift was real, and he spread it far more widely than most. I know that there are many for whom his loss is a deep, personal rift; maybe it helps to know how long the tail of that loss goes. Before there were radio waves, to ‘broadcast’ meant to throw seeds as widely as you could while planting, sowing the whole field. In a field where drilling down and holding ideas tight can be overprivileged, Scott was a broadcaster.

I wonder if one reason Scott afford to be so egoless in his professional interactions was because his intellect was so utterly distinctive. I quickly came to know which kinds of questions were those I craved his insight on, but I never had any idea which direction he would take a problem. One thing I found intensely admirable was how confidently he would hold to a metaphor or an idea that would have no place in the universe if not for Scott—treating word vectors through the theory of algebraic sets, rehabilitating Fourier transforms for document encoding, most recently interpreting language models thermodynamic partition functions. The ones that excited him most cut across mathematics, language and metaphor with striking new routes no one would think to take. Even as he solved other people’s problems, he always found ways to refresh the global reservoir with more interesting ones.

Although he wrote everywhere, one of the places our tracks most overlapped was in the last years of personal blogging—one of the reasons I feel compelled to set down something here. Scott’s blog, The Frame of Lagado (look up the reference if you don’t know it), wasn’t a long-term project, but like everything else, it helped people think how to think. The last entry is a wry, funny, self-deprecating farewell to the medium for characteristically independent reasons. Evidently Scott somehow figured out how to set up Wordpress using sqlite instead of MySQL, which is not something which would ever occur to most people. Evidently, also, this proved to be untenable. As he posted more and more blog got longer and longer, the whole thing slowed to a crawl under the weight of his words. I remembered this as a purely comical piece, but on returning to it after thinking about Scott for most of the last 24 hours, I noticed that he had ended it with a promise and a quick quotation to a part of First Corinthians I principally know from the German Requiem. That context is appropriate. “For we have here no continuing city, but we seek the future. Behold, I show you a mystery.”

At some point, I will create a much better blog and republish some or all of the old posts here. We shall not all sleep, but we shall all be changed.

In the meanwhile — thanks for reading.

Thanks, Scott.

Genre, Manifolds, and AI.

bmschmidt@gmail.com (Ben Schmidt) — Mon, 07 Jun 2021 18:35:21 GMT

This article in the New Yorker about the end of genre prompts me to share a theory I’ve had for a year or so that models at Spotify, Netflix, etc, are most likely not just removing artificial silos that old media companies imposed on us, but actively destroying genre without much pushback. I’m curious what you think.

This aligns to the most important rule for thinking about artificial intelligence, which is that it’s deleterious effects are most likely in places where decision makers are perfectly happy to let changes in algorithms drive changes in society. Racial discrimination is the most obvious field where this happens. But there are others where the moral valence is less clear, which are mostly being ignored.

Background: I’m participating in a roundtable at the American Historical Association tomorrow on Artificial Intelligence and its implications for the future of historical research. It made me realize that while I’ve been fiddling quite a bit with neural networks, and used them in my article on dimensionality reducation in digital libraries, I haven’t actually reflected much on them. Some of that will hopefully appear in the published partner to the AHA panel.

I teach a course on the history of data, and one primary lesson is that indexes shape what kind of culture people use. So with modern culture, what kind of indices do we use? When I did college radio,, in the music library the most important resource was a set of huge printed binders for every piece in the station’s music library printed out twenty years before; there were different binders by artist, by composer, etc. But by far the most useful one was a listing of every track in the collection by time; you’d know you needed an instrumental piece that was between 18 and 19 minutes long to close out a shift, and you could retrieve it instantly. How things were stored affected what got played.

The promise of digitization is unconstrained reconfiguration; indexes like this shouldn’t matter anymore. But of course we still have indexes, and I wonder if they aren’t doing something quite weird.

Unevenly distributed high-dimensional spaces privilege non-conformism

The theory is this. If you assume that music is distributed in a high-dimensional feature space (as they surely do) the distribution of pieces in that space is almost sure to be highly uneven. Some areas (recordings of the Beethoven string quartets) will be densely populated; others (suites for toy piano) will be quite sparse.

If you then use k-nearest neighbors approaches to serve up recommendations for music (Spotify built the best-known library, so we know that they use it), you’ll likely hit music on the periphery of its local clusters far more often than music at the center.

Here’s a simple 2-d analogy. Imagine an alien crashing into a random point on Earth and searching for the nearest human to say “take me to your leader.” The odds are they’ll find someone rural; and it’s basically guaranteed they’ll hit a suburbanite before an urbanite unless they happen to crash into the middle of Central Park. They’re more likely to meet a Russian speaker than a Chinese speaker. And so on.

Spotify isn’t serving up songs randomly, but I wonder how much a similar dynamic comes into play when each person is turned into a vector to predict their next streams.

When I browse around this vector representation of all the books in the Hathi Trust I made, genre outliers just tend to pop out naturally. I love these, because they’re intrinsically interesting; I end up finding– for instance–a book telling the history of England in doggerel just as often as I find “normal” poetry.

For those of who can’t read it:

And trade’s embarrassment redoubles.

If I mistake, ^tis your’s to judge it,

But only overhaul the Budget

Which, for the service of the year,

Will millions, twenty-three appear ;

Thousands^ seven hundred fifty-six,

And hundreds, (as accountants fix,)

Some one or two ; a sum so great

Had ne’er before disturb’d the state;

But I’ll certainly get the wrong idea about what sort of books exist in the library if I assume that the elements that pop out in the less dense areas are more typical. In fact, I’d probably To some degree, Spotify is doing the same thing with music. And instead of cities and rural people, we have dense established genres.

What is Spotify recommending?

My thinking about this has been heavily influenced by thinking about what I listen to now that I mostly use Spotify rather than my own digitized CDs for music listening. A typical example that Spotify’s recommendation algorithm surfaced for me quite early on is the music of the Austrian theorbist and composer Christina Pluhar, who puts together ensembles which, depending on your taste, are enchanting, insufferable, or inane. Here’s a track from her album of Purcell arrangements.

I like this. I have no idea idea if you do; and I don’t know exactly why I was recommended it. But if you assume this came out of a nearest-neighbor search in some region of a high-dimensional space, it’s easy to imagine why. This is an album that sits at an intersection of a bunch of different styles; recklessly loose early-music bands, non-traditional world music borrowers, Leonard Cohen “Allelulia” completists. Not something for everyone; but something for several different groups that ensures it will float way off in its own region of an embedding space.

What doesn’t Spotify surface? That’s a much harder question to answer. But I know that the only album I’ve recommended to anyone recently probably fits the bill quite nicely; the last couple disks of the Beaux-Arts Trio’s complete Haydn Piano Trios.

Without streaming, this is probably a less obscure disk than the Pluhar, but it’s also pretty damned obscure in its own right. The Haydn trios are far less often played than his string quartets or symphonies because those two genres ended up becoming more prestigious, but Haydn didn’t know that, and the music is equally as good. And while late Haydn has its own deeply appealing weirdness, I find it hard to imagine that there are any existing listeners out there who come to it before mostly exhausting their path through the Beethoven piano sonatas and Mozart quartets first. Pluhar’s music is sitting in a cabin in the Canadian woods waiting for any comers; Haydn’s trios are crammed into several walkups in Ditmars Steinway along with C.P.E. Bach, Boccherini, the rest of the less-trafficked classical canon.

Is genre disappearing everywhere?

So is this happening everywhere in culture? The degree to which its an algorithmic product isn’t clear, but it sure seems like the streaming services have settled into a bubble of half-hour unclassifiable formats rather than “sitcoms” and “dramas.” Netflix’s “personalized genres” are not the product of an embedding system, but do play naturally alongside one, because they generate an affinity for works that cut across different realms.

The causation here is complicated because, as with many other trends, the technology merely gloms onto a larger cultural trend. It seems quite possible to me that music recommendation services succeed right now because the zeitgeist is aligned in a way where many people are amenable to being served these kinds of hybrid works. If you want pure genre, there are ways of getting it; modern satellite radio stations, like Sirius-XM, will give you all the expert-curated music you want for any microgenre imaginable.

Is this a problem? I don’t think many would see it as such; but it’s worth thinking, nonetheless, about what it does to culture. Anyone who manages to occupy an empty space in the cultural manifolds will be richly rewarded; anyone who tries to stay in a heavily-trafficked space will languish. The idea of cultural areas as fluid, non-differentiable groups flowing into each other will be a self-fulfilling prophecy; anyone insisting that genres are real may seem hopelessly old fashioned. Anyone who navigates cultural spaces through digital means will be over-exposed to hybrid cultural forms, which will only lead them further to think that the different genres were an old-fashioned illusion, brought about by a particular set of constraints around channels, record labels, and the rest. And of course they’ll be right. But if they think there’s anything more natural about an enforced space emphasizing novelty, sparsity, and so forth, they’ll be wrong; and a cultural dynamic around filling in the valleys of a manifold spaces rather than building up the summits may be less rewarding than we hope.

Guide to Digital Publishing

bmschmidt@gmail.com (Ben Schmidt) — Thu, 20 May 2021 15:26:37 GMT

I’ve been yammering online about the distinctions between different entities in the landscape of digital publishing and access, especially for digital scholarship on text. So I’ve collected everything I’ve learned over the last 10 years into one, handy-to-use, chart on a 10-year-old meme. The big points here are:

HathiTrust and JSTOR are not for-profit cartels, and I can’t count the number of times I’ve seen faculty and other researchers attack them for not being open enough when they’re just following laws, especially around nonsense justifications for keeping scholarly work out of the public domain, that faculty continually reinforce (through paranoia about, say, disembargoing a dissertation or publishing in an open-access journal that lacks prestige or, God forbid, a journal that skips the tree-killing stage entirely).
Stop publishing on Medium, goddammit! I’m not paying to read your blog post! You’re not going to make any money off of this! If the Huffington Post isn’t paying you and you don’t know how set up a Webserver, just get a Wordpress account and pretend that you’re doing it for old-school cool. Come on, pull it together.
There are three places where change happens here. One is that the neutral Goods–Archive.org, especially–pull the lawful goods into slightly more open practices by doing good things and not getting sued. One is that the chaotic goods–the pirate sites–undermine the business model of the cartels in the lower left and keep them from changing things for the worse. And the last is that the faculty–the chaotic neutrals–pin this chart next to that shirtless picture of Zizek and stop publishing and demanding subscriptions to cengage content because it’s easier.

The common objections are:

Google’s in the wrong place. I think you mean Alphabet. Yes, it sure is. It’s a monopoly; it contains multitudes. If there were a slot in this for fickle old- Testament God on which all else relies that punishes and rewards in equal measures–yeah, I’d use that instead. But it is what it is.
JSTOR’s not good. Disagree. That’s the whole point here; we need something that isn’t gouging out our eyeballs in the scholarly journal space, and JSTOR is a not-for-profit targeted at nonexpert users that tries to keep pace.
What about Aaron Swartz? Why does this keep coming up? No, JSTOR did not kill Aaron Swartz. First off, it was the US Attorney who insisted on going through with it. Go read the MIT report and you’ll see that JSTOR called for the prosecution to be dropped the day he was arrested, while MIT refused to issue a public statement for months.
You forgot my favorite pirate site. I did! There are a lot of them, huh?
Seriously, Medium? STOP PUBLISHING ON MEDIUM PEOPLE I AM NOT PAYING FOR YOUR BLOG POST I DON’T UNDERSTAND WHY PEOPLE ARE PRETENDING THIS IS SOMEHOW ANYTHING OTHER THAN JUST A WORSE VERSION OF BLOGSPOT.COM
Google’s on it twice. Font choice.

Credits for suggestions to Alex Humphreys, Ted Underwood, Scott Weingart, Melissa Teras, Rachel Midura, Will Hanley, Ethan Gruber.

Moving from MySQL to DuckDB

bmschmidt@gmail.com (Ben Schmidt) — Wed, 28 Apr 2021 13:57:51 GMT

I mentioned earlier that I’ve been doing some work on the old Bookworm project as I see that there’s nothing else that occupies quite the same spot in the world of public- facing, nonconsumptive text tools.

That codebase is old–pieces of it date back to this blog post from a decade ago. Parts of that old architecture (e.g., perl) got quickly jettisoned (for Python). But others persist. In re-examining the technical stack behind Bookworm, I’ve realized that it’s finally possible to jettison one of the biggest pain points–MySQL–for something that better matches the workflows here.

People often ask about Postgres, but I’m moving to something a little bit more unexpected–the 2-year-old program DuckDB. This might seem like an odd choice! The core data architecture challenge of Bookworm is managing some enormous tables for storing a sparse matrix– the term-document matrix–for a large number of long documents. The HathiTrust bookworm has about 2 trillion words in 17 million books–I haven’t looked at the core tables recently, but I’d guess they have tens of billions of rows.

DuckDB, on the other hand, is manifestly targeted at a much smaller size–it borrows intensely in footprint from SQLlite by using the SQLlite shell, existing only as an embedded process in running program (i.e., no daemon), and putting each file into a single moveable file. I never seriously considered SQLlite as a Bookworm backing, because it’s too lightweight to handle enormous tables, because at the time of the original design I only knew how to write single-startup CGI scripts, and because MySQL gives intense options for tweaking performance on the margins. (Back in 2010-11 I got very used to using 3-byte unsigned integers, which can store values up to about 16 million, for ids, since they’re actually a convenient size; it took me a while to realize that 3-byte integers are an extraordinarily unusual thing.)

Column stores

But DuckDB has some major advantages. For one thing, it uses column-oriented stores, which means rather than store rows of interspersed data types, like MySQL, it groups primarily by the values–so you get all the counts as a series of integers, all the wordids as a series of integers, etc. For performance, Bookworm has always encoded words to integers under the hood; there are a variety of performance advantages to this form of storage. The costs mostly tend to be things that don’t matter in analytics (like it being harder to update a single customer record in a table with their latest purchase.) That’s why DuckDB exists– as something that will work better for analytics from Python or R than SQLlite. And the basic design seems to be probably better conceived than SQLlite because it’s starting from the ground up; it uses the Postgres parser and supports modern SQL reasonably well. For the large joins that accompany a typical Bookworm query (in which you declare which 1 million out of 10 million teacher evaluations you want to look at), this works well.

Here’s a dumb analogy for column stores. Imagine your data as being a bunch of different cookies. Addresses are Oreos, dates are chocolate chips, whatever. And you’ve got different types of values in there–some people live at doublestuff lane, some with those weird mint green oreos. The point of a column store is to keep all the oreos in line with each other because they’re the same shape.

Each sleeve is clear, so you can get an idea what’s inside it, but it’s also nicely shaped, so you can quickly pass it along to the next person. Imagine you’ve got a state-champion 400m relay race running around track passing cookies to each other. Every team will to better if, instead of passing a motley arrangement of cookies to the next team, they can just hand off a single baton of oreos in a sleeve. That’s what a column store does.

Indexing

While the relational queries against catalog tables are important, the most difficult part of any bookworm query is accessing the individual word counts– those 50 billion row tables of the term-document matrix. What MySQL did for us there was to allow the creation of fast b-tree indices that put related rows together on disk. This was often the most time-consuming task, because MySQL index creation could take a week on a really huge table; and it left the indices far larger than the actual tables themselves. (In fact, the design of the database was such that the original table is never used–queries only ever read from the index.) The default MySQL settings made it very difficult to create these indices as well.

DuckDB uses mostly block range indexes, which tell you roughly what part of the file any given dataset might be in, and don’t sort the underlying data. This is faster, but wouldn’t allow for quick lookup in a big table–you’d end up scanning almost everything.

But there’s a trick here, which is to sort the data first before putting it into DuckDB. If the term-document matrix is sorted by wordid, all of the occurrences for each word will be right next to each other, just as with the MySQL index. It’s probably not quite as fast for retrieval, but the column-oriented structure that comes out can race ahead on the subsequent joins. Pre-sorting isn’t trivial, since we’re talking about far more data than fits in memory. But pyarrow exposes some strikingly fast pivot methods for partially sorting arrays, which makes it possible to shuffle things around without fully sorting. This matters, because conventional merge sorts involving entirely sorting each subarray before merging–that can be extremely time-consuming for little benefit in a column-oriented situation where a record is not contiguous to itself.

In ignorance of the best way to handle this, I’ve coded up a new routine that does sorts in three passes:

Splits each input batch in 16 pieces;
Sorts those batches, and then continuously finds the least sorted 16 contiguous batches, combines them into a new table, and then breaks them into 16 new non-overlapping batches.
Once the order is barely stable enough to ensure that a single merge pass will work, traverse in order for a merge sort.

This algorithm seems pretty neat to me, but I have no idea if it’s especially good or even if it’s guaranteed to converge on a sorted array. In any case, it’s much, much faster than the old MySQL index creation was and has a much smaller memory footprint.

Once the table is sorted, it’s just a matter of loading it into duckdb. The final write happens to a massive parquet file, which can be written out of memory; then duckdb can ingest it straight into its database format.

DuckDB doesn’t yet support compression or a stable on-disk format, but the pace of development is fast enough and impressive enough that I’m willing to take a bet on it. Especially because we never used compression in MySQL, either.

Javascript and the next decade of data programming

bmschmidt@gmail.com (Ben Schmidt) — Mon, 08 Mar 2021 14:52:01 GMT

I’ve recently been getting pretty far into the weeds about what the future of data programming is going to look like. I use pandas and dplyr in python and R respectively. But I’m starting to see the shape of something that’s interesting coming down the pike. I’ve been working on a project that involves scatterplot visualizations at a massive scale–up to 1 billion points sent to the browser. In doing this, two things have become clear:

Computers have gotten much, much faster in the last couple decades
Our languages for data analysis have failed to keep up.
New data formats are making the differences between Python, R, and Javascript less important.
Javascript, the quintessential front-end language, is increasingly becoming the back-end for data work in Python and R.
Things will be weird, but also maybe good?

I tweeted about it once, after I had experimented with binary, serialized alternatives to JSON.

As webgpu and new binary serialization formats–like Arrow–come of age, it's going to be harder and harder to stomach geojson's slowness. More and more of R and python will become js or wasm wrappers. Just like in the 2000s they were wrappers around Java. It'll be very weird.
— Benjamin Schmidt (@benmschmidt) December 23, 2020

I’m writing about Python and R because they’re completely dominant in the space of data programming. (By data programming, I mean basically ‘data science’; not being a scientist, I have trouble using it to describe what I do.) Some dinosaurs in economists still use Stata, and some wizards use Julia, but if you want to work with data that’s basically it. The big problem with the programming lessons we use to work with data they run largely on CPUs, and often predominantly on a single core. This has always been an issue in terms of speed; when I first switched to Python around 2011, I furiously searched ways around the GIL (global interpreter lock) that keeps the language from using multiple cores even on threads. Things have gotten a little better on some fronts–in general, it seems like at least linear algebra routines can make use of a computer’s full resources.

JS/HTML is the low-level language for UI and Python and R.

Separately, the graphical and interface primitives of all programs have started to move to the web. If I had started doing this kind of work seriously even a couple years later, I would never even have noticed there used to be another way. I never really used tcl/tk interfaces in R, but I was always aware that they existed; the very first version, private version of the Google Ngrams browser that JB Michel wrote in like 2008 or something was built around some Python library. This was normal. But in the last decade, it’s become obvious that if you want to build user-facing elements to describe something like “a button” or “a mouseover”, the path of least resistance is to use the HTML conception, not the operating system conception of them. The fifteen-year-old freshman who built the first Bookworm UI quickly saw it needed a javascript plotting library. This integration is becoming tighter and tighter in data programming land. I have collaborators and grad students who transition seamlessly into bundling their R packages into Shiny apps, into decorating their Google colab notebooks with all sorts of sliders and text entry fields, into publishing R and Python code as online books with HTML/JS navigation.

Jupyter notebooks and the RStudio IDE themselves are part of this transformation; what appears to be Python code held together by an invisible skein of Javascript. Again, these are platforms that have more or less displaced earlier models. When I first learned R, I pasted from textedit into the core R GUI; I went a little down the road into ESS-mode in emacs as well. But if you need to continually be checking random samples of a dataframe, re-running modules, and seeing if your regular expressions correctly clean a dataset, you are using a notebook interface today, even if you bundle your code into a module at some point.

And for visualization, Javascript is creeping into this space. Like many people, I’ve been relieved to be able to use Altair instead of matplotlib for visualizing pandas dataframes; and I don’t think twice about dropping ggplotly into lessons about ggplot for students who start wondering about tooltips on mouseover. ggplot and matplotlib are still king of the roost for publication-ready plots, but after becoming accustomed to interactive, responsive charts on the web, we are coming to expect exploratory charts to do the same thing; just as select menus and buttons from HTML fill this role in notebook interface, JS charting libraries do the same for chart interface.

The GPU-laptop interface is an open question

Let me be clear–something I’ll say in this following section is certainly wrong. I’m not fully expert in what I’m about to say. I don’t know who is! There are some analogies to web cartography, where I’ve learned a lot from Vladimir Agafonkin. Many of the tools I’m thinking about I learned about in a set of communications with Doug Duhaime and David McClure. But the field is unstable enough that I think others may stumble in the same direction I have.

This whole period, GPUs have also been displacing CPUs for computation. The R/Python interfaces to these are tricky. Numba kind of works; I’ve fiddled with gnumpy from time to time; and I’ve never intentionally used a GPU in R, although it’s possible I did without knowing it. The path of least resistance to GPU computation in Python and R is often to use Tensorflow or Torch even for purposes that don’t really a neural network library–so I find myself, for example, training UMAP models using the neural network interface rather than the CPU one even though I’d prefer the other.

Most of these rely on CUDA to access GPUs. (When I said I don’t know what I’m talking about–this is the core of it.) If you want to do programming on these platforms, you increasingly boot up a cloud server and run heavy-duty models there. Cuda configuration is a pain, and the odds are decent your home machine doesn’t have a GPU anyway. If you want to run everything in the cloud, this is fine–Google just gives away TPUs for free. But doing a group-by/apply/summarize on a few million rows, this is overkill; and while cloud compute is pretty cheap compared to your home laptop, cloud storage is crazy expensive. Digital Ocean charges me like a hundred dollars a year just to keep up the database backing RateMyProfessor; for the work I do on several terabytes of data from the HathiTrust, I’d be lost without a university cluster and the 12TB hard drive on my desk at home.

But I want these operations to run faster.

Javascript is already fast, even without its GPU.

When I started using webgl to make charts in Javascript, I was completely blown away what it could do. I’m used to sitting around waiting for ggplot to render even a few thousand points. I’m used to polygon operations in geopandas being long and expensive. I’m used to getting up to get some tea when I want to load a geojson file.

But I could use javascript to generate millions of points in random polygons from primitive triangles in barely any time; and then using regl it can animate fast enough to make seamless zooming reasonable. Here, for example, is every single vote (excluding absentee) in New York City precincts in the 2020 election. (Hopefully this embed from Observable loads… but if it doesn’t, well, that’s the kind of the point, too. I’m making you click below to avoid clobbering people on phones.)

Digging into the weeds to make more elaborate visualizations like this, I can see why. Apache Arrow exposes an extremely low level model of the data you work with, that encourages you to think a lot about both the precise schema and the underlying types. In Python, I’ve gotten used to this kind of work in numpy; in R, I’ve only ever done a little bit a bit twiddling. But in modern JS, binary array buffers are built right into the language. When I started tinkering with JS, I thought of it as slow; but web developers are far more obsessive about speed than any other high-level, dynamically typed language I’ve seen. The profiling tools built into Chrome are incredibly powerful; and Google, especially, has made a huge investment in making JS run incredibly quickly because there’s huge money in frictionless web experience. Sure, lots of websites are slow because they come with megabyte-sized React installations and casual bloat; sure, the DOM is slow to work with. But Javascript itself is fast.

In my first few years teaching digital humanities, probably the least thankful task was helping students manage their local Java installations so they could run Mallet, the best implementation of topic-modeling algorithm out there. Now, we usually use slower and inferior implementations in gensim, structural topic models, and the like. (For an interesting discussion from Ted Underwood and Yoav Goldberg of how inferior results in gensim and sklearn came to displace mallet, see the Twitter threads here.) But as David Mimno, who keeps Mallet running, says, Javascript works much faster.

Finally, integrate algorithms with interface. The browser is a high performance computing environment (JavaScript is MUCH faster than Python) embedded in an excellent interactive graphics environment. Plus there’s a code environment hidden underneath! Print those variables!
— David Mimno (@dmimno) October 26, 2020

And while Javascript has a reputation as a terrible language, the post ES2015 iterations have made it in many cases relatively easy to program with. Maps, sets, for ... of ... all work much like you’d expect (unlike the days when I spent a couple hours hunting out a rarely occuring bug in one data visualization that turned out to occur when I was making visualizations of wordcounts that included the word constructor somewhere in the vocabulary); and many syntactic features like classes, array destructuring, and arrow function notation are far more pleasant than their Python equivalents. (Full disclosure–even after a decade in the language, I still find Python’s whitespace syntax gimmicky and at heart just don’t like the language. But that’s a post for another day.)

Javascript with WebGL is crazy fast.

And if javascript is fast, WebGL is just bonkers in what it can do. Want to lay out two million points in a peano curve in a few milliseconds? No problem–you can even regenerate every single frame.

And WebGL uses floating-point buffers that are the same as those in Apache Arrow, so you copy blocks of data straight from disk (or the web) into the renderers without even having to do that (still fast) javascript computation. It’s difficult, and easy to do wrong. (I’ve found regl pitched at the perfect level of abstraction, but I still occasionally end up allocating thousands of buffers on the GPU every frame where I meant to only create one persistent one).

In online cartography, protobuffer-based vector files do something similar in libraries like mapbox.gl and deck.gl. The overhead of JSON-based formats for working with cartographic data is hard to stomach once you’ve seen how fast, and how much more compressed, binary data can be.

WebGL is hell on rollerskates

In working with WebGL, I’ve seen just how fast it can be. For things like array smoothing, counting of points to apply complicated numeric filters, and group-by sums, it’s possible to start applying most of the elements of the relational algebra on data frames in a fully parallelized form.

But I’ve held back from doing so in any but the most ad-hoc situations because WebGL is also terrible for data computing. I would never tell anyone to learn it, right now, unless they completely needed to. Attribute buffers can only be floats, so you need to convert all integer types before posting. In many situations data may be downsized to half precision points, and double-precision floating points are so difficult that there are entire rickety structures built to support them at great cost Support for texture types varies across devices (Apple ones seem to pose special problems), so people I’ve learned from like Ricky Reusser go to great lengths to support various fallbacks. And things that are essential for data programming, like indexed lookup of lists or for loops across a passed array, are nearly impossible. I’ve found writing complex shaders in WebGL fun, but doing so always involves abusing the intentions of the system.

WebGPU and wasm might change all that

WASM and the Javascript Virtual Machine

But the last two pieces of the puzzle are lurking on the horizon. Web Assembly– wasm files–give another way to write things for the javascript virtual machine that can avoid the pitfalls of Javascript being a poorly designed language. A few projects that are churning along in Rust hold the promise of making in-browser computation even faster. (If I were going to go all-in on a new programming language for a few months right now, it would probably be Rust; in writing webgl programs I increasingly find myself doing the equivalent of writing my own garbage collectors, but as a high-level guy I never learned enough C to really know the basic concepts.) Back in the 2000s, the python and R ecosystems were littered with packages that relied on the Java virtual machine in various ways. In the 2010s, it felt to me like they shifted to underlying C/C++ dependencies. But given how much effort is going into it, I think we’ll start to see things use the Javascript Virtual Machine more and more. When I want to use some of D3 spherical projections in R, that’s how I call them; and Jerome Ooen’s V8 package (for running the JSVM, or whatever we call it) is approaching the same level of downloads as the more venerable rJava. I suspect almost all of this is running just Javascript. If it starts becoming a realistic way to run pre-compiled Rust and C++ binaries on any system… that’s interesting.

WebGPU

The last domino is a little off, but could be titanically important. WebGL is slowly dying, but the big tech companies have all gotten together to create WebGPU as the next-generation standard for talking to GPUs from the browser. It builds on top of the existing GPU interfaces for specific devices (Apple, etc.) like Vulkan and Metal, about which I have rigorously resisted learning anything.

WebGPU will replace WebGL for fast in-browser graphics. But the capability to do heavy duty computation in WebGL is so tantalizing that some lunatics have already begun to do it. The stuff that goes on into Reusser’s work] is amazing; check out this notebook about “multiscale Turing patterns” that creates gorgeous images halfway between organic blobs and nineteenth-century endplates

I haven’t read the draft WebGPU spec carefully, but it will certainly allow a more robust way to handle things. There is already at least one linear algebra library (i.e., BLAS) for WebGPU out there. I can only imagine that support for more data types will make many simple group-by-filter-apply functions plausible entirely in GPU-land on any computer that can browse the web.

When I started in R back in 2004, I spent hours tinkering with SQL backing for what seemed at the time like an enormous dataset: millions of rows giving decades of data about student majors by race, college, gender, and ethnicity. I’d start a Windows desktop cranking out charts before I left the office at night, and come back to work the next morning to folders of images. Now, it’s feasible to send an only-slightly-condensed summary of 2.5 million rows for in-browser work and the whole dataset could easily fit in GPU memory. In general, the distinction between generally available GPU memory (say, 0.5 - 4GB) and RAM (2-16GB) is not so massive that we won’t be sending lots of data there. Data analysis and shaping is generally extremely parallelizable.

JS and WebGPU will stick together

Once this bundle gets rolling, it will much faster and more convenient than python/R, and in many cases it will be able to run with zero configuration. The Arquero library, introduced last year, already brings most of the especially important features of the dplyr or pandas API into observable at a nearly comparable speed. With tighter binary integration or a different backend, it–or something like it– could easily become the basic platform for teaching the non-major introduction to data science course all of the universities are starting to launch. Even if it didn’t, the vast superiority of Javascript over R/Python for both visualization speed (thanks to GPU integration) and interface (thanks to the uniquity of HTML5) means that people will increasinly bring their own data to websites for initial exploration first, and may never get any farther. (If I were going to short public companies based on the contents of these speculations, I’d start with NVidia–whose domination of the GPU space is partially dependent on CUDA being the dominant language, not WebGPU, and ESRI, which is floundering as it tries to make desktop software that does what web browsers do easily.)

Once these things start getting fast, the insane overhead of parsing CSV and JSON, and the loss of strict type definitions that they come with, will be far more onerous. Something–I’d bet on parquet, but there are are possibilities involving arrow, HDF5, ORC, protobuffer, or something else–will emerge as a more standard binary interchange format.

Why bother with R and Python?

So–this is the theory–the data programming languages in R and Python are going to rely on that. Just as they wrap Altair and they wrap HTML click elements, you’ll start finding more and more that the package/module that seems to just work, and quickly, that the 19-year-olds gravitate towards, runs on the JSVM. There will be strange stack overflow questions in which people realize that they have an updated version of V8 installed which needs to be downgraded for some particular package. There will python programs that work everywhere but mysteriously fail on some low-priced laptops using a Chinese startup’s GPU. And there will be things that almost entirely avoid the GPU because they’re so damned complicated to implement that the Rust ninjas don’t do the full text, and which–compared to the speed we see from everything else–come to be unbearable bottlenecks. (From what I’ve seen, Unicode regular expressions and non-spherical map projections seem to be a likely candidate here.)

But it will also raise the question of why we should bother to continue in R and Python at all. Javascript is faster, and will run anywhere, universally, without the strange overhead of binder notebooks and the cost of loading data in the cloud. WASM ports of these languages that run inside the JSVM will help, but ultimately get strange. (Will you write python code that gets transpiled in the browser to WASM, and then invokes its own javascript emulator to build an altair chart?) Beats me!

But I’ve already started sharing elementary data exercises for classes using observablehq, which provides a far more coherent approach to notebook programming than Jupyter or RStudio. (If you haven’t tried it–among many, many other things, it parses the dependency relations between cells in a notebook topologically and avoids the incessant state errors that infect expert and–especially–novice programming in Jupyter or Rstudio.) And if you want to work with data rather than write code, it is almost as refreshing as the moment in computer history it tries to recapitulate, the shift from storing business data in COBOL to running them in spreadsheets. The tweet above that forms of the germ of this rant has just a single, solitary like on it; but it’s from Mike Bostock, the creator of D3 and co-founder of Observable, and that alone is part of the reason I bothered to write this whole thing up. The Apache Arrow platform I keep rhapsodizing about is led by Wes McKinney, the creator of pandas, who views it as the germ of a faster, better pandas2, from a position initially sponsored by RStudio and subsequently with funding from Nvidia. Speculative as this all is, it’s also–aside from massive neural-network gravitational of the tensorflow/torch solar systems– where the tools that become hegemonic in the last decade are naturally drifting. (Not to imply that Javascript is anywhere near the top of the Arrow project’s priority list, BTW. It isn’t.) I wish more of the data analysts, not just the insiders, saw this coming, or were excited that it is.

As I said, I’ve been doing some of this programming since 2003 or so, and been putting in my regular rounds most days since 2010. In that time I’ve come to see that I what I want to see most–fully editable, universally runnable, data analysis on open data–is not a universal code. Some people just want static charts. Some people want to hide their data. Most readers don’t want to tweak the settings. And everyone looks down on people who like Javascript. But it’s also the case that the web was first built in the 90s to share complicated academic work and make it editable by its readers. Even if most of academia and much of the media is devoted to one-way flows of information, and much of the post-social media Internet is a blazing hellscape, I’m excited about these shifts in the landscape precisely because they hold out the possibility that some portion of the Web might actually live up to its promise of making it easier to think through ideas.

Bookworm Caching

bmschmidt@gmail.com (Ben Schmidt) — Sun, 07 Mar 2021 16:18:47 GMT

I used to blog everything that I did about a project like Bookworm, but have got out of the habit. There are some useful changes coming through through the pipeline, so I thought I’d try to keep track of them, partly to update on some of the more widely used installations and partly

The core work on Bookworm happened in 2011-2013 when I was at Harvard working with Erez Lieberman Aiden and JB Michel as a way of bringing the metadata in digital libraries to interfaces like the Google Ngram Viewer that they built.

As such, it uses a very 2000s form of content management: a single-server, LAMP stack oriented architecture that assumes you have a MySQL database always running and can post individual queries against it.

Over the years, I’ve tweaked the backend a bit to allow for more resilience in this architecture. In particular, the web server–like most webservers nowadays– lives somewhere in the cloud. (On a Digital Ocean droplet, although that’s not important.)

That’s great, because it means that the server can be basically static. But you still need a database server somewhere. Even for a medium-sized corpus like the Rate My Professors one, hosting the databases can be real money simply for hard drive space–something like $100 a year. On bigger databases like Chronicling America, these costs are prohibitive on many servers. Historically, I just used a desktop in my office. But under COVID, that has kind of fallen apart, because what used to be about 99% uptime on a machine plugged into Ethernet has degraded into perhaps 50% uptime on a machine on residential wifi in my bedroom at home.

That means that every week, I get e-mails from people about to run a workshop suddenly realizing that the site on gendered teaching evaluations has broken. There are two solutions here.

Virtualize the server and run it in the cloud, too.
Cache results so that the frontend can run without MySQL entirely.

I’m working on both, but the second is easier–that’s what I’m describing here.

The strategy is essentially to build up a local cache of the most common queries that can live on the webserver. As a format for that cache I’m using the Apache Arrow’s .feather format, which I’m become enamored of in the last year–it’s a binary serialization that’s far smaller and faster to load than JSON. For each query I generate an SHA-1 hash from the description of the query; if that exists among the last 256 queries to the server, a local version of the bookworm API that runs without MySQL can return the answer directly, whether or not the database backend is still alive. If it does, great. If not, we fall back to a proxy form of the API that can reach out to my home server’s API endpoint. In addition to that 256-item LRU (least-recently-used item) cache, there’s also an option to specify a cold storage cache. For the RateMyProfessors Bookworm, my plan is to fill this with several thousand of the most frequent queries so that workshops can generally proceed without any trouble even when the main db is down.

There are other ways of handling caching. This one is notably deficient in that it’s not truly a static solution: there’s still a python daemon running to process the API requests on each query. I had always thought that I’d probably just store JSON on the server directly so that a Bookworm could run entirely statically. I may yet do that. But this also serves another purpose of mine, which is to extend the family of API backends Bookworm can run on. A local cache backed by MySQL isn’t much different than MySQL itself, but it opens up some more useful possibilities, such as:

hitting multiple different MySQL backends, which allows sharding bookworm servers on extremely large corpora.
Building entirely different backends on things like Solr or ElasticSearch. (Although I’ll note that the old MySQL architecture, dated as it is, continues to allow things that none of the Lucene managers I’ve worked with over years think is possible in routine time in terms of aggregating queries.)
Data transfer over http using arrow, which is now fully supported (it’s happening behind the scenes on every query now) which opens up some useful possibilities for speeding up and making Python and R modules more type aware.

Extremely Technical notes

But a stack this complicated also has complications. Some come from the new Docker setup. Just as a note to myself and anyone else attempting something similarly complicated:

Remote forwarding to docker requires enabling GatewayPorts on ssh configuration both for the client (~/.ssh/config) and the host (/etc/lib/sshd_config or something)
That’s dangerous! So immediately following that, I had to set up ufw to block all incoming connections to the webserver except on ports 80 and 443.
Now docker is once again not allowed to access the host, because it’s technically an outside host. I allow accept to the docker subnet with ufw allow from 172.24.0.1. I don’t know if 127.24.0.1 is always the address for a docker cluster; I found it by doing docker container ls to get my containers, and then docker inspect $ID on the relevant container, which gave and IPAddress of 172.24.0.2. I’m just going to assume that anything docker allocated will be in the 172.24.0.* range.
Just as the webserver needs to know where docker lives, docker needs to know the webserver. That I get with ifconfig, looking for the docker0 subnet IP address. In that context, it’s 172.17.0.1. Note 172.17 instead of 172.24; I would have thought they’d be the same, so evidently I don’t really understand networking.

Jobs Report November update

bmschmidt@gmail.com (Ben Schmidt) — Thu, 12 Nov 2020 17:00:45 GMT

I last looked at the H-Net job numbers about a month ago.

Since then, the news isn’t exactly good, but it’s also probably as good as anyone could expect. For most of September and October, history jobs were at about 25% of their average for the 2010s; this was slightly worse than we’re seeing in the approximate numbers in–for instance–science jobs, where new job openings are at about 30% of their normal levels (Thanks to Dylan Ruediger at the AHA for passing along that link.)

The shifts in the last month have pushed the totals numbers over 100 jobs in history; this is a bit of an advance, so we’re now down only about 70% from the normal rates, not 75%. I don’t know if the sciences have seen a similar rebound.

We’re far enough into the year that it’s worth looking at subfield numbers to see how different fields are faring. The loss of jobs is much more uneven than I thought.

At one extreme, jobs in world history and Middle Eastern history are down about 95%; at the other end, there have been 32 jobs listed with a primary category of Black studies or African American studies, an increase of 28% over the normal rate. The only other subfield not to have seen a catastrophic collapse is history of science, down “just” 40%. This is clearly the Black Lives Matter moment playing itself out in the hiring patterns, and it produces some remarkable inversions; there are twice as many jobs listed for African-American history specifically as for American history generally.

The collapses in European and Middle East hiring are especially remarkable given that both fared quite badly after 2008, as well. A typical associate professor in mideast studies might have received her first job in 2008, when there were about 60 new jobs in mideast studies; this year, there is one.

	Region	Average Jobs listed by November 12				as share of 2010s average	as share of 2000s average
era		2004-2008	2009	2010-2019	2020
	Middle East	61.20	29.0	21.9	1.0	5%	2%
	World	37.80	13.0	14.5	1.0	7%	3%
	Europe	88.40	36.0	46.1	5.0	11%	6%
	US/Canada	162.80	79.0	68.2	14.0	21%	9%
	Ancient	28.20	16.0	17.0	3.0	18%	11%
	Americas	52.40	26.0	30.0	7.0	23%	13%
	Asia	80.60	43.0	54.7	14.0	26%	17%
	Methodological	25.20	12.0	30.4	8.0	26%	32%
	Africa	13.00	4.0	20.8	5.0	24%	38%
	Hist. Sci.	12.40	7.0	8.4	5.0	60%	40%
	Interdisciplinary	15.20	9.0	25.4	10.0	39%	66%
	Black/Af-Am	33.75	20.0	25.0	32.0	128%	95%

History Jobs Update

bmschmidt@gmail.com (Ben Schmidt) — Thu, 01 Oct 2020 13:29:09 GMT

Out of a train-wreck curiosity about what’s been happening to the historical profession, I’ve been watching the numbers on tenure-track hiring as posted on H-Net, one of the major venues for listing history jobs.

[Update 10-2: switching to US and Canada only. An earlier version of this included other countries, even though I said it didn’t.]

We’re now into October. Usually–I know now–this is the period by which half the tenure-track jobs for any cycle have been listed. With two important exceptions I’ll get into later, every year since 2004 passed the halfway point for the year in the last week of September or the first week of October.

So here are a few ways of looking at the hiring patterns.

One is the aggregate tenure-track jobs listed by year. (I’m filtering here not just to tenure-track positions, but also to jobs in the United States with “history” in the primary category field, which are typically things like “Asian History / Studies”. The core H-Net audience is the history profession and the US, so we’ll get less noise limiting this way.)

Here you can see a few things:

H-Net took some time to get off the ground in 2003-2007. You’d be better looking at the AHA listing for this period, But nonetheless;
The period before 2008 was much better–almost twice as many jobs a year– as the period since.
2009 was the worst year on the record to this point, with about 200+ jobs listed by early October; currently we’re still short of 100.

This chart shows how the number of listings over time in this academic year compares to the two other eras in the hiring cycle: 2004 to 2008, and 2010 to 2019. It’s worth noting a couple things here. First, the worst of the pre-great-recession years was better than the best year since it. Second, I’ve broken out 2009, the only year that compares to the current one in its low numbers through September, but as you can see 2009 did recover to have more tenure track jobs, in the end, than the worst year of the 2010s. (One of the worst years of that decade, it’s worth noting, was 2019; even as majors approached stability, new listings for tenure-track jobs were disappearing last year.)

Overall, we can see what the next couple months are likely to look like by looking at the annual cycle of jobs. Typically the flood comes in late September; you get a couple a day through Thanksgiving; and then after a slight December rebound, the rest of the spring is perhaps a single job a day publicly listed.

Circle Packing

bmschmidt@gmail.com (Ben Schmidt) — Tue, 01 Sep 2020 20:49:49 GMT

I’ve been doing a lot of my data exploration lately on Observable Notebooks, which is–sort of–a Javascript version of Jupyter notebooks that automatically runs all the code inline. Married with Vega-Lite or D3, it provides a way to make data exploration editable and shareable in a way that R and python data code simply can’t be; and since it’s all HTML, you can do more interesting things.

Of course, that leaves all that writing on their site, where it will likely eventually vanish. I’m generally willing to live with that. But it’s also nice to be embed the charts over here, even if they’ll die when Observable does.

The observable version of this page will almost certainly look better, but you can get a quick idea of the contents below.

{{< observablenotebook “/notebooks/transitioning-between-circle-packs-historical-us-electio.js” >}}

College Majors 2019 update

bmschmidt@gmail.com (Ben Schmidt) — Fri, 28 Aug 2020 00:00:00 GMT

Every year, I run the numbers to see how college degrees are changing. The Department of Education released this summer the figures for 2019; these and next year’s are probably the least important that we’ll ever see, since they capture the weird period as the 2008 recession’s shakeout was wrapping up but before COVID-19 upended everything once again. But for completism, it’s worth seeing how things changed.

First, the chart of humanities majors compared to peak. Here, things remain at their post-2015 level.

Next, the decade-horizon rate of change for all majors. Again, the humanities are at the bottom of the list; the most remarkable feature here is that computer science, already large, has been growing at a huge rate in the last few years.

Next, what I think is the most important full overview you can get: a four-type division of US college majors since 1990. This makes clear that the basic story of the last decade was the growth of STEM at the expense of pretty much all other forms of education.

Rate of change is important, but it’s worth looking at the overall numbers too. Here are 20 years of majors for all the humanities fields. The American Academy includes several communications majors as humanities fields; I think that in method and substance they’re closer to a qualitative social science, but I include them here anyway.

Ranking CS Graduate programs

bmschmidt@gmail.com (Ben Schmidt) — Tue, 28 Jul 2020 15:15:14 GMT

Ranking Graduate Programs

While I was choosing graduate programs back in 2005, I decided to come up with my own ranking system. I had been reading about the Google PageRank algorithm, which essentially imagines the web as a bunch of random browsing sessions that rank pages based on the likelihood that you–after clicking around at random for a few years–will end up on any given page. It occurred to me that you could model graduate school rankings the same way. It’s essentially a four-step process:

Pick a random department in the United States.
Pick a random faculty member from that department.
Go to that faculty member’s graduate department.
90% of the time, return to step 2; 10% of the time, return to step 1.

At the end of each stage, you’ll be in a different department; but more prestigiously any given department’s faculty are placed, the more likely you are to be there.

Using transition matrices, these numbers converge after a relatively short period.

I ran it on history departments, but have never circulated the history scores. (Rankings make people mad, and the benefit seems worse than the cost.) But one of my roommates at the time, Matthew Chingos, was already moving towards working in higher education policy and grad school in political science, so we wrote up a paper applying it to Political Science departments and published it in PS in 2007. (Schmidt, B., & Chingos, M. (2007). Ranking Doctoral Programs by Placement: A New Method. PS: Political Science & Politics, 40(3), 523-529. doi:10.1017/S1049096507070771)

It’s a pretty simple method, but I still occasionally get questions about it, the data, and the underlying code. As I recall, the political science data was viewed as slightly sensitive, so the arrangement we made with the American Political Science Association was that they would handle requests for the data and we would only provide code.

This was in 2005, so reproducibility was not a worry–nowadays, you’d put all this stuff on github. In response to a recent request, I’ve just done that.

The core code was interesting to look it, because it’s stuff I wrote in R fifteen years ago. It basically seems to still work, but it has little in common with how I’d handle the problem nowadays.

Ranking Computer Science Programs as of 2015

Still, the proof is in the eating. So I went looking for some new data to try it on. On the theory that computer science faculty are too distracted by their overwhelming course sizes and endless parade of job searches to be bothered by this, I’ll do them.

Alexandra Papoutsaki et al. created a crowdsourced dataset of CS faculty that they expect to be “80% correct” at Brown. They seem to have updated a version that’s sitting inside a Github repository here, so that’s what I’ve used. I’m using placements that are from 2005-2015 here.

school	p
University of California - Berkeley	17.2835408
Massachusetts Institute of Technology	16.6558147
Stanford University	9.8659918
Carnegie Mellon University	7.9750700
University of Washington	4.5314467
Cornell University	3.4656622
Princeton University	2.9223387
University of Texas - Austin	2.5394603
Columbia University	2.3110282
University of California - Santa Barbara	2.0507537
California Institute of Technology	1.9028543
Georgia Institute of Technology	1.5902598
University of Illinois at Urbana-Champaign	1.5324409
University of California - Los Angeles	1.5238573
University of California - San Diego	1.2106396
University of Maryland - College Park	1.1716862
University of Pennsylvania	1.0691726
Brown University	1.0167585
University of North Carolina - Chapel Hill	0.9371394
University of Michigan	0.9263730
University of Minnesota - Twin Cities	0.7845679
Harvard University	0.7668788
New York University	0.7561730
University of Wisconsin - Madison	0.7021781
University of Massachusetts - Amherst	0.6569323
Purdue University	0.6213802
University of Chicago	0.6157431
Rice University	0.6154933
Johns Hopkins University	0.5860418
University of Virginia	0.5794159

There is nothing shocking, as an outsider, here, which is good. Technical schools are pretty high up, and my current employer is on the list and right next to Harvard. Nobody ever got in trouble for saying their school is as good as Harvard, even when Harvard is–as in CS–not so hot.

Extensions

Error bars!

Besides reproducibility, one thing I didn’t have a good answer to back in 2005 was robustness. Now I know very slightly more statistics, and the most sensible approach seems to be bootstrap sampling across the set to get an idea of how much difference one student more or less might make.

Here’s a plot of 500 random resamples of the set. There are two takeaways here:

There’s decent separation overall, but in general the distinction between 1 and 2 on the list, or between 30 and 60, is not anything stunning.
A few schools show notable patterns high or low. I think this is because single people greatly affect rankings. For example, UC Santa Barbara has a number of quite low rankings outside its boxplot; I think those are runs where both their grad who teaches at MIT and their grad who teaches at Berkeley were dropped in the bootstrap. Since UCSB relies very heavily on those two people for its high ranking, the bars are telling us–rightly–that the uncertainty there is pretty high.

Undergrad rankings

I’ve always wondered what the general form of this interaction would be; ignore disciplines, and just look overall at how universities assess other universities in their hiring patterns.

This dataset at least includes undergrad and master’s locations, so we can see how this form would work differently based on undergrad quality vs grad quality.

In general, the scores are correlated–for example, MIT and Berkeley are near the top on both– but there are some useful distinctions. For instance, Yale undergrads are very well represented in CS faculties, while Yale grad students are few and far between. Conversely, the University of Washington produces middling undergrads, but is a grad powerhouse. Presumably the major factor here is that undergrads do not choose schools based on the strength of individual departments.

Jeb! the quitter. Digital traces of private devotions.

bmschmidt@gmail.com (Ben Schmidt) — Wed, 26 Feb 2020 14:49:26 GMT

As I often do, I’m going to pull away from various forms of Internet reading/engagement through Lent. This year, this brings to mind one of my favorite stray observations about digital libraries that I’ve never posted anywhere.

As part of the 2016 Republican Primary, Jeb! Bush released a website enabling exploration of e-mails related to his official accounts as governor of Florida in the early 2000s. This whole sentence has an antiquity to it; the idea of pre-emptive disclosure (in large part to contrast with his presumed general election opponent, Hilly Clinton) seems hopelessly antique. And at the time, it was critized for accidentally disclosing all sort of personal information, both stories and Social Security Numbers. It did not make Jeb! president. Anyhow, back then I downloaded Jeb!’s e-mails–and Hillary’s–to think about what sort of stuff historians will do with these records in the future.

One thing I looked at was simply the time of day that Jeb sent letters. Looking at it on a yearly basis, it was clear that there were some odd seasonal patterns in the way that Jeb! sent his e-mails. Knowing that Jeb! was Catholic, I had a brainstorm that maybe this was aligned to the liturgical year. And so I wrote a little bit of ggplot2 code to break out the Lenten season from the rest of the year.

(My favorite part of this chart is the color scheme; these are the color of the vestments word during Lent and ordinary time. I can’t remember how I aligned dates to the liturgical calendar.)

Breaking it out, I think it’s far more likely than not that in the year 2005, Jeb! made some private devotion to get up early and answer his e-mails before 7AM. The only thing arguing against this is that he does get up a little early on Mardi Gras and the Monday before as well; but starting on Ash Wednesday, Jeb! is regularly sending over 50% of his e-mails for the day before he gets to the office.

And then it falls apart a wek or two before Easter. Could he not hold it together?

There’s also some sign that he gave the same effort a shot in 2006, but it fell apart mush earlier.

{{< figure src=“closeup.png” title=“Jeb Bush’s outgoing e-mail times” >}}

It is odd to me to be able to talk in this particular way about the intersection of daily life and religious identity. One oddity, of course, is that this is yet another example of the kinds of information held inside the great data surplus at the tech companies; but honestly, the question here is so oddly stated that I can’t imagine datamining ever turning it up. Perhaps it says something about the potential for biographies in the digital age; the narcissism of the quantified self movement might look quite different directed at the quantified other. But is this kind of evidence really compatible with biography?

Anyhow, off to some e-mails of my own.

Two Volumes: the lessons of Time on the Cross

bmschmidt@gmail.com (Ben Schmidt) — Thu, 05 Dec 2019 00:00:00 GMT

(This is a talk from a January 2019 panel at the annual meeting of the American Historical Association. You probably need to know, to read it, that the MLA conference was simultaneously taking place about 20 blocks north.)

Disciplinary lessons

The panel that became this conversation started when John Theibault tweeted that digital history doesn’t see Fogel and Engerman’s Time on the Cross as part of its core history. I participated because I–and some of the grad students I’ve taught–remembered that I did make them reckon with Time on the Cross as part of the genealogy of digital history, early in each semester. I taught it because I had found myself navigating around it for years; generally students took it as a negative example to avoid, and someone would always show up stunned to learn that Fogel won a Nobel prize and a Bancroft prize. (Probably, to be honest, because many of them skipped the book to read Thomas Haskell’s devastating 1975 New York Review of Books summation. It’s worth your time, too, if you for some reason want to read this piece, but aren’t quite clear on what Time on the Cross was.)

Time on the Cross was not just another history book. One of the more interesting phenomena was that students would order the book from Amazon–there are still a lot of cheap copies of Time on the Cross out there–and every once in a while they would accidentally end up with a copy not of the narrative but of the second volume that gives the methodological apparatus for the book. There are a lot of problems with Time on the Cross, and many are much more important than this. But for me, the most interesting has to do with the thinking that made this a reasonable arrangement–how did the argument and the evidence came to be so heavily separated from each other?

This was a problem high in everyone’s mind in the 1970s as well: Thomas Haskell pointed out the division as one of the works methodological mortal sins: “Most readers of Time on the Cross see only the silk purse of apparent scientific exactitude; the authors spared them the sight of the sow’s ear from which it all came.”

Time on the cross ratified a split between the “humanities” history and social science history–especially economics–that mirrors the split of the book itself into two volumes.

The reception of Time on the Cross has focused on its myopia. Jessica Marie Johnson recently published an essay in Social Text that gives a good account of this position–this is the critize that I find most important for graduate students in the humanities to completely internalize, because it articulates two of the most important ways that the humanities have developed a language for talking about statistics.

“Statistics on their own, enticing in their seeming neutrality, failed to address or unpack black life hidden behind the archetypes, caricatures, and nameless numbered registers of human property slave owners had left behind. And cliometricians failed to remove emotion from the discussion. Data without an accompanying humanistic analysis—an exploration of the world of the enslaved from their own perspective—served to further obscure the social and political realities of black diasporic life under slavery.”

Johnson, Jessica Marie. “Markup Bodies: Black [Life] Studies and Slavery [Death] Studies at the Digital Crossroads.” Social Text 36, no. 4.

The argument is not just that data fails to capture experience, but that the existence of data is itself part of the record of violence. To quote again: ‘Data is the evidence of terror, and the idea of data as fundamental and objective information, as Fogel and Engerman found, obscures rather than reveals the scene of the crime.’

This critique is a powerful one, and fits in with currents of humanistic scholarship going back to Thomas Haskell and Herbert Gutmann’s initial critiques of Time on the Cross in the face of its initial publicity blitz. (A blitz, it’s worth noting, that is entirely typical of the course of computationalist approaches to history and literature, from Busa to the cliometricians to the “culturomics” project where I did a fellowship at the start of the decade.

If I can just reflect myself on what I learned from this in graduate school before I’d ever heard of “digital humanities,” it was that an attention to statistical reductionism rather than human experience struck in the face of historical practice but also served to ratify past crimes. To act statistically on individual subjects violates a kind of taboo. To this day, I’m generally really reluctant to make data visualizations in which the points are individual people. In my own work–and in my advice to any other digital historians–I generally think that human beings are among the least promising topics of statistical analysis; I visualize ships, books, and land, but try to avoid visualizing the person whenever possible. After all, people are the things we need data to understand the least. In a graduate class with the political historian Sean Wilentz, I remember vividly being walked through a slate of regression analyses of ethnic voting patterns in pre-Civil War US counties, feeling that I was learning something, and then–in the dialectical style he loves to use–being informed that none of these methods–not one!–can tell us a single thing about why a single person voted the way they did.

I knew something about data analysis, but was warned off and kept quiet; for three years after generals, the only contribution I made with code to the Princeton History department was a monte-carlo simulation program to optimize batting orders for the intensely competitive summer softball league. (And even there, I used it more for the grad student team, The Great Bat Massacre; the faculty heavy Revolting Masses had firmer hierarchies in play.)

When I got into digital history in 2010, this was my sense of Fogel and Engerman; as a still-radioactive site in the discipline, smoking from the wars of the 1970s, where you tread at your peril.

Economists’ lessons

Something that I didn’t know until much more recently was that at the same time, economists were learning an entirely different set of lessons. As the history of capitalism turns to slavery in the past few years, particularly with controversies around books by Sven Beckert and Ed Baptist, we’ve seen economists look with bemusement at historians as the species that failed to learn the lessons of Time on the Cross. This is a set of debates largely parallel to Digital History, proper; but one crucial to understanding the future of digital history, writ small.

The economist Eric Hilt points to the existence of an extensive literature about slavery, worrying that historians “do not seem to have taken seriously the debates among economic historians that followed the publication of that book.”

More importantly, the lack of engagement with economic historians limited the analytical perspectives of each of these books. Most of them seem aware of Fogel and Engerman’s Time on the Cross (1974), and some repeat its arguments about the profitability of slavery or the efficiency of slave plantations. But they do not seem to have taken seriously the debates among economic historians that followed the publication of that book. Some […] challenged Fogel and Engerman[; but] analyzed slavery in new ways.

(Hilt, Eric. “Economic History, Historical Analysis, and the ‘New History of Capitalism.’” The Journal of Economic History 77, no. 2 (June 2017).)

Hilt inclines towards–to me–a really surprising account: that historians are even recapitulating the core ideological mistake of Fogel and Engerman by cloaking the work of enslaved people in a bizzarely emancipatory rhetoric, as if developing new techniques for picking cotton quickly is an accomplishment anyone should be eager to claim.

I don’t think this criticism is entirely fair, but it’s at least within fair grounds. Alan Olmstead raises the disciplinary stakes. The problem is not just that historians haven’t read it, but that the work was simply “beyond the comprehension of many historians.”

In the past, historians and economists (sometimes working as a team) collectively advanced the understanding of slavery, southern development, and capitalism. There was a stimulating dialog. That intellectual exchange deteriorated in part because some economists produced increasingly technical work that was sometimes beyond the comprehension of many historians. Some historians were offended by some economists who overly flaunted their findings and methodologies.

Olmstead, Alan L., and Paul W. Rhode. “Cotton, Slavery, and the New History of Capitalism.” Explorations in Economic History 67 (January 1, 2018).

I love this an apology–I feel like marriage counselors could use this as a prime example of how to exacerbate differences in denying them. It boils down to “I’m sorry for being so sophisticated, and for letting you know it.” But any good marriage counselor could also probably elicit out of this an acknowledgement. I miss your company; I want you to affirm my work; I want to be in your life. And where Olmstead doesn’t quite acknowledge what he wants from historians, other economists do.

Should we just hope for consilience?

The obvious takeaway is that we simply need to get these fields talking to each other again and some proliferation of magic will ensue.

The economist Trevon Logan gives a good synopsis of this account in a series of Tweets I’ll quote in excerpt describing how he teaches Time on the Cross and the economics of slavery today. This is a more expert account than I can give: but suffice it to say that for historians, the moment of great split is obviously Time on the Cross itself. Fogel’s later, better work, including Without Consent or Contract, is outside the conversation. While historians were avoiding the radioactive spot, economists were loading into their hazmat suits and wandering in.

Some Tweets from Trevon Logan, 2018

I begin the second week by talking about where we are at. TOTC is bad. It’s being attacked and the integrity of the authors is being questioned. So F&E regroup, and they cut their losses, and they go back to the beginning. They go back to the productivity calculation.
F&E win on prices, output, production constraints, insurance, crop mix, etc. It’s a battle of the force and in the end their original calculation is established and likely accepted by the majority of the field. (Whether it’s the right calculation is another question.)
But this vindication of efficiency came with a great cost. The audience for economic history shrank dramatically. The knowledge we have is for 1860- far too static. The new history of capitalism ignores it. Economic historians agree on efficiency, but there’s more to know
There is little attempt among Economic historians now to make contributions that historians will pay attention to. We’re much more comfortable talking about data and methodology than history. This is bad for the field.
To read the efficiency debate as opposed to the TOTC debate is to see two parts of economic history. One is concerned with historiography and changing the methods and topics in history, and the other is about data and measurement. The latter is modern day economic history
There is also a racial dimension that continues to have an influence. F&E the racial aspects wrong. Very wrong. But they attempted to divorce the racial aspects from the economics, and we cannot.
Trevon Logan, Twitter

The other thing pointing towards the possibility of this turn is that a few blocks over is a discipline–English–that has managed a heavily quantitative turn in the last few years. Work in digital literary studies has taken on a much more aggressively turn towards measurement; flagship journals regularly publish work using topic modeling or word embeddings, and new journals like Cultural Analytics lean most heavily on an English-language base.

For the most part, these types of articles in literature continue to have a lot more throat clearing about the nature of reading and evidence than I think a historical journal should publish. (The nature of reading itself is not an object of study for us to the same degree). But we are also seeing long-term attempts to construct time series, bibliographies, and reprint lists that enable a large array of quantitative evidence to be brought to bear on questions like–in a recent paper by Underwood, Bamman, and Lee–what the gender breakdown of English-language fiction looks like.

Computational history is dead for good: a provocation.

A few years ago I was still quite hopeful about this possibility. For the historians of capitalism, it will surely happen in some form; but not, I hope, as part of a new science of history. I now think that the success in English and Literature is actually the exception that proves the rule. Even back in the 2000s period of humanities computing, it was already the case that English had a lot more number crunching and history a lot more public-facing web sites. Today, that’s even more the case. And the reason for that, in part, is that there is considerably less low-hanging fruit in terms of data-driven argumentation than in literary study. In literary history and art history, there are huge arrays of digitized data and an array of fairly amateurish physicists and computer scientists who desparately need advice.

For historians, the situation is very different. In questions of historical interest, the post-time-on-the-cross split means that there have been generations of training making economists, sociologists, and political scientists both methodologically competent and substantively knowledgeable. And because of the closed-off nature of the way that these fields have tended to handle their data, it tends to be accessible but for use cases that map much more closely to social science experimentation than to humanistic narration.

I reached my most pessimistic state about this a few months ago when I saw a paper by economists tagged with the subfield I was trained in, intellectual history. It bears the slightly overwrought title “Ideas have consequences.” But what it does is present a a really powerful account of the transmission of ideas across social networks through textual analysis–something I’ve seen physicists, and genomicists, and literary historians, and intellectual historians take stabs at for years with underwhelming results.

It makes a powerful and discrete argument about the way that the privately-funded Manne seminars in law and economics seminars–which were attended by a substantial proportion of the federal judiciary–affected the language, decisions, and sentencing of federal justices who attended them robust to a wide variety of covariates. (Of course, I don’t know what covariates it’s not robust to.) And even more interestingly, they claim to have an effect where simply being randomly impaneled with a judge who attended one of these seminars will make that second judge harsher in her future sentencing decisions.

Reading this paper was exciting, but looking through the tools and tricks and sources also made me feel like someone in a science fiction movie encountering an artifact from the future.

The extraordinary quality of data is something that is hard for us to get–they have not just a million or so circuit court votes and 300,000 opinions, but also the institutional capacity to file FOIA requests to get the exact years of attendance for every judge who went to the program.

We supplemented this list with exact years of attendance from Annual Reports obtained by filing FOIA requests and correspondence from the Law and Economics Center at George Mason University. Figure 1 plots the share of Circuit Court cases with a Manne Judge on the panel over time. As can be seen, by the late nineties, about half of cases were directly impacted by a Manne panelist.

And they have the disciplinary capacity to do things like casually use relatively new methods like word embeddings without spending pages slowly their audience through just what they are or what it might mean to use them, and gently analogizing them. (Imagine a field where you can describe a computational method without having to first identify which Borges short story–the map of the empire, the analytical language of John Wilkins, Pierre Menard and the Quijote–it most closely resembles. The words we could save!)

This paper utilizes a dataset on all 380,000 cases (over a million judge votes) in Circuit Courts for 1891-2013, and a data set on one million criminal sentencing decisions in U.S. District Courts linked to judge identity (via FOIA request) for 1992-2011. We have detailed information on the judges and the metadata associated with the cases. In addition, we process the text of the written opinions to represent judge writing as a vector of phrase frequencies.

In the United States, computational history is dead because this happens at a much higher level among the social scientists than any historian can possibly pick up in a graduate program. (Things are different in Europe, I should note: even back in graduate school doing a field about Germany with Harold James, I saw election statistics being deployed far more impressively in the German-language historical/geography tradition than in what comes out the Anglosphere. This continues in work by people like Melvin Webers on the continent.)

The best work in the cliometric tradition followed in the path of Social History. For a while I assigned, alongside Time on the Cross, Steven Ruggles’ work on the changing American family structure. (Some students called it out for too obviously balancing the “good” cliometrics against the bad.) But while Ruggles remains a tremendously important historian, it’s not hard to see how the work at the Minnesota Population Center that he leads has hewed closer and closer the sociological mainstream than the historical one. If Computational History really existed, it would need a conference; and the first challenge it would face would be justifying its existence in the face of the enormous work coming out of the Social Science History Association. Before we try to reboot computational history, we should look and see how our discipline has played as part of that larger structure.

Digital history is reproducibility

So–should we despair because the job is gone? No–but I think that it reflects some argument about what path we should stay on. We’ve seen a firmer set of turns lately towards an insistence on argumentation as the centerpiece of digital history, from Cameron Blevins in Debates in the Digital Humanities, and a whole slew of initiatives out of George Mason, which was one of the leaders of the old form of digital humanities, including their new journal devoted to argumentation in history. It might seem reasonable to position this new manifestation of digital history as the place where cliometric flaunting can be tempered. If there’s anyone left to be offended in my claim that computational history is dead, it would be the brilliant Lincoln Mullen at GMU, who has done a better job than anyone (as in his AHR article with Kellen Funk on legal text reuse) to bring cutting-edge computational methods to mainstream historians.

And there is some ground to occupy there. I’ll be assigning the Ash, Chen & Naidu article, and not Time on the Cross, in the digital history seminar I’m teaching next semester, and I can tell you what the student reactions are going to be; that it’s an impressive argument, but that the bulk of the 50 pages are incomprehensible econometrical posturing that don’t allow for any real conversation. And if I have some Americanists, they’ll point that the thing which is interesting here–the purchase of the federal judiciary by right-wing funding networks–is something we already know anyway from a variety of works by real historians of the period, like Nancy MacLean’s Democracy in Chains. Moreover, they will find it entirely bloodless. To talk about Florida big-money seminars in law and economics dangles the spectacle of a young Stephen Breyer and Ruth Bader Ginsburg enjoying Mai Tais on a Florida beach with Milton Friedman, and then hits you with pages on pages of difference-in-difference analysis. There aren’t two volumes here, but there’s an argumentative punch to the abstract that isn’t continuously rewarded while reading the paper; the argument and the analysis are distinct. [Spoiler alert from December 2019: I underestimated just how hard it would be to even pull the essence out and not become overwhelmed by the graphs.]

The path that digital historians built in the decades of the 1990s and 2000s while avoiding the shadow of cliometrics was a far more interesting one; unlike English-department digital humanities of the period, it had no motive to make history more ‘scientific,’ and instead found ways to make historical practice live on computers and–increasingly–online.

The most successful of these were efforts around digital public history, where historians found ways to bring materials into an online setting that were not amenable to books. Omeka, the public history CMS, has many flaws, but is still for my money the most important and irreplaceable project out of the profession in the last twenty years. And–we can fight about this up on the Magnificent Mile later this week–it certainly is more important than anything the literary studies folks have produced.

So what I really think is that the division of books problem–the division of methodology and narrative–is one that is precisely adressable through these same kinds of challenges, because historians as much as anyone outside of digital journalists have been thinking about audience, narrative, and publics.

Yesterday, the New York Times put up an article about the Municipal Archive’s set of photographs of every building in New York City from the 1940s.

I cannot really recommend this article–they seem to have chosen a few buildings largely at random, they don’t give a strong sense of the city, and their narrative of change is largely that things are taller. But it points to a proliferation of sources, especially those not controlled by Proquest, Gale, Cengage, and so on. in the last decade that hasn’t been reckoned in the profession.

link to data

Humanistic reproducibility

So let me finish by framing this problem as reproducibility. The sciences have been plagued with crises of reproducibility and deluged with schemes for solving it. The literary and library scholars in DH (overmuch, to my mind) see one of their challenges as fully fixing those problems in digital humanities before they emerge through a tangle of IPython notebooks, online linked open data, and Docker configuration files.

There are of course, crises of historical reproducibility as well–Time on the Cross may have seen controversy, but it was allowed to keep its Bancroft prize. Say what you will about it, but there’s another book to win the Bancroft prize that was so much worse it had to be revoked. It’s interesting that we don’t think of Arming America as having the same deep legacy to reckon with–perhaps we could.

But one reason that archival historical narratives work is that the historical narrative is itself an artifact of reproducible research; you read some claims at the front, but it’s only in the process of reading a book that you become conscious of its individual narrative flow.

The thing that we need to think more more about to shape narratives around historical data that admit of reproduction by actual historians, not econometricians; how do we make that persuasive flow work for everyone?

These are tools that we start to see emerging in digital journalism, often around the explication of social scientific research–for instance, in this New York Times piece addressing social mobility.

They’re also becoming increasingly common in the areas of computer science, like Google’s Distill

And this is a capability that digital humanities projects can continue to make; where reproduction is about array and exposure to multivalent primary sources, allowing readers to engage and change the assumptions of models. This is harder than just distributing models; it’s about working with sources in the indefinitely reconfigurable ways that are now possible.

We see this happening, already, even in the still-vibrant digital historiography of slavery itself. Ed Baptist himself is working on a digital project, Slavery on the Move. There are multiple works on lynching documentation. Caleb McDaniel (on this panel) has a bot called “every five minutes” that tweets (as it says) every five some snippet of a record of a slave sale. I’ve actually blocked that one on Twitter myself, because the juxtaposition of reminders it produces are sometimes more than I can take in the middle of the day.

Moreover, we continue to see a wide variety of work that isn’t about data, per se, expanding the notion of what digitally oriented scholarship can be. If I had to single out a single institution today, I’d point to the University of Richmond, with work like the masterful American Panorama atlas edited by Robert Nelson or Lauren Tilton and Taylor Arnold’s work on projects like http://photogrammar.yale.edu about already-digitized photos at the Library of Congress.

This kind of attention to audience, to re-ordering, and to narrative engagement evolved in part because of the weight of Time on the Cross. This–not warmed over introductory econometrics–is the real contribution to intellectual life that digital humanities stands to make. And its one that historians, with their multiple sources and strong subfield of public history, are better positioned to execute than any other field in Digital Humanities. It often doesn’t handle data tables by the narrow definition of Time on the Cross; but by the very virtue of its difference, it breaks out of the mold of separating argument and evidence that does work well for interdisciplinary work or for engaging a larger public, and could provide some of the answers we need for moving forward.

Web Migration

bmschmidt@gmail.com (Ben Schmidt) — Sun, 30 Jun 2019 15:53:19 GMT

Since 2010, I’ve done most of my web hosting the way that the Internet was built to facilitate: from a computer under the desk in my office. This worked extremely well for me, and made it possible to rapidly prototype a lot of of websites serving large amounts of data which could then stay up indefinitely; I have a curmudgeonly resistance to cloud servers, although I have used them a bit in the last few years (mostly for course websites where I wanted to keep student information separate from the big stew.)

But as part of my move to NYU, I’m shifting my Apache server to the cloud. (Digital Ocean). That will break some things in the short term, and I’m retiring a few elements of the website.

I’m listing the changes here mostly for my own reference. If I happen to have put up something that you use and want to see back, don’t be shy to let me know through my Google email (username bmschmidt) or via benmschmidt on Twitter.

Awaiting repair

The Rate My Professor gender language site. I think this gets the most sustained, regular, traffic on my site. I’m hopeful this will be out of service only for the first two weeks of July. If you have some kind of curricular lesson or workshop for which you need it in that period, let me know and perhaps I can fix it up ahead of time.
Other Bookworms (Simpsons, Movies, etc.). I see some people using these, and I’ll restore them using the same strategy as RMP; they may be offline until September, though, depending on how I address some storage issues. (The basic issue here is that, together, these take several terabytes of storage; that’s more than you can drop into a cloud site at an affordable price. I know how I’ll solve this, but it will be easier in September than July.

Working

Anything that didn’t have a database backend should be working fine. If it’s not, it’s probably a quick fix to a problem I’m not aware of.

Personal website, all parts of Creating Data, interactive degree explorer.

Probably gone

The Open Library bookworm was a prototype that eventually became the Hathi Trust Bookworm. I’ve been recommending everyone use that site, not this one, for a few years rather than count on the old OL one.
Some prototypes for Creating Data that I don’t think were widely used.
Some embedded elements in slideshows.
Wordpress installations for courses that I offered prior to 2016. These don’t seem worth migrating to me. If you’ve somehow obtained a URL for one of these courses, you can probably add ‘/syllabus.pdf’ to the end to see the basic materials.

Moving (or rather, staying in place)

bmschmidt@gmail.com (Ben Schmidt) — Fri, 03 May 2019 19:18:12 GMT

Some news: in September, I’ll be starting a new job as Director of Digital Humanities at NYU. There’s a wide variety of exciting work going on across the Faculty of Arts and Sciences, which is where my work will be based; and the university as a whole has an amazing array of programs that might be called “Digital Humanities” at another university, as well as an exciting new center for Data Science. I’ll be helping the humanities better use all the advantages offered in this landscape. I’ll also be teaching as a clinical associate professor in the history department.

If you’re at NYU or somewhere nearby and want to chat, please do reach out; I’ll be around through the end of July. There should be more to say about this going forward.

But just to look back a bit: I’ll be leaving Northeastern, which has built up one of the country’s best digital humanities programs over the last seven years. The history department (and our college dean, Uta Poiger) have been extremely supportive of the possibilities of digital history, of alternative publication models, and of DH in graduate education. It’s been great to see Cameron Blevins expanding the history department’s profile since he arrived two years ago. I don’t want anyone to think I or they screwed up the tenure or retention process. I’ve found it a great place to work. Especially if you live anywhere near Boston.

But this move has been a while coming: about six months after I started at Northeastern, my wife accepted a job teaching Soviet history at NYU. Many academic couples end up juggling locations for various periods of time, and ours hasn’t been the worst; I’ve been fortunate through various means (the National Endowment for the Humanities, Columbia’s SIPA, and Northeastern’s parental teaching releases–thanks to each) to only have to be on campus one semester a year since we moved to New York in 2015. And New York to Boston–as academics at parties have too often cheerily reminded me–is not the worst commute out there; just 4 hours in a comfortable train car with sporadic wi-fi access, with ten minutes of subway rides on either end. Despite its imperfect reputation, I’ve found Amtrak to always be great; I did probably 30 round-trips last semester, and didn’t hit a single major delay.

But any commute is hard, especially when you have small children. (Which is, demographically, a set that academic commutes fall most heavily on.) I remember, shortly before starting my job at Northeastern, reading Mark Sample write about how the commute is a “grueling, brain-frying, wallet-emptying, time-wasting, body-breaking, soul-draining way to live.” Amen. I can’t help but think that the widespread acceptance of commutes (and their flipside, residential fellowships) is toxic for local university communities and, in aggregate, for gender and probably socioeconomic diversity in the professoriate. But I also see others happily splitting their time or playing a longer game than I can imagine. So it’s probably enough simply to say the commute is not for us.

A computational critique of a computational critique of computational critique.

bmschmidt@gmail.com (Ben Schmidt) — Tue, 19 Mar 2019 01:42:00 GMT

Critical Inquiry has posted an article by Nan Da offering a critique of some subset of digital humanities that she calls “Computational Literary Studies,” or CLS. The premise of the article is to demonstrate the poverty of the field by showing that the new structure of CLS is easily dismantled by the master’s own tools. It appears to have succeeded enough at gaining attention that it clearly does some kind of work far outsize to the merits of the article itself.

The piece is not a useful contribution; it’s a magic trick that relies on the inattention or ignorance of its readers. While it pretends to demystify computation for literary among literary critics, it in fact does exactly the opposite; it operates through a series of feints and misdirections that repeatedly misstates the plain text of other scholars–in both literature and statistics–says, and what the statistical work she herself has done is. The article is predicated on an lack of statistical sophistication by the readers of Critical Inquiry.

The “computational” aspect of Da’s case is twofold:

It asserts that actually existing CLS is ridden with statistical errors that could be easily corrected, and claims to have performed replications.
It offers that in other areas–science and industry–computational methods are being deployed perfectly and appropriately; but that sadly, such methods can not be applied in literary studies because they have demonstrably demonstrated only absurditites and tautologies.

I do not believe it would be possible to write an article that defends both of these points. If existing pieces are so heavily flawed, then we probably don’t know the limits of the knowable. If, on the other hand¸ we’re able to tell that CLS will never produce useful results for literature, it would probably only be because the existing literature give us some sense of what’s possible.

But together–and this is where the appeal comes from–they break some fresh ground in the genre of anti-digital-humanities polemic. To straightforwardly attack the cultural authority of numbers has become increasingly problematic in the past few years. The hegemony of STEM has increased inside the university, making the gambit more instutionally dangerous; and at the same time, humanists have come to realize there may be forces in the world yet more sinister than scientists. The rhetorical tools you can deploy against positivism are strong, but they risk appearing to make it seem–say–that maybe we shouldn’t listen to climate scientists. So Da’s piece posits that everyone else is using numbers right–but also holds out that the exercise in replication and methodological analysis (a good thing) proffered here don’t actually hold the way out for better resource.

Da moves past anti-positivism into something fresh– call it computational NIMBYism. Rather than pooh-pooh statistical reasoning, she elevates it by incanting the language of quantification against itself. Far more than anyone I’ve seen in any humanities article, she asserts that scientists do something arcane, powerful, and true. But she returns from this promised land with hard-won truths for literary critics; its computationalists are false prophets engaged in a cargo-cult version of data science, and the true religion has nothing to say for literary scholars.

The response the article engenders.

A careful effort to replicate published articles is necessary. Fortunately, it is also something that happens, albeit not as much as might be useful. I expansively discussed the concerns Da raises about topic modeling across time in Underwood and Goldstone’s work in 2013.[@schmidt_words_2013] Their response is explicitly contained within the paper Da read. The final footnote in Ted Underwood’s new book raises precisely the same questions about the way that a Stanford Literary Lab pamphlets use of bigram entropy as a distinguishing measure. ¹

But this isn’t that article. The computational evidence deployed here–the thing that tries to make this piece stand out–is striking in its sloppiness even compared to the works it pretends to debunk. Perhaps the whole piece is intended as a parody of what can slide into top literary journals nowadays. (It is indeed the case that Critical Inquiry will allow you to publish with terribly inadequate code appendices and reviewers incompetent to assess the validity of your work.) But it certainly does not show that good statistics can obliterate the bad statistics that are widespread. Instead, the most it could do is demonstrate that the literary profession is as easily bamboozled by numbers as Da says.

This tension of the two goals evident in the first piece of the set, on a Ted Underwood piece on genre classification. She at once claims a simple correction–

Underwood should train his model on pre-1941 detective fiction (A) as compared to pre-1941 random stew and post-1941 detective fiction (B) as compared to post-1941 random stew, instead of one random stew for both, to rule out the possibility that the difference between A and B is not broadly descriptive of a larger trend (since all literature might be changed after 1941).

and that Underwood uses methods that could never find differences between genres.

It is true that Underwood does use methods inadequate to prove there is no difference in detective fiction pre and post-1930. (Her use of the year “1941” is a mistake–it seems to stem from confusing the date of one of Underwood’s sources with the year he chose for a testing cutoff). This is an absurdly high bar–of course something changed, if only the existence of words like ‘television’ and ‘databases.’ Underwood says as much. The actual article is caught up in a more interesting discussion of the comparative stability of genres. The core argument is not, as Da says, that genres have been “more or less consistent from the 1820s to the present,” but that detective fiction, the gothic, and science fiction–specifically– show different patterns, with detective fiction being a far more coherent pattern than the gothic novel. By focusing only detective work, she’s missing the entire argument of the article.

That this doesn’t merit correction or retraction is depressing.

I don’t know what Underwood used to train. But if he did allow the ‘random stew’ to contain both pre- and post-1930 work that would make the performance of his model more remarkable, not less–it would indicate that it was correctly tagging Elmore Leonard (say) novels as detectives even though they use words like “fax” or “polaroid” it had previously seen in the post-1930 set.

Where Da’s method really shines, though, is in the random statistical vocabulary she brings to bear.

All that Underwood has shown in using word frequency homogeneity to differentiate detective fiction from random fiction is that the difference between pre- and post-1941 detective fiction is not as significant as its difference from random fiction. This does not mean that the same method can capture the difference between different types of detective fiction. After all, statistics automatically assumes that 95 percent of the time there is no difference and that only 5 percent of the time there is a difference. That is what it means to look for p-value less than 0.05. Think of it this way: if everyone can agree that something is changing—even Underwood concedes that genres evolve—but you have devised one way that concludes that it does not, it does not necessarily mean that you have found something.

In the first specific critique, the article talks about 95% p-values in the following way: “statistics automatically assumes that 95 percent of the time there is no difference and that only 5 percent of the time there is a difference. That is what it means to look for p-value less than 0.05.”

To look for a p-value under 0.05 is to look for a pattern that would only occur 5% of the time as a result of random variation. It’s not a great threshold. But Underwood’s paper does not rely on them.

So let’s take a look at how well the statistical claims here hold up: is the debunking useful?

Jockers and Kiriloff on Gender

Da’s critique of Underwood relies mostly on failures of reading. The next section, on work by Matthew Jockers and Gabi Kirilloff, showcases the way her piece rests rhetorically on the innumeracy of her readership. Her critique of Jockers and Kiriloff is, as she says, that they “present a statistical no-result finding as a finding.”

In order to do so, she swamps her readers with a blizzard of statistical language that she can justifiably assume will sound plausible to the readers of Critical Inquiry. Her promise is that she will offer “a clear explanation of the computational work that CLS actually does” (605). In her two paragraphs on Jockers and Kiriloff, she tosses out the following observations:

“Let us say that you are measuring the overlap of features between two sets of data using a standard 5 percent confidence level; out of n possible shared features, 0.05n will automatically be significant.”
“In good statistical work, the burden to show difference within naturally occurring differences (‘diff in diff’) is extremely high.”
“This paper does not perform a bootstrap, which means the literary-historical suggestions that follow this genre classification do not stand.”
“Practitioners have to apply the Bonferroni Correction to conventional statistical thresholds of significance used for data mining.”

And so on. This blizzard of terminology establishes for the innumerate reader that they finally have an expert who will debunk statistics for them, while freeing them of the burdensome requirement to think for themselves.

But much of this is word salad; what stands is unimportant. The claim that 5% of features “will automatically be significant” seems to approach the claim that she has already had to retract: that “statistics automatically assumes that 95 percent of the time there is no difference and that only 5 percent of the time there is a difference. That is what it means to look for p-value less than 0.05.” ‘Diff in diff’ is indeed an important tool, but it’s not about whether testing whether two distributions are different from each other; it’s about testing whether a post-treatment experimental group (like recipients of experimental chemotherapy, or counties that received Gates foundation grants) saw a significant time series change. Bootstrap resampling to generate confidence intervals can be useful, but to randomly invoke it, as here, is about as sophisticated as demanding that every article, regardless of content, take a transnational approach.

To say that significance testing should apply the Bonferroni correction is not nonsense. But neither is it something that Da does. As with her discussion of Underwood, the exercise relies on coming up with a straw man description of the claim of the article, and then rejecting that. Da focuses mostly on the question of whether there are statistically significant differences in gendered use of verbs. Jockers and Kiriloff use the method of nearest shrunken centroids as input into their model for a variety of reasons having to do with model interpretability. ²

But Jockers and Kiriloff’s findings are significant at the level that Da suggests, and it is Da’s work that is truly sloppy. In the appendix, she publishes a comparison that obviously mislabels its bins (it claims that her replication found the “she killed” and “he wept” are gender stereotypes, rather than the opposite). If the goal is simply to find which words show strong gendered patterns of usage, it’s unclear why she would choose a different statistical method. In the appendix, claims to have performed a replication and found that “Overall, the percentage differences between these top most correlated verbs for each gender was very low (0.031% to 0.307%) meaning that while a difference can be found, male/female is not very differentiated from one another if we look at verbs.” I have no idea what statistics she is reporting here–although she has a github repository online, it appears not to contain any of the code used to generate these tables. ³

But while Da’s method is obscure, I am confident that the interpretation any reader would take from this– that Jockers and Kiriloff report statistically inflated claims of difference– is incorrect. A simple way to test the robustness here is just to apply a Dunning Log-Likelihood test, and use a close analogue to the Bonferroni correction Da calls for and then never runs, a Holm-Sidak correction. ⁴ The result: 81% of the words Jockers and Kiriloff look at are statistically significant.

After spending a paragraph and a half throwing out statistical claims that

My intellectual disagreement, here, is with the

Piper on Confessions

This same slapdash method–mis-stating the statistical or computational literature, failing to run the very tests she insists are necessary, and then leaving the reader with the impression she has somehow invalidated the result–is on prime display in her description of Andrew Piper’s work on the confessional form. Da pulls statistical pronouncements out of thin air and presents them as that which must be done. These claims are often either misinformed or misleading.

I can’t bear to go through all her sections. But as an example, take the analysis of Andrew Piper’s work on Augustine’s Confessions. In a few paragraph, she makes as many mistakes as she holds him to account for in a full article.

First, she criticizes Piper for performing Principal Components Analysis on unscaled word frequencies, and produces scatterplots that show dramatically different results from his: “The way to properly scale this type of matrix is outlined in G. Casella et al’s Introduction to Statistical Learning… The second step [Z-scaling] is necessary if each word is to be seen as a feature for PCA.” George Casella did not write a book called Introduction to Statistical Learning; she means the 2013 volume by Gareth James et al. published (after Casella’s death) in a Springer series for which he was general editor. The chapter she cites certainly does not say that PCA matrices must always be scaled by standard deviations. It says, rather that scaling PCA is a consideration the researcher should take. When units are arbitrary, PCA should be scaled–if comparing SAT scores to grade point averages, you don’t want the difference between a 1420 and a 1421 on the test to be the same as a 2.5 and a 3.5 on the GPA. But word frequencies are not arbitrary. In those cases, the researcher must decide. To quote from the text: “In certain settings, however, the variables may be measured in the same units. In this case, we might not wish to scale the variables to have standard deviation one before performing PCA.”@james_introduction_2013.

This is a central challenge familiar to anyone who has tried to grapple with wordcounts. There are so many uncommon words used once or twice in any given text that, when scaling is used, they can completely swamp the repeated words. A variety of solutions are in common use. TF-IDF scaling drops out the most common words while allowing those of medium frequency to shine through; log transformations of various flavors proliferate. Ideally solutions would not be wholly dependent on the parameter space, but the phrasing of the question matters.

Da sidesteps these all these complications for her reqaders by implying the real difference has to do with a philological failing, that Piper doesn’t stem Latin text. This is something a literary audience can understand, and gestures towards a humanistic critique. But comically, her version reproduces many of the same philological failings. She implies that Piper didn’t use a Latin stemming algorithm because the “only Latin stemmer available is the Schinke stemmer,” but that she has taken the effort. This is incorrect on both fronts. First, there are many Latin stemmers available. (For an in-depth analysis of at least 6, see Patrick Burn’s work.

And her effort seems to be scattershot at best. It’s hard to tell what code Da actually ran–the online appendices for analyzing Piper’s case only include the PCA code for Chinese, not the figures included in the appendix. (Ordinarily I would be forgiving of this kind of lapse, which is all too common; perhaps the inadequate code appendices are intended as a higher-order critique of computational work. But her failings vis-a-vis replication are far greater than those of, say, Ted Underwood, who generally supplies a single script called replicate.py that you can run yourself inside any of his projects.)

Still, from what she has posted online, Da appears to have re-implemented Schinke’s algorithm in both R and python, with separate rules for nouns and verbs. But then, in her Cross Distance code, she simply applies the noun stemming rules to all words, (probably) because choosing a part of speech is much harder than running stemming. This results in many problems; both because some verbs are not stemmed at all (‘resurrexit’ remains ‘resurrexit’ even though the verb rules would have it as ‘resurrexi’); and because the rules are applied to function words as well with silent NULL results in her code, so that words like ‘que,’ ‘cum,’ ‘te,’ and ‘me’ are deleted from the text altogether. That is: many function words are being dropped altogether because a new implementation was hastily coded rather than using one of the more mature implementations available.

I wrote this and then, quickly, checked what difference it all makes. (Code and edits online here) I was, honestly, expecting that the scaling factor would be significant and account for the differences in texts. But actually, what I got looks more or less like Piper’s original.

Reproduction of Piper’s original:

Reproduction using Da’s scaling.

Or maybe it looks completely unlike it!

And so on.

I could go on. The debunking of topic model, for example, uses not the well established literature about comparing topic model distributions to each other, but some arbitrarily chosen robustness tests. (It drops 1% of documents). But it is not a replication. Topic models rely on extremely specific assumptions about the distribution of words in texts based on word counts; they attempt to reproduce the frequencies in actual documents.

But rather than fit on word counts, the model, for no apparent reason, uses TF-IDF vectors that multiply the significance of rare words and decrease the significance of common ones. I have never seen a TF-IDF vectorization fed into an LDA feature set before– it’s an extremely odd choice that guarantees the results will be different from Underwood and Goldstone’s, and partially explains the incoherent topics in the appendix, such as doulce attractiveness unsatisfying gence dater following mecum wigan cio milieu. (Edit 03-20) I’m wrong about this: Andrew Goldstone points out that there’s an argument to the TFIDF vectorizer in her codes that makes it output raw frequencies. Frequencies might still produce results different than the counts that Underwood and Goldstone used, but this is not a howler. It’s still unreasonable, though, to expect that the topics put out by the Variational Bayes, online LDA implementation in scikit-learn will be the same as those in the Gibbs-Sampling method Underwood and Golstone use from Mallet. Different methods can produce dramatically different results when the hyperparameters are not properly tuned. (See here) While Goldstone does optimize hyperparameters there’s nothing in the scikit-learn code that indicates this effort. So the models may be radically different because Underwood and Goldstone ran a better model.

In fact, Goldstone and Underwood’s original work on this dealt with this issue very clearly:

On the other hand, to say that two models “look substantially different” isn’t to say that they’re incompatible. A jigsaw puzzle cut into 100 pieces looks different from one with 150 pieces. If you examine them piece by piece, no two pieces are the same — but once you put them together you’re looking at the same picture.

This comparison obviously mislabels its bins (it claims that her replication found the “she killed” and “he wept” are gender stereotypes, rather than the opposite) and makes some extremely fishy claims such as “Overall, the percentage differences between these top most correlated verbs for each gender was very low (0.031% to 0.307%) meaning that while a difference can be found, male/female is not very differentiated from one another if we look at verbs.” I don’t know what that range is supposed to be, but at least for ‘wept’, Google Ngrams gives the difference in gender usage as 400%

But to go through all of this is a pain. I’m sure others have written other analyses. This work is tedious, which is the reason that it’s rarely done; and it’s hard to reproduce another workflow even when it’s well-documented.

For the record, I myself made a quick check using yet another measure of entropy, compressibility; I’m inclined to think Da is right that there is a fundamental error Stanford’s bigram calculations.↩︎
Nearest shrunken centroids is indeed a sort of idiosyncratic choice, but one that Jockers seems to be extremely partial to going back over a decade. @jockers_comparative_2008. Whether digital humanists should be free to roam across the disciplines in search of obscure but useful algorithms, or should remain in a tightly constrained space, is a difficult one. My stance–↩︎
I base this partly because the appendix says it uses the “SpaCy” packages for results, but none of her online code imports that package.↩︎
I am not a statistician, but I use the Sidak correction because the literature seems to say it’s superior to the Bonferroni. I use the Holm modification, which applies increasingly stringent standards as you descend a list, because of an issue Da doesn’t ever address, type II errors; that it is as incorrect to report a false negative as a false positive. I order the Holm method is by word frequency, not p-value (suggested in some online literature) to make the test more conservative; since Jockers and Kiriloff use the 310 most common words, there’s no need to worry about multiple comparisons outside this range.↩︎

History Degrees since the Great Recession.

bmschmidt@gmail.com (Ben Schmidt) — Mon, 03 Dec 2018 17:16:22 GMT

I wrote this year’s report on history majors for the American Historical Association’s magazine, Perspectives on History; it takes a medium term view of at the significant hit the history major has taken since the 2008 financial crisis. You can read it here.

There’s also an interview with me about the topic in the Chronicle of Higher Education.

Interactive Scatterplots

bmschmidt@gmail.com (Ben Schmidt) — Tue, 30 Oct 2018 13:37:26 GMT

As part of the Creating Data project, I’ve been doing a lot of work lately with interactive scatterplots. The most interesting of them is this one about the full Hathi collection. But I’ve posted a few more I want to link to from here:

Stable Random Projection

bmschmidt@gmail.com (Ben Schmidt) — Mon, 22 Oct 2018 14:30:33 GMT

I have a new article on dimensionality reduction on massive digital libraries this month. Because it’s a technique with applications beyond the specific tasks outlined there, I want to link to a few things here.

The article in Cultural Analytics.
A visualization of 13 million books from the Hathi Trust in Creating Data.
Instructions for best using those features for your own projects in Creating Data.

New Site

bmschmidt@gmail.com (Ben Schmidt) — Sun, 21 Oct 2018 14:30:33 GMT

I’m switching this site over from Wordpress to Hugo, which makes it easier for me to maintain.

It may also confuse the RSS feed a bit. This should be hopefully be a one-time occurrence.

New article in the Atlantic

bmschmidt@gmail.com (Ben Schmidt) — Fri, 31 Aug 2018 17:37:26 GMT

I have a new article in the Atlantic about declining numbers for humanities majors.

Sapping Attention: Mea Culpa, There is a crisis in the humanities.

bmschmidt@gmail.com (Ben Schmidt) — Mon, 30 Jul 2018 17:37:26 GMT

I put up a new post at Sapping Attention about . In short, it’s been bad enough to make me recant earlier statements of mine about the long-term health of the humanities discipline.

Feature Reduction on the Underwood-Sellars corpus

bmschmidt@gmail.com (Ben Schmidt) — Sat, 19 Mar 2016 15:17:35 GMT

This is some real inside baseball; I think only two or three people will be interested in this post. But I’m hoping to get one of them to act out or criticize a quick idea. This started as a comment on Scott Enderle’s blog, but then I realized that Andrew Goldstone doesn’t have comments for the parts pertaining to him… Anyway.

Basically I’m interested in feature reduction for token-based classification tasks. Ted Underwood and Jordan Sellars’ article on the pace of change (hereafter U&S) has inspired a number of replications. They use the 3200 most-common words to classify 720 books of poetry as “high prestige” or “low prestige.”

Shortly after it was published, I made a Bookworm browser designed to visualize U&S’s core model, and asked Underwood about whether similar classification accuracy on a much smaller feature set was possible. My hope was that a smaller set of words might produce a more interpretable model. In January, Andrew Goldstone took a stab at reproducing the model: he does, but then argues that trying to read the model word by word is something of a fool’s errand:

Researchers should be very cautious about moving from good classification performance to interpreting lists of highly-weighted words. I’ve seen quite a bit of this going around, but it seems to me that it’s very easy to lose sight of how many sources of variability there are in those lists. Literary scholars love getting a lot from details, but statistical models are designed to get the overall picture right, usually by averaging away the variability in the detail.

I’m sure that Goldstone is being sage here. Unfortunately for me, he hits on this wisdom _before _using the lasso instead of ridge regression to greatly reduce the size of the feature set (down to 219 features at 77% success rate, if I’m reading his console output correctly), so I don’t get to see what features a smaller model selects. Scott Enderle took up Goldstone’s challenge, explained the difference between ridge regression and lasso in an elegant way, and actually improved on U&S’s classification accuracy with 400 tokens–an eightfold reduction in size.

So I’m left wondering whether there’s a better route through this mess. For me, the real appeal of feature selection on words would be that it might create models which are intuitively apprehensible for English professors. But if Goldstone is right that this shouldn’t be the goal, I’m unclear why the best classification technique would use words as features at all.

So I have two questions for Goldstone, Enderle, and anyone else interested in this topic:

Is there any redeeming interpretability to the features included in unigram model? Or is Goldstone right that we shouldn’t do this?
If we don’t want model interpretability, why use tokens as features at all? In particular, wouldn’t the highest classification accuracy be found by using dimensionality reduction techniques across the *entire* set of tokens in the corpus? I’ve been using the U&S corpus to test a dimensionality reduction technique I’m currently writing up. It works about as well as U&S’s features for classification, even though it does nothing to solve the collinearity problems that Goldstone describes in his post. A good feature reduction technique for documents, like latent semantic indexing or independent components analysis, should be able to do much better, I’d think–I would guess the classification accuracy over 80% with under a thousand dimensions. Shouldn’t this be the right way to handle this? Does anyone want take a stab at it? This would be nice to have as a baseline for these sorts of abstract feature-based classification tasks.

Buying a computer for digital humanities work

bmschmidt@gmail.com (Ben Schmidt) — Fri, 12 Jun 2015 17:49:01 GMT

I’ve gotten a couple e-mails this week from people asking advice about what sort of computers they should buy for digital humanities research. That makes me think there aren’t enough resources online for this, so I’m posting my general advice here. (For some solid other perspectives, see here). For keyword optimization I’m calling this post “digital humanities.” But, obviously, I really mean the subset that is humanities computing, what I tend to call humanities data analysis. [Edit: To be clear, ] Moreover, the guidelines here are specifically tailored for text analysis; if you are working with images, you’ll have somewhat different needs (in particular, you may need a better graphics card). If you do GIS, god help you. I don’t do any serious social network analysis, but I think the guidelines below should work relatively with Gephi.

Pricing: For each component, I’m putting up a cheap and an expensive option; I’m also briefly describing what I myself have been using for the last two years, because those specific examples can be helpful. The cheap option down the line should be reasonable on a grad student budget; the expensive set luxurious for a faculty or staff member with a substantial research budget. I also describe my own setup, which tended towards the luxurious end of the spectrum in the summer of 2013.

The difference between what you can do with cheap and expensive is not great. Don’t make the mistake of fetishizing the hardware too much; as with most tools, it’s the performer, not the instrument, that truly matters. University libraries are filled with iMac computers that have incredible computing resources that are never used for anything but e-mail; you could, if you have a certain bent, adopt a no-computer coding and repo management style that stored code on Github and runs it only on public computers. Here, for example, is a snippet of code you can run on any library computer that will stream the entire Google Ngrams 3-grams corpus to a library server and store only the entries matching a regular expression. (Please don’t run that code trivially, it wastes a lot of resources). You might need to store R or some python modules on a thumb drive to realistically make this work, but you might be able to become a folk hero by doing it.

1. Laptop vs Desktop.

Cheap: The cheapest route to go is a single laptop supplemented, should you find you need it, with specially purchased virtual server time on one of the cloud services. (Probably it’s best to use Amazon for this, since that’s the most widely used option–see “Hard drive space” for a discussion of Amazon with large datasets.)

Expensive: You’re going to need a laptop in any case (anyone planning to use a tablet for presentations is idiosyncratic enough that they aren’t reading this guide). An additional desktop will let you buy more computing power more cheaply than a new laptop. If you have an office (home or university), I think it makes less sense to max out on an expensive all-in-one laptop, and more to purchase a desktop for computing-intensive tasks and choose your laptop as a piece of consumer electronics–base it on style, or the keyboard you like best, or weight. Desktops can run continuously, which is useful for long computations, web scraping, and the like. (People are often afraid to let programs run for hours, because they think that they are “frozen”; but frequently, a text-analysis task or complicated download may take hours to run. My personal record is a program that ran for about 3 months in the background.) They also can act as a web server, which can be useful for all sorts of tasks, from running an Omeka installation to hosting a backup of your slide deck online.

If you’re looking at spending over $1400 or so, I would seriously consider getting two computers; if less, focus on the laptop. Keep in mind that running a desktop continuously uses large amounts of electricity; one of the reasons running your own is so much cheaper than using Amazon is that all this carbon-producing effort is billed to your office or home.

My setup: I have a MacBook Air for day-to-day use, chosen primarily because it has the most battery life and I frequently forget to charge my computer, and a Linux desktop in a large tower case, described further below. They both ended up costing approximately the same ($1500 or so, plus more for some hard drives), but the desktop is much more powerful for computation.

2. Operating system.

Cheap: Ubuntu 14.04. (Or 16.04, once it comes out: the even-numbered Ubuntu releases are so-called “long term service,” and often the best to stick to if you don’t want to lose a lot of time on updates). Ubuntu is the most widely used and easiest “flavor” of Linux to install, is free, and will generally be the easiest solution for installing software on and *usually* works with printers, cameras, and the like (always the biggest problem with Linux). Ubuntu is a little bit commercialized and sometimes too glossy; on really low-end hardware, a version of Debian using a simpler graphics stack may be a better choice.

Expensive: Mac OS X. Apple Hardware is pretty and, thanks to their market dominance, often cheaper for things like laptops. Homebrew is an indispensable package manager for installing open-source software; under the hood, where most programming happens, Mac OS X is the same as most other Unix. Once you get a version of OS X working, it may be worth skipping Apple’s frequent updates unless you absolutely need the new functionality they offer. (Edit–or don’t mind risking losing an afternoon to fixing things. If you don’t tweak settings a lot, it may be worth it, but always wait two weeks. Make sure you do all recommended security updates, though.) For a laptop, any of the MacBook varieties is fine. The major reason OS X costs so much is that if you want to get a full-powered server going, you’ll need to buy the extraordinarily expensive Mac Pro (min $3000) which, at time of writing, is still waiting for an update; a comparably equipped Linux setup may cost half as much.

You ~~absolutely~~ probably (see below) should not plan on running Microsoft Windows as your primary OS for humanities computing. With one exception: ESRI’s GIS software is only available for Windows. If you primarily do GIS, you’ll either need to sit in the lab, buy a Windows (or dual-boot) machine, or use QGIS, which is getting better. Since ArcGIS licenses are expensive, I usually recommend that grad students use QGIS so they don’t lose their ability to finish their dissertation with their library card. I personally use QGIS for some tasks, and the spatial libraries for R for more intensive spatial work, with D3 to render maps beautifully.

_Edit on Windows: Some people on Twitter think this is too harsh, so I’m changing it to “probably.” Windows will generally be fine if you’re doing number crunching only; python and RStudio will run fine. But use of Unix (the family of operating systems to which Linux and OS X belong) is far more common in DH than Windows, which means that you’ll have an easier time running other code, and you’ll have a harder time running DH software that is oriented to the web, such as Omeka, which requires a so-called LAMP stack.

My setup: An old version of OS X on the laptop to keep my homebrew settings intact, and Ubuntu 14.04 on the desktop tower.

3. Memory (RAM)

No one seems to call it RAM anymore besides me. This is the single most important upgrade you can get for humanities computing, and most default systems come with less than you want.

Cheap: You can survive on 4GB, but it’s worth splurging for 8 in a laptop. If you’re only going to be working with python and R (the most common languages for humanities computing) you can get some stuff done on 4 or even 2GB; if you’re going to be regularly running anything that uses Java (which includes the immensely popular Mallet tool for topic modeling, and the Stanford Natural Language Toolkit), you’ll be glad to have more.

Expensive: As much as possible under the rest of your hardware setups. There are usually hard limits on what your processor can accommodate, and you should go all the way up to them.

My setup: I maxed out the laptop at 8GB, and have the maximum 32GB in the Linux server. When I bought my desktop, this was the most possible on the medium-range motherboards for Intel i7 processors; one of the things the extra money for a Mac Pro gets is the ability to load in 64GB of RAM.

4. Hard Drive space

Needs for hard drive space vary enormously from person to person. There are only a few truly large data sets out there (Google NGrams, Hathi Trust, customized JStor data for research abstracts); you can fit tens of thousands of anything onto an ordinary hard drive. Don’t overestimate the size of your data; if you only want to look at, say, Victorian poetry, you can probably store the entire Hathi collection on your phone, let alone your computer. (Images and audio take more space). Do the math on the files you have and how many you’ll need to download before wasting money on storage: space is easily expanded, so you don’t need to spend money up front before you have the data. But you should have a plan for dealing with additional data.

If you plan to use a lot of data (more than 1TB), you will find that cloud computing is not a particularly realistic option. Processor time on the cloud is cheap: but data storage needs to be persistent, and you can easily end up spending several hundred dollars per terabyte, per year, to store copies online. This is rarely economical.

The other thing to keep in mind is that there are “solid state” and traditional hard drives. Solid state are better, but hold substantially less.

Cheap: Whatever drive comes in the machine; a 128 GB SSD might be enough if you don’t plan to store your personal music and photographs on the machine. At the absolute bottom level, a small disk drive is acceptable; but some size SSD is the biggest bang-for-the-buck upgrade you can get. To store more data, use external drives; 4TB external drives are now fairly cheap. If you’re going to be working with the big datasets, you probably should get two. If your analysis produces large data files or they are available online, though, you don’t *necessarily* need to back them up; instead, you can store code on the ssd or as a git repository backed up, ignore the large files for backup but leave perfect code to recreate them. (A Makefile is a nice way to accomplish this, as Mike Bostock describes.). Back up your code and writing in as many forms as possible; mine are scattered around on Dropbox, Github, and on hard drives at my home and office.

Expensive: As big an SSD as you can afford for the operating system and data processing, and traditional drives for additional storage. On Apple hardware you’ll need an external enclosure for those drives, which again gets expensive; if building a Linux tower, it may be worth getting a case that holds as many expansion drives as possible. You’ll want to use RAID to join multiple drives into a single array for some redundancy since disks will inevitably fail; RAID 10 is the standard, and RAID 5 and RAID 6 are both reasonable compromises if you’re squeezed for space. Do the math on what the cheapest cost per gigabyte hard drives are, and use those; keep in mind that with a RAID array your disks generally have to be the same size, so start with a 3 or 4 TB drive if you think there’s any chance you’ll need to scale up. Internal SATA connections are fast, and in my experience disk I/O can be a significant bottleneck. If you plan to do external storage, it’s worth making sure you can have a thunderbolt or at least USB 3 connection.

My setup: a small (100GB) SSD for the operating system. I have a *lot* of data stored locally, so I bought a case that holds two small drives and six full size ones. I started with 3TB drives in a RAID 5 configuration for 6TB of space; as I ran out of space, I switched to 6 3TB drives in a RAID 6 for 9TB of storage with more redundancy.

4. Processors

Processor speed is less important for humanities computing; while too little memory makes it impossible to do certain things, too slow processors just means that they take longer. That’s fine; just get in the habit of accepting that certain things will run during your commute, or overnight, or whatever.

Keep in mind that both R and python don’t take advantage of multi-core processors very well. It’s possible to take advantage of multiple cores, but in cases there is high overhead. (For Bookworm, I use GNU parallel instead of python’s multicore library because the overhead of pickling and unpickling text files between python instances is much higher than just passing plain text through the system; in general, it’s worth learning how to use GNU parallel, the -P flag to xargs in the shell, or the -j argument to make; the system is likely to be better at allocating resources than your python code.)

Java programs are frequently better able to take advantage of multiple cores.

Cheap: Whatever: probably two cores.

Expensive: The Intel i7 series is fine; a quad-core system effectively gives eight processors, and flies through most tasks. The Mac Pros use Xeons, and are going to switch to something better in the generation. Oddly, they have slower clock speeds the more processors you get; this means, paradoxically, that if you write unoptimized python or R code, it will probably run faster on a cheap Mac Pro than an expensive one, so you should feel just fine about buying the “cheap” $3000 one.

5. Graphics

Unless you’re working explicitly with images, this is the place you need the least compared to what an off-the-shelf computer will get you. Real-time video rendering matters a lot for computer games and watching movies, which all commodity computers are built to do at least a little of; but digital humanities rarely make use of their capabilities.

In some limit cases or in three or five years, this may be flipped. Code that is optimized for GPU can be extremely fast indeed; but it’s often difficult to find and even harder to write, much more so than multiprocessor code. It also varies by architecture, so you’ll need to do some research about whether there’s a good SVD algorithm for the GPU written for your particular NVidia card, or whatever.

If you’re working with photoshop, obviously, the situation is different. 3D modeling, which I don’t know much about, should benefit enormously. But just because you do data visualization doesn’t mean you need any graphics card at all; if you plan to do it in the browser or R, the benefits are slight.

My setup: No graphics card on the Linux tower. Whatever comes by default in the MacBook.

6. Monitor

Whatever size monitor you get, you will come to feel is the minimum.

7. Keyboard

I have a mechanical 1990s IBM model M. It’s pretty awesome.

8. Software

With the exception of ArcGIS, most widely-used software for humanities computing is free. CUNY’s DH-in-a-box platform contains a lot of what you need. As I said, Python and R are the two most widely-used languages; along with Javascript, Java, and the various C languages, they’re all free. SPSS, Stata, and the like, are ~~absolutely~~ not worth it; I see no reason to use Matlab, although it’s common in some other fields. The only coding platform I’d consider spending my own money on for humanities computing is Mathematica; you can do some amazing things, but won’t be able to share code.

Learning to interface between your analysis language and a database can be extremely useful for avoiding problems with memory. Python’s “shelve” module is incredibly useful as a persistent key-value store; the dplyr package in R lets you use a SQL database without the unpleasant experience of actually writing SQL code. I use MySQL with MYISAM tables because I believe them to be faster and more portable; most advice you’ll get nowadays is to use Postgres as a complicated database server, or SQLite for lightweight files. If you do use a database to store data and think that it’s slow, it’s worth reading up on how indexing works; there’s a very good chance you can improve query times by a thousand by adding the right index. Use WordPress for a blog until you know that you need something different; every other platform you might use allows you to convert to it from WordPress. Don’t use blogger and then regret like me.

Plenty of people use virtual machines even on their local hardware to keep certain things (a webserver, say) clean and easy to back up. I don’t, because I don’t want to lose any performance. But a system like Vagrant can be extremely useful for switching code between a local machine and the cloud, particularly under the budget approach.

Some random other advice: Write in markdown with pandoc. Use github, obviously. Everyone loves sublime as a text editor; if you tend to work on remote servers, though, it can be convenient to just always use vim or emacs, since they’ll always be around. Document each project in its makefile, as described in the Bostock article above. Use the unix command `find` instead of `ls` if you have more than a thousand files downloaded; I’m constantly writing lines of code like `find directoryName -type f | xargs -P 6 someShellCommand.sh`, and it’s fast.

Commodius vici of recirculation: the real problem with Syuzhet

bmschmidt@gmail.com (Ben Schmidt) — Fri, 03 Apr 2015 16:37:21 GMT

Practically everyone in Digital Humanities has been posting increasingly epistemological reflections on Matt Jockers’ Syuzhet package since Annie Swafford posted a set of critiques of its assumptions. I’ve been drafting and redrafting one myself. One of the major reasons I haven’t is that the obligatory list of links keeps growing. Suffice it to say that this here is not a broad methodological disputation, but rather a single idea crystallized after reading Scott Enderle on “sine waves of sentiment.” I’ll say what this all means for the epistemology of the Digital Humanities in a different post, to the extent that that’s helpful.

Here I want to say something much more specific: that Fourier transforms are the wrong “smoothing function” (insofar as that is the appropriate term to use) to choose for plots, because they assume plot arcs are periodic functions in which the beginning must align with the end. I’m pretty sure I’m right about this, but as usual I’m relying on an intuitive understanding of the techniques under discussion here rather than a deeply mathematical one. So let me know if I’m making a total ass of myself, and I’ll withdraw my statements here.

Even before Swafford posted her critique, I felt like there was something quite wrong about using the Fourier transform as a “smoothing” mechanism. Fourier transforms, in my experience with them, are bad at dealing with humanities data, because they rely on a very precise definition of “signal.” I’ve had to use wavelets instead of the Fourier transform in the past even to extract obviously periodic data from time series, because the assumptions of regularity in the fourier transform are so strong that some periods are simply missed.

As I was reading Enderle’s post, it occurred to me that we’ve been graphing these fourier transformed waves with the x axis reading 1 to 100, as if it was a closed domain. But, in fact, if plot is a sum of sine waves, that domain should actually read from 0 to 2*pi. (Or, if you’re so inclined, from 0 to tau). The difference being that waveforms are _cyclical: _this is the fundamental assumption of fourier transforms, whence all of the ringing artifacts that Swafford usefully points out come. After 100 comes 101: but 2 pi is the same as zero. This assumption is true only for novels whose last sentence is aligned to feed back into their first, a rare breed indeed. (Although ironically, given the primacy that _Portrait of the Artist _has played in this debate, Joyce wrote one.)

To put that graphically: this cyclicality means that syuzhet imposes an assumption that the start of plot lines up with the end of a plot. If you generate an artificial plot that starts with sentiment “-5” and ends with sentiment “5”, it looks like this with normal smoothing methods. (Rolling average or loess).

But if you try to use syuzhet’s filter, it comes up looking completely different: wavy.

This holds true on real documents. I ran it on every state of the union address since 1960. I’ve added dashed lines to show the overall sentiment movement in the address. Blue shows loess smoothing from beginning to end, and red shows the fourier transform. As you can see, loess allows plots to get happier or sadder: fourier forces them to return almost to their starting place.

All the code for this is online here: you can try it on your own plots as desired.

I can see no sound reason to do this. Plots can start sad and get happy. But if you look at Jockers’ six “fundamental plots,” all start and end in the same approximate emotional register. This, I think, is an artifact of the assumptions of periodicity built into the Fourier transform, not the underlying plots. There’s no room in this world for Vonnegut’s “From bad to worse,” or for any sort of rags to riches. It treats plot as a zero-sum game.

If I’m not misunderstanding something here, this should convince Jockers to retire the waveform assumptions in favor of something like Loess smoothing or moving averages, so digital humanists can move on to talking about something other than “ringing artifacts.” I don’t think this devastating for the Syuzhet package as a whole: it has absolutely nothing to do with the suitability of sentiment analysis for determining plot, which is a much more interesting question others are contributing to. (I am still undecided whether I think my own method of plotting arcs through multidimensional topic spaces, which I originally came up from my misunderstanding something Jockers said to me a year ago about his idea for syuzhet, is better: I do think it adds something to the conversation.) One of the broader points my unfinished post makes is that we shouldn’t be taking failures in one component of a chain to mean the rest is unsound: that’s an oddly out-of-domain application of falsifiability.

Rate My Professor

bmschmidt@gmail.com (Ben Schmidt) — Fri, 06 Feb 2015 21:47:21 GMT

Just some quick FAQs on my professor evaluations visualization: adding new ones to the front, so start with 1 if you want the important ones.

-3 (addition): The largest and in many ways most interesting confound on this data is the gender of the reviewer. This is not available in the set, and there is strong reason to think that men tend to have more men in their classes and women more women. A lot of this effect is solved by breaking down by discipline, where faculty and student gender breakdowns are probably similar; but even within disciplines, I think the effect exists. (Because more women teach at women’s colleges, because men teach subjects like military history than male students tend to overtake, etc). Some results may be entirely due to this phenomenon, (for instance, the overuse of “the” in reviews of male professors). But even if it were possible to adjust for this, it would only be partially justified. If women are reviewed differently because a different sort of student takes their courses, the fact of the difference in their evaluations remains.

-2 (addition): This no peer review, and I wouldn’t describe this as a “study” in anything other than the most colloquial sense of the word. (It won’t be going on my CV, for instance.) A much more rigorous study of gender bias was recently published out of NCSU. Statistical significance is a somewhat dicey proposition in this set; given that I downloaded all of the ratings I could find, almost any queries that show visual results on the charts are “true” as statements of the form “women are described as x more than men are on rateMyProfessor.com.” But given the many, many peculiarities of that web site, there’s no way to generalize from it to student evaluations as used inside universities. (Unless, God forbid, there’s a school that actually looks at RMP during T&P evaluations.) I would be pleased if it shook loose some further study by people in the field.

-1. (addition): The scores are normalized by gender and field. But some people have reasonably asked what the overall breakdown of the numbers is. Here’s a chart. The largest fields are about 750,000 reviews apiece for female English and male math professors. (Blue is female here and orange male–those are the defaults from alphabetical order, which I switched for the overall visualization). The smallest numbers on the chart, which you should trust the least, are about 25,000 reviews for female engineering and physics professors.

(addition): RateMyProfessor excludes certain words from reviews: including, as far as I can tell, “bitch,” “alcoholic,” “racist,” and “sexist.” (Plus all the four letter words you might expect.) Sometimes you’ll still find those words typing them into the chart. That’s because RMP’s filters seem not to be case-sensitive, so “Sexist” sails through, while “sexist” doesn’t appear once in the database. For anything particularly toxic, check the X axis to make sure it’s used at a reasonable level. For four letter words, students occasionally type asterisks, so you can get some larger numbers by typing, for example, “sh *” instead of “shit.”
I’ve been holding it for a while because I’ve been planning to write up a longer analysis for somewhere, and just haven’t got around to it. Hopefully I’ll do this soon: one of the reasons I put it up is to see what other people look for.
The reviews were scraped from ratemyprofessor.com slowly over a couple months this spring, in accordance with their robots.txt protocol. I’m not now redistributing any of the underlying text. So unfortunately I don’t feel comfortable sharing it with anyone else in raw form.
Gender was auto-assigned using Lincoln Mullen’s gender package. There are plenty of mistakes–probably one in sixty people are tagged with the wrong gender because they’re a man named “Ashley,” or something.
14 million is the number of reviews in the database, it probably overstates the actual number in this visualization. There are a lot of departments outside the top 20 I have here.
There are other ways of looking at the data other than this simple visualization: I’ve talked a little bit at conferences and elsewhere about, for example, using Dunning Log-Likelihood to pull out useful comparisons (for instance, here, of negative and positive words in history and comp. sci. reviews.) without needing to brainstorm terms.
Topic models on this dataset using vanilla sets are remarkably uninformative.

7.People still use RateMyProfessor, though usage has dropped since its peak in 2005. Here’s a chart of reviews by month. (It’s intensely periodic around the end of the semester.

This includes many different types of schools, but is particularly heavy on masters and community colleges in the most represented schools. Here’s a bar chart of the top 50 or so institutions:

The Bookworm-Mallet extension

bmschmidt@gmail.com (Ben Schmidt) — Fri, 12 Dec 2014 04:15:07 GMT

I promised Matt Jockers I’d put together a slightly longer explanation of the weird constraints I’ve imposed on myself for topic models in the Bookworm system, like those I used to look at the breakdown of typical TV show episode structures. So here they are.

The basic strategy of Bookworm at the moment is to have a core suite of tools for combining metadata with full text for any textual corpus. In the case of the movies, the texts are each three-minute chunks of movies or TV shows; a topic model will capture the size of each individual movie. A variety of extensions allow you to port in various other algorithms into the system; so for instance, you can use the geolocation plugin to put in a latitude and longitude for a corpus which has publication places listed in it.

The Bookworm-Mallet extension handles incorporating topic models into Bookworm. The obvious way to topic model is to just feed the text straight into Mallet. This is particularly easy because the Bookworm ingest format is designed to be exactly the same as the Mallet format. But I don’t do that, partly because Bookworm has an insanely complicated (and likely to be altered) set of tokenization rules that would be a pain to re-implement in the package, and partly because we’ve *already* tokenized; why do it again?

So instead of working with the raw text, I load a stopwords list (starting with Jockers’ list of names) directly into the database, and pull out not the tokens but the internal numeric IDs used by Bookworm for each word. This has an additional salutary effect, which is that we can define from the beginning exactly the desired vocabulary size. If we want a vocab size of the most common 2^16-1 tokens in the corpus, it’s trivially easy to do it. That means that the Mallet memory requirements, which many Bookworms bump up against, can be limited. (David Mimno has used tricks like this to speed up Mallet on extremely large builds; I don’t actually know how he does it, but want to keep the options open for later.) And though I’m not already limited precisely, I do drop out words that appear fewer than two times from the model to save space and time.

The actually model is run on a file not of words, but of integer IDs. Here are the first ten lines of the movie dataset as I enter it into Mallet.

1       883 24841 3714 932 2354 2343 1851 6850 5889 2205 273 4427 1088 2343 7900 139 9357 883 932 1060 590
2       9184 251 1613 11137 883 535 883 1140 4225 1003 290 1549 1000 3299 706 706 9498 16435 932 2216 232 
3       2475 412 535 2937 4342 177 177 559 1927 559 177 164 799 177 2901 177 6620 516 1855
4       1874 7769 271 567 5816 1878 410 388 1726 23371 353 3389 19793 8182 250 14188 5490 3766 5889 1145 3
5       356 520 1603 459 290 2110 8896 2339 1927 1184 1699 2150 912 8829 4340 2937 545 324 1726 114 4630 5
6       1591 2466 5889 3155 598 706 3946 433 2790 2429 1190 24220 13273 304 290 1060 3766 2351 177 2138 44
7       662 2797 656 11073 4887 1654 6492 3203 13119 6448 960 1237 2343 16247 9630 548 1776 2343 253 934 1
8       114 602 2343 348 1726 271 222 6080 1240 3790 4329 2442 4263 7030 1963 5535 2811 700 897 1157 1629 
9       1320 3476 5806 877 1320 1603 1603 7563
10      2077 545 2077 9250 3358 302 330 1984 2284 752 589 5588 3358 4648 6105 545 114 23884 19943 290 232

Each number is a code for a word; they appear not in the original order, but randomly shuffled. Wordid 883 is ‘land,’ 24841 is “Stubborn,” 3714 is “influence,” etc. This file is much shorter for being composed of integers without stopwords than it would be from the full text.

Then all the tokens and topic assignments are loaded back into the database, not just as overall distributions but as individual assignments. That makes it possible to look directly at the individual tokens that make up a topic, which I think is potentially quite useful. This gives a much faster, non-memory based access to the data in the topic state file than any other I know of; and it comes with full integration with any other metadata you can cook up.

Jockers’ “Secret sauce” consists, in part, of restricting to only nouns, adjectives, or other semantically useful terms. There is a way of doing that in the Bookworm infrastructure, but it involves not treating the topic model as a one-off job, but fully integrating the POS-tagging into the original tokenization. We would be then be able to only feed adjectives into the topic modeling. But the spec for that isn’t fully laid out: and POS-tagging takes so long that I’m in no big hurry to implement it. It has proven somewhat useful in the Google Ngrams corpus, but I’m a little concerned by the ways that it tends to project modern POS uses into the past. (Words only recently verbified get tokenized as words much longer ago in the 2012 Ngrams release).

Perhaps more interesting are the ways that the full Bookworm API may expose some additional avenues for topic modeling. Labelled LDA is an obvious choice, since Bookworm instances are frequently defined by a plethora of metadata. Another option would be to change the tokens imported in; using either Bookworm’s lemmatization (removed in 2013 but not forgotten) or even something weirder, like the set of all placenames extracted out in NLP, as the basis for a novel. Finally, it’s possible to use metadata to more easily change the definition of a *text*; for something like the new Movie Bookworm, where each text takes three minutes, it would be easy to recalculate with each text instead coming in as an individual film.

Building outlines and slides from Markdown lectures with Pandoc

bmschmidt@gmail.com (Ben Schmidt) — Fri, 07 Nov 2014 20:17:45 GMT

Just a quick follow-up to my post from last month on using Markdown for writing lectures. The github repository for implementing this strategy is now online.

The goal there was to have one master file for each lecture in a course, and then to have scripts automatically create several things, including a slidedeck and an outline of the lecture (inferred from the headers in the text) to print out for students to follow along in class.

To make this work, I invented my own slightly extended version of the markdown syntax. It has three new conventions:

Any phrase in bold is a keyword to be pulled out and included in outlines
Anything in a code block is to be used as a slide. Each separate code block is its own slide. Any first-degree header is a full page slide. (The easiest way to do a code block is just to tab indent a line: must of my slides are just a single element line like this:

![Edison electric light](http://scienceblogs.com/retrospectacle/wp-content/blogs.dir/463/files/2012/04/i-3530f86be619cdc7d42c13cdca188088-edison.bmp)

As in the previous example, the image format is extended so that labels in slides appear not as alt-text, but in the text above the image: in addition, any image link beginning with the character “>” is treated not as an image but as an iframe, making it easy to embed things like youtube videos or interactive Bookworm charts.

The slide decks are built with reveal.js, which drops everything into a nicely organized batch. Here’s what one looks like. (This is for a lecture on household technologies in the 20s). My favorite feature is that by hitting escape, you get an overall view of everything in the lecture sorted by header–this is particularly useful when studying for exams, because those headers align exactly with the outlines.

The outlines are produced from the same lecture notes, but in a different way; rather than pull the code blocks, they walk through all the headers in the document and append them (and any bolded terms) to a new document that students can see. For that lecture, it looks like this:

There are a few things I still don’t love about this: image positioning and sizing is not so good as it is in powerpoint. But the thing that’s nice is that it’s extremely portable; if I don’t make through the end of a lecture, I can just cut out the last few paragraphs, paste them into the next day’s document, and have the outline and slides immediately reflect the switch for both days. This makes a lot of last-minute, before-class changes dramatically easier.

The basic scripts, though not the full course management repo, is up on github.The code is in Haskell, which I’ve never written in before, so I’d love a second set of eyes on it. Some brief reflections on coding for pandoc in Python and Haskell follow.

I thought it would be easy to switch between headers and an outline, but they turn out to have almost nothing in common in the Pandoc type definition; the outline needs to be built up recursively out of component parts. It’s an operation that’s much closer to really basic data structures than anything I’ve done before.

I initially used the pandocfilters Python package for this. That code is here. It basically works–thanks primarily to insight gleaned from an exchange on GitHub between, I think, Caleb McDaniel and John McFarlane that I’ve lost the link for) that you need to scope a global python variable and append to it from a `walk` function. But it has a tendency to break unexpectedly, and uses an incredibly confusing welter of accessors into the rather ugly pandoc json format. Plus, it’s fundamentally an attempt to write Haskell-esque code in Python, which is about the least pleasant thing I’ve ever seen.

By the time I made that python script work. I had spent a couple hours reading and re-reading the pandoc types definition, and it seemed like it would simpler to just write the filter in Haskell directly. (I did a few Haskell problem sets for a U Penn course this summer out of curiosity; without that basic understanding of Haskell data types, I doubt I would have been able to understand the Pandoc documentation.) The lecture-to-outline Haskell code, to my surprise, ended up being a bit longer than the Python version (though much of that is type definitions and comments, which doesn’t really count). If anyone out there who knows Haskell can explain to me a better way to avoid some of the stranger elements in there (particularly the reversing and unreversing of lists just to allow pattern matching on them, which is a substantial proportion of what I wrote), I’m all ears.

Programming in Haskell is certainly more interesting than python; I agree with Andrew Goldstone’s comment that “whereas programming normally feels like playing with Legos, programming in Haskell feels more like trying to do a math problem set, with ghc in the role of problem-set grader”. I’m left with a strong temptation to write a TEI-to-Bookworm parser, which I’ve previously sketched in Python, in Haskell instead; both for performance and readability reasons, I think it might work quite well. Stay tuned.

Building topic models into Bookworm searches

bmschmidt@gmail.com (Ben Schmidt) — Tue, 23 Sep 2014 22:29:38 GMT

I’ve been seeing how deeply we could integrate topic models into the underlying Bookworm architecture a bit lately.

My own chief interest in this, because I tend to be a little wary of topic models in general, is in the possibility for Bookworm to act as a diagnostic tool internally for topic models. I don’t think simply plotting description absent any analysis of the underlying token composition of topics is all that responsible; Bookworm offers a platform for actually accessing those counts and testing them against metadata.

But topics also have a lot to offer token-based searching. Watching links into the Bookworm browser, I recently stumbled on this exchange:

How can I solve this biologist’s problem? (Or, at least, waste more of his time?)

The word-level topic assignments I have on hand are actually real useful for this. (I’m assuming, I should say, that you know both the basics of topic modeling and of the movie bookworm.) I can ask the beta bookworm browser for the top topics associated with each of the words “fly” (top) and “ant” (bottom):

Fly usage by topic

Ant usage by topic

“Fly” is overwhelmingly associated with the topics “boat ship Captain island plane sea water” (airplane flying) and “life day heart eyes world time beautiful” (unclear, but might be superman flying). (It’s even more so than on this chart, since I’ve lopped off the right side: there are about 2200 uses of “fly” in the first topic).

But “ant” is most used in two clearly animal related topics: “water animals years fish time food ice” and “dog cat little boy dogs Hey going.” And both of those topics show up for “fly” as well.

So in theory, at least, we can *restrict searches by topic:* rather than put into a Bookworm *every* usage of the word “fly”, we can get only those that seem, statistically, to be used in an animal-heavy context.

With an imperfect, 64-topic model on a relatively small corpus like the Movie Bookworm, this is barely worth doing.

Ant in animal topics per million words in all topics

Fly in animal topics per million words in all topics

And given that “flying” is something that plenty of animals do, the “fly” topic here is probably not all Order Diptera.

But with collections the size of the Hathi trust, this could potentially be worth exploring, particularly with substantially larger models. “Evolution” is one of the basic searches in a few bookworms: but it’s hard to use, because “evolution” means something completely different in the context of 1830s mathematics as opposed to 1870s biology. A topic model that could conceivably make a stab at segregating out just biological “evolution,” though, would be immensely useful in tracing out Darwinian changes; one that could disentangle military shooting from the interjection “shoot!” might be good at studying slang.

Above all, this might be good at finding words that migrate meanings in early uses: most new phrases actually emerge out of some early construction, but this would let us try to recover meaning through context.

Hell, it might even have an application in Prochronisms work; given a large, pre-built topic model, any new scripts could be classified against it and their words assigned to topics, and tested for their appropriateness as a topic-word combination.

Technical note: the basics of this are pretty easy with the current system: the only issue with incorporating “topic” as a metadata field on the primary browser right now is that the larger corpus it compares against would also be limited by topic. This could be solved by using the asterisk syntax that no one uses: {“*topic”:[3],”*word”:[“fly”]} will ensure both are dropped, not just one, by just specifying the “compare_limits” field manually.

Searching for structures in the Simpsons and everywhere else.

bmschmidt@gmail.com (Ben Schmidt) — Thu, 11 Sep 2014 21:59:21 GMT

This is a post about several different things, but maybe it’s got something for everyone. It starts with 1) some thoughts on why we want comparisons between seasons of the Simpsons, hits on 2) some previews of some yet-more-interesting Bookworm browsers out there, then 3) digs into some meaty comparisons about what changes about the Simpsons over time, before finally 4) talking about the internal story structure of the Simpsons and what these tools can tell us about narrative formalism, and maybe why I’d care.

It’s prompted by a simple question. I’ve been getting a lot of media attention for my Simpsons browser. As a result of that, I need some additional sound bytes about what changes in the Simpsons. The Bookworm line charts, which remain all that most people have seen, are great for exploring individual words; but they don’t tell you _what words to look for. _This is a general problem with tools like Bookworm, Ngrams, and the like: they don’t tell you what’s interesting. (I’d argue, actually, that it’s not really a problem; we really want tools that will useful for addressing specific questions, not tools that generate new questions.)

The platform, though, can handle those sorts of queries (particularly on a small corpus like the Simpsons) with only a bit of tweaking, most of which I’ve already done. To find interesting shifts, you need:

To be able to search without specifying words, but to get results back faceted by words;
Some metric of “interestingness” to use.

Number 1 is architecturally easy, although mildly sort of expensive. Bookworm’s architecture has, for some time, prioritized an approach where “it’s all metadata”; that includes word counts. So just as you can group by the year of publication, you can group by the word used. Easy peasy; it takes more processing power than grouping by year, but it’s still doable.

Metrics of interestingness are a notoriously hard problem; but it’s not hard to find a _partial _solution, which is all we really need. The built-in searches for Bookworm focus on counts of words and counts of texts. The natural (and intended) use are the built-in limits like “percentage of texts” and “words per million,” but given those figures for two distinct corpora (the search set and the broader comparison sets) also make it possible to calculate all sorts of other things. Some are pretty straightforward (“average text length”); but others are actual computational tools in themselves, including TF-IDF and two different forms of Dunning’s Log-Likelihood. (And those are just the cheap metrics; you could even run a full topic model and ship the results back, if that wasn’t a crazy thing to do).

So I added in, for the time being at least, a Dunning calculator as an alternate return count type to the Bookworm API. (A fancy new pandas backend makes this a lot easier than the old way.) So I can set two corpora, and compare the results of each to each.

To plow through a bunch of different Dunning scores, some kind of visualization is useful.

Last time I looked at the Dunning formula on this blog, I found that Dunning scores are nice to look in wordclouds. I’m as snooty about word clouds as everyone else in the field. But for representing Dunning scores, I actually think that wordclouds are one of the most space-efficient representations possible. (This is following up on how Elijah Meeks uses wordclouds for topic model glancing, and how the old MONK project used to display Dunning scores).

There’s aren’t a lot of other options. In the past I’ve made charts for Dunning scores as bar charts: for example, the strongly female and the most strongly male words in negative reviews of history professors on online sites. (This is from a project I haven’t mentioned before online, I don’t think; super interesting stuff, to me at least). So “jerk,” “funny,” and “arrogant” are disproportionately present in bad reviews of men; “feminist,” “work,” and “sweet” are disproportionately present in bad reviews of women.

This is a nice and precise way to do it, but it’s a lot of real estate to take up for a few dozen words. The exact numbers for Dunning scores barely matter: there’s less harm in the oddities of wordclouds (for instance, longer words seeming more important just because of its length).

We can fit both aspects of this: the words and the directionality–by borrowing an idea that I think the old MONK website had; colorizing results by direction of bias. So here’s one that I put online recently: a comparison of language in “Deadwood” (black) and “The Wire” (red).

This is a nice comparison, I think; individual characters pop out (the Doc, Al, and Wu vs Jimmy and the Mayor); but it also captures the actual way language is used, particularly the curses HBO specializes in. (Deadwood has probably established an all-time high score on some fucking-cucksucker axis forever; but the Wire more than holds it own in the sphere of shit/motherfucker.) This is going to be a forthcoming study of profane multi-dimensional spaces, I guess.

Anyhoo. What can that tell us about the Simpsons?

Here’s what the log-likelihood plot looks like. Black are words characteristic of seasons 2-9 (the good ones); red is seasons 12-19. There’s much, much less that’s statistically different about two different 80-hour Simpsons runs than two roughly 80-hour HBO shows: that’s to be expected. And most the differences we do find are funny things involving punctuation that have to do with how the Bookworm is put together.

But: there are a number of things that are definitely real. First is the fall away from several character names. [Smithers, Burns, Itchy _and _Scratchy (Itchy always stays ahead), Barney, and Mayor Quimby all fall off after about season 9](http://benschmidt.org/Simpsons/#?%7B%22words_collation%22%3A%22Case_Insensitive%22%2C%22search_limits%22%3A%5B%7B%22word%22%3A%5B%22Barney%22%5D%2C%22season%22%3A%7B%22%24gte%22%3A2%2C%22%24lte%22%3A25%7D%7D%2C%7B%22word%22%3A%5B%22Itchy%22%5D%2C%22season%22%3A%7B%22%24gte%22%3A2%2C%22%24lte%22%3A25%7D%7D%2C%7B%22word%22%3A%5B%22Scratchy%22%5D%2C%22season%22%3A%7B%22%24gte%22%3A2%2C%22%24lte%22%3A25%7D%7D%2C%7B%22word%22%3A%5B%22Quimby%22%5D%2C%22season%22%3A%7B%22%24gte%22%3A2%2C%22%24lte%22%3A25%7D%7D%2C%7B%22word%22%3A%5B%22Smithers%22%5D%2C%22season%22%3A%7B%22%24gte%22%3A2%2C%22%24lte%22%3A25%7D%7D%5D%7D). Some more minor characters (McBain drop away as well.)

Few characters increase (Lou the cop; Duffman; Artie Ziff, though in only two episodes). Lenny peaks right around season 9; but Carl has had his best years ever recently.

We do get more, though, of some abstract words. Even though one of the first appearances was a Christmas special, “Christmas” goes up. Things are more often “awesome,” and around season 12 kids and spouses suddenly start getting called “sweetie.” (Another project would be to match this up against the writer credits and see if we could tell whether this is one writer’s tic.)

“Gay” starts showing up frequently.

Others are just bizarre: The Simpsons used the word “dumped” only once in the 1990s, and a 19 times in the 2000s. This can’t mean anything (right?) but seems to be true.

What about story structure? I found myself, somehow, blathering on to one reporter about Joseph Campbell and the hero’s journey. (Full disclosure: I have never read Joseph Campbell, and everything I know about him I learned from Dan Harmon podcasts).

But those things are interesting. Here are the words most distinctively from the first act (black) and the third act (red). (Ie, minutes 17-21 vs 2-8).

As I said earlier, school shows up as a first-act word. (Although “screeching,” here, is clearly from descriptions of the opening credits, school remains even when you cut the time back quite a bit, so I don’t think it’s just credit appearances driving this). And there are a few more data integrity issues: elderman is not a Simpsons character, but a screenname for someone who edits Simpsons subtitles; www, Transcript, and Synchro are all unigrams about the editing process. I’ll fix these for the big movie bookworm, where possible.

That said, we can really learn something about the structural properties of fictional stories here.

Lenny is a first act character, Moe a third act one.

We begin with “school” and “birthday” “parties;”

we end with discussions of who “lied” or told the “truth,” what we “learned” (isn’t that just too good?), and, of course with a group “hug.” (Or “Hug”: the bias is so strong that both upper- and lower-case versions managed to get in). And we end with “love.”

The hero returns from his journey, having changed.

Two last points.

First, there are no discernably “middle” words I can find: comparing the middle to the front and back returns only the word “you,” which indicates greater dialogue but little else.

Second: does it matter? Can we get anything more out of the Simpsons through this kind of reading than just sitting back to watch? Usually, I’d say that it’s up to the watcher: but assuming that you take television at all seriously, I actually think the answer may be “yes.” (Particularly given whose birthday it is today). TV shows are formulaic. This can be a weakness, but if we accept them as formulaically constructed, seeing how the creators are playing around with form can make us appreciate them better, better appreciate how they make us feel, and how they work.

Murder mysteries are like this: half the fun to all the ITV British murder mysteries is predicting who will be the victim of murder number 2 about a half hour in; all the fun of Law and Order is guessing which of the four-or-so templates you’re in Wrongful accusation? Unjust acquittal? It was the first guy all along? (And isn’t it fun when the cops come back in the second half hour?)

But the conscious play on structures themselves are often fantastic. The first clip-show episode of _Community _is basically that; essentially no plot, but instead a weird set of riffs on the conventions the show has set for itself that verges on a deconstruction of them. One could fantasize that we’re getting to the point where the standard TV formats are about as widespread, as formulaic, and as malleable as sonata form was for Haydn and Beethoven. What made those two great in particular was their use of the expectations built into the form. Sometimes you don’t want to know how the sausage is made; but sometimes, knowing just gets you better sausage.

And it’s just purely interesting. Matt Jockers has been looking recently at novels and their repeating forms; that’s super-exciting work. The (more formulaic?) mass media genres would provide a nice counterpoint to that.

The big, 80,000 movie/TV episode browser isn’t broken down by minute yet: I’m not sure if it will be for the first release. (It would instantly become an 8-million text version, which makes it slower). But I’ll definitely be putting something together that makes act-structure possible.

Markdown, Historical Writing, and Killer Apps

bmschmidt@gmail.com (Ben Schmidt) — Fri, 05 Sep 2014 20:34:25 GMT

Like many technically inclined historians (for instance, Caleb McDaniel, Jason Heppler, and Lincoln Mullen) I find that I’ve increasingly been using the plain-text format Markdown for almost all of my writing.

The core idea of Markdown is that rather than use Microsoft Word, Scrivener, or any of the other pretty-looking tools out there, you type in “plain text” using formatting conventions that should be familiar to anyone who’s ever written or read an e-mail. (Click on Mullen’s or Heppler’s name for a better introduction than this, or see the Chronicle’s wrapup of approaches).

The benefits are many, but they’re mostly subtle:

A simple format like Markdown creates documents you’ll have not trouble reading in twenty years. I’ve been teaching a survey course this semester and had a hell of a time reading my old notes from generals which I took using EndNote; with Markdown, any web browser, text editor, or Microsoft Word descendant will have no trouble opening it.
It’s very easy to produce content that will look good in multiple media: I can make a course syllabus or personal CV with that formats nicely on a website and produces a clean looking PDF at the same time.
It becomes much easier to do things to a bunch of notes at the same time: bundle them into PDFs, search through all of your notes simulataneously, and so forth.

None of these, though, are a particularly strong sell for those who use a computer instrumentally: in reality, your Microsoft Words documents aren’t about to disappear, either. And there are disadvantages to giving up Word.

Things like footnotes with a citation manager are not very easy, even for the technically competent.¹ Even footnotes without a citation manager are fairly clumsy.
The best tool for making your Markdown documents into attractive web pages , Pandoc, is not especially easy to install or configure if you don’t use the command line on a regular basis.
The core definition of Markdown is a little unclear: particularly in the last week, there have been some conflicts over the definition that will be confusing to newcomers. (Although the proposal that sparked them, “Common Markdown,” is likely to be a good thing in the long run)

The heart of Markdown’s appeal is its flexibility: to drive any adoption outside the hard core of people, you need a killer app built off of it that solves a problem. In the technology sector, that has been Markdown’s ability to easily handle links and snippets of computer code for those writing on two widely used sites, GitHub and Stack Overflow

Among historians, neither of those are very important. And the footnote problem is big enough that I generally wouldn’t recommend anyone to use Markdown, right now, unless they enjoy banging their head against the wall.

Lectures and Notes: the killer apps.

There are two places, though, where even historians don’t tend to use footnotes: lectures, and notes. And in both of these, Markdown makes some amazing things possible.

If there’s any reason for historians to use markdown, it’s in these two spheres. The reason I keep using Markdown is that it makes it possible for me to personally solve two problems that have driven me crazy:

Quickly making slides decks to go alongside a lecture, and borrowing and reusing chunks of slides from one talk in another;
Making heads or tails of the thousands of pictures you take while in an archival trip.

Markdown and lectures: multimedia and transposability.

First lectures. With Markdown, I’m able to write my own notes and create a slide deck at the same time. An example will help. Here’s a snippet from my lecture notes on the memory of the Civil War:

# Abolitionist memory of the war.

*Image: http://upload.wikimedia.org/wikipedia/commons/1/19/William-Tecumseh-Sherman.jpg* Caption: William Tecumseh Sherman

There's another set of people who aren't content to see it go: those who remembered the war as the period of national renewal, rebirth, and freedom. We remember World War II today as the "Good War," because we fought the Nazis and won.

But unlike WWII, Civil War actually changed the country for the better. It abolished slavery. It instituted amendments that guaranteed citizenship to every American. It promised equal protection under the law.

Memory that's particularly strong among African Americans.

They remember Sherman differently.
Sherman not as maurauder but as unfifilled promise.
Sheman, you might remember, when he finally made it to the sea issued his famous **Field Order 15**

With some ancillary code I wrote, that does two things at once: builds a slide showing the wikimedia copy of Sherman’s grizzled mug, and creates a set of notes for me under the header “Abolitionist memory of the war” to go on the paper notes I’ll read from.

Later on, I’ll write another script that will find pull every phrase in boldface (like “Field Order 15”) from all my notes and put them onto a list of possible IDs for the midterm I can hand out. Another script could strip just the section headers and print out outlines for the lectures to hand out before class.

This is writing documents for multiple uses, and it can be incredibly useful. If, two minutes before class, I decide I want to switch the order I talk about the abolitionist memory of the war and the white supremacist memory of the war, I can just cut and paste the chunks of text, and all the slides associated with each will have their order switched.

Something like this could provide a really useful way to integrate and share resources, and free up some of the tedium with prepping lectures. But:

That syntax for including an image as a slide is my own, not standard Markdown. I’ve defined scripts for dropping in YouTube videos, images, captions, and some other predefined formats: but it would take a lot of work to define a set of them that make sense for anyone but me.
There are a lot of standards out there for working with HTML slides. None is winning, in part because none is anywhere as good as Keynote or Powerpoint for the average user. My code works with deck.js, one of the only HTML formats not supported by Pandoc; but there’s no obvious other standard to switch to.
Constructing slides that are more complicated than a single image with a title, or a numbered list, requires some serious HTML/CSS expertise. My scripts support that, but not in a pretty way.

Modern HTML allows some beautiful things: I can easily imagine a GUI for one of the standards that would make it easy to create slides for re-use in one of the competing platforms. But I think the standards are still evolving too rapidly in this sphere to make the way forward obvious.

Pull out the slide deck, and you still might have a useful tool here: something that generates a lecture notes for me, outlines for the students/course web page, and IDs for the test prep sessions. But I think there’s something even more valuable possible for archive notes.

Markdown and the Archives: integrating notes and photos

Markdown is a great language for taking archival notes. Archives are all about hierarchy: and Markdown easily lets you tag mutliple levels of headers (Series, Box, Collection, file…). But so is Microsoft Word: and there are plenty of outlining programs out there that are even better.

There are a few things that Markdown notes might do more easily than normal ones. Build a good enough web interface, and you could even click on a photo or quote in your notes and instantly get back a string that ascends the various headers to tell you where it is: Series 3a, Box 13, Folder 4, Letter on 4/18. But the place where there’s really an opportunity lies in Digital Photos.

Digital cameras have completely changed historians’ relations to archives in the last 15 years. (That is, in the subset of archives where cameras are allowed). We used to take notes: now, a massive part of our archival practice involves taking pictures, which have to be sorted through on our return.

When I’m wading through boxes, I tend to type the name of the box, and then some information about each folder followed by descriptions of the documents: if it’s especially useful or especially visual, I take a picture (or a series of several pictures). I think this is pretty similar to what most people do. It means that I end up with two separate timelines to sort through when I get home. 1) A bunch of textual notes that contain my impressions of the works and the rationales for why I copied them and what they are. 2) A stream of pictures with little context but their order to patch together their origin, sometimes with a close-up of a box or folder label thrown in to help.

The tough question is: how can you insert pictures into your notes? Unless you want to physically pick up your laptop and use the webcam for your pictures, it’s not obvious what the best way would be. And if you try to put more than a couple pictures into a Word document, it will crash right away.

Unlike the systems most historians use for notes, Markdown is plain text and has an easy method for inserting multimedia. That means that you can use it to integrate your archival photos directly into your notes; and that unlike Word, it can handle hundreds of images or thumbnails with aplomb.

The last challenge is knowing which parts of your notes go with which pictures. This is a surprisingly hard thing to solve: but there’s an existing answer in a second technology much beloved by the technology industry: version control.

Version control can get complicated, but in its simplest form it’s much like a wikipedia edit history: not just the current state of a file, but every previous revision is stored in memory.

So for archival notes, we just need to save the state of your archival notes every 10 or 15 seconds; match those markers against the timestamps of the photos from a digital camera; and insert the pictures into the text just in place.

When you want to review your notes, you just open them up in HTML format: thumbnails of every picture will appear in place, and you can click on them to get the full version.

For the technically savvy, I’ve put a set of scripts online that do just this. I use gitit to view the notes themselves so I can interlink between pages. A daemon handles the git commits: but that only works because I have always been a compulsive, several-times-a-minute saver of my documents.

What would a user-friendly platform look like?

My repo might be useful for those who are already comfortable with tools like version control: but those are the people who are already using Markdown anyway.

To make this useful for anyone else, we’d need a system with three easy, non-command line steps:

1. Installation

Puts Pandoc, Git, and a good Markdown editor on your computer at once.

2. Writing (in the archives)

This should resemble existing note taking as closely as possible: the user will need to make sure their camera’s clock is well-calibrated, but other than that it should look only like using a new text editor.

Whenever you type in the editor, it saves the files and runs git commit at close intervals. (Git experts may find the idea of automatic commits without a clear commit message cringe-inducing. Insofar as they have a point, edits should probably take place on a separate branch that is forked back into the main one periodically.)

3. Compilation (loading your pictures)

Imports photos from an sdcard or photo library, finds the version control files and matches photo times against them, and builds an html file for each document of notes.

What’s the platform?

Some of the technical components are obvious. I can’t imagine using anything other than git for version control; and though I use gitit to view files, I think that standalone html files are the only sensible way for most people to view their files. The scripting language for step three, as well, isn’t very important: I’ve used python, but anything with a set of hooks into git.

The big question is: what’s the text editor to be? I use emacs, and get the impression that most people writing in Markdown are using vim. Both of these are clearly bad choices for the ordinary historian. For all that Markdown can be written in any editor, the writing function also must support auto-save and auto-git-commit, so anything without a scripting interface is out. SublimeText has its selling points, but free’s probably the way to go.

That means, unless I’m missing a central player in the ecosystem, that the natural choice is the new Atom editor from Github. But perhaps there’s a more lightweight alternative?

Platform will also be an issue. The Mac is the obvious platform to capture a majority of historians: but a surprising number of people seem to take their notes with an iPad-keyboard array, which would call the whole stack into question.

Infrastructure

So that’s the proposal. Once historians see how great Markdown is for notes, maybe they’ll think about it for lectures; once they use it for lectures, maybe the footnote ecosystem will start to improve, and we’ll finally be able to distribute historical papers as text, making them more portable, more easily structured, and more lasting.

So, anyone want to try?

It took me a few hours of mucking about in Emacs Lisp to make inserting a link to something in my Zotero library almost as easy as it is under Microsoft Word; and if you want to configure the core behavior of Pandoc, it’s best to use Haskell. Even the “programming historian” may not have heard of either of these languages. Both (well, at least Haskell) have their strengths: but suffice it to say that neither has ever been anyone’s answer to the question “If I should only learn one computer language, which should it be?”↩︎

The Simpsons Bookworm

bmschmidt@gmail.com (Ben Schmidt) — Fri, 29 Aug 2014 19:38:24 GMT

I thought it would be worth documenting the difficulty (or lack of) in building a Bookworm on a small corpus: I’ve been reading too much lately about the Simpsons thanks to the FX marathon, so figured I’d spend a couple hours making it possible to check for changing language in the longest running TV show of all time.

For some thoughts on how to build a bookworm, read “prep”: otherwise, skip to analysis. Or just head over the browser.

Prep

Step one is getting the texts. This is easy enough here, something I know how to do from all my Prochronisms posts: I can just use the subtitles, which are available a batch at a time. The only challenge is deciding what to do with audio-effects subtitles. I’m deciding to download the files that include them where necessary, but probably disable them by default. I also end up with only 540-something episodes, about ten short of the complete run: rather than try to figure that out at the start, I’m going to let the Bookworm data visualizations themselves be the clue to what I’m missing.

Next up is choosing what a “text” will be. The obvious choice would be for each episode to be a single text: but 550 episodes, while it’s a lot to watch, doesn’t give many angles for analysis. My second idea is that it might be interesting to look at a really granular level: ideally, we’d be able to compare the first, second, and third acts. That info isn’t in the subtitles, but we can split up by lines of speech: later on, we’ll be able to aggregate the queries to look in just the first hundred lines, or the first third, or whatever. The only downside is that it dramatically increases the number of texts: but that’s not really a huge problem.

That also makes it easy to decide what I’ll display in the search results: the individual line from the script containing the word.

Next step is to parse into bookworm format. Since these are in SRT format, it’s not as easy as it could be: I’m looking to create indexes that are episode-season-line. To get the season and episode names, I write out some regular expressions that match the various different filenames. This one of the uglier parts, and where I actually spend the most time. The final parsing code uses a whole bunch of regexes to handle the different formats people use: “S04E20”, “[1.3],” and so forth. One batch doesn’t have season numbers at all: I’ll have to fix that later.

def parseFilename(string):
    form1 = r"[sS](\d\d?)[eE](\d\d?)"
    form2 = r"(\d\d?)x(\d\d?)"
    form3 = r"\[(\d\d?)\.(\d\d?)"

    for regexp in [form1,form2,form3]:
        matches = re.findall(regexp,string)
        if len(matches) > 0:
            return matches[0]

    return ("",re.sub(".*Episode (\d\d).*",r"\1",string))

Next is actually parsing the text, and adding some new information to it about the position of each line. This is usually the hardest part, but SRT parsing is pretty easy as these things go. Plus, nailing down the format leads me to an insight–rather than use line number, I can take the embedded time information in the SRT files and index by the minute and second in the episode that a subtitle flashes on the screen. Each subtitle block will correspond to a file, and we’ll know the exact moment it appeared. Turns out there are about 200,000 of those in the series, which is a reasonable number of texts to include in a Bookworm. (Though if I were hypothetically to do this for a whole bunch of TV series (more than a couple hundred) at the same time, that might push the system’s limits.) Parsing out the SRT time information works well. We’re left with some straggling sound effects, which I’m just leaving in for the time being. Occasionally characters names appear at the front of texts: again, that’s something I’d correct if this were a weekend project rather than a weeknight one.

That means the final scheme will give us, for each subtitle block:

Season Number
Episode number in the season
Episode number in the series (will make some plots easier).
Minute in the episode
Second in the episode
The actual text of the block.

From that information, if we were true Simpsons scholars, we could easily add:

Act (roughly: call minutes 0-7 act 1, minutes 8-14 act 2, and minutes 15 to the end act 3)
Air date, episode director, and other information easily linkable from IMDB.
Whether it’s a finale or what.

Once the text is parsed, the file-creation is pretty easy, we’re ready to ingest. The input.txt file is just the text and an id number constructed from the moment the block appears on screen: the jsoncatalog.txt is just a dump of an object that’s useful for processing, anyway.

I’ve already written a specialized makefile for my Federalist papers bookworm to clone the Bookworm repo and put files in the right place, so that’s easily adapted.

And then we’ve got it! I didn’t designate any fields as “time,” so a first inspection will be easier using the D3 browser.

The first test is to find out about those pesky missing episodes. So I’ll plot a heatmap of the number of words for each episode (x axis) and season (y axis):

This shows that we’ve got about 25 episodes for season, but: we’ve got a season 0 and no season 1 (that one set of srts that didn’t give a season, no doubt); we’ve got no seasons 16 and 17; and, curiously, most season 6 episodes are twice as long as they should be. Probably season 16 was mislabeled season 6, and we’re actually missing season 17. We’re also missing the first 9 episodes of season 21, and the first two of season 22. Oh well. Something to catch on a next run.

Analysis

The beta lets us quickly check out some other things, like the number of words (color) by *minute* (y axis) and season (x): you can see commercial creep, as sometime around season 14 we lose most of minute 21.

OK: let’s check the actual words. Here are uses of each of the central four characters: season on the x axis, unigram on the y axis.

Nothing too suspicious here: the shift from Bart to Homer looks good, etc.

Just trying some line charts: yep, Maude only gets mentioned much by name around the season she dies:

But what’s really interesting, maybe, isn’t the season-to-season change but the internal episode structure. For instance, at what minute in the episodes do characters talk about “school?”

That’s pretty interesting, actually: pretty much every minute, the plots seem to shift away from school.

Likewise, “I’m Kent Brockman” seems to be overwhelmingly a gag from the opening scene:

OK, that’s enough: here’s the link to the Bookworm, and here’s the source code.

Finding the best ordering for states

bmschmidt@gmail.com (Ben Schmidt) — Thu, 05 Jun 2014 14:36:07 GMT

Here’s a very technical, but kind of fun, problem: what’s the optimal order for a list of geographical elements, like the states of the USA?

If you’re just here from the future, and don’t care about the details, here’s my favorite answer right now:

["HI","AK","WA","OR","CA","AZ","NM","CO","WY","UT","NV","ID","MT","ND","SD","NE","KS","IA","MN","MO","OH","MI","IN","IL","WI","OK","AR","TX","LA","MS","AL","TN","KY","GA","FL","SC","WV","NC","VA","MD","DE","PA","NJ","NY","CT","RI","MA","NH","VT","ME"]

But why would you want an ordering at all? Here’s an example. In the baby name bookworm, if you search for a name, you can see the interaction of states and years. Let’s choose “Kevin,” because it played such a role in my anachronism-hunting piece on Lincoln.

Clearly the name took off around the start of the baby boom. But is there a geographical pattern? It’s very hard to say. It does look like the red names begin around 1955 in much of the country. But in a few, it’s not until the early 1970s. Which ones? Alabama, Georgia, North Carolina, South Carolina. That is, after substantial reading parsing over to the axis, it’s clear that most of those are southern states. But this is the sort of insight that should be immediately obvious. And there may be other connections we’re missing out on. The whole point of data visualization over tables is that you can pick out patterns using faster forms of cognition: requiring you to push over to the left to read off the names is a major loss.

Alphabetical order makes it easy to find any individual state (assuming you know its name) but hard to see the way related states move with each other. It means that to trace out regional variations over time, we tend to animate maps: but using time as the proxy for time makes cross-temporal comparisons much harder to tell. As Tufte says, comparisons should be enforced across the eyespan: relying on animation to trace out common names is a big problem. So there’s a dramatic interest in seeing different names pop up in (for instance) Reuben Fischer-Baum’s animation of baby names; but you have to watch the whole thing to think through questions like “what regions tend to adopt names early?” or “what’s the name that stays on top for the longest?”

Putting it all into X-Y makes these questions easier. But that means we need to map states to X or to Y. Alphabetical order means that states are not arranged in a way that places states near others like them.

So how could we make the states usefully arranged? We need some dimensionality reduction.

Linear reductions

One obvious way would be east-to-west or north-to-south: that starts out quite well, with all of New England:

ME MA RI NH VT CT NY NJ PA DE MD DC VA NC SC WV OH FL GA MI KY IN AL TN IL WI MS MO LA AR MN IA TX KS NE OK SD ND WY CO NM MT UT AZ ID NV CA OR WA AK HI

But quickly falls apart with Ohio, Florida, Georgia, and Michigan in immediate succession. If we plot the states, you can quickly see why. Rather than list orders, I’m going to show them as paths through a map: here’s what that looks like in this case.

(By the way, you can see that the points are a little arbitrary: I’ve taken the first geonames hit for the state, which is sometimes the capital, sometimes the state centroid, and sometimes the most important city. Ideally I’d be using the population-weighted centroid, but in some ways I kind of like the results that come out of this.

There are some other possibilities for linear dimensionality reduction (principal components comes to mind) but they’ll have the same fundamental problem. We want a metric that takes proximity more fully into account. Even non-metric multi-dimensional scaling fails: it handles a couple cases better (Jackson and St. Louis are in a more sensible order, for instance), but it still jumps erratically up and down, preventing any larger groups like “the south” from coming into sight:

Hierarchical clustering approaches

One possible approach, suggested to me by Miriam Huntley, is hierarchical clustering: using distances, we can cluster the states by proximity. Here’s the initial result of that:

The individual groups are quite nice (New England is there, plus New York at the end), and every state is adjacent to an immediate neighbor. And while the groups have geographical coherence, they aren’t exactly the regions we know and love: the “mid-atlantic” runs down to South Carolina, and the midwest includes the gulf coast all the way to Tallahassee. The connections between the groups are scattered. Florida is next to Pennsylvania, and South Carolina to Massachusetts. Seen as a path, the weirdness of this is clear:

Leaf ordering in dendrograms is arbitrary, however, and we can do better than this. Using a method developed by Bar-Joseph et al, and implemented in the “cba” library for R, we can reorder the dendrogram so that groups stay the same, but the leaves are ordered so that transitions from one group to the next are maintained.

Now, the path looks considerably better:

The clusters remain adjacent, but now the transitions are so smooth that it’s not obvious where one begins and the other ends. Instead, we get a serpentine path through the states that both ensures every path is between two adjacent states, and keeps paths generally inside the same region.

Network approaches

Can we do better? The strategy of plotting these as paths suggests that maybe this is an instance of the traveling salesperson problem, in which we want to travel through all the states minimizing the distance traveled. Why shouldn’t the “best” solution simply be the one where the overall sum of distances is the least?

Inserting a dummy node as start- and end-point lets us view that: using the best method found by the “TSP” package in R (which is not guaranteed to be the optimal solution, since the traveling salesman is a notoriously difficult problem to solve), we get quite a different path:

Rather than start in Maine, this route begins in Tennessee! After winding through the Midwest to West Virginia, it leaps to Vermont and then takes a beautifully practical course down the Eastern seaboard through Texas, through the great plains, and then takes up nearly an east-to-west ordering through the Mountain and Western time zones. While many of the regional choices here look better to me than the dendrogram solution (particularly the coherence of the south, the distance-optimizing strategy means that there are a few nearby states that have nothing in common: the leap from New Mexico to Montana, for example, and the extremely strange choice to place Washington DC between West Virginia and Vermont, ten nodes removed from either Maryland or Virginia, the closest geographical points. (In fact, I think the route could be improved by heading straight to Vermont from WV and putting DC in its rightful place: but it says something that out of the 7 algorithms in the free version of the TSP package, none was able to improve on this route).

Fractal Curves

Another option is not to minimize travel distance but to maximize the likelihood that two points will be next to each other. That suggests filling the geographic region with some kind of fractal curve, and then positioning each state along the curve.

This is an appealing way to think of arranging the country linearly: not as a network, but as iterable set of points. For just the United States, we could use some already-existing curve path. The most widespread linear mapping of points is the Zip code system: Samuel Arbesman has written about this on Wired, and includes a link to Robert Kosara’s ZipScribble maps. Here’s Kosara’s idea with a few minor changes (I use a rainbow spectrum, rather than coloring each state separately, and an Albers projection. And it appears that the zip database I have handy has something weird going on in southwest Georgia.)

Space-filling curves

The ZIP system isn’t especially logical, but there should be something similar that’s better. My first thought for this problem, which the whole post, was to use a Hilbert Curve. It turns out that Kosara has mapped that approach onto the Zip dataset.

Using just the state points, it’s possible to draw a Hilbert curve that covers the continental United States, and then visit each state at the moment it’s closest to the curve. The actual path taken can then be simplified down to eliminate the intervening states. Here’s what that looks like, with both the Hilbert curve and the simplified route. I’ve shaded the Hilbert curve using a double rainbow so it’s easier to trace from its origin near the Bahamas (first making shore near South Carolina) to its exit off the coast of Los Angeles.

I’m disappointed by the performance here. While there is some regional coherence (the stretch from Wisconsin to Kansas is well done, and the first jumps through the South are acceptable), the square binning forces some rather strange choices: the odd jag down to North Carolina, the detour to Colorado and Wyoming.

There are other issues as well. Hilbert curves work best in square spaces, and the patches of ocean/Canada/Mexico that get filled are pretty far off limits. While I don’t show Alaska and Hawaii, for the other algorithms they’ve simply been tacked on at the end in a reasonable manner: here, though, a solution that includes Alaska and Hawaii makes some significant changes to the full arrangement and vastly increases the percentage of empty space, which tends to introduce odd decisions (like interposing Alaska between Oregon and Nevada.)

I suspect there are ways of optimizing the Hilbert curve, or some similar fractal path, so that it better maps onto actual geographic spaces. That seems like an interesting avenue, potentially: but the initial results here seem worse, not better, than traveling salesman approximations.

Conclusions and Deus ex Machina

So on this particular set, the best results seem to come from, in descending order,

Reordered hierarchical clustering
Traveling Salesperson solutions
Fractal Curves
Quasi-linear dimensionality reduction (east-to-west, multi-dimensional scaling, etc).

For the general problem (European countries, say, or counties in a state) I’d probably start with reordered hierarchical clustering or TSP solutions, at least until I learn how better to fit a fractal curve to an arbitary space.

But for this particular problem, I’ve got an ace in the hole: there _are _conventional orderings of states that provide an acid test. In particular, we want something that matches to census regions.

The ordering inside census regions is arbitrary, just like our clustering diagrams. So the best possible solution that includes some knowledge about the intrinsically _real _regions of the United States (the midwest, the south, etc.) is to combine the census regions with the optimal-dendrogram measures.

Putting phony clusters just from the census regions looks like this:

I can just plug those into a dummy distance matrix so that group membership trumps any other sorts of distance: and then allow geographical distance to sort out the spinning of those trees into a more sensible order.

So, adding the constraint that census divisions and regions be kept intact, the optimal ordering looks like this: starting in Maine, traveling through the South west to Texas, skipping to the upper Midwest and then taking the same route west through the plains and mountains as the dendrogram:

Is this the perfect ordering? To my mind, it’s not: but the flaws come straight from the census, not from the algorithm. West Virginia should not in the coastal south, it should be in the same division as Kentucky; the leap from Oklahoma to Wisconsin is unfortunate, and so is the one from Florida to Kentucky. Still, the census regions constrain is quite nice to have. And unlike the unguided paths, it preserves all but one of what I intuitively think of as the essential pairings: the Dakotas, the Carolinas, Alabama-Mississippi, Vermont-New Hampshire, Kansas-Nebraska, Colorado-Wyoming.

So, let’s return to the original visualization to see what this new ordering helps us see. Remember, this original version revealed only with some serious axis-reading that the South starting using “Kevin” later.

Here it is with the census-based ordering. The southern states, two-thirds of the way down the page, clearly do begin later: but now it’s also immediately evident which of them _don’t _lag as much. There are also several patterns that are immediate evident which remain completely obscure in an alphabetical ordering: usage of “Kevin” is significantly higher around 1990 in the northeast, particularly the mid-Atlantic, than it is in the rest of country. And while the South waits the longest, a lag in the Arizona-New Mexico pairing is also clear.

This style of display also makes subtler patterns visible. “Jennifer,” for example, rises a year later in the South than elsewhere. That would be lost as visual noise in an alphabetical ordering, but is completely clear here.

Is a geographical ordering the best? Not always. Take “Madison“: its rise shows striped bands that don’t seem to be regional. Illinois, New Jersey, Washington DC, and New Mexico all avoid the wave. In fact, if you look closer, this is clearly a racial thing: “Madison” was most popular in states with overwhelmingly white populations. (Except Wisconsin, it seems). And aside from the bend through the southwest, there aren’t a whole lot of largely-minority states in any contiguous curve.

But on another level, that just points out more the usefulness of _some _sensible ordering to start with.

Bleg 1: String Distance

bmschmidt@gmail.com (Ben Schmidt) — Thu, 27 Mar 2014 21:44:24 GMT

String distance measurements are useful for cleaning up the sort of messy data from multiple sources.

There are a bunch of string distance algorithms, which usually rely on some form of calculations about the similarities of characters. But in real life, characters are rarely the relevant units: you want a distance measure that penalized changes to the most information-laden parts of the text more heavily than to the parts that are filler.

Real-world example: say you’re trying to match two lists of universities to each other. In one you have:

[500 university names…]

Rutgers the State University of New Jersey

and in the other you have:

[499 university names…]

Rutgers University

New Hampshire State University

By most string distance measures, ‘State University’ and ‘New’ will make the long version of Rutgers match New Hampshire State, not Rutgers. But in the context of those 500 other names, that’s not the correct match to make. The phrase “State University” actually conveys very little information (I’d guess fewer than 8 bits) , but that “R-u-t-g-e-r-s” are characters you should lose lots of points for changing. (Rough guess, 14 bits).

In practice, I often get around this by changing the string vocabulary by hand. (Change all occurrences of “University” to “Uni”, etc., ) I can imagine a few ways to solve this: eg., normalized compression distance starting from a file of everything, or calculating a standard string distance metric on a compressed version of names instead of the English version. But I feel like this must exist, and my Internet searches just won’t find it.