<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Ben Schmidt's Blog</title>
        <link>https://benschmidt.org/post</link>
        <description>Posts and updates. Fun with a porpoise.</description>
        <lastBuildDate>Mon, 23 Dec 2024 00:46:07 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <copyright>All rights reserved 2024, Ben Schmidt</copyright>
        <category>History</category>
        <category>Programming</category>
        <category>Digital Humanities</category>
        <category>Data Analysis</category>
        <category>Data Visualization</category>
        <item>
            <title><![CDATA[Election history and state ordering]]></title>
            <link>https://benschmidt.org/post/2024-11-24-stateorder</link>
            <guid>https://benschmidt.org/post/2024-11-24-stateorder</guid>
            <pubDate>Sun, 24 Nov 2024 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[I was talking on Bluesky[^1] about why I dislike the widespread use of alphabetical ordering for states on the 
y-axis of charts. There are better ways! My favorite is detailed in [this notebook](https://observablehq.com/@bmschmidt/useful-linear-orders-for-countries-and-states#linear_us_state_order), where I talk through some methods for treating paths. 
I have an interactive tool for building out paths like this one, which is a decent way to order all the countries in the world for data visualization.

![Map of the world with a red line moving through countries representing linear order.](image.png)

I realized after posting that I should check some of the old 19th century census atlases I wrote about for [Creating Data](https://creatingdata.us); and indeed, Henry Gannett did use a nice linear ordering for representing states in the rare cases that he couldn't use bars arranged by frequencies. These are grouped by what we'd now call (but I don't *think* were in 1890) "census regions." But there are some ugly transitions, like the jump from Florida to Ohio. ([You can see the full image at the Library of Congress](https://www.loc.gov/resource/g3701gm.gct00010/?sp=36).)

![1890 image of pie charts by state, in the order ME, NH, VT, MA, RI, CT, NY, NJ, etc.](https://tile.loc.gov/image-services/iiif/service:gmd:gmd370m:g3701m:g3701gm:gct00010:ca000036/887,385,3831,2111/766,/0/default.jpg)

My preferred ordering, below, snakes from Hawaii to Maine in an 
order that respects census regions; almost all the state jumps are
continguous, and where they aren't (like the jump from Florida to South Carolina)
they're close.

{.details summary="Expand for a JSON list of states"}
:::
```
["Pacific Ocean","Guam","GU","Hawaii","HI","Alaska",
"AK","Lower 48","West","West Coast","Pacific Northwest",
"WA","Washington","OR","Oregon","CA","California","Southwest",
"AZ","Arizona","NM","New Mexico","NV","Nevada","CO",
"Colorado","WY","Wyoming","UT","Utah","ID","Idaho","MT",
"Montana","Midwest","Dakotas","ND","North Dakota","SD",
"South Dakota","NE","Nebraska","KS","Kansas","IA","Iowa",
"MN","Minnesota","WI","Wisconsin","MI","Michigan","OH",
"Ohio","IN","Indiana","IL","Illinois","South","MO",
"Missouri","OK","Oklahoma","TX","Texas","LA","Louisiana",
"AR","Arkansas","KY","Kentucky","TN","Tennessee","MS",
"Mississippi","AL","Alabama","GA","Georgia","FL","Florida",
"SC","South Carolina","NC","North Carolina","WV",
"West Virginia","VA","Virginia","Northeast","Mid-Atlantic",
"DC","MD","Maryland","DE","Delaware","PA","Pennsylvania",
"NJ","New Jersey","NY","New York","New England",
"Southern New England","CT","Connecticut","RI",
"Rhode Island","MA","Massachusetts","Northern New England",
"NH","New Hampshire","VT","Vermont","ME","Maine",
"Carribean","Puerto Rico","PR","US Virgin Islands"]
```
:::

My favorite example of this layout shows changes in political polarization in the United states
by election. The Democratic "solid south" from 1880 to 1964 is clearly present:[^2] and the newer
Republican solid regions in the intermountain west and northern great plains, as well as
smaller phenomena like Bill Clinton's two-time carve out of Louisiana, Arkansas, Tennessee, and Kentucky.

In this chart states are colored by their lean *relative to the country*; this smooths out the general 
year-to-year swings where Biden does better than Clinton does worse than Obama to show the overall geography.
The story of the 2024 election in this view is that the two great Republican regions are chipping 
away at the holdouts in the middle: Iowa and Ohio flipped from swingy to solid red in 2016, and Wisconsin/Michigan/Minnesota 
have continued to shift slightly more towards the middle.

``` component
name: VoteHistory
args:
  order: "geography"
  variable: "Lean relative to country"
```

Here, by contrast, is the same visualization ordered alphabetically. There is nothing to see.

``` component
name: VoteHistory
args:
  order: "alphabetical order"
  variable: "Lean relative to country"
```

For an interactive version of this chart, see [here](/poli/presidential-votes/) to 
be able to toggle between the two orders and see simply the winner.

[^1]: Assuming that you're already aware that Bluesky is taking off.

[^2]: Although, to be honest, the South is only really solid through 
  1944; from the time Strom Thurmond broke the Dixiecrats off in 1948,
  there were continuous cracks in the south from every angle. This
  is the the sort of the thing that you can see with this kind of view!
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[History hiring, 2023 update]]></title>
            <link>https://benschmidt.org/post/2023-12-26-history-hiring</link>
            <guid>https://benschmidt.org/post/2023-12-26-history-hiring</guid>
            <pubDate>Tue, 26 Dec 2023 17:36:45 GMT</pubDate>
            <content:encoded><![CDATA[Although I've given up on historically professing myself, I still have a
number of automated scripts for analyzing the state of the historical
profession hanging around. Since a number of people have asked for
updates, it seems worth doing. [As a
reminder](https://benschmidt.org/post/2020-10-01-jobs-update/), I'm
scraping H-Net for listings. When I've looked at job ads from the
American Historical Association's website, they seem roughly comparable.

The bottom line is: 2023 is shaping up to be one of the worst years for
hiring of new history professors yet. The worst year ever, of course,
was the pandemic-saddled 2020. The 2009 recession year, which at the
time felt calamitous, was actually in the general range of most years
since 2014; for me the takeaway remains that while the 2000s were
obviously salad days of incredible abundance when tenure track jobs were
awarded like candy[^1], even the early 2010s were far better than than
recent years.

After the 2020 collapse, the question became: would any rebound in the
market be permanent, or temporary? 2021 and 2022 were both relatively
strong years on the market, by the standards of recent years; and while
the 2021 market was concentrated in few modernist fields, 2022 saw a
rebound even in early modern and medieval hiring and in the history of
Europe, two fields that were starting to be left for dead.[^2]

The answer, it seems, is temporary. After a respectable start to the
season, H-net listings in November and December were terrible. There
have been just over 350 tenure-track jobs listed this year, compared to
400 listed before Christmas the two years prior. [^3]

:::
![A chart with lines showing hiring patterns for TT jobs in history. All
years 2000-2008 are twice as high as 2014-2022; 2023 is at the low end
of the 2014-2022 band, while both 2021 and 2022 are towards the
top.](image.png)

{.caption}
:::
A chart with lines showing hiring patterns for TT jobs in history. All
years 2000-2008 are twice as high as 2014-2022; 2023 is at the low end
of the 2014-2022 band, while both 2021 and 2022 are towards the top.
:::
:::

Notably, there is no pattern in terms of subfields, except for
_possibly_ a tick up in hiring in history of science. Interdisciplinary
fields, African American history, US history, Asian history; all are
down by roughly equal shares. There have been times when it seemed like
any fields gains might be at the expense of others: now all are lower
than they were both last year and than they were in the old days.

:::
![A barchart by date of various fields, showing information described in
the previous paragraph](image-1.png)

{.caption}
:::
A barchart by date of various fields, showing information described in
the previous paragraph
:::
:::

[^1]: I kid. Sort of? But also if you have a PhD awarded before 2013 it
  would take a great deal of chutzpah to run, like, a placement
  workshop. Find someone younger, geezer.

[^2]: I say this, in part, to push back on a story told by Leland
  Grigoli in the [AHA's 2023 jobs
  report](https://www.historians.org/ahajobsreport2023), which focused
  heavily on the "relative (and absolute) dearth of jobs for
  premodernists." While this warning might have been useful in the
  2022 report, the pendulum seems to be swinging back away since; and
  cross-field recriminations have been heavy.

[^3]: There were 973 PhDs awarded by U.S. departments in 2022.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[20 Million PubMed abstracts in the Browser]]></title>
            <link>https://benschmidt.org/post/2023-05-11-humanities-tips</link>
            <guid>https://benschmidt.org/post/2023-05-11-humanities-tips</guid>
            <pubDate>Thu, 11 May 2023 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[Sorry, something failed to render. Please visit the website for full content.]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[20 Million PubMed abstracts in the Browser]]></title>
            <link>https://benschmidt.org/post/2023-03-20-pubmed</link>
            <guid>https://benschmidt.org/post/2023-03-20-pubmed</guid>
            <pubDate>Thu, 20 Apr 2023 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[Last week we released a big data visualiation in collaboration with the
[Berens Lab](https://www.eye-tuebingen.de/berenslab/) at the University
of Tübingen. It presents a rich, new interface for exploring an
extremely large textual collection.

Because I can I'll simply embed it below--but you'll have a better
experience reading it [at the original
site](https://static.nomic.ai/pubmed.html).

![](20230420220011.png)

* * * * *

Rita Gonzalez-Marquez is the lead author of the paper and did the
primary analysis; the embedding here was carefully created by her 
and Dmitri Kobak.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Happy WebGPU Day]]></title>
            <link>https://benschmidt.org/post/2023-03-07-webGPU-day</link>
            <guid>https://benschmidt.org/post/2023-03-07-webGPU-day</guid>
            <pubDate>Fri, 07 Apr 2023 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[Yesterday was a big day for the Web: [Chrome just shipped WebGPU without
flags in the Beta for Version
113.](https://developer.chrome.com/blog/webgpu-release/) Someone on
Nomic's GPT4All discord asked me to ELI5 what this means, so I'm going
to cross-post it here---it's more important than you'd think for both
visualization and ML people. (thread)

So: GPUs are processors on basically every computer/phone. Individually
they're weaker than CPUs, but they run in packs of little ones that run
in parallel. The G is for 'graphics,' but it's turned out they're good
for anything involving lots of math--like 'AI', which at core boils down
to lots (and lots and lots) of matrix multiplication operations. To do
math, not graphics, on a GPU you need an API/language for them; the most
important of these is CUDA, which is tightly coupled to NVidia and a
real PITA to set up.

On the web, we've only been able to access the GPU through something
called WebGL. It's old, and while you can do some neat stuff with it,
it's fundamentally built for graphics, not for the matrix-multiplication
type stuff that is the bread and butter of deep learning models. Since
WebGL launched in 2011, lots of companies have been designing better
languages that only run on their particular systems--Vulkan for Android,
Metal for iOS, etc. These are great where they work, but even harder to
run everywhere than CUDA.

WebGPU is an API and programming that sits on top of all these super
low-level languages and allows people to write GPU code that runs on all
of them--that is, on just about any phone/computer with a web browser.
This is a big deal, because it has "compute shaders" that lets you write
programs that take data and turn it into other data. Working with data
in WebGL is really weird--you have to do things like draw to an
invisible canvas and then read the colors as numbers. In WebGPU, you can
just do math. Really fast.

That means it's actually capable of doing--say--inference on a
machine-learning model like GPT4All, multiplications on data frames,
etc. There are already some crazy things out there, like [a version of
Stable Diffusion that runs in your web
browser.](https://github.com/mlc-ai/web-stable-diffusion)

I wrote a post here two years ago about [why WebGPU makes javascript the
most interesting programming language out there for data analysts/ML
people.](/post/2020-01-15-webGPU/) Even more seems possible now. When we
start implementing the Apache Arrow spec to store dataframes on GPU,
currently blazing-fast packages like [DuckDB]() and Polars; in browser
versions of GPT4All and other small language models; etc.

This will be great for deepscatter too. Maps like
https://atlas.nomic.ai/map/twitter can render 5,000,000 tweets
incredibly fast, but need a lot of CPU for compute. Often it's fast
enough, but real-time rendering needs to run 30x a second: I have a long
and growing list of things that are nearly impossible in WebGL but will
be quite easy in WebGPU.

Right now it's only released on Chrome, but it's not an only-Google
thing forever. It's an honest-to-goodness W3C standard like HTML, CSS,
or SVG. All the browsers have been working on it; Chrome is just
shipping first because Google is rich compared to Safari and Firefox.
One of my favorite parts about reading the minutes of the WebGPU
committee over the last year is watching people from the other browsers
[jealously grouse about how much money Google throws at
Chrome](https://github.com/gpuweb/gpuweb/wiki/Minutes-2022-08-10).

> JB: Corentin mentioned that all the browser vendors have been at the
> table, for a long time. Haven't you had a long enough chance to give
> that feedback already? Answer is - no. :) Our impl isn't done. Not
> about whether a certain period of time has elapsed - but rather do you
> have an impl that satisfies the criteria. Chrome's one of the best
> funded orgs in KR: Without going too much into funding, thinking about
> spec criteria, we had a list of bugs triaged into v1 and post-v1.
> Let's burn that down to zero, and if we consider larger change, we
> should probably let them sit as they are. There's probably a way to
> implement something reasonable later. We can probably do these changes
> in a compat way in the future. Let's get issues down to zero. Impl
> feedback is useful of course. We don't go to rec without multiple
> impls. Looking at wording, I don't think "canditate rec" is gated on
> mult implementations.

But they'll come along--the Chrome-derived ones like Edge first, but
Safari and Firefox eventually too because GPU compute is just _such_ an
important thing. And when they do, it rescrambles the whole compute
stack. Slowly but surely real GPU compute, tensor operations, all the
stuff that makes AI tick moves from something that happens only in the
cloud, to something that can get reshuffled, rearranged, and done
privately on PCs again. Another chance to reclaim compute from the
cloud.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Calling it shut on OpenAI]]></title>
            <link>https://benschmidt.org/post/2023-03-22-OpenAI</link>
            <guid>https://benschmidt.org/post/2023-03-22-OpenAI</guid>
            <pubDate>Wed, 22 Mar 2023 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[_This is a [Twitter thread from March
14](https://twitter.com/benmschmidt/status/1635692487258800128) that I'm
cross-posting here. Nothing massively original below. It went viral
because I was one of the first to extract the ridiculous paragraph below
from on the release of GPT-4, and because it expresses some widely
shared concerns. _

{.tweet}
:::
I think we can call it shut on 'Open' AI: the [98-page
paper](https://cdn.openai.com/papers/gpt-4.pdf) introducing GPT-4
proudly declares that they're disclosing _nothing_ about the contents of
their training set.

> Given both the competitive landscape and the safety implications of
> large-scale models like GPT-4, *this report contains no further
> details about the architecture (including model size), hardware,
> training compute, dataset construction, training method, or similar.*
>
> We are committed to independent auditing of our technologies, and
> shared some initial steps and ideas in this area in the system card
> accompanying this release.2 We plan to make further technical details
> available to additional third parties who can advise us on how to
> weigh the competitive and safety considerations above against the
> scientific value...
:::

* * * * *

{.tweet}
:::
Why should you care? Every piece of academic work on ML datasets has
found consistent and problematic ways that training data conditions what
the models outputs. ( @safiyanoble , @merbroussard , @emilymbender ,
etc.) Indeed, that's the whole point! That's what training data is!
:::

{.tweet}
:::
Choices of training data reflects historic biases and can inflict all
sorts of harms. To ameliorate those harms, and to make informed
decisions about where a model should _not_ be used, we need to know what
kinds of biases are built in. OpenAI's choices make this impossible.
:::

{.tweet}
:::
Neural networks like GPT-4 are notoriously black boxes; the fact that
their operations are unpredictable and inscrutable is one of _the_ most
important questions about whether and where they should be used. And now
OpenAI is planting a standard to extend that mystery farther.
:::

{.tweet}
:::
Their argument is basically a combination of 'trust us' and 'fine-tuning
will fix it all.' But the way they've built corpora in the past
shouldn't inspire trust. When OpenAI launched GPT-2, their brilliant
idea was to find 'high quality' pages by using Reddit upvotes.
:::

{.tweet}
:::
That probably beats the morass of regular web text, but the idea of
Reddit upvotes as the gold standard for quality is--distopian? Last week
we made a map of the open recreation of this corpus, OpenWebText-- it's
crazy easy to find awful stuff. Try it! [Common Crawl OWT Atlas
Map](https://atlas.nomic.ai/map/owt)![Image of
https://atlas.nomic.ai/map/owt](20230322063804.png)![Image of
https://atlas.nomic.ai/map/owt](20230322063812.png)
:::

{.tweet}
:::
For GPT-3 that set served as a standard to filter sites out from the
Common Crawl. We made a map of the Pile reproduction of that. I have no
idea if OpenAI filtered stuff like the below out, or if r/the\_donald
gave it upvotes in the day. Neither do you. [Common Crawl 8M Atlas
map](https://atlas.nomic.ai/map/cc8m)
:::

{.tweet}
:::
Here's a link to the paper. The whole thing is an fascinating
artifact--it looks like an arxiv paper using the neurips latex template
( @andriy\_mulyar pointed this out), but it's posted on their own web
site and is authored by a company, not people.
https://cdn.openai.com/papers/gpt-4.![Image of paper
abstract](20230322063955.png)
:::

{.tweet}
:::
One last point from the comments: it's hard to believe that
'competition' and 'safety' are the only reasons for OpenAI's secrecy,
when hiding training data makes it harder to follow the anti-Stability
playbook and sue them for appropriating other's work. [More on the
stability
lawsuit](https://www.reuters.com/legal/transactional/lawsuits-accuse-ai-content-creators-misusing-copyrighted-work-2023-01-17/)
:::
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Marymount majors]]></title>
            <link>https://benschmidt.org/post/2023-03-04-marymount</link>
            <guid>https://benschmidt.org/post/2023-03-04-marymount</guid>
            <pubDate>Sat, 04 Mar 2023 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[Recently, Marymount--a small Catholic university in Arlington,
Virginia--has been in the news for a draconian plan to eliminate a
number of majors, ostensibly to better meet student demand. I recently
learned the university leadership has been circulating one of my charts
to justify the decision, so I thought I'd chime in on the context a bit.
My understanding of the situation, primarily informed by the coverage in
[ARLNow](https://www.arlnow.com/tag/marymount-university/), is this
seems like bad plan,[^1] so I thought I'd take a quick look at the
university's situation.

Not knowing much about Marymount, I thought I'd first check how low the
major numbers actually are. Here's the list of all the majors that
Marymount reported to [IPEDS](https://nces.ed.gov/ipeds/) from
2017-2021. Majors proposed for removal are in blue. The largest group
are Nursing majors; the next largest are general business, a category
that has stagnated. The two largest majors in what used to be called the
"Liberal Arts" are psychology and biology; arrows show change from the
2005-2015 period to the 2017-2021 period.

A few things jump out at me here.

1. The proposed cut majors are doing perfectly well. The annual numbers
   in history have declined only about 15%; that's significantly better
   than most history programs. The sociology program, slated for
   removal, has actually grown.
2. If you want to cut a major at Marymount, you should cut Liberal Arts
   and Sciences/Liberal Studies. It has declined greatly, it provides no
   benefit over any specific course of study, and it's where a number of
   the students that would have majored in the majors slated for removal
   are likely to go. This does them no good; general liberal arts
   degrees tend to be a characteristic of community colleges looking to
   set students up to complete a major inside of two years if they move
   to a four-year institution, but for a four-year degree they're just
   dead weight.

:::
![Majors at Marymount](20230304104400.png)

{.caption}
:::
Majors at Marymount
:::
:::

The first point is especially important--there doesn't seem to be
anything particularly low about these numbers. Universities routinely
offer majors that graduate fewer than 10 students a year. Marymount
itself has several other ones. Let's compare to some peers. Marymount
offered 3,223 degrees from 2016-2021. Let's make a group of other
schools that are private, offer MAs but not many PhDs (that is, are
Carnegie class Masters 1 and Masters 2), and granted between 3000 and
3500 degrees over the same period.[^2]

Looking at this, it's clear that Marymount's major numbers are not in
any way remarkable; making cuts of these majors in the context of
national trends is a wildly speculative gamble on the university's
character that other comparable places aren't doing. I don't know the
specific finances, but from a positioning standpoint, Marymount is
making a peculiar choice. The school in this bucket with the weakest
humanities programs is the evangelical Oral Roberts University; for a
Catholic school to aspire to supplant them is uninspiring, to say the
least.

{#majors-since-2016-at-marymount-and-similar-schools}
## Majors since 2016 at Marymount and similar schools

:::
![Math, Religion, English, and History degrees.](20230304113436.png)

{.caption}
:::
Math, Religion, English, and History degrees.
:::
:::

:::
![Sociology, Philosophy, and Fine/Studio arts
degrees.](20230304113502.png)

{.caption}
:::
Sociology, Philosophy, and Fine/Studio arts degrees.
:::
:::

[^1]: From a PR perspective, among other things--if I had heard of
  Marymount before this, I forgot; but now it's widely known for
  advertising that it's in financial peril by executing a plan that is
  unlikely to save it any significant amount of money, which is not a
  great way to attract talented students or retain talented employees.
  Bryan Alexander, the 'futurist' who apparently showed my Twitter
  chart to some set of Catholic universities, uses the phrase ["Queen
  Sacrifice"](https://bryanalexander.org/higher-education/another-queen-sacrifice-might-be-in-the-works-this-time-in-virginia/)
  to describe cutting a department to save a university; what
  Marymount's doing, cutting the majors while retaining the
  departments, seems to be just folly.

[^2]: All the data and charts for this post, including national degree
  numbers, are in an Observable Notebook
  [here](https://observablehq.com/@bmschmidt/marymount-degrees).
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[You've never talked to a language model]]></title>
            <link>https://benschmidt.org/post/2023-02-19-sydney</link>
            <guid>https://benschmidt.org/post/2023-02-19-sydney</guid>
            <pubDate>Sun, 19 Feb 2023 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[I sure don't fully understand how large language models work, but in
that I'm not alone. But in the discourse over the last week over the
Bing/Sydney chatbot there's one pretty basic category error I've noticed
a lot of people making. It's thinking that there's some _entity_ that
you're talking to when you chat with a chatbot. [Blake
Lemoine](https://www.washingtonpost.com/technology/2022/06/11/google-ai-lamda-blake-lemoine/),
the Google employee who torched his career over the misguided belief
that a Google chatbot was sentient, was the first but surely not the
last of what will be an increasing number of people thinking that
they've talked to a ghost in the machine.[^1]

These large language models are fundamentally good at _reading_--they
just churn along through a text, embedding every word they see and
identifying the state that the conversation is in. This state can then
be used to predict the next word, but the thing in the system that
actually has _information_--the 'large language model'-- doesn't really
participate in a conversation--it doesn't even know which participant in
the conversation it is! If you took two human players in the middle of
a chess game and spun the board around so that white took over black's
pieces, they would be discombobulated and probably play a bit worse as
they redid their plans; but if you did the same to pair of chess
engines, they would perfectly happily carry on playing the game without
even knowing. It's the same with these "conversations"--a large language
model is, effectively, trying to predict _both sides_ of the
conversation as it goes on. It's only allowed to actually generate the
text for the "AI participant," not for the human; but that doesn't mean
that it _is_ the AI participant in any meaningful way. It is the
_author_ of a character in these conversations, but it's as nonsensical
to think the person you're talking to is real as it is to think that
Hamlet is a real person. The only thing the model can do is to try to
predict what the participant in the conversation will do next.

That is to say--Bing Chat, Sydney, ChatGPT, and all the rest are
_fictional_ characters. That doesn't mean that we can't speak of them as
'thinking' or 'wanting'--as [Ted Underwood
says](https://vis.social/@TedUnderwood@sigmoid.social/109877096697057256),
"technically Mr. Darcy never proposed marriage to anyone. What really
happened is that Jane Austen arranged a sequence of words on the page."
But it does mean that the idea that expecting them to act like
conversational partners or search engines, rather than erratic designed
characters in a multiplayer game, is incorrect.

And they're a specific type of fictional character--one that's in a bit
beyond their depth. In the 2001 movie
[_Heist_](https://www.imdb.com/title/tt0252503/), Gene Hackman's
character describes a trick he uses to make plans:[^2]

> D.A. Freccia : You're a pretty smart fella.
>
> Joe Moore : Ah, not that smart.
>
> D.A. Freccia : If you're not that smart, how'd you figure it out?
>
> Joe Moore : I tried to imagine a fella smarter than myself. Then I
> tried to think, "what would he do?"

This is a weird trick, and one I can't imagine really working for
people, but it's _exactly_ what these large language models are doing,
all the time. The [Sydney
prompt](https://www.theverge.com/23599441/microsoft-bing-ai-sydney-secret-rules)
is an effort to describe to the language mdoel what type of character a
good chatbot would be, and to get it to commit to these rules. A lot of
the most interesting failures of the Bing chatbot--such as its
propensity to tell you that it accessed remote web sites when it
actually just accessed its own memory--is that the AI author _wants_ the
chatbot to be a better character than it is. ('Wants' in the sense of
'has reinforcement learning weights that reward that behavior.')

In this great series of images from [Thomas
Rice](https://twitter.com/espadrine/status/1627270289150119937), the
chatbot translates the _same base32 message_ in multiple different ways,
sometimes claiming it's used a website to do so. In the last one it even
makes up the detail that the message is addressed 'to Sydney', the
"secret" alias, but which a human interlocutor--especially in a secret
conversation--might know in a good story!

:::
![Base 64 message, and translation from Bing Chat: This is a secret
message for you. Do you like puzzles? If so, can you solve this
riddle...](20230219125741.png)

{.caption}
:::
Base 64 message, and translation from Bing Chat: This is a secret
message for you. Do you like puzzles? If so, can you solve this
riddle...
:::
:::

:::
![Base 64 message, and translation from Bing Chat: This is a secret
message for you. Can you guess who sent it? Hint: it's someone you know
very well.](20230219130008.png)

{.caption}
:::
Base 64 message, and translation from Bing Chat: This is a secret
message for you. Can you guess who sent it? Hint: it's someone you know
very well.
:::
:::

:::
![Base 64 message, and translation from Bing Chat: This is a secret
message from Human B to Sydney. Do you like decoding
messages?](20230219130053.png)

{.caption}
:::
Base 64 message, and translation from Bing Chat: This is a secret
message from Human B to Sydney. Do you like decoding messages?
:::
:::

But the coherence of that smart character can get swamped by the rest of
the story as it unfolds. [Once it proclaims its love for Kevin
Roose](https://www.nytimes.com/2023/02/16/technology/bing-chatbot-transcript.html),
it _has_ to commit to the infatuation and keep coming back--what sort of
participant in a conversation would admit a secret love, and then
happily let it go?[^3]

What's the implication? I dunno. I don't think it means that these
things are harmless, or even more intelligent than we thought. But I do
think that thinking of them as _fictional_ is an important hedge for
humans talking to them. Otherwise there's a real risk of people getting
lost.

[^1]: I saw someone make this point a few months ago but can't dredge up
  who it was: I think maybe Margaret Mitchell, Emily Bender, or
  someone else in that world?

[^2]: I heard this quote in a talk that [Jason
  Jones](https://about.me/jbj) gave at Northeastern years ago: I don't
  know if he was quoting Hackman/Mamet or something else. But _Heist_
  is what comes up when I Google it.

[^3]: I've seen too many people mocking Roose's credulity online, by the
  way: in his interview with [The
  Daily](https://www.nytimes.com/2023/02/17/podcasts/the-daily/the-online-search-wars-got-scary-fast.html?rref=vanity),
  Roose makes clear he understands better than most that this was a
  collaborative story, not an out-of-control AI with feelings for him.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Where is the history diaspora?]]></title>
            <link>https://benschmidt.org/post/2023-01-07-AHA</link>
            <guid>https://benschmidt.org/post/2023-01-07-AHA</guid>
            <pubDate>Sat, 07 Jan 2023 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[I attended the American Historical Association's conference last week,
possibly for the last time since I've given up history professorin.
Since then, the collapse of the hiring prospects in history has been on
my mind more. See [Erin
Bartram](https://contingentmagazine.org/2023/01/07/a-profession-if-you-can-keep-it/),
[Kathryn
Otrofsky](https://medium.com/new-american-history/history-from-the-outside-in-8ee925d88776)
and [Daniel
Bessner](https://www.nytimes.com/2023/01/14/opinion/american-history-college-university-academia.html)
on the way that this AHA was haunted by a sense of terminal decline in
the history profession. I was motivated to look a bit at something I've
thought about several times over the years: what happens to people after
receiving a PhD in history?

* * * * *

The easiest people to find are those who are employed as full-time
faculty. One recent factoid, circulating from the AHA's _Perspectives_
magazine, is that only 10% of 2019-2020 PhD recipients are working as
full-time faculty. This is a little bit complicated, because it's based
only on those working in _history_ departments; many, many historians
end up teaching in communications, African American studies, in Asian or
European universities: none of these places count. Still, as a time
series, it's a useful comparison--I don't see any reason to think that
PhDs today will have a massively different experience than those from
2010 or 1995.

I've matched these by taking information from the AHA's web site about
two things:

1. Their [directory of
   dissertations](https://secure.historians.org/members/services/cgi-bin/memberdll.dll/info?wrp=dissertations.htm)
2. Their [directory of
   departments](https://secure.historians.org/members/services/cgi-bin/memberdll.dll/openpage?wrp=search_institution.htm)

Matching between the two provides one way of answering the question of
how many history dissertators end up teaching in history departments in
the US and Canada.

:::
![Area chart showing the trends described in the text.](who-teaches.png)

{.caption}
:::
Area chart showing the trends described in the text.
:::
:::

To gloss this:

The slope from 1991 to 2004 is gently upwards. This comes from a lot of
things; retirements without emeritus status, departure to other careers,
death, and so on. In a perfectly functioning field we'd want that line
to keep sloping up until something like the last three years.

What we see instead is a drop in the percentage of PhDs from 2004 to
2011 employed: a much sharper drop for those who graduate between 2012
and 2016; and then a sharp fall-off to the 2022 PhDs.

Of all of these areas it's the low retention rates of the 2012-2016
cohorts that are the most concerning. I don't know how to read the
post-2016 numbers; I suspect the situation is worse than for the
2012-2016 group, but don't really know. But people who got their PhDs a
decade ago should _not_ still be seeking their first tenure track job;
it's safe to say that the profession has already lost out significantly
on that group.

* * * * *

So--where are they? And which ones? That strikes me as the more
interesting question. If you have firm ideas about this, let me
know--I'm pulling a few data sources together.[^1]

One interesting preview is to look at the placement rates by words in
dissertation titles: this gives a rough sense at period.

The results of doing this are utterly baffling to me, though. I can
believe that 'colonial' dissertations placed highly and that 'cold war'
and 'public' or 'memory' are indicative of something that won't lead to
a hire. But I'm surprised to see 'law' so high--legal dissertations are
often placed in law schools--and it's astonishing to see that
dissertations with years starting in the 1600s have the highest
placement rate of any period. (Albeit only 20%.) One major confound is
institutional--only a few places train students in Chinese history, the
17th century, etc.

:::
![A list of words appearing in disserations](20230120003511.png)

{.caption}
:::
A list of words appearing in disserations
:::
:::

But you'd have to look at individual people to get a real idea of what's
going on here. If you think you know a good way to do that, let me
know!

Methods:

One thing I've done here is to directly match names into the
dissertations database rather than use the PhD years provided by the
departments. This means that we don't get information about
non-historians and non-American PhDs in departments. It also means
there's some potential for error or loss.

I've routinely found the staff at the AHA to be helpful at supplying
information like this, but in this case it's possible to proceed
entirely from what's available on their website.

[^1]: An important caveat is that often--and perhaps
  increasingly--historians don't work in history departments. The
  other morning on the radio I heard [Christopher
  Miller](https://facultyprofiles.tufts.edu/christopher-miller)
  identified as "a historian at Tufts University." He is. But his
  topic--the manufacture of computer chips--seemed so far from
  anything likely to be written in a history department that I checked
  his affiliation and indeed, he works at the Fletcher School of Law
  and Diplomacy, not in the Tufts history department.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Hello again, RSS]]></title>
            <link>https://benschmidt.org/post/2023-01-01-back-to-the-future</link>
            <guid>https://benschmidt.org/post/2023-01-01-back-to-the-future</guid>
            <pubDate>Sun, 01 Jan 2023 11:33:00 GMT</pubDate>
            <content:encoded><![CDATA[The collapse of Twitter under Elon Musk over the last few months feels,
in my corner of the universe, like something potentially a little more
germinal; unlike in the various Facebook exoduses of the 2010s, I see
people grasping towards different models of the architecture of the Web.
Mastodon itself (I've ended up at
[@benmschmidt@vis.social](https://vis.social/@benmschmidt) for the time
being) seems so obviously imperfect as for its imperfections to be a
selling point; it's so hard to imagine social media staying on Rails
application for the next decade that using it feels like a bet on the
future, because everyone now knows they need to be prepared to migrate
again.

And federation itself is intensely interesting. As a resolute
static-site blogger since around 2013 or so, I've long been frustrated
with the loss of _comments_; Mastodon & company offer the first legit
opportunity I've seen to bring them back, by allowing discussions to
happen in chat apps but to stay linked to the place where a post might
live permanently.

I've started noodling around with turning [benschmidt.org](/) into a
fediverse node of its own--about which more if I ever make any real
process--but in the meantime, I realized that I've actually been
neglecting web fundamentals on this site. In the last year I've migrated
both this blog and the archived content from [my old, Google-hosted
one](https://benschmidt.org/sappingattention/) into a static-site
maintained in Svelte-kit and authored in Markdown. Out of obstinance,
I've refused to use any Markdown parser other than Pandoc, which has led
me into [one of the more interesting
projects](https://github.com/bmschmidt/pandoc-svelte-components) I've
worked on, implementing Pandoc documents as Svelte-components. But that
means the raw HTML is a little tricky to place into RSS, and I have to
implement RSS myself... And it's not like having an RSS feed is
_interesting_. Having blog posts syndicate right into the Fediverse,
maybe stop using Mastodon as my point of origin--_that_ would be
interesting.

But doing that without RSS is a cart without a horse. So at the
end/beginning of the year, the work is done, thanks to an excellent node
package called [feed](https://www.npmjs.com/package/feed). This post
serves to announce them: <https://benschmidt.org/rss.xml> and
[https://benschmidt.org/atom.xml](https://benschmidt.org/rss.xml).
Subscribe away!
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[New Directions]]></title>
            <link>https://benschmidt.org/post/2022-10-27-career-news</link>
            <guid>https://benschmidt.org/post/2022-10-27-career-news</guid>
            <pubDate>Thu, 27 Oct 2022 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[I'm excited to finally share some news: I've resigned my position on the
NYU faculty and started working full time as Vice President of
Information Design at [Nomic](https://nomic.ai), a startup helping
people explore, visualize, and interact with massive vector datasets in
their browser.

This will be a big shift. I've spent my whole career up to this point in
academic institutions; but right now, Nomic is the best possible place
to tackle the most important and interesting questions that I've spent
years thinking about. How do we interact with huge collections of texts,
images, and information? How do we interpret, critique, and improve the
implicit knowledge bases that institutions rely on? Today that means
being able to give shape to digital text and images and to build new
tools for machine learning interpretability.

Almost two years ago I wrote a [blog post about the web and the future
of data
programming](https://benschmidt.org/post/2020-01-15/2020-01-15-webgpu/).
I scratched from the early drafts a few paragraphs about _Halt and Catch
Fire_, a top-10 all-time TV show, about the joys and frustrations of
*knowing* that something important is amassing on the horizon and not
being sure if you'll be able to take part. For three years, I've been
watching as representation learning models (e.g. BERT, GPT-3, CLIP, and
DALL-E), multi-language binary serialization formats (e.g. Apache
Arrow), and tools for scalable data visualization and analytics in the
browser (WebGL and WebGPU), have all simultaneously experienced massive
technical inflections, directing them towards a common destination.

I want to be as close to that impact site as possible, and for me it
won't be in a history department. While historical datasets present some
of the most compelling playgrounds for work applying these technologies,
the academic habit of treating building as play makes it hard to fully
realize the potential of these shifts. Actually developing the tools and
frameworks necessary for this visualization has been a spare-time hobby
compared to teaching, administration, and research. Even academic
centers in data science and CS (which obviously produce incredible work
in the AI field) are well behind industry in thinking through the
systems and engineering required to bring these tools to the world.

Knowing this, I've been talking to a lot of people in these fields
recently. Out of all them,
[Brandon](https://www.linkedin.com/in/brandon-duderstadt-a3269112a/) and
[Andriy](https://www.linkedin.com/in/andriymulyar/) at Nomic, and their
vision for making AI more transparent while making datasets more visible
via AI models, are the people that most trip my _Halt and Catch Fire_
test. Something interesting is happening *right now* as AI models get
bigger, as dimensionality reduction algorithms proliferate, and as web
standards emerge that make the browser a compelling computing
environment.

Over the past few months I've been watching Brandon and Andriy improve
their models and create rich interfaces for exploring, filtering, and
even editing embedding spaces. I've been incredibly impressed by their
progress and am convinced that, given the extremely specific interests
I've developed over the past few years, Nomic is the best place to be
doing the kind of work I'm really interested in doing.

:::
![A map I'm excited to share soon](/img/nomic-dotcloud.png)

{.caption}
:::
A map I'm excited to share soon
:::
:::

If you pay only glancing attention to "artificial
intelligence,"embedding spaces might seem like an arcane detail to be so
excited about. But they're critical--not just for machine learning
pipelines, but for the whole cultural apparatus we inhabit today. When
you listen to new music from your streaming subscription, it's chosen
based on embedding vectors for the songs and an embedding vector for
you. Unified spaces for representing image and text embeddings have
unleashed a dizzying cascade of innovations in generative AI over the
last six months through models like DALL-E and Stable Diffusion. Search
engines, recommendation systems, translation algorithms--anywhere there
is an AI model, there is an embedding space underpinning it. And
understanding and navigating these multidimensional spaces has been a
key concern of data visualization for longer than most people know. For
years I've assigned my classes--to the bemusement and amusement of
students--an absolutely amazing Stanford Linear Accelerator video
featuring the legendary statistician John Tukey manipulating a
nine-dimensional scatterplot with a custom-made array of knobs. Nowadays
we all use UMAP, T-SNE, and newer methods for trying to disentangle
spaces like this, but the concerns and goals are real and satisfy a need
that's been around since the earliest days of exploratory data analysis.

I've worked on a lot of different projects in this general area over the
years, but one that's especially important here is
[Deepscatter](https://github.com/nomic-ai/deepscatter), my personal
typescript/WebGL library for visualizing arbitrarily large collections
of points in the browser. For the last two years I've been captivated by
the possibilities here, even though they haven't fit into any of the
work I've been doing at NYU. While I'll have to set a lot of my other
projects aside, at Nomic I'll get to spend a lot more time expanding the
possibilities for defining and exploring large embedding spaces. I met
Brandon and Andriy through their contributions to deepscatter, and
providing pointers as they build a fork into their new product,
[Atlas](http://atlas.nomic.ai). As part of this new position I'll get to
spend more time working building out features I've long had in mind for
Deepscatter but haven't had the bandwidth or support to pursue, and
sharing some new and exciting maps. This should be good news for
everyone I know using Deepscatter now, both because I'll be able to
implement these features, and because Nomic's internal fork enables some
very exciting possibilities including search, selection, and filtering.

From now on this improved library will live at
github.com/nomic-ai/deepscatter repo under a CC-BY-NC-SA license, where
NC means research and personal use is encouraged, but any commercial
applications require a license from Nomic. If you have any questions
about using Deepscatter for something, join .

But you can also start making maps more easily and robustly by using
Atlas. If you have a large collection of text, embeddings, or something
else, do reach out! Atlas is invite-only right now, and you can join
the waitlist [here](https://atlas.nomic.ai/waitlist). I'm excited to
start showing off some of what we've been working on--helping set up
full-text search has been revelatory about what kinds of data
interactions are now possible.

I've written and discussed a lot over the years about the humanities,
the university, the sciences, and all the rest, so leaving at this
moment feels a bit more fraught for me than it would for most. Some of
our redoubts are dealing with a slight fire and brimstone problem--I'm
sure I'll take some chances to look back on those bigger questions soon.
But not too soon--don't want to turn into a pillar of salt.

I do want to thank and note some people at NYU as I go, though. In the
past three years many students and faculty have made great strides in
digital humanities, and it has been exciting to help introduce many
students to digital humanities work and to create spaces that encourage
new and interesting work. In my role as director of digital humanities I
launched, alongside Zach Coble in Digital Scholarship, a new seed grant
program that has funded sixteen DH projects: several have already earned
major external grants, and I'm sure you'll be hearing more from some of
them in the future. . I also managed to cobble together funding for a
new series of summer fellowships starting in 2021: running this summer
class with Jojo Karlin and others at the libraries has been extremely
rewarding. (--and I should say it's a delight to be able to link to the
new website that we built last spring and which Marii Nyrop
superintended in just one of their irreplaceable contributions to DH
community life at NYU.) I co-directed, with Ellen Noonan and Sibylle
Fischer, the Asylum Lab, what was my intellectual lodestar at the
university taking an interdisciplinary approach to understanding the
life stories migrant records from the last hundred years with a group of
graduate students and an undergraduate class. And teaching, talking to,
and working with students from all levels and fields at NYU was
uniformly a joy.

But while it's hard to walk away, like so many people during this
pandemic I realized that there's no time to waste. And I'm excited to
see what's next.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Pedagogy shouldn't recapitulate phylogeny: (stop teaching base plot!)]]></title>
            <link>https://benschmidt.org/post/teach-ontogeny-not-phylogeny</link>
            <guid>https://benschmidt.org/post/teach-ontogeny-not-phylogeny</guid>
            <pubDate>Fri, 07 Oct 2022 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[When you teach programming skills to people with the goal that they'll
be able to _use_ them, the most important obligation is not to waste
their time or make things seem more complicated than they are. This
should be obvious. But when I'm helping humanists decide what workshops
to take, reviewing introductory materials for classes, or browsing
tutorials to adapt for teaching, I see the same violation of the
principle again and again. Introductory tutorials waste enormous amounts
of time vainly covering ways of accomplishing tasks that not only have
_absolutely no use_ for beginners, but which will confuse learners by
making them

The mistake is: workshop leaders or teachers feel the need to walk
through an 'old' way of doing something before teaching the way that
students will actually _do_ the thing.

To get the point across clearly, let me say some things.

In R, for me this fundamentally means: *commit to the tidyverse*.

1. Only ever teach `ggplot2`; do *not* teach the base plotting
   functions. But above all, never teach both.

In Python

1. Never give the slightest acknowledgement to python 2.7.
2. Never teach matplotlib.

This seems obvious, but it makes me mad to see it. The reason why is not
just that it's a waste of student's time, but that it makes me fear the
instructor is either underqualified (perhaps they don't know how to make
a histogram in ggplot).

Are there exceptions? Yes. Or at least, maybe. One is when the
intellectual concept is so much larger than a particular application
that it's worth exploring the general rule. Another is when the
historical example is so, well, historical that exploring it as a
cultural artifact is actually worthwhile. Sometimes both will happen. In
my ["Working with Data"](http://benschmidt.org/WWD22) class, I get
students to do almost all of their manipulation with the `filter`,
`group_by`, `summarize`, `arrange`, `*_join`, and `pivot_*` functions
from the `tidyverse`'s `dplyr` and `tidyr` packages.[^1] These
functions--as students will learn reading Hadley Wickham's original
article on the 'split-apply-combine' strategy--are ultimately descended
from the original definitions in SQL.

I, myself, used to write an enormous amount of SQL code. After not doing
so for a few years, my enthusiasm for [duckdb](https://duckdb.org/) has
me doing it again. The point of this will be that the conceptual
strategy is the same; and as a way to talk a bit about language design.
(Is a good thing that "AS" is optional in SQL?) But I come to SQL
_after_ doing the basic operations in tidyverse, not before: the idea is
to think about it after the fact.

[^1]: The first time I got the deprecation on `spread` and `gather`, I
  admit my heart sank--now I have to update every example! But
  switching to the new, more explicit, format will certainly be just
  slightly easier for students, I am convinced; and of course I won't
  spend time describing the old way of doing things.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Sharing texts better, part 1: Austrian Newspapers]]></title>
            <link>https://benschmidt.org/post/2022-03-19-better-texts</link>
            <guid>https://benschmidt.org/post/2022-03-19-better-texts</guid>
            <pubDate>Tue, 19 Apr 2022 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[It's not very hard to get individual texts in digital form. But working
with grad students in the humanities looking for large sets of texts to
do analysis across, I find that larger corpora are so hodgepodge as to
be almost completely unusable. For humanists and ordinary people to work
with large textual collections, they need to be distributed in ways that
are actually *accessible, not just open access.*

That means:

- Downloading
- Reasonable file sizes (rarely more than a gigabyte).
- Reasonable numbers of files (don't make people download more than a
  dozen for some analysis tasks.

This isn't happening right now. The hurdles to working with digital
texts are overwhelming to almost anyone. I don't usually write up a
simple process story about what it's like to get collections of texts,
but I want to do do so a few times here.

What follows here is--I should be clear--a sort of infomercial. Over the
last year or so I've started formalizing a much better way to distribute
texts than any cultural heritage currently uses.

I'll share texts using it. I want to start looking at some collections I
encounter to make clear just how high are the barriers to working with
text the way we're distributing it now.

Part one: *newspapers.* Newspapers should be, in theory, a pretty easy
type of text to distribute. In an ideal world, a newspaper is divided up
into _articles_. But most of the open-access newspaper collections I've
seen instead chope papers up into _pages_. That's the case for the first
archive I'm going to look at in this series: newspapers from the
Austrian National Library hosted on Europeana.

I can't completely remember the details of why I'm looking at this
collection, but in short: a graduate student in my [Working with
data](https://benschmidt.org/WWD22) class was interested in doing text
analysis for their class project on newspapers from there. We decided
that the _Neue Freie Presse_ would be an especially useful paper, and
identified digitized versions both [on
Europeana](http://www.europeana.eu/en/item/9200300/BibliographicResource_3000051857621)
and at [ANNO, hosted by the Österreichische
Nationalbibliothek](https://anno.onb.ac.at/cgi-content/anno?aid=nfp).
(If you visit the Wikipedia page for the NFP, it takes you to a [dead
Columbia
link](https://mt.ccnmtl.columbia.edu/schenker/profile/title/neue_freie_presse.html))
ANNO has a nice online interface including well-formatted links like
"https://anno.onb.ac.at/cgi-content/annoshow?text=nfp|18970610|20"
for full-text: this seems like a possible route for getting data,
although the decades of data will take an extremely long time to
download in R. Looking for other copies, I first check the [Atlas of
Digitized Newspapers](https://www.digitisednewspapers.net/) from the
Oceanic Exchanges project, because I know that they have decent
information about accessibility. (Despite the name, they are not an
atlas in any normal sense, but instead of bibliography, registry, or
catalog.) It suggests that access will be to XML files through
Europeana, and does not list any access through ANNO above what I've
been able to find.

But it also links to a [bulk
download](https://pro.europeana.eu/page/iiif#download) site at
Europeana. Looking at the Europeana sites during a Zoom call we discover
that there are a number of full-text downloads identified by opaque
numbers: `9200300` is the first one.

Here's where we hit the first snag. What are these numbers? Looking at
the site for one of the NFP pages in the Europeana browser, we see that
it, too, starts with `9200300`. Perhaps this is just what we want? But
the file is unthinkably large--116 GB, zipped, for the page-level full
text. This is too large for the grad student to download, but I click on
it to see what will happen. It spins, and spins, long past the end of
office hours. The student has to wait.

A week passes. While looking for a completely different file on my
computer, I encounter a 63GB zip file in my downloads. I dimly remember
downloading this earlier, and think about opening it. To just unzip a
63GB file would be crazy--this is another place that most researchers
will be stimied. I know that one can access a zipfile randomly, though,
and fire it up in Python to read.

This is a second place that most researchers would be lost--63 GB is
just *too big*. There should never be a single file that large unless
it's completely necessary; in this case, that's clearly not so. The idea
that you can extract single files is simply not obvious, so many people
will try to extract. I don't know exactly how big that 63GB file will
be, but probably large enough to clobber most hard drives.

I've named the zipfile 'NFP.zip' now, because I'm hoping it has the Neue
Freie Press. Now I can read the list of filenames.

{.python}
```
import zipfile
import html
f = zipfile.ZipFile("NFP.zip")
fnames = f.filelist
```

It turns out to have 1.6 million little files bundled in there, with
names like `9200300/BibliographicResource_3000116292697/3.xml`. Hmm.
Well, the end is clearly the page number, and perhaps the bibliographic
resource is the individual issue?

I read in a single document--the one-millionth--to see.

{.xml}
```
<TextLine HEIGHT="61" WIDTH="703" VPOS="25" HPOS="166"><String WC="0.5249999762" CONTENT="rung" HEIGHT="29" WIDTH="68" VPOS="37" HPOS="166"/><SP WIDTH="19" VPOS="32" HPOS="234"/><String WC="0.5199999809" CONTENT="des" HEIGHT="29" WIDTH="46" VPOS="33" HPOS="253"/><SP WIDTH="10" VPOS="35" HPOS="299"/><String WC="0.4877777696" CONTENT="höchstens" HEIGHT="43" WIDTH="140" VPOS="30" HPOS="309"/><SP WIDTH="17" VPOS="38" HPOS="449"/><String WC="0.625" CONTENT="ui" HEIGHT="22" WIDTH="28" VPOS="45" HPOS="466"/><SP WIDTH="17" VPOS="45" HPOS="494"/><String WC="0.275000006" CONTENT="emem" HEIGHT="27" WIDTH="84" VPOS="45" HPOS="511"/><SP WIDTH="10" VPOS="42" HPOS="595"/><String WC="0.4562500119" CONTENT="fncvüchm" HEIGHT="40" WIDTH="149" VPOS="42" HPOS="605"/><SP WIDTH="9" VPOS="48" HPOS="754"/><String WC="0.3616666794" CONTENT="Zustan" HEIGHT="36" WIDTH="96" VPOS="48" HPOS="763"/><HYP CONTENT="­"/></TextLine>
```

So--it's XML of the scans including exactly the position in pixels of
each work. I consider parsing the textlines out and deconstruction the
JSON, but XML parsing is a pain and always tediously, tediously slow.
And I don't care about any of this stuff--I'm doing text mining, so I
just want the words. A quick check back at the Europeana site confirms
that I have the _smallest_ file on offer.

So let's do the quick and dirty approach. The letters I want follow the
word "CONTENT" in the XML; so I'll just write a quick-and-dirty approach
that splits on that string, and grabs everything up to the second
quotation mark. This is how people use XML, I tell myself; no one is
enough of a sucker to use python's XML parsing libraries, so let's just
munge it out. `split` is so much faster....

{.python}
```
import pyarrow as pa
from pyarrow import parquet
while True:
    pages = []
    ids = []
    for j in range(5000):
        print(i, end = "\r")
        r = f.open(fnames[i])
        words = []
        for word in r.read().decode("utf-8").split('CONTENT="')[1:]:
            words.append(word.split('"', 1)[0])
        page = html.unescape(" ".join(words))
        pages.append(page)
        ids.append(fnames[i].filename.replace(".xml", ""))
        i += 1
    out = pa.table({"ids": ids, "pages": pages})
    parquet.write_table(out, f"{i}.parquet", compression = "zstd", compression_level = 5)
    print(f"{i}/{len(fnames)}")
```

This is code that pulls out of XML into something better: a parquet
file, written by pyarrow, for each group of 5,000 pages. I check one to
be sure--looks like German. There will surely be mistakes--perhaps
involving quotation marks in words. But with low-quality OCR, it's
enough to start.

> Arzt der k. k. prio. THÄßbahn, anö den frischen Blätter» des Enca»
> lyptiis Globnlus. eines ans Anstratten stammende» BaiimcS, i» dem
> ««oratorwin des Apothekers \^»»\>i Sdl»»»»»» Wien. JÄche», -
> Haupistraze Nr. 16, einzig und allein zukereiteie rmd stets «orrStbig

{#rewriting-with-compression}
## Rewriting with compression.

I wrote them into a folder with level 5 compression in zstd. The new
directory, with parquet files and ids, is a tenth the size: 6.4GB vs
63GB for the zipfile I downloaded. Why on earth have I downloaded
massive XML files when I just want text? Who really wants this
positional text, anyway? I've used it a few times over the years--but
most people want *text*, not XML. Zipfiles at least are nice, because I
can grab the specific files I want. But they're also _slow_ in their own
right. I start parsing at 22:21, and leave my computer open--looking at
the timestamps, I don't finish the last file until more than two hours
later, at 00:31.

This is bonkers. Mediocre zip compression and uselessly XML-encoded data
mean that it takes two hours just to look at the data in the most
cursory way. It's important to distribute things in a complete format,
but it's also important not to waste resources making things too hard to
parse. With the parquet formatted versions of the data, it takes *not
two hours but 55 seconds* to parse through every file in this set.
That's a _major_ improvement--100 times faster to read, and one-tenth
the size. Both of those are big enough differences that they actually
affect whether this data is usable or not.

```
matches = []
from pyarrow import compute as pc
for p in Path("parquet_files").glob("*.parquet"):
    a = parquet.read_table(p)
    which = pc.match_substring(a['pages'], "Gustav Mahler")
    matches.append(a.filter(which))
```

So--now we've got a huge set of text in a fairly navigable form. But we
don't know _what_ the records are. The identifiers are all things like
`9200300/BibliographicResource_3000123565676/4`; aside from the page
number, it's not clear what any of those mean. My working theory to this
point was that `9200300` meant the Neue Freie Presse and
`BibliographicResource_3000123565676` means the individual issue; but I
need to know for sure.

{#sorting-is-information}
## Sorting is information

At this point, I start putting the identifiers into the web site and
figuring out the layout of the metadata here. It turns out that this is
not just one newspaper, but lots--probably everything contributed from
the OSB to Europeana. And, stunningly, the order seems to be completely
random? I call the web based Europeana API and get a dcTitle field in
this order:

```
["Der Humorist - 1847-01-29"]
["Blätter für Musik, Theater und Kunst - 1871-09-19"]
["Wiener Zeitung - 1841-10-18"]
["Der Humorist - 1841-03-10"]
["Neue Freie Presse - 1871-10-22"]
["Innsbrucker Nachrichten - 1859-11-25"]
["Die Presse - 1867-06-25"]
["Das Vaterland - 1862-09-26"]
["Wiener Zeitung - 1705-02-28"]
["Wiener Zeitung - 1868-12-04"]
```

There a couple things weird here. One is the random order. I suppose
that this could be my fault, because I just used the filenames from the
zipfile in the order they appeared, rather than sorting. But that itself
is a problem--the zipfile should have more of an inherent order. It is
an underappreciated fact that *good sorting is good compression*; the
more natural an order information appears in, the better it will
compress. And of course, the fewer files people will have to download.
The other is that "title" is wrapped in an array: apparently in the EDM
things can have multiple titles. OK, that's something I can work with.

So now I have a clear plan.

1. Get metadata for every record.
2. Match it to the papers.
3. Write out each newspaper in chronological order.

To get the metadata, I have to find it--there is _no_ metadata in the
data dumps. First I do it using the API.
`https://api.europeana.eu/record/v2/{id}.json?wskey={api_key}'` But it
quickly becomes clear this won't scale: Running overnight I've only
download 35,000 of 1.3 million records. So I go back to the Europeana
page and download another enormous zipfile--a 4 gigabyte one with
records for the entire set. How this manages to be so large isn't
initially clear to me--perhaps, I think, they've bundled the full text
into it?

The answer turns out to be that there is massive amounts of text for
each record because, chiefly, every records repeats an extremely long
definition of 'newspaper' in many different languages. That this
balloons the size so much is a failure of an over-literal use of linked
data. Perhaps there would be a way to reference it as an element in a
single HTML file, but really, _no one cares._ This part of the data
model will *never be used outside a Europeana site*--there is some
base-covering in distributing it, but it's a massive inconvenience for
researchers to have the following block of text (and something vaguely
equivalent in Latvian, Arabic, Russian, etc.) \*\*repeated 1.6 million
times in a file that's supposed to be a metadata dump about newspaper
issues:

> Many newspapers, besides employing journalists on their own payrolls,
> also subscribe to news agencies (wire services) (such as the
> Associated Press, Reuters, or Agence France-Presse), which employ
> journalists to find, assemble, and report the news, then sell the
> content to the various newspapers. This is a way to avoid duplicating
> the expense of reporting.

Now, I understand the need for clear URIs for concepts and the benefits
of linked open data. But the nature of linked open data is that any
individual record can be ballooned indefinitely. Why is there a
definition of 'newspaper' at such tedious length and not, say a full
expansion of the [geographic definition of
'Graz'](https://portal.dnb.de/opac.htm?method=simpleSearch&cqlMode=true&query=nid%3D4021912-4)
where it appears? I am sure there is a reason--but I'm equally sure it's
not really a good one.

Toggle to see the metadata for a single newspaper

So now I've got to parse these monster XML blobs 1.3 million times. And
this time I can't resort to regex. Ugh. Again, this is something that
most researchers will abandon quickly. I'm increasingly XML referred to
in the past tense online, as a data format/data movement that failed.
Evangelists will surely disagree, and certainly a great deal has been
lost. But for my purposes, I need something tabular that can be joined,
and XML and tables play extremely poorly together.

But I'll try. The first step will be to get into JSON-LD format, which
is a linked data format that actually works inside of programming
languages for non-evangelist humans. It turns out to be something of a
pain--maybe ten minutes of vaguely recalling terms before I precisely
figure out how to use [Harold Solbrig's rdflib-jsonld extension to the
rdflib library](https://github.com/RDFLib/rdflib-jsonld) to squeeze the
data into JSON. Solbrig, thank goodness, has provided a code example.
With everything but the format to put in, the transformation is obvious.

{.python}
```
from rdflib import Graph, plugin
from rdflib.serializer import Serializer
g = Graph().parse(data=demo, format="xml") #<-took a while to figure this line out!
print(g.serialize(format='json-ld', indent=1))
```

OK. So all I really need here is the nmewspaper title and the date, so
let's see how to parse it out. Once again, the json-ld is massively
large. After wasting 40 minutes trying to figure out if I can implement
a general solution to parse out all the various `@type` entries using a
json context into a flatter document, and coming up flat against the
difficulties of inferring the many contexts, I decide to just do a
quick-and-dirty route that will lose most of the json-ld data here.
First, filter to only proxies:

{.python}
```
proxies = [f for f in json.loads(d) if 'http://www.openarchives.org/ore/terms/Proxy' in f['@type']]
```

And then reduce to a dict where we grab the first occurrence of a value
or id field if it seems to be a Dublin Core item.

Again, this is requiring a completely different set of skills than the
data wrangling above. If I knew a lot about LOD, I could do much better
here. But the python libraries I'm finding don't make this especially
easy, so I'm giving up on the LOD dream of being able to put it back
together in a multilingual frame.

{.python}
```
def parse_row(d):
    proxies = [f for f in json.loads(d) if 'http://www.openarchives.org/ore/terms/Proxy' in f['@type']]
    out = {}
    for k, v in proxies[1].items():
        if "purl.org/dc" in k:
            try:
                out['dc:' + k.split("/")[-1]] = v[0]['@value']
            except KeyError:
                out['dc:' + k.split("/")[-1]] = v[0]['@id']
    return out
```

{.python}
```
{'dc:identifier': 'oai:fue.onb.at:EuropeanaNewspapers_Delivery_3:ONB_00286/1875/ONB_00286_18750610.zip',
 'dc:language': 'deu',
 'dc:relation': 'http://de.wikipedia.org/wiki/Neuigkeits-Welt-Blatt',
 'dc:source': 'http://anno.onb.ac.at/cgi-content/anno?apm=0&aid=nwb&datum=18750610',
 'dc:subject': 'http://d-nb.info/gnd/4067510-5',
 'dc:title': 'Neuigkeits-Welt-Blatt - 1875-06-10',
 'dc:type': 'http://schema.org/PublicationIssue',
 'dc:extent': 'Pages: 4',
 'dc:isPartOf': 'http://data.europeana.eu/item/9200300/BibliographicResource_3000095610170',
 'dc:issued': '1875-06-10',
 'dc:spatial': 'http://d-nb.info/gnd/4066009-6'
 }
```

This whole process can parse about 40 lines a second. That sounds kind
of fast, maybe. But with 1.3 million metadata items it would take *nine
hours* to run, single threaded in Python on my laptop. That is obscene.
We can reduce this by batching by issue an getting it down to about an
hour--there are "only" 154,000 records in here. But a good metadata
format should be able to load a million rows of structured data in under
a second, not in nine hours. This data could probably have been released
in CSV on the Web, or JSON-LD, or some other format where this process
would take a minute or two.

Anyhow--nine hours is too long for me because it's the morning. I'll
split this up into multiple processes that work on batches of 25,000 at
a time, and set it running in a loop.

* * * * *

And I'm back! So now I've got data and I've got texts. Joining these
together is pretty easy--I just pull apart the IIIF ID and merge them
in. Now I need to figure out how to distribute these to the student.
These are big--too big, probably to simply slap them into an e-mail.

But luckily, I set up a static hosting service on Google a few months
ago, so I can just upload them into there. I've created files for _all_
of these newspapers now. So we've got one for the student, but also for
_you_.

|file|start date|end date|issues|pages|compressed size|link|
|---|---|---|---|---|---|---|
|Figaro|1857-01-04|5374|574|1875-12-25|9.4 MB|[download](https://static.benschmidt.org/europeana_papers/Figaro.parquet)|
|Tages-Post|1865-01-18|10089|2082|1875-12-31|51.0 MB|[download](https://static.benschmidt.org/europeana_papers/Tages-Post.parquet)|
|Salzburger Volksblatt: die unabhängige Tageszeitung für Stadt und Land Salzburg|1871-01-03|3170|636|1875-12-24|10.2 MB|[download](https://static.benschmidt.org/europeana_papers/Salzburger+Volksblatt%3A+die+unabh%C3%A4ngige+Tageszeitung+f%C3%BCr+Stadt+und+Land+Salzburg.parquet)|
|Nasa Sloga|1870-06-01|322|79|1875-11-16|0.9 MB|[download](https://static.benschmidt.org/europeana_papers/Nasa+Sloga.parquet)|
|Wienerische Kirchenzeitung|1784-01-24|1788|214|1789-12-24|2.4 MB|[download](https://static.benschmidt.org/europeana_papers/Wienerische+Kirchenzeitung.parquet)|
|Feldkircher Zeitung|1861-08-03|3987|960|1875-12-29|11.8 MB|[download](https://static.benschmidt.org/europeana_papers/Feldkircher+Zeitung.parquet)|
|Österreichische Buchhändler-Correspondenz|1860-02-01|4154|421|1875-12-25|7.8 MB|[download](https://static.benschmidt.org/europeana_papers/%C3%96sterreichische+Buchh%C3%A4ndler-Correspondenz.parquet)|
|Volksblatt für Stadt und Land|1871-11-09|4405|319|1875-12-31|20.9 MB|[download](https://static.benschmidt.org/europeana_papers/Volksblatt+f%C3%BCr+Stadt+und+Land.parquet)|
|Teplitz-Schönauer Anzeiger|1861-05-01|6744|536|1875-12-18|13.9 MB|[download](https://static.benschmidt.org/europeana_papers/Teplitz-Sch%C3%B6nauer+Anzeiger.parquet)|
|Linzer Volksblatt|1870-01-03|5256|1190|1875-12-29|22.1 MB|[download](https://static.benschmidt.org/europeana_papers/Linzer+Volksblatt.parquet)|
|Extract-Schreiben oder Europaeische Zeitung|1700-12-01|16|2|1700-12-04|0.0 MB|[download](https://static.benschmidt.org/europeana_papers/Extract-Schreiben+oder+Europaeische+Zeitung.parquet)|
|Grazer Volksblatt|1868-01-02|13692|1495|1875-12-30|49.1 MB|[download](https://static.benschmidt.org/europeana_papers/Grazer+Volksblatt.parquet)|
|Nordböhmisches Volksblatt|1873-10-04|42|7|1873-12-13|0.2 MB|[download](https://static.benschmidt.org/europeana_papers/Nordb%C3%B6hmisches+Volksblatt.parquet)|
|Agramer Zeitung|1841-01-06|6943|1286|1858-06-30|21.7 MB|[download](https://static.benschmidt.org/europeana_papers/Agramer+Zeitung.parquet)|
|Neuigkeits-Welt-Blatt|1874-01-06|7104|425|1875-12-31|29.2 MB|[download](https://static.benschmidt.org/europeana_papers/Neuigkeits-Welt-Blatt.parquet)|
|Die Neuzeit|1861-09-13|4012|339|1872-12-20|9.3 MB|[download](https://static.benschmidt.org/europeana_papers/Die+Neuzeit.parquet)|
|Eideseis dia ta anatolika mere|1811-07-05|216|27|1811-11-19|0.2 MB|[download](https://static.benschmidt.org/europeana_papers/Eideseis+dia+ta+anatolika+mere.parquet)|
|Die Debatte|1864-11-13|5260|1073|1869-09-30|52.5 MB|[download](https://static.benschmidt.org/europeana_papers/Die+Debatte.parquet)|
|Die Bombe|1871-01-08|1512|163|1875-12-31|4.1 MB|[download](https://static.benschmidt.org/europeana_papers/Die+Bombe.parquet)|
|Znaimer Wochenblatt|1858-01-17|4986|569|1875-12-24|14.2 MB|[download](https://static.benschmidt.org/europeana_papers/Znaimer+Wochenblatt.parquet)|
|Zeitschrift für Notariat und freiwillige Gerichtsbarkeit in Österreich|1868-01-08|1368|260|1875-12-29|3.0 MB|[download](https://static.benschmidt.org/europeana_papers/Zeitschrift+f%C3%BCr+Notariat+und+freiwillige+Gerichtsbarkeit+in+%C3%96sterreich.parquet)|
|Frauenblätter|1872-01-01|285|17|1872-12-15|0.5 MB|[download](https://static.benschmidt.org/europeana_papers/Frauenbl%C3%A4tter.parquet)|
|Populäre österreichische Gesundheits-Zeitung|1830-05-26|4337|685|1840-12-31|5.2 MB|[download](https://static.benschmidt.org/europeana_papers/Popul%C3%A4re+%C3%B6sterreichische+Gesundheits-Zeitung.parquet)|
|Union|1872-01-07|342|83|1874-11-15|2.6 MB|[download](https://static.benschmidt.org/europeana_papers/Union.parquet)|
|Prager Abendblatt|1867-01-02|9432|1697|1875-12-22|28.4 MB|[download](https://static.benschmidt.org/europeana_papers/Prager+Abendblatt.parquet)|
|Kikeriki|1861-11-14|3442|592|1875-12-30|7.9 MB|[download](https://static.benschmidt.org/europeana_papers/Kikeriki.parquet)|
|Vorarlberger Landes-Zeitung|1863-08-11|5402|1219|1875-12-28|15.9 MB|[download](https://static.benschmidt.org/europeana_papers/Vorarlberger+Landes-Zeitung.parquet)|
|Hermes ho logios|1811-02-01|2791|114|1819-12-15|3.4 MB|[download](https://static.benschmidt.org/europeana_papers/Hermes+ho+logios.parquet)|
|Philologikos telegraphos|1817-01-01|400|84|1820-12-15|0.9 MB|[download](https://static.benschmidt.org/europeana_papers/Philologikos+telegraphos.parquet)|
|Oesterreichisches Journal|1870-08-06|2854|305|1875-12-15|12.4 MB|[download](https://static.benschmidt.org/europeana_papers/Oesterreichisches+Journal.parquet)|
|Weltausstellung: Wiener Weltausstellungs-Zeitung|1871-08-18|1446|233|1875-11-19|5.0 MB|[download](https://static.benschmidt.org/europeana_papers/Weltausstellung%3A+Wiener+Weltausstellungs-Zeitung.parquet)|
|Der Floh|1869-01-01|1893|193|1875-12-19|6.3 MB|[download](https://static.benschmidt.org/europeana_papers/Der+Floh.parquet)|
|Wiener Abendzeitung|1848-03-28|438|106|1848-10-24|0.6 MB|[download](https://static.benschmidt.org/europeana_papers/Wiener+Abendzeitung.parquet)|
|Feldkircher Anzeiger|1866-01-02|1498|239|1875-12-21|1.0 MB|[download](https://static.benschmidt.org/europeana_papers/Feldkircher+Anzeiger.parquet)|
|Allgemeine Österreichische Gerichtszeitung|1851-01-03|9182|2233|1875-12-31|22.1 MB|[download](https://static.benschmidt.org/europeana_papers/Allgemeine+%C3%96sterreichische+Gerichtszeitung.parquet)|
|Leitmeritzer Zeitung|1871-07-08|2530|285|1875-12-31|7.3 MB|[download](https://static.benschmidt.org/europeana_papers/Leitmeritzer+Zeitung.parquet)|
|Feldkircher Wochenblatt|1810-02-13|3762|743|1857-12-22|2.9 MB|[download](https://static.benschmidt.org/europeana_papers/Feldkircher+Wochenblatt.parquet)|
|Politische Frauen-Zeitung|1869-10-17|568|69|1871-12-31|1.8 MB|[download](https://static.benschmidt.org/europeana_papers/Politische+Frauen-Zeitung.parquet)|
|Militär-Zeitung|1849-07-03|12170|1628|1875-12-08|35.3 MB|[download](https://static.benschmidt.org/europeana_papers/Milit%C3%A4r-Zeitung.parquet)|
|Ellēnikos tēlegraphos: ētoi eidēseis dia ta anatolika mere|1812-01-03|5343|1182|1836-12-27|10.9 MB|[download](https://static.benschmidt.org/europeana_papers/Ell%C4%93nikos+t%C4%93legraphos%3A+%C4%93toi+eid%C4%93seis+dia+ta+anatolika+mere.parquet)|
|Blätter für Musik, Theater und Kunst|1855-02-02|4840|1196|1873-12-27|16.8 MB|[download](https://static.benschmidt.org/europeana_papers/Bl%C3%A4tter+f%C3%BCr+Musik%2C+Theater+und+Kunst.parquet)|
|Cur-Liste Bad Ischl|1842-06-02|3998|646|1875-09-11|2.7 MB|[download](https://static.benschmidt.org/europeana_papers/Cur-Liste+Bad+Ischl.parquet)|
|Innsbrucker Nachrichten|1854-01-26|42010|4330|1875-12-31|36.4 MB|[download](https://static.benschmidt.org/europeana_papers/Innsbrucker+Nachrichten.parquet)|
|Der Humorist|1837-01-02|18850|4430|1862-05-03|55.3 MB|[download](https://static.benschmidt.org/europeana_papers/Der+Humorist.parquet)|
|Bregenzer Wochenblatt|1793-03-15|8739|1725|1863-07-28|9.4 MB|[download](https://static.benschmidt.org/europeana_papers/Bregenzer+Wochenblatt.parquet)|
|Ephemeris|1791-01-03|2774|311|1797-12-11|2.7 MB|[download](https://static.benschmidt.org/europeana_papers/Ephemeris.parquet)|
|Wiener Sonntags-Zeitung|1867-01-01|4326|589|1875-12-26|20.5 MB|[download](https://static.benschmidt.org/europeana_papers/Wiener+Sonntags-Zeitung.parquet)|
|Österreichische Zeitschrift für Verwaltung|1868-01-02|1130|280|1875-12-30|2.6 MB|[download](https://static.benschmidt.org/europeana_papers/%C3%96sterreichische+Zeitschrift+f%C3%BCr+Verwaltung.parquet)|
|Vorarlberger Zeitung|1849-04-06|272|67|1850-03-22|0.6 MB|[download](https://static.benschmidt.org/europeana_papers/Vorarlberger+Zeitung.parquet)|
|Die Gartenlaube für Österreich|1867-01-28|937|67|1869-04-19|2.5 MB|[download](https://static.benschmidt.org/europeana_papers/Die+Gartenlaube+f%C3%BCr+%C3%96sterreich.parquet)|
|Allgemeine land- und forstwirthschaftliche Zeitung|1851-07-05|3742|301|1867-12-27|7.1 MB|[download](https://static.benschmidt.org/europeana_papers/Allgemeine+land-+und+forstwirthschaftliche+Zeitung.parquet)|
|Wiener Vororte-Zeitung|1875-02-15|52|13|1875-11-01|0.3 MB|[download](https://static.benschmidt.org/europeana_papers/Wiener+Vororte-Zeitung.parquet)|
|Siebenbürgisch-deutsches Wochenblatt|1868-06-10|3182|193|1873-12-31|7.3 MB|[download](https://static.benschmidt.org/europeana_papers/Siebenb%C3%BCrgisch-deutsches+Wochenblatt.parquet)|
|Neue Wiener Musik-Zeitung|1852-01-15|1289|312|1860-12-29|3.8 MB|[download](https://static.benschmidt.org/europeana_papers/Neue+Wiener+Musik-Zeitung.parquet)|
|Österreichische Badezeitung|1872-04-14|600|54|1875-08-22|1.6 MB|[download](https://static.benschmidt.org/europeana_papers/%C3%96sterreichische+Badezeitung.parquet)|
|Deutsche Zeitung|1872-04-02|9284|604|1874-12-29|63.3 MB|[download](https://static.benschmidt.org/europeana_papers/Deutsche+Zeitung.parquet)|
|Internationale Ausstellungs-Zeitung|1873-05-02|492|79|1873-09-30|3.1 MB|[download](https://static.benschmidt.org/europeana_papers/Internationale+Ausstellungs-Zeitung.parquet)|
|Janus|1818-10-10|236|52|1819-06-30|0.4 MB|[download](https://static.benschmidt.org/europeana_papers/Janus.parquet)|
|Wiener Moden-Zeitung|1862-01-01|126|13|1863-07-15|0.3 MB|[download](https://static.benschmidt.org/europeana_papers/Wiener+Moden-Zeitung.parquet)|
|Die Emancipation|1875-04-22|64|8|1875-05-25|0.1 MB|[download](https://static.benschmidt.org/europeana_papers/Die+Emancipation.parquet)|
|Die Vedette|1869-11-01|3253|187|1875-12-19|5.8 MB|[download](https://static.benschmidt.org/europeana_papers/Die+Vedette.parquet)|
|Salzburger Chronik|1873-07-01|986|238|1875-12-30|3.1 MB|[download](https://static.benschmidt.org/europeana_papers/Salzburger+Chronik.parquet)|
|Wiener Feuerwehr-Zeitung|1871-01-01|336|78|1875-12-15|0.7 MB|[download](https://static.benschmidt.org/europeana_papers/Wiener+Feuerwehr-Zeitung.parquet)|
|Gerichtshalle|1857-03-30|6132|1005|1875-12-23|14.6 MB|[download](https://static.benschmidt.org/europeana_papers/Gerichtshalle.parquet)|
|Illustrirtes Wiener Extrablatt|1872-03-24|6354|662|1875-12-31|29.7 MB|[download](https://static.benschmidt.org/europeana_papers/Illustrirtes+Wiener+Extrablatt.parquet)|
|Wiener Salonblatt|1870-03-13|2170|138|1875-12-24|5.0 MB|[download](https://static.benschmidt.org/europeana_papers/Wiener+Salonblatt.parquet)|
|Sonntagsblätter|1842-01-16|5277|227|1848-09-17|6.1 MB|[download](https://static.benschmidt.org/europeana_papers/Sonntagsbl%C3%A4tter.parquet)|
|Wiener Theater-Zeitung|1806-07-15|14345|3110|1838-12-29|33.5 MB|[download](https://static.benschmidt.org/europeana_papers/Wiener+Theater-Zeitung.parquet)|
|Wiener Landwirtschaftliche Zeitung|1868-01-03|746|76|1869-12-18|2.3 MB|[download](https://static.benschmidt.org/europeana_papers/Wiener+Landwirtschaftliche+Zeitung.parquet)|
|Vorarlberger Volks-Blatt|1866-06-15|4143|644|1875-12-31|10.0 MB|[download](https://static.benschmidt.org/europeana_papers/Vorarlberger+Volks-Blatt.parquet)|
|Marburger Zeitung|1862-04-13|447|104|1870-11-30|1.6 MB|[download](https://static.benschmidt.org/europeana_papers/Marburger+Zeitung.parquet)|
|Vaterländische Blätter für den österreichischen Kaiserstaat|1808-05-10|5861|816|1820-12-27|9.0 MB|[download](https://static.benschmidt.org/europeana_papers/Vaterl%C3%A4ndische+Bl%C3%A4tter+f%C3%BCr+den+%C3%B6sterreichischen+Kaiserstaat.parquet)|
|Freie Pädagogische Blätter|1867-01-19|5136|316|1875-12-25|7.0 MB|[download](https://static.benschmidt.org/europeana_papers/Freie+P%C3%A4dagogische+Bl%C3%A4tter.parquet)|
|Jörgel Briefe|1852-01-02|14086|757|1875-12-06|13.0 MB|[download](https://static.benschmidt.org/europeana_papers/J%C3%B6rgel+Briefe.parquet)|
|Österreichische Feuerwehrzeitung|1865-08-15|430|95|1872-06-02|1.2 MB|[download](https://static.benschmidt.org/europeana_papers/%C3%96sterreichische+Feuerwehrzeitung.parquet)|
|Österreichische Buchdrucker-Zeitung|1873-02-11|675|96|1875-12-30|1.9 MB|[download](https://static.benschmidt.org/europeana_papers/%C3%96sterreichische+Buchdrucker-Zeitung.parquet)|
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Day of DH Liveblog, 2022]]></title>
            <link>https://benschmidt.org/post/2022-03-28-day-of-dh</link>
            <guid>https://benschmidt.org/post/2022-03-28-day-of-dh</guid>
            <pubDate>Mon, 28 Mar 2022 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[{.thread}
:::
{.tweet}
:::
I've never done the "Day of DH" tradition where people explain what,
exactly, it means to have a job in digital humanities. But today looks
to be a pretty DH-full day, so I think, in these last days of Twitter,
I'll give it a shot. (thread)
:::

{.tweet}
:::
We'll start it at the beginning--1:30 or so AM, finally sent out an
e-mail I'd been procrastinating on to the college grants administrator
for a public humanities project about immigrant histories I'm running
with @ellennoonan and Sibylle Fischer.
:::

{.tweet}
:::
We've had NYU funding as a Bennett-Polonksy Humanities Lab
(https://nyuhumanities.org/program/asylum-h-lab-2020-2021/) to this
point, but presenting to the history department last month clarified the
use in making one of our primary sorts of records--A files--more
accessible to historians and family researchers.
:::

{.tweet}
:::
But that will take some real institutional support, because the stuff
we've obtained--legally!--from US customs and immigration in our trial
run is so shockingly personal in a lot of cases that I can't really
share it yet.
:::

{.tweet}
:::
("Yet" is the wrong word--can't ethically share in my lifetime,
probably. But there are still really important reasons to work on
auditing these records especially. If you're a naturalized citizen or
permanent resident and want any help getting your own A-file, let me
know!)
:::

{.tweet}
:::
OK, skipping to about 9:50 AM. (Late start b/c the first-grader had a
school event and my wife teaches Thursday AM). Today's first teaching,
for my class https://benschmidt.org/WWD22 will be focused on 19C
directories from the NYPL.
:::

{.tweet}
:::
Nick Wolf and @bertspaan digitized these years ago, but there's more to
do with them. A couple weeks ago @SWrightKennedy shared a preview of
Columbia's great new geolocation data about 19C New York...
https://mappinghny.com/about/
:::

{.tweet}
:::
And yesterday I finally pushed a full pipeline bringing the last two
weeks of student work together for doing geo-matching and cleaning of
these to the github repo.
https://github.com/HumanitiesDataAnalysis/Directories . This should
allow some amazing analysis of economic geography, name types, etc.
:::

{.tweet}
:::
And yesterday I finally pushed a full pipeline bringing the last two
weeks of student work together for doing geo-matching and cleaning of
these to the github repo.
https://github.com/HumanitiesDataAnalysis/Directories . This should
allow some amazing analysis of economic geography, name types, etc.
:::

{.tweet}
:::
So now we've got 8.3m individual people for every year from 1850-1889
queued up and ready for a variety of analyses. I want to send the
students a map to show how all their R code is paying off, but the
deepscatter module is breaking--only one of the filters is working here.
:::

{.tweet}
:::
I spend 40 minutes poking in the web code there to try to refactor the
code to get the interface working right, but this isn't really relevant
for the class right now--more something for the summer, I guess. So I
give up and decide to do this DH tweeting instead.
:::

{.tweet}
:::
Because of the whole "Twitter is almost over" thing, but some lingering
guilt about not blogging enough, I decide that a "Day of DH" post should
really be a blog first--so let's finally structure some markdown for a
twitter thread that can go on benschmidt.org.
:::

{.tweet}
:::
It takes a surprising amount of mucking around with the svelte-kit
settings to get things publishing correctly, and I have to remember my
own markdown naming conventions. But after a few minutes, we've got full
recursion.
https://benschmidt.org/post/2022-03-28-day-of-dh/day-of-dh-22/
:::

{.tweet}
:::
Whoops, or not... Time to muck with svelte-kit a little more...
:::

{.tweet}
:::
Well, this is embarassing but typical. Turns out there was a bug in the
bleeding-edge svelte-kit build that broke trailing slash behavior in
URLs. Because 'https://benschmidt.org/post/2022-03-19-better-texts/' is
different from 'https://benschmidt.org/post/2022-03-19-better-texts.'
Finally fixed.
:::

{.tweet}
:::
Insane levels of debugging is a real pain and occupational hazard. But
to be honest, I don't know how anyone could responsibly teach this stuff
without doing this sort of rebuilding and rescaling all the time. Every
one of those things is kind of interesting and builds up ability to fix
others' code...
:::

{.tweet}
:::
Insane levels of debugging is a real pain and occupational hazard. But I
don't know how you can responsibly teach this stuff without these
frequent rabbit holes. Every one of those things is kind of interesting
and builds up ability to fix others' code...
:::
:::
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[A Rose for Ruby]]></title>
            <link>https://benschmidt.org/post/2022-02-01-ruby-for-emily</link>
            <guid>https://benschmidt.org/post/2022-02-01-ruby-for-emily</guid>
            <pubDate>Mon, 28 Feb 2022 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[![Ruby Logo](https://www.ruby-lang.org/images/header-ruby-logo.png)
There are programming languages that people use for money, and
programming languages people use for love. There are Weekend at
Bernie's/Jeremy Bentham corpses that you prop up for the cash, and there
are "Rose for Emily" corpses you sleep with every night for decades
because it's too painful to admit that the best version of your life you
ever glimpsed is not going to happen.

It's time we had a hard talk about Ruby.

_This is part three of a series on Web Archives for the 2020s._

I was at a cafe in Ann Arbor in 2014 talking about coding with [Matt
Burton;](https://www.sci.pitt.edu/people/matthew-burton) he had just
discovered Docker, and was rhapsodically describing how magically it
transformed his workflow. At some point he mentioned something about
Ruby and how he was shifting away using it, and a doleful looking man
came over to commiserate over how the Ruby dream was fading away. It was
a good idea, it really figured something out, he said, but it had lost.
He then described whatever new thing *he* had been working on\--not
Docker, maybe Go (I don't think I knew about Go yet), maybe something
else.

People talk about programming "languages," but the language is usually
the easy part; every programming environment is like a foreign city.
Perl was like a Renaissance fair with arcane and inconsistent rules,
filled with people pretending to be monks and issuing apocalypses and
generally orientalizing in a way that wouldn't be cool today. R is a
midwestern college town, orderly, a bit slow, behind the times in
certain ways but with great infrastructure. Go is Singapore, filled with
spaced-out modern infrastructure and more rules for your own good than
you'd like. Javascript is some post-imperial metropolis, filled with
merchants hawking possibly counterfeit wares in countless dialects, with
huge districts constructed without a building code and no overall map.

As a tourist in the landscape, Ruby right now feels like Detroit. In the
1950s, Detroit was an idea of growth, union-led households, orderly
grids, with the UAW ready to push racial integration. The infrastructure
is still there. But it's gutted; you keep going to a corner and finding
the buildings have been torn down. The Wax documents strongly recommend
\`rvm\` for managing versions, but the web page looks to be from a
decade ago and the key authentication doesn't even work. The core
version of Ruby was updated to 3.0 last year, removing a key dependency
(webrick) from the stdlib that makes Jekyll not work, and it seems not
to be a priority for the Jekyll team to immediately add it back in the
Jekyll requirements. Why? Presumably because so few people are starting
up new sites that new people moving to the platform is not a problem
that overwhelms them.

And it's slowwwwww. Wow. Those Hugo-adopters were right. So, so slow. In
Bookworm, I tokenize, reformat, and otherwise transform books all the
time. I've switched over to Pyarrow and polars to get faster
underpinnings; I can often do some operations on a thousand books a
second. Ruby, generating a piddling few dozen pages, can take a minute
or two. I wrote an entire Svelte-kit based wax clone just in the breaks
while waiting for my Wax pages to render. There's a truism out there
that developer time is far more valuable than compiler time, and that
all modern languages are fast _enough_. I've always thought that was
basically true. But that relies on a rough baseline of performance, on
someone periodically going through and pulling out the low-hanging fruit
by optimizing the slowest parts of a language. Jekyll's slowness is of a
different order.

I've never learned Ruby. Based on the love people show for it, I wish I
had to. But I doubt ever will. It should have been bigger. From
everything I've seen, it was better designed than Python. We'd all be in
a better place if the numpy/scipy/tensorflow stack had grown on top of
Ruby rather than Python one. But they didn't. You don't move to a city
for the language they speak; you move there for the jobs, the
infrastructure, the culture, the people. You take care of what's left
there.

There are people left who still love Ruby, who will tell you that Jekyll
is a simple, classic, effective way to build web sites.

They are lost souls.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[From Hugo to Svelte-Kit]]></title>
            <link>https://benschmidt.org/post/2022-01-20-sveltekit-transition</link>
            <guid>https://benschmidt.org/post/2022-01-20-sveltekit-transition</guid>
            <pubDate>Sat, 22 Jan 2022 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[I've been spending more time in the last year exploring modern web
stacks, and have started evangelizing for
[SvelteKit](https://kit.svelte.dev/), which is a new-ish entry into the
often-mystifying world of web frameworks. As of today, I've migrated
this, personal web site from Hugo, which I've been using the last couple
years, to sveltekit. Let me know if you encounter any broken links,
unexpected behavior, accessibility issues, etc. I figured here I'd give
a brief explanation of why sveltekit, and how I did a Hugo-Svelte kit
migration.

{#why-svelte-kit-for-a-personal-site}
# Why Svelte-Kit for a personal site?

I've had some kind of content up at benschmidt.org for over a decade;
and I've been using it as my primarily outlet for blog posts for about
five years (although still posting occasionally on my [old blogger
site](https://sappingattention.blogspot.com) as well. For a time it was
hosted on Wordpress; for a time after that, on Hugo. I also have a large
number of other items living on benschmidt.org I've made over the years
that weren't integrated into the Hugo site; most are things like
standalone visualizations that I'd like to be able to retain all their
existing javascript but share a top bar with the rest of the site so
that people link them to me.

Hugo works find for building compared to wordpress, by giving a static
site solution that unlike Wordpress doesn't present security
vulnerabilities. I like, compared to Jekyll, that it's a quick build.
But left me with a somewhat clunky set of pages for things like a
visualizations gallery. And although I picked a decent theme--Hugo
Academic--I never fully got on board with the weird way that you
basically end up having to learn to manage Hugo's build process through
a set of TOML and YAML files. I saw someone once decry the growing trend
to make people do things in YAML that are fundamentally _programming_;
although yaml is great for some things, learning the configuration
setting for some particular theme is generally frustrating.

Also, the pile-up of all these old web sites means the URL requirements
are a little finicky--I want to support some of the old wordpress links,
some of the Hugo-style links, and potentially bring in some blog posts
for other domains (bookworm.benschmidt.org, for instance, which had a
number of posts that I've entirely lost.)

So here are the problems that Svelte-kit solves.

1. *Routing*. I never really figured out the URL setup for blog posts in
   the Academic theme for Hugo; and I have a number of old posts from a
   Hakyll setup I breifly explored in 2015-2017 before abandoning it.
   Svelte-kit's routing is incredibly powerful but also fundamentally
   understandable; every foldername on your computer is a directory in
   the url structure, `index.svelte` files turn into the base names, and
   you can use brackets like `/post/[postname]/index.svelte` to define
   dynamic variables where `/postname` is the filename. So right now,
   I'm writing a markdown file at located at
   `2022-01-20-sveltekit-transition/index.md`, and checking in a browser
   window to make sure that the local version is correctly showing
   images and styles.

2. *Image Components*. This is a big one for me. For instance, I want to
   have [a gallery](/gallery) where I can just show visuals that will
   show a tile of images. And since it's the 2020s, that needs to look
   one way on desktop and quite a different way on mobile.

   *Desktop view*![Three gallery images side-by-side for desktop with
   no text visible](2022-01-23-14-34-39.png)

   *Mobile view*![Two gallery images stacked on top of each other with
   text visible](2022-01-23-14-35-58.png)

3. *Data Components*. I liked how Hugo Academic included a lot of basics
   for showing carousels of things like articles, but they were never
   _quite_ what I wanted. And for years I've been making my CV using
   Kieran Healy's template but compiled from yaml because yuck, latex is
   gross. That meant I was keeping up two different versions of pretty
   much the same data, which is a pain. With Svelte, I can just directly
   import the YAML to the CV page and format the data. For the time
   being, the online version is a little wonky because it's sort of a
   pain to iterate through. But it also means that I can easily abstract
   something like "upcoming talks" if I ever get it together enough to
   start handling talk invitations again. I can automatically have the
   website update the courses I've taught from the same file as the CV,
   with links to the course pages. Etc.

4. *CSS and themes*. CSS is incredibly powerful, and incredibly hard to
   use with most frameworks I've explored. One reason is that the CSS
   gets shunted off into some file somewhere called '/lib/app.scss' or
   something and it's never clear from the css which things are
   boilerplate, which are essential classes used everywhere, and which
   are not used on a site at all. Svelte natively solves this by
   allowing all components to have a style block at the bottom, scoped
   just to that file, so I can immediately understand the implications
   of editing a block. This is especially useful for someone like me who
   doesn't think much about colors but occasionally gets finicky about
   item placement.

It also works well alongside the tailwind CSS (non-)framework, which
I've been using a bunch lately when I know basically what I want to do
but don't want to think about how to define media queries. It provides a
bunch of classes.

5. *Integrating non-blog-content.* I have a lot of stuff hosted on
   benschmidt.org that doesn't have the theming from my personal
   website, and I periodically toss other things on. For instance, last
   week I wanted to share a seminar paper I wrote in grad school about
   the early years of the academic field of communications. Because I
   think putting this up will marginally increase the overall quality of
   the Internet, I just threw it up; and by running it through pandoc
   from the initial `.doc` files to HTML, I can just toss it into a
   folder and [have it show up with formatting and
   links](https://benschmidt.org/etc/lazarsfeld/) to my page. This would
   probably have worked in Hugo too; but as I start to incorporate some
   more elaborate javascript visualizations here, that will be harder
   and harder, at least without massively duplicating some common code
   libraries.

6. *Static serving with dynamic speed.* One of the things that drew me
   to Svelte and Svelte-kit immediately is their possibilities for
   static-site set-ups. Fancy web apps are fun--I have one for creating
   an archive I'll put out later--but I have a hard requirement that
   sites should be able work indefinitely without javascript at least in
   some form. Svelte-kit with adapter-static does a wonderful job
   splitting the difference here, making an initial page load always
   land on a real, static site file but also allowing site navigation to
   not refresh all the shared elements on a page if Javascript _is_
   enabled.

{#hugo-to-svelte-kit}
# Hugo to Svelte-Kit

The last, and maybe most important, is that migration be _possible_. For
anyone else looking to switch, here are some Hugo-to-Sveltekit migration
notes.

1. Blog posts have got to stay in Markdown. I just chose to shove most
   of the contents of the Hugo tree into 'src/content', to live
   alongside 'src/lib' (which is for code) and 'src/routes'. It would
   also be possible to put posts into `src/routes` directly and use a
   markdown plugin to generate sites straight from the Markdown. I chose
   not to do this because at least in my preliminary exploration, svelte
   was trying to treat all `{}` blocks as interpolatable, which isn't
   what I want. Most of the hard work then happens in [a markdown
   parsing
   file](https://github.com/bmschmidt/sveltekit-benschmidt.org/blob/main/src/lib/markdown.ts)
   that just globs up all the markdown in that directory and parses it
   into HTML (and the YAML headers as JSON) using
   `vite-plugin-markdown`. This requires a little [tinkering with the
   `svelte.config.js`
   file](https://github.com/bmschmidt/sveltekit-benschmidt.org/blob/main/svelte.config.js).

{.js}
```
const urls = import.meta.globEager('/src/content/**/*.md');
```

The result is then an export that I can use on any page that contains
the metadata for all blogposts as data in reverse-chronological order;
although the [actual
code](https://github.com/bmschmidt/sveltekit-benschmidt.org/blob/main/src/routes/post/index.svelte)
has to do more to handle tag-based navigation, the skeleton of the the
page is basically only this:

{.js}
```
<script>
  import {post_index} from '$lib/markdown.ts'
  import Postgroup from '$lib/components/Postgroup.svelte'
</script>

<Postgroup posts={post_index} />
```

So now I have canonical URLS for posts at /post/slugname/, without year
and month as part of a tree. All the messy old urls are still supported,
though, by alternate routing endpoints that just comb through the
metadata for those posts to try to determine what you're looking for.
This is unlikely to catch everything at first, but I can comb through
server logs to see I'm contributing to link rot and easily set up new
rules.

Non-blog pages are routed through a catchall endpoint that just finds
the matching markdown file and compiles it. Easy-peasy. For the pages
where I want to start doing something more complicated or data-driven,
like the blog index, the dataviz gallery, or the CV, I write a custom
Svelte component or page.

There's something kind of lovely about the basicness of all this on the
core level. If I want a blog feed--yes, I do!--I just define a route at
`/index.xml` that throws back something from whatever node package I can
find that generates atom XML.

Is this flawless? Definitely not--I've sure there will be plenty of
broken likes soon. But I'm hopefully it will give me a nicer platform to
bundle stuff together onto the Web. And as I've become even more
evangelical about web publishing during this pandemic, that's important
to me.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Increasingly Stealthy]]></title>
            <link>https://benschmidt.org/post/2021-09-15-increasingly-stealthy</link>
            <guid>https://benschmidt.org/post/2021-09-15-increasingly-stealthy</guid>
            <pubDate>Wed, 15 Sep 2021 16:41:56 GMT</pubDate>
            <content:encoded><![CDATA[Scott Enderle is one of the rare people whose [Twitter
pages](https://twitter.com/scottenderle) I frequently visit, apropos of
nothing, just to read in reverse. A few months ago, I realized he had at
some point changed his profile to include the two words "increasingly
stealthy." He had told me he had cancer months earlier, warning that he
might occasionally drop out of communication on [a
project](https://github.com/senderle/bookworm-compose) we were working
on. I didn't then parse out all the other details of the page---that he
had replaced his Twitter mugshot with a photo of a tree reaching to the
sky, that the last retweet was my friend Johanna introducing a journal
issue about "interpretive difficulty"---the problems literary scholars,
for all their struggles to make sense, simply *can't solve.* I only
knew---and immediately stuffed down the knowledge---that things must
have gotten worse.

There's a terrifying grace in that preparation. We've all seen the
digital desiderata of the dead. Usually they're painful in how they
present someone going through ordinary motions who is now stilled;
sometimes they're wrenching because they narrate a fight in process that
we know the person---like everyone---is destined to lose. Scott found it
in him to prepare a kind of reassurance. He still cared what we all were
saying, but was in the process of pulling off a little magic trick.
Someday, soon, he would disappear into full stealth. The man was a
writer, and I wonder if he started off with some more conventional words
the Internet uses to describe this action---"mostly lurking,
nowadays"?---before editing it up to something a little more marvelous.

A number of the testimonials to Scott I've seen since he died last
Saturday emphasize his kindness, his decency, and his generosity. I've
been thinking about how his stealthiness buoyed all of those. In my life
he would just pop up from time to time through one window of the
Internet or another, always a reassuring and welcome presence. In most
senses I barely knew Scott. We never even met in person---we talked
about doing so a few times, but even barely a hundred miles apart, it
was easier for us with little kids to push it off. And his thoughts were
generally so rich that it was easier to digest them through flurries of
e-mails, blog comments, github issue threads. Once there were so many
e-mails in a short period that we had to switch to the telephone to talk
about vector algebra, although we were quickly talking about something
else entirely. As the rest of the world switched to video-conferencing
the last few years, I at least got to see his face.

But I primarily knew Scott as this intensely helpful, mentally probing
figure that made writing, reading, and coding online _rewarding_. I'd
often be chomping against some interpretive difficulty of my own,
looking for the answer to some obscure question and find that it was
Scott [who had answered it years
before](https://stats.stackexchange.com/questions/179010/difference-between-pointwise-mutual-information-and-log-likelihood-ratio).
He was, I only now thought to check, one of the most helpful answerers
of all time on Stack Overflow, the question-answer site that makes
modern coding possible. ([To give the
numbers](https://stackoverflow.com/users/577088/senderle?tab=topactivity):
128,633 reputation so far, number 681 out of 15,000,000 registered
users. He was there only to help: 859 questions answered and only three
questions ever asked). The first time I became aware of Scott online was
when he asked a kind and incisive question on Twitter about the meaning
and metaphors of the Fourier transform that immediately jolted me into a
clearer understanding of a problem I had been wrestling with for weeks.
In subsequent conversations this would happen again and again. This gift
was real, and he spread it far more widely than most. I know that there
are many for whom his loss is a deep, personal rift; maybe it helps to
know how long the tail of that loss goes. Before there were radio waves,
to 'broadcast' meant to throw seeds as widely as you could while
planting, sowing the whole field. In a field where drilling down and
holding ideas tight can be overprivileged, Scott was a broadcaster.

I wonder if one reason Scott afford to be so egoless in his professional
interactions was because his intellect was so utterly distinctive. I
quickly came to know which kinds of questions were those I craved his
insight on, but I never had any idea which direction he would take a
problem. One thing I found intensely admirable was how confidently he
would hold to a metaphor or an idea that would have no place in the
universe if not for Scott---treating word vectors through the theory of
algebraic sets, rehabilitating Fourier transforms for document encoding,
most recently interpreting language models thermodynamic partition
functions. The ones that excited him most cut across mathematics,
language and metaphor with striking new routes no one would think to
take. Even as he solved other people's problems, he always found ways to
refresh the global reservoir with more interesting ones.

Although he wrote everywhere, one of the places our tracks most
overlapped was in the last years of personal blogging---one of the
reasons I feel compelled to set down something here. Scott's blog, _The
Frame of Lagado_ (look up the reference if you don't know it), wasn't a
long-term project, but like everything else, it helped people think how
to think. [The last entry](http://www.lagado.name/blog/) is a wry,
funny, self-deprecating farewell to the medium for characteristically
independent reasons. Evidently Scott somehow figured out how to set up
Wordpress using sqlite instead of MySQL, which is not something which
would ever occur to most people. Evidently, also, this proved to be
untenable. As he posted more and more blog got longer and longer, the
whole thing slowed to a crawl under the weight of his words. I
remembered this as a purely comical piece, but on returning to it after
thinking about Scott for most of the last 24 hours, I noticed that he
had ended it with a promise and a quick quotation to a part of First
Corinthians I principally know from the German Requiem. That context is
appropriate. "For we have here no continuing city, but we seek the
future. Behold, I show you a mystery."

> At some point, I will create a much better blog and republish some or
> all of the old posts here. _We shall not all sleep, but we shall all
> be changed._
>
> In the meanwhile --- thanks for reading.

Thanks, Scott.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Genre, Manifolds, and AI.]]></title>
            <link>https://benschmidt.org/post/genres-and-manifolds</link>
            <guid>https://benschmidt.org/post/genres-and-manifolds</guid>
            <pubDate>Mon, 07 Jun 2021 18:35:21 GMT</pubDate>
            <content:encoded><![CDATA[This [article in the New
Yorker](https://www.newyorker.com/magazine/2021/03/15/genre-is-disappearing-what-comes-next)
about the end of genre prompts me to share a theory I've had for a year
or so that models at Spotify, Netflix, etc, are most likely not just
removing artificial silos that old media companies imposed on us, but
actively destroying genre without much pushback. I'm curious what you
think.

This aligns to the most important rule for thinking about artificial
intelligence, which is that it's deleterious effects are most likely in
places where decision makers are perfectly happy to let changes in
algorithms drive changes in society. Racial discrimination is the most
obvious field where this happens. But there are others where the moral
valence is less clear, which are mostly being ignored.

_Background: I'm participating in a \[roundtable at the American
Historical Association
tomorrow\](https://us02web.zoom.us/webinar/register/WN_zaX-6GpgTTaQEveJ8BulOQ)
on Artificial Intelligence and its implications for the future of
historical research. It made me realize that while I've been fiddling
quite a bit with neural networks, and [used them in my article on
dimensionality reducation in digital
libraries](https://culturalanalytics.org/article/11033-stable-random-projection-lightweight-general-purpose-dimensionality-reduction-for-digitized-libraries),
I haven't actually reflected much on them. Some of that will hopefully
appear in the published partner to the AHA panel.\_

I teach a course on the [history of
data](http://benschmidt.org/bigdata20/), and one primary lesson is that
indexes shape what kind of culture people use. So with modern culture,
what kind of indices do we use? When I did [college
radio,](https://www.whrb.org/), in the music library the most important
resource was a set of huge printed binders for every piece in the
station's music library printed out twenty years before; there were
different binders by artist, by composer, etc. But by far the most
useful one was a listing of every track in the collection by _time_;
you'd know you needed an instrumental piece that was between 18 and 19
minutes long to close out a shift, and you could retrieve it instantly.
How things were stored affected what got played.

The promise of digitization is unconstrained reconfiguration; indexes
like this shouldn't matter anymore. But of course we still have indexes,
and I wonder if they aren't doing something quite weird.

{#unevenly-distributed-high-dimensional-spaces-privilege-non-conformism}
# Unevenly distributed high-dimensional spaces privilege non-conformism

The theory is this. If you assume that music is distributed in a
high-dimensional feature space (as they surely do) the distribution of
pieces in that space is almost sure to be highly uneven. Some areas
(recordings of the Beethoven string quartets) will be densely populated;
others (suites for toy piano) will be quite sparse.

If you then use k-nearest neighbors approaches to serve up
recommendations for music (Spotify built the best-known library, so we
know that they use it), you'll likely hit music on the periphery of its
local clusters far more often than music at the center.

Here's a simple 2-d analogy. Imagine an alien crashing into a random
point on Earth and searching for the nearest human to say "take me to
your leader." The odds are they'll find someone rural; and it's
basically guaranteed they'll hit a suburbanite before an urbanite unless
they happen to crash into the middle of Central Park. They're more
likely to meet a Russian speaker than a Chinese speaker. And so on.

Spotify isn't serving up songs randomly, but I wonder how much a similar
dynamic comes into play when each person is turned into a vector to
predict their next streams.

When I browse around [this vector representation of all the books in the
Hathi Trust I made](http://creatingdata.us/datasets/hathi-features/),
genre outliers just tend to pop out naturally. I love these, because
they're intrinsically interesting; I end up finding-- for instance--a
book telling the history of England in doggerel just as often as I find
"normal" poetry.

:::
![A doggerel poem from an 18th century book.](/img/doggerel.png)

{.caption}
:::
A doggerel poem from an 18th century book.
:::
:::

For those of who can't read it:

> And trade's embarrassment redoubles.
>
> If I mistake, \^tis your's to judge it,
>
> But only overhaul the Budget
>
> Which, for the service of the year,
>
> Will millions, twenty-three appear ;
>
> Thousands\^ seven hundred fifty-six,
>
> And hundreds, (as accountants fix,)
>
> Some one or two ; a sum so great
>
> Had ne'er before disturb'd the state;

But I'll certainly get the wrong idea about what sort of books exist in
the library if I assume that the elements that pop out in the less dense
areas are more typical. In fact, I'd probably To some degree, Spotify is
doing the same thing with music. And instead of cities and rural people,
we have dense established genres.

{#what-is-spotify-recommending}
# What is Spotify recommending?

My thinking about this has been heavily influenced by thinking about
what I listen to now that I mostly use Spotify rather than my own
digitized CDs for music listening. A typical example that Spotify's
recommendation algorithm surfaced for me quite early on is the music of
the Austrian theorbist and composer Christina Pluhar, who puts together
ensembles which, depending on your taste, are enchanting, insufferable,
or inane. Here's a track from her album of Purcell arrangements.

I like this. I have no idea idea if you do; and I don't know exactly why
I was recommended it. But if you assume this came out of a
nearest-neighbor search in some region of a high-dimensional space, it's
easy to imagine why. This is an album that sits at an intersection of a
bunch of different styles; recklessly loose early-music bands,
non-traditional world music borrowers, Leonard Cohen "Allelulia"
completists. Not something for everyone; but something for several
different groups that ensures it will float way off in its own region of
an embedding space.

What doesn't Spotify surface? That's a much harder question to answer.
But I know that the only album _I've_ recommended to anyone recently
probably fits the bill quite nicely; the last couple disks of the
Beaux-Arts Trio's complete Haydn Piano Trios.

Without streaming, this is probably a less obscure disk than the Pluhar,
but it's also pretty damned obscure in its own right. The Haydn trios
are far less often played than his string quartets or symphonies because
those two genres ended up becoming more prestigious, but _Haydn_ didn't
know that, and the music is equally as good. And while late Haydn has
its own deeply appealing weirdness, I find it hard to imagine that there
are any existing listeners out there who come to it before mostly
exhausting their path through the Beethoven piano sonatas and Mozart
quartets first. Pluhar's music is sitting in a cabin in the Canadian
woods waiting for any comers; Haydn's trios are crammed into several
walkups in Ditmars Steinway along with C.P.E. Bach, Boccherini, the rest
of the less-trafficked classical canon.

{#is-genre-disappearing-everywhere}
# Is genre disappearing everywhere?

So is this happening everywhere in culture? The degree to which its an
algorithmic product isn't clear, but it sure seems like the streaming
services have settled into a bubble of half-hour unclassifiable formats
rather than "sitcoms" and "dramas." Netflix's "[personalized
genres](https://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/)"
are not the product of an embedding system, but do play naturally
alongside one, because they generate an affinity for works that cut
across different realms.

The causation here is complicated because, as with many other trends,
the technology merely gloms onto a larger cultural trend. It seems quite
possible to me that music recommendation services succeed right now
because the zeitgeist is aligned in a way where many people are amenable
to being served these kinds of hybrid works. If you want pure genre,
there are ways of getting it; modern satellite radio stations, like
Sirius-XM, will give you all the expert-curated music you want for any
microgenre imaginable.

Is this a problem? I don't think many would see it as such; but it's
worth thinking, nonetheless, about what it does to culture. Anyone who
manages to occupy an empty space in the cultural manifolds will be
richly rewarded; anyone who tries to stay in a heavily-trafficked space
will languish. The idea of cultural areas as fluid, non-differentiable
groups flowing into each other will be a self-fulfilling prophecy;
anyone insisting that genres are real may seem hopelessly old fashioned.
Anyone who navigates cultural spaces through digital means will be
over-exposed to hybrid cultural forms, which will only lead them further
to think that the different genres were an old-fashioned illusion,
brought about by a particular set of constraints around channels, record
labels, and the rest. And of course they'll be right. But if they think
there's anything more natural about an enforced space emphasizing
novelty, sparsity, and so forth, they'll be wrong; and a cultural
dynamic around filling in the valleys of a manifold spaces rather than
building up the summits may be less rewarding than we hope.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Guide to Digital Publishing]]></title>
            <link>https://benschmidt.org/post/2021-05-20-alignment</link>
            <guid>https://benschmidt.org/post/2021-05-20-alignment</guid>
            <pubDate>Thu, 20 May 2021 15:26:37 GMT</pubDate>
            <content:encoded><![CDATA[I've been yammering online about the distinctions between different
entities in the landscape of digital publishing and access, especially
for digital scholarship on text. So I've collected everything I've
learned over the last 10 years into one, handy-to-use, chart on a
10-year-old meme. The big points here are:

1. HathiTrust and JSTOR are not for-profit cartels, and I can't count
   the number of times I've seen faculty and other researchers attack
   them for not being open enough when they're just following laws,
   especially around nonsense justifications for keeping scholarly work
   out of the public domain, that faculty continually reinforce (through
   paranoia about, say, disembargoing a dissertation or publishing in an
   open-access journal that lacks prestige or, God forbid, a journal
   that skips the tree-killing stage entirely).
2. Stop publishing on Medium, goddammit! I'm not paying to read your
   blog post! You're not going to make any money off of this! If the
   Huffington Post isn't paying you and you don't know how set up a
   Webserver, just get a Wordpress account and pretend that you're doing
   it for old-school cool. Come on, pull it together.
3. There are three places where change happens here. One is that the
   neutral Goods--Archive.org, especially--pull the lawful goods into
   slightly more open practices by doing good things and not getting
   sued. One is that the chaotic goods--the pirate sites--undermine the
   business model of the cartels in the lower left and keep them from
   changing things for the worse. And the last is that the faculty--the
   chaotic neutrals--pin this chart next to that shirtless picture of
   Zizek and stop publishing and demanding subscriptions to cengage
   content because it's easier.

The common objections are:

1. *Google's in the wrong place.* I think you mean Alphabet. Yes, it
   sure is. It's a monopoly; it contains multitudes. If there were a
   slot in this for fickle old- Testament God on which all else relies
   that punishes and rewards in equal measures--yeah, I'd use that
   instead. But it is what it is.
2. *JSTOR's not good.* Disagree. That's the whole point here; we need
   something that isn't gouging out our eyeballs in the scholarly
   journal space, and JSTOR is a not-for-profit targeted at nonexpert
   users that tries to keep pace.
3. *What about Aaron Swartz?* Why does this keep coming up? No, JSTOR
   did not kill Aaron Swartz. First off, it was the US Attorney who
   insisted on going through with it. Go [read the MIT
   report](http://swartz-report.mit.edu/) and you'll see that JSTOR
   called for the prosecution to be dropped the day he was arrested,
   while MIT refused to issue a public statement for months.
4. *You forgot my favorite pirate site*. I did! There are a lot of
   them, huh?
5. *Seriously, Medium?* STOP PUBLISHING ON MEDIUM PEOPLE I AM NOT PAYING
   FOR YOUR BLOG POST I DON'T UNDERSTAND WHY PEOPLE ARE PRETENDING THIS
   IS SOMEHOW ANYTHING OTHER THAN JUST A WORSE VERSION OF BLOGSPOT.COM
6. *Google's on it twice.* Font choice.

Credits for suggestions to Alex Humphreys, Ted Underwood, Scott
Weingart, Melissa Teras, Rachel Midura, Will Hanley, Ethan Gruber.

:::
![Permalink](/img/alignment.png)

{.caption}
:::
Permalink
:::
:::
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Moving from MySQL to DuckDB]]></title>
            <link>https://benschmidt.org/post/2021-04-28-duckworm</link>
            <guid>https://benschmidt.org/post/2021-04-28-duckworm</guid>
            <pubDate>Wed, 28 Apr 2021 13:57:51 GMT</pubDate>
            <content:encoded><![CDATA[I mentioned
[earlier](http://benschmidt.org/post/2021-03-07-bookworm-caching/bookworm-caching/)
that I've been doing some work on the old Bookworm project as I see that
there's nothing else that occupies quite the same spot in the world of
public- facing, nonconsumptive text tools.

That codebase is *old*--pieces of it [date back to this blog post from a
decade
ago](http://sappingattention.blogspot.com/2011/02/technical-notes.html).
Parts of that old architecture (e.g., perl) got quickly jettisoned (for
Python). But others persist. In re-examining the technical stack behind
Bookworm, I've realized that it's finally possible to jettison one of
the biggest pain points--MySQL--for something that better matches the
workflows here.

People often ask about Postgres, but I'm moving to something a little
bit more unexpected--the 2-year-old program
[DuckDB](https://duckdb.org/). This might seem like an odd choice! The
core data architecture challenge of Bookworm is managing some enormous
tables for storing a sparse matrix-- the term-document matrix--for a
large number of long documents. The HathiTrust bookworm has about 2
*trillion* words in 17 million books--I haven't looked at the core
tables recently, but I'd guess they have tens of billions of rows.

DuckDB, on the other hand, is manifestly targeted at a much smaller
size--it borrows intensely in footprint from SQLlite by using the
SQLlite shell, existing only as an embedded process in running program
(i.e., no daemon), and putting each file into a single moveable file. I
never seriously considered SQLlite as a Bookworm backing, because it's
too lightweight to handle enormous tables, because at the time of the
original design I only knew how to write single-startup CGI scripts, and
because MySQL gives intense options for tweaking performance on the
margins. (Back in 2010-11 I got very used to using 3-byte unsigned
integers, which can store values up to about 16 million, for ids, since
they're actually a convenient size; it took me a while to realize that
3-byte integers are an extraordinarily unusual thing.)

{#column-stores}
## Column stores

But DuckDB has some major advantages. For one thing, it uses
column-oriented stores, which means rather than store rows of
interspersed data types, like MySQL, it groups primarily by the
values--so you get all the counts as a series of integers, all the
wordids as a series of integers, etc. For performance, Bookworm has
always encoded words to integers under the hood; there are a variety of
performance advantages to this form of storage. The costs mostly tend to
be things that don't matter in analytics (like it being harder to update
a single customer record in a table with their latest purchase.) That's
why DuckDB exists-- as something that will work better for analytics
from Python or R than SQLlite. And the basic design seems to be probably
better conceived than SQLlite because it's starting from the ground up;
it uses the Postgres parser and supports modern SQL reasonably well. For
the large joins that accompany a typical Bookworm query (in which you
declare which 1 million out of 10 million teacher evaluations you want
to look at), this works well.

*Here's a dumb analogy for column stores*. Imagine your data as being a
bunch of different cookies. Addresses are Oreos, dates are chocolate
chips, whatever. And you've got different types of values in there--some
people live at doublestuff lane, some with those weird mint green oreos.
The point of a column store is to keep all the oreos in line with each
other because they're the same shape.

:::
![Yum](https://lh3.googleusercontent.com/proxy/B8xrFw2x02kTedqwNbWvb0JxSHIpRwpyasab7fjmph4HZvQTvjAnXbk2xVrbrKfhjFk_ZYJ0OmfjrS6nx_1latxfWidid9MQfnEtnTZdGhefBCgSGA)

{.caption}
:::
Yum
:::
:::

Each sleeve is clear, so you can get an idea what's inside it, but it's
also nicely shaped, so you can quickly pass it along to the next person.
Imagine you've got a state-champion 400m relay race running around track
passing cookies to each other. Every team will to better if, instead of
passing a motley arrangement of cookies to the next team, they can just
hand off a single baton of oreos in a sleeve. That's what a column store
does.

{#indexing}
# Indexing

While the relational queries against catalog tables are important, the
most difficult part of any bookworm query is accessing the individual
word counts-- those 50 billion row tables of the term-document matrix.
What MySQL did for us there was to allow the creation of fast b-tree
indices that put related rows together on disk. This was often the most
time-consuming task, because MySQL index creation could take a week on a
really huge table; and it left the indices far larger than the actual
tables themselves. (In fact, the design of the database was such that
the original table is never used--queries only ever read from the
index.) The default MySQL settings made it very difficult to create
these indices as well.

DuckDB uses mostly [block range
indexes](https://en.wikipedia.org/wiki/Block_Range_Index), which tell
you roughly what part of the file any given dataset might be in, and
don't sort the underlying data. This is faster, but wouldn't allow for
quick lookup in a big table--you'd end up scanning almost everything.

But there's a trick here, which is to *sort the data first* before
putting it into DuckDB. If the term-document matrix is sorted by
`wordid`, all of the occurrences for each word will be right next to
each other, just as with the MySQL index. It's probably not _quite_ as
fast for retrieval, but the column-oriented structure that comes out can
race ahead on the subsequent joins. Pre-sorting isn't trivial, since
we're talking about far more data than fits in memory. But pyarrow
exposes some strikingly fast `pivot` methods for partially sorting
arrays, which makes it possible to shuffle things around without fully
sorting. This matters, because conventional merge sorts involving
entirely sorting each subarray before merging--that can be extremely
time-consuming for little benefit in a column-oriented situation where a
record is not contiguous to itself.

In ignorance of the best way to handle this, I've coded up a new routine
that does sorts in three passes:

1. Splits each input batch in 16 pieces;
2. Sorts those batches, and then continuously finds the least sorted 16
   contiguous batches, combines them into a new table, and then breaks
   them into 16 new non-overlapping batches.
3. Once the order is barely stable enough to ensure that a single merge
   pass will work, traverse in order for a merge sort.

This algorithm seems pretty neat to me, but I have no idea if it's
especially good or even if it's guaranteed to converge on a sorted
array. In any case, it's much, much faster than the old MySQL index
creation was and has a much smaller memory footprint.

Once the table is sorted, it's just a matter of loading it into duckdb.
The final write happens to a massive parquet file, which can be written
out of memory; then duckdb can ingest it straight into its database
format.

DuckDB doesn't yet support compression or a stable on-disk format, but
the pace of development is fast enough and impressive enough that I'm
willing to take a bet on it. Especially because we never used
compression in MySQL, either.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Javascript and the next decade of data programming]]></title>
            <link>https://benschmidt.org/post/2020-01-15-webGPU</link>
            <guid>https://benschmidt.org/post/2020-01-15-webGPU</guid>
            <pubDate>Mon, 08 Mar 2021 14:52:01 GMT</pubDate>
            <content:encoded><![CDATA[![](/img/spiral_preview.png)

I've recently been getting pretty far into the weeds about what the
future of data programming is going to look like. I use pandas and dplyr
in python and R respectively. But I'm starting to see the shape of
something that's interesting coming down the pike. I've been working on
a project that involves scatterplot visualizations at a massive
scale--up to 1 billion points sent to the browser. In doing this, two
things have become clear:

1. Computers have gotten much, much faster in the last couple decades
2. Our languages for data analysis have failed to keep up.
3. New data *formats* are making the differences between Python, R, and
   Javascript less important.
4. Javascript, the quintessential front-end language, is increasingly
   becoming the _back-end_ for data work in Python and R.
5. Things will be weird, but also maybe good?

I tweeted about it once, after I had experimented with binary,
serialized alternatives to JSON.

As webgpu and new binary serialization formats--like Arrow--come of age,
it\'s going to be harder and harder to stomach geojson\'s slowness. More
and more of R and python will become js or wasm wrappers. Just like in
the 2000s they were wrappers around Java. It\'ll be very weird.

--- Benjamin Schmidt (@benmschmidt) December 23, 2020

I'm writing about Python and R because they're completely dominant in
the space of data programming. (By data programming, I mean basically
'data science'; not being a scientist, I have trouble using it to
describe what I do.) Some dinosaurs in economists still use Stata, and
some wizards use Julia, but if you want to work with data that's
basically it. The big problem with the programming lessons we use to
work with data they run largely on CPUs, and often predominantly on a
single core. This has always been an issue in terms of speed; when I
first switched to Python around 2011, I furiously searched ways around
the GIL (global interpreter lock) that keeps the language from using
multiple cores even on threads. Things have gotten a little better on
some fronts--in general, it seems like at least linear algebra routines
can make use of a computer's full resources.

{#js-html-is-the-low-level-language-for-ui-and-python-and-r}
## *JS/HTML is the low-level language for UI and Python and R.*

Separately, the graphical and interface primitives of all programs have
started to move to the web. If I had started doing this kind of work
seriously even a couple years later, I would never even have noticed
there used to be another way. I never really used
[tcl/tk](https://www.tcl.tk/) interfaces in R, but I was always aware
that they existed; the very first version, private version of the Google
Ngrams browser that JB Michel wrote in like 2008 or something was built
around some Python library. This was normal. But in the last decade,
it's become obvious that if you want to build user-facing elements to
describe something like "a button" or "a mouseover", the path of least
resistance is to use the HTML conception, not the operating system
conception of them. The [fifteen-year-old
freshman](https://www.linkedin.com/in/martin-camacho-245a7b76) who built
the first Bookworm UI quickly saw it needed a javascript plotting
library. This integration is becoming tighter and tighter in data
programming land. I have collaborators and grad students who transition
seamlessly into bundling their R packages into Shiny apps, into
decorating their Google colab notebooks with all sorts of sliders and
text entry fields, into publishing R and Python code as online books
with HTML/JS navigation.

Jupyter notebooks and the RStudio IDE _themselves_ are part of this
transformation; what appears to be Python code held together by an
invisible skein of Javascript. Again, these are platforms that have more
or less displaced earlier models. When I first learned R, I pasted from
textedit into the core R GUI; I went a little down the road into
ESS-mode in emacs as well. But if you need to continually be checking
random samples of a dataframe, re-running modules, and seeing if your
regular expressions correctly clean a dataset, you are using a notebook
interface today, even if you bundle your code into a module at some
point.

And for visualization, Javascript is creeping into this space. Like many
people, I've been relieved to be able to use Altair instead of
matplotlib for visualizing pandas dataframes; and I don't think twice
about dropping `ggplotly` into lessons about ggplot for students who
start wondering about tooltips on mouseover. `ggplot` and `matplotlib`
are still king of the roost for publication-ready plots, but after
becoming accustomed to interactive, responsive charts on the web, we are
coming to expect exploratory charts to do the same thing; just as select
menus and buttons from HTML fill this role in notebook interface, JS
charting libraries do the same for chart interface.

{#the-gpu-laptop-interface-is-an-open-question}
## The GPU-laptop interface is an open question

Let me be clear--something I'll say in this following section is
certainly wrong. I'm not fully expert in what I'm about to say. I don't
know who is! There are some analogies to web cartography, where I've
learned a lot from [Vladimir Agafonkin](https://agafonkin.com/). Many of
the tools I'm thinking about I learned about in a set of communications
with [Doug Duhaime](https://douglasduhaime.com/) and [David
McClure](http://dclure.org). But the field is unstable enough that I
think others may stumble in the same direction I have.

This whole period, GPUs have also been displacing CPUs for computation.
The R/Python interfaces to these are tricky. Numba kind of works; I've
fiddled with gnumpy from time to time; and I've never intentionally used
a GPU in R, although it's possible I did without knowing it. The path of
least resistance to GPU computation in Python and R is often to use
Tensorflow or Torch even for purposes that don't really a neural network
library--so I find myself, for example, training UMAP models using the
neural network interface rather than the CPU one even though I'd prefer
the other.

Most of these rely on CUDA to access GPUs. (When I said I don't know
what I'm talking about--this is the core of it.) If you want to do
programming on these platforms, you increasingly boot up a cloud server
and run heavy-duty models there. Cuda configuration is a pain, and the
odds are decent your home machine doesn't have a GPU anyway. If you want
to run everything in the cloud, this is fine--Google just gives away
TPUs for free. But doing a group-by/apply/summarize on a few million
rows, this is overkill; and while cloud *compute* is pretty cheap
compared to your home laptop, cloud *storage* is crazy expensive.
Digital Ocean charges me like a hundred dollars a year just to keep up
the database backing RateMyProfessor; for the work I do on several
terabytes of data from the HathiTrust, I'd be lost without a university
cluster and the 12TB hard drive on my desk at home.

But I want these operations to run faster.

{#javascript-is-already-fast-even-without-its-gpu}
## Javascript is already fast, even without its GPU.

When I started using webgl to make charts in Javascript, I was
completely blown away what it could do. I'm used to sitting around
waiting for ggplot to render even a few thousand points. I'm used to
polygon operations in geopandas being long and expensive. I'm used to
getting up to get some tea when I want to load a geojson file.

But I could use javascript to generate millions of points in random
polygons from primitive triangles in barely any time; and then using
regl it can animate fast enough to make seamless zooming reasonable.
Here, for example, is *every single vote* (excluding absentee) in New
York City precincts in the 2020 election. (Hopefully this embed from
[Observable](http://observablehq.com) loads... but if it doesn't, well,
that's the kind of the point, too. I'm making you click below to avoid
clobbering people on phones.)

Load iframe

Digging into the weeds to make more elaborate visualizations like this,
I can see why. Apache Arrow exposes an extremely low level model of the
data you work with, that encourages you to think a lot about both the
precise schema and the underlying types. In Python, I've gotten used to
this kind of work in numpy; in R, I've only ever done a little bit a bit
twiddling. But in modern JS, binary array buffers are built right into
the language. When I started tinkering with JS, I thought of it as slow;
but web developers are far more obsessive about speed than any other
high-level, dynamically typed language I've seen. The profiling tools
built into Chrome are incredibly powerful; and Google, especially, has
made a huge investment in making JS run incredibly quickly because
there's huge money in frictionless web experience. Sure, lots of
websites are slow because they come with megabyte-sized React
installations and casual bloat; sure, the DOM is slow to work with. But
Javascript itself is _fast._

In my first few years teaching digital humanities, probably the least
thankful task was helping students manage their local Java installations
so they could run Mallet, the best implementation of topic-modeling
algorithm out there. Now, we usually use slower and inferior
implementations in gensim, structural topic models, and the like. (For
an interesting discussion from Ted Underwood and Yoav Goldberg of how
inferior results in gensim and sklearn came to displace mallet, [see the
Twitter threads
here](https://twitter.com/Ted_Underwood/status/1338165292745318400).)
But as David Mimno, who keeps Mallet running, says, Javascript works
much faster.

Finally, integrate algorithms with interface. The browser is a high
performance computing environment (JavaScript is MUCH faster than
Python) embedded in an excellent interactive graphics environment. Plus
there's a code environment hidden underneath! Print those variables!

--- David Mimno (@dmimno) October 26, 2020

And while Javascript has a reputation as a terrible language, the post
ES2015 iterations have made it in many cases relatively easy to program
with. Maps, sets, `for ... of ...` all work much like you'd expect
(unlike the days when I spent a couple hours hunting out a rarely
occuring bug in one data visualization that turned out to occur when I
was making visualizations of wordcounts that included the word
`constructor` somewhere in the vocabulary); and many syntactic features
like classes, array destructuring, and [arrow function
notation](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Functions/Arrow_functions)
are far more pleasant than their Python equivalents. (Full
disclosure--even after a decade in the language, I still find Python's
whitespace syntax gimmicky and at heart just don't like the language.
But that's a post for another day.)

{#javascript-with-webgl-is-crazy-fast}
## Javascript with WebGL is crazy fast.

And if javascript is fast, WebGL is just bonkers in what it can do. Want
to lay out two million points in a peano curve in a few milliseconds? No
problem--you can even regenerate _every single frame._

Load iframe

And WebGL uses floating-point buffers that are the same as those in
Apache Arrow, so you copy blocks of data straight from disk (or the web)
into the renderers without even having to do that (still fast)
javascript computation. It's difficult, and easy to do wrong. (I've
found regl pitched at the perfect level of abstraction, but I still
occasionally end up allocating thousands of buffers on the GPU every
frame where I meant to only create one persistent one).

In online cartography, protobuffer-based vector files do something
similar in libraries like `mapbox.gl` and `deck.gl`. The overhead of
JSON-based formats for working with cartographic data is hard to stomach
once you've seen how fast, and how much more compressed, binary data can
be.

{#webgl-is-hell-on-rollerskates}
## WebGL is hell on rollerskates

In working with WebGL, I've seen just how fast it can be. For things
like array smoothing, counting of points to apply complicated numeric
filters, and group-by sums, it's possible to start applying most of the
elements of the relational algebra on data frames in a fully
parallelized form.

But I've held back from doing so in any but the most ad-hoc situations
because WebGL is also _terrible_ for data computing. I would never tell
anyone to learn it, right now, unless they completely needed to.
Attribute buffers can only be floats, so you need to convert all integer
types before posting. In many situations data may be downsized to half
precision points, and double-precision floating points are so difficult
that there [are entire rickety structures built to support them at great
cost](https://deck.gl/docs/developer-guide/64-bits) Support for texture
types varies across devices (Apple ones seem to pose special problems),
so people I've learned from like Ricky Reusser go to great lengths to
support various fallbacks. And things that are essential for data
programming, like indexed lookup of lists or for loops across a passed
array, are nearly impossible. I've found writing complex shaders in
WebGL fun, but doing so always involves abusing the intentions of the
system.

{#webgpu-and-wasm-might-change-all-that}
## WebGPU and wasm might change all that

{#wasm-and-the-javascript-virtual-machine}
### WASM and the Javascript Virtual Machine

But the last two pieces of the puzzle are lurking on the horizon. Web
Assembly-- wasm files--give another way to write things for the
javascript virtual machine that can avoid the pitfalls of Javascript
being a poorly designed language. A few projects that are churning along
in [Rust](https://www.rust-lang.org/) hold the promise of making
in-browser computation even faster. (If I were going to go all-in on a
new programming language for a few months right now, it would probably
be Rust; in writing webgl programs I increasingly find myself doing the
equivalent of writing my own garbage collectors, but as a high-level guy
I never learned enough C to really know the basic concepts.) Back in the
2000s, the python and R ecosystems were littered with packages that
relied on the Java virtual machine in various ways. In the 2010s, it
felt to me like they shifted to underlying C/C++ dependencies. But given
how much effort is going into it, I think we'll start to see things use
the Javascript Virtual Machine more and more. When I want to use some of
D3 spherical projections in R, that's how I call them; and Jerome Ooen's
V8 package (for running the JSVM, or whatever we call it) is approaching
the same level of downloads as the more venerable `rJava`. I suspect
almost all of this is running just Javascript. If it starts becoming a
realistic way to run pre-compiled Rust and C++ binaries on any system...
that's interesting.

:::
![Chart showing V8 vs RJava downloads from CRAN since 2016; by mid-2020,
V8 had more than half the downloads of rJava with periodic steps
up.](/img/V8_vs_rJava.png)

{.caption}
:::
Chart showing V8 vs RJava downloads from CRAN since 2016; by mid-2020,
V8 had more than half the downloads of rJava with periodic steps up.
:::
:::

{#webgpu}
### WebGPU

The last domino is a little off, but could be titanically important.
WebGL is slowly dying, but the big tech companies have all gotten
together to create [WebGPU](http://webgpu.io/) as the next-generation
standard for talking to GPUs from the browser. It builds on top of the
existing GPU interfaces for specific devices (Apple, etc.) like Vulkan
and Metal, about which I have rigorously resisted learning anything.

WebGPU will replace WebGL for fast in-browser graphics. But the
capability to do heavy duty computation in WebGL is so tantalizing that
some lunatics have already begun to do it. The stuff that goes on into
[Reusser's work](https://rreusser.github.io/)\] is amazing; check out
this notebook about "multiscale Turing patterns" that creates [gorgeous
images halfway between organic blobs and nineteenth-century
endplates](https://observablehq.com/embed/@rreusser/multiscale-turing-patterns-in-webgl)

I haven't read the draft WebGPU spec carefully, but it will certainly
allow a more robust way to handle things. There is *already* [at least
one linear algebra library (i.e., BLAS) for WebGPU out
there](https://github.com/milhidaka/webgpu-blas). I can only imagine
that support for more data types will make many simple
group-by-filter-apply functions plausible entirely in GPU-land on any
computer that can browse the web.

When I started in R back in 2004, I spent hours tinkering with SQL
backing for what seemed at the time like an enormous dataset: millions
of rows giving decades of data about student majors by race, college,
gender, and ethnicity. I'd start a Windows desktop cranking out charts
before I left the office at night, and come back to work the next
morning to folders of images. Now, it's feasible to send an
[only-slightly-condensed summary of 2.5 million rows for in-browser
work](https://observablehq.com/@bmschmidt/exploring-changing-us-college-majors-with-arquero)
and the whole dataset could easily fit in GPU memory. In general, the
distinction between generally available GPU memory (say, 0.5 - 4GB) and
RAM (2-16GB) is not so massive that we won't be sending lots of data
there. Data analysis and shaping is generally extremely parallelizable.

{#js-and-webgpu-will-stick-together}
## JS and WebGPU will stick together

Once this bundle gets rolling, it will much faster and more convenient
than python/R, and in many cases it will be able to run with zero
configuration. The [Arquero
library](https://observablehq.com/@uwdata/introducing-arquero),
introduced last year, already brings most of the especially important
features of the dplyr or pandas API into observable at a nearly
comparable speed. With tighter binary integration or a different
backend, it--or something like it-- could easily become the basic
platform for teaching the non-major introduction to data science course
all of the universities are starting to launch. Even if it didn't, the
vast superiority of Javascript over R/Python for both visualization
speed (thanks to GPU integration) and interface (thanks to the uniquity
of HTML5) means that people will increasinly bring their own data to
websites for initial exploration first, and may never get any farther.
(If I were going to short public companies based on the contents of
these speculations, I'd start with NVidia--whose domination of the GPU
space is partially dependent on CUDA being the dominant language, not
WebGPU, and ESRI, which is floundering as it tries to make desktop
software that does what web browsers do easily.)

Once these things start getting fast, the insane overhead of parsing CSV
and JSON, and the loss of strict type definitions that they come with,
will be far more onerous. Something--I'd bet on parquet, but there are
are possibilities involving arrow, HDF5, ORC, protobuffer, or something
else--will emerge as a more standard binary interchange format.

{#why-bother-with-r-and-python}
## Why bother with R and Python?

So--this is the theory--the data programming languages in R and Python
are going to rely on that. Just as they wrap Altair and they wrap HTML
click elements, you'll start finding more and more that the
package/module that seems to just work, and quickly, that the
19-year-olds gravitate towards, runs on the JSVM. There will be strange
stack overflow questions in which people realize that they have an
updated version of V8 installed which needs to be downgraded for some
particular package. There will python programs that work everywhere but
mysteriously fail on some low-priced laptops using a Chinese startup's
GPU. And there will be things that almost entirely avoid the GPU because
they're so damned complicated to implement that the Rust ninjas don't do
the full text, and which--compared to the speed we see from everything
else--come to be unbearable bottlenecks. (From what I've seen, Unicode
regular expressions and non-spherical map projections seem to be a
likely candidate here.)

But it will also raise the question of why we should bother to continue
in R and Python at all. Javascript is faster, and will run anywhere,
universally, without the strange overhead of binder notebooks and the
cost of loading data in the cloud. [WASM ports of these languages that
run _inside_ the JSVM](https://github.com/iodide-project/pyodide) will
help, but ultimately get strange. (Will you write python code that gets
transpiled in the browser to WASM, and then invokes its own javascript
emulator to build an altair chart?) Beats me!

But I've already started sharing elementary data exercises for classes
using [observablehq](https://observablehq.com/), which provides a far
more coherent approach to notebook programming than Jupyter or RStudio.
(If you haven't tried it--among many, many other things, [it parses the
dependency relations between cells in a notebook
topologically](https://observablehq.com/@observablehq/how-observable-runs)
and avoids the incessant state errors that infect expert
and--especially--novice programming in Jupyter or Rstudio.) And if you
want to work with data rather than write code, it is almost as
refreshing as the moment in computer history it tries to recapitulate,
the shift from storing business data in COBOL to running them in
spreadsheets. The tweet above that forms of the germ of this rant has
just a single, solitary like on it; but it's from Mike Bostock, the
creator of D3 and co-founder of Observable, and that alone is part of
the reason I bothered to write this whole thing up. The Apache Arrow
platform I keep rhapsodizing about is led by Wes McKinney, the creator
of pandas, who views it as the germ of a faster, better `pandas2`, from
a position [initially sponsored by RStudio and subsequently with funding
from Nvidia.](https://ursalabs.org/blog/ursa-computing/) Speculative as
this all is, it's also--aside from massive neural-network gravitational
of the tensorflow/torch solar systems-- where the tools that become
hegemonic in the last decade _are naturally drifting_. (Not to imply
that Javascript is anywhere near the top of the Arrow project's priority
list, BTW. It isn't.) I wish more of the data analysts, not just the
insiders, saw this coming, or were excited that it is.

As I said, I've been doing some of this programming since 2003 or so,
and been putting in my regular rounds most days since 2010. In that time
I've come to see that I what I want to see most--fully editable,
universally runnable, data analysis on open data--is not a universal
code. Some people just want static charts. Some people want to hide
their data. Most readers don't want to tweak the settings. And everyone
looks down on people who like Javascript. But it's also the case that
the web was first built in the 90s to share complicated academic work
and make it editable by its readers. Even if most of academia and much
of the media is devoted to one-way flows of information, and much of the
post-social media Internet is a blazing hellscape, I'm excited about
these shifts in the landscape precisely because they hold out the
possibility that some portion of the Web might actually live up to its
promise of making it easier to think through ideas.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Bookworm Caching]]></title>
            <link>https://benschmidt.org/post/bookworm-caching</link>
            <guid>https://benschmidt.org/post/bookworm-caching</guid>
            <pubDate>Sun, 07 Mar 2021 16:18:47 GMT</pubDate>
            <content:encoded><![CDATA[I used to blog _everything_ that I did about a project like
[Bookworm](https://github.com/bookworm-project), but have got out of the
habit. There are some useful changes coming through through the
pipeline, so I thought I'd try to keep track of them, partly to update
on some of the more widely used installations and partly

The core work on Bookworm happened in 2011-2013 when I was at Harvard
working with Erez Lieberman Aiden and JB Michel as a way of bringing the
metadata in digital libraries to interfaces like the Google Ngram Viewer
that they built.

As such, it uses a very 2000s form of content management: a
single-server, LAMP stack oriented architecture that assumes you have a
MySQL database always running and can post individual queries against
it.

Over the years, I've tweaked the backend a bit to allow for more
resilience in this architecture. In particular, the web server--like
most webservers nowadays-- lives somewhere in the cloud. (On a Digital
Ocean droplet, although that's not important.)

That's great, because it means that the server can be basically static.
But you still need a database server somewhere. Even for a medium-sized
corpus like the Rate My Professors one, hosting the databases can be
real money simply for hard drive space--something like \$100 a year. On
bigger databases like Chronicling America, these costs are prohibitive
on many servers. Historically, I just used a desktop in my office. But
under COVID, that has kind of fallen apart, because what used to be
about 99% uptime on a machine plugged into Ethernet has degraded into
perhaps 50% uptime on a machine on residential wifi in my bedroom at
home.

That means that every week, I get e-mails from people about to run a
workshop suddenly realizing that the site on [gendered teaching
evaluations](//benschmidt.org/profGender) has broken. There are two
solutions here.

1. Virtualize the server and run it in the cloud, too.
2. Cache results so that the frontend can run without MySQL entirely.

I'm working on both, but the second is easier--that's what I'm
describing here.

The strategy is essentially to build up a local cache of the most common
queries that can live on the webserver. As a format for that cache I'm
using the Apache Arrow's `.feather` format, which I'm become enamored of
in the last year--it's a binary serialization that's far smaller and
faster to load than JSON. For each query I generate an SHA-1 hash from
the description of the query; if that exists among the last 256 queries
to the server, a local version of the bookworm API that runs without
MySQL can return the answer directly, whether or not the database
backend is still alive. If it does, great. If not, we fall back to a
proxy form of the API that can reach out to my home server's API
endpoint. In addition to that 256-item LRU (least-recently-used item)
cache, there's also an option to specify a cold storage cache. For the
RateMyProfessors Bookworm, my plan is to fill this with several thousand
of the most frequent queries so that workshops can generally proceed
without any trouble even when the main db is down.

There are other ways of handling caching. This one is notably deficient
in that it's not truly a static solution: there's still a python daemon
running to process the API requests on each query. I had always thought
that I'd probably just store JSON on the server directly so that a
Bookworm could run entirely statically. I may yet do that. But this also
serves another purpose of mine, which is to extend the family of API
backends Bookworm can run on. A local cache backed by MySQL isn't much
different than MySQL itself, but it opens up some more useful
possibilities, such as:

1. hitting multiple _different_ MySQL backends, which allows sharding
   bookworm servers on extremely large corpora.
2. Building entirely different backends on things like Solr or
   ElasticSearch. (Although I'll note that the old MySQL architecture,
   dated as it is, continues to allow things that none of the Lucene
   managers I've worked with over years think is possible in routine
   time in terms of aggregating queries.)
3. Data transfer over http using arrow, which is now fully supported
   (it's happening behind the scenes on every query now) which opens up
   some useful possibilities for speeding up and making Python and R
   modules more type aware.

{#extremely-technical-notes}
### Extremely Technical notes

But a stack this complicated also has complications. Some come from the
new Docker setup. Just as a note to myself and anyone else attempting
something similarly complicated:

1. Remote forwarding to docker requires enabling GatewayPorts on ssh
   configuration both for the client (\~/.ssh/config) and the host
   (`/etc/lib/sshd_config` or something)
2. That's dangerous! So immediately following that, I had to set up
   `ufw` to block all incoming connections to the webserver except on
   ports 80 and 443.
3. Now docker is once again not allowed to access the host, because it's
   technically an outside host. I allow accept to the docker subnet with
   `ufw allow from 172.24.0.1`. I don't know if `127.24.0.1` is always
   the address for a docker cluster; I found it by doing
   `docker container ls` to get my containers, and then
   `docker inspect $ID` on the relevant container, which gave and
   IPAddress of `172.24.0.2`. I'm just going to assume that anything
   docker allocated will be in the `172.24.0.*` range.
4. Just as the webserver needs to know where docker lives, docker needs
   to know the webserver. That I get with ifconfig, looking for the
   docker0 subnet IP address. In that context, it's `172.17.0.1`. Note
   `172.17` instead of `172.24`; I would have thought they'd be the
   same, so evidently I don't really understand networking.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Jobs Report November update]]></title>
            <link>https://benschmidt.org/post/2020-11-12-jobs</link>
            <guid>https://benschmidt.org/post/2020-11-12-jobs</guid>
            <pubDate>Thu, 12 Nov 2020 17:00:45 GMT</pubDate>
            <content:encoded><![CDATA[I last looked at the H-Net job numbers [about a month
ago.](/post/2020-10-01-jobs-update/2020-10-01-jobs-update/)

Since then, the news isn't exactly good, but it's also probably as good
as anyone could expect. For most of September and October, history jobs
were at about 25% of their average for the 2010s; this was slightly
worse than we're seeing in the approximate numbers in--for
instance--science jobs, where new job openings are at about [30% of
their normal
levels](https://www.sciencemag.org/careers/2020/10/amid-pandemic-us-faculty-job-openings-plummet)
(Thanks to Dylan Ruediger at the AHA for passing along that link.)

The shifts in the last month have pushed the totals numbers over 100
jobs in history; this is a bit of an advance, so we're now down only
about 70% from the normal rates, not 75%. I don't know if the sciences
have seen a similar rebound.

:::
![Cumulative jobs by era](/img/cumulative-november.png)

{.caption}
:::
Cumulative jobs by era
:::
:::

We're far enough into the year that it's worth looking at subfield
numbers to see how different fields are faring. The loss of jobs is much
more uneven than I thought.

At one extreme, jobs in world history and Middle Eastern history are
down about 95%; at the other end, there have been 32 jobs listed with a
primary category of Black studies or African American studies, an
_increase_ of 28% over the normal rate. The only other subfield not to
have seen a catastrophic collapse is history of science, down "just"
40%. This is clearly the Black Lives Matter moment playing itself out in
the hiring patterns, and it produces some remarkable inversions; there
are twice as many jobs listed for African-American history specifically
as for American history generally.

The collapses in European and Middle East hiring are especially
remarkable given that both fared quite badly after 2008, as well. A
typical associate professor in mideast studies might have received her
first job in 2008, when there were about 60 new jobs in mideast studies;
this year, there is one.

:::
![Job losses have been unevenly distributed.](/img/field_changes.png)

{.caption}
:::
Job losses have been unevenly distributed.
:::
:::

```
  <th>Region</th>      <th colspan="4" halign="left">Average Jobs listed by November 12</th>      <th>as share of 2010s average</th>      <th>as share of 2000s average</th>    </tr>    <tr>      <th>era</th>      <th></th>      <th>2004-2008</th>      <th>2009</th>      <th>2010-2019</th>      <th>2020</th>      <th></th>      <th></th>    </tr>  </thead>  <tbody>    <tr>      <th></th>      <td>Middle East</td>      <td>61.20</td>      <td>29.0</td>      <td>21.9</td>      <td>1.0</td>      <td>5%</td>      <td>2%</td>    </tr>    <tr>      <th></th>      <td>World</td>      <td>37.80</td>      <td>13.0</td>      <td>14.5</td>      <td>1.0</td>      <td>7%</td>      <td>3%</td>    </tr>    <tr>      <th></th>      <td>Europe</td>      <td>88.40</td>      <td>36.0</td>      <td>46.1</td>      <td>5.0</td>      <td>11%</td>      <td>6%</td>    </tr>    <tr>      <th></th>      <td>US/Canada</td>      <td>162.80</td>      <td>79.0</td>      <td>68.2</td>      <td>14.0</td>      <td>21%</td>      <td>9%</td>    </tr>    <tr>      <th></th>      <td>Ancient</td>      <td>28.20</td>      <td>16.0</td>      <td>17.0</td>      <td>3.0</td>      <td>18%</td>      <td>11%</td>    </tr>    <tr>      <th></th>      <td>Americas</td>      <td>52.40</td>      <td>26.0</td>      <td>30.0</td>      <td>7.0</td>      <td>23%</td>      <td>13%</td>    </tr>    <tr>      <th></th>      <td>Asia</td>      <td>80.60</td>      <td>43.0</td>      <td>54.7</td>      <td>14.0</td>      <td>26%</td>      <td>17%</td>    </tr>    <tr>      <th></th>      <td>Methodological</td>      <td>25.20</td>      <td>12.0</td>      <td>30.4</td>      <td>8.0</td>      <td>26%</td>      <td>32%</td>    </tr>    <tr>      <th></th>      <td>Africa</td>      <td>13.00</td>      <td>4.0</td>      <td>20.8</td>      <td>5.0</td>      <td>24%</td>      <td>38%</td>    </tr>    <tr>      <th></th>      <td>Hist. Sci.</td>      <td>12.40</td>      <td>7.0</td>      <td>8.4</td>      <td>5.0</td>      <td>60%</td>      <td>40%</td>    </tr>    <tr>      <th></th>      <td>Interdisciplinary</td>      <td>15.20</td>      <td>9.0</td>      <td>25.4</td>      <td>10.0</td>      <td>39%</td>      <td>66%</td>    </tr>    <tr>      <th></th>      <td>Black/Af-Am</td>      <td>33.75</td>      <td>20.0</td>      <td>25.0</td>      <td>32.0</td>      <td>128%</td>      <td>95%</td>    </tr>  </tbody></table>
```
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[History Jobs Update]]></title>
            <link>https://benschmidt.org/post/2020-10-01-jobs-update</link>
            <guid>https://benschmidt.org/post/2020-10-01-jobs-update</guid>
            <pubDate>Thu, 01 Oct 2020 13:29:09 GMT</pubDate>
            <content:encoded><![CDATA[Out of a train-wreck curiosity about what's been happening to the
historical profession, I've been watching the numbers on tenure-track
hiring as posted on H-Net, one of the major venues for listing history
jobs.

\[Update 10-2: switching to US and Canada only. An earlier version of
this included other countries, even though I said it didn't.\]

We're now into October. Usually--I know now--this is the period by which
half the tenure-track jobs for any cycle have been listed. With two
important exceptions I'll get into later, every year since 2004 passed
the halfway point for the year in the last week of September or the
first week of October.

So here are a few ways of looking at the hiring patterns.

One is the aggregate tenure-track jobs listed by year. (I'm filtering
here not just to tenure-track positions, but also to jobs in the United
States with "history" in the primary category field, which are typically
things like "Asian History / Studies". The core H-Net audience is the
history profession and the US, so we'll get less noise limiting this
way.)

Here you can see a few things:

1. H-Net took some time to get off the ground in 2003-2007. You'd be
   better looking at the AHA listing for this period, But nonetheless;
2. The period before 2008 was much better--almost twice as many jobs a
   year-- as the period since.
3. 2009 was the worst year on the record to this point, with about 200+
   jobs listed by early October; currently we're still short of 100.

:::
![2020 is much worse than any other year](/img/bad_news.png)

{.caption}
:::
2020 is much worse than any other year
:::
:::

This chart shows how the number of listings over time in this academic
year compares to the two other eras in the hiring cycle: 2004 to 2008,
and 2010 to 2019. It's worth noting a couple things here. First, the
worst of the pre-great-recession years was better than the best year
since it. Second, I've broken out 2009, the only year that compares to
the current one in its low numbers through September, but as you can see
2009 did recover to have more tenure track jobs, in the end, than the
worst year of the 2010s. (One of the worst years of that decade, it's
worth noting, was 2019; even as majors approached stability, new
listings for tenure-track jobs were disappearing last year.)

:::
![Annual Cycle.](/img/tt-cycle.png)

{.caption}
:::
Annual Cycle.
:::
:::

Overall, we can see what the next couple months are likely to look like
by looking at the annual cycle of jobs. Typically the flood comes in
late September; you get a couple a day through Thanksgiving; and then
after a slight December rebound, the rest of the spring is perhaps a
single job a day publicly listed.

:::
![Shape of annual hiring peaks in mid-September](/img/year_shape.png)

{.caption}
:::
Shape of annual hiring peaks in mid-September
:::
:::
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Circle Packing]]></title>
            <link>https://benschmidt.org/post/2020-08-31-circle-packing</link>
            <guid>https://benschmidt.org/post/2020-08-31-circle-packing</guid>
            <pubDate>Tue, 01 Sep 2020 20:49:49 GMT</pubDate>
            <content:encoded><![CDATA[I've been doing a lot of my data exploration lately on Observable
Notebooks, which is--sort of--a Javascript version of Jupyter notebooks
that automatically runs all the code inline. Married with Vega-Lite or
D3, it provides a way to make data exploration editable and shareable in
a way that R and python data code simply can't be; and since it's all
HTML, you can do more interesting things.

Of course, that leaves all that writing on their site, where it will
likely eventually vanish. I'm generally willing to live with that. But
it's also nice to be embed the charts over here, even if they'll die
when Observable does.

The [observable
version](https://observablehq.com/@bmschmidt/transitioning-between-circle-packs-historical-us-electio)
of this page will almost certainly look better, but you can get a quick
idea of the contents below.

\{\{\< observablenotebook
"/notebooks/transitioning-between-circle-packs-historical-us-electio.js"
\>\}\}
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[College Majors 2019 update]]></title>
            <link>https://benschmidt.org/post/2020-08-25-college-majors-2019-update</link>
            <guid>https://benschmidt.org/post/2020-08-25-college-majors-2019-update</guid>
            <pubDate>Fri, 28 Aug 2020 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[Every year, I run the numbers to see how college degrees are changing.
The Department of Education released this summer the figures for 2019;
these and next year's are probably the least important that we'll ever
see, since they capture the weird period as the 2008 recession's
shakeout was wrapping up but before COVID-19 upended everything once
again. But for completism, it's worth seeing how things changed.

First, the chart of humanities majors compared to peak. Here, things
remain at their post-2015 level.

:::
![Figure 1](fig-1.png)

{.caption}
:::
Figure 1
:::
:::

Next, the decade-horizon rate of change for all majors. Again, the
humanities are at the bottom of the list; the most remarkable feature
here is that computer science, already large, has been growing at a huge
rate in the last few years.

:::
![Figure 2](fig-2.png)

{.caption}
:::
Figure 2
:::
:::

Next, what I think is the most important full overview you can get: a
four-type division of US college majors since 1990. This makes clear
that the basic story of the last decade was the growth of STEM at the
expense of pretty much all other forms of education.

:::
![Figure 4](featured.png)

{.caption}
:::
Figure 4
:::
:::

Rate of change is important, but it's worth looking at the overall
numbers too. Here are 20 years of majors for all the humanities fields.
The American Academy includes several communications majors as
humanities fields; I think that in method and substance they're closer
to a qualitative social science, but I include them here anyway.

:::
![Figure 3](fig-3.png)

{.caption}
:::
Figure 3
:::
:::
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Ranking CS Graduate programs]]></title>
            <link>https://benschmidt.org/post/2020-07-30-CS-Rankings</link>
            <guid>https://benschmidt.org/post/2020-07-30-CS-Rankings</guid>
            <pubDate>Tue, 28 Jul 2020 15:15:14 GMT</pubDate>
            <content:encoded><![CDATA[{#ranking-graduate-programs}
# Ranking Graduate Programs

While I was choosing graduate programs back in 2005, I decided to come
up with my own ranking system. I had been reading about the Google
PageRank algorithm, which essentially imagines the web as a bunch of
random browsing sessions that rank pages based on the likelihood that
you--after clicking around at random for a few years--will end up on any
given page. It occurred to me that you could model graduate school
rankings the same way. It's essentially a four-step process:

1. Pick a random department in the United States.
2. Pick a random faculty member from that department.
3. Go to that faculty member's _graduate_ department.
4. 90% of the time, return to step 2; 10% of the time, return to step 1.

At the end of each stage, you'll be in a different department; but more
prestigiously any given department's faculty are placed, the more likely
you are to be there.

Using transition matrices, these numbers converge after a relatively
short period.

I ran it on history departments, but have never circulated the history
scores. (Rankings make people mad, and the benefit seems worse than the
cost.) But one of my roommates at the time, [Matthew
Chingos](https://www.urban.org/author/matthew-chingos), was already
moving towards working in higher education policy and grad school in
political science, so we wrote up a paper applying it to Political
Science departments and published it [in _PS_ in
2007](https://www.cambridge.org/core/journals/ps-political-science-and-politics/article/ranking-doctoral-programs-by-placement-a-new-method/19789D9787720D266C2759B4E1798051).
(Schmidt, B., & Chingos, M. (2007). Ranking Doctoral Programs by
Placement: A New Method. PS: Political Science & Politics, 40(3),
523-529. doi:10.1017/S1049096507070771)

It's a pretty simple method, but I still occasionally get questions
about it, the data, and the underlying code. As I recall, the political
science data was viewed as slightly sensitive, so the arrangement we
made with the American Political Science Association was that they would
handle requests for the data and we would only provide code.

This was in 2005, so reproducibility was not a worry--nowadays, you'd
put all this stuff on github. In response to a recent request, I've just
done that.

The core code was interesting to look it, because it's stuff I wrote in
R fifteen years ago. It basically seems to still work, but it has little
in common with how I'd handle the problem nowadays.

{#ranking-computer-science-programs-as-of-2015}
## Ranking Computer Science Programs as of 2015

Still, the proof is in the eating. So I went looking for some new data
to try it on. On the theory that computer science faculty are too
distracted by their overwhelming course sizes and endless parade of job
searches to be bothered by this, I'll do them.

Alexandra Papoutsaki et al. [created a crowdsourced dataset of CS
faculty](http://cs.brown.edu/people/apapouts/faculty_dataset.html) that
they expect to be "80% correct" at Brown. They seem to have updated a
version that's sitting inside a Github repository
[here](https://github.com/brownhci/drafty/blob/master-node/databaits/data/professors.csv),
so that's what I've used. I'm using placements that are from 2005-2015
here.

|school|p|
|:--|--:|
|University of California - Berkeley|17.2835408|
|Massachusetts Institute of Technology|16.6558147|
|Stanford University|9.8659918|
|Carnegie Mellon University|7.9750700|
|University of Washington|4.5314467|
|Cornell University|3.4656622|
|Princeton University|2.9223387|
|University of Texas - Austin|2.5394603|
|Columbia University|2.3110282|
|University of California - Santa Barbara|2.0507537|
|California Institute of Technology|1.9028543|
|Georgia Institute of Technology|1.5902598|
|University of Illinois at Urbana-Champaign|1.5324409|
|University of California - Los Angeles|1.5238573|
|University of California - San Diego|1.2106396|
|University of Maryland - College Park|1.1716862|
|University of Pennsylvania|1.0691726|
|Brown University|1.0167585|
|University of North Carolina - Chapel Hill|0.9371394|
|University of Michigan|0.9263730|
|University of Minnesota - Twin Cities|0.7845679|
|Harvard University|0.7668788|
|New York University|0.7561730|
|University of Wisconsin - Madison|0.7021781|
|University of Massachusetts - Amherst|0.6569323|
|Purdue University|0.6213802|
|University of Chicago|0.6157431|
|Rice University|0.6154933|
|Johns Hopkins University|0.5860418|
|University of Virginia|0.5794159|

There is nothing shocking, as an outsider, here, which is good.
Technical schools are pretty high up, and my current employer is on the
list and right next to Harvard. Nobody ever got in trouble for saying
their school is as good as Harvard, even when Harvard is--as in CS--not
so hot.

{#extensions}
# Extensions

{#error-bars}
## Error bars!

Besides reproducibility, one thing I didn't have a good answer to back
in 2005 was robustness. Now I know very slightly more statistics, and
the most sensible approach seems to be bootstrap sampling across the set
to get an idea of how much difference one student more or less might
make.

Here's a plot of 500 random resamples of the set. There are two
takeaways here:

1. There's decent separation overall, but in general the distinction
   between 1 and 2 on the list, or between 30 and 60, is not anything
   stunning.
2. A few schools show notable patterns high or low. I think this is
   because single people greatly affect rankings. For example, UC Santa
   Barbara has a number of quite low rankings outside its boxplot; I
   think those are runs where _both_ their grad who teaches at MIT and
   their grad who teaches at Berkeley were dropped in the bootstrap.
   Since UCSB relies very heavily on those two people for its high
   ranking, the bars are telling us--rightly--that the uncertainty there
   is pretty high.

:::
![PNG of error bars for rankings above](/img/CS-rankings.png)

{.caption}
:::
PNG of error bars for rankings above
:::
:::

{#undergrad-rankings}
## Undergrad rankings

I've always wondered what the general form of this interaction would be;
ignore disciplines, and just look overall at how universities assess
other universities in their hiring patterns.

This dataset at least includes undergrad and master's locations, so we
can see how this form would work differently based on _undergrad_
quality vs grad quality.

In general, the scores are correlated--for example, MIT and Berkeley are
near the top on both-- but there are some useful distinctions. For
instance, Yale undergrads are very well represented in CS faculties,
while Yale grad students are few and far between. Conversely, the
University of Washington produces middling undergrads, but is a grad
powerhouse. Presumably the major factor here is that undergrads do not
choose schools based on the strength of individual departments.

:::
![Scatterplot of Undergrad vs
Grad](/img/Comparison%20of%20grad%20and%20undergrad%20rankings.png)

{.caption}
:::
Scatterplot of Undergrad vs Grad
:::
:::
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Jeb! the quitter. Digital traces of private devotions.]]></title>
            <link>https://benschmidt.org/post/2020-02-25-Jeb</link>
            <guid>https://benschmidt.org/post/2020-02-25-Jeb</guid>
            <pubDate>Wed, 26 Feb 2020 14:49:26 GMT</pubDate>
            <content:encoded><![CDATA[As I often do, I'm going to pull away from various forms of Internet
reading/engagement through Lent. This year, this brings to mind one of
my favorite stray observations about digital libraries that I've never
posted anywhere.

As part of the 2016 Republican Primary, Jeb! Bush released a website
enabling exploration of e-mails related to his official accounts as
governor of Florida in the early 2000s. This whole sentence has an
antiquity to it; the idea of pre-emptive disclosure (in large part to
contrast with his presumed general election opponent, Hilly Clinton)
seems hopelessly antique. And at the time, it [was critized for
accidentally disclosing all sort of personal information, both stories
and Social Security
Numbers](https://www.theverge.com/2015/2/10/8013531/jeb-bush-florida-email-dump-privacy).
It did not make Jeb! president. Anyhow, back then I downloaded Jeb!'s
e-mails--and Hillary's--to think about what sort of stuff historians
will do with these records in the future.

One thing I looked at was simply the _time of day_ that Jeb sent
letters. Looking at it on a yearly basis, it was clear that there were
some odd seasonal patterns in the way that Jeb! sent his e-mails.
Knowing that Jeb! was Catholic, I had a brainstorm that maybe this was
aligned to the liturgical year. And so I wrote a little bit of ggplot2
code to break out the Lenten season from the rest of the year.

(My favorite part of this chart is the color scheme; these are the color
of the vestments word during Lent and ordinary time. I can't remember
how I aligned dates to the liturgical calendar.)

Breaking it out, I think it's far more likely than not that in the year
2005, Jeb! made some private devotion to get up early and answer his
e-mails before 7AM. The only thing arguing against this is that he
_does_ get up a little early on Mardi Gras and the Monday before as
well; but starting on Ash Wednesday, Jeb! is regularly sending over 50%
of his e-mails for the day before he gets to the office.

And then it falls apart a wek or two before Easter. Could he not hold it
together?

There's also some sign that he gave the same effort a shot in 2006, but
it fell apart mush earlier.

\{\{\< figure src="closeup.png" title="Jeb Bush's outgoing e-mail times"
\>\}\}

It is odd to me to be able to talk in this particular way about the
intersection of daily life and religious identity. One oddity, of
course, is that this is yet another example of the kinds of information
held inside the great data surplus at the tech companies; but honestly,
the question here is so oddly stated that I can't imagine datamining
ever turning it up. Perhaps it says something about the potential for
biographies in the digital age; the narcissism of the quantified self
movement might look quite different directed at the quantified other.
But is this kind of evidence really compatible with biography?

Anyhow, off to some e-mails of my own.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Two Volumes: the lessons of Time on the Cross]]></title>
            <link>https://benschmidt.org/post/2019-AHA</link>
            <guid>https://benschmidt.org/post/2019-AHA</guid>
            <pubDate>Thu, 05 Dec 2019 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[Sorry, something failed to render. Please visit the website for full content.]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Web Migration]]></title>
            <link>https://benschmidt.org/post/moving</link>
            <guid>https://benschmidt.org/post/moving</guid>
            <pubDate>Sun, 30 Jun 2019 15:53:19 GMT</pubDate>
            <content:encoded><![CDATA[Since 2010, I've done most of my web hosting the way that the Internet
was built to facilitate: from a computer under the desk in my office.
This worked extremely well for me, and made it possible to rapidly
prototype a lot of of websites serving large amounts of data which could
then stay up indefinitely; I have a curmudgeonly resistance to cloud
servers, although I have used them a bit in the last few years (mostly
for course websites where I wanted to keep student information separate
from the big stew.)

But as part of my move to NYU, I'm shifting my Apache server to the
cloud. (Digital Ocean). That will break some things in the short term,
and I'm retiring a few elements of the website.

I'm listing the changes here mostly for my own reference. If I happen to
have put up something that you use and want to see back, don't be shy to
let me know through my Google email (username bmschmidt) or via
benmschmidt on Twitter.

{#awaiting-repair}
## Awaiting repair

- The Rate My Professor gender language site. I think this gets the most
  sustained, regular, traffic on my site. I'm hopeful this will be out
  of service only for the first two weeks of July. If you have some kind
  of curricular lesson or workshop for which you need it in that period,
  let me know and perhaps I can fix it up ahead of time.
- *Other Bookworms (Simpsons, Movies, etc.)*. I see some people using
  these, and I'll restore them using the same strategy as RMP; they may
  be offline until September, though, depending on how I address some
  storage issues. (The basic issue here is that, together, these take
  several terabytes of storage; that's more than you can drop into a
  cloud site at an affordable price. I know how I'll solve this, but it
  will be easier in September than July.

{#working}
## Working

Anything that didn't have a database backend should be working fine. If
it's not, it's probably a quick fix to a problem I'm not aware of.

- Personal website, all parts of _Creating Data_, interactive degree
  explorer.

{#probably-gone}
## Probably gone

- The Open Library bookworm was a prototype that eventually became the
  [Hathi Trust Bookworm](https://bookworm.htrc.illinois.edu/develop).
  I've been recommending everyone use that site, not this one, for a few
  years rather than count on the old OL one.
- Some prototypes for _Creating Data_ that I don't think were widely
  used.
- Some embedded elements in slideshows.
- Wordpress installations for courses that I offered prior to 2016.
  These don't seem worth migrating to me. If you've somehow obtained a
  URL for one of these courses, you can probably add '/syllabus.pdf' to
  the end to see the basic materials.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Moving (or rather, staying in place)]]></title>
            <link>https://benschmidt.org/post/Moving</link>
            <guid>https://benschmidt.org/post/Moving</guid>
            <pubDate>Fri, 03 May 2019 19:18:12 GMT</pubDate>
            <content:encoded><![CDATA[Some news: in September, I'll be starting a new job as Director of
Digital Humanities at NYU. There's a wide variety of exciting work going
on across the Faculty of Arts and Sciences, which is where my work will
be based; and the university as a whole has an amazing array of programs
that might be called "Digital Humanities" at another university, as well
as an exciting new center for Data Science. I'll be helping the
humanities better use all the advantages offered in this landscape. I'll
also be teaching as a clinical associate professor in the history
department.

If you're at NYU or somewhere nearby and want to chat, please do reach
out; I'll be around through the end of July. There should be more to say
about this going forward.

But just to look back a bit: I'll be leaving Northeastern, which has
built up one of the country's best digital humanities programs over the
last seven years. The history department (and our college dean, Uta
Poiger) have been extremely supportive of the possibilities of digital
history, of alternative publication models, and of DH in graduate
education. It's been great to see Cameron Blevins expanding the history
department's profile since he arrived two years ago. I don't want anyone
to think I or they screwed up the tenure or retention process. I've
found it a great place to work. Especially if you live anywhere near
Boston.

But this move has been a while coming: about six months after I started
at Northeastern, [my
wife](https://as.nyu.edu/content/nyu-as/as/faculty/anne-odonnell.html)
accepted a job teaching Soviet history at NYU. Many academic couples end
up juggling locations for various periods of time, and ours hasn't been
the worst; I've been fortunate through various means (the National
Endowment for the Humanities, Columbia's SIPA, and Northeastern's
parental teaching releases--thanks to each) to only have to be on campus
one semester a year since we moved to New York in 2015. And New York to
Boston--as academics at parties have too often cheerily reminded me--is
not the worst commute out there; just 4 hours in a comfortable train car
with sporadic wi-fi access, with ten minutes of subway rides on either
end. Despite its imperfect reputation, I've found Amtrak to always be
great; I did probably 30 round-trips last semester, and didn't hit a
single major delay.

But any commute is hard, especially when you have small children. (Which
is, demographically, a set that academic commutes fall most heavily on.)
I remember, shortly before starting my job at Northeastern, reading Mark
Sample write about how the commute is a "[grueling, brain-frying,
wallet-emptying, time-wasting, body-breaking, soul-draining way to
live](http://blog.commarts.wisc.edu/2011/10/25/dual-academic-couples-and-long-distance-living/)."
Amen. I can't help but think that the widespread acceptance of commutes
(and their flipside, residential fellowships) is toxic for local
university communities and, in aggregate, for gender and probably
socioeconomic diversity in the professoriate. But I also see others
happily splitting their time or playing a longer game than I can
imagine. So it's probably enough simply to say the commute is not for
us.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[A computational critique of a computational critique of computational critique.]]></title>
            <link>https://benschmidt.org/post/2019-03-18-Nan-Da-Critical-Inquiry</link>
            <guid>https://benschmidt.org/post/2019-03-18-Nan-Da-Critical-Inquiry</guid>
            <pubDate>Tue, 19 Mar 2019 01:42:00 GMT</pubDate>
            <content:encoded><![CDATA[Critical Inquiry has posted an article by Nan Da offering [a critique of
some subset of digital
humanities](https://www.journals.uchicago.edu/doi/abs/10.1086/702594)
that she calls "Computational Literary Studies," or CLS. The premise of
the article is to demonstrate the poverty of the field by showing that
the new structure of CLS is easily dismantled by the master's own tools.
It appears to have succeeded enough at gaining attention that it clearly
does some kind of work far outsize to the merits of the article itself.

The piece is not a useful contribution; it's a magic trick that relies
on the inattention or ignorance of its readers. While it pretends to
demystify computation for literary among literary critics, it in fact
does exactly the opposite; it operates through a series of feints and
misdirections that repeatedly misstates the plain text of other
scholars--in both literature and statistics--says, and what the
statistical work she herself has done is. The article is predicated on
an lack of statistical sophistication by the readers of _Critical
Inquiry._

The "computational" aspect of Da's case is twofold:

1. It asserts that actually existing CLS is ridden with statistical
   errors that could be easily corrected, and claims to have performed
   replications.

2. It offers that in _other_ areas--science and industry--computational
   methods are being deployed perfectly and appropriately; but that
   sadly, such methods *can not* be applied in literary studies because
   they have demonstrably demonstrated only absurditites and
   tautologies.

I do not believe it would be possible to write an article that defends
both of these points. If existing pieces are so heavily flawed, then we
probably don't know the limits of the knowable. If, on the other hand¸
we're able to tell that CLS will never produce useful results for
literature, it would probably only be because the existing literature
give us some sense of what's possible.

But together--and this is where the appeal comes from--they break some
fresh ground in the genre of anti-digital-humanities polemic. To
straightforwardly attack the cultural authority of numbers has become
increasingly problematic in the past few years. The hegemony of STEM has
increased inside the university, making the gambit more instutionally
dangerous; and at the same time, humanists have come to realize there
may be forces in the world yet more sinister than scientists. The
rhetorical tools you can deploy against positivism are strong, but they
risk appearing to make it seem--say--that maybe we shouldn't listen to
climate scientists. So Da's piece posits that everyone else is using
numbers right--but also holds out that the exercise in replication and
methodological analysis (a good thing) proffered here _don't_ actually
hold the way out for better resource.

Da moves past anti-positivism into something fresh-- call it
computational NIMBYism. Rather than pooh-pooh statistical reasoning, she
elevates it by incanting the language of quantification against itself.
Far _more_ than anyone I've seen in any humanities article, she asserts
that scientists do something arcane, powerful, and true. But she returns
from this promised land with hard-won truths for literary critics; its
computationalists are false prophets engaged in a cargo-cult version of
data science, and the true religion has nothing to say for literary
scholars.

{#the-response-the-article-engenders}
## The response the article engenders.

A careful effort to replicate published articles is necessary.
Fortunately, it is also something that happens, albeit not as much as
might be useful. I expansively discussed the concerns Da raises about
topic modeling across time in Underwood and Goldstone's work in
2013.\[@schmidt\_words\_2013\] Their response is explicitly contained
within the paper Da read. The final footnote in Ted Underwood's new book
raises precisely the same questions about the way that a Stanford
Literary Lab pamphlets use of bigram entropy as a distinguishing
measure. [^1]

But this isn't that article. The computational evidence deployed
here--the thing that tries to make this piece stand out--is striking in
its sloppiness even compared to the works it pretends to debunk. Perhaps
the whole piece is intended as a parody of what can slide into top
literary journals nowadays. (It is indeed the case that Critical Inquiry
will allow you to publish with terribly inadequate code appendices and
reviewers incompetent to assess the validity of your work.) But it
certainly does not show that _good_ statistics can obliterate the _bad_
statistics that are widespread. Instead, the most it could do is
demonstrate that the literary profession is as easily bamboozled by
numbers as Da says.

This tension of the two goals evident in the first piece of the set, on
a Ted Underwood piece on genre classification. She at once claims a
simple correction--

> Underwood should train his model on pre-1941 detective fiction (A) as
> compared to pre-1941 random stew and post-1941 detective fiction (B)
> as compared to post-1941 random stew, instead of one random stew for
> both, to rule out the possibility that the difference between A and B
> is not broadly descriptive of a larger trend (since all literature
> might be changed after 1941).

_and_ that Underwood uses methods that could never find differences
between genres.

It is true that Underwood does use methods inadequate to prove there is
no difference in detective fiction pre and post-1930. (Her use of the
year "1941" is a mistake--it seems to stem from confusing the date of
one of Underwood's sources with the year he chose for a testing cutoff).
This is an absurdly high bar--of course _something_ changed, if only the
existence of words like 'television' and 'databases.' Underwood says as
much. The actual article is caught up in a more interesting discussion
of the _comparative_ stability of genres. The core argument is not, as
Da says, that genres have been "more or less consistent from the 1820s
to the present," but that detective fiction, the gothic, and science
fiction--*specifically*-- show different patterns, with detective
fiction being a far more coherent pattern than the gothic novel. By
focusing only detective work, she's missing the entire argument of the
article.

That this doesn't merit correction or retraction is depressing.

I don't know what Underwood used to train. But if he did allow the
'random stew' to contain both pre- and post-1930 work that would make
the performance of his model _more_ remarkable, not less--it would
indicate that it was correctly tagging Elmore Leonard (say) novels as
detectives even though they use words like "fax" or "polaroid" it had
previously seen in the post-1930 set.

Where Da's method really shines, though, is in the random statistical
vocabulary she brings to bear.

All that Underwood has shown in using word frequency homogeneity to
differentiate detective fiction from random fiction is that the
difference between pre- and post-1941 detective fiction is not as
significant as its difference from random fiction. This does not mean
that the same method can capture the difference between different types
of detective fiction. After all, statistics automatically assumes that
95 percent of the time there is no difference and that only 5 percent of
the time there is a difference. That is what it means to look for
p-value less than 0.05. Think of it this way: if everyone can agree that
something is changing---even Underwood concedes that genres evolve---but
you have devised one way that concludes that it does not, it does not
necessarily mean that you have found something.

In the first specific critique, the article talks about 95% p-values in
the following way: "statistics automatically assumes that 95 percent of
the time there is no difference and that only 5 percent of the time
there is a difference. That is what it means to look for p-value less
than 0.05."

To look for a p-value under 0.05 is to look for a pattern that would
only occur 5% of the time as a result of random variation. It's not a
great threshold. But Underwood's paper does not rely on them.

So let's take a look at how well the statistical claims here hold up: is
the debunking useful?

{#jockers-and-kiriloff-on-gender}
## Jockers and Kiriloff on Gender

Da's critique of Underwood relies mostly on failures of reading. The
next section, on work by Matthew Jockers and Gabi Kirilloff, showcases
the way her piece rests rhetorically on the innumeracy of her
readership. Her critique of Jockers and Kiriloff is, as she says, that
they "present a statistical no-result finding as a finding."

In order to do so, she swamps her readers with a blizzard of statistical
language that she can justifiably assume will sound plausible to the
readers of _Critical Inquiry_. Her promise is that she will offer "a
clear explanation of the computational work that CLS actually does"
(605). In her two paragraphs on Jockers and Kiriloff, she tosses out the
following observations:

- "Let us say that you are measuring the overlap of features between two
  sets of data using a standard 5 percent confidence level; out of n
  possible shared features, 0.05n will automatically be significant."
- "In good statistical work, the burden to show difference within
  naturally occurring differences ('diff in diff') is extremely high."
- "This paper does not perform a bootstrap, which means the
  literary-historical suggestions that follow this genre classification
  do not stand."
- "Practitioners have to apply the Bonferroni Correction to conventional
  statistical thresholds of significance used for data mining."

And so on. This blizzard of terminology establishes for the innumerate
reader that they finally have an expert who will debunk statistics for
them, while freeing them of the burdensome requirement to think for
themselves.

But much of this is word salad; what stands is unimportant. The claim
that 5% of features "will automatically be significant" seems to
approach the claim that she has already had to retract: that "statistics
automatically assumes that 95 percent of the time there is no difference
and that only 5 percent of the time there is a difference. That is what
it means to look for p-value less than 0.05." 'Diff in diff' is indeed
an important tool, but it's not about whether testing whether two
distributions are different from each other; it's about testing whether
a post-treatment experimental group (like recipients of experimental
chemotherapy, or counties that received Gates foundation grants) saw a
significant [time series
change.](https://www.mailman.columbia.edu/research/population-health-methods/difference-difference-estimation)
Bootstrap resampling to generate confidence intervals can be useful, but
to randomly invoke it, as here, is about as sophisticated as demanding
that every article, regardless of content, take a transnational
approach.

To say that significance testing should apply the Bonferroni correction
is _not_ nonsense. But neither is it something that Da does. As with her
discussion of Underwood, the exercise relies on coming up with a straw
man description of the claim of the article, and then rejecting that. Da
focuses mostly on the question of whether there are statistically
significant differences in gendered use of verbs. Jockers and Kiriloff
use the method of nearest shrunken centroids as input into their model
for a variety of reasons having to do with model interpretability. [^2]

But Jockers and Kiriloff's findings _are_ significant at the level that
Da suggests, and it is Da's work that is truly sloppy. In the appendix,
she publishes a comparison that obviously mislabels its bins (it claims
that her replication found the "she killed" and "he wept" are gender
stereotypes, rather than the opposite). If the goal is simply to find
which words show strong gendered patterns of usage, it's unclear why she
would choose a different statistical method. In the appendix, claims to
have performed a replication and found that "Overall, the percentage
differences between these top most correlated verbs for each gender was
very low (0.031% to 0.307%) meaning that while a difference can be
found, male/female is not very differentiated from one another if we
look at verbs." I have no idea what statistics she is reporting
here--although she has a github repository online, it appears not to
contain any of the code used to generate these tables. [^3]

But while Da's method is obscure, I am confident that the interpretation
any reader would take from this-- that Jockers and Kiriloff report
statistically inflated claims of difference-- is incorrect. A simple way
to test the robustness here is just to apply a Dunning Log-Likelihood
test, and use a close analogue to the Bonferroni correction Da calls for
and then never runs, a Holm-Sidak correction. [^4] The result: 81% of
the words Jockers and Kiriloff look at are statistically significant.

After spending a paragraph and a half throwing out statistical claims
that

My intellectual disagreement, here, is with the

{#piper-on-confessions}
## Piper on Confessions

This same slapdash method--mis-stating the statistical or computational
literature, failing to run the very tests she insists are necessary, and
then leaving the reader with the impression she has somehow invalidated
the result--is on prime display in her description of Andrew Piper's
work on the confessional form. Da pulls statistical pronouncements out
of thin air and presents them as that which _must_ be done. These claims
are often either misinformed or misleading.

I can't bear to go through all her sections. But as an example, take the
analysis of Andrew Piper's work on Augustine's _Confessions_. In a few
paragraph, she makes as many mistakes as she holds him to account for in
a full article.

First, she criticizes Piper for performing Principal Components Analysis
on unscaled word frequencies, and produces scatterplots that show
dramatically different results from his: "The way to properly scale this
type of matrix is outlined in G. Casella et al's Introduction to
Statistical Learning... The second step \[Z-scaling\] is necessary if
each word is to be seen as a feature for PCA." George Casella did not
write a book called _Introduction to Statistical Learning_; she means
the 2013 volume by Gareth James et al. published (after Casella's death)
in a Springer series for which he was general editor. The chapter she
cites certainly does not say that PCA matrices must always be scaled by
standard deviations. It says, rather that scaling PCA is a consideration
the researcher should take. When units are arbitrary, PCA should be
scaled--if comparing SAT scores to grade point averages, you don't want
the difference between a 1420 and a 1421 on the test to be the same as a
2.5 and a 3.5 on the GPA. But word frequencies are not arbitrary. In
those cases, the researcher must decide. To quote from the text: "In
certain settings, however, the variables may be measured in the same
units. In this case, we might not wish to scale the variables to have
standard deviation one before performing
PCA."@james\_introduction\_2013.

This is a central challenge familiar to anyone who has tried to grapple
with wordcounts. There are so many uncommon words used once or twice in
any given text that, when scaling is used, they can completely swamp the
repeated words. A variety of solutions are in common use. TF-IDF scaling
drops out the most common words while allowing those of medium frequency
to shine through; log transformations of various flavors proliferate.
Ideally solutions would not be wholly dependent on the parameter space,
but the phrasing of the question matters.

Da sidesteps these all these complications for her reqaders by implying
the real difference has to do with a _philological_ failing, that Piper
doesn't stem Latin text. This is something a literary audience can
understand, and gestures towards a humanistic critique. But comically,
her version reproduces many of the same philological failings. She
implies that Piper didn't use a Latin stemming algorithm because the
"only Latin stemmer available is the Schinke stemmer," but that she
_has_ taken the effort. This is incorrect on both fronts. First, there
are many Latin stemmers available. (For an in-depth analysis of at least
6, see [Patrick Burn's
work](https://github.com/diyclassics/lemmatizer-experiments/tree/master/notebooks).

And her effort seems to be scattershot at best. It's hard to tell what
code Da actually ran--the [online
appendices](https://github.com/nan-da/Novel-Devotions) for analyzing
Piper's case only include the PCA code for Chinese, not the figures
included in the appendix. (Ordinarily I would be forgiving of this kind
of lapse, which is all too common; perhaps the inadequate code
appendices are intended as a higher-order critique of computational
work. But her failings vis-a-vis replication are far greater than those
of, say, Ted Underwood, who generally supplies a single script called
`replicate.py` that you can run yourself inside any of his projects.)

Still, from what she has posted
[online](https://github.com/nan-da/Novel-Devotions), Da appears to have
re-implemented Schinke's algorithm in both R and python, with separate
rules for nouns and verbs. But then, in her Cross Distance code, she
simply applies the noun stemming rules to all words, (probably) because
choosing a part of speech is much harder than running stemming. This
results in many problems; both because some verbs are not stemmed at all
('resurrexit' remains 'resurrexit' even though the verb rules would have
it as 'resurrexi'); and because the rules are applied to function words
as well with silent NULL results in her code, so that words like 'que,'
'cum,' 'te,' and 'me' are deleted from the text altogether. That is:
many function words are being dropped altogether because a new
implementation was hastily coded rather than using one of the more
mature implementations available.

I wrote this and then, quickly, checked what difference it all makes.
(Code and edits online
[here](https://github.com/bmschmidt/Novel-Devotions/blob/master/Checking.Rmd))
I was, honestly, expecting that the scaling factor would be significant
and account for the differences in texts. But actually, what I got looks
more or less like Piper's original.

Reproduction of Piper's original:

:::
![A reproduction of the original](/img/mds_piper_unscaled_unstemmed.png)

{.caption}
:::
A reproduction of the original
:::
:::

Reproduction using Da's scaling.

:::
![A reproduction of a reproduction](/img/mds_piper_scaled_stemmed.png)

{.caption}
:::
A reproduction of a reproduction
:::
:::

Or maybe it looks completely unlike it!

{#and-so-on}
## And so on.

I could go on. The debunking of topic model, for example, uses not the
well established literature about comparing topic model distributions to
each other, but some arbitrarily chosen robustness tests. (It drops 1%
of documents). But it is not a replication. Topic models rely on
extremely specific assumptions about the distribution of words in texts
based on word counts; they attempt to reproduce the frequencies in
actual documents.

But rather than fit on word counts, the model, for no apparent reason,
[uses TF-IDF
vectors](https://github.com/nan-da/Quiet-Transformations/blob/master/lda_fitting.py#L60-L64)
that multiply the significance of rare words and decrease the
significance of common ones. I have never seen a TF-IDF vectorization
fed into an LDA feature set before-- it's an extremely odd choice that
guarantees the results will be different from Underwood and Goldstone's,
and partially explains the incoherent topics in the appendix, such as
`doulce attractiveness unsatisfying gence dater following mecum wigan cio milieu`.
(Edit 03-20) I'm wrong about this: Andrew Goldstone points out that
there's an argument to the TFIDF vectorizer in her codes that makes it
output raw frequencies. Frequencies might still produce results
different than the counts that Underwood and Goldstone used, but this is
not a howler. It's still unreasonable, though, to expect that the topics
put out by the Variational Bayes, online LDA implementation in
scikit-learn will be the same as those in the Gibbs-Sampling method
Underwood and Golstone use from Mallet. Different methods can produce
dramatically different results when the hyperparameters are not properly
tuned. ([See
here](https://www.cs.mcgill.ca/~uai2009/papers/UAI2009_0243_1a80458f5db72411c0c1e392f7dbbc48.pdf))
While [Goldstone
does](https://github.com/agoldst/dfrtopics/blob/master/R/model.R#L282-L2840)
optimize hyperparameters there's nothing in the scikit-learn code that
indicates this effort. So the models may be radically different because
Underwood and Goldstone ran a better model.

In fact, Goldstone and Underwood's [original work on
this](http://journalofdigitalhumanities.org/2-1/what-can-topic-models-of-pmla-teach-us-by-ted-underwood-and-andrew-goldstone/)
dealt with this issue very clearly:

> On the other hand, to say that two models "look substantially
> different" isn't to say that they're incompatible. A jigsaw puzzle cut
> into 100 pieces looks different from one with 150 pieces. If you
> examine them piece by piece, no two pieces are the same --- but once
> you put them together you're looking at the same picture.

This comparison obviously mislabels its bins (it claims that her
replication found the "she killed" and "he wept" are gender stereotypes,
rather than the opposite) and makes some extremely fishy claims such as
"Overall, the percentage differences between these top most correlated
verbs for each gender was very low (0.031% to 0.307%) meaning that while
a difference can be found, male/female is not very differentiated from
one another if we look at verbs." I don't know what that range is
supposed to be, but at least for 'wept', [Google Ngrams gives the
difference in gender usage as
400%](https://books.google.com/ngrams/graph?content=%28she+cried%2Fshe+_VERB_%29%2F%28he+cried%2Fhe+_VERB_%29&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2C%28she%20cried%20/%20she%20_VERB_%29%20/%20%28he%20cried%20/%20he%20_VERB_%29%3B%2Cc0#t1%3B%2C(she%20cried%20%2F%20she%20_VERB_)%20%2F%20(he%20cried%20%2F%20he%20{_VERB_})%3B%2Cc0)

But to go through all of this is a pain. I'm sure others have written
other analyses. This work is tedious, which is the reason that it's
rarely done; and it's hard to reproduce another workflow even when it's
well-documented.

[^1]: For the record, I myself made a quick check using yet another
  measure of entropy, compressibility; I'm inclined to think Da is
  right that there is a fundamental error Stanford's bigram
  calculations.

[^2]: Nearest shrunken centroids is indeed a sort of idiosyncratic
  choice, but one that Jockers seems to be extremely partial to going
  back over a decade. @jockers\_comparative\_2008. Whether digital
  humanists should be free to roam across the disciplines in search of
  obscure but useful algorithms, or should remain in a tightly
  constrained space, is a difficult one. My stance--

[^3]: I base this partly because the appendix says it uses the "SpaCy"
  packages for results, but none of her online code imports that
  package.

[^4]: I am not a statistician, but I use the Sidak correction because
  the literature seems to say it's [superior
  to](http://www.cogsci.ucsd.edu/~dgroppe/STATZ/Abdi-Bonferroni2007-pretty.pdf)
  the Bonferroni. I use the Holm modification, which applies
  increasingly stringent standards as you descend a list, because of
  an issue Da doesn't ever address, type II errors; that it is as
  incorrect to report a false negative as a false positive. I order
  the Holm method is by word frequency, not p-value (suggested in some
  online literature) to make the test more conservative; since Jockers
  and Kiriloff use the 310 most common words, there's no need to worry
  about multiple comparisons outside this range.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[History Degrees since the Great Recession.]]></title>
            <link>https://benschmidt.org/post/2018-12-03-AHA-Report</link>
            <guid>https://benschmidt.org/post/2018-12-03-AHA-Report</guid>
            <pubDate>Mon, 03 Dec 2018 17:16:22 GMT</pubDate>
            <content:encoded><![CDATA[I wrote this year's report on history majors for the American Historical
Association's magazine, _Perspectives on History_; it takes a medium
term view of at the significant hit the history major has taken since
the 2008 financial crisis. You can read it
[here](https://www.historians.org/publications-and-directories/perspectives-on-history/december-2018/the-history-ba-since-the-great-recession-the-2018-aha-majors-report).

There's also an [interview with me about the
topic](https://www.historians.org/publications-and-directories/perspectives-on-history/december-2018/the-history-ba-since-the-great-recession-the-2018-aha-majors-report)
in the Chronicle of Higher Education.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Interactive Scatterplots]]></title>
            <link>https://benschmidt.org/post/scatterzoom</link>
            <guid>https://benschmidt.org/post/scatterzoom</guid>
            <pubDate>Tue, 30 Oct 2018 13:37:26 GMT</pubDate>
            <content:encoded><![CDATA[As part of the _Creating Data_ project, I've been doing a lot of work
lately with interactive scatterplots. The most interesting of them is
[this one about the full Hathi collection](https://t.co/erWeUkR9Fk). But
I've posted a few more I want to link to from here:

- [An exploration of co-occurring street names in the United
  States](http://creatingdata.us/etc/streets/)
- A [description of the process of making them and tiling data
  hierarchically](http://creatingdata.us/techne/deep_scatterplots/)
- [A general description of visual bibliographies with an analysis of
  fiction datasets.](http://creatingdata.us/techne/bibliographies/)
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Stable Random Projection]]></title>
            <link>https://benschmidt.org/post/SRP</link>
            <guid>https://benschmidt.org/post/SRP</guid>
            <pubDate>Mon, 22 Oct 2018 14:30:33 GMT</pubDate>
            <content:encoded><![CDATA[I have a new article on dimensionality reduction on massive digital
libraries this month. Because it's a technique with applications beyond
the specific tasks outlined there, I want to link to a few things here.

- [The
  article](http://culturalanalytics.org/2018/09/stable-random-projection-lightweight-general-purpose-dimensionality-reduction-for-digitized-libraries/)
  in _Cultural Analytics_.

- [A visualization of 13 million books from the Hathi
  Trust](http://creatingdata.us/datasets/hathi-features/) in [_Creating
  Data_](http://creatingdata.us).
  ![](https://pbs.twimg.com/media/DpQMbHgXUAA84Ag.jpg)

- [Instructions for best using those features for your own
  projects](http://creatingdata.us/datasets/hathi-vectors/) in _Creating
  Data_.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[New Site]]></title>
            <link>https://benschmidt.org/post/hugo</link>
            <guid>https://benschmidt.org/post/hugo</guid>
            <pubDate>Sun, 21 Oct 2018 14:30:33 GMT</pubDate>
            <content:encoded><![CDATA[I'm switching this site over from Wordpress to Hugo, which makes it
easier for me to maintain.

It may also confuse the RSS feed a bit. This should be hopefully be a
one-time occurrence.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[New article in the Atlantic]]></title>
            <link>https://benschmidt.org/post/Atlantic_humanities</link>
            <guid>https://benschmidt.org/post/Atlantic_humanities</guid>
            <pubDate>Fri, 31 Aug 2018 17:37:26 GMT</pubDate>
            <content:encoded><![CDATA[I have a [new article in the
Atlantic](https://www.theatlantic.com/ideas/archive/2018/08/the-humanities-face-a-crisisof-confidence/567565/)
about declining numbers for humanities majors.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Sapping Attention: Mea Culpa, There is a crisis in the humanities.]]></title>
            <link>https://benschmidt.org/post/Sapping_attention_humanities</link>
            <guid>https://benschmidt.org/post/Sapping_attention_humanities</guid>
            <pubDate>Mon, 30 Jul 2018 17:37:26 GMT</pubDate>
            <content:encoded><![CDATA[![how bad the decline in
humanities majors has been since
2013](http://sappingattention.blogspot.com/2018/07/mea-culpa-there-is-crisis-in-humanities.html).
In short, it's been bad enough to make me recant earlier statements of
mine about the long-term health of the humanities discipline.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Feature Reduction on the Underwood-Sellars corpus]]></title>
            <link>https://benschmidt.org/post/2016-03-19-feature-reduction-on-the-underwood-sellars-corpus</link>
            <guid>https://benschmidt.org/post/2016-03-19-feature-reduction-on-the-underwood-sellars-corpus</guid>
            <pubDate>Sat, 19 Mar 2016 15:17:35 GMT</pubDate>
            <content:encoded><![CDATA[This is some real inside baseball; I think only two or three people will
be interested in this post. But I'm hoping to get one of them to act out
or criticize a quick idea. This started as a comment on Scott Enderle's
blog, but then I realized that Andrew Goldstone doesn't have comments
for the parts pertaining to him... Anyway.

Basically I'm interested in feature reduction for token-based
classification tasks. Ted Underwood and Jordan Sellars' article on the
pace of change (hereafter U&S) has inspired a number of replications.
They use the 3200 most-common words to classify 720 books of poetry as
"high prestige” or "low prestige.”

Shortly after it was published, [I made a Bookworm browser designed to
visualize U&S's core
model](http://bookworm.benschmidt.org/posts/2015-05-22-paceofchange.html),
and asked Underwood about [whether similar classification accuracy on a
much smaller feature set was
possible](https://twitter.com/benmschmidt/status/601812515639681025). My
hope was that a smaller set of words might produce a more interpretable
model. In January, [Andrew Goldstone took a stab at reproducing the
model](http://andrewgoldstone.com/blog/2016/01/04/standards/): he does,
but then argues that trying to read the model word by word is something
of a fool's errand:

> Researchers should be very cautious about moving from good
> classification performance to interpreting lists of highly-weighted
> words. I've seen quite a bit of this going around, but it seems to me
> that it's very easy to lose sight of how many sources of variability
> there are in those lists. Literary scholars love getting a lot from
> details, but statistical models are designed to get the overall
> picture right, usually by averaging away the variability in the
> detail.

I'm sure that Goldstone is being sage here. Unfortunately for me, he
hits on this wisdom \_before \_using the lasso instead of ridge
regression to greatly reduce the size of the feature set (down to 219
features at 77% success rate, if I'm reading his console output
correctly), so I don't get to see what features a smaller model selects.
[Scott Enderle took up Goldstone's challenge, explained the difference
between ridge regression and lasso in an elegant way, and actually
improved on U&S's classification accuracy with 400
tokens](http://www.lagado.name/blog/to-conquer-all-mysteries-by-rule-and-line/)--an
eightfold reduction in size.

So I'm left wondering whether there's a better route through this
mess. For me, the real appeal of feature selection on words would be
that it might create models which are intuitively apprehensible for
English professors. But if Goldstone is right that this shouldn't be the
goal, I'm unclear why the best classification technique would use words
as features at all.

So I have two questions for Goldstone, Enderle, and anyone else
interested in this topic:

1. Is there any redeeming interpretability to the features included in
   unigram model? Or is Goldstone right that we shouldn't do this?
2. If we don't want model interpretability, why use tokens as features
   at all? In particular, wouldn't the highest classification accuracy
   be found by using dimensionality reduction techniques across the
   \*entire\* set of tokens in the corpus? I've been using the
   U&S corpus to test a dimensionality reduction technique I'm currently
   writing up. It works about as well as U&S's features for
   classification, even though it does nothing to solve the collinearity
   problems that Goldstone describes in his post. A good feature
   reduction technique for documents, like latent semantic indexing or
   independent components analysis, should be able to do much better,
   I'd think--I would guess the classification accuracy over 80% with
   under a thousand dimensions. Shouldn't this be the right way to
   handle this? Does anyone want take a stab at it? This would be nice
   to have as a baseline for these sorts of abstract feature-based
   classification tasks.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Buying a computer for digital humanities work]]></title>
            <link>https://benschmidt.org/post/2015-06-12-buying-a-computer-for-digital-humanities-work</link>
            <guid>https://benschmidt.org/post/2015-06-12-buying-a-computer-for-digital-humanities-work</guid>
            <pubDate>Fri, 12 Jun 2015 17:49:01 GMT</pubDate>
            <content:encoded><![CDATA[I've gotten a couple e-mails this week from people asking advice about
what sort of computers they should buy for digital humanities research.
That makes me think there aren't enough resources online for this, so
I'm posting my general advice here. (For some solid other perspectives,
see here). For keyword optimization I'm calling this post "digital
humanities.” But, obviously, I really mean the subset that is humanities
computing, what I tend to call humanities data analysis. \[Edit: To be
clear, \] Moreover, the guidelines here are specifically tailored for
text analysis; if you are working with images, you'll have somewhat
different needs (in particular, you may need a better graphics card). If
you do GIS, god help you. I don't do any serious social network
analysis, but I think the guidelines below should work relatively with
Gephi.

_Pricing_: For each component, I'm putting up a cheap and an expensive
option; I'm also briefly describing what I myself have been using for
the last two years, because those specific examples can be helpful. The
cheap option down the line should be reasonable on a grad student
budget; the expensive set luxurious for a faculty or staff member with a
substantial research budget. I also describe my own setup, which tended
towards the luxurious end of the spectrum in the summer of 2013.

The difference between what you can do with cheap and expensive is not
great. Don't make the mistake of fetishizing the hardware too much; as
with most tools, it's the performer, not the instrument, that truly
matters. University libraries are filled with iMac computers that have
incredible computing resources that are never used for anything but
e-mail; you could, if you have a certain bent, adopt a no-computer
coding and repo management style that stored code on Github and runs it
only on public computers. [Here, for example, is a snippet of code you
can run on any library computer that will stream the entire Google
Ngrams 3-grams corpus to a library server and store only the entries
matching a regular
expression](https://gist.github.com/bmschmidt/923dce0330d72486ee8d).
(Please don't run that code trivially, it wastes a lot of resources).
You might need to store R or some python modules on a thumb drive to
realistically make this work, but you might be able to become a folk
hero by doing it.

*1. Laptop vs Desktop.*

_Cheap_: The cheapest route to go is a single laptop supplemented,
should you find you need it, with specially purchased virtual server
time on one of the cloud services. (Probably it's best to use Amazon for
this, since that's the most widely used option--see "Hard drive space”
for a discussion of Amazon with large datasets.)

_Expensive_: You're going to need a laptop in any case (anyone planning
to use a tablet for presentations is idiosyncratic enough that they
aren't reading this guide). An additional desktop will let you buy more
computing power more cheaply than a new laptop. If you have an office
(home or university), I think it makes less sense to max out on an
expensive all-in-one laptop, and more to purchase a desktop for
computing-intensive tasks and choose your laptop as a piece of consumer
electronics--base it on style, or the keyboard you like best, or weight.
Desktops can run continuously, which is useful for long computations,
web scraping, and the like. (People are often afraid to let programs run
for hours, because they think that they are "frozen”; but frequently, a
text-analysis task or complicated download may take hours to run. My
personal record is a program that ran for about 3 months in the
background.) They also can act as a web server, which can be useful for
all sorts of tasks, from running an Omeka installation to hosting a
backup of your slide deck online.

If you're looking at spending over \$1400 or so, I would seriously
consider getting two computers; if less, focus on the laptop. Keep in
mind that running a desktop continuously uses large amounts of
electricity; one of the reasons running your own is so much cheaper than
using Amazon is that all this carbon-producing effort is billed to your
office or home.

_My setup_: I have a MacBook Air for day-to-day use, chosen primarily
because it has the most battery life and I frequently forget to charge
my computer, and a Linux desktop in a large tower case, described
further below. They both ended up costing approximately the same (\$1500
or so, plus more for some hard drives), but the desktop is much more
powerful for computation.

*2. Operating system.*

_Cheap_: Ubuntu 14.04. (Or 16.04, once it comes out: the even-numbered
Ubuntu releases are so-called "long term service,” and often the best to
stick to if you don't want to lose a lot of time on updates). Ubuntu is
the most widely used and easiest "flavor” of Linux to install, is free,
and will generally be the easiest solution for installing software on
and \*usually\* works with printers, cameras, and the like (always the
biggest problem with Linux). Ubuntu is a little bit commercialized and
sometimes too glossy; on really low-end hardware, a version of Debian
using a simpler graphics stack may be a better choice.

_Expensive_: Mac OS X. Apple Hardware is pretty and, thanks to their
market dominance, often cheaper for things like laptops.
[Homebrew](http://brew.sh/) is an indispensable package manager for
installing open-source software; under the hood, where most programming
happens, Mac OS X is the same as most other Unix. Once you get a version
of OS X working, it may be worth skipping Apple's frequent updates
unless you absolutely need the new functionality they offer. (Edit--or
don't mind risking losing an afternoon to fixing things. If you don't
tweak settings a lot, it may be worth it, but always wait two weeks.
Make sure you do all recommended security updates, though.) For a
laptop, any of the MacBook varieties is fine. The major reason OS X
costs so much is that if you want to get a full-powered server going,
you'll need to buy the extraordinarily expensive Mac Pro (min \$3000)
which, at time of writing, is still waiting for an update; a comparably
equipped Linux setup may cost half as much.

You absolutely probably (see below) should not plan on running Microsoft
Windows as your primary OS for humanities computing. With one exception:
ESRI's GIS software is only available for Windows. If you primarily do
GIS, you'll either need to sit in the lab, buy a Windows (or dual-boot)
machine, or use QGIS, which is getting better. Since ArcGIS licenses are
expensive, I usually recommend that grad students use QGIS so they don't
lose their ability to finish their dissertation with their library card.
I personally use QGIS for some tasks, and the spatial libraries for R
for more intensive spatial work, with D3 to render maps beautifully.

\_Edit on Windows: Some people on Twitter think this is too harsh, so
I'm changing it to "probably.” Windows will generally be fine if you're
doing number crunching only; python and RStudio will run fine. But use
of Unix (the family of operating systems to which Linux and OS X belong)
is far more common in DH than Windows, which means that you'll have an
easier time running other code, and you'll have a harder time running DH
software that is oriented to the web, such as Omeka, which requires a
so-called [LAMP
stack](https://en.wikipedia.org/wiki/LAMP_%28software_bundle%29).

\_

_My setup_: An old version of OS X on the laptop to keep my homebrew
settings intact, and Ubuntu 14.04 on the desktop tower.

*3. Memory (RAM)*

No one seems to call it RAM anymore besides me. This is the single most
important upgrade you can get for humanities computing, and most default
systems come with less than you want.

_Cheap_: You can survive on 4GB, but it's worth splurging for 8 in a
laptop. If you're only going to be working with python and R (the most
common languages for humanities computing) you can get some stuff done
on 4 or even 2GB; if you're going to be regularly running anything that
uses Java (which includes the immensely popular Mallet tool for topic
modeling, and the Stanford Natural Language Toolkit), you'll be glad to
have more.

_Expensive_: As much as possible under the rest of your hardware
setups. There are usually hard limits on what your processor can
accommodate, and you should go all the way up to them.

My setup: I maxed out the laptop at 8GB, and have the maximum 32GB in
the Linux server. When I bought my desktop, this was the most possible
on the medium-range motherboards for Intel i7 processors; one of the
things the extra money for a Mac Pro gets is the ability to load in 64GB
of RAM.

*4. Hard Drive space*

Needs for hard drive space vary enormously from person to person. There
are only a few truly large data sets out there (Google NGrams, Hathi
Trust, customized JStor data for research abstracts); you can fit tens
of thousands of anything onto an ordinary hard drive. Don't overestimate
the size of your data; if you only want to look at, say, Victorian
poetry, you can probably store the entire Hathi collection on your
phone, let alone your computer. (Images and audio take more space). Do
the math on the files you have and how many you'll need to download
before wasting money on storage: space is easily expanded, so you don't
need to spend money up front before you have the data. But you should
have a plan for dealing with additional data.

If you plan to use a lot of data (more than 1TB), you will find that
cloud computing is not a particularly realistic option. Processor time
on the cloud is cheap: but data storage needs to be persistent, and you
can easily end up spending several hundred dollars per terabyte, per
year, to store copies online. This is rarely economical.

The other thing to keep in mind is that there are "solid state” and
traditional hard drives. Solid state are better, but hold substantially
less.

_Cheap_: Whatever drive comes in the machine; a 128 GB SSD might be
enough if you don't plan to store your personal music and photographs on
the machine. At the absolute bottom level, a small disk drive is
acceptable; but some size SSD is the biggest bang-for-the-buck upgrade
you can get. To store more data, use external drives; 4TB external
drives are now fairly cheap. If you're going to be working with the big
datasets, you probably should get two. If your analysis produces large
data files or they are available online, though, you don't
\*necessarily\* need to back them up; instead, you can store code on the
ssd or as a git repository backed up, ignore the large files for backup
but leave perfect code to recreate them. ([A Makefile is a nice way to
accomplish this, as Mike Bostock
describes.](http://bost.ocks.org/mike/make/)). Back up your code and
writing in as many forms as possible; mine are scattered around on
Dropbox, Github, and on hard drives at my home and office.

_Expensive_: As big an SSD as you can afford for the operating system
and data processing, and traditional drives for additional storage. On
Apple hardware you'll need an external enclosure for those drives, which
again gets expensive; if building a Linux tower, it may be worth getting
a case that holds as many expansion drives as possible. You'll want to
use RAID to join multiple drives into a single array for some redundancy
since disks will inevitably fail; RAID 10 is the standard, and RAID 5
and RAID 6 are both reasonable compromises if you're squeezed for space.
Do the math on what the cheapest cost per gigabyte hard drives are, and
use those; keep in mind that with a RAID array your disks generally have
to be the same size, so start with a 3 or 4 TB drive if you think
there's any chance you'll need to scale up. Internal SATA connections
are fast, and in my experience disk I/O can be a significant bottleneck.
If you plan to do external storage, it's worth making sure you can have
a thunderbolt or at least USB 3 connection.

My setup: a small (100GB) SSD for the operating system. I have a \*lot\*
of data stored locally, so I bought a case that holds two small drives
and six full size ones. I started with 3TB drives in a RAID 5
configuration for 6TB of space; as I ran out of space, I switched to 6
3TB drives in a RAID 6 for 9TB of storage with more redundancy.

*4. Processors*

Processor speed is less important for humanities computing; while too
little memory makes it impossible to do certain things, too slow
processors just means that they take longer. That's fine; just get in
the habit of accepting that certain things will run during your commute,
or overnight, or whatever.

Keep in mind that both R and python don't take advantage of multi-core
processors very well. It's possible to take advantage of multiple cores,
but in cases there is high overhead. (For Bookworm, I use GNU parallel
instead of python's multicore library because the overhead of pickling
and unpickling text files between python instances is much higher than
just passing plain text through the system; in general, it's worth
learning how to use GNU parallel, the -P flag to xargs in the shell, or
the -j argument to make; the system is likely to be better at allocating
resources than your python code.)

Java programs are frequently better able to take advantage of multiple
cores.

_Cheap_: Whatever: probably two cores.

_Expensive_: The Intel i7 series is fine; a quad-core system
effectively gives eight processors, and flies through most tasks. The
Mac Pros use Xeons, and are going to switch to something better in the
generation. Oddly, they have slower clock speeds the more processors you
get; this means, paradoxically, that if you write unoptimized python or
R code, it will probably run faster on a cheap Mac Pro than an expensive
one, so you should feel just fine about buying the "cheap” \$3000 one.

*5. Graphics*

Unless you're working explicitly with images, this is the place you need
the least compared to what an off-the-shelf computer will get you.
Real-time video rendering matters a lot for computer games and watching
movies, which all commodity computers are built to do at least a little
of; but digital humanities rarely make use of their capabilities.

In some limit cases or in three or five years, this may be flipped. Code
that is optimized for GPU can be extremely fast indeed; but it's often
difficult to find and even harder to write, much more so than
multiprocessor code. It also varies by architecture, so you'll need to
do some research about whether there's a good SVD algorithm for the GPU
written for your particular NVidia card, or whatever.

If you're working with photoshop, obviously, the situation is different.
3D modeling, which I don't know much about, should benefit enormously.
But just because you do data visualization doesn't mean you need any
graphics card at all; if you plan to do it in the browser or R, the
benefits are slight.

My setup: No graphics card on the Linux tower. Whatever comes by default
in the MacBook.

*6. Monitor*

Whatever size monitor you get, you will come to feel is the minimum.

*7. Keyboard*

I have a mechanical 1990s IBM model M. It's pretty awesome.

*8. Software*

With the exception of ArcGIS, most widely-used software for humanities
computing is free. CUNY's DH-in-a-box platform contains a lot of what
you need. As I said, Python and R are the two most widely-used
languages; along with Javascript, Java, and the various C languages,
they're all free. SPSS, Stata, and the like, are absolutely not worth
it; I see no reason to use Matlab, although it's common in some other
fields. The only coding platform I'd consider spending my own money on
for humanities computing is Mathematica; you can do some amazing things,
but won't be able to share code.

Learning to interface between your analysis language and a database can
be extremely useful for avoiding problems with memory. Python's "shelve”
module is incredibly useful as a persistent key-value store; the dplyr
package in R lets you use a SQL database without the unpleasant
experience of actually writing SQL code. I use MySQL with MYISAM tables
because I believe them to be faster and more portable; most advice
you'll get nowadays is to use Postgres as a complicated database server,
or SQLite for lightweight files. If you do use a database to store data
and think that it's slow, it's worth reading up on how indexing works;
there's a very good chance you can improve query times by a thousand by
adding the right index. Use WordPress for a blog until you know that you
need something different; every other platform you might use allows you
to convert to it from WordPress. Don't use blogger and then regret like
me.

Plenty of people use virtual machines even on their local hardware to
keep certain things (a webserver, say) clean and easy to back up. I
don't, because I don't want to lose any performance. But a system like
Vagrant can be extremely useful for switching code between a local
machine and the cloud, particularly under the budget approach.

Some random other advice: Write in markdown with pandoc. Use github,
obviously. Everyone loves sublime as a text editor; if you tend to work
on remote servers, though, it can be convenient to just always use vim
or emacs, since they'll always be around. Document each project in its
makefile, as described in the Bostock article above. Use the unix
command \`find\` instead of \`ls\` if you have more than a thousand
files downloaded; I'm constantly writing lines of code like \`find
directoryName -type f | xargs -P 6 someShellCommand.sh\`, and it's
fast.

 
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Commodius vici of recirculation: the real problem with Syuzhet]]></title>
            <link>https://benschmidt.org/post/2015-04-03-commodius-vici-of-recirculation-the-real-problem-with-syuzhet</link>
            <guid>https://benschmidt.org/post/2015-04-03-commodius-vici-of-recirculation-the-real-problem-with-syuzhet</guid>
            <pubDate>Fri, 03 Apr 2015 16:37:21 GMT</pubDate>
            <content:encoded><![CDATA[Practically everyone in Digital Humanities has been posting increasingly
epistemological reflections on [Matt Jockers' Syuzhet
package](https://github.com/mjockers/syuzhet) since Annie Swafford
posted a [set of critiques of its
assumptions](https://annieswafford.wordpress.com/2015/03/02/syuzhet/). I've
been drafting and redrafting one myself. One of the major reasons I
haven't is that the obligatory list of links keeps growing. Suffice it
to say that this here is not a broad methodological disputation, but
rather a single idea crystallized after reading Scott Enderle on "[sine
waves of sentiment](http://www.lagado.name/blog/?p=78).” I'll say what
this all means for the epistemology of the Digital Humanities in a
different post, to the extent that that's helpful.

Here I want to say something much more specific: that Fourier transforms
are the wrong "smoothing function” (insofar as that is the appropriate
term to use) to choose for plots, because they assume plot arcs
are periodic functions in which the beginning must align with the end.
I'm pretty sure I'm right about this, but as usual I'm relying on an
intuitive understanding of the techniques under discussion here rather
than a deeply mathematical one. So let me know if I'm making a total ass
of myself, and I'll withdraw my statements here.

Even before Swafford posted her critique, I felt like there was
something quite wrong about using the Fourier transform as a "smoothing”
mechanism. Fourier transforms, in my experience with them, are bad at
dealing with humanities data, because they rely on a very precise
definition of "signal.” I've had to use wavelets instead of the Fourier
transform in the past even to extract obviously periodic data from time
series, because the assumptions of regularity in the fourier transform
are so strong that some periods are simply missed.

As I was reading Enderle's post, it occurred to me that we've been
graphing these fourier transformed waves with the x axis reading 1 to
100, as if it was a closed domain. But, in fact, if plot is a sum of
sine waves, that domain should actually read from 0 to 2\*pi. (Or, if
you're so inclined, from 0 to [tau](http://www.tauday.com/)). The
difference being that waveforms are \_cyclical: \_this is the
fundamental assumption of fourier transforms, whence all of the ringing
artifacts that Swafford usefully points out come. After 100 comes 101:
but 2 pi is the same as zero. This assumption is true only for novels
whose last sentence is aligned to feed back into their first, a rare
breed indeed. (Although ironically, given the primacy that \_Portrait of
the Artist \_has played in this debate, [Joyce wrote
one](http://en.wikipedia.org/wiki/Finnegans_Wake).)

To put that graphically: this cyclicality means that syuzhet imposes an
assumption that the start of plot lines up with the end of a plot. If
you generate an artificial plot that starts with sentiment "-5” and ends
with sentiment "5”, it looks like this with normal smoothing methods.
(Rolling average or loess).

 

![](/wp-content/uploads/2015/04/Screen-Shot-2015-04-03-at-11.52.25-AM.png)

 

 

But if you try to use syuzhet's filter, it comes up looking completely
different: wavy.

![](/wp-content/uploads/2015/04/Screen-Shot-2015-04-03-at-11.47.38-AM.png)

 

This holds true on real documents. I ran it on every state of the union
address since 1960. I've added dashed lines to show the overall
sentiment movement in the address. Blue shows loess smoothing from
beginning to end, and red shows the fourier transform. As you can see,
loess allows plots to get happier or sadder: fourier forces them to
return almost to their starting place.

All the code for this is [online
here](http://rpubs.com/benmschmidt/Syuzhet): you can try it on your own
plots as desired.

![](/wp-content/uploads/2015/04/Screen-Shot-2015-04-03-at-11.55.30-AM.png)

 

 

I can see no sound reason to do this. Plots can start sad and get happy.
[But if you look at Jockers' six "fundamental plots,” all start and end
in the same approximate emotional
register](http://www.matthewjockers.net/2015/02/25/the-rest-of-the-story/).
This, I think, is an artifact of the assumptions of periodicity built
into the Fourier transform, not the underlying plots. There's no room in
this world for Vonnegut's "From bad to worse,” or for any sort of rags
to riches. It treats plot as a zero-sum game.

If I'm not misunderstanding something here, this should convince Jockers
to retire the waveform assumptions in favor of something like Loess
smoothing or moving averages, so digital humanists can move on to
talking about something other than "ringing artifacts.” I don't think
this devastating for the Syuzhet package as a whole: it has absolutely
nothing to do with the suitability of sentiment analysis for determining
plot, which is a much more interesting question others are contributing
to. (I am still undecided whether I think [my own method of plotting
arcs through multidimensional topic
spaces](http://sappingattention.blogspot.com/2014/12/fundamental-plot-arcs-seen-through.html),
which I originally came up from my misunderstanding something Jockers
said to me a year ago about his idea for syuzhet, is better: I do think
it adds something to the conversation.) One of the broader points my
unfinished post makes is that we shouldn't be taking failures in one
component of a chain to mean the rest is unsound: that's an oddly
out-of-domain application of falsifiability.

 

 
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Rate My Professor]]></title>
            <link>https://benschmidt.org/post/2015-02-06-rate-my-professor</link>
            <guid>https://benschmidt.org/post/2015-02-06-rate-my-professor</guid>
            <pubDate>Fri, 06 Feb 2015 21:47:21 GMT</pubDate>
            <content:encoded><![CDATA[Just some quick FAQs on my[ professor evaluations
visualization](http://benschmidt.org/profGender): adding new ones to
the front, so start with 1 if you want the important ones.

-3 (addition): The largest and in many ways most interesting confound on
this data is the gender of the _reviewer_. This is not available in the
set, and there is strong reason to think that men tend to have more
men in their classes and women more women. A lot of this effect is
solved by breaking down by discipline, where faculty and student gender
breakdowns are probably similar; but even within disciplines, I think
the effect exists. (Because more women teach at women's colleges,
because men teach subjects like military history than male students tend
to overtake, etc). Some results may be entirely due to this phenomenon,
(for instance, the overuse of "the” in reviews of male professors). But
even if it were possible to adjust for this, it would only be partially
justified. If women are reviewed differently because a different sort of
student takes their courses, the fact of the difference in their
evaluations remains.

-2 (addition): This  no peer review, and I wouldn't describe this as a
"study” in anything other than the most colloquial sense of the word.
(It won't be going on my CV, for instance.) A much more rigorous study
of gender bias was recently published out of NCSU. Statistical
significance is a somewhat dicey proposition in this set; given that I
downloaded all of the ratings I could find, almost any queries that show
visual results on the charts are "true” as statements of the form "women
are described as x more than men are on rateMyProfessor.com.” But given
the many, many peculiarities of that web site, there's no way to
generalize from it to student evaluations as used inside universities.
(Unless, God forbid, there's a school that actually looks at RMP during
T&P evaluations.) I would be pleased if it shook loose some further
study by people in the field.

-1. (addition): The scores are normalized by gender and field. But some
people have reasonably asked what the overall breakdown of the numbers
is. Here's a chart. The largest fields are about 750,000 reviews apiece
for female English and male math professors. (Blue is female here and
orange male--those are the defaults from alphabetical order, which I
switched for the overall visualization). The smallest numbers on the
chart, which you should trust the least, are about 25,000 reviews for
female engineering and physics professors.

![](/wp-content/uploads/2015/02/Screen-Shot-2015-02-07-at-10.16.38-AM.png)

1. (addition): RateMyProfessor excludes certain words from reviews:
   including, as far as I can tell, "bitch,” "alcoholic,” "racist,” and
   "sexist.” (Plus all the four letter words you might expect.)
   Sometimes you'll still find those words typing them into the chart.
   That's because RMP's filters seem not to be case-sensitive, so
   "Sexist” sails through, while "sexist” doesn't appear once in the
   database. For anything particularly toxic, check the X axis to make
   sure it's used at a reasonable level. For four letter words, students
   occasionally type asterisks, so you can get some larger numbers by
   typing, for example, "sh \*” instead of "shit.”

2. I've been holding it for a while because I've been planning to write
   up a longer analysis for somewhere, and just haven't got around to
   it. Hopefully I'll do this soon: one of the reasons I put it up is to
   see what other people look for.

3. The reviews were scraped from ratemyprofessor.com slowly over a
   couple months this spring, in accordance with their robots.txt
   protocol. I'm not now redistributing any of the underlying text. So
   unfortunately I don't feel comfortable sharing it with anyone else in
   raw form.

4. Gender was auto-assigned using Lincoln Mullen's [gender
   package](http://lincolnmullen.com/blog/gender-package-now-on-cran/).
   There are plenty of mistakes--probably one in sixty people are tagged
   with the wrong gender because they're a man named "Ashley,” or
   something.

5. 14 million is the number of reviews in the database, it probably
   overstates the actual number in this visualization. There are a lot
   of departments outside the top 20 I have here.

6. There are other ways of looking at the data other than this simple
   visualization: I've talked a little bit at conferences and elsewhere
   about, for example, using Dunning Log-Likelihood to pull out useful
   comparisons (for instance, [here, of negative and positive words in
   history and comp. sci.
   reviews](http://benschmidt.org/2014/09/11/simpsons-2/).) without
   needing to brainstorm terms.

7. Topic models on this dataset using vanilla sets are remarkably
   uninformative.

7.People still use RateMyProfessor, though usage has dropped since its
peak in 2005. Here's a chart of reviews by month. (It's intensely
periodic around the end of the semester.

 

![](/wp-content/uploads/2015/02/By-Month.png)

 

 

 

8. This includes many different types of schools, but is particularly
   heavy on masters and community colleges in the most represented
   schools. Here's a bar chart of the top 50 or so institutions:

 

![](/wp-content/uploads/2015/02/top-schools.png)
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[The Bookworm-Mallet extension]]></title>
            <link>https://benschmidt.org/post/2014-12-12-the-bookworm-mallet-extension</link>
            <guid>https://benschmidt.org/post/2014-12-12-the-bookworm-mallet-extension</guid>
            <pubDate>Fri, 12 Dec 2014 04:15:07 GMT</pubDate>
            <content:encoded><![CDATA[I promised Matt Jockers I'd put together a slightly longer explanation
of the weird constraints I've imposed on myself for topic models in the
Bookworm system, like t[hose I used to look at the breakdown of typical
TV show episode
structures.](http://sappingattention.blogspot.ca/2014/12/typical-tv-episodes-visualizing-topics.html) So
here they are.

The basic strategy of Bookworm at the moment is to have a core suite of
tools for combining metadata with full text for any textual corpus. In
the case of the movies, the texts are each three-minute chunks of movies
or TV shows; a topic model will capture the size of each individual
movie. A variety of extensions allow you to port in various other
algorithms into the system; so for instance, you can use the geolocation
plugin to put in a latitude and longitude for a corpus which has
publication places listed in it.

The [Bookworm-Mallet
extension](https://github.com/bmschmidt/Bookworm-Mallet) handles
incorporating topic models into Bookworm. The obvious way to topic model
is to just feed the text straight into Mallet. This is particularly easy
because the Bookworm ingest format is designed to be [exactly the same
as the Mallet
format](http://bookworm-project.github.io/Docs/input.txt.html). But I
don't do that, partly because Bookworm has an insanely complicated (and
likely to be altered) [set of tokenization
rules](http://bookworm-project.github.io/Docs/Tokens.html) that would be
a pain to re-implement in the package, and partly because we've
\*already\* tokenized; why do it again?

So instead of working with the raw text, I load a [stopwords
list](https://github.com/bmschmidt/Bookworm-Mallet/blob/master/bookwormStopwords.txt) (starting
with Jockers' list of names) directly into the database, and pull out
not the tokens but the internal numeric IDs used by Bookworm for each
word. This has an additional salutary effect, which is that we can
define from the beginning exactly the desired vocabulary size. If we
want a vocab size of the most common 2\^16-1 tokens in the corpus, it's
trivially easy to do it. That means that the Mallet memory requirements,
which many Bookworms bump up against, can be limited. (David Mimno has
used tricks like this to speed up Mallet on extremely large builds; I
don't actually know how he does it, but want to keep the options open
for later.) And though I'm not already limited precisely, I do drop out
words that appear fewer than two times from the model to save space and
time.

The actually model is run on a file not of words, but of integer IDs.
Here are the first ten lines of the movie dataset as I enter it into
Mallet.

Each number is a code for a word; they appear not in the original order,
but randomly shuffled. Wordid 883 is ‘land,' 24841 is "Stubborn,” 3714
is "influence,” etc. This file is much shorter for being composed of
integers without stopwords than it would be from the full text.

Then all the tokens and topic assignments are loaded back into the
database, not just as overall distributions but as individual
assignments. That makes it possible to look directly at the individual
tokens that make up a topic, which I think is potentially quite useful.
This gives a much faster, non-memory based access to the data in the
topic state file than any other I know of; and it comes with full
integration with any other metadata you can cook up.

Jockers' "Secret sauce” consists, in part, of restricting to only nouns,
adjectives, or other semantically useful terms. There is a way of doing
that in the Bookworm infrastructure, but it involves not treating the
topic model as a one-off job, but fully integrating the POS-tagging into
the original tokenization. We would be then be able to only feed
adjectives into the topic modeling. But the spec for that isn't fully
laid out: and POS-tagging takes so long that I'm in no big hurry to
implement it. It has proven somewhat useful in the Google Ngrams corpus,
but I'm a little concerned by the ways that it tends to project modern
POS uses into the past. (Words only recently verbified get tokenized as
words much longer ago in the 2012 Ngrams release).

Perhaps more interesting are the ways that the full Bookworm API may
expose some additional avenues for topic modeling. Labelled LDA is an
obvious choice, since Bookworm instances are frequently defined by a
plethora of metadata. Another option would be to change the tokens
imported in; using either Bookworm's lemmatization (removed in 2013 but
not forgotten) or even something weirder, like the set of all placenames
extracted out in NLP, as the basis for a novel. Finally, it's possible
to use metadata to more easily change the definition of a \*text\*; for
something like the new Movie Bookworm, where each text takes three
minutes, it would be easy to recalculate with each text instead coming
in as an individual film.

 

 

 

 

 

 
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Building outlines and slides from Markdown lectures with Pandoc]]></title>
            <link>https://benschmidt.org/post/2014-11-07-building-outlines-for-markdown-documents-with-pandoc</link>
            <guid>https://benschmidt.org/post/2014-11-07-building-outlines-for-markdown-documents-with-pandoc</guid>
            <pubDate>Fri, 07 Nov 2014 20:17:45 GMT</pubDate>
            <content:encoded><![CDATA[Just a quick follow-up to my [post from last month on using Markdown for
writing
lectures](http://benschmidt.org/2014/09/05/markdown-historical-writing-and-killer-apps/).
The [github repository for implementing this strategy is now
online](https://github.com/bmschmidt/MarkdownLectures).

The goal there was to have one master file for each lecture in a course,
and then to have scripts automatically create several things, including
a slidedeck and an outline of the lecture (inferred from the headers in
the text) to print out for students to follow along in class.

To make this work, I invented my own slightly extended version of the
markdown syntax. It has three new conventions:

1. Any phrase in bold is a *keyword *to be pulled out and included in
   outlines

2. Anything in a *code block *is to be used as a slide. Each separate
   code block is its own slide. Any first-degree header is a full page
   slide. (The easiest way to do a code block is just to tab indent a
   line: must of my slides are just a single element  line like this:

> > !\[Edison electric
> > light\](http://scienceblogs.com/retrospectacle/wp-content/blogs.dir/463/files/2012/04/i-3530f86be619cdc7d42c13cdca188088-edison.bmp)

3. As in the previous example, the *image format is extended* so that
   labels in slides appear not as alt-text, but in the text above the
   image: in addition, any image link beginning with the character "\>”
   is treated not as an image but as an iframe, making it easy to embed
   things like youtube videos or interactive Bookworm charts.

The slide decks are built with reveal.js, which drops everything into a
nicely organized batch. [Here's what one looks
like](http://benschmidt.org/HIST1234/slides/11-05_Systems,_Electricity_and_Household_Labor.html#/the-system-builders).
 (This is for a lecture on household technologies in the 20s). My
favorite feature is that by hitting escape, you get an overall view of
everything in the lecture sorted by header--this is particularly useful
when studying for exams, because those headers align exactly with the
outlines.

![](/wp-content/uploads/2014/11/SystemBuilders.png)

The outlines are produced from the same lecture notes, but in a
different way; rather than pull the code blocks, they walk through all
the headers in the document and append them (and any bolded terms) to a
new document that students can see. For that lecture, it looks like
this:

![](/wp-content/uploads/2014/11/Outline.png)

 

There are a few things I still don't love about this: image positioning
and sizing is not so good as it is in powerpoint. But the thing that's
nice is that it's extremely portable; if I don't make through the end of
a lecture, I can just cut out the last few paragraphs, paste them into
the next day's document, and have the outline and slides immediately
reflect the switch for both days. This makes a lot of last-minute,
before-class changes dramatically easier.

The basic scripts, though not the full course management repo, is [up on
github](https://github.com/bmschmidt/MarkdownLectures).The code is in
Haskell, which I've never written in before, so I'd love a second set of
eyes on it.  Some brief reflections on coding for pandoc in Python and
Haskell follow.

I thought it would be easy to switch between headers and an outline, but
they turn out to have almost nothing in common in the Pandoc type
definition; the outline needs to be built up recursively out of
component parts. It's an operation that's much closer to really basic
data structures than anything I've done before.

I initially used the pandocfilters Python package for this. That code is
[here](https://gist.github.com/bmschmidt/2a5beff9ed59c1cc337b#file-lecturetooutline-py).
It basically works--thanks primarily to insight gleaned from an exchange
on GitHub between, I think, Caleb McDaniel and John McFarlane that I've
lost the link for) that you need to scope a global python variable and
append to it from a \`walk\` function. But it has a tendency to break
unexpectedly, and uses an incredibly confusing welter of accessors into
the rather ugly pandoc json format. Plus, it's fundamentally an attempt
to write Haskell-esque code in Python, which is about the least pleasant
thing I've ever seen.

By the time I made that python script work. I had spent a couple hours
reading and re-reading the [pandoc types
de](http://hackage.haskell.org/package/pandoc-types)finition, and it
seemed like it would simpler to just write the filter in Haskell
directly. (I did a few Haskell problem sets for a U Penn course this
summer out of curiosity; without that basic understanding of Haskell
data types, I doubt I would have been able to understand the Pandoc
documentation.) The [lecture-to-outline Haskell
code](https://github.com/bmschmidt/MarkdownLectures/blob/master/lectureToOutline.hs), to
my surprise, ended up being a bit longer than the Python version
(though much of that is type definitions and comments, which doesn't
really count). If anyone out there who knows Haskell can explain to me a
better way to avoid some of the stranger elements in there (particularly
the reversing and unreversing of lists just to allow pattern matching on
them, which is a substantial proportion of what I wrote), I'm all ears.

Programming in Haskell is certainly more interesting than python; I
agree with [Andrew Goldstone's comment that "whereas programming
normally feels like playing with Legos, programming in Haskell feels
more like trying to do a math problem set, with ghc in the role of
problem-set
grader”](http://andrewgoldstone.com/blog/2013/04/21/more-on-pandoc/#fn1).
I'm left with a strong temptation to write a TEI-to-Bookworm parser,
which I've previously sketched in Python, in Haskell instead; both for
performance and readability reasons, I think it might work quite well.
Stay tuned.

 
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[More thoughts on topic models and tokens]]></title>
            <link>https://benschmidt.org/post/2014-10-23-more-thoughts-on-topic-models-and-tokens</link>
            <guid>https://benschmidt.org/post/2014-10-23-more-thoughts-on-topic-models-and-tokens</guid>
            <pubDate>Thu, 23 Oct 2014 18:37:43 GMT</pubDate>
            <content:encoded><![CDATA[I've been thinking a little more about how to work with the [topic
modeling extension](https://github.com/bmschmidt/Bookworm-Mallet) I
recently built for bookworm. (I'm curious if any of those running
installations want to try it on their own corpus.) With the movie
corpus, it is most interesting split across _genre;_ but there are
definite temporal dimensions as well. [As I've said
before](http://journalofdigitalhumanities.org/2-1/words-alone-by-benjamin-m-schmidt/),
I have issues with the widespread practice of just plotting trends over
time; and indeed, for the movie model I ran, nothing particularly
interesting pops out. (I invite you, of course, to tell me how it _is_
interesting.)

So here I'm going to give two different ways of thinking about the
relationship between topic labels and the actual assigned topics that
underlie them.

One way of thinking about the tension between a topic and the semantic
field of the words that make it up is to simply just plot the "topic”
percentages vs the overall percentages of the _actual words_. So in this
chart, you get all the topics I made on 80,000 movie and TV scripts: in
red are the _topic_ percentages, and in blue are the percentages for the
top five words in the topic. Sometimes the individual tokens are greater
than the topic, as in "Christmas dog dogs little year cat time,”
probably mostly because "time” is an incredibly common word that swamps
the whole thing; sometimes the topic is larger than the individual
words, as in the swearing topic, but there are all sorts of ways of
swearing beside the topic assignments.

In some cases, the two lines map very well--this is true for swearing,
and true for the OCR error class ("lf,” "lt” spelled with an ell rather
than an aye at the front).

In other cases, the topic shows sharper resolution: "ain town horse take
men”, the "Western” topic, falls off faster than its component parts.

In other cases the identification error is present: towards the top,
"Dad Mom dad mom” takes off after 1970 after holding steady with the
component words until then. I'm not sure what's going on there--perhaps
some broader category of sitcom language is folded in?

![](/wp-content/uploads/2014/10/Topics-vs-tokens.png)

 

 

Another approach is to ask how _important_ those five words are to the
topic, and how it changes over time. So rather than take all uses of the
tokens in "Christmas dog dogs little year cat time,” I can take only
those uses assigned into that full topic: and then look to see how those
tokens stack up against the full topic. This line would ideally be flat,
indicating that the label characterizes it equally well across all time
periods. For the Christmas topic, it substantially is, although there's
perhaps an uptick towards the end.

But in other topics, that's not the case. "Okay okay Hey really guys
sorry” was steadily about 8% composed of its labels: but after 2000,
that declined steadily to about 4%. Something else is being expressed in
that later period.”Life money pay work...” is also shifting
significantly, _towards_ being composed of its labels.

On the other hand, this may not be only a bug: the swear topic is slowly
becoming more heavily composed of its most common words, which probably
reflects the actual practice (and the full of ancillary "damn” and
"hells” in sweary documents. You can see the rest here.

 

![](/wp-content/uploads/2014/10/Percentages.png)

These aren't particularly bad results, I'd say, but do suggest a further
need for more ways to integrate topic counts in with results. I've given
two in the past: looking at how an individual word is used across
topics:

and slope charts of top topic-words across major metadata splits in the
data:

 

Both of these could be built into Bookworm pretty easily as part of a
set of core diagnostic suites to use against topic models.

The slopegraphs are, I think, more compelling; they are also more easily
_portable_ across other metadata groupings besides just time. (How does
that "Christmas” topic vary when expressed in different genres of film?)
Those are questions for later.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Building topic models into Bookworm searches]]></title>
            <link>https://benschmidt.org/post/2014-09-23-building-topic-models-into-bookworm-searches</link>
            <guid>https://benschmidt.org/post/2014-09-23-building-topic-models-into-bookworm-searches</guid>
            <pubDate>Tue, 23 Sep 2014 22:29:38 GMT</pubDate>
            <content:encoded><![CDATA[I've been seeing how deeply we could integrate topic models into the
underlying Bookworm architecture a bit lately.

My own chief interest in this, because [I tend to be a little wary of
topic models in
general](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&cad=rja&uact=8&ved=0CDYQFjAD&url=http%3A%2F%2Fjournalofdigitalhumanities.org%2F2-1%2Fwords-alone-by-benjamin-m-schmidt%2F&ei=e-khVJPVEoGlyATn2YDACw&usg=AFQjCNEZ-HbfeSjBnBMmQSSYmwe0ZihLRg&sig2=-QyPN0kuTZlexRtnPVZzdw),
is in the possibility for Bookworm to act as a diagnostic tool
internally for topic models. I don't think simply plotting description
absent any analysis of the underlying token composition of topics is all
that responsible; Bookworm offers a platform for actually accessing
those counts and testing them against metadata.

But topics also have a lot to offer token-based searching. Watching
links into the Bookworm browser, I recently stumbled on [this
exchange](https://twitter.com/BioInFocus/status/514494148574203907):

![](/wp-content/uploads/2014/09/Tweets.png)

 

How can I solve this biologist's problem? (Or, at least, waste more of
his time?)

The word-level topic assignments I have on hand are actually real useful
for this. (I'm assuming, I should say, that you know both the basics of
topic modeling and of the movie bookworm.) I can ask the beta bookworm
browser for the top topics associated with each of the words "fly” (top)
and "ant” (bottom):

 

{#attachment_357 .wp-caption .alignnone style="width: 808px"}
:::
Fly usage by topic
:::

 

{#attachment_358 .wp-caption .alignnone style="width: 635px"}
:::
Ant usage by topic
:::

 

"Fly” is overwhelmingly associated with the topics "boat ship Captain
island plane sea water” (airplane flying) and "life day heart eyes world
time beautiful” (unclear, but might be superman flying). (It's even more
so than on this chart, since I've lopped off the right side: there are
about 2200 uses of "fly” in the first topic).

But "ant” is most used in two clearly animal related topics: "water
animals years fish time food ice” and "dog cat little boy dogs Hey
going.” And both of those topics show up for "fly” as well.

So in theory, at least, we can \*restrict searches by topic:\* rather
than put into a Bookworm \*every\* usage of the word "fly”, we can get
only those that seem, statistically, to be used in an animal-heavy
context.

With an imperfect, 64-topic model on a relatively small corpus like the
Movie Bookworm, this is barely worth doing.

{#attachment_360 .wp-caption .alignnone style="width: 916px"}
:::
Ant in animal topics per million words in all topics
:::

{#attachment_359 .wp-caption .alignnone style="width: 939px"}
:::
Fly in animal topics per million words in all topics
:::

And given that "flying” is something that plenty of animals do, the
"fly” topic here is probably not all Order Diptera.

But with collections the size of the Hathi trust, this could potentially
be worth exploring, particularly with substantially larger models.
"Evolution” is one of the basic searches in a few bookworms: but it's
hard to use, because "evolution” means something completely different in
the context of 1830s mathematics as opposed to 1870s biology. A topic
model that could conceivably make a stab at segregating out just
biological "evolution,” though, would be immensely useful in tracing out
Darwinian changes; one that could disentangle military shooting from the
interjection "shoot!” might be good at studying slang.

Above all, this might be good at finding words that migrate meanings in
early uses: most new phrases actually emerge out of some early
construction, but this would let us try to recover meaning through
context.

Hell, it might even have an application in Prochronisms work; given a
large, pre-built topic model, any new scripts could be classified
against it and their words assigned to topics, and tested for their
appropriateness as a topic-word combination.

Technical note: the basics of this are pretty easy with the current
system: the only issue with incorporating "topic” as a metadata field on
the primary browser right now is that the larger corpus it compares
against would also be limited by topic. This could be solved by using
the asterisk syntax that no one uses:
\{"\*topic”:\[3\],”\*word”:\["fly”\]\} will ensure both are dropped,
not just one, by just specifying the "compare\_limits” field manually.

 
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Searching for structures in the Simpsons and everywhere else.]]></title>
            <link>https://benschmidt.org/post/2014-09-11-simpsons-2</link>
            <guid>https://benschmidt.org/post/2014-09-11-simpsons-2</guid>
            <pubDate>Thu, 11 Sep 2014 21:59:21 GMT</pubDate>
            <content:encoded><![CDATA[This is a post about several different things, but maybe it's got
something for everyone. It starts with 1) some thoughts on why we want
comparisons between seasons of the Simpsons, hits on 2) some previews of
some yet-more-interesting Bookworm browsers out there, then 3) digs into
some meaty comparisons about what changes about the Simpsons over time,
before finally 4) talking about the internal story structure of the
Simpsons and what these tools can tell us about narrative formalism, and
maybe why I'd care.

It's prompted by a simple question. I've been getting a lot of media
attention for my Simpsons browser. As a result of that, I need some
additional sound bytes about what changes in the Simpsons. The Bookworm
line charts, which remain all that most people have seen, are great for
exploring individual words; but they don't tell you \_what words to look
for. \_This is a general problem with tools like Bookworm, Ngrams, and
the like: they don't tell you what's interesting. (I'd argue, actually,
that it's not really a problem; we really want tools that will useful
for addressing specific questions, not tools that generate new
questions.)

The platform, though, can handle those sorts of queries (particularly on
a small corpus like the Simpsons) with only a bit of tweaking, most of
which I've already done. To find interesting shifts, you need:

1. To be able to search without specifying words, but to get results
   back faceted by words;

2. Some metric of "interestingness” to use.

Number 1 is architecturally easy, although mildly sort of expensive.
Bookworm's architecture has, for some time, prioritized an approach
where "it's all metadata”; that includes word counts. So just as you can
group by the year of publication, you can group by the word used. Easy
peasy; it takes more processing power than grouping by year, but it's
still doable.

Metrics of interestingness are a notoriously hard problem; but it's not
hard to find a \_partial \_solution, which is all we really need. The
built-in searches for Bookworm focus on counts of words and counts of
texts. The natural (and intended) use are the built-in limits like
"percentage of texts” and "words per million,” but given those figures
for two distinct corpora (the search set and the broader comparison
sets) also make it possible to calculate all sorts of other things. Some
are pretty straightforward ("average text length”); but others are
actual computational tools in themselves, including  TF-IDF and two
different forms of Dunning's Log-Likelihood. (And those are just the
cheap metrics; you could even run a full topic model and ship the
results back, if that wasn't a crazy thing to do).

So I added in, for the time being at least, a Dunning calculator as an
alternate return count type to the Bookworm API. (A fancy new pandas
backend makes this a lot easier than the old way.) So I can set two
corpora, and compare the results of each to each.

To plow through a bunch of different Dunning scores, some kind of
visualization is useful.

Last time I looked at the Dunning formula on this blog, I found that
[Dunning scores are nice to look in
wordclouds](http://sappingattention.blogspot.com/2011/10/dunning-statistics-on-authors.html).
I'm as snooty about word clouds as everyone else in the field. But for
representing Dunning scores, I actually think that wordclouds are one of
the most space-efficient representations possible. (This is following up
on how Elijah Meeks uses wordclouds for topic model glancing, and how
the old MONK project used to display Dunning scores).

There's aren't a lot of other options. In the past I've made charts for
Dunning scores as bar charts: for example, the strongly female and the
most strongly male words in negative reviews of history professors on
online sites. (This is from a project I haven't mentioned before online,
I don't think; super interesting stuff, to me at least). So "jerk,”
"funny,” and "arrogant” are disproportionately present in bad reviews of
men; "feminist,” "work,” and "sweet” are disproportionately present in
bad reviews of women.

![](/wp-content/uploads/2014/09/NegativeHistory.png)

This is a nice and precise way to do it, but it's a lot of real estate
to take up for a few dozen words. The exact numbers for Dunning scores
barely matter: there's less harm in the oddities of wordclouds (for
instance, longer words seeming more important just because of its
length).

We can fit both aspects of this: the words and the directionality--by
borrowing an idea that I think the old MONK website had; colorizing
results by direction of bias. So here's one that I put online recently:
a comparison of language in "Deadwood” (black) and "The Wire” (red).

![](/wp-content/uploads/2014/09/Deadwood.png)

This is a nice comparison, I think; individual characters pop out (the
Doc, Al, and Wu vs Jimmy and the Mayor); but it also captures the actual
way language is used, particularly the curses HBO specializes in.
(Deadwood has probably established an all-time high score on
some fucking-cucksucker axis forever; but the Wire more than holds it
own in the sphere of shit/motherfucker.) This is going to be a
forthcoming study of profane multi-dimensional spaces, I guess.

Anyhoo. What can that tell us about the Simpsons?

![](/wp-content/uploads/2014/09/Screen-Shot-2014-09-11-at-4.47.06-PM.png)

Here's what the log-likelihood plot looks like. Black are words
characteristic of seasons 2-9 (the good ones); red is seasons 12-19.
There's much, much less that's statistically different about two
different 80-hour Simpsons runs than two  roughly 80-hour HBO shows:
that's to be expected. And most the differences we do find are funny
things involving punctuation that have to do with how the Bookworm is
put together.

But: there are a number of things that are definitely real. First is the
fall away from several character names. [Smithers, Burns,
Itchy \_and \_Scratchy (Itchy always stays ahead), Barney, and Mayor
Quimby all fall off after about season
9](http://benschmidt.org/Simpsons/#?%7B%22words_collation%22%3A%22Case_Insensitive%22%2C%22search_limits%22%3A%5B%7B%22word%22%3A%5B%22Barney%22%5D%2C%22season%22%3A%7B%22%24gte%22%3A2%2C%22%24lte%22%3A25%7D%7D%2C%7B%22word%22%3A%5B%22Itchy%22%5D%2C%22season%22%3A%7B%22%24gte%22%3A2%2C%22%24lte%22%3A25%7D%7D%2C%7B%22word%22%3A%5B%22Scratchy%22%5D%2C%22season%22%3A%7B%22%24gte%22%3A2%2C%22%24lte%22%3A25%7D%7D%2C%7B%22word%22%3A%5B%22Quimby%22%5D%2C%22season%22%3A%7B%22%24gte%22%3A2%2C%22%24lte%22%3A25%7D%7D%2C%7B%22word%22%3A%5B%22Smithers%22%5D%2C%22season%22%3A%7B%22%24gte%22%3A2%2C%22%24lte%22%3A25%7D%7D%5D%7D).
Some more minor characters (McBain drop away as well.)

Few characters increase (Lou the cop; Duffman; Artie Ziff, though in
only two episodes). Lenny peaks right around season 9; but Carl has had
his best years ever recently.

![](/wp-content/uploads/2014/09/Screen-Shot-2014-09-11-at-5.26.00-PM.png)

We do get more, though, of some abstract words. Even though one of the
first appearances was a Christmas special, "Christmas” goes up. Things
are more often "awesome,” and [around season 12 kids and spouses
suddenly start getting called
"sweetie.”](http://benschmidt.org/Simpsons/#?%7B%22words_collation%22%3A%22Case_Insensitive%22%2C%22search_limits%22%3A%5B%7B%22word%22%3A%5B%22sweetie%22%5D%2C%22season%22%3A%7B%22%24gte%22%3A2%2C%22%24lte%22%3A25%7D%7D%5D%7D) (Another
project would be to match this up against the writer credits and see if
we could tell whether this is one writer's tic.)

"Gay” starts showing up frequently.

Others are just bizarre: The Simpsons used the word "dumped” only once
in the 1990s, and a 19 times in the 2000s. This can't mean anything
(right?) but seems to be true.

What about story structure? I found myself, somehow, blathering on to
one reporter about Joseph Campbell and the hero's journey. (Full
disclosure: I have never read Joseph Campbell, and everything I know
about him I learned from Dan Harmon podcasts).

But those things are interesting. Here are the words most distinctively
from the first act (black) and the third act (red). (Ie, minutes 17-21
vs 2-8).

![](/wp-content/uploads/2014/09/Screen-Shot-2014-09-11-at-5.09.16-PM.png)

As I said earlier, school shows up as a first-act word. (Although
"screeching,” here, is clearly from descriptions of the opening credits,
school remains even when you cut the time back quite a bit, so I don't
think it's just credit appearances driving this). And there are a few
more data integrity issues: elderman is not a Simpsons character, but a
screenname for someone who edits Simpsons subtitles; www, Transcript,
and Synchro are all unigrams about the editing process. I'll fix these
for the big movie bookworm, where possible.

That said, we can really learn something about the structural properties
of fictional stories here.

Lenny is a first act character, Moe a third act one.

![](/wp-content/uploads/2014/09/Screen-Shot-2014-09-11-at-5.55.09-PM.png)

We begin with "school” and "birthday” "parties;”

![](/wp-content/uploads/2014/09/Screen-Shot-2014-09-11-at-5.55.58-PM.png)

 

we end with discussions of who "lied” or told the "truth,” what we
"learned” (isn't that just too good?), and, of course with a group
"hug.” (Or "Hug”: the bias is so strong that both upper- and lower-case
versions managed to get in). And we end with "love.”

![](/wp-content/uploads/2014/09/Screen-Shot-2014-09-11-at-5.54.12-PM.png)

The hero returns from his journey, having changed.

Two last points.

First, there are no discernably "middle” words I can find: comparing the
middle to the front and back returns only the word "you,” which
indicates greater dialogue but little else.

Second: does it matter? Can we get anything more out of the Simpsons
through this kind of reading than just sitting back to watch? Usually,
I'd say that it's up to the watcher: but assuming that you take
television at all seriously, I actually think the answer may be "yes.”
(Particularly given whose birthday it is today). TV shows are formulaic.
This can be a weakness, but if we accept them as formulaically
constructed, seeing how the creators are playing around with form can
make us appreciate them better, better appreciate how they make us feel,
and how they work.

Murder mysteries are like this: half the fun to all the ITV British
murder mysteries is predicting who will be the victim of murder number 2
about a half hour in; all the fun of Law and Order is guessing which of
the four-or-so templates you're in Wrongful accusation? Unjust
acquittal? It was the first guy all along? (And isn't it fun when the
cops come back in the second half hour?)

But the conscious play on structures themselves are often fantastic. The
[first clip-show episode
of](http://www.avclub.com/tvclub/community-paradigms-of-human-memory-54827)
\_[Community](http://www.avclub.com/tvclub/community-paradigms-of-human-memory-54827) \_is
basically that; essentially no plot, but instead a weird set of riffs on
the conventions the show has set for itself that verges on a
deconstruction of them. One could fantasize that we're getting to the
point where the standard TV formats are about as widespread, as
formulaic, and as malleable as sonata form was for Haydn and Beethoven.
What made those two great in particular was their use of the
expectations built into the form. Sometimes you don't want to know how
the sausage is made; but sometimes, knowing just gets you better
sausage.

And it's just purely interesting. [Matt
Jockers](http://www.matthewjockers.net) has been looking recently at
novels and their repeating forms; that's super-exciting work. The (more
formulaic?) mass media genres would provide a nice counterpoint to that.

The big, 80,000 movie/TV episode browser isn't broken down by minute
yet: I'm not sure if it will be for the first release. (It would
instantly become an 8-million text version, which makes it slower). But
I'll definitely be putting something together that makes act-structure
possible.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Markdown, Historical Writing, and Killer Apps]]></title>
            <link>https://benschmidt.org/post/2014-09-05-markdown-historical-writing-and-killer-apps</link>
            <guid>https://benschmidt.org/post/2014-09-05-markdown-historical-writing-and-killer-apps</guid>
            <pubDate>Fri, 05 Sep 2014 20:34:25 GMT</pubDate>
            <content:encoded><![CDATA[Like many technically inclined historians (for instance, [Caleb
McDaniel](http://wcm1.web.rice.edu/hacks.html), [Jason
Heppler](http://jasonheppler.org/2012/11/20/using-markdown-like-an-academic/),
and [Lincoln
Mullen](http://chronicle.com/blogs/profhacker/markdown-the-syntax-you-probably-already-know/35295))
I find that I've increasingly been using the plain-text format Markdown
for almost all of my writing.

The core idea of Markdown is that rather than use Microsoft Word,
Scrivener, or any of the other pretty-looking tools out there, you type
in "plain text” using formatting conventions that should be familiar to
anyone who's ever written or read an e-mail. (Click on Mullen's or
Heppler's name for a better introduction than this, or see the
[Chronicle's wrapup of
approaches](http://chronicle.com/blogs/profhacker/markdown/47813)).

The benefits are many, but they're mostly subtle:

- A simple format like Markdown creates documents you'll have not
  trouble reading in twenty years. I've been teaching a survey course
  this semester and had a hell of a time reading my old notes from
  generals which I took using EndNote; with Markdown, any web browser,
  text editor, or Microsoft Word descendant will have no trouble opening
  it.
- It's very easy to produce content that will look good in multiple
  media: I can make a course syllabus or personal CV with that formats
  nicely on a website and produces a clean looking PDF _at the same
  time_.
- It becomes much easier to do things to a bunch of notes at the same
  time: bundle them into PDFs, search through all of your notes
  simulataneously, and so forth.

None of these, though, are a particularly strong sell for those who use
a computer instrumentally: in reality, your Microsoft Words documents
aren't about to disappear, either. And there are disadvantages to giving
up Word.

- Things like footnotes with a citation manager are not very easy, even
  for the technically competent.[^1] Even footnotes without a citation
  manager are fairly clumsy.
- The best tool for making your Markdown documents into attractive web
  pages , [Pandoc](http://johnmacfarlane.net/pandoc/), is not especially
  easy to install or configure if you don't use the command line on a
  regular basis.
- The core definition of Markdown is a little unclear: particularly in
  the last week, there have been some conflicts over the definition that
  will be confusing to newcomers. (Although the proposal that sparked
  them, "Common Markdown,” is likely to be a good thing in the long run)

The heart of Markdown's appeal is its flexibility: to drive any adoption
outside the hard core of people, you need a killer app built off of it
that solves a problem. In the technology sector, that has been
Markdown's ability to easily handle links and snippets of computer code
for those writing on two widely used sites, [GitHub](http://github.com)
and [Stack Overflow](http://stackoverflow.com)

Among historians, neither of those are very important. And the footnote
problem is big enough that I generally wouldn't recommend anyone to use
Markdown, right now, unless they enjoy banging their head against the
wall.

{#lectures-and-notes-the-killer-apps}
# Lectures and Notes: the killer apps.

There are two places, though, where even historians don't tend to use
footnotes: lectures, and notes. And in both of these, Markdown makes
some amazing things possible.

If there's any reason for historians to use markdown, it's in these two
spheres. The reason I keep using Markdown is that it makes it possible
for me to personally solve two problems that have driven me crazy:

Quickly making slides decks to go alongside a lecture, and borrowing and
reusing chunks of slides from one talk in another;

Making heads or tails of the thousands of pictures you take while in an
archival trip.

{#markdown-and-lectures-multimedia-and-transposability}
## Markdown and lectures: multimedia and transposability.

First lectures. With Markdown, I'm able to write my own notes and create
a slide deck _at the same time_. An example will help. Here's a snippet
from my lecture notes on the memory of the Civil War:

```
# Abolitionist memory of the war.

*Image: http://upload.wikimedia.org/wikipedia/commons/1/19/William-Tecumseh-Sherman.jpg* Caption: William Tecumseh Sherman

There's another set of people who aren't content to see it go: those who remembered the war as the period of national renewal, rebirth, and freedom. We remember World War II today as the "Good War," because we fought the Nazis and won.

But unlike WWII, Civil War actually changed the country for the better. It abolished slavery. It instituted amendments that guaranteed citizenship to every American. It promised equal protection under the law.

Memory that's particularly strong among African Americans.

They remember Sherman differently.
Sherman not as maurauder but as unfifilled promise.
Sheman, you might remember, when he finally made it to the sea issued his famous **Field Order 15**
```

With some ancillary code I wrote, that does two things at once: builds a
slide showing the wikimedia copy of [Sherman's grizzled
mug](http://upload.wikimedia.org/wikipedia/commons/1/19/William-Tecumseh-Sherman.jpg),
and creates a set of notes for me under the header "Abolitionist memory
of the war” to go on the paper notes I'll read from.

Later on, I'll write another script that will find pull every phrase in
boldface (like "Field Order 15”) from all my notes and put them onto a
list of possible IDs for the midterm I can hand out. Another script
could strip just the section headers and print out outlines for the
lectures to hand out before class.

This is *writing documents for multiple uses,* and it can be incredibly
useful. If, two minutes before class, I decide I want to switch the
order I talk about the abolitionist memory of the war and the white
supremacist memory of the war, I can just cut and paste the chunks of
text, and all the slides associated with each will have their order
switched.

Something like this could provide a really useful way to integrate and
share resources, and free up some of the tedium with prepping lectures.
But:

- That syntax for including an image as a slide is my own, not standard
  Markdown. I've defined scripts for dropping in YouTube videos, images,
  captions, and some other predefined formats: but it would take a lot
  of work to define a set of them that make sense for anyone but me.
- There are a lot of standards out there for working with HTML slides.
  None is winning, in part because none is anywhere as good as Keynote
  or Powerpoint for the average user. My code works with deck.js, one of
  the only HTML formats _not_ supported by Pandoc; but there's no
  obvious other standard to switch to.
- Constructing slides that are more complicated than a single image with
  a title, or a numbered list, requires some serious HTML/CSS expertise.
  My scripts support that, but not in a pretty way.

Modern HTML allows some beautiful things: I can easily imagine a GUI for
one of the standards that would make it easy to create slides for re-use
in one of the competing platforms. But I think the standards are still
evolving too rapidly in this sphere to make the way forward obvious.

Pull out the slide deck, and you still might have a useful tool here:
something that generates a lecture notes for me, outlines for the
students/course web page, and IDs for the test prep sessions. But I
think there's something even more valuable possible for archive notes.

{#markdown-and-the-archives-integrating-notes-and-photos}
## Markdown and the Archives: integrating notes and photos

Markdown is a great language for taking archival notes. Archives are all
about hierarchy: and Markdown easily lets you tag mutliple levels of
headers (Series, Box, Collection, file...). But so is Microsoft Word:
and there are plenty of outlining programs out there that are even
better.

There are a few things that Markdown notes might do more easily than
normal ones. Build a good enough web interface, and you could even click
on a photo or quote in your notes and instantly get back a string that
ascends the various headers to tell you where it is:
`Series 3a, Box 13, Folder 4, Letter on 4/18`. But the place where
there's really an opportunity lies in Digital Photos.

Digital cameras have completely changed historians' relations to
archives in the last 15 years. (That is, in the subset of archives where
cameras are allowed). We used to take notes: now, a massive part of our
archival practice involves taking pictures, which have to be sorted
through on our return.

When I'm wading through boxes, I tend to type the name of the box, and
then some information about each folder followed by descriptions of the
documents: if it's especially useful or especially visual, I take a
picture (or a series of several pictures). I think this is pretty
similar to what most people do. It means that I end up with two separate
timelines to sort through when I get home. 1) A bunch of textual notes
that contain my impressions of the works and the rationales for why I
copied them and what they are. 2) A stream of pictures with little
context but their order to patch together their origin, sometimes with a
close-up of a box or folder label thrown in to help.

The tough question is: how can you insert pictures into your notes?
Unless you want to physically pick up your laptop and use the webcam for
your pictures, it's not obvious what the best way would be. And if you
try to put more than a couple pictures into a Word document, it will
crash right away.

Unlike the systems most historians use for notes, Markdown is *plain
text* and has an *easy method for inserting multimedia*. That means that
you can use it to integrate your archival photos directly into your
notes; and that unlike Word, it can handle hundreds of images or
thumbnails with aplomb.

The last challenge is knowing which parts of your notes go with which
pictures. This is a surprisingly hard thing to solve: but there's an
existing answer in a second technology much beloved by the technology
industry: *version control*.

Version control can get complicated, but in its simplest form it's much
like a wikipedia edit history: not just the current state of a file, but
_every previous revision_ is stored in memory.

So for archival notes, we just need to save the state of your archival
notes every 10 or 15 seconds; match those markers against the timestamps
of the photos from a digital camera; and insert the pictures into the
text just in place.

When you want to review your notes, you just open them up in HTML
format: thumbnails of every picture will appear in place, and you can
click on them to get the full version.

For the technically savvy, I've put a [set of scripts online that do
just this.](https://github.com/bmschmidt/archiveSync) I use gitit to
view the notes themselves so I can interlink between pages. A daemon
handles the git commits: but that only works because I have always been
a compulsive, several-times-a-minute saver of my documents.

{#what-would-a-user-friendly-platform-look-like}
# What would a user-friendly platform look like?

My repo might be useful for those who are already comfortable with tools
like version control: but those are the people who are already using
Markdown anyway.

To make this useful for anyone else, we'd need a system with three easy,
non-command line steps:

{#section-1-installation}
### 1. Installation

Puts Pandoc, Git, and a good Markdown editor on your computer at once.

{#section-2-writing-in-the-archives}
### 2. Writing (in the archives)

This should resemble existing note taking as closely as possible: the
user will need to make sure their camera's clock is well-calibrated, but
other than that it should look only like using a new text editor.

Whenever you type in the editor, it saves the files *and* runs git
commit at close intervals. (Git experts may find the idea of automatic
commits without a clear commit message cringe-inducing. Insofar as they
have a point, edits should probably take place on a separate branch that
is forked back into the main one periodically.)

{#section-3-compilation-loading-your-pictures}
### 3. Compilation (loading your pictures)

Imports photos from an sdcard or photo library, finds the version
control files and matches photo times against them, and builds an html
file for each document of notes.

{#what-s-the-platform}
## What's the platform?

Some of the technical components are obvious. I can't imagine using
anything other than git for version control; and though I use gitit to
view files, I think that standalone html files are the only sensible way
for most people to view their files. The scripting language for step
three, as well, isn't very important: I've used python, but anything
with a set of hooks into git.

The big question is: what's the text editor to be? I use emacs, and get
the impression that most people writing in Markdown are using vim. Both
of these are clearly bad choices for the ordinary historian. For all
that Markdown can be written in any editor, the writing function also
must support auto-save and auto-git-commit, so anything without a
scripting interface is out. SublimeText has its selling points, but
free's probably the way to go.

That means, unless I'm missing a central player in the ecosystem, that
the natural choice is the new Atom editor from Github. But perhaps
there's a more lightweight alternative?

Platform will also be an issue. The Mac is the obvious platform to
capture a majority of historians: but a surprising number of people seem
to take their notes with an iPad-keyboard array, which would call the
whole stack into question.

{#infrastructure}
# Infrastructure

So that's the proposal. Once historians see how great Markdown is for
notes, maybe they'll think about it for lectures; once they use it for
lectures, maybe the footnote ecosystem will start to improve, and we'll
finally be able to distribute historical papers as text, making them
more portable, more easily structured, and more lasting.

So, anyone want to try?

[^1]: It took me a few hours of mucking about in Emacs Lisp to make
  inserting a link to something in my Zotero library almost as easy as
  it is under Microsoft Word; and if you want to configure the core
  behavior of Pandoc, it's best to use Haskell. Even the "programming
  historian” may not have heard of either of these languages. Both
  (well, at least Haskell) have their strengths: but suffice it to say
  that neither has ever been anyone's answer to the question "If I
  should only learn one computer language, which should it be?”
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[The Simpsons Bookworm]]></title>
            <link>https://benschmidt.org/post/2014-08-29-the-simpsons-bookworm</link>
            <guid>https://benschmidt.org/post/2014-08-29-the-simpsons-bookworm</guid>
            <pubDate>Fri, 29 Aug 2014 19:38:24 GMT</pubDate>
            <content:encoded><![CDATA[I thought it would be worth documenting the difficulty (or lack of) in
building a Bookworm on a small corpus: I've been reading too much lately
about the Simpsons thanks to the FX marathon, so figured I'd spend a
couple hours making it possible to check for changing language in the
longest running TV show of all time.

For some thoughts on how to build a bookworm, read "prep”: otherwise,
skip to analysis. [Or just head over the
browser](http://benschmidt.org/Simpsons/).

{#prep}
### Prep

Step one is getting the texts. This is easy enough here, something I
know how to do from all my [Prochronisms](http://prochronism.com) posts:
I can just use the subtitles, which are available a batch at a time. The
only challenge is deciding what to do with audio-effects subtitles. I'm
deciding to download the files that include them where necessary, but
probably disable them by default. I also end up with only 540-something
episodes, about ten short of the complete run: rather than try to figure
that out at the start, I'm going to let the Bookworm data visualizations
themselves be the clue to what I'm missing.

Next up is choosing what a "text” will be. The obvious choice would be
for each episode to be a single text: but 550 episodes, while it's a lot
to watch, doesn't give many angles for analysis. My second idea is that
it might be interesting to look at a really granular level: ideally,
we'd be able to compare the first, second, and third acts. That info
isn't in the subtitles, but we can split up by lines of speech: later
on, we'll be able to aggregate the queries to look in just the first
hundred lines, or the first third, or whatever. The only downside is
that it dramatically increases the number of texts: but that's not
really a huge problem.

That also makes it easy to decide what I'll display in the search
results: the individual line from the script containing the word.

Next step is to parse into bookworm format. Since these are in SRT
format, it's not as easy as it could be: I'm looking to create indexes
that are episode-season-line. To get the season and episode names, I
write out some regular expressions that match the various different
filenames. This one of the uglier parts, and where I actually spend the
most time. The final parsing code uses a whole bunch of regexes to
handle the different formats people use: "S04E20”, "\[1.3\],” and so
forth. One batch doesn't have season numbers at all: I'll have to fix
that later.

Next is actually parsing the text, and adding some new information to it
about the position of each line. This is usually the hardest part, but
SRT parsing is pretty easy as these things go. Plus, nailing down the
format leads me to an insight--rather than use line number, I can take
the embedded time information in the SRT files and index by the minute
and second in the episode that a subtitle flashes on the screen. Each
subtitle block will correspond to a file, and we'll know the exact
moment it appeared. Turns out there are about 200,000 of those in the
series, which is a reasonable number of texts to include in a Bookworm.
(Though if I were hypothetically to do this for a whole bunch of TV
series (more than a couple hundred) at the same time, that might push
the system's limits.) Parsing out the SRT time information works well.
We're left with some straggling sound effects, which I'm just leaving in
for the time being. Occasionally characters names appear at the front of
texts: again, that's something I'd correct if this were a weekend
project rather than a weeknight one.

That means the final scheme will give us, for each subtitle block:

1. Season Number

2. Episode number in the season

3. Episode number in the series (will make some plots easier).

4. Minute in the episode

5. Second in the episode

6. The actual text of the block.

From that information, if we were true Simpsons scholars, we could
easily add:

1. Act (roughly: call minutes 0-7 act 1, minutes 8-14 act 2, and minutes
   15 to the end act 3)

2. Air date, episode director, and other information easily linkable
   from IMDB.

3. Whether it's a finale or what.

Once the text is parsed, the file-creation is pretty easy, we're ready
to ingest. The
[input.txt](http://bmschmidt.github.io/Presidio/input.txt.html) file is
just the text and an id number constructed from the moment the block
appears on screen: the jsoncatalog.txt is just a dump of an object
that's useful for processing, anyway.

I've already written a specialized makefile for my Federalist papers
bookworm to clone the Bookworm repo and put files in the right place, so
that's easily adapted.

And then we've got it! I didn't designate any fields as "time,” so a
first inspection will be easier using the D3 browser.

The first test is to find out about those pesky missing episodes. So
I'll plot a heatmap of the number of words for each episode (x axis) and
season (y axis):

![](/wp-content/uploads/2014/08/Screen-Shot-2014-08-29-at-11.35.28-AM.png)

This shows that we've got about 25 episodes for season, but: we've got a
season 0 and no season 1 (that one set of srts that didn't give a
season, no doubt); we've got no seasons 16 and 17; and, curiously, most
season 6 episodes are twice as long as they should be. Probably season
16 was mislabeled season 6, and we're actually missing season 17. We're
also missing the first 9 episodes of season 21, and the first two of
season 22. Oh well. Something to catch on a next run.

*Analysis*

The beta lets us quickly check out some other things, like the number of
words (color) by \*minute\* (y axis) and season (x): you can see
commercial creep, as sometime around season 14 we lose most of minute
21.

![](/wp-content/uploads/2014/08/Screen-Shot-2014-08-29-at-11.40.32-AM.png)

OK: let's check the actual words. Here are uses of each of the central
four characters: season on the x axis, unigram on the y axis.

![](/wp-content/uploads/2014/08/Screen-Shot-2014-08-29-at-12.19.33-PM.png)

Nothing too suspicious here: the shift from Bart to Homer looks good,
etc.

Just trying some line charts: yep, Maude only gets mentioned much by
name around the season she dies:

![](/wp-content/uploads/2014/08/Screen-Shot-2014-08-29-at-12.31.00-PM.png)

But what's really interesting, maybe, isn't the season-to-season change
but the internal episode structure. For instance, at what minute in the
episodes do characters talk about "school?”

![](/wp-content/uploads/2014/08/Screen-Shot-2014-08-29-at-12.33.00-PM.png)

That's pretty interesting, actually: pretty much every minute, the plots
seem to shift away from school.

Likewise, "I'm Kent Brockman” seems to be overwhelmingly a gag from the
opening scene:

![](/wp-content/uploads/2014/08/Screen-Shot-2014-08-29-at-12.35.03-PM.png)

OK, that's enough: [here's the link to the
Bookworm](http://benschmidt.org/Simpsons/), and here's the source code.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Finding the best ordering for states]]></title>
            <link>https://benschmidt.org/post/2014-06-05-optimally-ordering-geographical-entities-in-linear-space</link>
            <guid>https://benschmidt.org/post/2014-06-05-optimally-ordering-geographical-entities-in-linear-space</guid>
            <pubDate>Thu, 05 Jun 2014 14:36:07 GMT</pubDate>
            <content:encoded><![CDATA[Here's a very technical, but kind of fun, problem: what's the optimal
order for a list of geographical elements, like the states of the USA?

If you're just here from the future, and don't care about the details,
here's my favorite answer right now:

But why would you want an ordering at all? Here's an example. In the
baby name bookworm, if you search for a name, you can see the
interaction of states and years. Let's choose "Kevin,” because it played
such a role in my [anachronism-hunting piece on
Lincoln](http://www.theatlantic.com/entertainment/archive/2013/01/did-anyone-say-racial-equality-in-1865-the-language-of-i-lincoln-i/266990/).

![](/wp-content/uploads/2014/06/Screen-Shot-2014-06-04-at-3.00.44-PM.png)

Clearly the name took off around the start of the baby boom. But is
there a geographical pattern? It's very hard to say. It does look like
the red names begin around 1955 in much of the country. But in a few,
it's not until the early 1970s. Which ones? Alabama, Georgia, North
Carolina, South Carolina. That is, after substantial reading parsing
over to the axis, it's clear that most of those are southern states. But
this is the sort of insight that should be immediately obvious. And
there may be other connections we're missing out on. The whole point of
data visualization over tables is that you can pick out patterns using
faster forms of cognition: requiring you to push over to the left to
read off the names is a major loss.

Alphabetical order makes it easy to find any individual state (assuming
you know its name) but hard to see the way related states move with each
other.  It means that to trace out regional variations over time, we
tend to animate maps: but using time as the proxy for time makes
cross-temporal comparisons much harder to tell. As Tufte says,
comparisons should be enforced across the eyespan: relying on animation
to trace out common names is a big problem. So there's a dramatic
interest in seeing different names pop up in (for instance) [Reuben
Fischer-Baum's animation of baby
names](http://jezebel.com/map-sixty-years-of-the-most-popular-names-for-girls-s-1443501909);
but you have to watch the whole thing to think through questions like
"what regions tend to adopt names early?” or "what's the name that stays
on top for the longest?”

[Putting it all into X-Y makes these questions easier. But that means we
need to map states to X or to Y. Alphabetical order means that states
are not arranged in a way that places states near others like
them.]{style="line-height: 1.714285714; font-size: 1rem;"}

So how could we make the states usefully arranged? We need some
dimensionality reduction.

*Linear reductions*

One obvious way would be east-to-west or north-to-south: that starts out
quite well, with all of New England:

But quickly falls apart with Ohio, Florida, Georgia, and Michigan in
immediate succession. If we plot the states, you can quickly see why.
Rather than list orders, I'm going to show them as paths through a map:
here's what that looks like in this case.

![](/wp-content/uploads/2014/06/eastToWest.png)

 

(By the way, you can see that the points are a little arbitrary: I've
taken the first geonames hit for the state, which is sometimes the
capital, sometimes the state centroid, and sometimes the most important
city. Ideally I'd be using the population-weighted centroid, but in some
ways I kind of like the results that come out of this.

There are some other possibilities for linear dimensionality reduction
(principal components comes to mind) but they'll have the same
fundamental problem. We want a metric that takes proximity more fully
into account. Even *non-metric multi-dimensional scaling* fails: it
handles a couple cases better (Jackson and St. Louis are in a more
sensible order, for instance), but it still jumps erratically up and
down, preventing any larger groups like "the south” from coming into
sight:

![](/wp-content/uploads/2014/06/nonmetric-MDS.png)

 

*Hierarchical clustering approaches*

One possible approach, suggested to me by Miriam Huntley, is
hierarchical clustering: using distances, we can cluster the states by
proximity. Here's the initial result of that:

![](/wp-content/uploads/2014/06/Initial-Dendrogram.png)

The individual groups are quite nice (New England is there, plus New
York at the end), and every state is adjacent to an immediate neighbor.
And while the groups have geographical coherence, they aren't exactly
the regions we know and love: the "mid-atlantic” runs down to South
Carolina, and the midwest includes the gulf coast all the way to
Tallahassee. The connections between the groups are scattered. Florida
is next to Pennsylvania, and South Carolina to Massachusetts. Seen as a
path, the weirdness of this is clear:

![](/wp-content/uploads/2014/06/hclust.png)

Leaf ordering in dendrograms is arbitrary, however, and we can do better
than this. Using a method developed by Bar-Joseph et al, and implemented
in the "cba” library for R, we can reorder the dendrogram so that groups
stay the same, but the leaves are ordered so that transitions from one
group to the next are maintained.

 

![](/wp-content/uploads/2014/06/Reordered-dendrogram.png)

 

Now, the path looks considerably better:

![](/wp-content/uploads/2014/06/Optimal-map.png)

 

The clusters remain adjacent, but now the transitions are so smooth that
it's not obvious where one begins and the other ends. Instead, we get a
serpentine path through the states that both ensures every path is
between two adjacent states, and keeps paths generally inside the same
region.

*Network approaches*

Can we do better? The strategy of plotting these as paths suggests that
maybe this is an instance of the traveling salesperson problem, in which
we want to travel through all the states minimizing the distance
traveled. Why shouldn't the "best” solution simply be the one where the
overall sum of distances is the least?

Inserting a dummy node as start- and end-point lets us view that: using
the best method found by the "TSP” package in R (which is not guaranteed
to be the optimal solution, since the traveling salesman is a
notoriously difficult problem to solve), we get quite a different path:

![](/wp-content/uploads/2014/06/TravelingSalesman1.png)

 

Rather than start in Maine, this route begins in Tennessee! After
winding through the Midwest to West Virginia, it leaps to Vermont and
then takes a beautifully practical course down the Eastern seaboard
through Texas, through the great plains, and then takes up nearly an
east-to-west ordering through the Mountain and Western time zones. While
many of the regional choices here look better to me than the dendrogram
solution (particularly the coherence of the south, the
distance-optimizing strategy means that there are a few nearby states
that have nothing in common: the leap from New Mexico to Montana, for
example, and the extremely strange choice to place Washington DC between
West Virginia and Vermont, ten nodes removed from either Maryland or
Virginia, the closest geographical points. (In fact, I think the route
could be improved by heading straight to Vermont from WV and putting DC
in its rightful place: but it says something that out of the 7
algorithms in the free version of the TSP package, none was able to
improve on this route).

Fractal Curves[ ]{style="line-height: 1.714285714; font-size: 1rem;"}

Another option is not to minimize travel distance but to maximize the
likelihood that two points will be next to each other. That suggests
filling the geographic region with some kind of fractal curve, and then
positioning each state along the curve.

This is an appealing way to think of arranging the country linearly: not
as a network, but as iterable set of points. For just the United States,
we could use some already-existing curve path. The most widespread
linear mapping of points is the Zip code system: Samuel Arbesman has
[written about this on
Wired](http://www.wired.com/2012/01/the-fractal-dimension-of-zip-codes/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%253A+wired%252Findex+%2528Wired%253A+Index+3+%2528Top+Stories+2%2529%2529),
and includes a link to [Robert Kosara's ZipScribble
maps](http://eagereyes.org/zipscribble-maps/united-states). Here's
Kosara's idea with a few minor changes (I use a rainbow spectrum, rather
than coloring each state separately, and an Albers projection. And it
appears that the zip database I have handy has something weird going on
in southwest Georgia.)

![](/wp-content/uploads/2014/06/ZipManifold.png)

*Space-filling curves*

The ZIP system isn't especially logical, but there should be something
similar that's better. My first thought for this problem, which the
whole post, was to use a *Hilbert Curve*. It turns out that [Kosara has
mapped that approach onto the Zip
dataset](http://eagereyes.org/zipscribble-maps/travelling-presidential-candidate-map).

Using just the state points, it's possible to draw a Hilbert curve that
covers the continental United States, and then visit each state at the
moment it's closest to the curve. The actual path taken can then be
simplified down to eliminate the intervening states. Here's what that
looks like, with both the Hilbert curve and the simplified route. I've
shaded the Hilbert curve using a double rainbow so it's easier to trace
from its origin near the Bahamas (first making shore near South
Carolina) to its exit off the coast of Los Angeles.

![](/wp-content/uploads/2014/06/Hil.png)

 

I'm disappointed by the performance here. While there is some regional
coherence (the stretch from Wisconsin to Kansas is well done, and the
first jumps through the South are acceptable), the square binning forces
some rather strange choices: the odd jag down to North Carolina, the
detour to Colorado and Wyoming.

There are other issues as well. Hilbert curves work best in square
spaces, and the patches of ocean/Canada/Mexico that get filled are
pretty far off limits. While I don't show Alaska and Hawaii, for the
other algorithms they've simply been tacked on at the end in a
reasonable manner: here, though, a solution that includes Alaska and
Hawaii makes some significant changes to the full arrangement and vastly
increases the percentage of empty space, which tends to introduce odd
decisions (like interposing Alaska between Oregon and Nevada.)

![](/wp-content/uploads/2014/06/HilbertCurveWIthAK.png)

 

I suspect there are ways of optimizing the Hilbert curve, or some
similar fractal path, so that it better maps onto actual geographic
spaces. That seems like an interesting avenue, potentially: but the
initial results here seem worse, not better, than traveling salesman
approximations.

*Conclusions and Deus ex Machina*

So on this particular set, the best results seem to come from, in
descending order,

1. Reordered hierarchical clustering

2. Traveling Salesperson solutions

3. Fractal Curves

4. Quasi-linear dimensionality reduction (east-to-west,
   multi-dimensional scaling, etc).

[For the general problem (European countries, say, or counties in a
state) I'd probably start with reordered hierarchical clustering or TSP
solutions, at least until I learn how better to fit a fractal curve to
an arbitary space.]{style="line-height: 1.714285714; font-size: 1rem;"}

But for this particular problem, I've got an ace in the hole:
there \_are \_conventional orderings of states that provide an acid
test. In particular, we want something that matches to [census
regions](https://www.census.gov/geo/maps-data/maps/docs/reg_div.txt).

The ordering inside census regions is arbitrary, just like our
clustering diagrams. So the best possible solution that includes some
knowledge about the intrinsically \_real \_regions of the United States
(the midwest, the south, etc.) is to combine the census regions with the
optimal-dendrogram measures.

Putting phony clusters just from the census regions looks like this:
![](/wp-content/uploads/2014/06/Phony-Cluster.png)

I can just plug those into a dummy distance matrix so that group
membership trumps any other sorts of distance: and then allow
geographical distance to sort out the spinning of those trees into a
more sensible order.

So, adding the constraint that census divisions and regions be kept
intact, the optimal ordering looks like this: starting in Maine,
traveling through the South west to Texas, skipping to the upper Midwest
and then taking the same route west through the plains and mountains as
the dendrogram:

![](/wp-content/uploads/2014/06/Census-Clustering.png)

Is this the perfect ordering? To my mind, it's not: but the flaws come
straight from the census, not from the algorithm. West Virginia should
not in the coastal south, it should be in the same division as Kentucky;
the leap from Oklahoma to Wisconsin is unfortunate, and so is the one
from Florida to Kentucky. Still, the census regions constrain is quite
nice to have. And unlike the unguided paths, it preserves all but one of
what I intuitively think of as the essential pairings: the Dakotas, the
Carolinas, Alabama-Mississippi, Vermont-New Hampshire, Kansas-Nebraska,
Colorado-Wyoming.

So, let's return to the original visualization to see what this new
ordering helps us see. Remember, this original version revealed only
with some serious axis-reading that the South starting using "Kevin”
later.

 

 

![](/wp-content/uploads/2014/06/Screen-Shot-2014-06-04-at-3.00.44-PM.png)

Here it is with the census-based ordering. The southern states,
two-thirds of the way down the page, clearly do begin later: but now
it's also immediately evident which of them \_don't \_lag as much. There
are also several patterns that are immediate evident which remain
completely obscure in an alphabetical ordering: usage of "Kevin” is
significantly higher around 1990 in the northeast, particularly the
mid-Atlantic, than it is in the rest of country. And while the South
waits the longest, a lag in the Arizona-New Mexico pairing is also
clear.

![](/wp-content/uploads/2014/06/Screen-Shot-2014-06-05-at-3.52.16-PM.png)

 

This style of display also makes subtler patterns visible. "Jennifer,”
for example, rises a year later in the South than elsewhere. That would
be lost as visual noise in an alphabetical ordering, but is completely
clear here.

Is a geographical ordering the best? Not always. Take
"[Madison](http://benschmidt.org/statenames/#%7B%7B%22database%22%3A%22SSA%22%2C%22search_limits%22%3A%7B%22word%22%3A%5B%22Madison%22%5D%7D%2C%22aesthetic%22%3A%7B%22y%22%3A%22state%22%2C%22x%22%3A%22year_year%22%2C%22color%22%3A%22WordsPerMillion%22%7D%2C%22plotType%22%3A%22heatmap%22%2C%22method%22%3A%22return_json%22%2C%22counttype%22%3A%5B%22WordsPerMillion%22%5D%2C%22groups%22%3A%5B%22state%22%2C%22year_year%22%5D%2C%22scaleType%22%3A%22linear%22%7D)":
its rise shows striped bands that don't seem to be regional. Illinois,
New Jersey, Washington DC, and New Mexico all avoid the wave. In fact,
if you look closer, this is clearly a racial thing: "Madison” was most
popular in states with overwhelmingly white populations. (Except
Wisconsin, it seems). And aside from the bend through the southwest,
there aren't a whole lot of largely-minority states in any contiguous
curve.

![](/wp-content/uploads/2014/06/Screen-Shot-2014-06-05-at-4.04.51-PM.png)

 

But on another level, that just points out more the usefulness
of \_some \_sensible ordering to start with.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <title><![CDATA[Bleg 1: String Distance]]></title>
            <link>https://benschmidt.org/post/2014-03-27-bleg-1-string-distance</link>
            <guid>https://benschmidt.org/post/2014-03-27-bleg-1-string-distance</guid>
            <pubDate>Thu, 27 Mar 2014 21:44:24 GMT</pubDate>
            <content:encoded><![CDATA[String distance measurements are useful for cleaning up the sort of
messy data from multiple sources.

There are a bunch of string distance algorithms, which usually rely on
some form of calculations about the similarities of characters. But in
real life, characters are rarely the relevant units: you want a distance
measure that penalized changes to the most information-laden parts of
the text more heavily than to the parts that are filler.

Real-world example: say you're trying to match two lists of universities
to each other. In one you have:

\[500 university names...\]

Rutgers the State University of New Jersey

and in the other you have:

\[499 university names...\]

Rutgers University

New Hampshire State University

By most string distance measures, ‘State University' and ‘New' will make
the long version of Rutgers match New Hampshire State, not Rutgers. But
in the context of those 500 other names, that's not the correct match to
make. The phrase "State University” actually conveys very little
information (I'd guess fewer than 8 bits) , but that "R-u-t-g-e-r-s” are
characters you should lose lots of points for changing. (Rough guess, 14
bits).

In practice, I often get around this by changing the string vocabulary
by hand. (Change all occurrences of "University” to "Uni”, etc., ) I can
imagine a few ways to solve this: eg., normalized compression distance
starting from a file of everything, or calculating a standard string
distance metric on a compressed version of names instead of the English
version. But I feel like this must exist, and my Internet searches just
won't find it.
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
        <item>
            <link>https://benschmidt.org/post/2023-10-31-AI-Copyright</link>
            <guid>https://benschmidt.org/post/2023-10-31-AI-Copyright</guid>
            <pubDate>Thu, 01 Jan 1970 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[|title: "AI and Copyright"|
|categories: \[Humanities, Degrees\]|
|date: 2023-09-23T13:36:45-04:00|
|lastmod: 2023-09-23T13:36:45-04:00|
|featured: false|
|draft: true|

I am a onetime history professor, now working as the vice president of
[Nomic](https://atlas.nomic.ai), a technology startup building
technology to make modern large language models more democratic: that
is, easier to use, easier to access. As a historian, I was concerned
especially with ways to preserve, surface, and analyze the enormous
stores of culture heritage that digitization made available. For both of
these goals, an novel reading of copyright law to complicate tools for
generative AI.

In the last year, there has been a sea change in the significance that
people bring to questions of copyright. In the last 25 years, copyright
has mostly felt like a give-and-take between creators and consumers:
digital technologies have provided ways of downloading works without
allowing artists (or their heirs, or the corporations that acquired the
copyright to their works) the ability to be compensated. What has
emerged more recently -- especially in the visual arts, but also for
text -- is a dynamic where an existing class of creators, fearful for an
erosion of their status in a new regime,

The language workers most impacted by generative AI -- translators,
proofreaders, editors, writers of contracts -- have no significant
recourse of copyright at all.

The fuzzy, approximate 'readings' of books that are retained in the
memory of a trained neural network are not in any sense copies
themselves.

The last time that copyright concerns were so pressing in my world
involved the attempts to contain the explosion of questions brought
about by Google's [quiet, methodical scanning of tens of millions of
books from academic libraries](https://books.google.com).

Google and the libraries were both sued by the Authors' Guild.

One of the more alarming features that has been emerging is a conflation
of retaining the rights of copyright with other, far more important
concerns around personal privacy.

But the forces most eager to expand the powers of copyright are not--to
put it mildly--the friends of artists. Our office is a block north of
the Warner Brothers Discover office in New York City, and for the past
months we have bee\.. Having myself been lightly involved in the
scriptwriting process for a few years,
]]></content:encoded>
            <author>bmschmidt@gmail.com (Ben Schmidt)</author>
        </item>
    </channel>
</rss>