This is some real inside baseball; I think only two or three people will be interested in this post. But I’m hoping to get one of them to act out or criticize a quick idea. This started as a comment on Scott Enderle’s blog, but then I realized that Andrew Goldstone doesn’t have comments for the parts pertaining to him… Anyway.
Basically I’m interested in feature reduction for token-based classification tasks.
I’ve gotten a couple e-mails this week from people asking advice about what sort of computers they should buy for digital humanities research. That makes me think there aren’t enough resources online for this, so I’m posting my general advice here. (For some solid other perspectives, see here). For keyword optimization I’m calling this post “digital humanities.” But, obviously, I really mean the subset that is humanities computing, what I tend to call humanities data analysis. [Edit: To be clear, ] Moreover, the guidelines here are specifically tailored for text analysis; if you are working with images, you’ll have somewhat different needs (in particular, you may need a better graphics card). If you do GIS, god help you. I don’t do any serious social network analysis, but I think the guidelines below should work relatively with Gephi.
Practically everyone in Digital Humanities has been posting increasingly epistemological reflections on Matt Jockers’ Syuzhet package since Annie Swafford posted a set of critiques of its assumptions. I’ve been drafting and redrafting one myself. One of the major reasons I haven’t is that the obligatory list of links keeps growing. Suffice it to say that this here is not a broad methodological disputation, but rather a single idea crystallized after reading Scott Enderle on “sine waves of sentiment.
Just some quick FAQs on myprofessor evaluations visualization: adding new ones to the front, so start with 1 if you want the important ones.
-3 (addition): The largest and in many ways most interesting confound on this data is the gender of the reviewer. This is not available in the set, and there is strong reason to think that men tend to have more men in their classes and women more women.
I promised Matt Jockers I’d put together a slightly longer explanation of the weird constraints I’ve imposed on myself for topic models in the Bookworm system, like those I used to look at the breakdown of typical TV show episode structures. So here they are.
The basic strategy of Bookworm at the moment is to have a core suite of tools for combining metadata with full text for any textual corpus.
Just a quick follow-up to my post from last month on using Markdown for writing lectures. The github repository for implementing this strategy is now online.
The goal there was to have one master file for each lecture in a course, and then to have scripts automatically create several things, including a slidedeck and an outline of the lecture (inferred from the headers in the text) to print out for students to follow along in class.
I’ve been thinking a little more about how to work with the topic modeling extension I recently built for bookworm. (I’m curious if any of those running installations want to try it on their own corpus.) With the movie corpus, it is most interesting split across genre; but there are definite temporal dimensions as well. As I’ve said before, I have issues with the widespread practice of just plotting trends over time; and indeed, for the movie model I ran, nothing particularly interesting pops out.
I’ve been seeing how deeply we could integrate topic models into the underlying Bookworm architecture a bit lately.
My own chief interest in this, because I tend to be a little wary of topic models in general, is in the possibility for Bookworm to act as a diagnostic tool internally for topic models. I don’t think simply plotting description absent any analysis of the underlying token composition of topics is all that responsible; Bookworm offers a platform for actually accessing those counts and testing them against metadata.
This is a post about several different things, but maybe it’s got something for everyone. It starts with 1) some thoughts on why we want comparisons between seasons of the Simpsons, hits on 2) some previews of some yet-more-interesting Bookworm browsers out there, then 3) digs into some meaty comparisons about what changes about the Simpsons over time, before finally 4) talking about the internal story structure of the Simpsons and what these tools can tell us about narrative formalism, and maybe why I’d care.
Like many technically inclined historians (for instance, Caleb McDaniel, Jason Heppler, and Lincoln Mullen) I find that I’ve increasingly been using the plain-text format Markdown for almost all of my writing.
The core idea of Markdown is that rather than use Microsoft Word, Scrivener, or any of the other pretty-looking tools out there, you type in “plain text” using formatting conventions that should be familiar to anyone who’s ever written or read an e-mail.