For the second unit as well, I want to try a kind of group workset. First, we’ll divide into three or four groups of four or so around a specific corpus and problem; then we’ll talk about it.
I also want to emphasize again–this will not necessarily be easy, and much may be completely impossible if you don’t come to office hours! Please do! Don’t wait for your group to meet before you come see me!
The goal here is threefold:
Instead of e-mailing the workset over the next week, we’ll take forty minutes at the start of class on October 30th (in two weeks) to go over each of the three groups. So this will be a sort of a grouped sprint1 towards a preliminary mini-works-in-progress conference.
The deliverables at the end will be:
Each group should choose a different method
Possible corpora include:
Each group should choose a different method unless you’ve got a great justification
Topic Modeling You should be well familiar with it at this point. Actually running the model can be a bit of a pain, but there are detailed step-by-step instructions in a variety of places online, most notably the Programming Historian. Does not require a large corpus.
Word Vectors This is similar to topic modeling in some ways: a hot topic from machine learning that’s identifies structural categories of language use in texts. But it principally focuses on just finding new elements of vocabulary that are similar to other words. You could use this to define a topic in the texts, you could use to uncover patterns in language use, or you could use it to find words that show a certain orientation: it provides a route towards answering weird questions like “where do foods fall on the salty-sweet spectrum in 19th century newspapers” or SAT-style analogies like man:king::???:queen
. Requires a pretty large corpus.
Text Reuse Just like Mullen and Funk. If you can think of another type of texts where it might work well, this could be an easy chance to re-use Mullen’s code, which tends to be clear and well-documented.
Named-entity extraction Like Blevins in a couple weeks, you can use the Stanford Natural Language Toolkit to extract lists of names and places out of corpora of texts. Once you have the list, someone in the group can plot the locations on a map, as well. This will seem from a distance like it’s easier.
Sentiment analysis This remains a weirdly hot topic in English department structural analysis of texts. Try to track the existence of position or negative language in a corpus. There are solid tools for sentiment analysis out there: one good one is Matt Jockers R package syuzhet
. If you can think of a dataset where sentiment matters. An surprisingly open question in digital humanities research right now is: are there differences in the sentiment trajectories between different genres?
Other Get in touch with me if you have a corpus that you think might benefit from some other method.
Write a regular expression that captures 100 words around every use of the names of presidential candidates in elections between 1892 and 1920. Sentiment analyze them. Are there patterns of when positive language, negative language, or non-emotional language is used relative to the life-cycle of the campaign?
Topic model a different field from JStor data than Underwood and Goldstone did.
Figure out how to load the data from the Hall, Jurafsky, and Manning article and trace how a few different words change their context. (This might be hard, so you’ll have to come by!)
You’ll have to figure out how to divide stuff up for this, which is why the set is not due for two weeks.
But some potential roles are:
In real life, the only actual grouped sprint in the world is the three-legged race. Ideally you’ll look like the 1-mile relay team handing off perfectly, but don’t worry if this ends up with your whole group face down in the mud, as long as you can explain how.↩