Download PDF

I’m updating this from the 2015 syllabus–but things have changed in this field, a lot, and so will the syllabus before we start. I’ve also stolen a lot from Ryan Cordell’s 2017 offering of this course

Notes: mostly we’ll be reading articles in this course available online. A few books are required for purchase. If you have difficulty obtaining any texts, please let me know as soon as possible.

In week 1, you’ll read my spiel about what humanists need to understand when they read CS. My answer is, in general–you need to know what they did, but not how they did it. I’ve put some CS papers in this syllabus to expand your thinking about what’s possible. You should absolutely, positively, not aim to understand the process. As a rule, if you see a fancy equation in an article not written by a humanist, you can probably skip the whole section for the time being.

Required books

  • Ramsay Reading Machines.
  • TBD. One of Andrew Piper or Ted Underwood’s new books in literary history, probably.

Unit 1: Defining and transforming data

Thursday, January 10

Week 1: Introduction. What is (could be) Humanities Data Analysis?

Reading

  • (optionally) Debates in Digital Humanities 2016, forum on text analysis.

Practicum: regular expressions

  • Software: Please come to class having installed:
    1. The programs R and Rstudio
    • Rstudio is wrapper program around the R language that we’ll be using for almost every assignment – save this first week – in this class.

Problem set: Regex practice. Note: Regular expressions embody pretty much everything that is miserable, ugly, and inelegant about computer programming. But they’re basically indispensable for actually manipulating data in the real world. So we baptize by fire!

Week 2: turning Information %>% Data

Thursday, January 17

Reading

  • Unsworth: Knowledge Representation in Humanities Computing. http://people.brandeis.edu/~unsworth/KR/KRinHC.html
  • Rosenberg, “Data before the Fact”, in Gitelman "Raw Data" Is an Oxymoron.
  • “Becoming Digital,” from Daniel J. Cohen and Roy Rosenzweig, Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web.
  • Katie Rawson and Trevor Muñoz, “Against Cleaning”, Curating Menus, 7 July, 2016.
  • (Much more practical) Hadley Wickham, “Tidy Data,” Journal of Statistical Software, 2015

Practicum

  • In class: introduction to R: data types, data.frames, and manipulation; using functions. The pipe.

Problem set:

  • Census data manipulation with read_table and tidyr
  • Optional reach: use tidyr to manipulate a data set into a form for network representations.

Week 3: Exploratory Data Analysis = split %>% apply %>% combine

Thursday, January 24

A huge amount of work is just: finding interesting things to count. Often, sophisticated work can just be figuring how to count something new. Here we look a little bit at how you can, simply count something.

Reading

  • Behrens “Principles and Procedures of Exploratory Data Analysis.”
  • Too long, too old, and too out of print to assign: but be aware of the granddaddy of them all: Tukey Exploratory Data Analysis.
  • Logan, T., & Parman, J. (2017). The National Rise in Residential Segregation. The Journal of Economic History, 77(1), 127-170. doi:10.1017/S0022050717000079. (We’re reading this for the general findings, not the methodology.
  • Underwood, Bamman, and Lee “The Transformation of Gender in English-Language Fiction.”

Problem Set: Split/Apply/Combine

  • Skill: the split/apply/combine strategy.
  • Summary statistics and exploratory data analysis on the census files.

Doing

Do a bit of a research to try to find some tabular data that you can bring to class about something you’re interested in.

Good data for these purposes.

  1. Be multivariate or have some sensible way to make it have multiple columns.
  2. Be fairly big. (More than 100 things, for sure. More than 1,000 would be good).

Exception/special case: is there textual data you can work with?

Week 4: data %>% visualization

Thursday, January 31

  • Scan http://docs.ggplot2.org/current/, paying particular attention to the first section; the different “geoms”.
  • Find Microsoft Excel: compare this to their different “chart types.”
  • selfiecity.net
  • Klein “The Image of Absence.”
  • Drucker, “Humanities Approaches to Graphical Display” (Digital Humanities Quarterly, 2011)
  • Klein and D’Ignacio, Feminist Data Visualization. (Read the short paper–browse the reviewable book at MIT press if it’s still online).

Problem set: Visualization

  • R package: ggplot2
  • Multiple geoms in ggplot.

Some documentation is available at Wickham “Ggplot2.”

Penumbral:

  • Daston and Galison Objectivity., Chapter 7 (On the Sciences, today.)

Week 5: History Dept/NuLab event with “Uncivil” Hosts.

Reading postponed–keep working on the visualization task with the HathiTrust books.

Week 6: text %>% data

Thursday, February 14.

Week 7: data %>% the Embedding Strategy.

Thursday, February 21

Modern machine learning requires data, but it doesn’t just look like an XML or TEI representation. Instead, a particular trick for turning items into strings of numbers–the embedding strategy–has emerged as the dominant ways for computers to represent information to themselves. So we’ll talk about that strategy, and how to get things in and out of it.

  • Michael Gavin, Word Space article.
  • Schmidt “Stable Random Projection.” (Skip the first section about the algorithm, read starting with the overview visualization of Hathi Trust.)
  • Tukey, Friedman, and Fisherkeller: Introduction to Prim-9. This is on YouTube, or I have a local copy.
  • Allison et al. Quantitative Formalism. Link
  • Ryan Heuser, something, maybe just visualizations.

Problem Set.

  • Vector Space Models
  • Principal Components Analysis
  • Cosine similarity

Supplemental readings (technical explanations):

  • James An Introduction to Statistical Learning with Applications in R., chapters 10-10.2.

Week 8: (image, text, data) %>% embeddings

Thursday, February 28

  • Tanya Clement and Stephen McLaughlin, “Measured Applause: Toward a Cultural Analysis of Audio Collections,” CA: Journal of Cultural Analytics (23 May 2016)
  • Melvin Wevers, Tomas Smits. “The visual digital turn: Using neural networks to study historical images”. https://doi.org/10.1093/llc/fqy085. 18 January 2019
  • PixPlot, Yale DH Lab: http://dhlab.yale.edu/projects/pixplot/
  • CS paper: read for a sense of possibilities, not method. The Shape of Art History in the Eyes of the Machine: Ahmed Elgammal, Marian Mazzone, Bingchen Liu, Diana Kim, Mohamed Elhoseiny. https://arxiv.org/abs/1801.07729

(For background, you could read this generally useful, slightly hyperventilating introduction to neural networks from the NY Times)

Week 8: Supervised Learning and predictive models

Thursday, March 14

Problem sets:

  • Naive Bayes
  • Logit classifiers
  • K-nearest-neighbor classification.

Methods:

  • James An Introduction to Statistical Learning with Applications in R., Chapter 4.
  • Chapter 8, Decision trees, particularly 8.1: optional.
  • Reach–Support Vector Machines, James Chapter 9. Only for the foolhardy.

Week 9: Unsupervised clustering

Thursday, March 21

  • Blei, “Probabilistic Topic Models.”
  • Goldstone and Underwood “The Quiet Transformations of Literary Studies.” and online supplement
  • Ted Underwood, “Topic Modeling Made Just Simple Enough,” April 7, 2012
  • Rhody “Topic Modeling and Figurative Language.”

Problem sets and methods:

  • K-means Clustering.
  • Hierarchical Clustering.
  • Topic Modeling using the R-Mallet package.

Week 10: space %>% data

Thursday, March 28

  • Richard White “What is Spatial History?” http://web.stanford.edu/group/spatialhistory/cgi-bin/site/pub.php?id=29
  • Wilkens “The Geographic Imagination of Civil War-Era American Fiction.”
  • Blevins “Space, Nation, and the Triumph of Region.”
  • Present: Colleen on Philip Ethington’s “Placing the past: ‘Groundwork’ for a spatial theory of history.”

Problem Set: Geographic binning and visualization.

Week 11:

Thursday, April 4

  • Elliott Ash, Daniel L. Chen, Suresh Naidu. “Ideas Have Consequences: The Impact of Law and Economics on American Justice.” Working paper: http://elliottash.com/wp-content/uploads/2018/08/ash-chen-naidu-2018-07-15.pdf
  • Choose 4 articles from the Current Research in Digital History forum/journal. Read them, and come prepared to talk about 2 in particular. How do they work? What do they succeed at?

Week 12:

Wednesday, April 11

Allison, Sarah, Ryan Heuser, Matthew L. Jockers, Franco Moretti, and Michael Witmore. Quantitative Formalism: An Experiment (Stanford Literary Lab, Pamphlet 1). Stanford: Standford Literary Lab, n.d.

Behrens, John T. “Principles and Procedures of Exploratory Data Analysis.” Psychological Methods 2, no. 2 (1997): 131. http://psycnet.apa.org/journals/met/2/2/131/.

Blevins, C. “Space, Nation, and the Triumph of Region: A View of the World from Houston.” Journal of American History 101, no. 1 (2014): 122–147. doi:10.1093/jahist/jau184.

Daston, Lorraine, and Peter Galison. Objectivity. New York: Zone Books ; Distributed by the MIT Press, 2007.

Gitelman, Lisa. "Raw Data" Is an Oxymoron. Infrastructures Series. Cambridge, Massachusetts: The MIT Press, 2013.

Goldstone, Andrew, and Ted Underwood. “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us.” New Literary History 45, no. 3 (2014): 359–384. doi:10.1353/nlh.2014.0025.

James, Gareth. An Introduction to Statistical Learning with Applications in R, 2013. http://dx.doi.org/10.1007/978-1-4614-7138-7.

Jockers, Matt. Text Analysis with R for Students of Literature. Springer, 2014. http://www.springer.com/statistics/computational+statistics/book/978-3-319-03163-7.

Klein, Lauren F. “The Image of Absence: Archival Silence, Data Visualization, and James Hemings.” American Literature 85, no. 4: 661–688. Accessed January 14, 2015. doi:10.1215/00029831-2367310.

Mosteller, Frederick, and David L. Wallace. “Inference in an Authorship Problem: A Comparative Study of Discrimination Methods Applied to the Authorship of the Disputed Federalist Papers.” Journal of the American Statistical Association 58, no. 302 (1963): 275–309. http://www.tandfonline.com/doi/abs/10.1080/01621459.1963.10500849.

Ramsay, Stephen. Reading Machines: Toward an Algorithmic Criticism. Topics in the Digital Humanities. Urbana: University of Illinois Press, 2011.

Rhody, Lisa M. “Topic Modeling and Figurative Language.” Journal of Digital Humanities 2, no. 1 (April 7, 2013). http://journalofdigitalhumanities.org/2-1/topic-modeling-and-figurative-language-by-lisa-m-rhody/.

Schmidt, Benjamin. “Stable Random Projection: Lightweight, General-Purpose Dimensionality Reduction for Digitized Libraries.” Journal of Cultural Analytics (2018). doi:10.22148/16.025.

Tukey, John W. Exploratory Data Analysis. Addison-Wesley Series in Behavioral Science. Reading, Mass: Addison-Wesley Pub. Co, 1977.

Underwood, Ted, David Bamman, and Sabrina Lee. “The Transformation of Gender in English-Language Fiction.” Journal of Cultural Analytics (2018). doi:10.22148/16.019.

Wickham, Hadley. “Ggplot2.” Wiley Interdisciplinary Reviews: Computational Statistics 3, no. 2 (2011): 180–185. doi:10.1002/wics.147.

Wilkens, Matthew. “The Geographic Imagination of Civil War-Era American Fiction.” American Literary History 25, no. 4: 803–840. Accessed January 15, 2015. doi:10.1093/alh/ajt045.

Witmore, Michael. “Text: A Massively Addressable Object,” December 31, 2010. http://winedarksea.org/?p=926.


  1. The algorithms we will discuss in the second half of the semester are discussed in greater length in this text. If you wish to come to more mathematical understanding, this provides a relatively gentle introduction in machine-learning terms, but with some levels of math we’ll gloss over in this class, also based in the R language. All chapters are available for download, for free, from the Northeastern library; download any now that you find helpful.

  2. For those interested solely in text analysis and not census, bibliographic, or other forms of “humanities data,” this may be valuable. But be aware it uses a different set of libraries and data models for visualization and analysis than the ones we are using in this class, so the code is unlikely to work immediately