I’m updating this from the 2015 syllabus–but things have changed in this field, a lot, and so will the syllabus before we start. I’ve also stolen a lot from Ryan Cordell’s 2017 offering of this course

Notes: mostly we’ll be reading articles in this course available online. A few books are required for purchase. If you have difficulty obtaining any texts, please let me know as soon as possible.

In week 1, you’ll read my spiel about what humanists need to understand when they read CS. My answer is, in general–you need to know what they did, but not how they did it. I’ve put some CS papers in this syllabus to expand your thinking about what’s possible. You should absolutely, positively, not aim to understand the process. As a rule, if you see a fancy equation in an article not written by a humanist, you can probably skip the whole section for the time being.

Required books

Ramsay Reading Machines.
TBD. One of Andrew Piper or Ted Underwood’s new books in literary history, probably.

Recommended supplementary materials (free online)

James An Introduction to Statistical Learning with Applications in R.¹

Jockers Text Analysis with R for Students of Literature.²

Unit 1: Defining and transforming data

Thursday, January 10

Week 1: Introduction. What is (could be) Humanities Data Analysis?

Reading

(optionally) Debates in Digital Humanities 2016, forum on text analysis.

Practicum: regular expressions

Software: Please come to class having installed:
1. The programs R and Rstudio
- Rstudio is wrapper program around the R language that we’ll be using for almost every assignment – save this first week – in this class.

Problem set: Regex practice. Note: Regular expressions embody pretty much everything that is miserable, ugly, and inelegant about computer programming. But they’re basically indispensable for actually manipulating data in the real world. So we baptize by fire!

Week 2: turning Information %>% Data

Thursday, January 17

Reading

Unsworth: Knowledge Representation in Humanities Computing. http://people.brandeis.edu/~unsworth/KR/KRinHC.html
Rosenberg, “Data before the Fact”, in Gitelman "Raw Data" Is an Oxymoron.
“Becoming Digital,” from Daniel J. Cohen and Roy Rosenzweig, Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web.
Katie Rawson and Trevor Muñoz, “Against Cleaning”, Curating Menus, 7 July, 2016.
(Much more practical) Hadley Wickham, “Tidy Data,” Journal of Statistical Software, 2015

Practicum

In class: introduction to R: data types, data.frames, and manipulation; using functions. The pipe.

Problem set:

Census data manipulation with read_table and tidyr
Optional reach: use tidyr to manipulate a data set into a form for network representations.

Week 3: Exploratory Data Analysis = split `%>%` apply `%>%` combine

Thursday, January 24

A huge amount of work is just: finding interesting things to count. Often, sophisticated work can just be figuring how to count something new. Here we look a little bit at how you can, simply count something.

Reading

Behrens “Principles and Procedures of Exploratory Data Analysis.”
Too long, too old, and too out of print to assign: but be aware of the granddaddy of them all: Tukey Exploratory Data Analysis.
Logan, T., & Parman, J. (2017). The National Rise in Residential Segregation. The Journal of Economic History, 77(1), 127-170. doi:10.1017/S0022050717000079. (We’re reading this for the general findings, not the methodology.
Underwood, Bamman, and Lee “The Transformation of Gender in English-Language Fiction.”

Problem Set: Split/Apply/Combine

Skill: the split/apply/combine strategy.
Summary statistics and exploratory data analysis on the census files.

Doing

Do a bit of a research to try to find some tabular data that you can bring to class about something you’re interested in.

Good data for these purposes.

Be multivariate or have some sensible way to make it have multiple columns.
Be fairly big. (More than 100 things, for sure. More than 1,000 would be good).

Exception/special case: is there textual data you can work with?

Week 4: data `%>%` visualization

Thursday, January 31

Scan http://docs.ggplot2.org/current/, paying particular attention to the first section; the different “geoms”.
Find Microsoft Excel: compare this to their different “chart types.”
selfiecity.net
Klein “The Image of Absence.”
Drucker, “Humanities Approaches to Graphical Display” (Digital Humanities Quarterly, 2011)
Klein and D’Ignacio, Feminist Data Visualization. (Read the short paper–browse the reviewable book at MIT press if it’s still online).

Problem set: Visualization

R package: ggplot2
Multiple geoms in ggplot.

Some documentation is available at Wickham “Ggplot2.”

Penumbral:

Daston and Galison Objectivity., Chapter 7 (On the Sciences, today.)

Week 5: History Dept/NuLab event with “Uncivil” Hosts.

Reading postponed–keep working on the visualization task with the HathiTrust books.

Week 6: text `%>%` data

Thursday, February 14.

Daniel Shore, Shakespeare’s Constructicon." Shakespeare Quarterly, 2015.
Andrew Piper, Enumerations, 2018. Introduction and Chapter 3.
Julia Silge and David Robinson, Text Mining with R: A Tidy Approach. Read online at https://www.tidytextmining.com. Preface and Chapters 1-4. This is technical–think about using it with some corpus of texts of interest to you.
Stephen Ramsay, “The Hermeneutics of Screwing Around,” 2010. https://libraries.uh.edu/wp-content/uploads/Ramsay-The-Hermeneutics-of-Screwing-Around.pdf.
Reread because we didn’t talk about it enough: Witmore “Text.”

Week 7: data `%>%` the Embedding Strategy.

Thursday, February 21

Modern machine learning requires data, but it doesn’t just look like an XML or TEI representation. Instead, a particular trick for turning items into strings of numbers–the embedding strategy–has emerged as the dominant ways for computers to represent information to themselves. So we’ll talk about that strategy, and how to get things in and out of it.

Michael Gavin, Word Space article.
Schmidt “Stable Random Projection.” (Skip the first section about the algorithm, read starting with the overview visualization of Hathi Trust.)
Tukey, Friedman, and Fisherkeller: Introduction to Prim-9. This is on YouTube, or I have a local copy.
Allison et al. Quantitative Formalism. Link
Ryan Heuser, something, maybe just visualizations.

Problem Set.

Vector Space Models
Principal Components Analysis
Cosine similarity

Supplemental readings (technical explanations):

James An Introduction to Statistical Learning with Applications in R., chapters 10-10.2.

Week 8: (image, text, data) %>% embeddings

Thursday, February 28

Tanya Clement and Stephen McLaughlin, “Measured Applause: Toward a Cultural Analysis of Audio Collections,” CA: Journal of Cultural Analytics (23 May 2016)
Melvin Wevers, Tomas Smits. “The visual digital turn: Using neural networks to study historical images”. https://doi.org/10.1093/llc/fqy085. 18 January 2019
PixPlot, Yale DH Lab: http://dhlab.yale.edu/projects/pixplot/
CS paper: read for a sense of possibilities, not method. The Shape of Art History in the Eyes of the Machine: Ahmed Elgammal, Marian Mazzone, Bingchen Liu, Diana Kim, Mohamed Elhoseiny. https://arxiv.org/abs/1801.07729

(For background, you could read this generally useful, slightly hyperventilating introduction to neural networks from the NY Times)

Week 8: Supervised Learning and predictive models

Thursday, March 14

Mosteller and Wallace “Inference in an Authorship Problem.”
Lena Hettinger, Martin Becker, Isabella Reger, Fotis Jannidis, and Andreas Hotho “Genre Classification on German Novels” (2015)
Ted Underwood, Distant reading and the blurry edges of genre"
Ted Underwood, Understanding Genre in a Collection of a Million Volumes, Interim report
Maybe instead Ted Underwood, Distant Horizons; not sure how the publication date will align here.

Problem sets:

Naive Bayes
Logit classifiers
K-nearest-neighbor classification.

Methods:

James An Introduction to Statistical Learning with Applications in R., Chapter 4.
Chapter 8, Decision trees, particularly 8.1: optional.
Reach–Support Vector Machines, James Chapter 9. Only for the foolhardy.

Week 9: Unsupervised clustering

Thursday, March 21

Blei, “Probabilistic Topic Models.”
Goldstone and Underwood “The Quiet Transformations of Literary Studies.” and online supplement
Ted Underwood, “Topic Modeling Made Just Simple Enough,” April 7, 2012
Rhody “Topic Modeling and Figurative Language.”

Problem sets and methods:

K-means Clustering.
Hierarchical Clustering.
Topic Modeling using the R-Mallet package.

Week 10: space %>% data

Thursday, March 28

Richard White “What is Spatial History?” http://web.stanford.edu/group/spatialhistory/cgi-bin/site/pub.php?id=29
Wilkens “The Geographic Imagination of Civil War-Era American Fiction.”
Blevins “Space, Nation, and the Triumph of Region.”
Present: Colleen on Philip Ethington’s “Placing the past: ‘Groundwork’ for a spatial theory of history.”

Problem Set: Geographic binning and visualization.

Week 11:

Thursday, April 4

Elliott Ash, Daniel L. Chen, Suresh Naidu. “Ideas Have Consequences: The Impact of Law and Economics on American Justice.” Working paper: http://elliottash.com/wp-content/uploads/2018/08/ash-chen-naidu-2018-07-15.pdf
Choose 4 articles from the Current Research in Digital History forum/journal. Read them, and come prepared to talk about 2 in particular. How do they work? What do they succeed at?

Week 12:

Wednesday, April 11

Roopika Risam, New Digital Worlds: Postcolonial Digital Humanities in Theory, Praxis, and Pedagogy (Northwestern University Press, 2019). Chapter 5. Available online: https://www-jstor-org.ezproxy.neu.edu/stable/j.ctv7tq4hg.9
Lauren Tilton and Taylor Arnold, “Distant Viewing,” Digital Scholarship in the Humanities, 2019. https://www.distantviewing.org/pdf/distant-viewing.pdf
Daniel Rodgers, Age of Fracture, Chapter 1: Losing the words of the Cold War. Available online: https://ebookcentral.proquest.com/lib/northeastern-ebooks/detail.action?docID=3300914 With this, think about all the ways we’ve looked at the language in State of the Union addresses, and think–as specifically as possible–about what digital-textual methods could bring to this problem.

Allison, Sarah, Ryan Heuser, Matthew L. Jockers, Franco Moretti, and Michael Witmore. Quantitative Formalism: An Experiment (Stanford Literary Lab, Pamphlet 1). Stanford: Standford Literary Lab, n.d.

Behrens, John T. “Principles and Procedures of Exploratory Data Analysis.” Psychological Methods 2, no. 2 (1997): 131. http://psycnet.apa.org/journals/met/2/2/131/.

Blevins, C. “Space, Nation, and the Triumph of Region: A View of the World from Houston.” Journal of American History 101, no. 1 (2014): 122–147. doi:10.1093/jahist/jau184.

Daston, Lorraine, and Peter Galison. Objectivity. New York: Zone Books ; Distributed by the MIT Press, 2007.

Gitelman, Lisa. "Raw Data" Is an Oxymoron. Infrastructures Series. Cambridge, Massachusetts: The MIT Press, 2013.

Goldstone, Andrew, and Ted Underwood. “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us.” New Literary History 45, no. 3 (2014): 359–384. doi:10.1353/nlh.2014.0025.

James, Gareth. An Introduction to Statistical Learning with Applications in R, 2013. http://dx.doi.org/10.1007/978-1-4614-7138-7.

Jockers, Matt. Text Analysis with R for Students of Literature. Springer, 2014. http://www.springer.com/statistics/computational+statistics/book/978-3-319-03163-7.

Klein, Lauren F. “The Image of Absence: Archival Silence, Data Visualization, and James Hemings.” American Literature 85, no. 4: 661–688. Accessed January 14, 2015. doi:10.1215/00029831-2367310.

Mosteller, Frederick, and David L. Wallace. “Inference in an Authorship Problem: A Comparative Study of Discrimination Methods Applied to the Authorship of the Disputed Federalist Papers.” Journal of the American Statistical Association 58, no. 302 (1963): 275–309. http://www.tandfonline.com/doi/abs/10.1080/01621459.1963.10500849.

Ramsay, Stephen. Reading Machines: Toward an Algorithmic Criticism. Topics in the Digital Humanities. Urbana: University of Illinois Press, 2011.

Rhody, Lisa M. “Topic Modeling and Figurative Language.” Journal of Digital Humanities 2, no. 1 (April 7, 2013). http://journalofdigitalhumanities.org/2-1/topic-modeling-and-figurative-language-by-lisa-m-rhody/.

Schmidt, Benjamin. “Stable Random Projection: Lightweight, General-Purpose Dimensionality Reduction for Digitized Libraries.” Journal of Cultural Analytics (2018). doi:10.22148/16.025.

Tukey, John W. Exploratory Data Analysis. Addison-Wesley Series in Behavioral Science. Reading, Mass: Addison-Wesley Pub. Co, 1977.

Underwood, Ted, David Bamman, and Sabrina Lee. “The Transformation of Gender in English-Language Fiction.” Journal of Cultural Analytics (2018). doi:10.22148/16.019.

Wickham, Hadley. “Ggplot2.” Wiley Interdisciplinary Reviews: Computational Statistics 3, no. 2 (2011): 180–185. doi:10.1002/wics.147.

Wilkens, Matthew. “The Geographic Imagination of Civil War-Era American Fiction.” American Literary History 25, no. 4: 803–840. Accessed January 15, 2015. doi:10.1093/alh/ajt045.

Witmore, Michael. “Text: A Massively Addressable Object,” December 31, 2010. http://winedarksea.org/?p=926.

The algorithms we will discuss in the second half of the semester are discussed in greater length in this text. If you wish to come to more mathematical understanding, this provides a relatively gentle introduction in machine-learning terms, but with some levels of math we’ll gloss over in this class, also based in the R language. All chapters are available for download, for free, from the Northeastern library; download any now that you find helpful.↩
For those interested solely in text analysis and not census, bibliographic, or other forms of “humanities data,” this may be valuable. But be aware it uses a different set of libraries and data models for visualization and analysis than the ones we are using in this class, so the code is unlikely to work immediately↩

Schedule

Required books

Recommended supplementary materials (free online)

Unit 1: Defining and transforming data

Week 1: Introduction. What is (could be) Humanities Data Analysis?

Reading

Practicum: regular expressions

Week 2: turning Information %>% Data

Reading

Practicum

Problem set:

Week 3: Exploratory Data Analysis = split %>% apply %>% combine

Reading

Doing

Week 4: data %>% visualization

Week 5: History Dept/NuLab event with “Uncivil” Hosts.

Week 6: text %>% data

Week 7: data %>% the Embedding Strategy.

Problem Set.

Week 8: (image, text, data) %>% embeddings

Week 8: Supervised Learning and predictive models

Week 9: Unsupervised clustering

Week 10: space %>% data

Problem Set: Geographic binning and visualization.

Week 11:

Week 12:

Week 3: Exploratory Data Analysis = split `%>%` apply `%>%` combine

Week 4: data `%>%` visualization

Week 6: text `%>%` data

Week 7: data `%>%` the Embedding Strategy.