Schedule

Mostly we’ll be reading articles in this course available online. One are required for purchase. If you have difficulty obtaining any texts, please let me know as soon as possible.

In week 1, you’ll read my spiel about what humanists need to understand when they read CS. My answer is, in general–you need to know what they did, but not how they did it. I’ve put some CS papers in this syllabus to expand your thinking about what’s possible. You should absolutely, positively, not aim to understand the process in a CS paper, or worry about all the notation in the economics or political science papers. As a rule, if you see a fancy equation in an article not written by a humanist, you can probably skip the whole section for the time being.

There are two required texts, both online. One is my text for this course at [http://hdf.benschmidt.org/R], which includes technical materials and programming code; the other is [Data Feminism], by Catherine D’Ignazio and Lauren Klein, MIT Press 2019. It’s available open-access online, but you can also purchase a copy if you like. https://data-feminism.mitpress.mit.edu/.

The textbook sections include programming exercises, which should be submitted as Word, PDF, or HTML the Thursday after class.

Following Nick Montfort’s practice in Exploratory Programming for the Arts and Humanities, each of the problems ends with a “free exercise.” I encourage you to post at least 6 of these to the course Slack over the semester; it would also make sense to do these cumulatively as a blog series, Twitter thread, etc.

Defining and transforming

Mon, Jan 24

Introductions

Due Fri, Jan 28: Install R on your computer, or–if you want to use a different programming– meet with me to discuss.

The programs R and Rstudio Rstudio is a wrapper program around the R language that we’ll be using for almost every assignment.

Mon, Jan 31

What is (could be) Humanities Data Analysis?

Readings

Jean-Baptiste Michel et al. “Quantitative Analysis of Culture Using Millions of Digitized Books,” Science (New York, N.Y.) 331, no. 6014 (January 14, 2011): 176–82, https://doi.org/10.1126/science.1199644.
Johanna Drucker “Humanities Approaches to Graphical Display” 5, no. 1 (2011), http://www.digitalhumanities.org/dhq/vol/5/1/000091/000091.html.
Gergely Baics, Wright Kennedy, Rebecca Kobrin, Laura Kurgan, Leah Meisterlin, Dan Miller, Mae Ngai. Mapping Historical New York: A Digital Atlas. New York, NY: Columbia University. 2021. https://mappinghny.com

Online text

Introduction/The Gift of Data](https://hdf.benschmidt.org/R/introduction#the-gift-of-data)
Working in a Programming Language
Introduction to Data Merely skim the parts about regular expressions–this is something to come back to when you need it.

Due Mon, Jan 31: Choose two datasets to discuss in class that are relevant to your research interests, so far as you’re able to find them.

One should be something that you can actually download, almost certainly in the form of a CSV or Excel File.

The other should be something that you know exists, but that you might not be fully able to work with yet.

For both of them, fill out the online spreadsheet. (That should be editable from your .nyu address.) The goal here is to reduce this to a tabular dataset. Describe what each of the fields in this dataset would be. (No more than 10 fields per dataset).

Do not describe the dataset as a whole aside from the columns–see if you can capture it in the individual elements.

Due Mon, Jan 31: Install the course package inside your own copy of R Studio, as described at the end of “Working in a Programming Language chapter.” We’ll talk through any problems in class.

Due Thu, Feb 03: Finish the exercises for “Introduction to Data” installing R and Rstudio. You will be complete when you can ‘knit’ your file to a Microsoft Word document, HTML page, or PDF. (Note that PDFs can be a bit harder.)

Mon, Feb 07

Information %>% Data

guest: Nicholas Wolf, NYU Libraries

Readings

Klein and D’Ignazio, chapters 1 and 2.
(Much more practical) Hadley Wickham, “Tidy Data,” Journal of Statistical Software, 2015
Browse online: New York City Directories.

Online text

The Data Table
Counting Things
Please report issues, and use the “Troubleshooting Guide

agenda: Class agenda

Mon, Feb 14

Data Visualization

Readings

Jacques Bertin, Semiology of Graphics: Diagrams, Networks, Maps, selections. (Shared over Slack)
D’Ignazio and Klein, Data Feminism, Chapter 3: On Rational, Scientific, Objective Viewpoints from Mythical, Imaginary, Impossible Standpoints.
Read https://ggplot2.tidyverse.org/reference/, paying particular attention to the first section; the different “geoms.”
Browse the Vega Gallery; this is the equivalent to ggplot2 in Javascript (and also widely used in Python)
“Streetscapes: Mozart, Marx, and a Dictator.” Die Zeit online, 2018-02-13. link

Online text

Visualizing Data

Mon, Feb 21

No class: President’s Day

Due Thu, Feb 24: Visualizing data, exercises.

Mon, Feb 28

Counting, grouping, and accounting for how only things that get counted count.

description: A huge amount of work is just about finding interesting things to count. Often, sophisticated work can just be figuring how to count something new. Here we look a little bit at how you can, simply count something.

Readings

Trevon D. Logan and John M. Parman “The National Rise in Residential Segregation,” The Journal of Economic History 77, no. 1 (March 2017): 127–70, https://doi.org/10.1017/S0022050717000079. As with all econ in this class, read for the general findings, and data, not the methodology.
Ted Underwood, David Bamman, and Sabrina Lee “The Transformation of Gender in English-Language Fiction,” Journal of Cultural Analytics, 2018, https://doi.org/10.22148/16.019. (This is basically chapter 4 of the other Underwood, but I like the order in the article version better for this week.)
D’Ignazio and Klein, Data Feminism, Chapter 4. ‘What Gets Counted Counts’

Online text

Cleaning Data

Mon, Mar 07

Data modeling and data merging.

Readings

Melanie Walsh and Maria Antoniak, The Goodreads “Classics”: A Computational Study of Readers, Amazon, and Crowdsourced Amateur Criticism, *Cultural Analytics 2021
Tim Berners-Lee, “The Semantic Web,” Scientific American 2001.

Online text

Combining datasets: Merges, joins, and standards.

Due Thu, Mar 10: Problems for Combining datasets: Merges, joins, and standards.

Due Fri, Mar 11: From practicum: place on the course Slack two ggplot visualizations results from a join between two different datasets. Try to be goofy on one and serious with the others. You may use text fields if you want.

Mon, Mar 14

No class: Spring Break

Texts, maps, and data

Mon, Mar 21

Text as Data, 1

Readings

Gentzkow et al, Journal of Economic Literature, https://doi.org/10.1257/jel.20181020 (Skim only.)
Klein and D’Ignazio, Unicorns, Janitors, Ninjas, Wizards, and Rock Stars.
Julia Silge and David Robinson, Text Mining with R: A Tidy Approach. Read online at https://www.tidytextmining.com. Preface and Chapters 1-4. If you have no interest in analyzing text, you can skip this.

practicum for next class: -“Texts as Data, exercises.”

Due Fri, Mar 25: Place on the course Slack two ggplot visualizations results from a join between two different datasets. Try to be goofy on one and serious with the others. You may use text fields for the join.

Mon, Mar 28

Text as Data, 2

description: Once text is data, you can explore and reconfigure it. There are a variety of ways to do this.

Readings

Stephen Ramsay, “The Hermeneutics of Screwing Around,” 2010. https://libraries.uh.edu/wp-content/uploads/Ramsay-The-Hermeneutics-of-Screwing-Around.pdf.
Michael Witmore “Text: A Massively Addressable Object,” December 31, 2010, http://winedarksea.org/?p=926.

Online text for this class session

Chapter 9.1, The Variable Document Model
Chapter 9.2, Getting Data.
Chapter 10.1, Three Metrics

agenda: Class agenda

Due Thu, Mar 31: Chapters 9, Exercises. (Not chapter 10, which is due in three weeks.)

Mon, Apr 04

Space as Data

Readings

Klein and D’Ignazio, The Numbers Don’t Speak for Themselves
C. Blevins “Space, Nation, and the Triumph of Region: A View of the World from Houston,” Journal Of American History 101, no. 1 (2014): 122–47, https://doi.org/10.1093/jahist/jau184.
Anbinder et al, Networks and Opportunities: A Digital History of Ireland’s Great Famine Refugees in New York. American Historical Review, 2019. Be sure to spend a good amount of time in the online map as well as the printed article.
Review the New York City Atlas.
Pleiades project site. Browse in general: and read https://pleiades.stoa.org/help/data-structure, and think about types of uncertainty/ambiguity in geovisualization.

Online text for this class

Space as Data (complete).
If you are interested in mapping, browse the TOC for Lovelace et al, Geocomputation in R. Note that Lovelace uses the tmap package for mapping while we stick to ggplot2 with the spatial geometries function geom_sf. If you really want to make–say–a zoomable map, you may want to explore tmap on your own.

The algorithmic toolkit for exploring humanities datasets.

Mon, Apr 11

Thinking statistically - Klein and D’Ignazio, Chapter 6. “https://data-feminism.mitpress.mit.edu/pub/czq9dfs5/release/3?readingCollection=0cd867ef”

Due Mon, Apr 11: Identify data/datasets you’ll be working with for the rest of the class

Due Thu, Apr 14: Problems for thinking statistically and Dunning Log-Likelihood comparisons.

Due Fri, Apr 15: Go back to a problem you have already done, and use bootstrap sampling as an estimate of uncertainty on a problem set that you have already done.

Mon, Apr 18

Supervised Learning and Predictive Models

note: From this point on, the weekly readings and topics are about specific applications of algorithms to different types of problems. To this point, everything we’ve done has been foundational–from here on out, it’s more about specific applications that you can do if you want, but don’t necessarily need to.

Readings

Klein and D’Ignazio, Chapter 7. “https://data-feminism.mitpress.mit.edu/pub/czq9dfs5/release/3?readingCollection=0cd867ef”
Underwood, Predicting Genre.

online text: Classification

Mon, Apr 25

Clustering, topic modeling, and unsupervised approaches

Readings

Sarah Allison et al. “Quantitative Formalism: An Experiment (Stanford Literary Lab, Pamphlet 1)” (Stanford: Standford Literary Lab, January 15, 2011). Link

agenda: Class agenda

Due Mon, Apr 25: due

Due Mon, Apr 25: text

Mon, May 02

The Embedding Strategy and representation learning.

description: Modern machine learning requires data, but it doesn’t just look like an XML or TEI representation. Instead, a particular trick for turning items into strings of numbers–the embedding strategy–has emerged as the dominant ways for computers to represent information to themselves.

Readings

Tukey, Friedman, and Fisherkeller: Introduction to Prim-9. This is on YouTube, or I have a local copy.
Ryan Heuser, Abstraction: A Literary History
Sandeep Soni, Lauren F. Klein, and Jacob Eisenstein. ‘Abolitionist Networks: Modeling Language Change in Nineteenth-Century Activist Newspapers’. Cultural Analytics January 18, 2021.
If you want to exist in this modern world, you should know something about deep learning. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton “Deep Learning,” Nature 521, no. 7553 (May 2015): 436–44, https://doi.org/10.1038/nature14539. is the single central text. It’s hard to understand, but we’ll take a stab. Come with questions, not understanding?
PixPlot, Yale DH Lab: http://dhlab.yale.edu/projects/pixplot/

Assignment for this class

Submit a draft piece about New York City or your own dataset for peer review.

Online text

Vector Space Models, Principal Components Analysis, and similarity.

Mon, May 09

Going deep

Readings

Lauren Tilton and Taylor Arnold, “Distant Viewing,” Digital Scholarship in the Humanities, 2019. https://www.distantviewing.org/pdf/distant-viewing.pdf
Melvin Wevers, Tomas Smits. “The visual digital turn: Using neural networks to study historical images.” https://doi.org/10.1093/llc/fqy085. 18 January 2019
Bengio, Lecunn, and Hinton, “Deep Learning”
Online Text, “Basic Neural Networks with Keras”

agenda: Class agenda

Agenda Notes

Notes for Mon, Feb 07

Rstudio installation and debug issues. What are packages, etc.
Any python holdouts?
It’s to use Jupyter instead of RStudio if you prefer; but you will to install locally, because there are too many dependencies to re- download to Google Colab each time.
Drucker and Michel.
Polar opposites, so I find it helpful to find out which one you all find more amenable.
The question of where data comes from. Google Ngrams.
Issues of representation and the gift of data.
- Wild ways of thinking about datasets.
Your datasets
A new section: Ontologies are formal languages for particular domains.
Categorical fields.
Introduction to Counting.

Notes for Mon, Mar 28

“Collaboration?”
Finding Texts–pushing mostly to Wednesday
- Tokenization alternatives.
“Discuss Shore and Ramsay–can we have fun?”
“Discuss Witmore and free discussion of problems that can be approached as different sets of documents.”

Notes for Mon, Apr 25

Talk about Allison and about Underwood. Distinguishing clustering and classification.
Walk through of the vector space model concept in R.
Pointing towards how to do classification in R.
Walk through of basic clustering strategies.

Notes for Mon, May 09

General check-in
What can Deep Learning do?
What can Deep Learning do for you?
What about Word Embeddings?
What about different forms of storytelling?

Allison, Sarah, Ryan Heuser, Matthew L. Jockers, Franco Moretti, and Michael Witmore. “Quantitative Formalism: An Experiment (Stanford Literary Lab, Pamphlet 1).” Stanford: Standford Literary Lab, January 15, 2011.

Blevins, C. “Space, Nation, and the Triumph of Region: A View of the World from Houston.” Journal Of American History 101, no. 1 (2014): 122–47. https://doi.org/10.1093/jahist/jau184.

Daston, Lorraine, and Peter Galison. Objectivity. New York; Cambridge, Mass.: Zone Books ; Distributed by the MIT Press, 2007.

Drucker, Johanna. “Humanities Approaches to Graphical Display” 5, no. 1 (2011). http://www.digitalhumanities.org/dhq/vol/5/1/000091/000091.html.

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. “Deep Learning.” Nature 521, no. 7553 (May 2015): 436–44. https://doi.org/10.1038/nature14539.

Logan, Trevon D., and John M. Parman. “The National Rise in Residential Segregation.” The Journal of Economic History 77, no. 1 (March 2017): 127–70. https://doi.org/10.1017/S0022050717000079.

Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K Gray, Joseph P Pickett, Dale Hoiberg, et al. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science (New York, N.Y.) 331, no. 6014 (January 14, 2011): 176–82. https://doi.org/10.1126/science.1199644.

Tukey, John W. Exploratory Data Analysis. Addison-Wesley Series in Behavioral Science. Reading, Mass: Addison-Wesley Pub. Co, 1977.

Underwood, Ted, David Bamman, and Sabrina Lee. “The Transformation of Gender in English-Language Fiction.” Journal of Cultural Analytics, 2018. https://doi.org/10.22148/16.019.

Witmore, Michael. “Text: A Massively Addressable Object,” December 31, 2010. http://winedarksea.org/?p=926.