Mostly we’ll be reading articles in this course available online. One are required for purchase. If you have difficulty obtaining any texts, please let me know as soon as possible.
In week 1, you’ll read my spiel about what humanists need to understand when they read CS. My answer is, in general–you need to know what they did, but not how they did it. I’ve put some CS papers in this syllabus to expand your thinking about what’s possible. You should absolutely, positively, not aim to understand the process in a CS paper, or worry about all the notation in the economics or political science papers. As a rule, if you see a fancy equation in an article not written by a humanist, you can probably skip the whole section for the time being.
There are two required texts, both online. One is my text for this course at [http://hdf.benschmidt.org/R], which includes technical materials and programming code; the other is [Data Feminism], by Catherine D’Ignazio and Lauren Klein, MIT Press 2019. It’s available open-access online, but you can also purchase a copy if you like. https://data-feminism.mitpress.mit.edu/.
The textbook sections include programming exercises, which should be submitted as Word, PDF, or HTML the Thursday after class.
Following Nick Montfort’s practice in Exploratory Programming for the Arts and Humanities, each of the problems ends with a “free exercise.” I encourage you to post at least 6 of these to the course Slack over the semester; it would also make sense to do these cumulatively as a blog series, Twitter thread, etc.
Introductions
Due Fri, Jan 28: Install R on your computer, or–if you want to use a different programming– meet with me to discuss.
The programs R and Rstudio Rstudio is a wrapper program around the R language that we’ll be using for almost every assignment.
What is (could be) Humanities Data Analysis?
Readings
Online text
Due Mon, Jan 31: Choose two datasets to discuss in class that are relevant to your research interests, so far as you’re able to find them.
One should be something that you can actually download, almost certainly in the form of a CSV or Excel File.
The other should be something that you know exists, but that you might not be fully able to work with yet.
For both of them, fill out the online spreadsheet. (That should be editable from your .nyu address.) The goal here is to reduce this to a tabular dataset. Describe what each of the fields in this dataset would be. (No more than 10 fields per dataset).
Do not describe the dataset as a whole aside from the columns–see if you can capture it in the individual elements.
Due Mon, Jan 31: Install the course package inside your own copy of R Studio, as described at the end of “Working in a Programming Language chapter.” We’ll talk through any problems in class.
Due Thu, Feb 03: Finish the exercises for “Introduction to Data” installing R and Rstudio. You will be complete when you can ‘knit’ your file to a Microsoft Word document, HTML page, or PDF. (Note that PDFs can be a bit harder.)
Information %>% Data
guest: Nicholas Wolf, NYU Libraries
Readings
Online text
agenda: Class agenda
Data Visualization
Readings
Online text
Related texts not to read
Due Mon, Feb 14: ON PAPER, draw a speculative visualization of the New York City directories data and try to describe it in terms of the geoms and marks in Bertin and the ggplot docs. Bring it to class.
Due Wed, Feb 16: Counting Things, exercises
No class: President’s Day
Due Thu, Feb 24: Visualizing data, exercises.
Counting, grouping, and accounting for how only things that get counted count.
description: A huge amount of work is just about finding interesting things to count. Often, sophisticated work can just be figuring how to count something new. Here we look a little bit at how you can, simply count something.
Readings
Online text
Related texts not to read
Practicum for next class
Data modeling and data merging.
Readings
Melanie Walsh and Maria Antoniak, The Goodreads “Classics”: A Computational Study of Readers, Amazon, and Crowdsourced Amateur Criticism, *Cultural Analytics 2021
Tim Berners-Lee, “The Semantic Web,” Scientific American 2001.
Online text
Due Thu, Mar 10: Problems for Combining datasets: Merges, joins, and standards.
Due Fri, Mar 11: From practicum: place on the course Slack two ggplot visualizations results from a join between two different datasets. Try to be goofy on one and serious with the others. You may use text fields if you want.
No class: Spring Break
Text as Data, 1
Readings
practicum for next class: -“Texts as Data, exercises.”
Due Fri, Mar 25: Place on the course Slack two ggplot visualizations results from a join between two different datasets. Try to be goofy on one and serious with the others. You may use text fields for the join.
Text as Data, 2
description: Once text is data, you can explore and reconfigure it. There are a variety of ways to do this.
Readings
Online text for this class session
agenda: Class agenda
Due Thu, Mar 31: Chapters 9, Exercises. (Not chapter 10, which is due in three weeks.)
Space as Data
Readings
Online text for this class
Thinking statistically - Klein and D’Ignazio, Chapter 6. “https://data-feminism.mitpress.mit.edu/pub/czq9dfs5/release/3?readingCollection=0cd867ef”
Due Mon, Apr 11: Identify data/datasets you’ll be working with for the rest of the class
Due Thu, Apr 14: Problems for thinking statistically and Dunning Log-Likelihood comparisons.
Due Fri, Apr 15: Go back to a problem you have already done, and use bootstrap sampling as an estimate of uncertainty on a problem set that you have already done.
Supervised Learning and Predictive Models
note: From this point on, the weekly readings and topics are about specific applications of algorithms to different types of problems. To this point, everything we’ve done has been foundational–from here on out, it’s more about specific applications that you can do if you want, but don’t necessarily need to.
Readings
online text: Classification
Clustering, topic modeling, and unsupervised approaches
Readings
agenda: Class agenda
Due Mon, Apr 25: due
Due Mon, Apr 25: text
The Embedding Strategy and representation learning.
description: Modern machine learning requires data, but it doesn’t just look like an XML or TEI representation. Instead, a particular trick for turning items into strings of numbers–the embedding strategy–has emerged as the dominant ways for computers to represent information to themselves.
Readings
Assignment for this class
Online text
Going deep
Readings
agenda: Class agenda