Download PDF

Schedule

Notes: mostly we’ll be reading articles in this course available online. One are required for purchase. If you have difficulty obtaining any texts, please let me know as soon as possible. In week 1, you’ll read my spiel about what humanists need to understand when they read CS. My answer is, in general–you need to know what they did, but not how they did it. I’ve put some CS papers in this syllabus to expand your thinking about what’s possible. You should absolutely, positively, not aim to understand the process in a CS paper. As a rule, if you see a fancy equation in an article not written by a humanist, you can probably skip the whole section for the time being.

Defining and transforming

Mon, Jan 24

Introductions

Due Mon, Jan 24: - “Install R on your computer.” - “The programs R and Rstudio Rstudio is a wrapper program around the R language that we’ll be using for almost every assignment.”

Mon, Jan 31

What is (could be) Humanities Data Analysis?

Readings

Online text

Due Mon, Jan 31: Choose two datasets to discuss in class that are relevant to your research interests, so far as you’re able to find them.

One should be something that you can actually download, almost certainly in the form of a CSV or Excel File.

The other should be something that you know exists, but that you might not be fully able to work with yet.

For both of them, fill out the online spreadsheet. The goal here is to reduce this to a tabular dataset. Describe what each of the columns in this dataset would be.

Do not describe the dataset as a whole aside from the columns–see if you can capture it in the individual elements.

Due Mon, Jan 31: Try to finish the exercises for “Working in a Programming Language,” installing R and Rstudio.

Mon, Feb 07

Information %>% Data

Readings

  • John Unsworth “Knowledge Representation in Humanities Computing,” 2001, http://www.people.virginia.edu/~jmu2m/KR/KRinHC.html.
  • Daniel Rosenberg “Data Before the Fact,” in Raw Data Is an Oxymoron, ed. Lisa Gitelman (Cambridge: MIT Press, 2013).
  • (Much more practical) Hadley Wickham, “Tidy Data,” Journal of Statistical Software, 2015

Online text

agenda: Class agenda

Mon, Feb 14

Data Visualization

Readings

  • Jacques Bertin, Semiology of Graphics: Diagrams, Networks, Maps, selections.
  • Scan http://docs.ggplot2.org/current/, paying particular attention to the first section; the different “geoms.”.
  • “Streetscapes: Mozart, Marx, and a Dictator.” Die Zeit online, 2018-02-13. link

Online text

Practicum for next class

  • R package: “ggplot2
  • Multiple geoms in ggplot.

Related texts not to read

  • Lorraine Daston and Peter Galison Objectivity (New York; Cambridge, Mass.: Zone Books ; Distributed by the MIT Press, 2007)., Chapter 7 (On the Sciences, today.)

Due Wed, Feb 16: Counting things

Mon, Feb 21

No class: President’s Day

Mon, Feb 28

Counting, grouping, and accounting for how only things that get counted count.

description: A huge amount of work is just about finding interesting things to count. Often, sophisticated work can just be figuring how to count something new. Here we look a little bit at how you can, simply count something.

Readings

  • Trevon D. Logan and John M. Parman “The National Rise in Residential Segregation,” The Journal of Economic History 77, no. 1 (March 2017): 127–70, https://doi.org/10.1017/S0022050717000079. As with all econ in this class, read for the general findings, and data, not the methodology.
  • Ted Underwood, David Bamman, and Sabrina Lee “The Transformation of Gender in English-Language Fiction,” Journal of Cultural Analytics, 2018, https://doi.org/10.22148/16.019. (This is basically chapter 4 of the other Underwood, but I like the order in the article version better for this week.)
  • What Gets Counted Counts From D’Ignazio and Klein, Data Feminism, MIT press 2020.

Online text

Related texts not to read

  • Too long, too old, and too out of print to assign: but be aware of the granddaddy of them all: John W. Tukey Exploratory Data Analysis, Addison-Wesley Series in Behavioral Science (Reading, Mass: Addison-Wesley Pub. Co, 1977).

Practicum for next class

  • Circle back to the analysis set. Do something more with the collection of book titles.
  • If you successful finished much of the last problem set:
  • Take a stab at the problems for Cleaning Data and tidying data

Mon, Mar 07

Making data work together

Readings

  • TBD. Topics including the Textual Encoding Initiative, Codd normal forms, and maybe Underwood.

Online text

  • Combining datasets: Merges, joins, and standards.

Practicum for next class

  • Combining datasets: Merges, joins, and standards.

Mon, Mar 14

No class: Spring Break

Texts, maps, and data

Mon, Mar 21

Text as Data, 1

Readings

  • Gentzkow et al, Journal of Economic Literature, https://doi.org/10.1257/jel.20181020
  • Andrew Piper, Enumerations, 2018. Introduction and Chapter 3.
  • Julia Silge and David Robinson, Text Mining with R: A Tidy Approach. Read online at https://www.tidytextmining.com. Preface and Chapters 1-4. This is technical–think about using it with some corpus of texts of interest to you.

practicum for next class: -“Texts as Data, exercises.

Due Fri, Mar 25: Place on the course Slack two ggplot visualizations results from a join between two different datasets. Try to be goofy on one and serious with the others. You may use text fields if you want.

Mon, Mar 28

Text as Data, 2

Readings

Online text for this class session

agenda: Class agenda

Due Fri, Apr 01: Place on the course Slack two ggplot visualizations results from a join between two different datasets. Try to be goofy on one and serious with the others. You may use text fields if you want.

Mon, Apr 04

Space as Data

Readings

  • C. Blevins “Space, Nation, and the Triumph of Region: A View of the World from Houston,” Journal of American History 101, no. 1 (2014): 122–47, https://doi.org/10.1093/jahist/jau184.
  • Anbinder et al, Networks and Opportunities: A Digital History of Ireland’s Great Famine Refugees in New York. American Historical Review, 2019. Be sure to spend a good amount of time in the online map as well as the printed article.
  • Pleiades project site. Browse in general: and read https://pleiades.stoa.org/help/data-structure, and think about types of uncertainty/ambiguity in geovisualization.

Online text for this class

  • Space as Data (complete).
  • For a full reference information, see Lovelace et al, Geocomputation in R. Note that Lovelace uses the tmap package for mapping while we stick to ggplot2 with the spatial geometries function geom_sf. If you really want to make–say–a zoomable map, you may want to explore tmap on your own.

Due Wed, Apr 06: Free exercise: use some bag of words on the texts of your own choosing and explore comparisons between subsets using PMI or Dunning. These can be full-text, XML, or–if–you prefer–wordcounts for books from the HathiTrust as described in the online text. Post as images or tables to the slack channel #getting-text-files."

Mon, Apr 11

Dogs as Data

description: I think we need a little reboot, so we’ll focus on dogs for a little bit. Claim a possible question in the slack as described there. It’s OK if you can’t fully realize what you want to do, but you must try something, post your questions, your broken code.

Readings

Due Mon, Apr 11: Download a shapefile or geojson from the Internet, read it into R, and make a map that you are confident no one has made before. Post in Slack.

Due Mon, Apr 11: Identify data/datasets you’ll be working with for the rest of the class

The algorithmic toolkit for exploring humanities datasets.

Mon, Apr 18

Supervised Learning and Predictive Models

note: From this point on, the weekly readings and topics are about specific applications of algorithms to different types of problems. To this point, everything we’ve done has been foundational–from here on out, it’s more about specific applications that you can do if you want, but don’t necessarily need to.

Class agenda

  • For the four people who didn’t bring a description in class, please talk through what you’re thinking about investigating.
  • Discussion of Noble. * Black Feminist Technology Studies

Readings

  • Safiya Noble, Algorithms of Oppression. This is long, and we won’t be able to give the time it deserves, so try to give it just a bit..

online text: Classification.

Mon, Apr 25

Clustering, topic modeling, and unsupervised approaches

Readings

  • Sarah Allison et al. “Quantitative Formalism: An Experiment (Stanford Literary Lab, Pamphlet 1)” (Stanford: Standford Literary Lab, January 15, 2011). Link

In class agenda

  • Talk about Allison and about Underwood. Distinguishing clustering and classification.
  • Walk through of the vector space model concept in R.
  • Pointing towards how to do classification in R.
  • Walk through of basic clustering strategies.

Due Mon, Apr 25: due

Due Mon, Apr 25: text

Mon, May 02

The Embedding Strategy and representation learning.

description: Modern machine learning requires data, but it doesn’t just look like an XML or TEI representation. Instead, a particular trick for turning items into strings of numbers–the embedding strategy–has emerged as the dominant ways for computers to represent information to themselves.

Readings

Assignment for this class

  • Submit a draft of your dog article

Online text

  • Let’s try this again: online text on clustering.
  • Vector Space Models, Principal Components Analysis, and similarity.

Mon, May 09

Going deep

Readings

Class agenda

  • General check-in
  • What can Deep Learning do?
  • What can Deep Learning do for you?
  • What about Word Embeddings?
  • What about different forms of storytelling?

Agenda Notes

Notes for Mon, Feb 07

  1. Rstudio installation and debug issues. What are packages, etc.
  2. Any python holdouts?
  3. It’s to use Jupyter instead of RStudio if you prefer; but you will to install locally, because there are too many dependencies to re- download to Google Colab each time.
  4. Drucker and Michel.
  5. Polar opposites, so I find it helpful to find out which one you all find more amenable.
  6. The question of where data comes from. Google Ngrams.
  7. Issues of representation and the gift of data.
    • Wild ways of thinking about datasets.
  8. Your datasets
  9. A new section: Ontologies are formal languages for particular domains.
  10. Categorical fields.
  11. Introduction to Counting.

Notes for Mon, Mar 28

  • “Collaboration?”
  • Finding Texts–pushing mostly to Wednesday
    • Tokenization alternatives.
  • “Discuss Shore and Ramsay–can we have fun?”
  • “Discuss Witmore and free discussion of problems that can be approached as different sets of documents.”

Allison, Sarah, Ryan Heuser, Matthew L. Jockers, Franco Moretti, and Michael Witmore. “Quantitative Formalism: An Experiment (Stanford Literary Lab, Pamphlet 1).” Stanford: Standford Literary Lab, January 15, 2011.

Blevins, C. “Space, Nation, and the Triumph of Region: A View of the World from Houston.” Journal of American History 101, no. 1 (2014): 122–47. https://doi.org/10.1093/jahist/jau184.

Daston, Lorraine, and Peter Galison. Objectivity. New York; Cambridge, Mass.: Zone Books ; Distributed by the MIT Press, 2007.

Drucker, Johanna. “Humanities Approaches to Graphical Display” 5, no. 1 (2011). http://www.digitalhumanities.org/dhq/vol/5/1/000091/000091.html.

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. “Deep Learning.” Nature 521, no. 7553 (May 2015): 436–44. https://doi.org/10.1038/nature14539.

Logan, Trevon D., and John M. Parman. “The National Rise in Residential Segregation.” The Journal of Economic History 77, no. 1 (March 2017): 127–70. https://doi.org/10.1017/S0022050717000079.

Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K Gray, Joseph P Pickett, Dale Hoiberg, et al. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science (New York, N.Y.) 331, no. 6014 (January 14, 2011): 176–82. https://doi.org/10.1126/science.1199644.

Rosenberg, Daniel. “Data Before the Fact.” In Raw Data Is an Oxymoron, edited by Lisa Gitelman. Cambridge: MIT Press, 2013.

Tukey, John W. Exploratory Data Analysis. Addison-Wesley Series in Behavioral Science. Reading, Mass: Addison-Wesley Pub. Co, 1977.

Underwood, Ted, David Bamman, and Sabrina Lee. “The Transformation of Gender in English-Language Fiction.” Journal of Cultural Analytics, 2018. https://doi.org/10.22148/16.019.

Unsworth, John. “Knowledge Representation in Humanities Computing,” 2001. http://www.people.virginia.edu/~jmu2m/KR/KRinHC.html.

Witmore, Michael. “Text: A Massively Addressable Object,” December 31, 2010. http://winedarksea.org/?p=926.