Mostly we’ll be reading articles in this course available online. A few books are required for purchase. If you have difficulty obtaining any texts, please let me know as soon as possible.
- Drucker Graphesis.
- Ramsay Reading Machines.
- Tufte Envisioning Information.
- Jockers Macroanalysis.
Recommended supplementary materials (free online)
James An Introduction to Statistical Learning with Applications in R.1
Jockers Text Analysis with R for Students of Literature.2
Unit 1: Defining and transforming data
Week 1, January 15: Introduction. What is (could be) Humanities Data Analysis?
- Ramsay Reading Machines., Chapters 1 to 2
- Michel et al. “Quantitative Analysis of Culture Using Millions of Digitized Books.”
Practicum: regular expressions
- Software: Please come to class having installed a text editor with full regular expression capabilities.
- For Macs, a good one is TextWrangler or Atom.
- For Windows, NotePad Deluxe. (Not Notepad, which you most likely have already).
- For Linux, gEdit or atom. (If you’re a glutton for punishment, I heartily endorse either emacs or vi).
Problem set: Regex practice.
Week 2, January 22: Cleaning and Metadata: a brief pass at networks
- Unsworth: Knowledge Representation in Humanities Computing. http://people.brandeis.edu/~unsworth/KR/KRinHC.html
- Rosenberg, “Data before the Fact”, in Gitelman "Raw Data" Is an Oxymoron / Edited by Lisa Gitelman.
- “Becoming Digital,” from Daniel J. Cohen and Roy Rosenzweig, Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web.
- Hadley Wickham, “Tidy Data,” Journal of Statistical Software, 2015
- Witmore “Text.”
- So and Long, “Network Analysis and the Sociology of Modernism.”
- Before Class: Install, or get an account on my server, the program RStudio. (Note that on some systems, this also requires you to install R itself.)
- In class: introduction to R: data types, data.frames, and manipulation; using functions.
- Census data manipulation with
- Optional reach: use
tidyrto manipulate a data set into a form for network representations.
Week 3, January 29: Exploratory Data Analysis and the Split-Apply-Combine strategy for counting.
- Behrens “Principles and Procedures of Exploratory Data Analysis.”
- Too long, too old, and too out of print to assign: but be aware of the granddaddy of them all: Tukey Exploratory Data Analysis.
- Ruggles “The Transformation of American Family Structure.”
- Moretti “Style, Inc. Reflections on Seven Thousand Titles (British Novels, 1740–1850).”
Problem Set: Split/Apply/Combine
- Before class: install git
- Skill: the split/apply/combine strategy.
- Summary statistics and exploratory data analysis on the census files.
Week 4, February 5: Grammatical and critical visualization, week 1
- Scan http://docs.ggplot2.org/current/, paying particular attention to the first section; the different “geoms”.
- Find Microsoft Excel: compare this to their different “chart types.”
- Tufte Envisioning Information.
- Manovich, Lev. “What Is Visualisation?” Visual Studies , no. (): –
Problem set: Visualization
- R package:
- Multiple geoms in ggplot.
Some documentation is available at Wickham “Ggplot2.”
Week 5: February 12: Grammatical and critical visualization, week 2
- Klein “The Image of Absence.”
- Drucker Graphesis.
Problem set: Dig into exploratory data analysis of the bibliographic data set.
Optional supplemental readings:
- Daston and Galison Objectivity., Chapter 7 (On the Sciences, today.)
Week 8, February 19: Chance, deformance, and non-quantitative digital analysis
- Ramsay Reading Machines. (In earnest, all the way through).
- Mark Sample, Hacking the Academy and Notes toward a deformed humanities
- Daniel Shore, Shakespeare’s Constructicon. Preprint, to be published summer 2015
Introduction to digital text. Randomness and probabilities.
- Elementary Probability.
- Markov chains and random distributions.
February 26: Class Cancelled
Problem set: Continue working with bibliographic data.
- Build a Markov chain generator, an oulipean subsitution machine, a Twitter Bot, or anything else of the kind.
- Investigate Caleb McDaniel’s Every Three Minutes bot; build something similar that generates lists of slaves and owners.
March 5: Text and metadata
- Jockers Macroanalysis.
No problem set over break: review and complete parts that you’ve previously missed.
But also, begin firmly pulling together (if you have not already) your own data. This should be either something that can be read in as a data.frame, a number of text files like the Dickens we built in class, or text to work with in the R
tm package (in which case, you’ll need to read Jockers Text Analysis with R and talk to me about how to prepare data for the remainder of the course.
March 12: Spring Break
Unit 2: Algorithms
Week 9, March 19: Vectors and spaces
- McCarthy, Schreibman, and Siemens “Knowing ….”
- Tukey, Friedman, and Fisherkeller: Introduction to Prim-9. This is on YouTube, or I have a local copy.
- Allison et al. Quantitative Formalism Link.(http://litlab.stanford.edu/LiteraryLabPamphlet1.pdf)
(Both of these are experiments with visualizing multidimensional spaces, but also note that both are extremely odd publication models. Why?)
- Vector Space Models
- Principal Components Analysis
- Cosine similarity
Supplemental readings (technical explanations):
- James An Introduction to Statistical Learning with Applications in R., chapters 10-10.2.
Week 10, March 26: Classifying and Predicting
- Mosteller and Wallace “Inference in an Authorship Problem.”
From Ted Underwood’s classification project.
- Ted Underwood, Distant reading and the blurry edges of genre“
- Ted Underwood, Understanding Genre in a Collection of a Million Volumes, Interim report
- Naive Bayes
- Logit classifiers
- K-nearest-neighbor classification.
- James An Introduction to Statistical Learning with Applications in R., Chapter 4.
- Chapter 8, Decision trees, particularly 8.1: optional.
- Reach–Support Vector Machines, James Chapter 9. Only for the foolhardy.
Week 11, April 2: Clustering, Exploring, and Topic Modeling
- Blei, “Probabilistic Topic Models.”
- Goldstone and Underwood “The Quiet Transformations of Literary Studies.” and online supplement
- Ted Underwood, “Topic Modeling Made Just Simple Enough,” April 7, 2012
- Rhody “Topic Modeling and Figurative Language. Journal of Digital Humanities.”
Problem sets and methods:
- K-means Clustering.
- Hierarchical Clustering.
- Topic Modeling using the R-Mallet package.
Topic model something inappropriate.
Week 12, April 9: Integrating Place
- Richard White “What is Spatial History?” http://web.stanford.edu/group/spatialhistory/cgi-bin/site/pub.php?id=29
- Wilkens “The Geographic Imagination of Civil War-Era American Fiction.”
- Blevins “Space, Nation, and the Triumph of Region.”
Week 13, April 16: Debates in Humanities Data Analysis
- Alan Liu, “Where is the cultural criticism in the digital humanities?”
The Syuzhet Affair. Note: as our course unfolded, there has been extensive DH blogging about this topic that far exceeds the long list of short posts here. If you find something else that you want to bring in (perhaps from the extensive two-part Twitter storification the Eileen Clancy put together) you are more than welcome to. (Along with, of course, your own understanding of the issues from the problem set).
- Core posts: read all
- Jockers, Matthew. “Revealing Sentiment and Plot Arcs with the Syuzhet Package”
- Jockers. “The Rest of the Story”
- Annie Swafford. “Problems with the Syuzhet Package.”
- Jockers. “Some Thoughts on Annie’s Thoughts . . . about Syuzhet”
- Swafford. “Continuing the Syuzhet Discussion.”
- Other recent approaches to plot arc-eology–skim for content.
- Reiter, N., A. Frank, and O. Hellwig. “An NLP-Based Cross-Document Approach to Narrative Structure Discovery.” Literary and Linguistic Computing 29, no. 4 (December 1, 2014): 583–605. doi:10.1093/llc/fqu055.
- Benjamin Schmidt, Fundamental plot arcs, seen through multidimensional analysis of thousands of TV and movie scripts
- Andrew Piper, “Novel Devotions: Conversional Reading, Computational Modeling, and the Modern Novel,” New Literary History. (Preprint and synopsis available here)
- The question of Sentiment Analysis
- The question of smoothing methods
- Jockers. “A Ringing Endorsement of Smoothing Matthew L. Jockers.”
- Swafford. “Why Syuzhet Doesn’t Work and How We Know.”
- Scott Enderle, “What’s a sine wave of sentiment?”
- Benjamin Schmidt, Commodius vici of recirculation: the real problem with Syuzhet
- Jockers, “Requiem for a low-pass filter”
Allison, Sarah, Ryan Heuser, Matthew L. Jockers, Franco Moretti, and Michael Witmore. Quantitative Formalism: An Experiment (Stanford Literary Lab, Pamphlet 1). Stanford: Standford Literary Lab, 2011–1AD.
Behrens, John T. “Principles and Procedures of Exploratory Data Analysis.” Psychological Methods 2, no. 2 (1997): 131. http://psycnet.apa.org/journals/met/2/2/131/.
Blevins, C. “Space, Nation, and the Triumph of Region: A View of the World from Houston.” Journal Of American History 101, no. 1 (2014): 122–147. doi:10.1093/jahist/jau184.
Daston, Lorraine, and Peter Galison. Objectivity. New York: Zone Books ; Distributed by the MIT Press, 2007.
Drucker, Johanna. Graphesis: Visual Forms of Knowledge Production, 2014.
Gitelman, Lisa. "Raw Data" Is an Oxymoron / Edited by Lisa Gitelman. Infrastructures Series. Cambridge, Massachusetts: The MIT Press, 2013.
Goldstone, Andrew, and Ted Underwood. “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us.” New Literary History 45, no. 3 (2014): 359–384. doi:10.1353/nlh.2014.0025.
James, Gareth. An Introduction to Statistical Learning with Applications in R, 2013. http://dx.doi.org/10.1007/978-1-4614-7138-7.
Jockers, Matt. Text Analysis with R for Students of Literature. Springer, 2014. http://www.springer.com/statistics/computational+statistics/book/978-3-319-03163-7.
Jockers, Matthew L. Macroanalysis: Digital Methods and Literary History. University of Illinois Press, 2013.
Klein, Lauren F. “The Image of Absence: Archival Silence, Data Visualization, and James Hemings.” American Literature 85, no. 4 (12AD–1AD): 661–688. doi:10.1215/00029831-2367310.
McCarthy, Willard, Susan Schreibman, and Ray Siemens, eds. “Knowing … : Modeling in Literary Studies.” In Companion to Digital Literary Studies (Blackwell Companions to Literature and Culture). Blackwell Companions to Literature and Culture. Oxford: Blackwell Publishing Professional, 2008. http://www.digitalhumanities.org/companionDLS/.
Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K Gray, Joseph P Pickett, Dale Hoiberg, et al. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science (New York, N.Y.) 331, no. 6014 (January 14, 2011): 176–182. doi:10.1126/science.1199644.
Moretti, Franco. “Style, Inc. Reflections on Seven Thousand Titles (British Novels, 1740–1850).” Critical Inquiry 36, no. 1 (2009): 134–158.
Mosteller, Frederick, and David L. Wallace. “Inference in an Authorship Problem: A Comparative Study of Discrimination Methods Applied to the Authorship of the Disputed Federalist Papers.” Journal of the American Statistical Association 58, no. 302 (1963): 275–309. http://www.tandfonline.com/doi/abs/10.1080/01621459.1963.10500849.
Ramsay, Stephen. Reading Machines: Toward an Algorithmic Criticism. Topics in the Digital Humanities. Urbana: University of Illinois Press, 2011.
Rhody, Lisa M. “Topic Modeling and Figurative Language. Journal of Digital Humanities,” April 7, 2013. http://journalofdigitalhumanities.org/2-1/topic-modeling-and-figurative-language-by-lisa-m-rhody/.
Ruggles, Steven. “The Transformation of American Family Structure.” The American Historical Review 99, no. 1 (February 1994): 103. doi:10.2307/2166164.
Tufte, Edward R. Envisioning Information. Cheshire, Conn. (P.O. Box 430, Cheshire 06410): Graphics Press, 1990.
Tukey, John W. Exploratory Data Analysis. Addison-Wesley Series in Behavioral Science. Reading, Mass: Addison-Wesley Pub. Co, 1977.
Wickham, Hadley. “Ggplot2.” Wiley Interdisciplinary Reviews: Computational Statistics 3, no. 2 (2011): 180–185. doi:10.1002/wics.147.
Wilkens, Matthew. “The Geographic Imagination of Civil War-Era American Fiction.” American Literary History 25, no. 4 (12AD–1AD): 803–840. doi:10.1093/alh/ajt045.
Witmore, Michael. “Text: A Massively Addressable Object,” December 31, 2010. http://winedarksea.org/?p=926.
The algorithms we will discuss in the second half of the semester are discussed in greater length in this text. If you wish to come to more mathematical understanding, this provides a relatively gentle introduction in machine-learning terms, although something more than you , also based in the R language. All chapters are available for download, for free, from the Northeastern site: you may not benefit from reading them now, but it is worth downloading them while you have the chance.↩
For those interested solely in text analysis and not census, bibliographic, or other forms of “humanities data,” this will be an invaluable resource: be aware it uses a different set of libraries and data models for visualization and analysis than the ones we are using in this class, so the work will .↩