Required Texts
Mostly we’ll be reading articles in this course available online. A few books are required for purchase. If you have difficulty obtaining any texts, please let me know as soon as possible.
Required books
 Drucker Graphesis.
 Ramsay Reading Machines.
 Tufte Envisioning Information.
 Jockers Macroanalysis.
Recommended supplementary materials (free online)

James An Introduction to Statistical Learning with Applications in R.^{1}

Jockers Text Analysis with R for Students of Literature.^{2}
Unit 1: Defining and transforming data
Week 1, January 15: Introduction. What is (could be) Humanities Data Analysis?
Reading
 Ramsay Reading Machines., Chapters 1 to 2
 Michel et al. “Quantitative Analysis of Culture Using Millions of Digitized Books.”
Practicum: regular expressions
 Software: Please come to class having installed a text editor with full regular expression capabilities.
 For Macs, a good one is TextWrangler or Atom.
 For Windows, NotePad Deluxe. (Not Notepad, which you most likely have already).
 For Linux, gEdit or atom. (If you’re a glutton for punishment, I heartily endorse either emacs or vi).
Problem set: Regex practice.
Week 2, January 22: Cleaning and Metadata: a brief pass at networks
Reading
 Unsworth: Knowledge Representation in Humanities Computing. http://people.brandeis.edu/~unsworth/KR/KRinHC.html
 Rosenberg, “Data before the Fact”, in Gitelman "Raw Data" Is an Oxymoron / Edited by Lisa Gitelman.
 “Becoming Digital,” from Daniel J. Cohen and Roy Rosenzweig, Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web.
 Hadley Wickham, “Tidy Data,” Journal of Statistical Software, 2015
 Witmore “Text.”
 So and Long, “Network Analysis and the Sociology of Modernism.”
Practicum
 Before Class: Install, or get an account on my server, the program RStudio. (Note that on some systems, this also requires you to install R itself.)
 In class: introduction to R: data types, data.frames, and manipulation; using functions.
Problem set:
 Census data manipulation with
read.table
andtidyr
 Optional reach: use
tidyr
to manipulate a data set into a form for network representations.
Week 3, January 29: Exploratory Data Analysis and the SplitApplyCombine strategy for counting.
Reading
 Behrens “Principles and Procedures of Exploratory Data Analysis.”
 Too long, too old, and too out of print to assign: but be aware of the granddaddy of them all: Tukey Exploratory Data Analysis.
 Ruggles “The Transformation of American Family Structure.”
 Moretti “Style, Inc. Reflections on Seven Thousand Titles (British Novels, 1740–1850).”
Problem Set: Split/Apply/Combine
 Before class: install git
 Skill: the split/apply/combine strategy.
 Summary statistics and exploratory data analysis on the census files.
Week 4, February 5: Grammatical and critical visualization, week 1
 Scan http://docs.ggplot2.org/current/, paying particular attention to the first section; the different “geoms”.
 Find Microsoft Excel: compare this to their different “chart types.”
 Tufte Envisioning Information.
 Manovich, Lev. “What Is Visualisation?” Visual Studies , no. (): –
 selfiecity.net
Problem set: Visualization
 R package:
ggplot2
 Multiple geoms in ggplot.
Some documentation is available at Wickham “Ggplot2.”
Week 5: February 12: Grammatical and critical visualization, week 2
 Klein “The Image of Absence.”
 Drucker Graphesis.
Problem set: Dig into exploratory data analysis of the bibliographic data set.
Optional supplemental readings:
 Daston and Galison Objectivity., Chapter 7 (On the Sciences, today.)
Week 8, February 19: Chance, deformance, and nonquantitative digital analysis
 Ramsay Reading Machines. (In earnest, all the way through).
 Mark Sample, Hacking the Academy and Notes toward a deformed humanities
 Daniel Shore, Shakespeare’s Constructicon. Preprint, to be published summer 2015
In class:
Introduction to digital text. Randomness and probabilities.
Problem set:
 Elementary Probability.
 Markov chains and random distributions.
February 26: Class Cancelled
Problem set: Continue working with bibliographic data.
Reach possibilities:
 Build a Markov chain generator, an oulipean subsitution machine, a Twitter Bot, or anything else of the kind.
 Investigate Caleb McDaniel’s Every Three Minutes bot; build something similar that generates lists of slaves and owners.
March 5: Text and metadata
 Jockers Macroanalysis.
No problem set over break: review and complete parts that you’ve previously missed.
But also, begin firmly pulling together (if you have not already) your own data. This should be either something that can be read in as a data.frame, a number of text files like the Dickens we built in class, or text to work with in the R tm
package (in which case, you’ll need to read Jockers Text Analysis with R and talk to me about how to prepare data for the remainder of the course.
March 12: Spring Break
Unit 2: Algorithms
Week 9, March 19: Vectors and spaces
 McCarthy, Schreibman, and Siemens “Knowing ….”
 Tukey, Friedman, and Fisherkeller: Introduction to Prim9. This is on YouTube, or I have a local copy.
 Allison et al. Quantitative Formalism Link.(http://litlab.stanford.edu/LiteraryLabPamphlet1.pdf)
(Both of these are experiments with visualizing multidimensional spaces, but also note that both are extremely odd publication models. Why?)
Problem Set.
 Vector Space Models
 Principal Components Analysis
 Cosine similarity
Supplemental readings (technical explanations):
 James An Introduction to Statistical Learning with Applications in R., chapters 1010.2.
Week 10, March 26: Classifying and Predicting
 Mosteller and Wallace “Inference in an Authorship Problem.”
From Ted Underwood’s classification project.
 Ted Underwood, Distant reading and the blurry edges of genre“
 Ted Underwood, Understanding Genre in a Collection of a Million Volumes, Interim report
Problem sets:
 Naive Bayes
 Logit classifiers
 Knearestneighbor classification.
Methods:
 James An Introduction to Statistical Learning with Applications in R., Chapter 4.
 Chapter 8, Decision trees, particularly 8.1: optional.
 Reach–Support Vector Machines, James Chapter 9. Only for the foolhardy.
Week 11, April 2: Clustering, Exploring, and Topic Modeling
 Blei, “Probabilistic Topic Models.”
 Goldstone and Underwood “The Quiet Transformations of Literary Studies.” and online supplement
 Ted Underwood, “Topic Modeling Made Just Simple Enough,” April 7, 2012
 Rhody “Topic Modeling and Figurative Language. Journal of Digital Humanities.”
 ???
Problem sets and methods:
 Kmeans Clustering.
 Hierarchical Clustering.
 Topic Modeling using the RMallet package.
Topic model something inappropriate.
Week 12, April 9: Integrating Place
 Richard White “What is Spatial History?” http://web.stanford.edu/group/spatialhistory/cgibin/site/pub.php?id=29
 Wilkens “The Geographic Imagination of Civil WarEra American Fiction.”
 Blevins “Space, Nation, and the Triumph of Region.”
Week 13, April 16: Debates in Humanities Data Analysis
Critical backgrounds
 Alan Liu, “Where is the cultural criticism in the digital humanities?”
The Syuzhet Affair. Note: as our course unfolded, there has been extensive DH blogging about this topic that far exceeds the long list of short posts here. If you find something else that you want to bring in (perhaps from the extensive twopart Twitter storification the Eileen Clancy put together) you are more than welcome to. (Along with, of course, your own understanding of the issues from the problem set).
 Core posts: read all
 Jockers, Matthew. “Revealing Sentiment and Plot Arcs with the Syuzhet Package”
 Jockers. “The Rest of the Story”
 Annie Swafford. “Problems with the Syuzhet Package.”
 Jockers. “Some Thoughts on Annie’s Thoughts . . . about Syuzhet”
 Swafford. “Continuing the Syuzhet Discussion.”
 Other recent approaches to plot arceology–skim for content.
 Reiter, N., A. Frank, and O. Hellwig. “An NLPBased CrossDocument Approach to Narrative Structure Discovery.” Literary and Linguistic Computing 29, no. 4 (December 1, 2014): 583–605. doi:10.1093/llc/fqu055.
 Benjamin Schmidt, Fundamental plot arcs, seen through multidimensional analysis of thousands of TV and movie scripts
 Andrew Piper, “Novel Devotions: Conversional Reading, Computational Modeling, and the Modern Novel,” New Literary History. (Preprint and synopsis available here)
 The question of Sentiment Analysis
 Andrew Piper, “Validation and Subjective Computing.”
 Scott Weingart, Not Enough Perspectives, Pt. 1
 David Bannam, Validity
 The question of smoothing methods
 Jockers. “A Ringing Endorsement of Smoothing Matthew L. Jockers.”
 Swafford. “Why Syuzhet Doesn’t Work and How We Know.”
 Scott Enderle, “What’s a sine wave of sentiment?”
 Benjamin Schmidt, Commodius vici of recirculation: the real problem with Syuzhet
 Jockers, “Requiem for a lowpass filter”
Allison, Sarah, Ryan Heuser, Matthew L. Jockers, Franco Moretti, and Michael Witmore. Quantitative Formalism: An Experiment (Stanford Literary Lab, Pamphlet 1). Stanford: Standford Literary Lab, 2011–1AD.
Behrens, John T. “Principles and Procedures of Exploratory Data Analysis.” Psychological Methods 2, no. 2 (1997): 131. http://psycnet.apa.org/journals/met/2/2/131/.
Blevins, C. “Space, Nation, and the Triumph of Region: A View of the World from Houston.” Journal Of American History 101, no. 1 (2014): 122–147. doi:10.1093/jahist/jau184.
Daston, Lorraine, and Peter Galison. Objectivity. New York: Zone Books ; Distributed by the MIT Press, 2007.
Drucker, Johanna. Graphesis: Visual Forms of Knowledge Production, 2014.
Gitelman, Lisa. "Raw Data" Is an Oxymoron / Edited by Lisa Gitelman. Infrastructures Series. Cambridge, Massachusetts: The MIT Press, 2013.
Goldstone, Andrew, and Ted Underwood. “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us.” New Literary History 45, no. 3 (2014): 359–384. doi:10.1353/nlh.2014.0025.
James, Gareth. An Introduction to Statistical Learning with Applications in R, 2013. http://dx.doi.org/10.1007/9781461471387.
Jockers, Matt. Text Analysis with R for Students of Literature. Springer, 2014. http://www.springer.com/statistics/computational+statistics/book/9783319031637.
Jockers, Matthew L. Macroanalysis: Digital Methods and Literary History. University of Illinois Press, 2013.
Klein, Lauren F. “The Image of Absence: Archival Silence, Data Visualization, and James Hemings.” American Literature 85, no. 4 (12AD–1AD): 661–688. doi:10.1215/000298312367310.
McCarthy, Willard, Susan Schreibman, and Ray Siemens, eds. “Knowing … : Modeling in Literary Studies.” In Companion to Digital Literary Studies (Blackwell Companions to Literature and Culture). Blackwell Companions to Literature and Culture. Oxford: Blackwell Publishing Professional, 2008. http://www.digitalhumanities.org/companionDLS/.
Michel, JeanBaptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K Gray, Joseph P Pickett, Dale Hoiberg, et al. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science (New York, N.Y.) 331, no. 6014 (January 14, 2011): 176–182. doi:10.1126/science.1199644.
Moretti, Franco. “Style, Inc. Reflections on Seven Thousand Titles (British Novels, 1740–1850).” Critical Inquiry 36, no. 1 (2009): 134–158.
Mosteller, Frederick, and David L. Wallace. “Inference in an Authorship Problem: A Comparative Study of Discrimination Methods Applied to the Authorship of the Disputed Federalist Papers.” Journal of the American Statistical Association 58, no. 302 (1963): 275–309. http://www.tandfonline.com/doi/abs/10.1080/01621459.1963.10500849.
Ramsay, Stephen. Reading Machines: Toward an Algorithmic Criticism. Topics in the Digital Humanities. Urbana: University of Illinois Press, 2011.
Rhody, Lisa M. “Topic Modeling and Figurative Language. Journal of Digital Humanities,” April 7, 2013. http://journalofdigitalhumanities.org/21/topicmodelingandfigurativelanguagebylisamrhody/.
Ruggles, Steven. “The Transformation of American Family Structure.” The American Historical Review 99, no. 1 (February 1994): 103. doi:10.2307/2166164.
Tufte, Edward R. Envisioning Information. Cheshire, Conn. (P.O. Box 430, Cheshire 06410): Graphics Press, 1990.
Tukey, John W. Exploratory Data Analysis. AddisonWesley Series in Behavioral Science. Reading, Mass: AddisonWesley Pub. Co, 1977.
Wickham, Hadley. “Ggplot2.” Wiley Interdisciplinary Reviews: Computational Statistics 3, no. 2 (2011): 180–185. doi:10.1002/wics.147.
Wilkens, Matthew. “The Geographic Imagination of Civil WarEra American Fiction.” American Literary History 25, no. 4 (12AD–1AD): 803–840. doi:10.1093/alh/ajt045.
Witmore, Michael. “Text: A Massively Addressable Object,” December 31, 2010. http://winedarksea.org/?p=926.

The algorithms we will discuss in the second half of the semester are discussed in greater length in this text. If you wish to come to more mathematical understanding, this provides a relatively gentle introduction in machinelearning terms, although something more than you , also based in the R language. All chapters are available for download, for free, from the Northeastern site: you may not benefit from reading them now, but it is worth downloading them while you have the chance.↩

For those interested solely in text analysis and not census, bibliographic, or other forms of “humanities data,” this will be an invaluable resource: be aware it uses a different set of libraries and data models for visualization and analysis than the ones we are using in this class, so the work will .↩