Week 9: Classifying
Febrary 29, 2019
Preface
This is a text for a course in humanities data analysis; it uses the modern R language to introduce the major challenges and features that exist for data analysis in the humanities.
This is an text that teaches the practice and principles of data analysis as encountered in traditionally non-quantitative fields. It is especially targeted at graduates students in history and literature departments.
I have developed it over the years as a resources for classes aimed at graduate students in the humanities who are new to programming, but interested in working with digitized sources of a variety of sorts–whether texts, maps, networks, and images. It deals especially heavily with textual data, which is widespread and (relatively) straightforward to work with. But it deals with principles, algorithms, and approaches that can be applied to a variety of sources.
It is organized as a set of week-by-week themes that build up a core curriculum of key concepts in the manipulation and presentation of data. The concepts here are drawn from what is likely to be useful to people in the humanities. The first half is largely occupied with technologies of counting.
The sort of data you’ll work with here may seem, at first, excessively limited. There are basically only two data types that this text deals with. First, that of a table with rows representing observations and columns reprenting values. (This is a form almost everyone has encountered in spreadsheets or databases). Second, the related but more abstract representations of observations as points in an arbitrary, multidimensional space. I don’t talk about, in principle, how to visualize or analyze nested hierarchies, network relations, or sentence trees.
If you work through this full book, I hope you’ll see that such a constraint can ultimately be generative, not limiting. While we won’t directly visualize XML documents, for example, we will consider how best to work with and manipulate them as tabular data with their tag hierarchies represented as columns. This may seem weird. But it also captures one of the most interesting things about data analysis; that the tools you might learn for analyzing the distribution of words in a document can be just as valuable and valid for analyzing the distribution of people in a city or photographic features in an archive. For any single analysis task, you can probably save time at first by loading it into some online tool or downloadable Java application; but you lose in that the ability to see the shared representational layers below. An absolutely fundamental skill for data manipulation is the ability to recast data into different forms; by doing visualization and statistical analysis on just two of them, you will see how to shape a variety of forms of information
There are, at this point, plenty of textbooks out that aim to offer some guidelines for humanities students dipping their toe into R. Among the ones I have used the most are those by Jockers, Tilton and Arnold, and the Programming Historian. I felt the need to create this one in my courses for a few reasons.
It uses purely modern R, by which I mean the so-called tidyverse family of packages (including, notably, the tidytext and tidygraph packages). As I explain in the [second chapter], these packages offer not just a set of tools that can accomplish arbitrary tasks, but a unified philosophy and clear separation of the different aspects of data analysis. R is at an interesting point where it seems conceivable that it could eventually be two different languages entirely; for pedagogical purposes, I find the tidyverse packages to promote useful thinking about data, as well as simply allowing you to get stuff done.
As part of this, it keeps the number of functions, libraries, and concepts as low as possible. There are many different ways, even within the tidyverse, to do any given task; rather than forcing you to go back and look them up, I try to limit the vocabulary as far as possible and only judiciously introduce new concepts. This means that some of the most interesting and often useful elements from the tidyverse are ommitted (quosures and quasiquotation; XXX) as well as some features of base R that most courses generally introduce out of reflex (there is extremely little use in this text of things like for and while-loops, the
$
accessor for data frames, and the factor data type).It tries to, influenced by the tidyverse philosophy, step back from describing simply how to do something the fastest towards instead emphasizing the basic units of data analysis that can be shared across sources. There are three different major languages used in the digital humanities: Python, R, and Javascript. While each of them offers different local syntaxes, they have the same core principles of data manipulation. If a student wants to complete this course in Python rather than R, doing so is entirely possible; the proof that they’ve done so is in their ability to execute the problems at the end of each chapter.
It presents an opinionated reduction of the world of statistical operations down to a few essentials. The statistics that humanists use are quite different from those needed in the social science, where causal inference is king; although I do go into some detail about Dunning log-likelihood (known as g-tests outside of computational linguistics) and the bootstrap (as a general purpose tool), I aim to help you produce visualizations or tables as endpoints more often than statistical tests.
On the other hand, it does go much farther into lessons from the world of informational retrieval, like TF-IDF and cosine similarity, than a typical programming text, since it is so important for students in the humanities to understand the basics of search engines even if they will not code again.
Finally, because I know that students in the humanities have frequently not dealt directly with math since high school, I try to be careful to dwell a bit on the purpose and nature of even high-school concepts like logarithms.
Because this is fundamentally a pedagogical text, each chapter generally ends with a number of exercises. The point of these is to allow students to work out some of the concepts introduced in the text in their own brains and fingers. (Or however they prefer to work). In my classes, these are generally ungraded assignments that students must hand in each week; I encourage collaborative work and don’t penalize dead ends.
It (for now) uses the R language as the primary area of application.
In some places, it includes elements using the Python
pandas
package (and altair
for data visualization) that offer the
same fundamental approach as does R.
This text is intended to be used alongside an R package, installed through github, at https://github.com/HumanitiesDataAnalysis/HumanitiesDataAnalysis.