Download PDF

Coding and Scripting in R

This course will have you writing some code in the R language. There is an extensive debate about whether digital humanists need to learn to code which we’re not going to engage in; the fact of the matter is simply that if you want to either do data analysis in the humanities, coding will often be the only way to realize your personal vision; and if you want to build resources in the humanities that others might want analyze, you’ll need to know what sophisticated users want to do with your tools to make them work for them.

I have no expectation that anyone will come out of this a full-fledged developer. In fact, I hope that by doing some actual scripting, you’ll come to see that these debates over learning to code brush over a lot of intermediate stages. We’ll be focusing in particular in developing skills less in full-fledged “programming,” but in “scripting.” That means instructing a computer in every stage of your work flow; using a language rather than a Graphical User Interface (GUI), which may be almost all the program you’ve used before. This takes more time at first, but has some extraordinary advantages over working in a GUI:

  1. Your work is saved and open for inspection.
  2. If you want to discover an error, you can correct it without losing the work done after.
  3. If you want to amend your process (analyze a hundred books instead of ten, for instance) but do the same analysis, you can alter the code only slightly.

Why R?

In this class, you will have to do some coding as well as just thinking about data analysis in the humanities. If you’ve never coded before, this will be frustrating from time to time. (In fact, if you’ve done a lot of coding before, it will still be frustrating!)

We’ll be working entirely in the “R” language, developed specifically for statistical computing. This has three main advantages for the sort of work that historians do:

  1. It is easy to download and install, though the program RStudio. This makes it easy to do “scripting,” rather than true programming, where you can test your results step by step. It also means that R takes the least time to get from raw data to pretty plots of anything this side of Excel. RStudio also offers a number of features that make it easier to explore data interactively.

  2. It has a set of packages we’ll be using for data analysis. These packages, whose names you will scattered through this text, are ggplot2, tidyr, dplyr, and the like. These are not core R libraries, but they are widely used and offer the most intellectually coherent approach to data analysis and presentation of any computing framework in existence. That means that even if you don’t use these particular tools in the future, working with them should help you develop a more coherent way of thinking about what data is from the computational side, and what you as a humanist might be able to do with it. These tools are rooted in a long line of software based on making it easy for individuals to manipulate data: read the optional source on the history of database populism to see more. The ways of thinking you get from this will serve you will in thinking about relational databases, structured data for archives, and a welter of other sources.

  3. It is free: both “free as in beer,” and “free as in speech,” in the mantra of the Free Software Foundation. That means that it–like the rest of the peripheral tools we’ll talk about–won’t suddenly become inaccessible if you lose a university affiliation.

R vs. Python vs. Javascript: which is the best language for humanities computing?

Different computer languages serve different purposes. If you have taken ever taken an introductory computer science course, you might have learned a different language, like python, Java, C, or Lisp.

Although computing languages are equivalent in a certain, abstract sense, they each channel you towards thinking in particular ways. So when we

Which of these languages is best? It depends, obviously, on what you want to do. If you only learn a single languauge, there’s a strong argument that it should Python, which is a widespread, swiss-army-knife type language that can frequently run quite quickly. But python generally promotes a kind of thinking about how you can get a problem done.

What R–especially tidyverse R–does best is let you abstract back from thinking about programming to thinking about data. Exploratory data analysis which operates on a particular base class, the ‘data frame.’ We’ll talk about this more in Chapter 3; but the point is that it provides a coherent, basic language for describing any data set in terms of groupings, summary statistics, and visualization.

The closest analogues to these in other languages are less elegant and less well thought out. Python has widely used tool called pandas for analyzing data that is fast, powerful, and effective. But it is also more challenging for beginners than it need be. If you Google problems you’ll be confronted with a variety of problems.1

The place of pre-packaged software.

One thing you can’t do in this course, though, is rely on the out-the-box approaches prevalent in many DH programs. ArcGIS or QGIS may be the best way to make maps, and Gephi the best way to do network analysis. But as this is a course in data analysis, I want you to think about the fundamental operations of cartography and network analysis as simply subsets of a broader field, which is hard to see from the confines. All of these things are possible in R. And unlike graphical tools, working in a language saves your workflow. If you make a map with laboriously poisitioned points in ArcGIS, your operations aren’t open for inspection. In R, though, every step you take and every move you make can be preserved. This is called reproducible research, and it is among the most important contributions you can make when working collaboratively.

  1. Ten years ago, experts tended to pontificate that python was better than R because it had a small standard library, cleaner syntax, and promoted a single way to do things effectively. One of the great ironies of modern data science is that, for programming with data, the situation has almost completely reversed; the pandas library presents a