Coding and Scripting in R
This course will have you writing some code in the R language. There is an extensive debate about whether digital humanists need to learn to code which we’re not going to engage in; the fact of the matter is simply that if you want to either do data analysis in the humanities, coding will often be the only way to realize your personal vision; and if you want to build resources in the humanities that others might want analyze, you’ll need to know what sophisticated users want to do with your tools to make them work for them.
I have no expectation that anyone will come out of this a full-fledged developer. In fact, I hope that by doing some actual scripting, you’ll come to see that these debates over learning to code brush over a lot of intermediate stages. We’ll be focusing in particular in developing skills less in full-fledged “programming,” but in “scripting.” That means instructing a computer in every stage of your work flow; using a language rather than a Graphical User Interface (GUI), which may be almost all the program you’ve used before. This takes more time at first, but has extraordinary advantages over working in a GUI:
- Your work is saved and open for inspection.
- If you want to discover an error, you can correct it without losing the work done after.
- If you want to amend your process (analyze a hundred books instead of ten, for instance) but do the same analysis, you can alter the code only slightly.
In this class, you will have to do some coding as well as just thinking about data analysis in the humanities. If you’ve never coded before, this will be frustrating from time to time. (In fact, if you’ve done a lot of coding before, it will still be frustrating!)
We’ll be working entirely in the “R” language, developed specifically for statistical computing. This has three main advantages for the sort of work that historians do:
It is easy to download and install, though the program
RStudio. This makes it easy to do “scripting,” rather than true programming, where you can test your results step by step. It also means that R takes the least time to get from raw data to pretty plots of anything this side of Excel. RStudio also offers a number of features that make it easier to explore data interactively.
It has a set of packages we’ll be using for data analysis called
ggplot2. These are not core R libraries, but they are widely used and offer the most intellectually coherent approach to data analysis and presentation of any computing framework in existence. That means that even if you don’t use these particular tools in the future, working with them should help you develop a more coherent way of thinking about what data is from the computational side, and what you as a humanist might be able to do with it. These tools are rooted in a long line of software based on making it easy for individuals to manipulate data: read the optional source on the history of database populism to see more. The ways of thinking you get from this will serve you will in thinking about relational databases, structured data for archives, and a welter of other sources.
It is free: both “free as in beer,” and “free as in speech,” in the mantra of the Free Software Foundation. That means that it–like the rest of the peripheral tools we’ll talk about–won’t suddenly become inaccessible if you lose a university affiliation.
The place of other languages.
For those reasons, I’ll expect you to most run-of-the-mill assignments using R. But: If for some reason you have a strong desire to use a different language for parts of your analysis, talk to me about when it might be appropriate. Many scientists now use
Julia for high-powered machine learning. A certain distinction will attach to the first person in the humanities to use it for their work; maybe it will be you! You may have learned
Stata if you took undergrad Econ classes; and I myself find
Python indispensable for working with texts, although I prefer R for the data analysis tasks we’re doing here. If you think one of these or another language you know will be appropriate for parts of the class, let me know.
The place of pre-packaged software.
One thing you can’t do in this course, though, is rely on the out-the-box approaches prevalent in many DH programs. ArcGIS or QGIS may be the best way to make maps, and Gephi the best way to do network analysis. But as this is a course in data analysis, I want you to think about the fundamental operations of cartography and network analysis as simply subsets of a broader field, which is hard to see from the confines . All of these things are possible in R. And unlike graphical tools, working in a language saves your workflow. If you make a map with laboriously poisitioned points in ArcGIS, your operations aren’t open for inspection. In R, though, every step you take and every move you make can be preserved. This is called reproducible research, and it is among the most important contributions you can make when working collaboratively.