Chapter 2 Working in a Programming Language

2.1 Different languages and humanities computing.

Different computer languages serve different purposes. If you have taken ever taken an introductory computer science course, you might have learned a different language, like Python, Java, C, or Lisp.

Although computing languages are equivalent in a certain, abstract sense, they each channel you towards thinking in particular ways. As I say in chapter 2, computers offer a variety of formal languages for describing things; each of these languages emphasizes a different thing.

Which of these languages is best? It depends on what you want to do. For creating rich, user-oriented experiences, javascript and the open web is best.

What R–especially tidyverse R–does well is encourage you to move from thinking about programming to thinking about data. Exploratory data analysis which operates on a particular base class, the ‘dataframe’ or (for short) ‘tibble.’ We’ll talk about this more in Chapter 3; but a dataframe represents a structured collection of data much like an Excel spreadsheet or database table. This gives a coherent, basic framework for describing any data set. The things that you can do with a dataframe

If you only learn a single language, there’s a strong argument that it should Python, which is a widespread language that can do anything and frequently run quite quickly. If you want to learn to create code, Python is a better language.

But python generally promotes a specific kind of thinking about how you can get a problem done that revolves around thinking like a computer.

The closest analogues to these in other languages are less elegant and less well thought out. Python has widely used tool called pandas for analyzing data that is fast, powerful, and effective. But it is also more challenging for beginners than it need be. If you Google problems you’ll be confronted with a variety of different ways to solve a problem. Ten years ago, one big advantage of python over R was that it had a small standard library, cleaner syntax, and promoted a single way to do things effectively. One of the great ironies of modern data science is that, for programming with data, the situation has almost completely reversed; pandas give you a bewildering number of different ways to join data frames, to access their rows or columns, or to walk through the rows. The tidyverse does a better job enforcing a particular approach.

If you want to learn programming, there’s a good argument for learning python. Although if you just want to get things done, there’s an equally strong case for Javascript: and if you really want to understand computers, you should take a month learning to write in Haskell, or Lisp, or C.

2.2 The case for R.

In this class, you will have to do some coding as well as just thinking about data analysis in the humanities. If you’ve never coded before, this will be frustrating from time to time. (In fact, if you’ve done a lot of coding before, it will still be frustrating!)

We’ll be working entirely in the “R” language, developed specifically for statistical computing. This has three main advantages for the sort of work that historians do:

  1. It is easy to download and install, though the program RStudio. This makes it easy to do “scripting,” rather than true programming, where you can test your results step by step. It also means that R takes the least time to get from raw data to pretty plots of anything this side of Excel. RStudio also offers a number of features that make it easier to explore data interactively.

  2. It has a set of packages–tidyverse (wickham_welcome_2019?)– that are especially designed for teaching and introductory exploration. These packages, whose names you will scattered through this text, are ggplot2, tidyr, dplyr, and the like. These are not core R libraries, but they are widely used and offer the most intellectually coherent approach to data analysis and presentation of any computing framework in existence. That means that even if you don’t use these particular tools in the future, working with them should help you develop a more coherent way of thinking about what data is from the computational side, and what you as a humanist might be able to do with it. These tools are rooted in a long line of software based on making it easy for individuals to manipulate data: read the optional source on the history of database populism to see more. The ways of thinking you get from this will serve you will in thinking about relational databases, structured data for archives, and a welter of other sources.

  3. It is free: both “free as in beer,” and “free as in speech,” in the mantra of the Free Software Foundation. That means that it–like the rest of the peripheral tools we’ll talk about–won’t suddenly become inaccessible if you lose a university affiliation.

Every computer language is an accretion of cultural history; knowing a little bit about R will help you to understand what’s happening in it.

R, the programming language, dates back to the 1970s. During the heyday of Bell Labs in the 1970s, researchers built a variety of tools for working with different computer systems, including the language C that has influenœced most low-level program design since, and the operating system Unix that provides the foundation for many modern computing systems from Apple laptops to Amazon servers to Android phones.

John Chambers developed a language called “S” at Bell Labs with several goals that continue to influence the language’s design. One was to provide a way to use, in a more human notation, the blazingly fast linear algebra routines that undergird all sorts of work in math, statistics, and visualization. Another was to facilitate more sophisticated, exploratory data visualization.

In the 1990s, two statisticians in New Zealand, Ross Ihaka and Robert Gentleman, created an open source version of S called “R” that could be freely distributed without worrying about AT&T’s old patents. By virtue of being free, that language has slowly displaced Stata and SPSS, the other major statistical computing environments of the 1980s and 1990s.

2.3 The case for Python

If you plan to learn only one computer language in your life, there’s a good case right now for having it be Python. Python is a clean, general-purpose language with wide support. You’ll find more online courses, more answers to forums, and more neighbors down the hall who have some experience in Python than almost any other language.

It is also a fairly well-designed language: it was invented in the early 1990s as the problems with an earlier generation of scripting langauges started to become clear, and its designer, Guido van Rossum, make efforts to make it both easily usable (unlike faster languages such as C++ or Java) and relatively robust against programming errors (unlike other “easy to use” languages like perl and Javascript.)

It was originally built a systems language, and its capacity for high-powered data processing has largely arisen since about 2005. The Python data ecosystem is large, but at the heart is a scientific computing module called numpy, that makes it quite fast for large scale mathematical operations. The pandas package, built on top of numpy, allows for a variety of fast operations on tabular data. pandas offers many of the same functionalities as the tidyverse packages in R, and was originally designed to replicate the functionality of a similar R package from the late 2000s. Because numpy has more granular control over low-level datatypes, it can be easier to use python if you care deeply about–say–making sure that a number that will never exceed 255 in one of your fields only takes a single byte of storage. (In R, it might easily take 8.) While the pandas API is not especially intuitive, it’s easy to find examples of most queries online.

Data visualization in python has historically been dominated by the mediocre matplotlib package, but a number of grammar of graphics implementations have recently been released; the one I recommend is called altair, and it generates clean HTML charts that you can copy into a presentation or embed, live, into the web. Python has an excellent geocomputation library called ‘geopandas.’

Python has been expanding its reach for much of the last decade. While many statisticians, journalists, and political scientists use R, Python is more dominant in newer fields. The most important of these is machine learning; while it is possible to invoke advanced neural network models from R, there is far more support for doing so from Python, and introductory tutorials.

If you work heavily with raw data from APIs, Python’s native datatypes are closer to the JSON that most organizations provide, which can make it easier to parse the data yourself.

Like RStudio notebooks in R Markdown, Python has a notebook format that is widely used: Jupyter notebooks. In my experience, beginners find Jupyter notebooks slightly easier to use than R Markdown notebooks; but on the other hand, they are slightly less amenable to being turned into print documents.

2.4 The case for Javascript

If Python is the language that the most people know, Javascript is the one that the most people can use. If you want to write code that can be deployed anywhere, Javascript is the native language of the web; while Python and R both have strategies for running in emulation or containerized systems that make it possible to edit notebooks online, neither really compares to JS.

2.5 The place of GUIs

One thing you can’t do in this course, though, is rely on the out-the-box where one tool fits every problem. ArcGIS or QGIS may be the best way to make maps, and Gephi the best way to do network analysis. But as this is a course in data analysis, you should think about the fundamental operations of cartography and network analysis as simply subsets of a broader field, which is hard to see from the confines. All of these things are possible in R, and by seeing them as facets of a broader activity, you’ll develop transferrable skills and insights.

Also unlike graphical tools, working in a language saves your workflow. If you make a map with laboriously poisitioned points in ArcGIS, you may have a beautiful final project, but you can’t reproduce exactly how it happened. In R, though, every step you take and every move you make can be preserved. This is called reproducible research, and it is among the most important contributions you can make when working collaboratively.

2.6 Packages

R is a modular piece of software. The base language allows you to apply most standard statistical methods from the 1990s, but in general the power of the language comes from extensions that others have worked with. This means that even after installing R, you’ll frequently have to install additional “packages.” The ability to do so is contained inside R itself; you can think of packages as living inside your local copy of R.

There is also one program worth installing that lives outside, called “RStudio.” It provides you with an environment to work in. Some features it provides that are especially useful for students are:

  1. The ability to see loaded data on the right of the screen, and to click to view it.
  2. An interface that keeps plots alongside code.
  3. Tools for automatically checking your code, such as tab-completion that can often guess what you’re trying to type after a few letters.

But the most important things R Studio does for you are to make it easier to work in projects and to use literate programming.

2.6.1 Projects

Projects in RStudio are basically folders devoted to a specific task. What they do is enforce a discipline on you that can be hard to stick to otherwise, which is that you should always keep all data related to a specific data analysis project in the same place unless you have a very good reason not to. I would recommend having a few different projects for this class. One should be called “Problem Sets;” you can use the download_problem_sets() function in the course package to fill it with sets, and then use the Files tab in RStudio to open the latest ones. Others will focus on more specific tasks.

Projects also automatically save all of your data from session to session so that you can quit R and have all your data there when you open it back up again a week later. This can be useful, but it can also be something

You should always be in a project. Make a folder on your computer and store related items inside of it. Put all the data associated with a project in the same folder. Make sure you know where it is.

2.6.2 Literate Programming

R is designed to be interactive. There is a prompt (marked ‘console’) at the bottom of your screen into which you can always enter any expression. While some langauges expect you to write a program in a text file, in R it is very normal to work back and forth with your data, entering one command at a time.

While you can type directly into the prompt, any good data analysis should use a file so you can correct and save your work. R makes heavy use of a paradigm called “literate programming,” in which code and full text are intermixed into so-called ‘notebooks.’

These are files that end with the suffix ‘.Rmd,’ which stands for “R Markdown.” Markdown is a simple way of typing text that allows for minimal amounts of formatting (such as italics, numbered lists, and so forth) using a style derived from the way people often type fulltext e-mails. The standard represents a word in **bold** with two asterisks, a block quotation with with each line prefaced by a >, and so on.

Markdown allows for “code blocks” surrounded by three backticks. In R markdown documents, everything is text unless you explicitly mark it as code; but when you do, there are a variety of ways to run it.

\`\`\`{R}

analyze(data)

\`\`\`

2.6.3 The Tidyverse

One set of packages bears particular emphasis. Hadley Wickham, who was for a time a statistics professor at Rice University, and more recently has become the chief data scientist at RStudio. The tidyverse is a set of packages that he oversees that provides a different syntax for R.

2.6.4 Installing from CRAN

R packages can come from two sources. The first, easier one is internal to the R ecosystem and called “CRAN” (The ‘Comprehensive R Archive Network’). CRAN sets the highest bar on what packages are available.

The function install.packages or the “packages” pane in RStudio.

install.packages("tidyverse")
install.packages("remotes")

2.6.5 Installing from github

But it can be a great deal of work to make a package fit into CRAN–the maintainers are famously fastidious about certain standards that can be included. (The digital humanist Matt Lincoln has a blog post about how an obscure feature of certain operating systems nearly broke not just his clipoard package, but all sorts of other packages that depend on it. Lincoln (2019)) Frequently you’ll want to install packages from outside sources; the most common is Github, a website owned by Microsoft that distributes code using the open source ‘git’ standard.

2.6.6 The course package

This course itself uses an R package to manage information. You can install it using the following lines of code.

The second line will also reinstall the package, which we’ll probably do periodically in the semester.

if (!require(remotes)) install.packages("remotes")
remotes::install_github("HumanitiesDataAnalysis/HumanitiesDataAnalysis")

Once installed, you can also update by typing update_HDA() at the R prompt.

The course package contains four things:

  1. Sample data sets we’ll be working with
  2. Code to make it easier to work with the class by, for example, downloading problems sets to your computer.
  3. Code the streamline approaches that we’ve already learned that aren’t easily expressed in another packages.
  4. A list of ‘dependencies’ that will automatically install other packages you need.

2.7 Troubleshooting Guide

When you have trouble running code, there are a few questions you should ask first.

  1. Does RStudio know you’re writing code?
    • Do you have a code block that’s grey with white text on either side?
    • Is there a ‘run’ button on your cell?
    • If not, you probably have a formatting problem. Make sure the chunk has three ticks at the top and bottom; consider going up to the insert button and running a chunk.
  2. Read the error messages.
    • Is there something missing? For instance, have you run library(tidyverse) at the top of your code?
  3. Are you spelling everything right and closing all your punctuation?
    • It’s easy to lose track of how many open parentheses you have.
  4. Can you restart and start over from the beginning? Sometimes you’ll be relying on some changed piece of code you’ve forgotten about.

2.8 Exercises: Creating your first project

Getting started is the hardest thing, because it requires understanding–to some degree–this entire software ‘stack.’ Here’s what you should do once RStudio is running.

  1. Type the following into the prompt to install the latest version of this package. These updates are important.
remotes::install_github("HumanitiesDataAnalysis/HumanitiesDataAnalysis", upgrade = FALSE)
?remotes::install_github
  1. Type library(HumanitiesDataAnalysis) to actually load the package.

  2. Create a new project for problem sets in a folder on your computer.

  3. Type download_problem_sets() into the console prompt to download the sets.

  4. Start editing the code in the first problem set and run it using the green arrow buttons.