Chapter 6 Cleaning Data

This chapter talks about cleaning data. It also introduces one fundamental element of R that we have mostly avoided to this point: assignment.

New tibble functions:

  • mutate
  • pull

New column functions:

  • as.numeric, as.character, str_extract, str_replace,

New concepts: Variable assignment in the general R space; variable assignment inside mutate calls.

6.1 Overview

For humanities data analysis (and most other types, for that matter), a huge proportion of you effort will go into cleaning rather than analyzing data. Cleaning can be itself be an important act. In Trevor Munoz and Katie Rawson’s essay Against Cleaning, they give a compelling rationale for why you must attend to the ways that regarding cleaning as easy or straightforward can give you less of an understanding of your information.

That said, you will invariably have to do something to your data to productively analyze, visualize, or understand it.

Most of the work of cleaning data is necessarily irregular. Different types of data all have their own problems. Often, the problem of ‘cleaning’ is really the problem of tracing a set of errors back to their sources. Here are some examples of data sanitation that I’ve encountered and had to clean over the years:

  1. A government dataset about asylum that spells “torture” as “toture” in one of the fields.
  2. A large number of ship’s logs records that give dates of February 30, September 31, and occasionally well into the future.
  3. A government dataset of statistics from the 1970s that was entirely unreadable until it became clear it used an almost-extinct character encoding scheme for IBM mainframes called EBCDIC.

There is no more a general set of rules for data cleaning than a general set of rules for editing manuscripts. Part of cleaning is understanding your own dataset. Another part is understanding your own analytic goals.

6.1.1 Common data cleaning issues.

6.1.1.1 NA values

Sometimes, data will be missing. In the crews data set, we see this for many fields. Inside R, missing data is represented with with a special indication as <NA> (for “not available”).

How missing data is encoded may change. Some sets use empty strings; others use the letters “NA”; others use words like “None” or “Null.” And there are different types of unknowns; sometimes you’ll find a dataset where someone has written in “unknown” or “unclear,” which is not the same as missing data in some cases but is the same in others. By default (as you can see by typing ?read_csv into R or looking in the help pane), read_csv uses a value of c("", "NA"), which means that values which are empty ("") or which are literally the string NA will turned into <NA> values. Sometimes this will be wrong–for instance, NA is a perfectly valuable Library of Congress Classification. And sometimes you may want to specify that things are missing by adding new strings to the list of missing values.

It’s also sometimes necessary to work around <NA> values. You’ll see this a number of times in the code in this book. For instance, if you want to take the mean height in the crews dataset, you get “NA” because some heights are not present. There are two ways to deal with this.

One is to apply a filter that uses the function is.na. We can check each element to ensure that the function is.na says “No!” – technically, FALSE– and keep them in the filter if they do. There’s a slightly faster way, which is to use an exclamation point at the beginning of that line. Just as != means “does not equal,” ! here too is a shorthand for “not.” So you read this simply as Not, is missing, height.

crews %>% filter(is.na(height) == FALSE)
# (Equivalently, but faster to type)
crews %>% filter(!is.na(height))

Another way is use specific functions tactics for handling NA values. For instance, when taking an average, we can say na.rm = TRUE to remove NA values. In practice, this is a little bit harder, I think.

crews %>% summarize(height = mean(height, na.rm=TRUE))

6.1.1.2 Character encoding

If you’re working with new data, it’s probably in the Unicode standard. But if it’s older, characters from outside (things like curly quotes created on a Windows machine in 2003) may cause real problems. If you find yourself working with a file in a different character encoding, R generally allows you to read the file in by specifying the name of the standard. The precise instructions for doing so involve ‘locales.’ Extensive discussion of this is available in the readr documentation.

6.1.1.3 Record formats

Even when you’re given perfectly formatted data, you may experience some trouble reading it in.

In R, most problems will arise from conventions around commenting and quoting files. When you use read_table and encounter errors, most often you want to adjust the arguments to comment and quote characters, which instructs the parser to ignore the “#” sign and quotation marks. Note that some files do use quotation marks as a “quote” character, such as the New Bedford whaling data we looked at in class.

Even the simplest of characters can cause problems. Sometimes, you may find that the end-of-line files aren’t behaving as you would expect. This has to do with the so-called “carriage return” character (called \r in regular expressions, as opposed to \n, the standard newline.

History moment: the reason for the peculiar behavior of the carriage return and newline has to do with the typewriter operations underlying modern computing: the \r symbol was supposed to move the typewriter head to the beginning of the line, while the \n newline character pushes the typewriter head down. In practice, \n is sufficient to do both; there’s actually a third character, the formfeed (\f), that usually takes on the job of dropping the cursor down a line. You will almost never see it in use, but it can be handy to insert from time to time if you need your own record breaks.

6.1.1.4 Inconsistent category labels

This is the biggest one, and one that you’ll encounter in the whaling log data. It gets to the distinction between datasets of strings and datasets of categories. Carefully curated datasets will always use exactly the same labels for categorical labels; but real-world sets often will not. Labels may be inconsistently capitalized–the whaling set uses “Rig” and “rig” for the same type of boat. Labels may occasionally be mispelled.

6.1.1.5 Date-time and other “type” formats

What is the data type for a year? You might reasonably expect it to be a number. But in fact, you’ll often find that dates are strings.

In converting messy dates, you’ll frequently have to use regular expressions to perform these coercsions.

When creating data yourself, you should use ISO 8601, which uses a “year-month-day” format. Countries around the world differ about whether Pearl Harbor happened on 12/7/41 or 7/12/41; the only sound response if you are creating data is to use 1945-12-07 as your format. The full ISO format for times includes ways of representing dates for which only years or months exist.

6.2 Data cleaning

6.2.1 Reading files.

Depending on how the data you wish to read in is structured, you will typically use one of three functions in R to read it in.

How do you read in data? Generally, the information you want will be either on your hard drive (as when you format it) or on the Internet. For our first example, we’ll be looking at a well-formatted CSV online.

There’s some good descriptive data about people, which suggests a chance for something about bodies–measurements, physical descriptions, and ages all have interesting interplays. That will be particularly valuable if we can tie it in to some other sorts of information. Before I get into that, there are couple variables that I just want to see fuller counts on: table() in R gives the best way to do that. I’m interested in names because I could link them up to census information and because they provide some clues to ethnicity.

6.2.1.1 Reading tables and constructive failure.

Often, if there is something even slightly askew about your input data, read_csv will fail. This may be frustrating. Try to be grateful, though, instead. This failure is the first aspect of something we’ll encounter many times in this class that is a general feature of data analysis: programs that don’t receive exactly the input they expect will simply fail to work, usually “throwing” an error message of some sort.

Here, we’ll look a bit more at the original version of the crewlist dataset we explored two weeks ago.

You can read it on your computer by typing the following code.

library("tidyverse")

read_csv("http://www.whalingmuseum.org/online_exhibits/crewlist/crewlist.csv")

If you run this, you will see some alarming text in red. You can read it or try to understand; but the basic problem is that there are thousands of parsing failures. The reason has to do with the data types we’ve been talking about. read_csv will automatically guess at data types; but it turns out that it’s getting some of these wrong because:

  1. It doesn’t see any eye colors until thousands of entries in;
  2. A few ship IDs have letters in them, but it initially learns to read them as numbers.

The first part of our cleaning is just to work around this by telling it to look farther in. As with any function, ?read_csv at the command prompt will give you instructions.

crews = read_csv("http://www.whalingmuseum.org/online_exhibits/crewlist/crewlist.csv", guess_max = 1000000)

Now we can inspect it. Going to the “Environment” pane in the upper right-hand corner, you can see what’s in here.

Where did this data come from? Did they do a good job transcribing? Here’s an example of the original source

R is composed of functions: each of these apply on an object. Each is, essentially, a little program of its own: you can run it on data, and see what happens.

In tidyverse R, the most common data structure is a tibble; it’s essentially a table where the rows correspond to observations, and the columns refer to variables. (Tidyverse R borrows this from the incredibly similarly named ‘data.frame’ object that R has had since its birth. There are basically interchangeable, but using the modern versions helps you avoid a few pitfalls.) It resembles a spreadsheet or database table, but every datapoint has a type.

In this data set, as you’ll see, each row corresponds to an individual crew member, and the columns give information about him, such as the ship he sailed on his, his name, his rank, and so forth.

To look at it, we can start with the simplest program: simply looking at the thing. Type its name and press the green button.

crews

The first thing to do with a new data source is often to run summary, which figures out what the different columns in your database are and gives appropriate descriptions of the types of data in each. For numbers, it gives averages; for categorical data (called ‘factors’) in R, it lists the most common elements.

summary(crews)

6.2.1.2 Other formats

There are many different libraries out there for reading data. R is blessed with many packages for importing them. You can read excel files using ‘readxl,’ for example. But for many purposes, it may be simpler go into Excel and click “save as CSV” to write to a more standard form.

6.2.2 Cleaning Data.

There are some obvious problems with the “crews” data we need to fix for analysis.

For example. For example, in the raw version of this dataset:

  1. ‘Height’ is represented as a string like 5' 2 1/2", with feet and inches broken out;
  2. ‘Age’ has been read as a string, not a number;
  3. ‘Date’ has been read as a string, so we can’t plot–for example–years.

An example: cleaning years, and assignment. Suppose we want to extract the years and turn them into integers.

6.2.2.1 Using mutate and str_replace to clean data.

Note that the date doesn’t follow a standard form. We’re going to use regular expressions again to clean it up. Notice the steps in the chain here. We’re creating a new column called “year.” Then, we feed the output into the function str_replace, which lets us use a regular expression.

Remember the function str_replace: it is among the most useful tools for data cleaning in R you will encounter. It lets you use the full power of regular expressions for find-replace operations. This regular expression is complicated–you may have to refer back to your sheet to see what’s going on. Note in particular that the parenthesis are performing a grouping operation. str_replace is a substitution function, so you have to tell it what to replace it with. In this case, the escaped phrase \\1 tells R to substitute with the first matched group. (That’s what \\1 means: \\2 will match the second matched group, and so forth.)

So: suppose we want to pull ‘year’ out as a number.

The tidyverse has a function mutate that changes a dataframe: it either adds a new column, or changes an existing one.

This gets a little complicated, so let’s break it down.

pull extracts a column so you can look at it more closely. Let’s do that with our ApproximateDeparture column.

crews %>%
  pull(ApproximateDeparture) %>%
  head(10)

Let’s build up a tiny program that extracts the year from this. The first step is to use regular expressions to get to just the year part of the expression. A simple way to do that here is to use the .* regex operator to replace everything up to and including a slash with nothing (the empty string, "").

crews %>%
  pull(ApproximateDeparture) %>%
  str_replace(".*/", "") %>%
  head(10)

We’re not done yet, though, because these are strings and years should be integers. The function as.numeric does this for us. (There’s an inverse function, as.character, that can turn a number into a string.)

crews %>%
  pull(ApproximateDeparture) %>%
  str_replace(".*/", "") %>%
  as.numeric() %>%
  head(10)

Success! But while this program runs, it doesn’t actually change our original frame. To do so, we need to do assignment.

6.3 Assignment in R.

Assignment is a fundamental feature of all languages. To this point, we’ve basically been working with

Advanced note: in the early history of R, you could not in fact assign a variable by using the = sign. Instead, you would build an arrow out of two characters: president <- "Washington" This form of assignment is still used in R. If you’re the sort of writer who sometimes starts sentences without having figured out the final clause, it can even be handy, because they can point in both directions; you can end a long program by saying something like. "Washington" -> president Know to recognize this code when it appears. But most computer languages use = for assignment and == for equality, and so R now follows suit. Inside tibble calls, you use = to assign; I find it simpler to just always use = when assignming. Super-esoteric note: There is actually a third assignment operator, which introductory programmers should never use: the <<- assigner, which assigns to a variable created outside of a function.7

The actual assignment takes two steps. One says, inside a mutate call, to set year equal to the output of a string function; the other reassigns the variable crews to equal this new frame, with its new column.

crews = crews %>% mutate(
  year = ApproximateDeparture %>% str_replace(".*/", "") %>% as.numeric()
)

6.3.1 Cleaning using other functions

In practice, you often do not need to get so far into the nitty-gritty. I do not know of an R function that would translate feet and inches, but for age and date, we can do this all much more simply.

To turn a string into a number is quite simple; it just takes the function as.numeric, which converts between types. Dates are more complicated. A date is a base data type in R, just like strings and numbers, but there are many different ways of specifying them. So we use the function parse_date with a special string that describes the format: month-day-year.

This is another formal language, of course, a tiny little one that specifies data formats. “%m/%d/%Y” means month-day-year (12/7/1945); “%Y-%m-%d” means (1945-12-7). Are you supposed to memorize all these letters? Of course not. Instead, you just need to know that there’s a function for date parsing, and then use the ‘help’ documentation in R studio to read about how a particular conversion function works.

crews = crews %>% mutate(
  Age = Age %>% as.numeric(),
  date = ApproximateDeparture %>% parse_date("%m/%d/%Y")
)

Getting Initials.

library(HumanitiesDataAnalysis)
library(tidyverse)
crews %>% 
  mutate(
    first_initial = str_extract(FirstName, ".."),
    second_initial = str_extract(LastName, ".."),
    initials = str_c(first_initial, " ", second_initial)
  ) %>%
  count(initials) %>%
  arrange(-n)

6.4 Exercises: Cleaning and Tidying data

Download these problems

6.4.1 Assigning

  1. The crews database has columns for feet and inches, but no column for ‘height.’ Make one, where height is measured in feet, inches, or something else.
crews %>% mutate(
  KEEP_GOING
) 
  1. The above code prints to the screen, but doesn’t change the ‘crews.’ Make it so that that assignment happens to a new dataframe.

6.4.2 Cleaning Code

The following code chunks are all broken. You may have to run individual lines. Rather than copying and pasting, remember that the easiest way to run short sets of code in R is to highlight and then hit “control-enter” to force it to run.

To give three at the end:

x == 2
x + 1

Fix the following line of code so it gives (2,3,4)

c((1, 2, 3) + 1

Fix this line so that ‘count’ equals three.

count = 1 + "2"

6.4.3

  1. Is the data you read in under shiptypes tidy?

  2. (Hard, skippable) In the crews dataset bundled with the package, I’ve added ‘feet’ and ‘inches.’ Here’s the code that I used to do it. Two hard questions:

    • Can you describe what is going on in each line of this?
    • Can you find any examples of heights that this fails to detect? (One way would be using an ‘is.na()’ filter)
crews = crews %>% mutate(
  feet = Height %>%
    str_replace(" ?ft.? ", "'") %>%
    str_replace("[',] ?.*", "") %>%
    as.numeric(),
  inches = Height %>%
    str_replace(" ?ft.? ", "'") %>%
    str_replace("[0-9]'? ?(1?[0-9]).*", "\\1") %>%
    as.numeric()
)
  1. Using the pivot_wider functions from the tidyr package, create a new data.frame that has rows corresponding to ship types and columns corresponding to years. (That is, the exact inverse of the data.frame you saw before.)
crews %>%
  mutate(year = lubridate::year(date)) %>%
  count(Rig, year) %>%
  KEEP_GOING()
  1. There’s another dataset. It was created by the Center for Spatial and Textual Analysis at Stanford. Read it in. We’ll be working with this data set more next week.
CESTA %>% head()
  1. This data is clearly not tidy. (Why not?) Use pivot_longer to turn it into tidy data, and save it as a data.frame called “tidied.” (Hint: the final frame will have a column called “year.” You may have to use str_replace to remove Xs from the year name.)
  1. Use the ? command or the help pane to read about the function write_csv. Save the result to disk: we’ll be exploring it more next week.
tidied = CESTA %>%
  pivot_longer(`1790`:`2010`, names_to = 'year', values_to = 'population') %>%
  write_csv("tidied.csv")