Chapter 4 Counting things
Summary
This chapter talks about how to count things, individually and in aggregate, using a pipeline strategy where dataframes are (optionally) grouped, the summarized, then combined. This is an essential strategy of many data analysis platforms. In R it is represented by the following functions, all of which you should know what they do.
(On files)
read_csv
(On data frames)
filter
arrange
summarize
count
group_by
mutate
sample_n
andsample_frac
drop_na
(Inside summary/mutate calls.)
sum
n()
mean
4.1 The power of counting
Now that we know what data is, we are finally ready to analyze it.
Data “analysis,” of course, can mean many different things. That’s the whole point of this class! The analyses we’ll conduct to begin with are extremely simple; they mostly include counting, grouping, and summarizing. Later in this course we’ll come to more complicated operations. But in fact, simple algorithms of counting have major advantages in terms.
You can understand them. There is a “black box” quality to many of the more advanced tools we’ll be looking at later in the course: “summary statistics” are much easier to correct and change on the fly. They use, for the most part, math you learned in elementary school.
Your readers can understand them. It is a burden to have to spend five minutes explaining your algorithms at every talk, and when your work makes a field-specific contribution, you want scholars to focus on your argument and evidence, not on whether they trust your algorithm.
4.2 The pipeline strategy for exploratory data analysis.
Our core strategy will be a pipeline strategy. The idea is that a single table object is passed through a series of saved steps until it gives you useful results. You will sometimes want to save the results of your work; but most of the time, you just
dplyr
exposes a number of functions that make this easy. But although they are more coherently bundled in this package, they are shared by a wide variety of data manipulation software. In particular, they closely resemble the operations of SQL, the most important data access language of all time.4
I’ll be introducing these statements one by one, but first let me describe the basic idea.
4.2.0.1 The “pipe” operator and data flow.
We’re making heavy, heavy use of the so-called “pipe” operator, %>%
. (The name of the package containing it is “magrittr - Ceci n’est pas un pipe,” but for our purposes it’s fine to call it a pipe.)
The idea of the pipe is to represent each transformation of data you make as a chain.
Remember how we mentioned in the data types chapter that multiplication and addition operators
are a different sort of notation from the rest of R? Although it’s highly unnatural, a much
more computer-friendly method would be to use functions called add
, divide_by
, and so on.
These are available in the maggritr package.
If you wanted to define a process where you took a number, added 2, then multiplied by 6,
and finally divided by ten. You could express that as a set of functions where you assign them along the way.
(The <-
operator below assigns values; we will not use it again this chapter.)
<- 1
start
<- multiply_by(start, 6)
x <- add(x, 2)
y <- divide_by(y, 10)
z
z
## [1] 0.8
But you could also nest them inside each other. This requires you to read the functions from the inside-out: it’s awkward, and doesn’t reflect the order in which you will usually work.
divide_by(add(multiply_by(1, 6), 2), 10)
## [1] 0.8
Using pipes, you can read the operation simply from left to right. The idea here is to think of data as flowing through a set of operations. First you take the thing and multiply it; next you take the same thing and divide it.
1 %>%
multiply_by(6) %>%
add(2) %>%
divide_by(10)
## [1] 0.8
These expressions can get quite long. Formatting your code will make it substantially more readable. There are a few guidelines for this:
- Each line must end with a pipe-forward operator: if R hits a linebreak with a syntactically complete element, it will print the results.
- If you pipe an operation to a function that has no arguments, you may omit the parentheses after the function. (Don’t worry about what this means yet if it doesn’t make sense.)
So you might prefer to write the above in the following form. (Note that RStudio will automatically indent your code for you.)
4.3 Real Data
Let’s look at some real data. You have already downloaded, with the materials, a dataset called ‘crews.’ You can open it like this:
data(crews)
The first thing to do is simply look at. Going to the “Environment” pane in the upper right-hand corner, you can see what’s present. Click on it.
Where did this data come from? Did they do a good job transcribing? Here’s an example of the original source
Once you have a dataset in R, there are a few ways to look at it.
The most basic is to highlight the variable name and hit control-enter
in the RStudio code window. This prints a representation
of the dataset.
Clicking on the item in your ‘environment’ panel loads it for you in a separate tab. If you look at the console area of R,
you’ll see that this simply calls a function: View(crews)
. Unlike most R functions, View
is capitalized; this is a sign that
you shouldn’t use it in permanent code. Whether you want to click on the variable or type View
is up to you.
4.3.0.1 Filtering
“Filtering” is the operation where you make a dataset smaller based on some standards.
4.3.0.1.1 The ==
,>
,and <
operators
The easiest way to filter is when you know exact which value you’re looking for.
Passing data through the filter operator reduces it down to only those entries that match your criteria. In the previous section, you have noticed that the summary
function on crews showed a single 82-year-old. By filtering to only persons aged 82, we can see who exactly that was.
%>% filter(Age == 82) crews
If we wished to case a slightly wider net, we could filter for sailors over the age of 70:
%>% filter(Age > 65) crews
4.3.0.1.1.0.1 ==
and =
Warning: one of the most frequent forms of errors you will encounter in data analysis is confusing the ==
operator with the =
one. For the most part, the warnings you get should help clarify this problem. But keep in mind that ‘==’ is for comparison (it asks the question: ‘are x and y equal?’) while ‘=,’ as discussed later, is for assignment.
4.3.0.1.1.1 The %in%
operator
Almost as useful for humanities computing as ==
is the special operator %in%
. ==
tests if two values are the same: %in%
tests if the left hand sign is part of the right hand side. This expression tells you whether, for example, “New York” is in a list of cities.
It evaluates to TRUE
because
"New York" %in% c("Boston", "New York", "Philadelphia")
On a longer list, this can be useful to find any combination of names of interest. Here, for example, we can check if anyone on a list might be a Beatle.
To print all of these, might be too long, so instead we use a special function called sample
that takes some number of rows at random.
You can use head
to see the beginning of a dataset; but sample_n
{R} sample
{python} can be better because it gives you a different set of results each time, not biased towards the beginning of your data. (A related function to sample_n
is sample_frac
, which takes a fraction of the data; so crews %>% sample_frac(.33)
would get 33% of the dataset.)
%>%
crews filter(LastName %in% c("McCartney", "Lennon", "Harrison", "Starr")) %>%
sample_n(10)
Compare that to a regular-expression based approach, below. What’s the difference?
%>%
crews filter(LastName %>% str_detect("McCartney|Lennon|Harrison|Starr")) %>%
sample_n(10)
The tidyverse also exposes a function, count
, that makes it possible to see how many times each element in a list occurs.
We can also count the occurrences of each of these last names to see which is most common.
%>%
crews filter(LastName %in% c("McCartney", "Lennon", "Harrison", "Starr")) %>%
count(LastName)
:::R language-note
You may notice a similarity between %in% and %>% in the way they use percentage signs. This is because they both
represent an infix operator: basically, A %in% B
is the same as if there were a function that takes two arguments in(A, B).
The pipe operator, %>%
, takes two arguments: first a data frame A, and then a function with arguments B. It calls
function B with the arguments from A. This “pipe operator” is currently only part of tidyverse R, but may be coming
to mainstream R at some point
with a different name, |>
. Javascript is considering using the |>
name as well.
:::
4.3.0.2 Arranging
Frequently useful with filter
is the function arrange
. It sorts data. Using the head
function from last week, we can, for example, first limit to ships that sailed before 1860, and then show the youngest individuals.
%>%
crews arrange(Age) %>%
head(10)
If you want to sort in descending order, to see the oldest, you can use the dplyr
function desc
to reverse the variable: but usually it’s easiest to just put a negative sign in front of the variable you want sorted.
%>%
crews arrange(Age %>% desc()) %>%
head()
It is also useful to look at a completely unordered chunk of the data. When you read something in like this
crews
set, the first few records
%>% sample_n(5) crews
4.3.0.3 Summarizing
Looking at your individual data is sometimes sufficient: but usually, you want to know some aggregate conditions about it.
dplyr
provides the summarize
function to do this. Unlike the summary
function we saw earlier, summarize
doesn’t do anything on its own; instead, it lets you specify the kinds of commands you want to run.
In dplyr, you pipe your results through to a summarize function and then run a different function call inside the parentheses. The simplest function call is n
: it says how many rows there are. To run it, we use the function summarize, which reduces the whole dataset down.
%>% summarize(n = n()) crews
But you can use any of the variables inside the frame as part of your function call: for instance, to find the average age in the set, you could run the following.
%>% summarize(mean_age = mean(Age)) crews
This produces an empty value, because there is missing data. R provides a variety of ways of dealing with missing data: in dplyr
, we can just put in a filter operation in.
Since NA
is a special value, it requires a special function in the filter: is.na.
And we don’t want to
filter to only <NA>
fields; we want to filter to not <NA>
fields. By common programming convention,
!
means “not.”
%>%
crews filter(!is.na(Age)) %>%
summarize(mean_age = mean(Age))
This is a bit confusing, so in the tidyverse, there is a function to do this directly: drop_na.
%>%
crews drop_na(Age) %>%
summarize(mean_age = mean(Age))
This may look like a single variable, but it’s actually a tibble like all of those we’ve been working with. By using =
to assign values inside summarize, we can summarize on a variety of statistics: average, mean, or oldest age.
%>%
crews drop_na(Age) %>%
summarize(
average_age = mean(Age),
median_age = median(Age),
oldest_age = max(Age)
)
4.3.0.4 Finding functions for your task.
There are a bunch of functions you maybe haven’t seen before here: mean
, median
, and max
. From their names it should be clear what they do.
Some languages (Python, for example) work to reduce the number of functions in the set. R is not like these: it has so many functions that even experienced users sometimes stumble across ones they have not seen before. And libraries like dplyr
provide still more.
A few functions from base R that may be useful with summarize include:
n()
- count the number of rows.rank(col)
- give each row a rank based on the value of col.min(col)
,max(col)
,median(col)
,mean(col)
- return the minimum value ofcol
.
It’s also important to know about type conversion.
as.character
- Turn anything into a character.as.numeric
- Turn anything into a number. Letters will produce<NA>
.as.Date
- turn an ISO formatted field (e.g., “1940-03-15”) into a date object. For more complicated date operations, you will want to investigate thelubridate
package in the tidyverse which exposes functions likedmy
for “3/14/92.”
So how do you find additional functions? The best place to start is the help bar and then type the phrase you’re looking for. Especially if it’s a statistically meaningful concept (e.g., “Standard Deviation”), you should be able to find it.
The other way is by searching. One of R’s major flaws is that the name is so generic that it’s hard to Google.
The website Stack Overflow contains some of the most valuable information, and you can search inside R by prefixing your search term
with the letter [R]
in brackets. But Google will often include good answers as well.
4.3.1 group_by
and summarize
.
This sorting and arranging is all stuff that you can do quite easily in Excel. The reason to
use a tabular package like this is that it allows you to perform these operations many times
across different facets of data. The last element which enables all sorts of amazing analysis is grouping.
Now that you know how to filter and summarize, you’re ready for the most important operation in dplyr
: group_by
.
Unlike filter
and summarize
, group_by
doesn’t actually change the data you see.
Instead, it does something much more subtle; as it says, it groups the data for future operations.
In other words, it sets the units that you’ll be working with. In Michael Witmore’s terms, it changes the level of address for the text.@witmore_text_2010
The most basic idea is to use a grouping and then the n
function to count the number of items in each bucket.
%>% group_by(Skin) %>% summarize(count = n()) crews
This is actually so common that there’s a function, “count,” that does it directly and assigns to a variable called n
. But I recommend sticking with the more basic format for now, because it gets you in the habit of actually thinking about what happens.
%>% count(Skin) crews
You can also group by or count multiple items at once. Each of these is like adding and addition index to your search.
%>%
crews group_by(Skin, Hair) %>%
summarize(count = n()) %>%
arrange(-count) %>%
head(10)
Here is a little bit about the social construction of race, entirely in code. Try to understand what this code does, and what the data says.
Note that there’s a new function in here, mutate
. Like summarize
, that adds
a new column (or changes an existing one); unlike summarize, it doesn’t change the number of rows.
%>%
crews group_by(Skin) %>%
drop_na(Skin, Hair) %>%
mutate(skin_total = n()) %>%
filter(skin_total > 100) %>% ## Why?
group_by(Skin, Hair, skin_total) %>% ## Why is `skin_total` here?
summarize(combo_count = n()) %>%
mutate(share = combo_count / skin_total) %>%
arrange(-share) %>%
filter(share > .1) %>% # Remove uncommon entries
head(10)
## `summarise()` has grouped output by 'Skin', 'Hair'. You can override using the `.groups` argument.
Read the line of code below. What does it do? Can you come up with any patterns or explanations for the results that you see here?
%>%
crews drop_na(Age) %>%
group_by(Skin) %>%
summarize(meanAge = mean(Age), count = n()) %>%
filter(count > 100) %>%
arrange(meanAge) %>% head(10)
4.4 Exercises
4.4.1 Summarizing and filters
- Start by just editing some code. The code below finds the first date that appears in this collection. Edit it to find the minimum age in the set.
%>%
crews drop_na(date) %>%
summarize(min = min(date))
Use
filter
to determine: what is the name of that youngest person? When did he or she sail?How many sailors left on ‘Barks’ between 1850 and 1880? Chain together
filter
andsummarize
with the specialn()
function. Note that this has a number of different conditions in the filter statement. You could build several filters in a row: but you can also include multiple filters by separating them with commas. For instance,filter(school=="NYU",year==2020)
might be a valid filter on some dataset (though not this one.)Question 3 told you how many sailors left on barks in those years. How many distinct voyages left? The variable
Voyage.number
identifies distinct voyages in this set. (This may require reading some documentation: reach out to me or a classmate if you can’t figure it out. There are at least two ways: one involves using thedplyr
functiondistinct
before summarizing, and the second involves using the functionslength
andunique
in your call tosummarize
.)
4.4.2 Grouping
Now let’s move to the group_by operation. Once data is grouped, when we run a summary operation, it will give the counts. For example, try counting the number of people by residence.
%>%
crews group_by(Residence) %>%
summarize(count=n()) %>%
arrange(-count) %>% head(10)
4.4.3 Working with Text
To think about working with text, let’s look at a bibiographic dataset. I’ve taken all the books in the Library of Congress’s catalog dump that have summaries in the bibliographic record, and then further reduced down to “just” about 150,000 books that have dates of birth for their authors. (This is about 1% of the full Library of Congress catalog of books.)
The tidyverse
packages includes some functions for exploring strings we can use inside these functions.
str_detect
finds a regular expression; it returns TRUE or FALSE, so works well with filter
.
str_extract
replaces a regular expression inside a string: it returns a new string, so it works well with mutate.
You may not remember everything about regexes: but for this, try to recall at least that the vertical bar (|
)
means “or.” So "France|French"
will be true if the string contains either “France” or “French.”
- Try to get a sense of what is in this set based on some keyword searches. Can you get a sense of what the biases of this subset of the catalog are?
Here are a couple examples having to do with geographic terms in subjects; you’d probably do better to explore some other kind of resource.
%>%
books filter(subjects %>% str_detect("France|French")) %>%
select(year, summary) %>%
sample_n(10)
Consider counting some individual words, as well. Using ‘str_extract,’ we can create a new column–‘word’–which is only the part of the subject that matches a search. Then we can count those individual terms.
Again, look for something other than the three countries here.
%>%
books mutate(word = subjects %>% str_extract("(France|Germany|England)")) %>%
drop_na(word) %>%
group_by(word) %>%
summarize(count = n())
4.4.4 Your data
I don’t know if this will work. But try getting read_csv
to work on your own dataset, and do three of the following:
- Find an outlier; who is the oldest person? The youngest?
- Count a categorical variable and arrange by decreasing count. What are the most common labels?
- Use
str_sub
orstr_extract
to create a better categorical variable than the ones you have. - Count some combination of variables and see if you can identify things that tend to occur a lot together or apart.
- Use
min
,mean
, ormax
inside groups to see which groups have higher or lower values. - Describe the extent of your data.
Note: if you have a slow computer or an extremely large data set, you can do all of these operations on data on disk. I recommend using
sqllite3
as the on-disk data storage engine: read the documentation at?src_sqlite
for an explanation of how to read it in.↩︎