Chapter 5 Visualizing Data

Goals

This chapter covers how to create plots of tabular data using the Grammar of Graphics.

You should be able to map data variables to a variety of aesthetic codings, to know how to use different scales, and to produce charts using different encoding schemes (such as bars, circles, points, and text.)

Functions we talk about:

Fundamental to charts:

  • ggplot()
  • aes

Geometries and scales:

  • geom_point, geom_line, geom_histogram, geom_text, geom_smooth, stat_summary
  • scale_y_continuous, scale_x_discrete

To make charts prettier or better constructed.

  • labs, coord_flip

5.1 Why Visualize?

Graphic presentation is an increasingly important element of humanities scholarship. For exploratory data analysis, visualization has long been seen as the most useful way of working with data. As we saw with the information in “Anscombe’s quartet,” it is quite frequent that patterns which can be quite difficult to describe mathematically are immediately obvious when plotted with the proper specifications.

Visualization is also an important element of scholarly communication. As scholars like John Theibault, etc. have recently argued, visualizations can themselves be an element of historical arguments. (As we’ll read next week, there are reasons to be concerned that visualizations may interpose scientisitc forms of reading).

This work is the heir to a long history of non-lexical representations of history. Graphic visualization is occasionally described as an offshoot of the sciences. In fact, the tradition is probably more closely connected to the historical and social sciences than to any of the “hard” sciences: the early history of data visualization and thematic cartography are mostly attempts to bring visual order to social and historical systems.5

The question of whether an image has “an argument” is a thorny one; the question of whether an image can provide evidence should be much less so.

5.2 The Grammar of Graphics in ggplot

In dplyr, we have already begun to learn a “grammar of data analysis.” It provides a variety of functions for describing transformations you wish to make to data; to combine them, to aggregate, to filter, and so forth. The content of those transformations is left up to the user; this is why Wickham uses the metaphor of a “generative grammar” to describe it. A grammar gives rules about construction that make it possible to describe operations much more easily than might be otherwise possible.

Historical note: this idea of a generative grammar as a live influence in general academic culture goes back to linguistics and particularly Noam Chomsky, who made his first career describing the underlying rules of language. (Chomsky, in addition to his current role as a critic of modern capitalism, continues to defend his theories inside linguistics; his 2011 exchanges with Peter Norvig, the chief research scientist at Google, over the validity of the knowledge derived from machine learning techniques are well worth reading.) The insights were useful outside linguistics: the conductor Leonard Bernstein gave the Charles Eliot Norton lectures on aesthetics at Harvard in the early 1970s by applying Chomskyian techniques to musical formats.

This idea of a grammar stems from Jacques Bertin’s 1967 text The Semiology of Graphics. It was most influentially described in the statistical literature in 1999 by Leland Wilkinson in his book The Grammar of Graphics. We’ll be exploring data visualization in the context of an implementation of that grammar also by Hadley Wickham.

The name of this package is ggplot2. (gg for “Grammar of graphics”: 2 because it is version 2 of the software.) I’ll be referring to it as simply ggplot, because that is the name of the function used to create the core object.

ggplot uses a philosophy of chaining that should be familiar from your work with dplyr. But while dplyr is a chain in which order is paramount, ggplot is a little more flexible. You have to create a plot using the ggplot() function: but then you can add elements to it in an order that’s not always strictly proscribed. But in creating a plot, you’ll certainly have to do specify data, geometry, and aesthetic encoding.

5.2.1 Data, Geometry, and Aesthetics

Every ggplot chart begins with a single function ggplot, which generally generally takes as an argument a single tibble. For example: if we want to plot the changing urban population of the United States, we can use functions from dplyr to collect it. Then we can create an object base_plot that contains the beginnings of the plot.

Look at the code below. It uses some of the data cleaning and tidying principles from the next section that you shouldn’t understand yet to make the dataset more easily plottable.

populations = CESTA %>%
  pivot_longer(`1790`:`2010`, "year", values_to = "population") %>%
  mutate(year = year %>% as.numeric())

totalpopulations = populations %>%
  group_by(year) %>%
  drop_na(population) %>%
  summarize(USpop = sum(population))

The next step is choose a particular way to plot the data. This is called a “geometry” in ggplot, and it corresponds roughly to the type of plot that you’ll be using. There are many different types of plots that make sense: line charts, bar charts, maps, histograms, and so forth.

As Lev Manovich says, plots work by reducing real elements to abstractions. The “aesthetic” component of a ggplot graph is the list of abstraction you want to enforce. Each particular geom will require or use certain different types of data. A line chart plots one thing on the x axis, and another on the y.

We set this in ggplot using the function aes. aes specifies the aesthetic mapping. So make a linechart of population over time, we simply tell it that y should be population, and x should be time.

For this example, we’ll make a simple line chart. This requires three elements. It would be nice if you could simply writebase_plot + geom_line(), but every chart requires at least three elements.

Chart component Description Example ggplot function
Data A dataset: specifically, a ‘tibble’ with columns corresponding to the variables you would plot ggplot(populations)
Geometry A type of mark to put on the paper. geom_line(), geom_point()
Aesthetic Mapping How columns in the data should be represented through encoding channels. Takes multiple arguments. aes(x=year, y = count); aes(color=country, size=count)
Scale A strategy for turning raw data into chart information scale_x_continuous(), scale_color_brewer

These are combined together using the + sign. (This is a kind of a pain; it would be simpler if it used the %>% operator the same way as everything else in the tidyverse. But ggplot2 is the oldest piece of the tidyverse.)

Note that you are not restricted to just one mark type per chart. In the image below, there is both a set of points and a set of lines.

ggplot(totalpopulations) +
  aes(x = year, y = USpop) +
  geom_line() +
  geom_point()

5.2.2 Changing Scales

Scales are some of the most important elements of any chart.

In ggplot, scales are generated automatically from the aesthetic, but often you will want to tweak the settings. Your variable names may not appear right, you may not like the colors ggplot choices, or you may want to change the outline.

For humanities data, in most cases a logarithmic distribution fits the data better than a straightforward linear one. (Log distributions assume that rather each tick of the axis indicating an increase of a fixed amount.)

ggplot(totalpopulations) +
  geom_line() +
  geom_point() +
  aes(x = year, y = USpop) +
  scale_y_continuous("Total Population", trans = "log10")

Other sorts of scales will take other options. For example, a color scale might benefit from manually specifying the colors that you’re interested in.

5.2.2.1 Conventions around X and Y axes.

For at least a hundred years, the making of charts has been surrounded by a set of pronouncements about how to make charts. People have extremely strong opinions, often not that well informed, about the right way to display information.

Some conventions you probably know intuitively. The most fundamental is that the X axis is supposed is supposed the represent the dependent variable, and the Y axis is supposed to represent the independent variable. This assumption is built into ggplot in a variety of ways. A line chart connects points along the X axis, not the Y axis; histograms automatically assign Y to be the count variable. The language of ‘dependent’/‘independent’ may be alien to humanists, so another way to the think about it is that the X axis should usually represent the more fundamental concept, while the y axis shows the thing you are counting, measuring, or describing. Time should almost always appear on the X axis. In some cases when you want to swap the two axes; this is OK, but often you will have to use coord_flip() to do so (see below).

A less well-founded rule that you’re likely to encounter is the proscription that a Y-axis should always include the number zero. People who say this as an ironclad rule are wrong, but often zealous in their wrongness–there are many cases in which it is nonsensical to insist that zero is necessary. A chart of CO2 parts per million should not show zero as the baseline, because the more important baseline is the range between 180 and 300 ppm predominant in the last million years. A chart of temperature should not include zero as a baseline because the temperature 0–in Farenheit or Celsius–is generally important. A chart with a logarithmic scale cannot include zero as a baseline because to do so is mathematically impossible. And so on. It is a generally useful rule of thumb to try to include a zero baseline if creating a bar chart, but even this rule can be broken if you make appropriate note of it.

5.2.3 Facets

The “small multiple” is a plot strategy where different data is plotted acording to the same principles in several different plots. In ggplot, this is possible through the facet functions: facet_grid (when you have both x and y aspects to facet off of) and facet_wrap (when you only want to facet off a single thing.)

Our chart of US urban population can accomplish the same thing for each individual states by first grouping by state in the creation field, and then by faceting by state. Try adding “color” to the aesthetic on one of the variables to make it more attractive.

populations %>%
  filter(ST %in% c("NY", "MA", "CA", "TX", "VA", "NV")) %>%
  group_by(ST, year) %>%
  filter(population > 0) %>%
  summarize(pop = sum(population)) %>%
  ggplot() +
  geom_line() +
  aes(x = year, y = pop) +
  scale_y_log10() +
  facet_wrap(~ST)
## `summarise()` has grouped output by 'ST'. You can override using the `.groups` argument.

5.2.4 Other settings

5.2.4.1 Coordinate systems

For the most part, you’ll probably be plotting in cartesian space–that is to say, your coordinates will look like graph paper.

Cartesian is not the only coordinate system, though. Two other particularly important ones built into ggplot are coord_polar and coord_map.

  • coord_polar uses polar coordinates: instead of positioning an element by x and y, it positions it by angle and distance. Some quite famous charts use it, such as Florence Nightingale’s Coxcomb charts.

  • coord_map allows you, with the mapproj package, to make a wide variety of maps in R that project latitude and longitude in ways other than the standard projection system.

Because it will form the basis for further work, here is code to plot cities by longitude and latitude. This is not, exactly, a map: but you can see the outlines of the states.

cities = populations %>% filter(year == 1950)

cities %>%
  filter(LON < 0) %>%
  ggplot() +
  aes(x = LON, y = LAT, size = population, color = ST) +
  geom_point()

5.2.4.2 Labels

Charts should have explanations

The labs function lets you specify those.

cities = populations %>% filter(year == 1950)

cities %>%
  filter(LON < 0, ST != "AK", ST != "HI") %>%
  ggplot() +
  aes(x = LON, y = LAT, size = population, color = ST) +
  geom_point() +
  labs(title = ("The eastern half of the United States is more populated citywise."))

5.2.4.3 Themes

You can specify the appearance of grid lines, legend position, and all sorts of other elements through the theme argument. Online documentation may be necessary for the time being.

5.3 Exploring Data with ggplot

We’re going to explore these geoms with a cleaned version of the crews dataset.

5.3.1 Histograms of quantitative variables

For statisticians, the most basic chart is a histogram, which shows how frequent a single variable is at different levels. We can plot the height of our sailors by adding a new layer to the plot: that consists of the function geom_histogram to build a histogram, and the aesthetic aes(x=height) to tell it we want a histogram about the distribution of height.

library(HumanitiesDataAnalysis)
library(tidyverse)

crews %>%
  mutate(height = feet + inches/12) %>%
  count(height) %>%
ggplot() +
  geom_point() + #binwidth = 1 / 12) +
  aes(x = height, y = n)
## Warning: Removed 1 rows containing missing values (geom_point).

That might seem like a lot of work, but the advantage is that once a plot is created, we can simply swap out the aesthetic to plot against–say–date or age as well.

ggplot(crews) +
  geom_histogram() +
  aes(x = date)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 258 rows containing non-finite values (stat_bin).

ggplot(crews) +
  geom_histogram() +
  aes(x = Age)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 71838 rows containing non-finite values (stat_bin).

5.3.2 Barplots of categorical variables

When you use a histogram with a categorical variable, you ask for a barplot, as when we look at the types of ships in the sample.

ggplot(crews) +
  geom_bar() +
  aes(x = Rig)

A barplot is different, though, because we might want to add some more variables in. For example, we can add another aesthetic for ‘fill’ (which gives the color of the bars):

ggplot(crews) +
  geom_bar() +
  aes(x = Rig, fill = Rig)

That’s a nice chart. But we could also change it so the x-axis contains to information by just setting it to an empty string, and the bars will appear stacked on top of each other.

ggplot(crews) +
  geom_bar() +
  aes(x = "", fill = Rig)

That’s not as good a way of visualizing the information: you have to compare the size of chunks against each other. But if we make one more tweak—setting the y axis to a polar coordinate system—suddenly it becomes very familiar.

ggplot(crews) +
  aes(x = "", fill = Rig) + 
  geom_bar() +
  coord_polar("y")

It’s a pie chart! A pie chart is the same as a bar chart except for the coordinate system! Since Edward Tufte, pie charts are universally reviled. Unlike bar charts, they are not a fundamental element in the grammar of graphics; instead, it is describing them here as “a stacked bar chart plotted in a polar coordinate system.”

5.3.3 Scatterplots

For exploratory data analysis, the scatterplot is to multivariate data what the histogram is to single-variable data.

Summary stats are useful, but sometimes you want to compare two types of charts to each other. The next basic chart to use is a scatterplot: in ggplot, you can get this by using “geom_point.” Suppose, for example, we want to compare height against age.

ggplot(crews) +
  geom_point(alpha = 0.1) +
  aes(x = Age, y = feet + inches / 12)
## Warning: Removed 92128 rows containing missing values (geom_point).

ggplot(crews) +
  geom_point() +
  aes(x = Age, y = feet + inches / 12, color = Rig)
## Warning: Removed 92128 rows containing missing values (geom_point).

5.3.4 Density Plots

Here are some survey plots on the data that we have.

ggplot(crews) +
  geom_point(size = 1, alpha = .02, position = position_jitter(h = 1 / 12, w = 1), color = "red") +
  labs(title = "Height in feet") +
  aes(x = Age, y = feet + inches / 12) +
  geom_smooth(span = 1 / 30) + 
  stat_summary(fun.y = "mean", geom = "point") 
## Warning: `fun.y` is deprecated. Use `fun` instead.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 92128 rows containing non-finite values (stat_smooth).
## Warning: Removed 92128 rows containing non-finite values (stat_summary).
## Warning: Removed 92128 rows containing missing values (geom_point).

crews %>% ggplot() +
  aes(y = feet + inches / 12, x = as.Date(date)) +
  geom_point(size = 2, alpha = .05) +
  labs(title = "Height in feet") +
  geom_smooth() +
  xlim(as.Date("1860-1-1"), as.Date("1930-1-1"))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 92021 rows containing non-finite values (stat_smooth).
## Warning: Removed 92021 rows containing missing values (geom_point).

ggplot(crews %>% filter(!is.na(Hair))) +
  geom_bar() +
  aes(x = Hair) +
  coord_flip()

Sometimes when scatterplots get too full, it is useful to bin by squares or hexagons. (Hexagons are popular because they are the roundest shape that can snap together evenly. That’s why honeybees use them to organize their hives.)

ggplot(crews) +
  aes(y = feet + inches / 12, x = Age) +
  labs(title = "Height in feet") +
  geom_hex() +
  scale_fill_gradientn("Number of sailors", trans = "log10", colours = heat.colors(10))
## Warning: Removed 92128 rows containing non-finite values (stat_binhex).

5.3.5 Text Scatterplots

For humanities data analysis, text scatterplots are particularly important; points are often individual items we care about, not just interchangeable elements.

For them, you can use the geom_text function. It works just like geom_point, but requires an additional aesthetic: label, which maps to a character field.

What does this chart show? What variables (if any) do you think are driving this chart?

crews %>%
  group_by(Residence) %>%
  filter(is.na(Residence) == FALSE) %>%
  # This is a 'mutate' call.
  mutate(count = n()) %>%
  filter(count > 50) %>%
  summarize(
    averageHeight = mean(feet + inches / 12, na.rm = T),
    averageYear = mean(date, na.rm = T), count = n()
  ) %>%
  arrange(-count) %>%
  ggplot() +
  geom_point(aes(size = count), alpha = 0.25) +
  geom_text(check_overlap = TRUE) +
  aes(y = averageHeight, x = averageYear, label = Residence)

5.4 Colors

Every element of a data visualization requires as much care as every sentence in an essay. But color–if you choose to use it–is one of the most complicated things to show, because our intuitions are often wrong. The following explains the very basics of this space, but the primary takeaway should not be that you (or I!) know enough color theory to adequately choose our own colors; it’s that color defaults from even a decade ago are probably worth shying away from in favor of pre-packaged solutions.

5.4.1 RGB

Colors in computers are usually expressed the “RGB” format of three numbers corresponding to the strength of red, green, and blue light coming through. The color “green” can be expressed either as three numbers between 0 and 1 (0, 1, 0), three numbers between 0 and 255 (0, 255, 0), or–as you often see in HTML–in hexadecimal format like #00FF00. Each of these is fundamentally the same; they mean to maximize the green channel of a visualization (the green pixels on your screen) and do nothing with the others. #00FF00 is hexadecimal notation for means 00–0 in regular numbers for red, FF (255, the highest two-digit number in hexacimal) for green, and 00 (0) for blue. If you want to tweak colors you pull from online directly, you can edit them directly.

There are also a number of conventional color names that you can use like “green,” “blue,” “red,” and “black.”

RGB colors do not work well for scales because they do match the way the human eye works. An early attempt to fix this problem are a set of related colorspaces based around “hue” with names like HSV, HCL, and HSL. They try to represent the colors of a rainbow as a three dimensional wheel, with “brightness” (“Value”) as one dimension and “saturation” as the other. ggplot’s default colorscheme, created in the late 2000s, uses this space, selected from evenly around the HCL color wheel.

6

These hue-based methods are better than RGB colors but suffer many subtle problems of their own, including one big one–the fact that almost one in ten men, and about one in two hundred women, have some form of colorblindness. This means that when picking a color scheme, you cannot trust your own eyes because your eyes may not work like everyone else’s.

Instead, you should use someone else’s color schemes. One set of color schemes that sees extremely wide use are those developed by geographer Cynthia Brewer for the 2000 US census and browsable online at [colorbrewer.org]. They are built into ggplot into the functions scale_color_brewer() (for discrete values) and scale_color_distiller(). If you are trying to represent categorical values, you write scale_color_brewer(palette = "Dark2").

The different palettes are useful in different cases–I am partial to PuOr for diverging data when there is no qualitative difference between the fields, RdBu for plotting US political parties, and PiYG when the scale is bad to good. (Red to green is a tempting but unallowable scheme, since red-green colorblindness is very common.)

For plotting absolute quantities, three related schemes are your best bet. Viridis is a green to yellow colorscheme, magma, and inferno black to red.

populations %>% 
  group_by(ST, year) %>%
  summarize(pop = sum(`population`)) %>%
  group_by(year) %>%
  mutate(share = pop/sum(pop, na.rm=TRUE)) %>%
  ggplot() + 
    aes(x = year, y = ST, fill = share) +
    geom_tile() + 
    scale_fill_viridis_b(breaks=c(.01, .04, .08, .12, .2), labels = scales::percent)
## `summarise()` has grouped output by 'ST'. You can override using the `.groups` argument.

A common beginner’s mistake is to use a rainbow color palette to “get more separation.” In general, color is not capable of the many things people often want it to do, and you should consider using a different encoding dimension like shape, faceting, or size instead.

5.5 Exercises: Visualizing Data

Download these problems

Below is something new: a chain that start not with data, but with a period. This is a way to create your own functions; the one below cleans our book data a bit more by turning the author’s birth and death dates, and the publication dates, into numeric values for plotting.

clean_books = . %>%
  mutate(
    year = as.numeric(year),
    birth = author_dates %>% str_extract("[0-9]{4}") %>% as.numeric(),
    death = author_dates %>% str_extract("-[0-9]{4}") %>% str_replace("-", "") %>% as.numeric()) %>%
  filter(year > 1520, year < 2021)

Here’s a plot of when books were written that uses this cleaning function.

books %>%
  clean_books() %>%
  ggplot() +
  geom_histogram() +
  aes(x = as.numeric(year))
## Warning: NAs introduced by coercion
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Modify this so that instead of looking at the dates authors are born, you’re seeing the dates in which they died.
books %>%
  clean_books() %>%
  ggplot() +
  KEEP_GOING()
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

  1. Make a scatterplot using this data and the grammar of graphics.

It should:

A. Have the x-axis represent birth year. B. Have the y-axis represent publication year. C. Use the geom_point function to make a point chart.

books %>%
  clean_books() %>%
  ggplot() +
  KEEP_GOING()
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

The code below adds a new variable to the dataset called ‘alive’ that shows if the author was still alive when the book was written. Make this into the ‘color’ variable of your scatterplot.

books %>% clean_books %>%
  mutate(alive = year < death) %>%
  KEEP_GOING

Now let’s look at some plots where we perform data transformations first.

Translate the following set of instructions into code:

  1. Clean the books set;
  2. Count the number of authors;
  3. Limit the list to the most authors
  4. Draw a bar chart (geom_col) of the counts data you have made.
books %>% 
  clean_books() %>%
  group_by(author) %>%
  KEEP_GOING
  • Go back to the population dataset. Recall this code from the text that tidies the data so that each year is a population.
populations = CESTA %>%
  pivot_longer(`1790`:`2010`, "year", values_to = "population") %>%
  mutate(year = year %>% as.numeric()) %>% select(City, ST, year, population, LAT, LON)

The chart below shows the combined population of California and Michigan. Make it instead show them separately, represented by either the color or lty aesthetic. (You’ll need to change the data analysis code before the ggplot function, and the plotting code after.)

populations %>% filter(ST %in% c("CA", "MI")) %>%
  group_by(year) %>% 
  summarize(population = sum(population)) %>%
  ggplot() + 
  geom_line() + 
  aes(x=year, y = population)

Here’s New York City’s population. I’ve added a column to the dataset that represents “total population.” Plot not New York’s population, but its share of the national urban population.

In what year was NYC the maximum share of the national urban population?

populations %>%
  group_by(year) %>%
  mutate(total_population = sum(population, na.rm = TRUE)) %>%
  filter(City == "New York City" & ST == "NY") %>%
  KEEP_GOING

There’s something funny going on in that chart between 1880 and 1920. Can you figure out what it is? If so, make a chart that explains why the shape of the line in the previous chart isn’t to be believed.

Make a map of the United States in which each city is colorized by the year that it had its maximum population according to the census data. (Hard, very skippable.)

  1. Using the rank function inside a function and facet, make a geom_text plot that ranks the top ten cities for each census. (Hint: You will need to group by year.)
KEEP_GOING()

5.5.1 Your dataset

At this point, you should be able to find some kind of tabular dataset that’s a little bit interesting to you. Load it in, and start to explore it through a number of plots.

In previous chapters, we’ve laid out a number of functions for you to work with data frame.

(On data frames)

  1. filter
  2. group_by
  3. summarize
  4. mutate
  5. count
  6. arrange
  7. mutate
  8. pivot_longer

(Inside summary/mutate calls.)

  1. sum
  2. n()
  3. mean
  4. str_extract (works with mutate)

Use at least 7 of these functions to create some views of your dataset, and then plot them.


  1. Grafton and Rosenberg; Playfair; Friendly; DuBourg.↩︎

  2. I’ve adapted these charts from code at http://sape.inf.usi.ch/quick-reference/ggplot2/colour.↩︎