4 Visualisation and communication

Just as there are many types of data e.g. tables and audio files, correspondingly there are many types of data visualisation e.g. statistical plots and maps. Likewise visualisations we do for an exploration may be different to those we do for communication. Therefore we narrow the scope here to introduce some fundamentals of visualisation, presentations and reports.

By the end of this chapter you will be able to:

  • Use the ggplot2 package to create exploratory plots and customize their plots
  • Use geometrical objects and aesthetic mappings to create box plots, bar plots and time series plots
  • Use facets to split plots into sub-plots according to variables
  • Transform plots using statistical mappings and categorical variables (factors).
  • Transform data positions and coordinates
  • Use ggplot2 themes and the scales package to customize plots
  • Export figures for use in documents and presentations

These skills will enable to start creating visualisations as well as reports and presentations using R, but importantly it is up to you to determine what sort of visualisation is appropriate for the task at hand and for you to use your critical judgement in assessing a visualisation, presentation or report.

4.1 Visualisation overview

For a deeper understanding of the art of data visualisation, a good place to start is with the work of Albert Cairo, a journalist and academic specialising in data visualisation who has written and continues to write extensively on the subject.

Betsy Mason has written an article that summarises many of the key ideas: Why scientists need to be better at data visualization including the issues around how we perceive shade and hue, that can make plots such as heatmaps problematic.

Rafeal Irizarry has also written about Why dynamite plots must die. These are very common plots in biomedical science, but are far from optimal.

The point here is not to tell you explicitly what plots to make, or that heatmaps are bad etc., but to encourage you to think about what type of plot is best for the task in hand. Choosing to plot your data in a certain way because, “that’s what everyone else does” is unlikely to be the best reason.

To begin with two questions one might immediately have are:

  1. Why visualise our data in the first place?
  2. What do we know about what makes an effective visualisation?

4.1.1 Making comparisons

Starting with the second question, a helpful set of principles for statistical visualisations is the work of Cleveland and McGill, namely: Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods by Cleveland and McGill.

What they established is a hierarchy of perception, that is to say we there is an order to which our brains find it easier or harder to make comparisons with visual information.

The order of comparisons from easiest to hardest is as follows, but this is best illustrated by some plotting some example data.

  1. Positions on a common scale
  2. Positions on the same but non-aligned scales
  3. Lengths
  4. Angles, slopes
  5. Area
  6. Volume, colour saturation
  7. Colour hue

Figure 4.1 illustrates plotting the same percentages for five Countries seven ways corresponding with Cleveland and McGill’s hierarchy.

The same percentages for five Countries A-E is plotted seven ways to illustrate the differences in ease of making comparisons depending the plot type. The percentages are A:32%, B:29%, C:34%, D:25% and E:22% in all plots.

Figure 4.1: The same percentages for five Countries A-E is plotted seven ways to illustrate the differences in ease of making comparisons depending the plot type. The percentages are A:32%, B:29%, C:34%, D:25% and E:22% in all plots.

The message here is that when choosing a visualisation for comparisons between sets of observations the best place to start is with plotting the observations on a common scale (but not necessarily finish).

4.2 ggplot2()

In his paper A layered grammar of graphics Hadley Wickham discusses the theory behind the ggplot2() package. As with tidyverse the aim was to create a consistent system for building plots.

The key components of a plot are:

  • data with aesthetic mappings e.g. position on an axis, colour or size.
  • geometric objects e.g. points, lines.

To this other components or layers can be added, such as facets or statistical transformations to make increasingly complex plots.

As ggplot2() was first released in 2007 it preceded the magrittr package in 2014 from which we get the pipe %>% it has one syntax difference that is sure to catch you out. Whilst we can pipe into a ggplot() function, we build the plots using the plus sign + to add our next layer.

Likewise, forgetting to use the aesthetic function aes() within a layer is a common mistake I make. Aesthetics and geometries are related, so for example you can’t set the shape aesthetic of a line. I usually make this mistake with colour and fill aesthetics, where I try to use one when I need the other.

The basic form for creating plots with the ggplot2 package is as follows:

To recap:

  • We provide a data frame as an argument to the ggplot() function
  • Variables in the data frame are mapped to visual properties using the aesthetics function aes()
  • The aesthetics are mapped to a geometric object such as points using a layer with a geom() function.
  • If we were to plot several geoms() on the same plot wanted to map the same aesthetics to each geom() we can pass the aesthetics argument to ggplot() directly: ggplot(data = <DATA>,mapping = aes(<MAPPINGS>)).
  • We can also pipe data into ggplot() as we would for other functions.

4.2.1 Datasaurus dozen

Returning to the first question: Why visualise data in the first place?

Inspired by the Anscombe’s Quartet and Albert Cairo’s Datasaurus, Justin Matejka and George Fitzmaurice created the Datasaurus dozen to illustrate a problem with relying only on summary statistics to understand data.

## Parsed with column specification:
## cols(
##   dataset = col_character(),
##   x = col_double(),
##   y = col_double()
## )
## Observations: 1,846
## Variables: 3
## $ dataset <chr> "dino", "dino", "dino", "dino", "dino", "dino", "dino", "dino…
## $ x       <dbl> 55.3846, 51.5385, 46.1538, 42.8205, 40.7692, 38.7179, 35.6410…
## $ y       <dbl> 97.1795, 96.0256, 94.4872, 91.4103, 88.3333, 84.8718, 79.8718…
## # A tibble: 13 x 1
##    dataset   
##    <chr>     
##  1 dino      
##  2 away      
##  3 h_lines   
##  4 v_lines   
##  5 x_shape   
##  6 star      
##  7 high_lines
##  8 dots      
##  9 circle    
## 10 bullseye  
## 11 slant_up  
## 12 slant_down
## 13 wide_lines

We have a bakers’ dozen of datasets, each with 1,846 observations for variables x and y. Let’s look at the mean and standard deviation for each variable using a grouped summary and the mean() and standard deviation sd() functions.

## # A tibble: 13 x 5
##    dataset    mean_x mean_y  sd_x  sd_y
##    <chr>       <dbl>  <dbl> <dbl> <dbl>
##  1 away         54.3   47.8  16.8  26.9
##  2 bullseye     54.3   47.8  16.8  26.9
##  3 circle       54.3   47.8  16.8  26.9
##  4 dino         54.3   47.8  16.8  26.9
##  5 dots         54.3   47.8  16.8  26.9
##  6 h_lines      54.3   47.8  16.8  26.9
##  7 high_lines   54.3   47.8  16.8  26.9
##  8 slant_down   54.3   47.8  16.8  26.9
##  9 slant_up     54.3   47.8  16.8  26.9
## 10 star         54.3   47.8  16.8  26.9
## 11 v_lines      54.3   47.8  16.8  26.9
## 12 wide_lines   54.3   47.8  16.8  26.9
## 13 x_shape      54.3   47.8  16.8  26.9

Here we use the basic form to plot the datasaurus but with the addition of facet_wrap() in which the first argument is ~ representing “formula” (a type of data structure) and the second argument is the variable we wish to create individual plots for. In words read this as “facets depend upon the dataset”:

The message here is that plotting the data reveals structure that is not apparent from summary statistics. Therefore if possible: plot early, and plot often.

4.2.2 Exploratory plots

As the name implies, the point of an exploratory plot is to explore our data. This can happen before, during or after tidying/transforming data, and as discussed in The Art of Data Science serves two key purposes that we saw with the Datasaurus:

  1. Creating expectations
  2. Checking for deviations from our expectations.

In Chapter 2 having transformed the Portal rodent surveys data into a table summarising the rodent observations per three month period per plot type (control or exclosure). The original paper was exploring the hypothesis that kangaroo rat and granivore populations were in competition for resources in their habitat.

We plotted the average number of captures over time as a line and point plot, coloured according to plot type, and facetting the plot according to animal type.

Here we are exploring the effect of the exlosure on the animal populations over time, and the variation in populations over time in both plot types.

The average number of captures over time as a line and point plot, coloured according to plot type, and facetting the plot according to animal type.

Figure 4.2: The average number of captures over time as a line and point plot, coloured according to plot type, and facetting the plot according to animal type.

Expectations created by this plot for example are that the populations fluctuate quite a lot over time, and the exclosure appears to work well with few kangaroo rats observed in those plots.

What is less clear is what the effect of the exclosure is having, if any, on the granivore population, suggesting we might wish to try a different type of plot.

4.2.3 Statistical transformations

Often we want to transform the data as part of the plotting process, for example to create plots that reveal statistical information such as distributions or averages. For these we can use geoms which are statistical in nature or to which we provide statistical arguments.

4.2.3.1 Barplots

These simple plots actually reveal subtleties in plots. In the following code we take the by_month_species subset and create a bar chart of species_id, filling the bars with colour according to rodent_type.

We see that the code has automatically created a count variable on the y-axis for each species_id plotted as a bar on the x-axis.

This is because the geom_bar() algorithm automatically performs a statistical transformation of the mapped variable. That is to say it performs a count and bins the data according to the genus. This means we have a chart where the bar height corresponds to the number of rows for each species_id.

But this isn’t a count of the captures, it’s how many rows each species_id appears in of by_month_species. Remember how we summarised this data to calculate captures per month.

Often when people talk about bar charts they are referring to when the height of the bar is a variable present in the data, here that would be the captures column. In this kind of situation we need to map the y-axis too, and change the default geom_bar() statistical transformation to stat = "identity" so that it takes the values provided in the captures column to determine the height of the bar.

To reiterate: if you are trying to create a bar chart using x and y variables contained in your data set, you need to set geom_bar(stat = "identity").

Let’s check this by looking at PE in the control plots and counting the rows and comparing that to summing the number of captures. Note we need to ungroup() before doing another summary. There are 132 rows, but only 11 captures.

## # A tibble: 1 x 1
##       n
##   <int>
## 1   132
## # A tibble: 1 x 1
##   total
##   <dbl>
## 1  11.1

4.2.3.2 Boxplots

The previous exploratory plot showed us changes with time, but averages and spread or other statistical summaries of the surveys data might be more informative for understanding the effect of exlosure on granivore populations.

Box-plots are a standard way to plot the distribution of a data set. Plotting the data this way means that we drop the dynamic information, the time dimension, but gain a compact summarised view of the variability of the number of captures, and hopefully evidence of any changes in granivore population following exclosure.

In the following code we use the boxplot() geom, and this time swap the mapping such that colour maps to the rodent type, and plots are faceted according to the plot type.

Comparing the median bar and the interquartile range, we can see evidence suggesting a modest increase in the granivore population when kangaroo rats are excluded.

4.2.4 Position adjustments

Another common adjustment we might make to our plots is to adjust the position of the data.

Recall the mpg data set for cars and our plot showing the fuel efficiency versus the size of the engine. Many of the points are actually over plotted such that we can’t see them.

By adding a bit of random noise to the position each data point, called jitter we can spread the points out and see them more clearly. This is seen in the second plot below, where the position = "jitter" argument has been provided to the geom_point() function.

It’s a little counter intuitive to add noise to a plot, but when we’re trying to explore data for patterns it can be very useful.

Let’s add points to our box-plot and try comparing the plot to one where the points are jittered.

We can also apply positional adjustments to other plots such as barplots. We can combine aesthetic fill = species_id with position = "fill" in geom_bar() to create a separate bar for each species and put plot_type on the x axis for by_month_species plots bars of length 1, but filled with the proportion of captures for that species.

Perhaps not that useful given what we know about making comparisons.

For more positional adjustments, check out the help files ?postion_jitter, ?position_dodge, ?position_fill, ?position_identity, and ?position_stack.

4.2.5 Coordinate adjustments

Previously, we plotted by_month_species and sometimes it’s tricky when there are lots of labels on the x-axis, so let’s flip the plot round using coord_flip() to improve things:

4.2.5.1 Factors

Finally for this plot, it would be nice to put the bars in order of size. We can do this by converting the genus to a factor.

Recall that factors are how R represents categorical variables, variables with a limited number of values, such as genus.

Also recall factors look like strings, but behave like integers. That is to say they are strings of characters with associated values called levels that can be used to order catergories.

This is why factors are useful, especially when we want to place things in non-alphabetical order. We can take advantage of factor levels to order things as we wish.

Check out the forcats tidyverse package to explore the power of factors, but here to give you a taste of what is possible we’re going to use the forcats package fct_reorder() function to convert species_id into a factor and in doing so set the level according to to the number of captures. In this way we can order the bars in our plot according to the number of captures and see the pattern more clearly.

In fct_reorder(), the first argument is the variable we wish to make into a factor species_id, and the second argument is variable we wish to use to create the order. Here we want to order according to the number of captures, so captures is the second argument.

Note we still keep x and y mappings, but now x is now a function of both the genus and the capture columns.

This gives us a nice ordered bar chart by genus and number of captures during the experimental period.

We can clearly see that kangaroo rat exclosure works, and that this corresponds with an increase in granivore captures these plots.

4.3 Themes and customisations

When we’re exploring our data we generally don’t mind if the labels and colours etc… are a bit messy. But when we want to publish or communicate our findings to others we want everything to be just so.

This is where the code can become quite complicated and it’s hard to provide a general case. However, the underlying template remains the same:

It’s unlikely that you’ll need to statistical transformations, positional adjustments, coordinate adjustments, and facets all in the same plot, but this provides an idea of what is possible. Remember the aim is clarity, not to make something complicated just because you can.

This template we can use to add all the additional functions and arguments that will change the plot to make it exactly how we want. It takes time, but remember code is reusable. Solve it once and you’ve solved it forever.

For all the possible customisations check out the ggplot2 documentation, and this chapter of R for data science: Graphics for communication

But here’s a few common changes.

4.3.1 Changing colours, labels, themes and adding titles

Recall our box-plot with points overlaid. Following the facet function, we’re going add a succession of functions to change the colours scale_colour_brewer(), change the axis labels, add a title using labs() and change the theme.

Themes deal with non-data elements e.g. the grid, the axes etc. There are a range of built-in themes and packages of themes such as ggthemes. Here we’ll use the built-in theme, theme_bw() that gives a simple layout.

Check out ?scale_colour_brewer and ?scale_fill_brewer for more information on colour schemes for discrete data. And ?scale_color_continuous for creating colour gradients for continuous data.