3 Data wrangling II

Following on from chapter 2 this lesson deals with some common tidying problems with data using the tidyr package . By the end of this chapter the learner will:

  • have learnt how to transform rows and columns to reorganise data
  • have learnt some ways to deal with missing values
  • have learnt how to join data contained in separate tables into a single table

3.1 Reshaping data with pivots

As you will recall tidy data means that in a table:

  1. Every variable has its own column.
  2. Every observation has its own row.
  3. Each value has its own cell.

Two common problems making your data untidy are that:

  1. A single variable is spread across multiple colummns.
  2. A single observation is distributed across multiple rows.

In the first case the column names are actually the variable we are interested in, making our table wide.

In the second case, the observation is contained in several rows instead of a single row, making our table long.

Hence tidyr has functions that pivot (as in turning) a table from wide-to-long by reducing the number of columns and increasing the number of rows. Or from long-to-wide by reducing the number of rows and increasing the number of columns.

Usually, the hard part is identifying in a dataset our variables and our observations, as it is not always obvious which is which.

3.1.1 tidyr::pivot_longer

First we’ll consider the case when our variable has been recorded as column names by returning to a version of the Portal surveys data.

A table containing the mean weight of 10 rodent species on each plot from the rodent survey data can be read directly into a surveys_spread object using the code below.

Feel free to explore the data, and snapshot is shown in Figure 3.1.

Imagine that we recorded the mean weight of each rodent species on a plot in the field. It makes sense to put the species as column headings, along with the plot id and then record the values in each cell.

However, really our variables of interest are the rodent species and our observational units, the rows, should contain the mean weight for a rodent species in a plot. Hence we need to reduce the number of columns and create a longer table.

To do this we pivot_longer() by using names_to = "genus" to create a new genus variable for the exisiting column heading, and values_to = "mean_weight" to create a variable mean_weight for the values. We put a minus sign before the variable -plot_id to tell the function not to use these values in the new variable column.

tidyr::pivot_longer The genus of the rodents have been used as column headings for the weights of each animal. By creating new variables genus which takes the column names, and another variable mean_weight which takes the cell values, and but not the plot_id column we pivot from a wide table to a long table.

Figure 3.1: tidyr::pivot_longer The genus of the rodents have been used as column headings for the weights of each animal. By creating new variables genus which takes the column names, and another variable mean_weight which takes the cell values, and but not the plot_id column we pivot from a wide table to a long table.

## # A tibble: 240 x 3
##    plot_id genus           mean_weight
##      <dbl> <chr>                 <dbl>
##  1       1 Baiomys                7   
##  2       1 Chaetodipus           22.2 
##  3       1 Dipodomys             60.2 
##  4       1 Neotoma              156.  
##  5       1 Onychomys             27.7 
##  6       1 Perognathus            9.62
##  7       1 Peromyscus            22.2 
##  8       1 Reithrodontomys       11.4 
##  9       1 Sigmodon              NA   
## 10       1 Spermophilus          NA   
## # … with 230 more rows

3.1.2 tidyr::pivot_wider

If we wanted to go the other way, from wide to long, then we need to take values from a single column to become variables names of multiple columns, populated with values from an existing variable.

A version of the wide table in Figure 3.2 can be downloaded into our environment and assigned as surveys_gw as before.

This time we use pivot_wider() with the arguments names_from for the column we want to use to create variables, and values_from for the column whose values we want populate our new variable columns with.

tidyr::pivot_wider From long table surveys_gw the genus column contains the values that become variables using names_from and the mean_weight column contains the values that fill the new columns using values_from in the pivot from long to wide.

Figure 3.2: tidyr::pivot_wider From long table surveys_gw the genus column contains the values that become variables using names_from and the mean_weight column contains the values that fill the new columns using values_from in the pivot from long to wide.

## # A tibble: 24 x 11
##    plot_id Baiomys Chaetodipus Dipodomys Neotoma Onychomys Perognathus
##      <dbl>   <dbl>       <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
##  1       1    7           22.2      60.2    156.      27.7        9.62
##  2       2    6           25.1      55.7    169.      26.9        6.95
##  3       3    8.61        24.6      52.0    158.      26.0        7.51
##  4       5    7.75        18.0      51.1    190.      27.0        8.66
##  5      18    9.5         26.8      61.4    149.      26.6        8.62
##  6      19    9.53        26.4      43.3    120       23.8        8.09
##  7      20    6           25.1      65.9    155.      25.2        8.14
##  8      21    6.67        28.2      42.7    138.      24.6        9.19
##  9       4   NA           23.0      57.5    164.      28.1        7.82
## 10       6   NA           24.9      58.6    180.      25.9        7.81
## # … with 14 more rows, and 4 more variables: Peromyscus <dbl>,
## #   Reithrodontomys <dbl>, Sigmodon <dbl>, Spermophilus <dbl>

3.2 Missing values

In R missing values are represented as NA (or NaN for an undefined mathematical operation such as dividing by zero).

As discussed in R4DS values can be missing in two ways:

  • Explicitly, i.e. flagged with NA.
  • Implicitly, i.e. simply not present in the data.

If we had a subset of the surveys data that looked like this table:

## # A tibble: 7 x 3
##    year   qtr mean_weight
##   <dbl> <dbl>       <dbl>
## 1  1991     1        3.75
## 2  1991     2        2.5 
## 3  1991     3       NA   
## 4  1991     4        8.5 
## 5  1992     1        7.5 
## 6  1992     2        2.25
## 7  1992     4        2.5

Then we can see that in the third quarter of 1991, no data was recorded by the explicit NA value. However the third quarter of 1992 is missing altogether, hence it is implicitly missing.

Another issue is that explicit missing values are often recorded in real data in various ways e.g. an empty cell or as a dash. We can’t cover all cases, but the general advice is to do some manual inspection of your data in addition to using R to understand how missing values have been recorded if there are any. In an ideal world data comes with a “code book” that explains the data, but this often doesn’t happen. Functions such as read_csv() allow you to supply a vector of values representing NA which will convert these values in your input to NAs.

Once you have identified missing values, in general there two ways to deal with them:

  1. Drop incomplete observations
  2. Complete the missing observations: fill or impute values.

See also tidyr functions for missing values.

3.2.1 Checking for explicit missing values using R

In addition to manual inspection, one way to check for explicit missing values coded as NA in your data frame is to use the is.na() function which returns a logical TRUE or FALSE vector of values.

Using this to check a single variable could be done by combining it with select() for example using surveys_ms:

##      mean_weight
## [1,]       FALSE
## [2,]       FALSE
## [3,]        TRUE
## [4,]       FALSE
## [5,]       FALSE
## [6,]       FALSE
## [7,]       FALSE

Or with filter() to return the row where there is a missing value for mean_weight.

## # A tibble: 1 x 3
##    year   qtr mean_weight
##   <dbl> <dbl>       <dbl>
## 1  1991     3          NA

A more complicated use would be to combine is.na() with another function such sum() to provide the total number of missing values for a variable, or to combine sum() with complete.cases() to check each row for for missing values.

For example, to find the number of observations (rows) in surveys with no missing values:

## [1] 30676

And for a single variable, we could extend our use of select:

## [1] 1

Extending this find the missing values per variable in a whole data frame requires introducing more syntax and the use of a map function from the purr package. (Base R can also do this, but we’re staying in the tidyverse where possible).

The idea here is that we want to do the same calculation mapped across each variable. The calculation here being count the number of missing values in each column.

We pipe surveys to the map_dfr() function where dfr means return a data frame. The ~ represents formula, meaning “depends upon”, such that sum(is.na(.)) where . represents the input variable, depends upon the variables in surveys. The map function then does this for each variable in the surveys dataframe and returns the output as a data frame. I’ve piped the output to glimpse() for readability.

In words what this syntax is saying is “take the surveys dataframe and for each column check if each value is NA, then sum the number of NAs in each column and return the totals for all the columns as a dataframe.”

This is pretty complicated if you’ve never seen this before, but hopefully you can follow the idea.

This could be combined with filter() or select() prior to the map function if you wanted to subset the data further for example.

See R4DS map functions for more details.

## Observations: 1
## Variables: 13
## $ record_id       <int> 0
## $ month           <int> 0
## $ day             <int> 0
## $ year            <int> 0
## $ plot_id         <int> 0
## $ species_id      <int> 0
## $ sex             <int> 1748
## $ hindfoot_length <int> 3348
## $ weight          <int> 2503
## $ genus           <int> 0
## $ species         <int> 0
## $ taxa            <int> 0
## $ plot_type       <int> 0

As we might expect the missing values are all for variables that record measurements for the rodents.

3.2.2 Dropping missing values

The simplest solution to observations with missing values is to drop those from the data set. If it makes sense to do this, then tidyr::drop_na() makes it easy to do this. Note: This will create implicit missing values.

In Section 3.2.1 we found 30,676 rows with no missing values by using complete.cases() on the surveys data of 34,786 observations.

Therefore we expect passing surveys to drop_na() will return a dataframe of 30,676 observations:

## Observations: 30,676
## Variables: 13
## $ record_id       <dbl> 845, 1164, 1261, 1756, 1818, 1882, 2133, 2184, 2406, …
## $ month           <dbl> 5, 8, 9, 4, 5, 7, 10, 11, 1, 5, 5, 7, 10, 11, 1, 2, 3…
## $ day             <dbl> 6, 5, 4, 29, 30, 4, 25, 17, 16, 18, 18, 8, 1, 23, 25,…
## $ year            <dbl> 1978, 1978, 1978, 1979, 1979, 1979, 1979, 1979, 1980,…
## $ plot_id         <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ species_id      <chr> "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL",…
## $ sex             <chr> "M", "M", "M", "M", "M", "M", "F", "F", "F", "F", "F"…
## $ hindfoot_length <dbl> 32, 34, 32, 33, 32, 32, 33, 30, 33, 31, 33, 30, 34, 3…
## $ weight          <dbl> 204, 199, 197, 166, 184, 206, 274, 186, 184, 87, 174,…
## $ genus           <chr> "Neotoma", "Neotoma", "Neotoma", "Neotoma", "Neotoma"…
## $ species         <chr> "albigula", "albigula", "albigula", "albigula", "albi…
## $ taxa            <chr> "Rodent", "Rodent", "Rodent", "Rodent", "Rodent", "Ro…
## $ plot_type       <chr> "Control", "Control", "Control", "Control", "Control"…

drop_na() can accept variables arguments, meaning only the observations with missing values in those columns will be dropped.

For example, here we drop only missing weight observations, so the rows which have missing values for hindfoot_length or sex are kept.

## Observations: 32,283
## Variables: 13
## $ record_id       <dbl> 588, 845, 990, 1164, 1261, 1453, 1756, 1818, 1882, 21…
## $ month           <dbl> 2, 5, 6, 8, 9, 11, 4, 5, 7, 10, 11, 1, 5, 5, 7, 10, 1…
## $ day             <dbl> 18, 6, 9, 5, 4, 5, 29, 30, 4, 25, 17, 16, 18, 18, 8, …
## $ year            <dbl> 1978, 1978, 1978, 1978, 1978, 1978, 1979, 1979, 1979,…
## $ plot_id         <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ species_id      <chr> "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL",…
## $ sex             <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "F", "F"…
## $ hindfoot_length <dbl> NA, 32, NA, 34, 32, NA, 33, 32, 32, 33, 30, 33, 31, 3…
## $ weight          <dbl> 218, 204, 200, 199, 197, 218, 166, 184, 206, 274, 186…
## $ genus           <chr> "Neotoma", "Neotoma", "Neotoma", "Neotoma", "Neotoma"…
## $ species         <chr> "albigula", "albigula", "albigula", "albigula", "albi…
## $ taxa            <chr> "Rodent", "Rodent", "Rodent", "Rodent", "Rodent", "Ro…
## $ plot_type       <chr> "Control", "Control", "Control", "Control", "Control"…

3.2.3 Completing missing values

There are numerous ways to complete missing values, but what they generally have in common is the approach of guessing the missing value based on existing information, otherwise known as imputation. Here we’ll cover a couple of ways to do it with dplyr and tidyr functions.

First, let’s deal with implicit missing values using tidyr::complete().

Imagine we have the table surveys_ms which is missing the value for the mean_weight for the third quarter of 1991 and all observations for the third quarter of 1992:

## # A tibble: 7 x 3
##    year   qtr mean_weight
##   <dbl> <dbl>       <dbl>
## 1  1991     1        3.75
## 2  1991     2        2.5 
## 3  1991     3       NA   
## 4  1991     4        8.5 
## 5  1992     1        7.5 
## 6  1992     2        2.25
## 7  1992     4        2.5

To add the row for the third quarter of 1992 we use complete() with the year and qtr variables as arguments. The function finds all the unique combinations of year and qtr and then adds any that are missing to create a complete set of observations for 1991 and 1992 with explicit missing values.

## # A tibble: 8 x 3
##    year   qtr mean_weight
##   <dbl> <dbl>       <dbl>
## 1  1991     1        3.75
## 2  1991     2        2.5 
## 3  1991     3       NA   
## 4  1991     4        8.5 
## 5  1992     1        7.5 
## 6  1992     2        2.25
## 7  1992     3       NA   
## 8  1992     4        2.5

Next let’s consider a complete table surveys_ic with an explicit missing value for the species_id.

## # A tibble: 6 x 2
##   species_id mean_weight
##   <chr>            <dbl>
## 1 DM                3.75
## 2 DM                2.5 
## 3 <NA>              8.5 
## 4 DS                7.5 
## 5 DS                5.5 
## 6 DS                2.5

In this case we might reasonably assume that the missing value is DM as we have six observations on two species.

We can use tidyr::fill() to replace NA with the last non-missing value for species_id. This is also known as last observation carried forward.

Passing species_id to fill() completes the table:

## # A tibble: 6 x 2
##   species_id mean_weight
##   <chr>            <dbl>
## 1 DM                3.75
## 2 DM                2.5 
## 3 DM                8.5 
## 4 DS                7.5 
## 5 DS                5.5 
## 6 DS                2.5

Another common strategy is to impute missing values by using the mean or the median of exisiting values for the same variable.

We can do that using dplyr::coalesce() which is a function that finds the first non-missing value in a variable and then replaces it using the second argument.

For example, if we return to surveys_ms we used complete() to complete the observations, but we have two NAs in the mean_weight column.

Let’s impute the missing values bt using the median value for mean_weight to replace the NA values.

We can mutate() to overwrite the mean_weight variable (be careful when you do this!) using coalesce() with mean_weight as the first argument to look for missing values, and then the median() function is used to replace the missing values with the median mean_weight, remembering to remove NA.

## # A tibble: 8 x 3
##    year   qtr mean_weight
##   <dbl> <dbl>       <dbl>
## 1  1991     1        3.75
## 2  1991     2        2.5 
## 3  1991     3        3.12
## 4  1991     4        8.5 
## 5  1992     1        7.5 
## 6  1992     2        2.25
## 7  1992     3        3.12
## 8  1992     4        2.5

This gives us a complete table with . Whether it makes sense to do this is another question, and you should think carefully about your missing value strategy as it will influence your final output and conclusions.

Let’s repeat part of the analysis we did to find the mean_weight, and now mean_hindfoot, for the full surveys dataset, but this time we’ll complete the table by imputing values and compare the result to the result when we drop the observations with missing values.

For simplicity, let’s just look at kangaroo rats. Here I introduce the %in% operator with filter and vector containing the species_id for the kangaroo rats.

This leaves 16,127 observations.

## Observations: 16,127
## Variables: 13
## $ record_id       <dbl> 3, 226, 233, 245, 251, 257, 259, 268, 346, 350, 354, …
## $ month           <dbl> 7, 9, 9, 10, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, …
## $ day             <dbl> 16, 13, 13, 16, 16, 16, 16, 16, 12, 12, 12, 12, 10, 1…
## $ year            <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977,…
## $ plot_id         <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ species_id      <chr> "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM",…
## $ sex             <chr> "F", "M", "M", "M", "M", "M", "M", "F", "F", "M", "M"…
## $ hindfoot_length <dbl> 37, 37, 25, 37, 36, 37, 36, 36, 37, 37, 38, 36, 37, 3…
## $ weight          <dbl> NA, 51, 44, 39, 49, 47, 41, 55, 36, 47, 44, 40, 42, 5…
## $ genus           <chr> "Dipodomys", "Dipodomys", "Dipodomys", "Dipodomys", "…
## $ species         <chr> "merriami", "merriami", "merriami", "merriami", "merr…
## $ taxa            <chr> "Rodent", "Rodent", "Rodent", "Rodent", "Rodent", "Ro…
## $ plot_type       <chr> "Control", "Control", "Control", "Control", "Control"…

How many missing values are there in krats and for which variables?

## Observations: 1
## Variables: 13
## $ record_id       <int> 0
## $ month           <int> 0
## $ day             <int> 0
## $ year            <int> 0
## $ plot_id         <int> 0
## $ species_id      <int> 0
## $ sex             <int> 131
## $ hindfoot_length <int> 1136
## $ weight          <int> 617
## $ genus           <int> 0
## $ species         <int> 0
## $ taxa            <int> 0
## $ plot_type       <int> 0

Grouping by species_id and sex, find the mean_weight and mean_hindfoot when observations with missing values are dropped:

## # A tibble: 6 x 4
## # Groups:   species_id [3]
##   species_id sex   mean_weight mean_hindfoot
##   <chr>      <chr>       <dbl>         <dbl>
## 1 DM         F            41.6          35.7
## 2 DM         M            44.3          36.2
## 3 DO         F            48.5          35.5
## 4 DO         M            49.1          35.7
## 5 DS         F           117.           49.6
## 6 DS         M           123.           50.4

Now we complete the table, first using fill() to impute the missing sex of each kangaroo rat based on the last non-missing sex observation. Then we group the kangaroo rats according to species and sex. Then we impute the missing weight and hindfoot_length values with mutate() and coalesce() from each groups median value. Finally we summarise again to find the mean_weight and mean_hindfoot for each group.

## # A tibble: 6 x 4
## # Groups:   species_id [3]
##   species_id sex   mean_weight mean_hindfoot
##   <chr>      <chr>       <dbl>         <dbl>
## 1 DM         F            41.6          35.7
## 2 DM         M            44.4          36.2
## 3 DO         F            48.5          35.5
## 4 DO         M            49.2          35.7
## 5 DS         F           118.           49.6
## 6 DS         M           123.           50.3

The results are very similar.

3.3 Joining tables

See also R4DS Relational data chapter

Previously we’ve used data from the Portal project where everything we needed was already contained within a single table, but often we have related information spread across multiples tables that we want to analyse. In these situations we need to join pairs of tables to explore relationships of interest.

We’ll use a dummy dataset here that recreates part of the data the Portal project collects. As well as the animal census they separately record weather information.

Imagine we had two small tables, one called airtemp that contains the air temperature in Celsius for three plots for one timepoint on one day. And another called krat that records the observations for three plots on the same day for kangaroo rat speciecs DM, DS and RM.

## Observations: 3
## Variables: 3
## $ date    <date> 1990-01-07, 1990-01-07, 1990-01-07
## $ plot    <dbl> 4, 8, 1
## $ airtemp <dbl> 7.3, 9.5, 11.0
## Observations: 3
## Variables: 3
## $ date    <date> 1990-01-07, 1990-01-07, 1990-01-07
## $ plot    <dbl> 4, 4, 6
## $ species <chr> "DM", "DS", "RM"

We can think of three types of join, two of which correspond with dplyr verbs:

  1. Joins that mutate the table. These are joins that create a new variable in one table using observations from another.
  2. Joins that filter the table. These are joins that keep only a subset of observations from one table based on whether they match the observations in another.
  3. Joins that perform set operations. These are joins corresponding with the mathematical operation of intersection \(\cap\), union \(\cup\), and difference \(-\).

Note Joins always use two tables. To add a third relationship it would require joining the table created from the first join to another table. And so on.

3.3.1 Keys

A key is a variable or set of variables that uniquely identifies an observation in a table. In a simple case only one variable is sufficient, but often several variables are required.

  • Primary key: a key that uniquely identifies the observation in its own table e.g. the combination of date and plot is unique to each set of observations in the airtemp table.
  • Foreign key: a key that uniquely identifies the observation in another table. e.g. the combination of date, plot and species is unique to each set of observations in the krat table.
## # A tibble: 3 x 3
##   date        plot     n
##   <date>     <dbl> <int>
## 1 1990-01-07     1     1
## 2 1990-01-07     4     1
## 3 1990-01-07     8     1
## # A tibble: 3 x 4
##   date        plot species     n
##   <date>     <dbl> <chr>   <int>
## 1 1990-01-07     4 DM          1
## 2 1990-01-07     4 DS          1
## 3 1990-01-07     6 RM          1

So joining airtemp and krat uses the airtemp primary key of date and plot.

Keys can be both primary and foreign, as they might be primary for one table and foreign in another or vice versa.

It’s a good idea to verify a key is primary i.e. uniquely identifies an observation. See the examples in R4DS keys

If you discover your table lacks a primary key you can add one with mutate().

Keys are called explicity using by = followed by the keys as a character vector. For example, here we call by = c("plot","date).

## # A tibble: 2 x 4
##   date        plot airtemp species
##   <date>     <dbl>   <dbl> <chr>  
## 1 1990-01-07     4     7.3 DM     
## 2 1990-01-07     4     7.3 DS

If we implicitly let the join function choose the keys, we get an output telling us what was used.

If we had a key that had a different name in both tables, we tell the join function that they are the same key by saying one variable is equal to another. For example if we had plot in one table and it was called plot_id in another, we could use by = c("plot" = "plot_id") to tell the join function that this is the key.

3.3.2 Mutating joins

These are joins that created new columns, hence they act like dplyr::mutate.

From dplyr we have:

  • inner_join() : keeps only the observations with matching keys in both tables
  • left_join() : keeps all the observations in the left hand table
  • right_join() : keeps all the observations in the right hand tables
  • full_join() : keeps all the observations in both tables

3.3.2.1 Inner join

inner_join() : keeps only the observations with matching keys in both tables

An inner join between airtemp and krat keeps only the observations which have match in both tables using the primary key of date and plot as shown in Figure 3.3.

Inner join

Figure 3.3: Inner join

## Joining, by = c("date", "plot")
## # A tibble: 2 x 4
##   date        plot airtemp species
##   <date>     <dbl>   <dbl> <chr>  
## 1 1990-01-07     4     7.3 DM     
## 2 1990-01-07     4     7.3 DS

3.3.2.2 Left join

left_join() : keeps all the observations in the left hand table

An left join between airtemp and krat keeps all the observations in airtemp and fills the missing observations in species with NA as shown in Figure 3.4.

Left join

Figure 3.4: Left join

## Joining, by = c("date", "plot")
## # A tibble: 4 x 4
##   date        plot airtemp species
##   <date>     <dbl>   <dbl> <chr>  
## 1 1990-01-07     4     7.3 DM     
## 2 1990-01-07     4     7.3 DS     
## 3 1990-01-07     8     9.5 <NA>   
## 4 1990-01-07     1    11   <NA>

3.3.2.3 Right join

right_join() : keeps all the observations in the right hand tables

An right join between airtemp and krat keeps all the observations in krat and fills the missing observations in airtemp with NA as shown in Figure 3.5.

Right join

Figure 3.5: Right join

## Joining, by = c("date", "plot")
## # A tibble: 3 x 4
##   date        plot airtemp species
##   <date>     <dbl>   <dbl> <chr>  
## 1 1990-01-07     4     7.3 DM     
## 2 1990-01-07     4     7.3 DS     
## 3 1990-01-07     6    NA   RM

3.3.2.4 Full join

full_join() : keeps all the observations in both tables

An full join between airtemp and krat keeps all the observations in both tables and fills the missing observations in airtemp and species with NA as shown in Figure 3.6.

Full join

Figure 3.6: Full join

## Joining, by = c("date", "plot")
## # A tibble: 5 x 4
##   date        plot airtemp species
##   <date>     <dbl>   <dbl> <chr>  
## 1 1990-01-07     4     7.3 DM     
## 2 1990-01-07     4     7.3 DS     
## 3 1990-01-07     8     9.5 <NA>   
## 4 1990-01-07     1    11   <NA>   
## 5 1990-01-07     6    NA   RM

3.3.3 Filtering joins

Filtering joins act like the filter function keeping only observations that match the join without adding or removing columns.

semi_join() and anti_join()

3.3.3.1 Semi join

semi_join() keeps only the observations in the left hand table that have a match in the right hand table.

A semi-join between between airtemp and krat keeps only the observation in airtemp that matches in krat as showin in Figure 3.7.

Semi join

Figure 3.7: Semi join

## Joining, by = c("date", "plot")
## # A tibble: 1 x 3
##   date        plot airtemp
##   <date>     <dbl>   <dbl>
## 1 1990-01-07     4     7.3

3.3.3.2 Anti join

anti_join() drops the observations in the left hand table that have a match in the right hand table.

A anti-join between between airtemp and krat drops the observations in airtemp that match in krat as shown in Figure 3.8.

Anti join

Figure 3.8: Anti join

## Joining, by = c("date", "plot")
## # A tibble: 2 x 3
##   date        plot airtemp
##   <date>     <dbl>   <dbl>
## 1 1990-01-07     8     9.5
## 2 1990-01-07     1    11

3.3.4 Set operations

These are joins corresponding with the mathematical operation of intersection \(\cap\), union \(\cup\), and difference \(-\) as shown in Figure 3.9. These are intersect(), union() and setdiff().

See R4DS set operations for examples of usage.

Set operations intersect(A,B), union(A,B) and setdiff(A,B).

Figure 3.9: Set operations intersect(A,B), union(A,B) and setdiff(A,B).

3.3.5 Duplicate keys

If we have duplicate keys in each tables which aren’t used as part of the join, then the join will create columns for both varaibles. For example, if we join airtemp with krat only using plot as a key, but date is in both, we reuturn date.x and date.y for airtemp and krat respecitively:

## # A tibble: 5 x 5
##   date.x      plot airtemp date.y     species
##   <date>     <dbl>   <dbl> <date>     <chr>  
## 1 1990-01-07     4     7.3 1990-01-07 DM     
## 2 1990-01-07     4     7.3 1990-01-07 DS     
## 3 1990-01-07     8     9.5 NA         <NA>   
## 4 1990-01-07     1    11   NA         <NA>   
## 5 NA             6    NA   1990-01-07 RM

3.3.6 Binding rows and columns

We can also join tables by binding rows or columns. Binding rows is useful when we have tables with different sets of observations for the same variables, such as when we have multiple replicates of an experiment. Binding columns can be done when we want to add a column from one table that corresponds with the observations in another.

See dplyr bind :

“When row-binding, columns are matched by name, and any missing columns will be filled with NA.”

When column-binding, rows are matched by position, so all data frames must have the same number of rows."

These functions are bind_rows and bind_cols.