Chapter 4 dplyr verbs and piping

A core package in the tidyverse is dplyr for transforming data, which is often used in conjunction with the magrittr package that allows us to pipe multiple operations together.

The R4DS dplyr chapter is here and for magrittr here.

The figures in this chapter we made for use with an ecological dataset on rodent surveys, but the principles they illustrate are generic and show the use of each function with or without the use of a pipe.

From R4DS:

"All dplyr verbs work similarly:

1. The first argument is a data frame.

2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

3. The result is a new data frame.

Together these properties make it easy to chain together multiple simple steps to achieve a complex result."

4.1 Pipes

A pipe in R looks like this %>% and allows us to send the output of one operation into another. This saves time and space, and can make our code easier to read.

For example we can pipe the output of calling the dat object into the glimpse function like so:

dat %>% glimpse()
## Observations: 7,702
## Variables: 8
## $ protein_accession   <chr> "VATA_HUMAN_P38606", "RL35A_HUMAN_P18077",...
## $ protein_description <chr> "V-type proton ATPase catalytic subunit A ...
## $ control_1           <dbl> 0.8114, 0.3672, 2.9815, 0.1424, 1.0748, 0....
## $ control_2           <dbl> 0.8575, 0.3853, 4.6176, 0.2238, 0.9451, 0....
## $ control_3           <dbl> 1.0381, 0.4091, 2.8709, 0.1281, 0.8032, 0....
## $ treatment_1         <dbl> 0.6448, 0.4109, 7.1670, 0.1643, 0.7884, 0....
## $ treatment_2         <dbl> 0.7190, 0.4634, 2.0052, 0.2466, 0.8798, 1....
## $ treatment_3         <dbl> 0.4805, 0.3561, 0.8995, 0.1268, 0.7631, 0....

This becomes even more useful when we combine pipes with dplyr functions.

4.2 Filter rows

The filter function enables us to filter the rows of a data frame according to a logical test (one that is TRUE or FALSE). Here it filters rows in the surveys data where the year variable is greater or equal to 1985.

Let’s try this with dat to filter the rows for proteins in control_1 and control_2 experiments where the observations are greater than 20:

dat %>% filter(control_1 > 20, control_2 > 20)
## # A tibble: 2 x 8
##   protein_accessi~ protein_descrip~ control_1 control_2 control_3
##   <chr>            <chr>                <dbl>     <dbl>     <dbl>
## 1 MYH9_HUMAN_P355~ Myosin-9 OS=Hom~      29.2      31.7      24.6
## 2 A0A087WWY3_HUMA~ Filamin-A OS=Ho~      31.9      27.8      31.3
## # ... with 3 more variables: treatment_1 <dbl>, treatment_2 <dbl>,
## #   treatment_3 <dbl>

Filtering is done with the following operators >,<,>=,<=,!= (not equal) and == for equal. Not the double equal sign.

4.3 Arrange rows

Arranging is similar to filter except that it changes the row order according to the columns in ascending order. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

Here we arrange the surveys data according to the record identification number.

To try that with dat let’s arrange the data according to control_1:

dat %>% arrange(control_1)
## # A tibble: 7,702 x 8
##    protein_accessi~ protein_descrip~ control_1 control_2 control_3
##    <chr>            <chr>                <dbl>     <dbl>     <dbl>
##  1 PAL4G_HUMAN_P0D~ Peptidyl-prolyl~   0.001      0.0177    NA    
##  2 E5RGV5_HUMAN_E5~ Nucleolysin TIA~   0.0011    NA          0.093
##  3 E5RJP4_HUMAN_E5~ Glutamine--fruc~   0.002     NA         NA    
##  4 I3L3U1_HUMAN_I3~ Myosin light ch~   0.00240   NA         NA    
##  5 ENPLL_HUMAN_Q58~ Putative endopl~   0.0026    NA         NA    
##  6 K1C15_HUMAN_P19~ Keratin_ type I~   0.00290    0.0615     0.122
##  7 B5ME44_HUMAN_B5~ Outer dense fib~   0.00290   NA         NA    
##  8 PANK3_HUMAN_Q9H~ Pantothenate ki~   0.0033    NA         NA    
##  9 RRS1_HUMAN_Q150~ Ribosome biogen~   0.0035    NA         NA    
## 10 NFL_HUMAN_P07196 Neurofilament l~   0.0035     0.315      0.564
## # ... with 7,692 more rows, and 3 more variables: treatment_1 <dbl>,
## #   treatment_2 <dbl>, treatment_3 <dbl>

4.4 Select columns

Selecting is the verb we use to select columns of interest in the data. Here we select only the year and plot_type columns and discard the rest.

Let’s use select with dat to drop the protein description and control experiments using negative indexing and keep everything else:

dat %>% select(-protein_description,-(control_1:control_3))
## # A tibble: 7,702 x 4
##    protein_accession  treatment_1 treatment_2 treatment_3
##    <chr>                    <dbl>       <dbl>       <dbl>
##  1 VATA_HUMAN_P38606        0.645       0.719       0.480
##  2 RL35A_HUMAN_P18077       0.411       0.463       0.356
##  3 MYH10_HUMAN_P35580       7.17        2.01        0.900
##  4 RHOG_HUMAN_P84095        0.164       0.247       0.127
##  5 PSA1_HUMAN_P25786        0.788       0.880       0.763
##  6 PRDX5_HUMAN_P30044       0.545       1.69        0.821
##  7 ACLY_HUMAN_P53396        4.67        5.01        3.57 
##  8 VDAC2_HUMAN_P45880       1.01        1.04        0.904
##  9 LRC47_HUMAN_Q8N1G4       1.22        1.01        0.593
## 10 CH60_HUMAN_P10809        8.31        8.31        5.73 
## # ... with 7,692 more rows

4.5 Create new variables

Creating new variables uses the mutate verb. Here I am creating a new variable called rodent_type that will create a new column containing the type of rodent observed in each row.

Let’s create a new variable for dat called prot_id that use the str_extract function from the stringr package to take the last 6 characters of the protein_accession variable, the ".{6}$" part is called a regular expression, to keep just the UNIPROT id part of the string.

We’ll use select to drop the other variables except the protein accession afterwards via another pipe.

dat %>% 
  mutate(prot_id = str_extract(protein_accession,".{6}$")) %>% 
  select(protein_accession, prot_id)
## # A tibble: 7,702 x 2
##    protein_accession  prot_id
##    <chr>              <chr>  
##  1 VATA_HUMAN_P38606  P38606 
##  2 RL35A_HUMAN_P18077 P18077 
##  3 MYH10_HUMAN_P35580 P35580 
##  4 RHOG_HUMAN_P84095  P84095 
##  5 PSA1_HUMAN_P25786  P25786 
##  6 PRDX5_HUMAN_P30044 P30044 
##  7 ACLY_HUMAN_P53396  P53396 
##  8 VDAC2_HUMAN_P45880 P45880 
##  9 LRC47_HUMAN_Q8N1G4 Q8N1G4 
## 10 CH60_HUMAN_P10809  P10809 
## # ... with 7,692 more rows

4.6 Create grouped summaries

The last key verb is summarise which collapses a data frame into a single row.

For example, we could use it to find the average weight of all the animals surveyed in the surveys data using mean(). (Here the na.rm = TRUE argument is given to remove missing values from the data, otherwise R would return NA when trying to average.)

summarise is most useful when paired with group_by which defines the variables upon which we operate upon.

Here if we group by species_id and rodent_type together and then used summarise without any arguments we return these two variables only.

We’ll use the mpg dataset again to illustrate a grouped summary. Here I’ll group according fuel type fl, c = compressed natural gas ,d = diesel, e = ethanol, p = premium and r = regular. Then using summarise to calculate the mean highway (hwy) miles per gallon, and the mean urban (cty) miles per gallon,the tables is collapsed from 234 to five rows, one for each fuel type and two columns for the mean mpg;s. This illustrates how grouped summaries provide a very concise way of exploring data as we can immediately see the relative fuel efficiences of each fuel type under two conditions.

# fl is fuel type. c = compressed natural gas ,d = diesel, 
# e = ethanol, p = premium and r = regular.
mpg %>% 
  group_by(fl) %>% 
  # Create summaries mean_hwy and mean_cty using the mean function, 
  # dropping any missing variables.
  summarise(mean_hwy = mean(hwy, na.rm = T), mean_cty = mean(cty, na.rm = T))
## # A tibble: 5 x 3
##   fl    mean_hwy mean_cty
##   <chr>    <dbl>    <dbl>
## 1 c         36      24   
## 2 d         33.6    25.6 
## 3 e         13.2     9.75
## 4 p         25.2    17.4 
## 5 r         23.0    16.7

We’ll use dplyr and pipes in Chapter 5.