6  Data wrangling II

Summary

This chapter covers further tasks for working with tabular data using the tidyverse packages dplyr, tidyr and stringr. These tasks are joining tables and working with strings to transform and tidy data.

6.1 Joins

We often want to join tables together based upon a variable or variables common to each table, called a key or keys, such as the digital object identifier (DOI) that is present in tables downloaded from both Scopus and Web of Science (Section 7.2) in order to further transform the data or answer a question that cannot easily be answered from each table alone.

R4DS joins gives some detailed examples of joining tables, but here we’ll look at some simpler examples to try build our mental model about how to join tables.

There are two types of join we often encounter:

  1. Joins that create new variables in one table by matching observations in another: these are mutating joins as per tidyverse verb mutate Section 5.3.4.
  2. Joins that filter the observations from one table against another, as per tidyverse verb filter Section 5.3.1.

dplyr has six join functions.

Mutating joins:

  1. Left Join: Returns all rows from the left table and the matched rows from the right table. left_join()
  2. Right Join: Similar to left join, but returns all rows from the right table and the matched rows from the left table. right_join()
  3. Full Join: Returns all rows from both tables, with null values in the columns where there are no matches. full_join()

Filtering joins:

  1. Inner Join: Returns only the rows that have matches in both tables. inner_join()
  2. Semi Join: Returns only the rows from the left table that have matches in the right table. semi_join()
  3. Anti Join: Returns only the rows from the left table that do not have matches in the right table. anti_join()

6.2 Mutating joins

To demonstrate a mutating join using the penguins data, I first use mutate to create a unique identifier variable for each penguin from the row number called penguin_id, and assign this to a new dataframe numbered_penguins.

Then I use numbered_penguins to create two new dataframes using select (Listing 6.1) :

  • penguins_physical is a dataframe containing physical characteristics of penguins (e.g., bill length, flipper length, body mass).
  • penguins_info is a dataframe containing other information about penguins (e.g., species, island, year, sex).
Listing 6.1: Create unique id for each penguin and split the penguins data into two tables, one with physical measurements, the other with the remaining variables.
# Create unique id for each penguin
numbered_penguins <- penguins |> 
  mutate(penguin_id = row_number())

# Split the dataset into two tables
penguins_physical <- numbered_penguins |>
  select(penguin_id, bill_length_mm, 
         bill_depth_mm, flipper_length_mm, body_mass_g)

penguins_info <- numbered_penguins |>
  select(penguin_id, species, island, year, sex)

The common key that we can use to join these tables back together is the penguin_id variable. The mutation part is the creation of new variables when two tables are joined to create a single table.

Here we’ll use the left_join() function to combine the two dataframes penguins_physical and penguins_info based on the penguin_number column, which is common to both dataframes (Listing 6.3).

We assign the result to a new dataframe, penguins_joined, which contains all the columns from penguins_physical keyed to penguin_id from penguins_info.

The left_join function returns all records from the left dataframe (penguins_physical) and the matching records from the right dataframe (penguins_info). If there were no matches in the right dataframe, the result will contain null values for the right dataframe’s columns.

As we didn’t explicitly tell the function which variables to use, we got a message tells us it joined by = join_by(penguin_id).

Listing 6.2: Join the two penguins back together with a mutating left join
penguins_joined <- penguins_physical |>
  left_join(penguins_info)
Joining with `by = join_by(penguin_id)`
Listing 6.3: Join the two penguins back together with a mutating left join
# Print the joined dataset
glimpse(penguins_joined)
Rows: 344
Columns: 9
$ penguin_id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
$ sex               <fct> male, female, female, NA, female, male, female, male…

6.3 Filtering join

Here we’ll do a filtering join, a semi-join.

First I’ll filter the penguins_info I created just for the data from the year 2008:

Listing 6.4: Create a dataframe for penguins only for the year 2008
penguins_info_filt <- penguins_info |>
  filter(year == 2008)

Then we do a semi-join between the penguins_physical table and the filtered penguins_info_filt table. We still use the penguin_id as the key.

A semi-join returns only the rows from penguins_physical that have a match in penguins_info_filt, effectively filtering the physical characteristics of penguins to only include those from 2008. This joins data from 114 rows of the 344 rows in the penguins_physical (Listing 6.6).

As we didn’t explicitly tell the function which variables to use, we got a message tells us it joined by = join_by(penguin_id)

I’ve assigned the resulting table to a new dataframe: penguins_filtered.

Listing 6.5: Perform a filtering semi join that returns only the rows from penguins_physical that match the rows in penguins_info_filt
penguins_filtered <- penguins_physical |>
  semi_join(penguins_info_filt)
Joining with `by = join_by(penguin_id)`
Listing 6.6: Perform a filtering semi join that returns only the rows from penguins_physical that match the rows in penguins_info_filt
# Print the filtered dataset
glimpse(penguins_filtered)
Rows: 114
Columns: 5
$ penguin_id        <int> 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, …
$ bill_length_mm    <dbl> 39.6, 40.1, 35.0, 42.0, 34.5, 41.4, 39.0, 40.6, 36.5…
$ bill_depth_mm     <dbl> 17.7, 18.9, 17.9, 19.5, 18.1, 18.6, 17.5, 18.8, 16.6…
$ flipper_length_mm <int> 186, 188, 190, 200, 187, 191, 186, 193, 181, 194, 18…
$ body_mass_g       <int> 3500, 4300, 3450, 4050, 2900, 3700, 3550, 3800, 2850…

6.4 Strings and characters

The terms strings and characters are often used interchangeably, but a character is technically a single symbol or glyph, such as & or a, or even 🙀. And a sequence of characters and/or symbols form a string.

In R to create character vector, we enclose the characters forming the string in quotes. For example, "hello" is a string, and each individual character within it (h, e, l, l, o) is a character. "hello" is a character vector of length 1 as there is one string of five characters.

"hello world" is also a character vector of length 1 as the white space also counts as a character.

However using combine function c() to create c("hello", "world") would create a character vector of length 2 as we have now created "hello" and "world" as separate strings.

Single or double quotes?

Both single ' or double " are allowed, but for consistent style I’d recommend double, unless your string itself contains multiple " as per the tidyverse style guide.

Frequently we are dealing with data in the form of character vectors or strings which we need to match or extract, such as the affiliation of an author for a journal article in Scopus data. stringr is a tidyverse package of functions specifically for these tasks, and more generally we use regular expressions to match strings.

6.4.1 Regular expressions

A regular expression (regex) is a pattern used to match and manipulate strings. It’s like a really powerful “find and replace” tool, but has completely unintuitive syntax!

You can use regex to specify a pattern, and then use that pattern to search for matches in a string. Once matched, we can extract or manipulate the strings in conjunction with other functions.

stringr is the tidyverse package for working with strings and regex. Many of the stringr functions allow us to avoid explicitly worrying about writing a regex, but we can always create a regex if needed.

Regex patterns consist of special characters, such as ., *, +, ?, {, }, [, ], (, ), and \. These characters have special meanings, like “any character” ( . ), “zero or more” ( * ), or “start of string” ( ^ ). You can combine these characters to create complex patterns to match specific strings.

Escaping regex special characters with backslash

Because characters such dot . or asterisk * have special meaning in regex, if you want to treat them literally as a dot or asterisk, we have to escape their meta meaning.

To escape we use a backslash \ before the special character such that to treat a dot as a dot we would write \. in our regex.

6.4.2 Getting started with stringr

To explore stringr we’ll create a couple of character vectors:

  1. ding_words: a character vector of 20 words, all ending with ding using the c() function (Listing 6.7).
Listing 6.7: Create a character vector of twenty words ending in ‘ding’
ding_words <- c(
  "abounding",
  "astounding",
  "confounding",
  "resounding",
  "surrounding",
  "grounding",
  "founding",
  "pounding",
  "rounding",
  "sounding",
  "bounding",
  "hounding",
  "mounding",
  "wounding",
  "compounding",
  "expounding",
  "propounding",
  "rebounding",
  "unbounding",
  "floundering"
)
  1. cat_names: a character vector of 20 strings of cat names using the c() function (Listing 6.8)
Listing 6.8: Create a character vector of twenty cat names
cat_names <-
  c(
    "Whiskers",
    "Fluffy",
    "Mittens",
    "Socks",
    "Tiger",
    "Smokey",
    "Gizmo",
    "Oreo",
    "Luna",
    "Oliver",
    "Leo",
    "Milo",
    "Charlie",
    "Simba",
    "Nala",
    "Felix",
    "Garfield",
    "Tigger",
    "Chloe",
    "Bella"
  )

6.4.2.1 Pattern Matching and Viewing

str_view is a function from the stringr package to search for patterns in the words vector. It returns a visual representation of the matches.

str_view(words, "ding"): searches for words containing the substring “ding” (Listing 6.9).

Listing 6.9: View the string ‘ding’ in ding_words
str_view(ding_words,"ding")
 [1] │ aboun<ding>
 [2] │ astoun<ding>
 [3] │ confoun<ding>
 [4] │ resoun<ding>
 [5] │ surroun<ding>
 [6] │ groun<ding>
 [7] │ foun<ding>
 [8] │ poun<ding>
 [9] │ roun<ding>
[10] │ soun<ding>
[11] │ boun<ding>
[12] │ houn<ding>
[13] │ moun<ding>
[14] │ woun<ding>
[15] │ compoun<ding>
[16] │ expoun<ding>
[17] │ propoun<ding>
[18] │ reboun<ding>
[19] │ unboun<ding>

str_view(ding_words, "bou|rr"): searches for words containing the substrings “bou” or “rr”. The vertical bar | here is a special regex character that represent the Boolean OR operator (Listing 6.10).

Listing 6.10: View the string ‘bou’ or ‘rr’ in ding_words
str_view(ding_words,"bou|rr")
 [1] │ a<bou>nding
 [5] │ su<rr>ounding
[11] │ <bou>nding
[18] │ re<bou>nding
[19] │ un<bou>nding

6.4.2.2 Pattern Matching and Counting

str_detect is a function is used to identify elements in the cat_names vector that contain the pattern “[isk]” (i.e., the characters “i”, “s”, or “k”).

str_detect(cat_names, "[isk]"): returns a logical vector (TRUE or FALSE)indicating which names contain the pattern (Listing 6.11):

Listing 6.11: Find whether cat names contain i,s or k is TRUE or FALSE
str_detect(cat_names,"(isk)") 
 [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

str_count(cat_names, "[isk]"): counts the number of occurrences of the pattern in each name (Listing 6.12):

Listing 6.12: Count how many occurrences of i, s or k in each cat name
str_count(cat_names,"[isk]")
 [1] 4 0 2 2 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 0

6.5 String Replacement

Having briefly looked at some of the ways we match strings, we can then replace them using the str_replace and str_replace_all functions.

Here we use to replace str_replace and str_replace_all to match the pattern “[isk]” in the cat_names vector and replace it with a dash (-):

str_replace(cat_names, "[isk]", "-"): replaces the first occurrence of the pattern in each name, that is the first occurrence of either i,s or k (Listing 6.13):

Listing 6.13: Replace the first occurrence of i, s, or k in each name with a dash
str_replace(cat_names,"[isk]","-") 
 [1] "Wh-skers" "Fluffy"   "M-ttens"  "Soc-s"    "T-ger"    "Smo-ey"  
 [7] "G-zmo"    "Oreo"     "Luna"     "Ol-ver"   "Leo"      "M-lo"    
[13] "Charl-e"  "S-mba"    "Nala"     "Fel-x"    "Garf-eld" "T-gger"  
[19] "Chloe"    "Bella"   

str_replace_all(cat_names, "[isk]", "-"): replaces all occurrences of the pattern in each name, that is all occurances of either i,s or k (Listing 6.14):

Listing 6.14: Replaces all occurances of either i,s or k in each cat name
str_replace_all(cat_names,"[isk]","-")
 [1] "Wh---er-" "Fluffy"   "M-tten-"  "Soc--"    "T-ger"    "Smo-ey"  
 [7] "G-zmo"    "Oreo"     "Luna"     "Ol-ver"   "Leo"      "M-lo"    
[13] "Charl-e"  "S-mba"    "Nala"     "Fel-x"    "Garf-eld" "T-gger"  
[19] "Chloe"    "Bella"   

6.6 Creating the cat_tibble

To demonstrate using stringr and regular expressions to match a pattern in order to split a single column variable into two column variables as we will for the bibliometrics data let’s make a new cat based table called cat_tibble.

The code in Listing 6.15 creates cat_tibble a tibble dataframe with a single column cat_id.

Here we use the str_c function to concatenate strings. We’re creating a string that looks like <name>-sex_age>, e.g. <Whiskers>-M_5 where:

  • < is the literal character
  • cat_names is the vector of cat names
  • > is the literal character
  • sample(c("-M_","-F_"), ...) generates a random sex indicator (-M_ for male or -F_ for female) for each cat
  • sample(3:11, ...) generates a random age (between 3 and 11) for each cat
  • We pass the arguments size = length(cat_names), replace = TRUE to the sample() function to indicate we want the same number of values as the length of the cat_names vector, which is 20 names long.
Listing 6.15: Creating the cat tibble
cat_tibble <- tibble(
  cat_id = str_c("<", cat_names, ">",
                 sample(c("-M_","-F_"),
                        size = length(cat_names),
                      replace = TRUE),
               sample(3:15, size = length(cat_names),
                      replace = TRUE))
)

Let’s inspect the our new table Listing 6.16 :

Listing 6.16: View the cat_tibble
cat_tibble
# A tibble: 20 × 1
   cat_id        
   <chr>         
 1 <Whiskers>-F_7
 2 <Fluffy>-M_15 
 3 <Mittens>-M_9 
 4 <Socks>-F_11  
 5 <Tiger>-M_5   
 6 <Smokey>-F_13 
 7 <Gizmo>-F_13  
 8 <Oreo>-M_10   
 9 <Luna>-M_7    
10 <Oliver>-F_12 
11 <Leo>-M_3     
12 <Milo>-M_7    
13 <Charlie>-M_10
14 <Simba>-F_9   
15 <Nala>-F_5    
16 <Felix>-F_6   
17 <Garfield>-F_8
18 <Tigger>-F_10 
19 <Chloe>-F_3   
20 <Bella>-F_13  

7 Separating the cat_id column into multiple columns

Having created a table with a single variable that combines name, sex and age into a single variable cat_id (Listing 6.16), it would be much more useful to have these as three separate variables: name, sex, and age.

We can do this using the separate_wider_regex() function from the tidyr package to split the cat_id column into separate columns based on regular expression patterns (Listing 7.1).

Listing 7.1: Separate cat_tibble wider by regex.
cat_tibble |>
  separate_wider_regex(
    cat_id,
    patterns = c("<",
      name = "[A-Za-z]+", ">-",
      sex = ".","_",
      age = "[0-9]+")
  )
# A tibble: 20 × 3
   name     sex   age  
   <chr>    <chr> <chr>
 1 Whiskers F     7    
 2 Fluffy   M     15   
 3 Mittens  M     9    
 4 Socks    F     11   
 5 Tiger    M     5    
 6 Smokey   F     13   
 7 Gizmo    F     13   
 8 Oreo     M     10   
 9 Luna     M     7    
10 Oliver   F     12   
11 Leo      M     3    
12 Milo     M     7    
13 Charlie  M     10   
14 Simba    F     9    
15 Nala     F     5    
16 Felix    F     6    
17 Garfield F     8    
18 Tigger   F     10   
19 Chloe    F     3    
20 Bella    F     13   

The patterns argument is a named character vector created with the c() function defining the regular expression patterns to match and new variables to created by extracting strings from the cat_id column. We must match the whole pattern, including strings we don’t want.

Specifically:

  • < matches the literal < character, we don’t want this, so we don’t name it
  • name = "[A-Za-z]+" extracts one or more alphabetic characters (the cat’s name) to a new variable name.
  • ">-" matches the literal >- characters, we don’t want this, so we don’t name it.
  • sex = ".",extracts the next single character (.) to a new variable called sex. Here it’s M or F to indicate the sex of the cat.
  • "_" matches the literal underscore character (_), we don’t want this, so we don’t name it.
  • age = "[0-9]+" extracts one or more digits to a new variable called age.

With these tools we can now tackle the bibliometrics data.