6 Data wrangling II
6.1 Joins
We often want to join tables together based upon a variable or variables common to each table, called a key or keys, such as the digital object identifier (DOI) that is present in tables downloaded from both Scopus and Web of Science (Section 7.2) in order to further transform the data or answer a question that cannot easily be answered from each table alone.
R4DS joins gives some detailed examples of joining tables, but here we’ll look at some simpler examples to try build our mental model about how to join tables.
There are two types of join we often encounter:
- Joins that create new variables in one table by matching observations in another: these are mutating joins as per tidyverse verb
mutate
Section 5.3.4. - Joins that filter the observations from one table against another, as per tidyverse verb
filter
Section 5.3.1.
dplyr
has six join functions.
Mutating joins:
-
Left Join: Returns all rows from the left table and the matched rows from the right table.
left_join()
-
Right Join: Similar to left join, but returns all rows from the right table and the matched rows from the left table.
right_join()
-
Full Join: Returns all rows from both tables, with null values in the columns where there are no matches.
full_join()
Filtering joins:
-
Inner Join: Returns only the rows that have matches in both tables.
inner_join()
-
Semi Join: Returns only the rows from the left table that have matches in the right table.
semi_join()
-
Anti Join: Returns only the rows from the left table that do not have matches in the right table.
anti_join()
6.2 Mutating joins
To demonstrate a mutating join using the penguins
data, I first use mutate
to create a unique identifier variable for each penguin from the row number called penguin_id
, and assign this to a new dataframe numbered_penguins
.
Then I use numbered_penguins
to create two new dataframes using select
(Listing 6.1) :
-
penguins_physical
is a dataframe containing physical characteristics of penguins (e.g., bill length, flipper length, body mass). -
penguins_info
is a dataframe containing other information about penguins (e.g., species, island, year, sex).
The common key that we can use to join these tables back together is the penguin_id
variable. The mutation part is the creation of new variables when two tables are joined to create a single table.
Here we’ll use the left_join()
function to combine the two dataframes penguins_physical
and penguins_info
based on the penguin_number
column, which is common to both dataframes (Listing 6.3).
We assign the result to a new dataframe, penguins_joined
, which contains all the columns from penguins_physical
keyed to penguin_id
from penguins_info
.
The left_join
function returns all records from the left dataframe (penguins_physical
) and the matching records from the right dataframe (penguins_info
). If there were no matches in the right dataframe, the result will contain null values for the right dataframe’s columns.
As we didn’t explicitly tell the function which variables to use, we got a message tells us it joined by = join_by(penguin_id)
.
Joining with `by = join_by(penguin_id)`
Rows: 344
Columns: 9
$ penguin_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
$ sex <fct> male, female, female, NA, female, male, female, male…
6.3 Filtering join
Here we’ll do a filtering join, a semi-join.
First I’ll filter the penguins_info
I created just for the data from the year 2008:
Then we do a semi-join between the penguins_physical
table and the filtered penguins_info_filt
table. We still use the penguin_id
as the key.
A semi-join returns only the rows from penguins_physical
that have a match in penguins_info_filt
, effectively filtering the physical characteristics of penguins to only include those from 2008. This joins data from 114 rows of the 344 rows in the penguins_physical
(Listing 6.6).
As we didn’t explicitly tell the function which variables to use, we got a message tells us it joined by = join_by(penguin_id)
I’ve assigned the resulting table to a new dataframe: penguins_filtered
.
Joining with `by = join_by(penguin_id)`
Rows: 114
Columns: 5
$ penguin_id <int> 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, …
$ bill_length_mm <dbl> 39.6, 40.1, 35.0, 42.0, 34.5, 41.4, 39.0, 40.6, 36.5…
$ bill_depth_mm <dbl> 17.7, 18.9, 17.9, 19.5, 18.1, 18.6, 17.5, 18.8, 16.6…
$ flipper_length_mm <int> 186, 188, 190, 200, 187, 191, 186, 193, 181, 194, 18…
$ body_mass_g <int> 3500, 4300, 3450, 4050, 2900, 3700, 3550, 3800, 2850…
6.4 Strings and characters
The terms strings and characters are often used interchangeably, but a character is technically a single symbol or glyph, such as & or a, or even 🙀. And a sequence of characters and/or symbols form a string.
In R to create character vector, we enclose the characters forming the string in quotes. For example, "hello"
is a string, and each individual character within it (h
, e
, l
, l
, o
) is a character. "hello"
is a character vector of length 1 as there is one string of five characters.
"hello world"
is also a character vector of length 1 as the white space also counts as a character.
However using combine function c()
to create c("hello", "world")
would create a character vector of length 2 as we have now created "hello"
and "world"
as separate strings.
Both single '
or double "
are allowed, but for consistent style I’d recommend double, unless your string itself contains multiple "
as per the tidyverse style guide.
Frequently we are dealing with data in the form of character vectors or strings which we need to match or extract, such as the affiliation of an author for a journal article in Scopus data. stringr
is a tidyverse package of functions specifically for these tasks, and more generally we use regular expressions to match strings.
6.4.1 Regular expressions
A regular expression (regex) is a pattern used to match and manipulate strings. It’s like a really powerful “find and replace” tool, but has completely unintuitive syntax!
You can use regex to specify a pattern, and then use that pattern to search for matches in a string. Once matched, we can extract or manipulate the strings in conjunction with other functions.
stringr
is the tidyverse package for working with strings and regex. Many of the stringr
functions allow us to avoid explicitly worrying about writing a regex, but we can always create a regex if needed.
Regex patterns consist of special characters, such as .
, *
, +
, ?
, {
, }
, [
, ]
, (
, )
, and \
. These characters have special meanings, like “any character” ( .
), “zero or more” ( *
), or “start of string” ( ^
). You can combine these characters to create complex patterns to match specific strings.
Because characters such dot .
or asterisk *
have special meaning in regex, if you want to treat them literally as a dot or asterisk, we have to escape their meta meaning.
To escape we use a backslash \
before the special character such that to treat a dot as a dot we would write \.
in our regex.
6.4.2 Getting started with stringr
To explore stringr
we’ll create a couple of character vectors:
-
ding_words
: a character vector of 20 words, all ending withding
using thec()
function (Listing 6.7).
-
cat_names
: a character vector of 20 strings of cat names using thec()
function (Listing 6.8)
6.4.2.1 Pattern Matching and Viewing
str_view
is a function from the stringr
package to search for patterns in the words
vector. It returns a visual representation of the matches.
str_view(words, "ding")
: searches for words containing the substring “ding” (Listing 6.9).
[1] │ aboun<ding>
[2] │ astoun<ding>
[3] │ confoun<ding>
[4] │ resoun<ding>
[5] │ surroun<ding>
[6] │ groun<ding>
[7] │ foun<ding>
[8] │ poun<ding>
[9] │ roun<ding>
[10] │ soun<ding>
[11] │ boun<ding>
[12] │ houn<ding>
[13] │ moun<ding>
[14] │ woun<ding>
[15] │ compoun<ding>
[16] │ expoun<ding>
[17] │ propoun<ding>
[18] │ reboun<ding>
[19] │ unboun<ding>
str_view(ding_words, "bou|rr")
: searches for words containing the substrings “bou” or “rr”. The vertical bar |
here is a special regex character that represent the Boolean OR operator (Listing 6.10).
[1] │ a<bou>nding
[5] │ su<rr>ounding
[11] │ <bou>nding
[18] │ re<bou>nding
[19] │ un<bou>nding
6.4.2.2 Pattern Matching and Counting
str_detect
is a function is used to identify elements in the cat_names
vector that contain the pattern “[isk]” (i.e., the characters “i”, “s”, or “k”).
str_detect(cat_names, "[isk]")
: returns a logical vector (TRUE
or FALSE
)indicating which names contain the pattern (Listing 6.11):
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
str_count(cat_names, "[isk]")
: counts the number of occurrences of the pattern in each name (Listing 6.12):
[1] 4 0 2 2 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 0
6.5 String Replacement
Having briefly looked at some of the ways we match strings, we can then replace them using the str_replace
and str_replace_all
functions.
Here we use to replace str_replace
and str_replace_all
to match the pattern “[isk]” in the cat_names
vector and replace it with a dash (-):
str_replace(cat_names, "[isk]", "-")
: replaces the first occurrence of the pattern in each name, that is the first occurrence of either i
,s
or k
(Listing 6.13):
[1] "Wh-skers" "Fluffy" "M-ttens" "Soc-s" "T-ger" "Smo-ey"
[7] "G-zmo" "Oreo" "Luna" "Ol-ver" "Leo" "M-lo"
[13] "Charl-e" "S-mba" "Nala" "Fel-x" "Garf-eld" "T-gger"
[19] "Chloe" "Bella"
str_replace_all(cat_names, "[isk]", "-")
: replaces all occurrences of the pattern in each name, that is all occurances of either i
,s
or k
(Listing 6.14):
[1] "Wh---er-" "Fluffy" "M-tten-" "Soc--" "T-ger" "Smo-ey"
[7] "G-zmo" "Oreo" "Luna" "Ol-ver" "Leo" "M-lo"
[13] "Charl-e" "S-mba" "Nala" "Fel-x" "Garf-eld" "T-gger"
[19] "Chloe" "Bella"
6.6 Creating the cat_tibble
To demonstrate using stringr
and regular expressions to match a pattern in order to split a single column variable into two column variables as we will for the bibliometrics data let’s make a new cat based table called cat_tibble
.
The code in Listing 6.15 creates cat_tibble
a tibble dataframe with a single column cat_id
.
Here we use the str_c
function to concatenate strings. We’re creating a string that looks like <name>-sex_age>
, e.g. <Whiskers>-M_5
where:
-
<
is the literal character -
cat_names
is the vector of cat names -
>
is the literal character -
sample(c("-M_","-F_"), ...)
generates a random sex indicator (-M_
for male or-F_
for female) for each cat -
sample(3:11, ...)
generates a random age (between 3 and 11) for each cat - We pass the arguments
size = length(cat_names), replace = TRUE
to thesample()
function to indicate we want the same number of values as the length of thecat_names
vector, which is 20 names long.
Let’s inspect the our new table Listing 6.16 :
# A tibble: 20 × 1
cat_id
<chr>
1 <Whiskers>-F_7
2 <Fluffy>-M_15
3 <Mittens>-M_9
4 <Socks>-F_11
5 <Tiger>-M_5
6 <Smokey>-F_13
7 <Gizmo>-F_13
8 <Oreo>-M_10
9 <Luna>-M_7
10 <Oliver>-F_12
11 <Leo>-M_3
12 <Milo>-M_7
13 <Charlie>-M_10
14 <Simba>-F_9
15 <Nala>-F_5
16 <Felix>-F_6
17 <Garfield>-F_8
18 <Tigger>-F_10
19 <Chloe>-F_3
20 <Bella>-F_13
7 Separating the cat_id
column into multiple columns
Having created a table with a single variable that combines name, sex and age into a single variable cat_id
(Listing 6.16), it would be much more useful to have these as three separate variables: name
, sex
, and age
.
We can do this using the separate_wider_regex()
function from the tidyr
package to split the cat_id
column into separate columns based on regular expression patterns (Listing 7.1).
# A tibble: 20 × 3
name sex age
<chr> <chr> <chr>
1 Whiskers F 7
2 Fluffy M 15
3 Mittens M 9
4 Socks F 11
5 Tiger M 5
6 Smokey F 13
7 Gizmo F 13
8 Oreo M 10
9 Luna M 7
10 Oliver F 12
11 Leo M 3
12 Milo M 7
13 Charlie M 10
14 Simba F 9
15 Nala F 5
16 Felix F 6
17 Garfield F 8
18 Tigger F 10
19 Chloe F 3
20 Bella F 13
The patterns
argument is a named character vector created with the c()
function defining the regular expression patterns to match and new variables to created by extracting strings from the cat_id
column. We must match the whole pattern, including strings we don’t want.
Specifically:
-
<
matches the literal<
character, we don’t want this, so we don’t name it -
name = "[A-Za-z]+"
extracts one or more alphabetic characters (the cat’s name) to a new variablename
. -
">-"
matches the literal>-
characters, we don’t want this, so we don’t name it. -
sex = ".",
extracts the next single character (.
) to a new variable calledsex
. Here it’sM
orF
to indicate the sex of the cat. -
"_"
matches the literal underscore character (_
), we don’t want this, so we don’t name it. -
age = "[0-9]+"
extracts one or more digits to a new variable calledage
.
With these tools we can now tackle the bibliometrics data.