9-10th October 2018

About me

Imposter syndrome

Hadley Wickham: Practioner/Programmer

Reproducible Science

"Reproducibility involves being able to recalculate the exact numbers in a data analysis using the code and raw data provided by the analyst…Reproducibility should not be confused with “correctness” of a data analysis. A data analysis can be fully reproducible and recreate all numbers in an analysis and still be misleading or incorrect."

Jeff Leek, The Elements of Data Analytic Style

Official lesson materials

All the official software carpentry lesson materials can be found here

The official materials mostly use base R, we used a mixture of base R and the tidyverse

Syllabus

  • Intro to R and RStudio
  • Importing data
  • Transforming data with dplyr
  • Functions in R
  • Visualising data with ggplot

R is 25 years old

Ross Ihaka and Robert Gentleman.

R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299–314, 1996

Writing code is frustrating

“There are only two kinds of languages: the ones people complain about and the ones nobody uses”

Bjarne Stroustrup, C++ creator and developer

Reproducible R

Don't save your workspace, save your code

Reproducible R

Work in projects, let RStudio help manage your files

Getting help

Getting help

  • Google: 'Typically adding “R” to a query is enough to restrict it to relevant results'. Hadley Wickham
  • Check out the help pages using ?function_name e.g. ?mean
  • Join RStudio Community
  • Learn how to make a reproducible example

Assigning objects

names are labels bound to objects

Atomic vectors

One-dimensional groups, the key building blocks of R objects

Indexing atomic vectors

# Make a character vector using the combine function, c()
cards <- c("ace", "king", "queen", "jack", "ten")

# Return the values of cards
cards
## [1] "ace"   "king"  "queen" "jack"  "ten"
# Return the third value of cards
# Indexing starts at 1
cards[3]
## [1] "queen"

Type of atomic vectors

# Make a character vector using the combine function, c()
cards <- c("ace", "king", "queen", "jack", "ten")
# Make a numeric vector using seq()
my_sequence <- seq(1:7)

# Check the type of each vector
typeof(cards)
## [1] "character"
typeof(my_sequence)
## [1] "integer"

Factors

Factors are Rs way of storing categorical information such as eye colour or car type.

A factor is something that can only have certain values, and can be ordered (such as low,medium,high) or unordered such as types of fruit.

Factors look like strings, but behave like integers

See this SWC lesson

All vectors

Lists group objects instead of individual values, such as several atomic vectors of different sizes. NULL is the absence of a vector.

Lists

# Create a list with three objects, a sequence, a character vector, a list
list1 <- list(100:130, "R", list(TRUE, FALSE))
# Return the values of the list
list1
## [[1]]
##  [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
## [18] 117 118 119 120 121 122 123 124 125 126 127 128 129 130
## 
## [[2]]
## [1] "R"
## 
## [[3]]
## [[3]][[1]]
## [1] TRUE
## 
## [[3]][[2]]
## [1] FALSE

Lists

Matrices and arrays

Matrices store values in a two dimensional array, whilst arrays can have n dimensions.

See this SWC lesson

Data frames

Two dimensional versions of lists, each atomic vector is the same length.

Tidy data

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each observational unit forms a table

Gapminder data

Gapminder Foundation data with information about countries around the world from 1952 to 2007.

The workshop data is already tidy.

We load it using readr package function read_csv()

Gapminder data

An excerpt of data for teaching containing observations of six variables:

  • country
  • continent
  • year
  • lifeExp - life expectancy at birth
  • pop - total population
  • gdpPercap - per-capita GDP (Gross domestic product)

magrittr

the pipe operator in R: %>%

magrittr is part of the tidyverse.

"The magrittr package offers a set of operators which make your code more readable"

The pipe operator %>% pipes the left-hand side values forward into expressions that appear on the right-hand side.

dplyr

dplyr is part of the tidyverse

"dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges."

dplyr::filter()

dplyr::select()

dplyr::mutate()

dplyr::summarise

Functions

Garrett Grolemund, Hands-on programming with R

Three components: name, body and a set of arguments

# Roll two dice function
roll <- function(){
  die <- 1:6
  dice <- sample(die, size = 2, replace = TRUE)
  sum(dice)
}

Functions

Gapminder GDP calculator

# Takes a dataset and multiplies the population column
# with the GDP per capita column.
calcGDP <- function(dat, yr=NULL, ctry=NULL) {
  # Is there a year argument?
  if(!is.null(yr)) {
   dat <- dat %>% filter(year == yr)
  }
  # Is there a country argument?
  if (!is.null(ctry)) {
   dat <- dat %>% filter(country == ctry)
  }
  # Create new GDP column
  new <- dat %>% mutate(gdp = pop * gdpPercap)
  return(new)
}

ggplot2

ggplot2 is part of the tidyverse

"ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics.

You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details."

ggplot2

# In most cases plotting follows this form using the pipe
dat %>% ggplot(aes(<variables to be aesthetically mapped>)) +
  geom_...(<arguments to geometric layer>)

# Or without the pipe
ggplot(aes(dat,<variables to be aesthetically mapped>)) +
  geom_...(<arguments to geometric layer>)

ggplot2

# Use mpg data to create a scatter plot of engine size versus fuel
# efficency, colouring the points according to car type
mpg %>% ggplot(aes(displ, hwy, colour = class)) + 
  geom_point()

Further resources

A few R people to follow, there are many others

  • Hadley Wickham @hadleywickham
  • Garrett Grolemund @StatGarrett
  • Jenny Bryan @JennyBryan
  • Mara Averick @datandme
  • Thomas Lin Pedersen @thomasp85
  • Julia Silge @juliasilge
  • David Robinson @drob
  • Maëlle Salmon @ma_salmon
  • Jesse Mostipak @kierisi
  • Yihui Xie @xieyihui
  • Jim Hester @jimhester_