- Alistair Bailey
- website: ab604.uk
- email: ab604@soton.ac.uk
- twitter: @alistair604
9-10th October 2018
"Reproducibility involves being able to recalculate the exact numbers in a data analysis using the code and raw data provided by the analyst…Reproducibility should not be confused with “correctness” of a data analysis. A data analysis can be fully reproducible and recreate all numbers in an analysis and still be misleading or incorrect."
Jeff Leek, The Elements of Data Analytic Style
All the official software carpentry lesson materials can be found here
The official materials mostly use base R, we used a mixture of base R and the tidyverse
Ross Ihaka and Robert Gentleman.
R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299–314, 1996
“There are only two kinds of languages: the ones people complain about and the ones nobody uses”
Bjarne Stroustrup, C++ creator and developer
?function_name
e.g. ?mean
# Make a character vector using the combine function, c() cards <- c("ace", "king", "queen", "jack", "ten") # Return the values of cards cards
## [1] "ace" "king" "queen" "jack" "ten"
# Return the third value of cards # Indexing starts at 1 cards[3]
## [1] "queen"
# Make a character vector using the combine function, c() cards <- c("ace", "king", "queen", "jack", "ten") # Make a numeric vector using seq() my_sequence <- seq(1:7) # Check the type of each vector typeof(cards)
## [1] "character"
typeof(my_sequence)
## [1] "integer"
Factors are Rs way of storing categorical information such as eye colour or car type.
A factor is something that can only have certain values, and can be ordered (such as low,medium,high) or unordered such as types of fruit.
Factors look like strings, but behave like integers
See this SWC lesson
# Create a list with three objects, a sequence, a character vector, a list list1 <- list(100:130, "R", list(TRUE, FALSE)) # Return the values of the list list1
## [[1]] ## [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 ## [18] 117 118 119 120 121 122 123 124 125 126 127 128 129 130 ## ## [[2]] ## [1] "R" ## ## [[3]] ## [[3]][[1]] ## [1] TRUE ## ## [[3]][[2]] ## [1] FALSE
Matrices store values in a two dimensional array, whilst arrays can have n
dimensions.
See this SWC lesson
Gapminder Foundation data with information about countries around the world from 1952 to 2007.
The workshop data is already tidy.
We load it using readr
package function read_csv()
An excerpt of data for teaching containing observations of six variables:
country
continent
year
lifeExp
- life expectancy at birthpop
- total populationgdpPercap
- per-capita GDP (Gross domestic product)%>%
magrittr is part of the tidyverse.
"The magrittr package offers a set of operators which make your code more readable"
The pipe operator %>%
pipes the left-hand side values forward into expressions that appear on the right-hand side.
dplyr is part of the tidyverse
"dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges."
Three components: name, body and a set of arguments
# Roll two dice function roll <- function(){ die <- 1:6 dice <- sample(die, size = 2, replace = TRUE) sum(dice) }
# Takes a dataset and multiplies the population column # with the GDP per capita column. calcGDP <- function(dat, yr=NULL, ctry=NULL) { # Is there a year argument? if(!is.null(yr)) { dat <- dat %>% filter(year == yr) } # Is there a country argument? if (!is.null(ctry)) { dat <- dat %>% filter(country == ctry) } # Create new GDP column new <- dat %>% mutate(gdp = pop * gdpPercap) return(new) }
ggplot2 is part of the tidyverse
"ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics.
You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details."
# In most cases plotting follows this form using the pipe dat %>% ggplot(aes(<variables to be aesthetically mapped>)) + geom_...(<arguments to geometric layer>) # Or without the pipe ggplot(aes(dat,<variables to be aesthetically mapped>)) + geom_...(<arguments to geometric layer>)
# Use mpg data to create a scatter plot of engine size versus fuel # efficency, colouring the points according to car type mpg %>% ggplot(aes(displ, hwy, colour = class)) + geom_point()