Data like the world can seem chaotic. For inquire we have to transform the data into useful structures that we and the computer can interact with.

In this workshop, we will be using the tidyverse library, a collection of R packages that acts as an extra layer of interaction between base R and the user without significant impacts in performance. If you haven’t installed it, do it by copying the following line in the Console panel after the >:

install.packages("tidyverse")

Hit Enter, the download and installation process should start. When finished, load the library by executing:

library("tidyverse")
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## -- Attaching packages -------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.1
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Above you can see all the libraries contained in tidyverse to be loaded. Some other libraries that might be useful to install are:

install.packages(c("readxl", "psych", "skimr"))

Please download the following cheat sheets as guides:

1 Importing data into R

1.1 The source of the data

It is recommended to take care of your folder structure by making organising your project with at least 3 folders, one for your scripts, one for your data and another one for results. To avoid problems with paths:

  1. Create an empty file that will contain your script with the extension .R in the end.
  2. Execute the script using Rstudio.
  3. To know where your current R session is executed, use getwd()

Download the following files (right-click > Save link as…) Number of deaths by cause and Healthcare expenditure as percentage of GDP

Data extracted from Our world in data

1.2 The path to the data

Computer locations are structured as layers one contained in the other. To navigate the folder structure we have to know that:

  • ./ Current location
  • ../ Out of the current location. It can be stacked e.g. ../../
  • / root (usually where the important files for the system are located)
  • ~/ home directory where you can #hygge

You can change this location by giving full directions from the root or relative to the current folder using setwd("./directions/tofolder/inside")

1.3 The nature of the data

Data can come in multiple formats. Look at the file extension of your data file or have a look in a text editor how it is formated. Look at the middle column of the first page of the Data Import Cheat Sheet. Load the data into a variable such as my_data.

1.4 Data frames vs Tibbles

Tables in base R are considered as data.frame. Tibbles are an improved version of the data.frame, when files are imported using read_ these are formatted as Tibbles. Look at the difference by running the commands as.data.frame(my_data) and as_tibble(my_data)

2 Column names and formats

When importing tables, the type of data in each column is guessed but it can also be specified. You can explore your dataset using view() in an interactive way (a new tab opens). Have a glimpse() to the imported dataset and recognise the data type of the columns:

Description Example
int integers 1, 2, 3 ,4
dbl doubles or real numbers 1.0, 2.3, 3.623, 4.78
chr characters or string (text) “Hello”, “wild-type”, “1”
dttm date-times “2018-06-09 16:45:40”
lgl logical TRUE / FALSE
fctr factors 1, 1, 2, 3, 4, 4 Levels: 1, 2, 3, 4
date dates “2018-06-09”

Column types can be reformatted at any time.

Try to avoid spaces in your column names

3 Long vs wide datasets

Tables can be mainly found in two designs:

  • In Wide format all observations for each sample or subject will be contained in one row across multiple columns
  • In Long format (or tidy) each observation is in its row, and each variable type in its column. There are multiple rows for each sample or subject.

In the middle column, page 2 of the Data Import Cheat Sheet you can find how to tidy your data to the suitable format. In short:

  • gather() to go from wide to long format
  • spread() to go from long to wide format

4 Pipes

One of the best enhancements in R is pipes. They can be used to concatenate commands using %>%. This will pass the result of one function as the first argument of the next function. In Windows pipes can also be introduced by Ctrl + shift + M. Example:

myresults <- mydata %>%
  select(column1, column2, 3:10, -column9) %>%
  filter(column1 < 0.05)

5 Explore your data

There are many things you can do with your dataset. A suggested way of operating would be:

  1. Clarify your questions or idea of what you want to see
  2. Check the Data Transformation Cheat Sheet to find the functions you need to answer your questions
  3. If the cheat sheet is not enough, look up your problem on the internet by adding tidyverse, dplyr or R at the end of your query

A very brief summary of things you can do:

  • select() columns
  • filter() values in columns
  • arrange() your data in a ascending or arrange(desc()) in descending order
  • mutate() to create new columns or overwrite existing ones
  • pull() a specific column as a vector
  • rename() columns

5.1 Group your data

Considering that your data is in long format you can group your observations based on a specific column using group_by(column_name). This will allow you to perform operations and run functions per group instead of the whole dataset. Check page 1 of the Data Transformation Cheat Sheet

5.2 Combine datasets

dplyr Description SQL
inner_join(x, y, by = "col") Keeps only common rows between x and y SELECT * FROM x INNER JOIN y USING (col)
left_join(x, y, by = "col") Keeps all rows in x SELECT * FROM x LEFT OUTER JOIN y USING (col)
right_join(x, y, by = "col") Keeps all rows in y SELECT * FROM x RIGHT OUTER JOIN y USING (col)
full_join(x, y, by = "col") Keeps all rows in x or y SELECT * FROM x FULL OUTER JOIN y USING (col)
Extracted from R for Data Science

Extracted from R for Data Science

When merging tables, common column names will be used automatically. To specify a common column to perform the joining add , by = "column_with_same_name"). If column names don’t match then use , by = c("col_in_x" = "col_in_y"). Preferably change the column names with rename() to avoid issues. Check page 2 of the Data Transformation Cheat Sheet

5.3 Summarise your data

Custom summaries reports can be created by using summarise(). Despite being flexible, this requires a detailed specification of the types of summaries we want to see such as mean, median, maximum values, etc. Packages like skimr or psych provide a set of out of the box summary statistics for your data. Examples based on the built-in dataset esoph:

library(skimr)
## 
## Attaching package: 'skimr'
## The following object is masked from 'package:stats':
## 
##     filter

# Built-in dataset: The Effect of Vitamin C on Tooth Growth in Guinea Pigs
ToothGrowth %>% skim() %>% print()
## Skim summary statistics
##  n obs: 60 
##  n variables: 3 
## 
## -- Variable type:factor -------------------------------------------------------------------------------------------------------------
##  variable missing complete  n n_unique            top_counts ordered
##      supp       0       60 60        2 OJ: 30, VC: 30, NA: 0   FALSE
## 
## -- Variable type:numeric ------------------------------------------------------------------------------------------------------------
##  variable missing complete  n  mean   sd  p0   p25   p50   p75 p100
##      dose       0       60 60  1.17 0.63 0.5  0.5   1     2     2  
##       len       0       60 60 18.81 7.65 4.2 13.07 19.25 25.27 33.9
##      hist
##  <U+2587><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581><U+2587>
##  <U+2583><U+2585><U+2583><U+2585><U+2583><U+2587><U+2582><U+2582>
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

ToothGrowth %>% describe()
## # A tibble: 3 x 13
##    vars     n  mean    sd median trimmed   mad   min   max range   skew
##   <int> <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1     1    60 18.8  7.65    19.2   18.9  9.04    4.2  33.9  29.7 -0.143
## 2     2    60  1.5  0.504    1.5    1.5  0.741   1     2     1    0    
## 3     3    60  1.17 0.629    1      1.15 0.741   0.5   2     1.5  0.372
## # ... with 2 more variables: kurtosis <dbl>, se <dbl>

6 Basic data visualization

Despite not the topic of this workshop but visualising your data with ggplot2 can be achieved by adding some extra lines to your pipe. To achieve this consider:

  • The Data Visualization Cheat Sheet is one of the main tools to find which kind of plot you want to make and how to do it.
  • Start by adding %>% ggplot(aes( and define your x= your_column_x, y= your_column_y, color= your_column_color to color lines or fill= your_column_color to fill shapes with color.
  • After using the ggplot(aes()) function, layers are added by piping with + instead of %>%

An example could be:

# Built-in dataset: The Effect of Vitamin C on Tooth Growth in Guinea Pigs
ToothGrowth %>%
  # Turn numeric dose into categories
  mutate(dose = factor(dose)) %>%
  ggplot(aes(x = dose, y = len, fill = supp)) + 
  # Non-stacked barplots
  geom_col(position = position_dodge()) +
  # change visual template and colors
  theme_minimal() + scale_fill_brewer(palette="Set1")

Have fun and play around =D




Thanks to the support of: