Data like the world can seem chaotic. For inquire we have to transform the data into useful structures that we and the computer can interact with.
In this workshop, we will be using the tidyverse
library, a collection of R packages that acts as an extra layer of interaction between base R and the user without significant impacts in performance. If you haven’t installed it, do it by copying the following line in the Console
panel after the >
:
install.packages("tidyverse")
Hit Enter
, the download and installation process should start. When finished, load the library by executing:
library("tidyverse")
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## -- Attaching packages -------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.1
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Above you can see all the libraries contained in tidyverse to be loaded. Some other libraries that might be useful to install are:
install.packages(c("readxl", "psych", "skimr"))
Please download the following cheat sheets as guides:
It is recommended to take care of your folder structure by making organising your project with at least 3 folders, one for your scripts, one for your data and another one for results. To avoid problems with paths:
.R
in the end.getwd()
Download the following files (right-click > Save link as…) Number of deaths by cause and Healthcare expenditure as percentage of GDP
Data extracted from Our world in data
Computer locations are structured as layers one contained in the other. To navigate the folder structure we have to know that:
./
Current location../
Out of the current location. It can be stacked e.g. ../../
/
root (usually where the important files for the system are located)~/
home directory where you can #hyggeYou can change this location by giving full directions from the root or relative to the current folder using setwd("./directions/tofolder/inside")
Data can come in multiple formats. Look at the file extension of your data file or have a look in a text editor how it is formated. Look at the middle column of the first page of the Data Import Cheat Sheet. Load the data into a variable such as my_data
.
Tables in base R are considered as data.frame
. Tibbles are an improved version of the data.frame, when files are imported using read_
these are formatted as Tibbles. Look at the difference by running the commands as.data.frame(my_data)
and as_tibble(my_data)
When importing tables, the type of data in each column is guessed but it can also be specified. You can explore your dataset using view()
in an interactive way (a new tab opens). Have a glimpse()
to the imported dataset and recognise the data type of the columns:
Description | Example | |
---|---|---|
int |
integers | 1, 2, 3 ,4 |
dbl |
doubles or real numbers | 1.0, 2.3, 3.623, 4.78 |
chr |
characters or string (text) | “Hello”, “wild-type”, “1” |
dttm |
date-times | “2018-06-09 16:45:40” |
lgl |
logical | TRUE / FALSE |
fctr |
factors | 1, 1, 2, 3, 4, 4 Levels: 1, 2, 3, 4 |
date |
dates | “2018-06-09” |
Column types can be reformatted at any time.
Try to avoid spaces in your column names
Tables can be mainly found in two designs:
In the middle column, page 2 of the Data Import Cheat Sheet you can find how to tidy your data to the suitable format. In short:
gather()
to go from wide to long formatspread()
to go from long to wide formatOne of the best enhancements in R is pipes. They can be used to concatenate commands using %>%
. This will pass the result of one function as the first argument of the next function. In Windows pipes can also be introduced by Ctrl + shift + M
. Example:
myresults <- mydata %>%
select(column1, column2, 3:10, -column9) %>%
filter(column1 < 0.05)
There are many things you can do with your dataset. A suggested way of operating would be:
tidyverse
, dplyr
or R
at the end of your queryA very brief summary of things you can do:
select()
columnsfilter()
values in columnsarrange()
your data in a ascending or arrange(desc())
in descending ordermutate()
to create new columns or overwrite existing onespull()
a specific column as a vectorrename()
columnsConsidering that your data is in long format you can group your observations based on a specific column using group_by(column_name)
. This will allow you to perform operations and run functions per group instead of the whole dataset. Check page 1 of the Data Transformation Cheat Sheet
dplyr | Description | SQL |
---|---|---|
inner_join(x, y, by = "col") |
Keeps only common rows between x and y | SELECT * FROM x INNER JOIN y USING (col) |
left_join(x, y, by = "col") |
Keeps all rows in x | SELECT * FROM x LEFT OUTER JOIN y USING (col) |
right_join(x, y, by = "col") |
Keeps all rows in y | SELECT * FROM x RIGHT OUTER JOIN y USING (col) |
full_join(x, y, by = "col") |
Keeps all rows in x or y | SELECT * FROM x FULL OUTER JOIN y USING (col) |
When merging tables, common column names will be used automatically. To specify a common column to perform the joining add , by = "column_with_same_name")
. If column names don’t match then use , by = c("col_in_x" = "col_in_y")
. Preferably change the column names with rename()
to avoid issues. Check page 2 of the Data Transformation Cheat Sheet
Custom summaries reports can be created by using summarise()
. Despite being flexible, this requires a detailed specification of the types of summaries we want to see such as mean, median, maximum values, etc. Packages like skimr
or psych
provide a set of out of the box summary statistics for your data. Examples based on the built-in dataset esoph
:
library(skimr)
##
## Attaching package: 'skimr'
## The following object is masked from 'package:stats':
##
## filter
# Built-in dataset: The Effect of Vitamin C on Tooth Growth in Guinea Pigs
ToothGrowth %>% skim() %>% print()
## Skim summary statistics
## n obs: 60
## n variables: 3
##
## -- Variable type:factor -------------------------------------------------------------------------------------------------------------
## variable missing complete n n_unique top_counts ordered
## supp 0 60 60 2 OJ: 30, VC: 30, NA: 0 FALSE
##
## -- Variable type:numeric ------------------------------------------------------------------------------------------------------------
## variable missing complete n mean sd p0 p25 p50 p75 p100
## dose 0 60 60 1.17 0.63 0.5 0.5 1 2 2
## len 0 60 60 18.81 7.65 4.2 13.07 19.25 25.27 33.9
## hist
## <U+2587><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581><U+2587>
## <U+2583><U+2585><U+2583><U+2585><U+2583><U+2587><U+2582><U+2582>
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
ToothGrowth %>% describe()
## # A tibble: 3 x 13
## vars n mean sd median trimmed mad min max range skew
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 60 18.8 7.65 19.2 18.9 9.04 4.2 33.9 29.7 -0.143
## 2 2 60 1.5 0.504 1.5 1.5 0.741 1 2 1 0
## 3 3 60 1.17 0.629 1 1.15 0.741 0.5 2 1.5 0.372
## # ... with 2 more variables: kurtosis <dbl>, se <dbl>
Despite not the topic of this workshop but visualising your data with ggplot2
can be achieved by adding some extra lines to your pipe. To achieve this consider:
%>% ggplot(aes(
and define your x= your_column_x
, y= your_column_y
, color= your_column_color
to color lines or fill= your_column_color
to fill shapes with color.ggplot(aes())
function, layers are added by piping with +
instead of %>%
An example could be:
# Built-in dataset: The Effect of Vitamin C on Tooth Growth in Guinea Pigs
ToothGrowth %>%
# Turn numeric dose into categories
mutate(dose = factor(dose)) %>%
ggplot(aes(x = dose, y = len, fill = supp)) +
# Non-stacked barplots
geom_col(position = position_dodge()) +
# change visual template and colors
theme_minimal() + scale_fill_brewer(palette="Set1")
Have fun and play around =D