Plots are the fastest way to show a result or to tell a story…

Based on the WSJ graphics work on vaccination and The inevitability of data visualization criticism

Or misguide and promote Fake News…

And, of course, memes…

Traditionally, scientists have mainly focused on data and ignored proper visualization practices. Nowadays, with bigger datasets, efficient visualization is needed not only to understand your data but to also convey results into a digestible snapshot.

Recommended Why scientists need to be better at data visualization

From Data Science for Docs

1 Libraries and datasets

In this workshop, apart from tidyverse library which includes ggplot2 we will have a look at heatmaps and correlation plots. If you haven’t installed it, do it by copying the following line in the Console panel after the >:

install.packages(c("tidyverse", "corrplot"))

Hit Enter, the download and installation process should start. When finished, load the libraries by executing:

library("tidyverse")

1.1 Material to download…

Data Visualization Cheat Sheet
Hepatitis dataset. Data from adapted from UCI Machine Learning repository

And download the following files (right-click > Save link as…)

2 Basic data visualization

Despite not the topic of this workshop but visualising your data with ggplot2 can be achieved by adding some extra lines to your pipe. To achieve this consider:

The Data Visualization Cheat Sheet is one of the main tools to find which kind of plot you want to make and how to do it.
After the variable that contains your table add %>% ggplot(aes( and define your x= your_column_x, y= your_column_y, color= your_column_color to color lines, fill= your_column_color to fill shapes with color or size= your_column_size.
After using the ggplot(aes()) function, layers are added by piping with + instead of %>%

2.1 An example

Use the ToothGrowth dataset (remember you can use as_tibble() to see the data frame):

# Built-in dataset: The Effect of Vitamin C on Tooth Growth in Guinea Pigs
?ToothGrowth
as_tibble(ToothGrowth)

Manipulate your data here and there to have a nice format for plotting. Group the observations and calculate summary statistics such as counts, mean and standard deviation per group:

# To refresh some data wrangling steps from the previous workshop
TG_summarised <- ToothGrowth %>% 
  # Group the data
  group_by(supp, dose = factor(dose)) %>% add_tally() %>% 
  # Calculate mean, sd y number of samples per group
  summarise(len_mean = mean(len), sd = sd(len), n = max(n))

Unleash the ggplot magic:

# An example of barplot in ggplot
TG_summarised %>% 
  ggplot(aes(x = dose, y = len_mean, fill = supp)) + 
  # Non-stacked barplots
  geom_bar(stat = "identity") +
  # Add values inside the bars
  geom_text(aes(label = len_mean), color = "white", vjust = 1, size = 4) +
  # Add the error bars
  geom_errorbar(aes(ymin = len_mean - sd, ymax = len_mean + sd), 
                position = position_dodge(.9), width = 0.3) +
  # Split barplots by supp and remove the legend
  facet_wrap( ~ supp) + guides(fill = "none") +
  # change visual template and colors
  theme_minimal() + scale_fill_brewer(palette = "Set1")

Try yourself doing, for example, a boxplot or violin plot based on the original dataset ToothGrowth which is already available in R.

Tip: As seen in the previous code you might have to turn some continuous variables into factors.

2.2 Forest plot

If you would like to visualize the estimates of your regression and see your very significant odds ratio, hazard ratios, etc. You can do that with ggplot. To show an example let’s fit a very wrong but functional logistic model on the previous data to predict the class supp.

model <- glm(supp ~ len + dose, family = "binomial", data = ToothGrowth)
summary(model)
## 
## Call:
## glm(formula = supp ~ len + dose, family = "binomial", data = ToothGrowth)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.59843  -0.96252   0.04771   1.04486   1.84253  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   1.5377     0.7860   1.956  0.05044 . 
## len          -0.2127     0.0728  -2.921  0.00348 **
## dose          2.0886     0.8497   2.458  0.01397 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 83.178  on 59  degrees of freedom
## Residual deviance: 72.330  on 57  degrees of freedom
## AIC: 78.33
## 
## Number of Fisher Scoring iterations: 4

Then we can turn the coefficients and confidence intervals into a table:

# Extract the estimates and the confidence intervals into a tibble
toforest <- cbind("Beta" = coef(model), confint(model)) %>%
  # Rownames to specify in which column the row names will be stored
  as_tibble(., rownames = "Variable")
## Waiting for profiling to be done...

Finally let’s feed ggplot of nice data:

toforest %>% ggplot(aes(y = Variable, x = Beta, xmin = `2.5 %`, xmax = `97.5 %`)) +
  # Add points for the estimates
  geom_point(color = 'black') +
  # Add horizontal errorbar using the xmin and xmax specified 
  geom_errorbarh(height = .05) +
  # Change the lower and upper limits of the plot
  scale_x_continuous(limits=c(-5,5), name='Estimate') +
  ylab('Variable') +
  # Add vertical line
  geom_vline(xintercept=0, color='black', linetype='dashed') +
  theme_minimal()

3 Many options

Check the interactive version with links to R code at https://www.data-to-viz.com/

4 Beyond ggplot2

Computers like numbers, characters or another type of data can be confusing for some libraries. This means that sometimes we have to turn our data into a numeric matrix. If we have characters columns these have to be transformed into a meaningful value or factorised. If they are factorised they can be converted into a numeric type.

Load the Hepatitis dataset and mutate the colums if they are of character type into factors:

clean_hepatitis <- read_csv("./material/clean_hepatitis.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   AGE = col_double(),
##   BILIRUBIN = col_double(),
##   ALK_PHOSPHATE = col_double(),
##   SGOT = col_double(),
##   ALBUMIN = col_double(),
##   PROTIME = col_double()
## )
## See spec(...) for full column specifications.

clean_hepatitis <- clean_hepatitis %>% mutate_if(is.character, as.factor)

Let’s make 2 matrices, one with only the categorical variables and another with the continuous values to facilitate comparisons. This can be done by selecting the columns of the format of interest and coercing them into a matrix as.matrix(). If columns are in factor format they have to be all mutated into as.numeric:

# Selec
hep_mat_factors <- clean_hepatitis %>% 
                               select_if(is.factor) %>% 
                               mutate_all(as.numeric) %>% as.matrix()

hep_mat_numeric <- hep_mat_numeric <- clean_hepatitis %>% select_if(is.numeric) %>% as.matrix()

If your row names are contained in another column these can be added to the matrix using rownames(mymatrix) <- my_dataframe$column_with_IDs

5 Heatmaps

When the values are comparable and all of the same type, heatmaps can give us a good view of our numeric data. R has a built-in heatmap function (heatmap) that can cover basic heatmaps. Let’s make a heatmap of based on the categorical variables. Check the ?heatmap for help.

# The scale is set to "none" since the values are already comparable
heatmap(hep_mat_factors, scale="none")

If the values are numeric but not comparable, these can be normalised by patient or by variable. Let’s make a heatmap of the age and biochemistry values:

heatmap(hep_mat_numeric, scale="col")

There are many others heatmap libraries. A well-documented library for more complex heatmaps is, guess the name… ComplexHeatmap

6 Correlation is not causation… but it looks pretty nice

Correlation plots are a special case of heatmap that summarises in a similar format the correlation between columns or rows. First using a numeric matrix we have to compute a correlation matrix, this can be done by measuring person correlation assuming linearity using the cor() function.

cormat <- cor(hep_mat_factors, use = "pairwise.complete.obs", method = "pearson")

With this correlation matrix we can have our correlation plot using the library(corrplot)

library(corrplot)
## corrplot 0.84 loaded
corrplot(cormat, type = "lower", order = "AOE")

More help and options can be found in the vignettes for the corrplot package.

7 Other advanced plots

Here there are some guides to reproduce other types of plots that could be useful:

UpSet plots: A new type of Venn Diagram

Principal Component Analysis (PCA) plots
Visualize gene sets with clusterProfiler

7.1 Bibliography

Thanks to the support of:

Data visualization in R

by Adrian G. Zucco