Plots are the fastest way to show a result or to tell a story…
Or misguide and promote Fake News…
From Calling Bullshit
Make art in R… From Data to art in R
And, of course, memes…
Traditionally, scientists have mainly focused on data and ignored proper visualization practices. Nowadays, with bigger datasets, efficient visualization is needed not only to understand your data but to also convey results into a digestible snapshot.
Recommended Why scientists need to be better at data visualization
In this workshop, apart from tidyverse
library which includes ggplot2
we will have a look at heatmaps and correlation plots. If you haven’t installed it, do it by copying the following line in the Console
panel after the >
:
install.packages(c("tidyverse", "corrplot"))
Hit Enter
, the download and installation process should start. When finished, load the libraries by executing:
library("tidyverse")
And download the following files (right-click > Save link as…)
Despite not the topic of this workshop but visualising your data with ggplot2
can be achieved by adding some extra lines to your pipe. To achieve this consider:
%>% ggplot(aes(
and define your x= your_column_x
, y= your_column_y
, color= your_column_color
to color lines, fill= your_column_color
to fill shapes with color or size= your_column_size
.ggplot(aes())
function, layers are added by piping with +
instead of %>%
Use the ToothGrowth
dataset (remember you can use as_tibble()
to see the data frame):
# Built-in dataset: The Effect of Vitamin C on Tooth Growth in Guinea Pigs
?ToothGrowth
as_tibble(ToothGrowth)
Manipulate your data here and there to have a nice format for plotting. Group the observations and calculate summary statistics such as counts, mean and standard deviation per group:
# To refresh some data wrangling steps from the previous workshop
TG_summarised <- ToothGrowth %>%
# Group the data
group_by(supp, dose = factor(dose)) %>% add_tally() %>%
# Calculate mean, sd y number of samples per group
summarise(len_mean = mean(len), sd = sd(len), n = max(n))
Unleash the ggplot magic:
# An example of barplot in ggplot
TG_summarised %>%
ggplot(aes(x = dose, y = len_mean, fill = supp)) +
# Non-stacked barplots
geom_bar(stat = "identity") +
# Add values inside the bars
geom_text(aes(label = len_mean), color = "white", vjust = 1, size = 4) +
# Add the error bars
geom_errorbar(aes(ymin = len_mean - sd, ymax = len_mean + sd),
position = position_dodge(.9), width = 0.3) +
# Split barplots by supp and remove the legend
facet_wrap( ~ supp) + guides(fill = "none") +
# change visual template and colors
theme_minimal() + scale_fill_brewer(palette = "Set1")
Try yourself doing, for example, a boxplot or violin plot based on the original dataset ToothGrowth
which is already available in R.
Tip: As seen in the previous code you might have to turn some continuous variables into factors.
If you would like to visualize the estimates of your regression and see your very significant odds ratio, hazard ratios, etc. You can do that with ggplot. To show an example let’s fit a very wrong but functional logistic model on the previous data to predict the class supp
.
model <- glm(supp ~ len + dose, family = "binomial", data = ToothGrowth)
summary(model)
##
## Call:
## glm(formula = supp ~ len + dose, family = "binomial", data = ToothGrowth)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.59843 -0.96252 0.04771 1.04486 1.84253
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.5377 0.7860 1.956 0.05044 .
## len -0.2127 0.0728 -2.921 0.00348 **
## dose 2.0886 0.8497 2.458 0.01397 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 83.178 on 59 degrees of freedom
## Residual deviance: 72.330 on 57 degrees of freedom
## AIC: 78.33
##
## Number of Fisher Scoring iterations: 4
Then we can turn the coefficients and confidence intervals into a table:
# Extract the estimates and the confidence intervals into a tibble
toforest <- cbind("Beta" = coef(model), confint(model)) %>%
# Rownames to specify in which column the row names will be stored
as_tibble(., rownames = "Variable")
## Waiting for profiling to be done...
Finally let’s feed ggplot of nice data:
toforest %>% ggplot(aes(y = Variable, x = Beta, xmin = `2.5 %`, xmax = `97.5 %`)) +
# Add points for the estimates
geom_point(color = 'black') +
# Add horizontal errorbar using the xmin and xmax specified
geom_errorbarh(height = .05) +
# Change the lower and upper limits of the plot
scale_x_continuous(limits=c(-5,5), name='Estimate') +
ylab('Variable') +
# Add vertical line
geom_vline(xintercept=0, color='black', linetype='dashed') +
theme_minimal()
Computers like numbers, characters or another type of data can be confusing for some libraries. This means that sometimes we have to turn our data into a numeric matrix. If we have characters columns these have to be transformed into a meaningful value or factorised. If they are factorised they can be converted into a numeric type.
Load the Hepatitis dataset and mutate the colums if they are of character type into factors:
clean_hepatitis <- read_csv("./material/clean_hepatitis.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## AGE = col_double(),
## BILIRUBIN = col_double(),
## ALK_PHOSPHATE = col_double(),
## SGOT = col_double(),
## ALBUMIN = col_double(),
## PROTIME = col_double()
## )
## See spec(...) for full column specifications.
clean_hepatitis <- clean_hepatitis %>% mutate_if(is.character, as.factor)
Let’s make 2 matrices, one with only the categorical variables and another with the continuous values to facilitate comparisons. This can be done by selecting the columns of the format of interest and coercing them into a matrix as.matrix()
. If columns are in factor format they have to be all mutated into as.numeric
:
# Selec
hep_mat_factors <- clean_hepatitis %>%
select_if(is.factor) %>%
mutate_all(as.numeric) %>% as.matrix()
hep_mat_numeric <- hep_mat_numeric <- clean_hepatitis %>% select_if(is.numeric) %>% as.matrix()
If your row names are contained in another column these can be added to the matrix using rownames(mymatrix) <- my_dataframe$column_with_IDs
When the values are comparable and all of the same type, heatmaps can give us a good view of our numeric data. R has a built-in heatmap function (heatmap
) that can cover basic heatmaps. Let’s make a heatmap of based on the categorical variables. Check the ?heatmap
for help.
# The scale is set to "none" since the values are already comparable
heatmap(hep_mat_factors, scale="none")
If the values are numeric but not comparable, these can be normalised by patient or by variable. Let’s make a heatmap of the age and biochemistry values:
heatmap(hep_mat_numeric, scale="col")
There are many others heatmap libraries. A well-documented library for more complex heatmaps is, guess the name… ComplexHeatmap
Correlation plots are a special case of heatmap that summarises in a similar format the correlation between columns or rows. First using a numeric matrix we have to compute a correlation matrix, this can be done by measuring person correlation assuming linearity using the cor()
function.
cormat <- cor(hep_mat_factors, use = "pairwise.complete.obs", method = "pearson")
With this correlation matrix we can have our correlation plot using the library(corrplot)
library(corrplot)
## corrplot 0.84 loaded
corrplot(cormat, type = "lower", order = "AOE")
More help and options can be found in the vignettes for the corrplot
package.
Here there are some guides to reproduce other types of plots that could be useful:
Visualize gene sets with clusterProfiler