Packages
Main points to think about in the process
- Clarify the message of your visualization, e.g. do
you want to show:
- A comparison / contrast
- A pattern
- A relationship/connection between
- A change in
- Variability of
- How can I (re)structure the composition to
communicate that message?
- How can I simplify/declutter?
- Is there an opportunity to remove unnecessary elements?
- Is there an opportunity to minimize the elements included?
- Is there an opportunity to streamline repetitive elements?
- Is there an opportunity to combine/layer elements for clarity?
- Think about color palettes: contrast, meaning, consistency
- Don’t forget to look at the text style and readability of text in
the figures.
Things to avoid
- Misleading axes
- Overcrowding
- Poor color/contrast
- Poorly readable or missing labels/units/legends
- Redundancy
Keep it simple! Keep it consistent!
Exercise 1: Improve the following visualization
In a retrospective cohort where multiple sclerosis patients were
scanned (brain + spinal cord), 340 patients had radiological disease
activity on their MRI (Granella et al. 2019).
Message: A considerable part of MS patients in this
study (12%) had disease activity solely based on spinal cord MRI (no
symptoms, no new lesion on brain MRI).
# Import the data; 340 spinal cord MRIs and whether there is concomitant brain activity and whether patient was symptomatic or asymptomatic at the time
ex1.df = read.csv('data/ex1.csv', colClasses=c("symptomatic" = "factor", "concomitant_brain_activity" = "factor"))
Always take a look at the imported data!
head(ex1.df, 10)
## concomitant_brain_activity symptomatic
## 1 0 1
## 2 1 1
## 3 1 1
## 4 1 1
## 5 1 1
## 6 0 1
## 7 1 0
## 8 0 1
## 9 0 0
## 10 1 1
And use the describe function to gain insight into your
variables (type, number of observations, missing values, etc.).
describe(ex1.df)
## ex1.df
##
## 2 Variables 340 Observations
## --------------------------------------------------------------------------------
## concomitant_brain_activity
## n missing distinct
## 340 0 2
##
## Value 0 1
## Frequency 125 215
## Proportion 0.368 0.632
## --------------------------------------------------------------------------------
## symptomatic
## n missing distinct
## 340 0 2
##
## Value 0 1
## Frequency 106 234
## Proportion 0.312 0.688
## --------------------------------------------------------------------------------
Let’s create a cross-tabulation table using table, and
then divide it by the total number of rows to see the proportions of the
categories we would like to show.
table(ex1.df$symptomatic, ex1.df$concomitant_brain_activity, dnn = c("Symptomatic", "Brain activity")) / nrow(ex1.df) * 100
## Brain activity
## Symptomatic 0 1
## 0 12.05882 19.11765
## 1 24.70588 44.11765
ex1.plot <- ggplot(data = ex1.df, # Select the dataset
mapping = aes(x = concomitant_brain_activity, fill = symptomatic) # Mapping of the variables to the axes/fill/color in the figure, passed within an aesthetics object (aes)
)
ex1.plot + geom_bar() # geom_ is a modifier to the plot object to choose the plot type (see https://ggplot2.tidyverse.org/reference/index.html#geoms)

# Now we expand on the original ggplot object with the variable mappings in it
ex1.plot_pretty <- ex1.plot +
# We add the modifier for a bar plot again
# You can check with ?geom_bar, what attributes can be defined. With 'position = stack' we define a stacked barplot. And 'width' the width of the bars.
geom_bar(position = "stack", width = 0.3)
ex1.plot_pretty

ex1.plot_pretty <- ex1.plot_pretty +
# Define the plot title and labels
labs(
x = "Concomitant activity on brain MRI",
y = "Number of active spinal cord MRIs",
title = paste("Spinal cord MRIs with new cord lesions (n = ", nrow(ex1.df), ")", sep="")
) +
# Define the labels, colors etc. of your scales, by adding + scale_<channel to control>_<way of control>.
# First, for the x-axis, which is a discrete variable, we define what the 0 and 1 mean
scale_x_discrete(labels = c(
"0" = "Absent",
"1" = "Present"
)) +
# Next, lets control the fill channel (to which the symptomatic variable is mapped). With scale_fill_manual you could manually define the colours, but you can also use RColorBrower to automatically map it to a color palette.
scale_fill_brewer(palette = "Set2", labels = c(
"0" = "Asymptomatic",
"1" = "Symptomatic"
)) + theme_minimal()
ex1.plot_pretty

ex1.plot_pretty <- ex1.plot_pretty +
# With the theme object we can control all kinds of aesthetics of the figure
theme(
# Add margins to the plot
plot.margin = unit(rep(0.5, 4), "cm"),
# General text style
text = element_text(size = 10),
# Plot title styling - in the middle, bold, add margin
plot.title = element_text(hjust = 0.5, face = "bold", margin = margin(b = 9)),
# Style the legend - position top, white background, remove legend title
legend.position = "top",
legend.background = element_rect(color = "#ffffff", fill = "#ffffff"),
legend.title = element_blank(),
legend.text = element_text(size = 6),
legend.margin = margin(b = -9),
# Style the axes labels
axis.title.x = element_text(margin = margin(t = 10)),
axis.title.y = element_text(margin = margin(r = 10)),
# Style the axes lines
axis.line = element_line(colour = "black"),
# Remove the grids and background
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()
)
ex1.plot_pretty

# Some values to know where to add the brackets for highlighting asymptomatic SC-MRI lesions that occured indepently from brain MRI activity
ymin = nrow(ex1.df[ex1.df$concomitant_brain_activity == 0 & ex1.df$symptomatic == 1,])
ymax = nrow(ex1.df[ex1.df$concomitant_brain_activity == 0,])
ex1.plot_pretty_annotated <- ex1.plot_pretty +
# Lets add the counts + percentages to the bars
# We add a layer with a geom_ modifier, in this case geom_text because we want to add text.
geom_text(stat="count", aes(label = paste(after_stat(count), "\n(", after_stat(round(count/sum(count)*100,1)), "%)", sep="")), position = position_stack(vjust = 0.5), size = 3, color = "#000000") +
# We are going to make a bracket to emphasize the asymptomatic SC-MRI lesions that occured indepently from brain MRI activity
# First, the text.
geom_text(
data = subset(ex1.df, symptomatic == 0 & concomitant_brain_activity == 0),
aes(x = as.numeric(concomitant_brain_activity) * 0.8 - 0.2,
y = (ymax - (ymax-ymin)/2), label="Isolated asymptomatic\nSC-MRI activity"), size = 2.1, fontface = "plain"
) +
# Now, the bracket
geom_segment(
data = subset(ex1.df, symptomatic == 0 & concomitant_brain_activity == 0),
aes(x = as.numeric(concomitant_brain_activity) * 0.8,
xend = as.numeric(concomitant_brain_activity) * 0.8,
y = ymax,
yend = ymin),
linewidth = 0.6
) +
geom_segment(
data = subset(ex1.df, symptomatic == 0 & concomitant_brain_activity == 0),
aes(x = as.numeric(concomitant_brain_activity) * 0.8,
xend = as.numeric(concomitant_brain_activity) * 0.8 + 0.02,
y = ymax,
yend = ymax),
linewidth = 0.6
) +
geom_segment(
data = subset(ex1.df, symptomatic == 0 & concomitant_brain_activity == 0),
aes(x = as.numeric(concomitant_brain_activity) * 0.8,
xend = as.numeric(concomitant_brain_activity) * 0.8 + 0.02,
y = ymin,
yend = nrow(ex1.df[ex1.df$concomitant_brain_activity == 0 & ex1.df$symptomatic == 1,])),
linewidth = 0.6
)
ex1.plot_pretty_annotated

ggsave("ex1_plot_pretty.png", plot=ex1.plot_pretty_annotated, width=15, height=11, units="cm", dpi = 900)
Exercise 2: Survival data
Here we will plot survival analysis curves using example data from
the lung dataset from the survival package.
The data contain subjects with advanced lung cancer from the North
Central Cancer Treatment Group. We will focus on the following
variable:
`time`: Observed survival time in days
`status`: Censoring status, censored (1) or dead (2)
`sex`: Male (1) or female (2)
`age`: in years
ex2.data = lung %>%
mutate(
status = recode(status, `1` = 0, `2` = 1), # Recode status to 0 and 1 to make it easier to work with
sex = recode(sex, `1` = "Male", `2` = "Female"),
id = seq.int(nrow(lung))
) %>% select(id, time, status, sex, age)
head(ex2.data, 5)
## id time status sex age
## 1 1 306 1 Male 74
## 2 2 455 1 Male 68
## 3 3 1010 0 Male 56
## 4 4 210 1 Male 57
## 5 5 883 1 Male 60
describe(ex2.data)
## ex2.data
##
## 5 Variables 228 Observations
## --------------------------------------------------------------------------------
## id
## n missing distinct Info Mean pMedian Gmd .05
## 228 0 228 1 114.5 114.5 76.33 12.35
## .10 .25 .50 .75 .90 .95
## 23.70 57.75 114.50 171.25 205.30 216.65
##
## lowest : 1 2 3 4 5, highest: 224 225 226 227 228
## --------------------------------------------------------------------------------
## time
## n missing distinct Info Mean pMedian Gmd .05
## 228 0 186 1 305.2 279 228 53.0
## .10 .25 .50 .75 .90 .95
## 80.4 166.8 255.5 396.5 616.3 733.6
##
## lowest : 5 11 12 13 15, highest: 840 883 965 1010 1022
## --------------------------------------------------------------------------------
## status
## n missing distinct Info Sum Mean
## 228 0 2 0.6 165 0.7237
##
## --------------------------------------------------------------------------------
## sex
## n missing distinct
## 228 0 2
##
## Value Female Male
## Frequency 90 138
## Proportion 0.395 0.605
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean pMedian Gmd .05
## 228 0 42 0.999 62.45 63 10.31 45.35
## .10 .25 .50 .75 .90 .95
## 50.00 56.00 63.00 69.00 74.00 75.00
##
## lowest : 39 40 41 42 43, highest: 76 77 80 81 82
## --------------------------------------------------------------------------------
# Here we create a dumbbell plot to gain some insight into the duration of follow-up and the outcome
plot_palette = brewer.pal(n=8, name="Set1")
ggplot(ex2.data, aes(y = id,
x = 0,
xend = time)) +
geom_point(aes(x = time, color=factor(status))) +
geom_dumbbell(colour_xend = plot_palette[ex2.data$status+1], size_xend = 2, color="#ededed") + theme_minimal() +
labs(y = "Patients", x = "Days") + theme(legend.position = "top", legend.title = element_blank(), legend.margin = margin(b = -5), # Remove the grids and background
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) + scale_color_manual(labels = c(
"0" = "Censored",
"1" = "Died"
), values = plot_palette[1:2])
## Warning: Using the `size` aesthetic with geom_segment was deprecated in ggplot2 3.4.0.
## ℹ Please use the `linewidth` aesthetic instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(data=ex2.data, aes(y = after_stat(count),
x = time, fill=factor(status))) + geom_density() + theme_minimal() + scale_fill_manual(labels = c(
"0" = "Censored",
"1" = "Died"
), values = alpha(plot_palette[1:2], 0.6)) + theme(legend.position = "top", legend.title = element_blank(), legend.margin = margin(b = -5), # Remove the grids and background
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) + labs(
x = "Days",
y = "Count"
)

plot_palette = brewer.pal(n=8, name="Dark2")
ggplot(data=ex2.data, aes(y = after_stat(count),
x = time, fill=factor(status))) + geom_histogram(alpha=0.5, position="identity", bins=25, colour="#333333") + theme_minimal() + scale_fill_manual(labels = c(
"0" = "Censored",
"1" = "Died"
), values = rev(plot_palette[1:2])) + theme(legend.position = "top", legend.title = element_blank(), legend.margin = margin(b = -5), # Remove the grids and background
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) + labs(
x = "Days",
y = "Count"
)

ex2.survival_model.by_sex <- survfit2(Surv(time, status) ~ sex, data = ex2.data)
ex2.survival_fig.by_sex <- ggsurvfit(ex2.survival_model.by_sex, linewidth=1.25) +
labs(
x = "Days",
y = "Overall survival probability",
title = "Survival stratified by sex",
) +
add_confidence_interval()
ex2.survival_fig.by_sex

ex2.survival_fig_pretty.by_sex <- ex2.survival_fig.by_sex +
scale_color_manual(values=plot_palette[1:2]) +
scale_fill_manual(values=plot_palette[1:2]) +
theme(
panel.grid.major = element_blank(),
panel.background = element_blank()
)
ex2.survival_fig_pretty.by_sex

# Lets create another survival plot where we stratify by age above and under 65 years old
ex2.data <- ex2.data %>% mutate(
age_senior = ifelse(age > 65, TRUE, FALSE)
)
ex2.survival_model.by_age <- survfit2(Surv(time, status) ~ age_senior, data = ex2.data)
ex2.survival_fig_pretty.by_age <- ggsurvfit(ex2.survival_model.by_age, linewidth=1.25) +
labs(
x = "Days",
y = "Overall survival probability",
title = "Survival stratified by age",
) +
add_confidence_interval() +
scale_color_manual(values=plot_palette[1:2], labels = c("< 65", "> 65")) +
scale_fill_manual(values=plot_palette[1:2], labels = c("< 65", "> 65")) +
theme(
panel.grid.major = element_blank(),
#panel.grid.minor = element_blank(),
#panel.border = element_blank(),
panel.background = element_blank()
)
ex2.survival_fig_pretty.by_age

# Combine plots by using the plot_grid command from the cowplot library. For the ggsurvfit plots it is necessary to build them beforehand, hence they are wrappend in the ggsurvfit_build() functions.
cowplot::plot_grid(ggsurvfit_build(ex2.survival_fig_pretty.by_sex), ggsurvfit_build(ex2.survival_fig_pretty.by_age), ncol=2)

ex2.survival_fig_cow.by_age <- ggsurvfit(ex2.survival_model.by_age, linewidth=1.25) +
labs(
x = "Days",
y = "Overall survival probability",
title = "Survival stratified by age",
) +
add_confidence_interval() +
scale_color_manual(values=plot_palette[3:4], labels = c("< 65", "> 65")) +
scale_fill_manual(values=plot_palette[3:4], labels = c("< 65", "> 65")) +
theme(
panel.grid.major = element_blank(),
#panel.grid.minor = element_blank(),
#panel.border = element_blank(),
panel.background = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank()
)
ex2.survival_fig_cow.by_age

# Now lets re-run the cowplot command with the new survival plot for age without the y-axis labels.
cowplot::plot_grid(ggsurvfit_build(ex2.survival_fig_pretty.by_sex), ggsurvfit_build(ex2.survival_fig_cow.by_age), ncol=2)
# References
Granella, Franco, Elena Tsantes, Stefania Graziuso, Veronica Bazzurri,
Girolamo Crisi, and Erica Curti. 2019.
“Spinal Cord Lesions Are
Frequently Asymptomatic in Relapsingremitting Multiple
Sclerosis: A Retrospective MRI Survey.” Journal of
Neurology 266 (12): 3031–37.
https://doi.org/10.1007/s00415-019-09526-3.
Healy, Kieran. 2024. Data Visualization: A Practical
Introduction. Princeton University Press.