Workshop data visualization in R

Useful resources

Packages

Main points to think about in the process

Clarify the message of your visualization, e.g. do you want to show:
- A comparison / contrast
- A pattern
- A relationship/connection between
- A change in
- Variability of
How can I (re)structure the composition to communicate that message?
How can I simplify/declutter?
- Is there an opportunity to remove unnecessary elements?
- Is there an opportunity to minimize the elements included?
- Is there an opportunity to streamline repetitive elements?
- Is there an opportunity to combine/layer elements for clarity?
Think about color palettes: contrast, meaning, consistency
Don’t forget to look at the text style and readability of text in the figures.

Things to avoid

Misleading axes
Overcrowding
Poor color/contrast
Poorly readable or missing labels/units/legends
Redundancy

Keep it simple! Keep it consistent!

Basics of ggplot

Figure adapted from (Healy 2024)

Exercise 1: Improve the following visualization

In a retrospective cohort where multiple sclerosis patients were scanned (brain + spinal cord), 340 patients had radiological disease activity on their MRI (Granella et al. 2019).

Message: A considerable part of MS patients in this study (12%) had disease activity solely based on spinal cord MRI (no symptoms, no new lesion on brain MRI).

# Import the data; 340 spinal cord MRIs and whether there is concomitant brain activity and whether patient was symptomatic or asymptomatic at the time

ex1.df = read.csv('data/ex1.csv', colClasses=c("symptomatic" = "factor", "concomitant_brain_activity" = "factor"))

Always take a look at the imported data!

head(ex1.df, 10)

##    concomitant_brain_activity symptomatic
## 1                           0           1
## 2                           1           1
## 3                           1           1
## 4                           1           1
## 5                           1           1
## 6                           0           1
## 7                           1           0
## 8                           0           1
## 9                           0           0
## 10                          1           1

And use the describe function to gain insight into your variables (type, number of observations, missing values, etc.).

describe(ex1.df)

## ex1.df 
## 
##  2  Variables      340  Observations
## --------------------------------------------------------------------------------
## concomitant_brain_activity 
##        n  missing distinct 
##      340        0        2 
##                       
## Value          0     1
## Frequency    125   215
## Proportion 0.368 0.632
## --------------------------------------------------------------------------------
## symptomatic 
##        n  missing distinct 
##      340        0        2 
##                       
## Value          0     1
## Frequency    106   234
## Proportion 0.312 0.688
## --------------------------------------------------------------------------------

Let’s create a cross-tabulation table using table, and then divide it by the total number of rows to see the proportions of the categories we would like to show.

table(ex1.df$symptomatic, ex1.df$concomitant_brain_activity, dnn = c("Symptomatic", "Brain activity")) / nrow(ex1.df) * 100

##            Brain activity
## Symptomatic        0        1
##           0 12.05882 19.11765
##           1 24.70588 44.11765

ex1.plot <- ggplot(data = ex1.df, # Select the dataset
       mapping = aes(x = concomitant_brain_activity, fill = symptomatic) # Mapping of the variables to the axes/fill/color in the figure, passed within an aesthetics object (aes)
       ) 

ex1.plot + geom_bar() # geom_ is a modifier to the plot object to choose the plot type (see https://ggplot2.tidyverse.org/reference/index.html#geoms)

# Now we expand on the original ggplot object with the variable mappings in it
ex1.plot_pretty <- ex1.plot +
  
  # We add the modifier for a bar plot again
  # You can check with ?geom_bar, what attributes can be defined. With 'position = stack' we define a stacked barplot. And 'width' the width of the bars.
  geom_bar(position = "stack", width = 0.3)

ex1.plot_pretty

ex1.plot_pretty <- ex1.plot_pretty +
  # Define the plot title and labels
  labs(
    x = "Concomitant activity on brain MRI", 
    y = "Number of active spinal cord MRIs",
    title  = paste("Spinal cord MRIs with new cord lesions (n = ", nrow(ex1.df), ")", sep="")
    ) +
  
  # Define the labels, colors etc. of your scales, by adding + scale_<channel to control>_<way of control>.
  # First, for the x-axis, which is a discrete variable, we define what the 0 and 1 mean
  scale_x_discrete(labels = c(
    "0" = "Absent", 
    "1" = "Present"
  )) +
  
  # Next, lets control the fill channel (to which the symptomatic variable is mapped). With scale_fill_manual you could manually define the colours, but you can also use RColorBrower to automatically map it to a color palette.
  scale_fill_brewer(palette = "Set2", labels = c(
    "0" = "Asymptomatic", 
    "1" = "Symptomatic"
  )) + theme_minimal()

ex1.plot_pretty

ex1.plot_pretty <- ex1.plot_pretty +
  
  # With the theme object we can control all kinds of aesthetics of the figure
  theme(
    # Add margins to the plot
    plot.margin = unit(rep(0.5, 4), "cm"),
    # General text style
    text = element_text(size = 10),
    # Plot title styling - in the middle, bold, add margin
    plot.title = element_text(hjust = 0.5, face = "bold", margin = margin(b = 9)),
    # Style the legend - position top, white background, remove legend title
    legend.position = "top",
    legend.background = element_rect(color = "#ffffff", fill = "#ffffff"),
    legend.title = element_blank(),
    legend.text = element_text(size = 6),
    legend.margin = margin(b = -9),
    # Style the axes labels
    axis.title.x = element_text(margin = margin(t = 10)), 
    axis.title.y = element_text(margin = margin(r = 10)),
    # Style the axes lines
    axis.line = element_line(colour = "black"),
    # Remove the grids and background
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank()
    )

ex1.plot_pretty

# Some values to know where to add the brackets for highlighting asymptomatic SC-MRI lesions that occured indepently from brain MRI activity
ymin = nrow(ex1.df[ex1.df$concomitant_brain_activity == 0 & ex1.df$symptomatic == 1,])
ymax = nrow(ex1.df[ex1.df$concomitant_brain_activity == 0,])

ex1.plot_pretty_annotated <- ex1.plot_pretty +
  
  # Lets add the counts + percentages to the bars
  # We add a layer with a geom_ modifier, in this case geom_text because we want to add text.
    geom_text(stat="count", aes(label = paste(after_stat(count), "\n(", after_stat(round(count/sum(count)*100,1)), "%)", sep="")), position = position_stack(vjust = 0.5), size = 3, color = "#000000") +
  
    # We are going to make a bracket to emphasize the asymptomatic SC-MRI lesions that occured indepently from brain MRI activity
  # First, the text.
  geom_text(
        data = subset(ex1.df, symptomatic == 0 & concomitant_brain_activity == 0),
    aes(x = as.numeric(concomitant_brain_activity) * 0.8 - 0.2,
    y = (ymax - (ymax-ymin)/2), label="Isolated asymptomatic\nSC-MRI activity"), size = 2.1, fontface = "plain"
  ) + 
  
  # Now, the bracket
  geom_segment(
    data = subset(ex1.df, symptomatic == 0 & concomitant_brain_activity == 0),
    aes(x = as.numeric(concomitant_brain_activity) * 0.8,
        xend = as.numeric(concomitant_brain_activity) * 0.8,
        y = ymax,
        yend = ymin),
    linewidth = 0.6
  ) + 
  geom_segment(
        data = subset(ex1.df, symptomatic == 0 & concomitant_brain_activity == 0),
    aes(x = as.numeric(concomitant_brain_activity) * 0.8,
        xend = as.numeric(concomitant_brain_activity) * 0.8 + 0.02,
        y = ymax,
        yend = ymax),
    linewidth = 0.6
  ) + 
  geom_segment(
        data = subset(ex1.df, symptomatic == 0 & concomitant_brain_activity == 0),
    aes(x = as.numeric(concomitant_brain_activity) * 0.8,
        xend = as.numeric(concomitant_brain_activity) * 0.8 + 0.02,
        y = ymin,
        yend = nrow(ex1.df[ex1.df$concomitant_brain_activity == 0 & ex1.df$symptomatic == 1,])),
    linewidth = 0.6
  )

ex1.plot_pretty_annotated

ggsave("ex1_plot_pretty.png", plot=ex1.plot_pretty_annotated, width=15, height=11, units="cm", dpi = 900)

Exercise 2: Survival data

Here we will plot survival analysis curves using example data from the lung dataset from the survival package. The data contain subjects with advanced lung cancer from the North Central Cancer Treatment Group. We will focus on the following variable:

`time`: Observed survival time in days
`status`: Censoring status, censored (1) or dead (2)
`sex`: Male (1) or female (2)
`age`: in years

ex2.data = lung %>%
  mutate(
    status = recode(status, `1` = 0, `2` = 1), # Recode status to 0 and 1 to make it easier to work with
    sex = recode(sex, `1` = "Male", `2` = "Female"),
    id = seq.int(nrow(lung))
  ) %>% select(id, time, status, sex, age)

head(ex2.data, 5)

##   id time status  sex age
## 1  1  306      1 Male  74
## 2  2  455      1 Male  68
## 3  3 1010      0 Male  56
## 4  4  210      1 Male  57
## 5  5  883      1 Male  60

describe(ex2.data)

## ex2.data 
## 
##  5  Variables      228  Observations
## --------------------------------------------------------------------------------
## id 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##      228        0      228        1    114.5    114.5    76.33    12.35 
##      .10      .25      .50      .75      .90      .95 
##    23.70    57.75   114.50   171.25   205.30   216.65 
## 
## lowest :   1   2   3   4   5, highest: 224 225 226 227 228
## --------------------------------------------------------------------------------
## time 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##      228        0      186        1    305.2      279      228     53.0 
##      .10      .25      .50      .75      .90      .95 
##     80.4    166.8    255.5    396.5    616.3    733.6 
## 
## lowest :    5   11   12   13   15, highest:  840  883  965 1010 1022
## --------------------------------------------------------------------------------
## status 
##        n  missing distinct     Info      Sum     Mean 
##      228        0        2      0.6      165   0.7237 
## 
## --------------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##      228        0        2 
##                         
## Value      Female   Male
## Frequency      90    138
## Proportion  0.395  0.605
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##      228        0       42    0.999    62.45       63    10.31    45.35 
##      .10      .25      .50      .75      .90      .95 
##    50.00    56.00    63.00    69.00    74.00    75.00 
## 
## lowest : 39 40 41 42 43, highest: 76 77 80 81 82
## --------------------------------------------------------------------------------

# Here we create a dumbbell plot to gain some insight into the duration of follow-up and the outcome
plot_palette = brewer.pal(n=8, name="Set1")
ggplot(ex2.data, aes(y = id,
                          x = 0,
                          xend = time)) +  
  geom_point(aes(x = time, color=factor(status))) +
  geom_dumbbell(colour_xend = plot_palette[ex2.data$status+1], size_xend = 2, color="#ededed") + theme_minimal() +
labs(y = "Patients", x = "Days") + theme(legend.position = "top", legend.title = element_blank(), legend.margin = margin(b = -5), # Remove the grids and background
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank()) + scale_color_manual(labels = c(
    "0" = "Censored", 
    "1" = "Died"
  ), values = plot_palette[1:2])

## Warning: Using the `size` aesthetic with geom_segment was deprecated in ggplot2 3.4.0.
## ℹ Please use the `linewidth` aesthetic instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(data=ex2.data, aes(y = after_stat(count),
                          x = time, fill=factor(status))) + geom_density() + theme_minimal() + scale_fill_manual(labels = c(
    "0" = "Censored", 
    "1" = "Died"
  ), values = alpha(plot_palette[1:2], 0.6)) + theme(legend.position = "top", legend.title = element_blank(), legend.margin = margin(b = -5), # Remove the grids and background
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank()) + labs(
      x = "Days",
      y = "Count"
    )

plot_palette = brewer.pal(n=8, name="Dark2")
ggplot(data=ex2.data, aes(y = after_stat(count),
                          x = time, fill=factor(status))) + geom_histogram(alpha=0.5, position="identity", bins=25, colour="#333333") + theme_minimal() + scale_fill_manual(labels = c(
    "0" = "Censored", 
    "1" = "Died"
  ), values = rev(plot_palette[1:2])) + theme(legend.position = "top", legend.title = element_blank(), legend.margin = margin(b = -5), # Remove the grids and background
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank()) + labs(
      x = "Days",
      y = "Count"
    )

ex2.survival_model.by_sex <- survfit2(Surv(time, status) ~ sex, data = ex2.data)
ex2.survival_fig.by_sex <- ggsurvfit(ex2.survival_model.by_sex, linewidth=1.25) +
  labs(
    x = "Days",
    y = "Overall survival probability",
    title = "Survival stratified by sex",
  ) + 
  add_confidence_interval()

ex2.survival_fig.by_sex

ex2.survival_fig_pretty.by_sex <- ex2.survival_fig.by_sex +
  scale_color_manual(values=plot_palette[1:2]) +
  scale_fill_manual(values=plot_palette[1:2]) + 
  theme(
    panel.grid.major = element_blank(),
    panel.background = element_blank()
  ) 

ex2.survival_fig_pretty.by_sex

# Lets create another survival plot where we stratify by age above and under 65 years old
ex2.data <- ex2.data %>% mutate(
  age_senior = ifelse(age > 65, TRUE, FALSE)
)

ex2.survival_model.by_age <- survfit2(Surv(time, status) ~ age_senior, data = ex2.data)
ex2.survival_fig_pretty.by_age <- ggsurvfit(ex2.survival_model.by_age, linewidth=1.25) +
  labs(
    x = "Days",
    y = "Overall survival probability",
    title = "Survival stratified by age",
  ) + 
  add_confidence_interval() +
  scale_color_manual(values=plot_palette[1:2], labels = c("< 65", "> 65")) +
  scale_fill_manual(values=plot_palette[1:2], labels = c("< 65", "> 65")) + 
  theme(
    panel.grid.major = element_blank(),
    #panel.grid.minor = element_blank(),
    #panel.border = element_blank(),
    panel.background = element_blank()
  )

ex2.survival_fig_pretty.by_age

# Combine plots by using the plot_grid command from the cowplot library. For the ggsurvfit plots it is necessary to build them beforehand, hence they are wrappend in the ggsurvfit_build() functions.
cowplot::plot_grid(ggsurvfit_build(ex2.survival_fig_pretty.by_sex), ggsurvfit_build(ex2.survival_fig_pretty.by_age), ncol=2)

ex2.survival_fig_cow.by_age <- ggsurvfit(ex2.survival_model.by_age, linewidth=1.25) +
  labs(
    x = "Days",
    y = "Overall survival probability",
    title = "Survival stratified by age",
  ) + 
  add_confidence_interval() +
  scale_color_manual(values=plot_palette[3:4], labels = c("< 65", "> 65")) +
  scale_fill_manual(values=plot_palette[3:4], labels = c("< 65", "> 65")) + 
  theme(
    panel.grid.major = element_blank(),
    #panel.grid.minor = element_blank(),
    #panel.border = element_blank(),
    panel.background = element_blank(),
    axis.title.y = element_blank(),
    axis.text.y = element_blank()
  )

ex2.survival_fig_cow.by_age

# Now lets re-run the cowplot command with the new survival plot for age without the y-axis labels.
cowplot::plot_grid(ggsurvfit_build(ex2.survival_fig_pretty.by_sex), ggsurvfit_build(ex2.survival_fig_cow.by_age), ncol=2)

# References

Granella, Franco, Elena Tsantes, Stefania Graziuso, Veronica Bazzurri, Girolamo Crisi, and Erica Curti. 2019. “Spinal Cord Lesions Are Frequently Asymptomatic in Relapsingremitting Multiple Sclerosis: A Retrospective MRI Survey.” Journal of Neurology 266 (12): 3031–37. https://doi.org/10.1007/s00415-019-09526-3.

Healy, Kieran. 2024. Data Visualization: A Practical Introduction. Princeton University Press.