Getting started using R for data analysis and visualization

October 24, 2022 lydiaandreyevnakrasilnikovaLeave a comment

This guide is the second of the getting started guides I made originally for our lab, and now for you. All the getting started guides can be found here.

Thank you to Danny Park in our lab for introducing me to R and preempting my confusion about the difference between tables in Excel and R—a lot of this guide was verbally passed on to me from Danny when I first joined the lab; I just wrote it down.

Thank you to Lizzie Wolkovich for teaching me everything I know about modelling.

This document should take a veritable eternity to work through—but you don’t need to work through all or even most of it to do what you want to do. Take the pieces you need and when you need more pieces take those too.

Introduction

Our lab is split pretty evenly between people who use Python for data analysis and visualization and people who use R for data analysis and visualization. I strongly prefer R, because it was made specifically for data analysis, but there is no wrong answer. R is especially powerful because of the added functionality of its libraries, such as dplyr and ggplot2.

If R is about to become your first programming language (!), I recommend starting off with the first part of codecademy’s Learn R course.

Getting started

Using R within Terra

If you use Terra (our lab uses Terra), you can write and run R scripts directly in Terra, and run your R scripts directly on the data you analyze in Terra. You can watch the Terra team’s 15-minute Terra Notebooks Quickstart video to learn how:

Using R on your computer

Download and install R here.

Download and install RStudio here.

RStudio allows you to write and run your code as a full script in its own file, run subsets of your code by simply highlighting the individual lines you are interested in running and hitting command enter, or writing and running throwaway code in RStudio’s console.

Here is a short video introduction to RStudio:

You can also follow University of British Columbia STAT 545’s guide: Install R and RStudio.

Packages/libraries

Package and library are terms used pretty much interchangeably in R. You can install a package, in this example a package named plyr, by entering the following command in RStudio’s command line with your own package of interest:

install.packages("plyr")

You only need to install a package once. After you have installed it, you can load a package by adding the following line to your script anywhere before the code that will use the package, again using the package plyr as an example:

library(plyr)

You can learn more about packages/libraries in this tutorial.

I have found the following packages helpful:

tidyverse: a collection of helpful packages all packaged together—including dplyr, ggplot2, tidyr, and stringr below
data visualization:
- ggplot2: the main package used for data visualization in R
- ggpubr: customization add-ons for ggplot2
- scales: customize axes and legends
- patchwork: put multiple plots together in an aligned arrangement
- grid and gridExtra: another option for arranging multiple plots into a layout
data manipulation and summarization:
- dplyr: a powerful library for manipulating and operating on tables of data
- plyr: the previous iteration of dplyr
- tidyr: rearranging data tables
- stringr: working with strings
statistical analysis:
- binom: binomial confidence intervals
- broom: summarizes statistical analysis outputs in more readable formats
beepr: beeps when your code is done running

Data tables in R

Differences between tables in R and Excel

The thing that most confused me when I started coding in R is that data tables in R are different from the tables I was used to working with in Excel.

If you’re used to working in Excel, you might be accustomed to tables that look like this one. In this table, each row is a patient. The patient name appears in the first column of each row. The rest of the columns belong to species. Each cell contains the number of reads that were mapped to the species (its column) from the patient (its row).

For input into R, this table would look quite different. Every readcount cell in the Excel version of the table gets its own row in the R version of the table.

The R data table format allows you to add more columns to each patient-species pair, which in the Excel table format would require you to make a whole new table. Notice that some of the columns refer to the patient (patient, is_food_handler, collection_date, Ct) while other columns refer to the species (species, is_virus) and other columns refer to the patient-species pair (readcount, genome_covered):

Each column is a variable, and the two terms can be used interchangeably when talking about data tables.

(This data is clearly from an experiment mixing different sample types from each patient into a single sample.)

My example code will use this table. If you would like to follow along, download it here.

Reading in and saving data tables

To read in a comma-separated table:
input_table_path <- "/Users/lakras/my_input_file.csv" my_table <- read.csv(input_table_path)
or
my_table <- read.table(input_table_path, sep=",", header=TRUE)

To read in a tab-separated table:
input_table_path <- "/Users/lakras/my_input_file.tsv" my_table <- read.delim(input_table_path)
or
my_table <- read.table(input_table_path, sep="\t", header=TRUE)

To save a table as a comma-separated output table:
output_table_path <- "/Users/lakras/my_output_file.csv" write.table(my_table, file=output_table_path, quote=FALSE, sep=',', row.names=FALSE)

To save a table as a tab-separated output table:
output_table_path <- "/Users/lakras/my_output_file.tsv" write.table(my_table, file=output_table_path, quote=FALSE, sep='\t', row.names=FALSE)

Viewing and manipulating data tables

Exploring your data table and its columns

You can view or access the full data table by simply entering its name:
my_table

View a summary of the variables (columns) in your data table, their data types (int, num, chr, string, factor), and the first few values using the str function:
str(my_table)

You can view or access the individual variables (columns) of your data table by entering the table name followed by $ followed by the column name:
my_table$patient my_table$species my_table$readcount my_table$genome_covered my_table$is_virus my_table$Ct my_table$is_food_handler my_table$collection_date

You can view all values of a variable (column) and the number of rows with each value using the table function:
table(my_table$species)

You can view a subset of variables (columns) using the select function:
select(my_table, patient, species, genome_covered)

You can view the first few rows of a data table using the head function or the last few rows using the tail function:
head(my_table) tail(my_table)

Subsetting your data table

You can use the subset function to view a subset of your data table. Perhaps we want to view the subset of our rows related to Hepatitis C:
subset(my_table, species == "Hepatitis C")

You can save the subset, or set the data table to be a subset of itself. Perhaps we are only interested in keeping the subset of our rows with readcount ≥100:
my_table <- subset(my_table, readcount >= 100)

Creating new columns

To make a new column with all values set to TRUE:
my_table$is_human <- TRUE

To make a new column by duplicating another column:
my_table$readcount_2 <- my_table$readcount

To make a new column by multiplying another column by a scalar:
my_table$percent_genome_covered <- my_table$genome_covered * 100

Conditional viewing and manipulating column values

You can view values of a column including only those rows that meet a certain condition. For example, you can view all readcounts ≥100:
my_table$readcount[my_table$readcount >= 100]

You can also view all Ct values from rows with readcount ≥100:
my_table$Ct[my_table$readcount >= 100]

In addition of viewing column values, you can also modify those column values. For example, you can set all readcounts <100 to NA:
my_table$readcount[my_table$readcount < 100] <- NA

You can also modify a column that is different from the column you are testing in your condition, and the column you modify does not have to already exist. You can create a new variable with values conditional on another column’s values:
my_table$passes_threshold[my_table$readcount >= 100] <- TRUE my_table$passes_threshold[my_table$readcount < 100] <- FALSE

Data types

Here are some of the data types you will encounter and use:

chr: a text value
factor: a category
int: an integer (no decimal places)
num: a number (with decimal places)
logi: TRUE or FALSE (logi is short for logical)
Date: a date

The chr data type is a text value. This data type might be useful for text values that are unique to each row, for example patient name, sample name, or hospital address. In some cases, text values (or other kinds of values) might reappear from row to row and be used to group rows together, for example state of residency, presence of coughing, sample type, or patient gender. This values should be factors.

You can modify the data type of your variable.

To make a variable’s values into text values:
my_table$patient <- as.character(my_table$patient)

To make a variable’s values into categories:
my_table$species <- as.factor(my_table$species)

To make a variable’s values into integers (no decimal places):
my_table$readcount <- as.numeric(my_table$readcount)

To make a variable’s values into numerical values (with decimal places):
my_table$Ct <- as.numeric(my_table$Ct) my_table$genome_covered <- as.numeric(my_table$genome_covered)

To make a variable’s values into TRUE and FALSE values:
my_table$is_food_handler <- as.logical(my_table$is_food_handler) my_table$is_virus <- as.logical(my_table$is_virus)

To make a variable’s values into dates:
my_table$collection_date <- as.Date(my_table$collection_date)

Special values

Some special values are NA (value not available), NaN (not a number), Inf (infinity), and -Inf (negative infinity). You will need to keep these values in mind and check for them in many analyses.

You can check if a value (my_value in this case) is one of these values:
my_value <- 5 is.na(my_value) is.nan(my_value) is.infinite(my_value)

Statistical analysis

Summary statistics

You can view basic summary statistics for a variable.

Minimum, maximum, and range:
min(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE) max(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE) range(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)

Mean and standard deviation:
mean(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE) sd(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)

Median, first quartile (the 25th percentile of your data), and third quartile (the 75th percentile of your data), and interquartile range (the middle 50% of your data, or the distance between the 25th percentile and the 75th percentile):
median(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE) quantile(my_table$genome_covered, 1/4, na.rm=TRUE, nan.rm=TRUE) quantile(my_table$genome_covered, 3/4, na.rm=TRUE, nan.rm=TRUE) IQR(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)

na.rm=TRUE and nan.rm=TRUE exclude NA and NaN values from your calculations of the mean, median, min, max, etc.

You can generate a table of summary statistics grouped by values of a variable using ddply of the plyr or dplyr library. For example, we might calculate the minimum, maximum, and mean of readcounts for each species separately, saving them in a new data table we name readcount_summary:

readcount_summary <- ddply(my_table, ~ species, summarise, min_readcount=min(readcount), max_readcount=max(readcount),mean_readcount=mean(readcount))

The output looks like this:

	species	min_readcount	max_readcount	mean_readcount
1	E. coli	0	4324	620.4286
2	Hepatitis C	0	6457	931.7143
3	HIV	0	865	217.0000
4	Influenza A	0	4363	677.1429
5	SARS-CoV-2	76	4532	1289.5714

Or we might calculate the minimum, maximum, and mean of both readcounts and Cts, grouping rows not only by species, but also by whether or not the patient is a food handler and whether or not the at least half the genome is covered:

readcount_summary <- ddply(my_table, genome_covered >= 0.50 ~ is_food_handler ~ species, summarise, min_readcount=min(readcount), max_readcount=max(readcount), mean_readcount=mean(readcount), min_Ct=min(Ct), max_Ct=max(Ct),mean_Ct=mean(Ct))

That output looks like this:

	genome_ covered >= 0.5	is_food_ handler	species	min_ readcount	max_ readcount	mean_ readcount	min _Ct	max _Ct	mean _Ct
1	FALSE	FALSE	E. coli	0	4324	868.60	16.8	27.9	21.64000
2	FALSE	FALSE	Hepatitis C	0	65	16.25	16.8	27.9	21.72500
3	FALSE	FALSE	HIV	0	865	303.80	16.8	27.9	21.64000
4	FALSE	FALSE	Influenza A	0	0	0	16.8	27.9	22.70000
5	FALSE	FALSE	SARS-CoV-2	87	876	546.00	21.3	27.9	24.66667
6	FALSE	TRUE	E. coli	0	0	0	21.5	23.9	22.70000
7	FALSE	TRUE	Hepatitis C	0	0	0	21.5	23.9	22.70000
8	FALSE	TRUE	HIV	0	0	0	21.5	23.9	22.70000
9	FALSE	TRUE	Influenza A	65	312	188.50	21.5	23.9	22.70000
10	FALSE	TRUE	SARS-CoV-2	76	435	255.50	21.5	23.9	22.70000
11	TRUE	FALSE	Hepatitis C	6457	6457	6457.00	21.3	21.3	21.30000
12	TRUE	FALSE	Influenza A	4363	4363	4363.00	17.4	17.4	17.40000
13	TRUE	FALSE	SARS-CoV-2	2346	4532	3439.00	16.8	17.4	17.10000

dplyr is extremely powerful, and can do a lot more than summarize. You can learn more about dplyr by exploring these tutorials:

The Data aggregation tutorial of the the STAT 545 course at the University of British Columbia
Introduction to dplyr for Faster Data Manipulation in R (also available in video)
The official introduction to dplyr

Statistical tests

R has many statistical tests built in. Identify the one that is most relevant to your data, determine how to do it in R, and interpret your p-value or other preferred statistical test output. Here is a guide from r-statistics.co.

Modelling

Modelling allows you to estimate the strength of a correlations and to compare the individual impacts of variables affecting an outcome variable.

First, you should get an introduction to selecting the appropriate kind of model and writing the code to generate that model. datacamp.com has an interactive course on modelling in R, in which you practice coding it yourself and it judges your code. If you prefer just reading, r-statistics.co has a guide:

Linear Regression (for numerical outcome variables, and an introduction to modelling in general)
Logistic Regression (for binary outcome variables)
Multinomial Regression (for categorical outcome variables)
Other regression models, if you need them
Model Selection Approaches

Next, you need to understand what your model is telling you. I learned everything I know about modelling from Regression and Other Stories by Andrew Gelman, Jennifer Hill, and Aki Vehtari. This book is thorough, clear, correct, and readable, unlike most things. The pdf is (legally) available for free online. The following are the guides to interpreting your models that I have bookmarked (most of these sections are around 3-5 pages):

Interpreting your models:
- Linear models:
- Logistic models for binary outcome variables:
- Other kinds of models, as needed:
  - Chapter 15: Other generalized linear models
Judging your model:
Transforming your data to make your coefficients interpretable:

The whole book is excellent. Once you are comfortable selecting and interpreting models, I recommend reading the following chapters:

Study design:
- Chapter 16: Design and sample size decisions
Dealing with missing data:
- Chapter 17: Poststratification and missing-data imputation
Causal inference:

Data visualization

Scatterplots, bar plots, histograms, and other standard figures

Follow this tutorial from r-statistics.co to learn how to use ggplot2 for data visualization. It takes you through making a scatterplot, then extends what you’ve learned to other kinds of plots:

Additionally, I think you should learn about grouped vs. stacked vs. percent bars in bar plots or histograms, which the above tutorial does not cover.

I have found that in my own work, I most frequently use and encounter scatterplots, histograms, barplots, line plots, and violin plots.

Here are some references with different kinds of plots you might like to make, explanations, and example code to copy-paste and modify:

Once you have mastered the general principles, you can google any kind of plot or modification to your plot that you would like to make. I have found that almost anything I want to do has at some point been asked about on StackOverflow.

Maps

Here are some resources on plotting colorful maps in R:

R is in particular is the best solution for drawing a map with curved directional arrows illustrating transmission events (doing it in python seems to be much more challenging and doesn’t end up as nice). There does not seem to be a good guide for making this kind of figure, but my colleague Gage Moreno found a way to do it using curvedarrow in the diagram package.

Conditionals, loops, and functions

Conditional statements

Conditional statements allow you to automatically run different lines of code depending on what your data looks like. In this example, we check if your minimum collection date is before May 9th; if it is, we subset our data to collection dates after May 10th:

if(min(my_table$collection_date) < as.Date("2022-05-09")) { my_table <- subset(my_table, collection_date > as.Date("2022-05-10")) }

The above example only executes the code inside the { and } brackets if the condition boils down to TRUE; if the condition is FALSE, your script does nothing. If you’d like, you can add an else statement to make it do something else if the condition is FALSE. In the following example, if our mean readcount is <100, we replace any readcounts less than 100 with NA. If not, we instead move on to the else and replace any readcounts less than 200 with NA:

if(mean(my_table$readcount) < 100) { my_table$readcount[my_table$readcount < 100] <- NA } else { my_table$readcount[my_table$readcount < 200] <- NA }

You can chain these conditions together. In this example, we check if the minimum proportion of any genome covered in our table is less than 50%:

If it is, we subset our table to rows with is_virus set to TRUE and (&) readcount at least 50 and we exit the chain.
If it isn’t, we follow the else to the next statement: we check if the minimum proportion of any genome covered in our table is greater than 70%.
- If it is, we subset our table to rows with is_virus set to FALSE and (&) readcount at least 100 and we exit the chain.
- If it isn’t, we follow the else to our next and final statement: we subset our table to rows with readcount at least 100 or (|) at least 50% of the genome covered, and then we set any Ct values greater than 25 to 25.

if(min(my_table$genome_covered) < 0.50) { my_table <- subset(my_table, is_virus & readcount >= 50) } else if(min(my_table$genome_covered) > 0.70) { my_table <- subset(my_table, !is_virus & readcount >= 100) } else { my_table <- subset(my_table, readcount >= 100
| genome_covered >= 0.50) my_table$Ct[my_table$Ct > 25] <- 25 }

Scaling up your analyses with loops

Looping through a predetermined list of values

If you would like to automatically do the same analysis or generate the same plot for multiple values, you can use a for-loop. In the following example, we subset the table to include only rows where species is Hepatitis C, HIV, or Influenza A, and save each table to an output file named after the species. We do that by using the variable my_species inside the { and } brackets and setting my_species to Hepatitis C, HIV, and then Influenza A. With each iteration of the loop, my_species takes on the next value and the code inside the { and } brackets is executed with my_species set to that value.

for(my_species in c("Hepatitis C", "HIV", "Influenza A")) { my_table_subset <- subset(my_table, species == my_species) output_table_path <- paste(input_table_path, "_", my_species, ".txt",
sep="") write.table(my_table_subset, file=output_table_path, quote=FALSE, sep='\t', row.names=FALSE) }

Instead of using a for-loop, you could instead copy-paste the code inside the { and } brackets, changing my_species to your species of choice each time:

my_table_subset <- subset(my_table, species == "Hepatitis C") output_table_path <- paste(input_table_path, "_", "Hepatitis C",
".txt", sep="") write.table(my_table_subset, file=output_table_path, quote=FALSE, sep='\t', row.names=FALSE)

my_table_subset <- subset(my_table, species == "HIV") output_table_path <- paste(input_table_path, "_", "HIV",
".txt", sep="") write.table(my_table_subset, file=output_table_path, quote=FALSE, sep='\t', row.names=FALSE)

my_table_subset <- subset(my_table, species == "Influenza A") output_table_path <- paste(input_table_path, "_", "Influenza A",
".txt", sep="") write.table(my_table_subset, file=output_table_path, quote=FALSE, sep='\t', row.names=FALSE)

This is a lot longer and harder to read—and you’re only copy-pasting three lines of code. If you wanted to add another species of interest (or ten), you would need to copy-paste again and again instead of just adding another species name to your list. If you wanted to change something, maybe the names of the output files, you would need to change it for each copy-pasted version. Not only is that a lot of work, it also makes it more likely that you will make a mistake and, because your code is less readable, less likely that you will catch it.

Looping through a non-predetermined list of values

Another benefit of for-loops is that you do not need to decide what values you are cycling through ahead of time. Perhaps we want to print subsets of our table for all species appearing in the table, whatever they are, without knowing them ahead of time. This example does the same thing as our previous two examples, but it does it for all unique species in our table.

for(my_species in as.list(unique(my_table$species))) { my_table_subset <- subset(my_table, species == my_species) output_table_path <- paste(input_table_path, "_", my_species, ".txt",
sep="") write.table(my_table_subset, file=output_table_path, quote=FALSE, sep='\t', row.names=FALSE) }

Maintaining data types in loops

Note that as.list is absolutely necessary in the above for-loop—I learned this the hard way. R for some reason strips values of their type when they are iterated through in a for-loop. Try it yourself—here is the list the for-loop is iterating over; it is of type factor:

unique(my_table$species) class(unique(my_table$species))

Here is an iteration through the list with as.list—the elements of the list are as before of type factor:

for(my_species in as.list(unique(my_table$species))) { print(my_species) print(class(my_species)) }

Here is an iteration through the list without as.list—the elements of the list are now of type character:

for(my_species in unique(my_table$species)) { print(my_species) print(class(my_species)) }

Looping through integers

You can also cycle through integer values. This chunk of code, for example, sets my_value to each integer from 2 through 10 and prints it each time.

for(my_value in 2:10) { print(my_value) }

Reusing your analyses with functions

When you find yourself copy-pasting your code and changing a few things, you should consider using a function. Here is a function I wrote to calculate a 95% confidence interval given a proportion and a total:

  confidence_interval_from_proportion_and_total
    <- function(proportion, total)   {     confidence_interval_plus_minus <- 1.96 * sqrt(proportion
      * (1 - proportion) / total)     confidence_interval_min <- proportion
      - confidence_interval_plus_minus     confidence_interval_max <- proportion
      + confidence_interval_plus_minus     return(c(min = confidence_interval_min, proportion = proportion,       max = confidence_interval_max))   }

After defining the function, I call it whenever I need to calculate a confidence interval:

confidence_interval_from_proportion_and_total(0.50, 500) confidence_interval_from_proportion_and_total(0.10, 10) confidence_interval_from_proportion_and_total(0.99, 200)

Instead of copy-pasting my code each time, I can call the function in one line, making my code shorter and a lot easier to read. If I decide to change how I calculate a confidence interval, perhaps to bound the confidence interval between 0 and 1, I can simply edit the code inside the function rather than needing to locate and edit many lines of code. Some functions can get pretty long, so the benefit adds up.

Avoiding (some) bugs

Put the following line at the very top of your script:

rm(list=ls())

This line deletes whatever data is already loaded, so that your previously defined variables do not linger, ready to creep up into your analyses when you use that name again because of a typo or without assigning it a new value as you intended to. You should run this line whenever you are starting over analyzing your data or when you open and run a new script—top of the script is the best place.

Resources

Additional tutorials and courses:
- codecademy’s Learn R course
- the r-statistics.co series of tutorials
- A hands-on Introduction to R workshop, with biologists as a target audience
- STAT 545: Data wrangling, exploration, and analysis with R at the University of British Columbia
Introduction to R:
- datacamp.com’s interactive introduction to R course
- A short video introduction to RStudio
- University of British Columbia STAT 545: Install R and RStudio
- An introduction to R from r-statistics.co
The Terra team’s 15-minute Terra Notebooks Quickstart video for using R in Terra
Data manipulation and summarizing with dplyr:
- Introduction to dplyr for Faster Data Manipulation in R (also available in video)
- The official introduction to dplyr
Data visualization examples with explanations and code snippets:
- The Complete ggplot2 Tutorial from r-statistics.co
  - The Complete ggplot2 Tutorial – Part1 | Introduction To ggplot2
  - The Complete ggplot2 Tutorial – Part 2 | How To Customize ggplot2
- Top 50 ggplot2 Visualizations – The Master List
- The R graph gallery
- The tidyverse ggplot2 reference web page
Modelling:
- datacamp.com’s interactive course on modelling in R
- r-statistics.co’s guide:

Linear Regression (for numerical outcome variables, and an introduction to modelling in general)
Logistic Regression (for binary outcome variables)
Multinomial Regression (for categorical outcome variables)
Other regression models, if you need them
Model Selection Approaches
Regression and Other Stories by Andrew Gelman, Jennifer Hill, and Aki Vehtari: (legally) available for free online
- Interpreting your models:
  - Linear models:
  - Logistic models for binary outcome variables:
  - Other kinds of models, as needed:
    - Chapter 15: Other generalized linear models
- Judging your model:
- Transforming your data to make your coefficients interpretable:

Study design:
- Chapter 16: Design and sample size decisions
Dealing with missing data:
- Chapter 17: Poststratification and missing-data imputation
Causal inference:

What I’ve Learned About Making Scientific Posters

June 27, 2022August 2, 2022 lydiaandreyevnakrasilnikova2 Comments

I was lucky to get to present our now published recent work, learning from last summer’s SARS-CoV-2 Delta outbreak in Ptown, at last winter’s Broad retreat. My goal was entirely to do justice to the work and the massive number of people who did that work, but to my surprise and delight we won a poster prize in the Computational Biology and Data Science category, which as you might imagine is one of the largest and most interesting categories at the Broad. I think our poster won primarily because of the project, not the poster itself, but I think the poster itself had to have also been pretty okay, at the very least clear and effective. I learned a lot putting it together, and I also learned a lot putting together other, not as astoundingly successful posters in past years and collecting my colleagues’ good advice.

Here is how the poster turned out:

The first thing I did was add the title, stretched in very large (size 85) font to cover the entire top of the poster, so that people could easily read it from far away. Our poster title was the title of our paper: Transmission from vaccinated individuals in a large SARS-CoV-2 Delta variant outbreak.

I divided the poster itself into two columns, expecting that after the title the reader would read the left column from top to bottom and then the right column from top to bottom. The left column starts with the author list and the graphical abstract side by side. The author list, which is huge, is in small-ish (size 18) font, with affiliations in even smaller font (size 12) below it. The graphical abstract is gorgeous and effective and was made by the Broad Pattern team in a process that was both extremely impressive and also very collaborative and iterative, a lot like the process of putting together the eventual journal cover with Thought Café. This project was my first time working with professional scientific illustrators, more scientific and detailed in the case of the graphical abstract and more abstract (a pun! yay!) in the case of the cover.

The rest of the poster dives into the science. Our abstract does a great job summarizing the work and our findings (the purpose of the abstract), so I organized our the bulk of the poster around the abstract, in its entirety, divided into five parts and shown in large (size 37) text.

The first sentence of our abstract covers background on the outbreak and gives context for the questions we ask in the paper. I put this sentence first after the author list and graphical abstract, on its own in the middle of the left column:

An outbreak of over one thousand COVID-19 cases in Provincetown, Massachusetts, in July 2021—the first large outbreak mostly in vaccinated individuals in the US—prompted a comprehensive public health response, motivating changes to national masking recommendations and raising questions about infection and transmission among vaccinated individuals.

The rest of the abstract presents the four takeaways of the paper, each in one sentence:

To address these questions, we combined genomic and epidemiological data from 467 individuals, including 40% of known outbreak-associated cases.
The Delta variant accounted for 99% of outbreak-associated cases in this dataset; it was introduced from at least 40 sources, but 83% of cases derived from a single source, likely through transmission across multiple settings over a short time rather than a single event.
Genomic and epidemiological data supported multiple transmissions of Delta from and between fully vaccinated individuals.
Despite its magnitude, the outbreak had limited onward impact in MA and the US, likely due to high vaccination rates and a robust public health response.

I gave each of these points its own results section on the poster, in order, in whatever way I was able to get them to fit in the two columns. The first point ended up at the bottom of the left column and the other three divided the right column more or less equally.

Each results section starts with its sentence from the abstract, in large font as a header for the section, padded by at least some whitespace above and below. The hope is that a person with not a lot of time to dedicate to the poster can grasp the main takeaways simply by reading everything that is in the largest font. In all cases but one the section’s header sentence is at the top of the section; in one case the content of the section was tall so the large-font sentence is in the top left, which is still easy to find and I think adds some pleasant variety to the layout.

Next, I pulled out the figure panels that contributed to the message of each results section. The figure panels in a section were not always from the same figures and were not only from the main figures. Here are the figure panels that ended up included in each section of the poster:

Section 1: Figure 1 in its entirety and Figure S1 in its entirety
Section 2: Figure 2A and Figure S4A
Section 3: Figure 3 in its entirety and Figure S5C
Section 4: Figure 2C and Figure S4B

I added the figure legends around the figures in very small (size 14) text, so that someone who really wants to can read them and someone who does not want to read them can easily ignore them. In a few cases I was able to fit the entire figure legend for a panel under that panel, but in most cases I had to clump the figure legends together with labels like Left, Right, Middle, Top, Left Top, and Left Inset. Like a game of Twister but for the eyes.

I did not include any figure and panel numbers or references to other figure panels, because I wanted the figures to be at home in the poster and not just visiting from the paper.

Finally, I added three QR codes, because I had seen QR codes on other posters and found them absolutely delightful and very novel. They are probably neither delightful nor novel at this point (and probably weren’t delightful or novel at that point either) but I think they were useful.

One QR code leads to our interactive tree, where you can examine the temporal and genetic relationships between our SARS-CoV-2 sequences in the context of other Delta sequences. This tree is most relevant to the second results section of the poster so I placed it there, with “Interactive tree” and a little arrow to the QR code.

The other two QR codes are relevant to the paper as a whole. One points to the preprint (the paper was not yet published in a peer-reviewed journal) and the other points to the tweetorial I wrote summarizing the paper both for people who are in our relatively small subfield of biology and hopefully also for people who are not. I put these two QR codes at the very bottom left of the poster, because that is the corner that had space. In addition to squishing them into a corner I also put a little box around them to signify that they are separate from the results section they are next to. I also labeled them: “Preprint” and “Tweetorial” with little arrows to the QR codes. (Really the arrows should point from the QR codes to the things the QR codes point to, and not the other way around, but that’s too complicated and not a hill I was willing to kill this poster on.)

And that brings us to the poster’s final form:

Here are the five lessons I learned:

Information should be presented in order along the path you expect the eye to follow.
The more QR codes the better. This is half-joking, but three was clearly not too many.
The most important takeaways should be in the biggest font, with font size corresponding to importance.
The poster should be skimmable to the reader’s preference of skimmability. I used to think there should be minimal text, and I still do, but now I think it is also important to give an interested reader something to look at more closely. But any text that is not an important takeaway should be unintimidating and easy to avoid reading if a reader does not have time or spoons to read it.
I’ve seen a lot of posters that show the background, methods, results, and conclusions, each in its own section, and in most cases I personally much prefer when a poster focuses on and is organized around the big takeaways of the work. I don’t remember a lot long-term; it helps to be directed to the important parts. (Of course, it is entirely possible that the next project I make a poster for will be perfect for the background-methods-results-conclusions format and I will take this all back.)

Here is the poster as a zoomable pdf:

PTown retreat poster Download

Here is a video of the poster printing (!). It’s silly, but it was exciting for me:

I couldn’t get my poster to send itself to the poster printer (the morning of the poster session, because as hard as I try that is the timeline I apparently work on) and a fellow Broadie helped me because that is just what the Broad culture is like, a cozy and collaborative (and absurdly smart) home.

Here is the poster on its poster board, at poster spot 2, poster spot 2 itself being a great honor:

Because of its layout, the poster was very easy to present. The poster session was in person, which was also very fun. I was able to simply gesture to and summarize the main points and describe the takeaways from the figures that led to the main points. When a new person joined I was able to quickly interrupt myself to summarize the project overall, I think along with a quick list of the main points while gesturing to them, and then jump back into the point I was presenting. I cycled through the main points as people came and went.

I was lucky that I was very much on that day; it was absolutely one of my better presentation days, and the project itself had filled my entire brain for months. I really enjoyed presenting to the smaller groups that visited our poster; it felt more like a conversation than a presentation. I liked that people were comfortable interrupting me to ask questions and I liked that I was able to think about the questions for longer than I usually feel comfortable stopping during a talk. I have noticed over the past year that I really like being able to make eye contact with people and feel like we are in conversation. The more conversational and less formal the situation feels, the better I am at presenting the work, a little correlated (R² being maybe about 0.65) with the number of people I am talking with. I have not recently given a talk in person, but the pattern seems to hold at least over zoom and in this one in-person poster session. I’d like to learn how to bring the same energy to all my presentations, including in contexts that are more challenging or high stakes. Something to strive for and remember very fondly. This poster session was amazing. I want all my communication about my work to feel like it felt.