Getting started using R for data analysis and visualization

This guide is the second of the getting started guides I made originally for our lab, and now for you. All the getting started guides can be found here.

Thank you to Danny Park in our lab for introducing me to R and preempting my confusion about the difference between tables in Excel and R—a lot of this guide was verbally passed on to me from Danny when I first joined the lab; I just wrote it down.

Thank you to Lizzie Wolkovich for teaching me everything I know about modelling.

Contents

This document should take a veritable eternity to work through—but you don’t need to work through all or even most of it to do what you want to do. Take the pieces you need and when you need more pieces take those too.

Introduction

Our lab is split pretty evenly between people who use Python for data analysis and visualization and people who use R for data analysis and visualization. I strongly prefer R, because it was made specifically for data analysis, but there is no wrong answer. R is especially powerful because of the added functionality of its libraries, such as dplyr and ggplot2.

If R is about to become your first programming language (!), I recommend starting off with the first part of codecademy’s Learn R course.

Getting started

Using R within Terra

If you use Terra (our lab uses Terra), you can write and run R scripts directly in Terra, and run your R scripts directly on the data you analyze in Terra. You can watch the Terra team’s 15-minute Terra Notebooks Quickstart video to learn how:

Using R on your computer

Download and install R here.

Download and install RStudio here.

RStudio allows you to write and run your code as a full script in its own file, run subsets of your code by simply highlighting the individual lines you are interested in running and hitting command enter, or writing and running throwaway code in RStudio’s console.

Here is a short video introduction to RStudio:

You can also follow University of British Columbia STAT 545’s guide: Install R and RStudio.

Packages/libraries

Package and library are terms used pretty much interchangeably in R. You can install a package, in this example a package named plyr, by entering the following command in RStudio’s command line with your own package of interest:

install.packages("plyr")

You only need to install a package once. After you have installed it, you can load a package by adding the following line to your script anywhere before the code that will use the package, again using the package plyr as an example:

library(plyr)

You can learn more about packages/libraries in this tutorial.

I have found the following packages helpful:

  • tidyverse: a collection of helpful packages all packaged together—including dplyr, ggplot2, tidyr, and stringr below
  • data visualization:
    • ggplot2: the main package used for data visualization in R
    • ggpubr: customization add-ons for ggplot2
    • scales: customize axes and legends
    • patchwork: put multiple plots together in an aligned arrangement
    • grid and gridExtra: another option for arranging multiple plots into a layout
  • data manipulation and summarization:
  • statistical analysis:
    • binom: binomial confidence intervals
    • broom: summarizes statistical analysis outputs in more readable formats
  • beepr: beeps when your code is done running

Data tables in R

Differences between tables in R and Excel

The thing that most confused me when I started coding in R is that data tables in R are different from the tables I was used to working with in Excel.

If you’re used to working in Excel, you might be accustomed to tables that look like this one. In this table, each row is a patient. The patient name appears in the first column of each row. The rest of the columns belong to species. Each cell contains the number of reads that were mapped to the species (its column) from the patient (its row).

For input into R, this table would look quite different. Every readcount cell in the Excel version of the table gets its own row in the R version of the table.

The R data table format allows you to add more columns to each patient-species pair, which in the Excel table format would require you to make a whole new table. Notice that some of the columns refer to the patient (patient, is_food_handler, collection_date, Ct) while other columns refer to the species (species, is_virus) and other columns refer to the patient-species pair (readcount, genome_covered):

Each column is a variable, and the two terms can be used interchangeably when talking about data tables.

(This data is clearly from an experiment mixing different sample types from each patient into a single sample.)

My example code will use this table. If you would like to follow along, download it here.

Reading in and saving data tables

To read in a comma-separated table:
  input_table_path <- "/Users/lakras/my_input_file.csv"
  my_table <- read.csv(input_table_path)

or
  my_table <- read.table(input_table_path, sep=",", header=TRUE)

To read in a tab-separated table:
  input_table_path <- "/Users/lakras/my_input_file.tsv"
  my_table <- read.delim(input_table_path)

or
  my_table <- read.table(input_table_path, sep="\t", header=TRUE)

To save a table as a comma-separated output table:
  output_table_path <- "/Users/lakras/my_output_file.csv"
  write.table(my_table, file=output_table_path, quote=FALSE, sep=',',   row.names=FALSE)

To save a table as a tab-separated output table:
  output_table_path <- "/Users/lakras/my_output_file.tsv"
  write.table(my_table, file=output_table_path, quote=FALSE, sep='\t',
    row.names=FALSE)

Viewing and manipulating data tables

Exploring your data table and its columns

You can view or access the full data table by simply entering its name:
  my_table

View a summary of the variables (columns) in your data table, their data types (int, num, chr, string, factor), and the first few values using the str function:
  str(my_table)

You can view or access the individual variables (columns) of your data table by entering the table name followed by $ followed by the column name:
  my_table$patient
  my_table$species
  my_table$readcount
  my_table$genome_covered
  my_table$is_virus
  my_table$Ct
  my_table$is_food_handler
  my_table$collection_date

You can view all values of a variable (column) and the number of rows with each value using the table function:
  table(my_table$species)

You can view a subset of variables (columns) using the select function:
  select(my_table, patient, species, genome_covered)

You can view the first few rows of a data table using the head function or the last few rows using the tail function:
  head(my_table)
  tail(my_table)

Subsetting your data table

You can use the subset function to view a subset of your data table. Perhaps we want to view the subset of our rows related to Hepatitis C:
  subset(my_table, species == "Hepatitis C")

You can save the subset, or set the data table to be a subset of itself. Perhaps we are only interested in keeping the subset of our rows with readcount ≥100:
  my_table <- subset(my_table, readcount >= 100)

Creating new columns

To make a new column with all values set to TRUE:
  my_table$is_human <- TRUE

To make a new column by duplicating another column:
  my_table$readcount_2 <- my_table$readcount

To make a new column by multiplying another column by a scalar:
  my_table$percent_genome_covered <- my_table$genome_covered * 100

Conditional viewing and manipulating column values

You can view values of a column including only those rows that meet a certain condition. For example, you can view all readcounts ≥100:
  my_table$readcount[my_table$readcount >= 100]

You can also view all Ct values from rows with readcount ≥100:
  my_table$Ct[my_table$readcount >= 100]

In addition of viewing column values, you can also modify those column values. For example, you can set all readcounts <100 to NA:
  my_table$readcount[my_table$readcount < 100] <- NA

You can also modify a column that is different from the column you are testing in your condition, and the column you modify does not have to already exist. You can create a new variable with values conditional on another column’s values:
  my_table$passes_threshold[my_table$readcount >= 100] <- TRUE
  my_table$passes_threshold[my_table$readcount < 100] <- FALSE

Data types

Here are some of the data types you will encounter and use:

  • chr: a text value
  • factor: a category
  • int: an integer (no decimal places)
  • num: a number (with decimal places)
  • logi: TRUE or FALSE (logi is short for logical)
  • Date: a date

The chr data type is a text value. This data type might be useful for text values that are unique to each row, for example patient name, sample name, or hospital address. In some cases, text values (or other kinds of values) might reappear from row to row and be used to group rows together, for example state of residency, presence of coughing, sample type, or patient gender. This values should be factors.

You can modify the data type of your variable.

To make a variable’s values into text values:
  my_table$patient <- as.character(my_table$patient)

To make a variable’s values into categories:
  my_table$species <- as.factor(my_table$species)

To make a variable’s values into integers (no decimal places):
  my_table$readcount <- as.numeric(my_table$readcount)

To make a variable’s values into numerical values (with decimal places):
  my_table$Ct <- as.numeric(my_table$Ct)
  my_table$genome_covered <- as.numeric(my_table$genome_covered)

To make a variable’s values into TRUE and FALSE values:
  my_table$is_food_handler <- as.logical(my_table$is_food_handler)
  my_table$is_virus <- as.logical(my_table$is_virus)

To make a variable’s values into dates:
  my_table$collection_date <- as.Date(my_table$collection_date)

Special values

Some special values are NA (value not available), NaN (not a number), Inf (infinity), and -Inf (negative infinity). You will need to keep these values in mind and check for them in many analyses.

You can check if a value (my_value in this case) is one of these values:
  my_value <- 5
  is.na(my_value)
  is.nan(my_value)
  is.infinite(my_value)

Statistical analysis

Summary statistics

You can view basic summary statistics for a variable.

Minimum, maximum, and range:
  min(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)
  max(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)
  range(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)

Mean and standard deviation:
  mean(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)
  sd(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)

Median, first quartile (the 25th percentile of your data), and third quartile (the 75th percentile of your data), and interquartile range (the middle 50% of your data, or the distance between the 25th percentile and the 75th percentile):
  median(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)
  quantile(my_table$genome_covered, 1/4, na.rm=TRUE, nan.rm=TRUE)
  quantile(my_table$genome_covered, 3/4, na.rm=TRUE, nan.rm=TRUE)
  IQR(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)

na.rm=TRUE and nan.rm=TRUE exclude NA and NaN values from your calculations of the mean, median, min, max, etc.

You can generate a table of summary statistics grouped by values of a variable using ddply of the plyr or dplyr library. For example, we might calculate the minimum, maximum, and mean of readcounts for each species separately, saving them in a new data table we name readcount_summary:

  readcount_summary <- ddply(my_table, ~ species, summarise,
    min_readcount=min(readcount),
    max_readcount=max(readcount),mean_readcount=mean(readcount))

The output looks like this:

speciesmin_readcountmax_readcountmean_readcount
1E. coli04324620.4286
2Hepatitis C06457931.7143
3HIV0865217.0000
4Influenza A04363677.1429
5SARS-CoV-27645321289.5714

Or we might calculate the minimum, maximum, and mean of both readcounts and Cts, grouping rows not only by species, but also by whether or not the patient is a food handler and whether or not the at least half the genome is covered:

  readcount_summary <- ddply(my_table,
    genome_covered >= 0.50 ~ is_food_handler ~ species, summarise,
    min_readcount=min(readcount),
    max_readcount=max(readcount),
    mean_readcount=mean(readcount),
    min_Ct=min(Ct),
    max_Ct=max(Ct),mean_Ct=mean(Ct))

That output looks like this:

genome_ covered >= 0.5is_food_ handlerspeciesmin_ readcountmax_ readcountmean_ readcountmin _Ctmax _Ctmean _Ct
1FALSEFALSE E. coli04324868.6016.827.921.64000
2FALSEFALSEHepatitis C06516.2516.827.921.72500
3FALSEFALSEHIV0865303.8016.827.921.64000
4FALSEFALSEInfluenza A00016.827.922.70000
5FALSEFALSESARS-CoV-287876546.0021.327.924.66667
6FALSETRUE E. coli00021.523.922.70000
7FALSETRUEHepatitis C00021.523.922.70000
8FALSETRUEHIV00021.523.922.70000
9FALSETRUEInfluenza A65312188.5021.523.922.70000
10FALSETRUESARS-CoV-276435255.5021.523.922.70000
11TRUEFALSEHepatitis C645764576457.0021.321.321.30000
12TRUEFALSEInfluenza A436343634363.0017.417.417.40000
13TRUEFALSESARS-CoV-2234645323439.0016.817.417.10000

dplyr is extremely powerful, and can do a lot more than summarize. You can learn more about dplyr by exploring these tutorials:

Statistical tests

R has many statistical tests built in. Identify the one that is most relevant to your data, determine how to do it in R, and interpret your p-value or other preferred statistical test output. Here is a guide from r-statistics.co.

Modelling

Modelling allows you to estimate the strength of a correlations and to compare the individual impacts of variables affecting an outcome variable.

First, you should get an introduction to selecting the appropriate kind of model and writing the code to generate that model. datacamp.com has an interactive course on modelling in R, in which you practice coding it yourself and it judges your code. If you prefer just reading, r-statistics.co has a guide:

  1. Linear Regression (for numerical outcome variables, and an introduction to modelling in general)
  2. Logistic Regression (for binary outcome variables)
  3. Multinomial Regression (for categorical outcome variables)
  4. Other regression models, if you need them
  5. Model Selection Approaches

Next, you need to understand what your model is telling you. I learned everything I know about modelling from Regression and Other Stories by Andrew Gelman, Jennifer Hill, and Aki Vehtari. This book is thorough, clear, correct, and readable, unlike most things. The pdf is (legally) available for free online. The following are the guides to interpreting your models that I have bookmarked (most of these sections are around 3-5 pages):

The whole book is excellent. Once you are comfortable selecting and interpreting models, I recommend reading the following chapters:

Data visualization

Scatterplots, bar plots, histograms, and other standard figures

Follow this tutorial from r-statistics.co to learn how to use ggplot2 for data visualization. It takes you through making a scatterplot, then extends what you’ve learned to other kinds of plots:

  1. The Complete ggplot2 Tutorial – Part1 | Introduction To ggplot2
  2. The Complete ggplot2 Tutorial – Part 2 | How To Customize ggplot2
  3. Top 50 ggplot2 Visualizations – The Master List

Additionally, I think you should learn about grouped vs. stacked vs. percent bars in bar plots or histograms, which the above tutorial does not cover.

I have found that in my own work, I most frequently use and encounter scatterplots, histograms, barplots, line plots, and violin plots. 

Here are some references with different kinds of plots you might like to make, explanations, and example code to copy-paste and modify:

Once you have mastered the general principles, you can google any kind of plot or modification to your plot that you would like to make. I have found that almost anything I want to do has at some point been asked about on StackOverflow.

Maps

Here are some resources on plotting colorful maps in R:

R is in particular is the best solution for drawing a map with curved directional arrows illustrating transmission events (doing it in python seems to be much more challenging and doesn’t end up as nice). There does not seem to be a good guide for making this kind of figure, but my colleague Gage Moreno found a way to do it using curvedarrow in the diagram package.

Conditionals, loops, and functions

Conditional statements

Conditional statements allow you to automatically run different lines of code depending on what your data looks like. In this example, we check if your minimum collection date is before May 9th; if it is, we subset our data to collection dates after May 10th:

  if(min(my_table$collection_date) < as.Date("2022-05-09"))
  {
    my_table <- subset(my_table, collection_date > as.Date("2022-05-10"))
  }

The above example only executes the code inside the { and } brackets if the condition boils down to TRUE; if the condition is FALSE, your script does nothing. If you’d like, you can add an else statement to make it do something else if the condition is FALSE. In the following example, if our mean readcount is <100, we replace any readcounts less than 100 with NA. If not, we instead move on to the else and replace any readcounts less than 200 with NA:

  if(mean(my_table$readcount) < 100)
  {
    my_table$readcount[my_table$readcount < 100] <- NA
  } else {
    my_table$readcount[my_table$readcount < 200] <- NA
  }

You can chain these conditions together. In this example, we check if the minimum proportion of any genome covered in our table is less than 50%:

  • If it is, we subset our table to rows with is_virus set to TRUE and (&) readcount at least 50 and we exit the chain.
  • If it isn’t, we follow the else to the next statement: we check if the minimum proportion of any genome covered in our table is greater than 70%.
    • If it is, we subset our table to rows with is_virus set to FALSE and (&) readcount at least 100 and we exit the chain.
    • If it isn’t, we follow the else to our next and final statement: we subset our table to rows with readcount at least 100 or (|) at least 50% of the genome covered, and then we set any Ct values greater than 25 to 25.

  if(min(my_table$genome_covered) < 0.50)
  {
    my_table <- subset(my_table, is_virus & readcount >= 50)
  } else if(min(my_table$genome_covered) > 0.70)
  {
    my_table <- subset(my_table, !is_virus & readcount >= 100)
  } else
  {
    my_table <- subset(my_table, readcount >= 100

        | genome_covered >= 0.50)
    my_table$Ct[my_table$Ct > 25] <- 25
  }

Scaling up your analyses with loops

Looping through a predetermined list of values

If you would like to automatically do the same analysis or generate the same plot for multiple values, you can use a for-loop. In the following example, we subset the table to include only rows where species is Hepatitis C, HIV, or Influenza A, and save each table to an output file named after the species. We do that by using the variable my_species inside the { and } brackets and setting my_species to Hepatitis C, HIV, and then Influenza A. With each iteration of the loop, my_species takes on the next value and the code inside the { and } brackets is executed with my_species set to that value.

  for(my_species in c("Hepatitis C", "HIV", "Influenza A"))
  {
    my_table_subset <- subset(my_table, species == my_species)
    output_table_path <- paste(input_table_path, "_", my_species, ".txt",

      sep="")
    write.table(my_table_subset, file=output_table_path, quote=FALSE,
      sep='\t', row.names=FALSE)
  }

Instead of using a for-loop, you could instead copy-paste the code inside the { and } brackets, changing my_species to your species of choice each time:

  my_table_subset <- subset(my_table, species == "Hepatitis C")
  output_table_path <- paste(input_table_path, "_", "Hepatitis C",

    ".txt", sep="")
  write.table(my_table_subset, file=output_table_path, quote=FALSE,
    sep='\t', row.names=FALSE)

  my_table_subset <- subset(my_table, species == "HIV")
  output_table_path <- paste(input_table_path, "_", "HIV",

    ".txt", sep="")
  write.table(my_table_subset, file=output_table_path, quote=FALSE,
    sep='\t', row.names=FALSE)

  my_table_subset <- subset(my_table, species == "Influenza A")
  output_table_path <- paste(input_table_path, "_", "Influenza A",

    ".txt", sep="")
  write.table(my_table_subset, file=output_table_path, quote=FALSE,
    sep='\t', row.names=FALSE)

This is a lot longer and harder to read—and you’re only copy-pasting three lines of code. If you wanted to add another species of interest (or ten), you would need to copy-paste again and again instead of just adding another species name to your list. If you wanted to change something, maybe the names of the output files, you would need to change it for each copy-pasted version. Not only is that a lot of work, it also makes it more likely that you will make a mistake and, because your code is less readable, less likely that you will catch it.

Looping through a non-predetermined list of values

Another benefit of for-loops is that you do not need to decide what values you are cycling through ahead of time. Perhaps we want to print subsets of our table for all species appearing in the table, whatever they are, without knowing them ahead of time. This example does the same thing as our previous two examples, but it does it for all unique species in our table.

  for(my_species in as.list(unique(my_table$species)))
  {
    my_table_subset <- subset(my_table, species == my_species)
    output_table_path <- paste(input_table_path, "_", my_species, ".txt",

      sep="")
    write.table(my_table_subset, file=output_table_path, quote=FALSE,
      sep='\t', row.names=FALSE)
  }

Maintaining data types in loops

Note that as.list is absolutely necessary in the above for-loop—I learned this the hard way. R for some reason strips values of their type when they are iterated through in a for-loop. Try it yourself—here is the list the for-loop is iterating over; it is of type factor:

  unique(my_table$species)
  class(unique(my_table$species))

Here is an iteration through the list with as.list—the elements of the list are as before of type factor:

  for(my_species in as.list(unique(my_table$species)))
  {
    print(my_species)
    print(class(my_species))
  }

Here is an iteration through the list without as.list—the elements of the list are now of type character:

  for(my_species in unique(my_table$species))
  {
    print(my_species)
    print(class(my_species))
  }

Looping through integers

You can also cycle through integer values. This chunk of code, for example, sets my_value to each integer from 2 through 10 and prints it each time.

  for(my_value in 2:10)
  {
    print(my_value)
  }

Reusing your analyses with functions

When you find yourself copy-pasting your code and changing a few things, you should consider using a function. Here is a function I wrote to calculate a 95% confidence interval given a proportion and a total:

  confidence_interval_from_proportion_and_total
    <- function(proportion, total)
  {
    confidence_interval_plus_minus <- 1.96 * sqrt(proportion

      * (1 - proportion) / total)
    confidence_interval_min <- proportion

      - confidence_interval_plus_minus
    confidence_interval_max <- proportion

      + confidence_interval_plus_minus
    return(c(min = confidence_interval_min, proportion = proportion,
      max = confidence_interval_max))
  }

After defining the function, I call it whenever I need to calculate a confidence interval:

  confidence_interval_from_proportion_and_total(0.50, 500)
  confidence_interval_from_proportion_and_total(0.10, 10)
  confidence_interval_from_proportion_and_total(0.99, 200)

Instead of copy-pasting my code each time, I can call the function in one line, making my code shorter and a lot easier to read. If I decide to change how I calculate a confidence interval, perhaps to bound the confidence interval between 0 and 1, I can simply edit the code inside the function rather than needing to locate and edit many lines of code. Some functions can get pretty long, so the benefit adds up.

Avoiding (some) bugs

Put the following line at the very top of your script:

  rm(list=ls())

This line deletes whatever data is already loaded, so that your previously defined variables do not linger, ready to creep up into your analyses when you use that name again because of a typo or without assigning it a new value as you intended to. You should run this line whenever you are starting over analyzing your data or when you open and run a new script—top of the script is the best place.

Resources

Infectious Disease Reading List: My Qualifying Exam Experiences, Advice, and Syllabi

Every PhD student I talk with seems to have a different qualifying exam and a different qualifying exam experience. My department, Organismic and Evolutionary Biology at Harvard, has a very flexible and customizable quals, in line with the overall very flexible and customizable PhD program. Our qualifying exams usually happen in the second (G2) year, and consist of the following parts:

  • Three committee members, in addition to your advisor, one of them (not your advisor) Chair of the committee—all available at the same time, all in the same room (or zoom room). This part was trickier than I expected. The committee does not all have to be from the department, which helps. Connecting with professors I had already had excellent and productive interactions with helped. Asking for broad availability before sending out possible exam times helped. Booking the room well in advance helped (but did not end up being necessary in my case).
  • A written dissertation research proposal describing the work you plan to do during the rest of your PhD. I have been told that a lot of people end up deviating from the proposal. My proposal consisted entirely of projects I had already started and am committed to finishing. Even so, I am already doing work I could not have even imagined when I wrote my proposal, even though I wrote it something like seven months ago.
  • Three syllabi of courses you think you are qualified to teach, on varying topics and of varying levels. This is open ended and different from what I usually do and (I think) a lot of fun.
  • An oral exam, up to three hours long, consisting of two parts. First, you present and answer questions about your work. This part is (I think) a lot of fun. Then, the committee asks you questions, guided but not constrained by your syllabi, to find the depth (or shallowness) of your knowledge. This part is (I think) a lot harder and less fun.

I passed my quals in the spring, during my G3 year, on Monday, April 13th (like Friday the 13th but worse, because it’s a Monday). We’re required to pass sometime during the G3 year, so I just slipped under the radar. (I had also scheduled a back-up time a few weeks later in case I failed, but I did not end up needing it (!!!!).) April 13th was at the very end of the very start of the pandemic in the United States—my quals were virtual, over zoom. I had originally timed the exam to be right before my mom’s birthday and just after my dad’s and brother’s birthdays, and planned to go home to Pennsylvania right after, hopefully accomplished and with a weight off my shoulders and with full focus on family. Of course that did not happen, and I haven’t seen my family since spring break in March. Instead, I got back from my own birthday with my family over spring break in Florida to a lockdown, thinking it was temporary, and focused fully on quals prep.

I wrote my dissertation research proposal first, with three chapters covering my three in-progress projects (and one tiny transition chapter-ish section covering a relevant smaller completed project). These wound up being 1,970+510+2,245+2,628 = 7,353 words not including references and took a lot longer than I expected, largely because the writing required a lot of reading. I then compiled my syllabi. I got carried away and added far too many papers; I ended up (by request of my committee) sending another version with key papers highlighted. This sequence of events was bad, because it allowed early tasks to steal preparation time from later tasks; it was also good, because it allowed me to work on just one thing at a time, which (I think) I am better at than I am at multitasking. I give a lot of presentations at work, so my slide deck covering my research was largely already ready—which meant that in the weeks leading up to my exam I was able to focus almost entirely on reading the papers on my syllabus.

The exam itself was fine. I took it sitting on the floor between the couch and the coffee table in our living room, with my computer on the coffee table and cups and cups of water and coffee just offscreen on the floor next to me. I was very nervous leading up to the exam and didn’t sleep, which was a mistake. My presentation of my research was excellent, I think, though (not surprisingly) I was not able to get to everything I wanted to talk about and we exceeded the allotted time. The oral exam was a weaker point. I did not know the papers on my syllabus well enough to answer pointed questions about the material anywhere near as well as I would like, even though I had read every paper. I was very nervous, and made some embarrassingly dumb mistakes. In retrospect, for both the presentation and the syllabi, it would have been better to give myself less material—to go deeper into the material on the syllabi and to go less deep into my own work, at least for the presentation (not the written proposal).

When I entered the time crunch of the last few weeks left I put together a spreadsheet tracking my progress and timing of remaining work. (You update the count of papers you’ve read in the “done” columns and everything else fills in automatically.) I make a spreadsheet like this one every time I have some work to do that is both time pressured and easily quantified, which is rarely the case in grad school (except for quals prep) but was usually the case in undergrad. I started making these kinds of spreadsheets a few weeks into freshman year; my friend Mika taught me pretty much immediately after we both arrived on campus. It is motivating and reassuring and probably also a method of procrastinating. I’ve attached a version of my spreadsheet below, with Halloween set as the deadline, in case you would like to go nuts in the way I particularly like to go nuts and use it as a template or inspiration:

All in all I spent about exactly a month on full-time/overtime quals prep (pretty much quals prep and sleeping (probably not enough sleeping) and very little else) from the middle of March to the middle of April. I think it was good for me to constrain this chapter of the unending project of self-improvement and mind expansion—but if I could go back in time, I would have started compiling my syllabi and reading the papers on my syllabi during the first year of my PhD. Some of the texts on my syllabi are material I read and learned at the start of my PhD, but because I chose to also include a lot of material I wanted to know well but didn’t, there was a lot for me to read leading up to the exam and I am not satisfied with how well I absorbed some of it. Reading just one or two papers a week spread out over a year would have probably resulted in far better retention and learning, and would have allowed me to dedicate more time to getting everything I could from each paper. At the time I was intimidated by the process of putting together my syllabi, but I didn’t need to be. Organizing my favorites of the papers I was already reading into vague themes would have been a good enough start to later retrofit to the desired format.

I have been told that some students dedicate an entire semester to preparing for quals. I don’t think I would like to do that (and if the pandemic hadn’t paused my primary project I probably would have continued to try to multitask and continue working on research—which probably would have ended badly for my qualifying exam, though who knows), but focusing entirely on reading and writing for a stretch of time was very productive for me.

I learned a lot, both about my current projects while preparing my dissertation research proposal and about what kinds of work and tools are available to me in the field more broadly while preparing my syllabi. As I expected, having to write out the current and future directions of my current projects and having to read deeply enough to write every sentence with full truth and confidence forced me to gain a much, much better understanding of my own work and of the adjacent literature. What surprised me was that when I returned to my research after my qualifying exam, I returned with a lot of clarity of a sort I hadn’t had before. I knew where I was and where I was going in my current projects. I also found myself coming up with exciting new project ideas at a rate and of a quality (if I may say so myself) I hadn’t expected from myself at this stage of my career; quals definitely caused a leap in my ability to think like a scientist.

The document itself is also helpful as a compilation—I rather frequently refer back to my project proposals, my syllabi, and especially the references at the end of each project proposal. My strategy in undergrad and at the start of grad school was to do the science first, then write only when the science was done. Now I am trying something new and writing the paper as I go, and I find that so far it has made the work far more focused, informed, and efficient, and has give me a way to identify (and hopefully fix) problems and gaps in my work well before I try to build anything on them.

To sum up, here is my advice to anyone getting ready to prepare for their qualifying exam:

  • Start identifying and reading papers for the exam well in advance, even before you actually start officially preparing for the exam—a little at a time. Amortize as much as you can of the paper reading part of the work.
  • For every paper you read, put together a few sentences summarizing the key takeaways from the paper. Review your list of papers and summaries in the days leading up to the exam.
  • I found quals to be a great opportunity to learn things I did not know but wanted to know. You can fill your syllabi with material you know well, material you want to learn, or a mix. Consider what you want to get out of the experience and plan from there.
  • When scheduling the test, first ask your committee for broad swaths of time (weeks or months) that are or aren’t good and for any recurring commitments when they are always busy. Then send out a poll with specific test time options. I initially sent out a poll with five timeslots, and then when none of those worked I sent out another poll with ten additional timeslots. I had access to two of my four committee members’ calendars, which helped. I also found it helpful to allow committee members to give each time slot a score from 1 to 5 (5—extremely convenient, 3—I can make it work, 1—doesn’t work for me) rather than just saying yes/no/maybe, which made it easier to work with potential scheduling conflicts. Here is what my first scheduling form looked like:
  • Send out and ask your committee to reserve time for not one, but two three-hour timeslots for your exam, several weeks apart. This way, it won’t be as hard to reschedule your test on short notice if someone has an emergency or a conference or an unexpected vacation or speaking opportunity. And if you fail your test you have another one already lined up with time to prepare for it.
  • Don’t worry about fitting everything into your presentation. If you’re anything like me, you should make the presentation itself shorter than you think it should be—if you have more slides, then you can have plenty of hidden slides ready in case they come up in questions and discussion.
  • There will be questions you don’t know the answer to. Hopefully you are able to answer the shallower, easier questions before reaching something you do not know.
  • Have a nice, efficient stress-relief activity that doesn’t hook you into spending a lot of time on it. I almost never played video games until studying for my quals and for some reason occasionally playing Animal Crossing elevated rather than decreased my productivity, which is not something I would ever have expected.
  • Get plenty of sleep the nights leading up to the exam itself.

What follows is my three syllabi:

  1. Evolutionary Perspectives on Human Disease, which is meant to be an introduction to the immune system as it appears throughout life (in humans, in animals more broadly, in plants, and in bacteria), a not very deep look at disease (infectious and otherwise) across life, and interactions between and co-evolution of infectious agents and their hosts, especially when the hosts are human (but also, briefly, when the hosts are bacteria), culminating in the evolution of the placenta.
  2. Microbial Inhabitants and Infectious Agents of the Human Body, which is a sweeping view of past outbreaks and epidemics, culminating in the current COVID-19 pandemic, as well as short visit to some of the microbes that we more happily coexist with.
  3. Introduction to Data Analysis Methods for Biological Inference, which covers everything from experimental design and statistical tests to multivariate models to GWAS and PCA to how sequencing works and metagenomic sequencing and genome assembly and phylogenetic trees, culminating in an exploration of how genomic sequencing can be used to track and react to infectious disease outbreaks (which is one of the things that I work on).

I tried to design the syllabi as if I were actually teaching these courses—and I would actually be very excited to teach them. They encompass, I think, most of what I know that is most relevant to my research, including a lot of things that I did not know until I put these syllabi together, found gaps in my knowledge that I was not satisfied with, and filled them. (I would also like to teach creative writing, but alas.)

You might notice that the readings include both actual papers and science journalism, in some cases science journalism about papers that are also included. (This actually came up as a question during my exam!) Including both was a very intentional choice: science journalism—specifically, Popular Science and then MIT Technology Review—was the first context in which I read about and got excited about research. I still get most of my science news from popular science journalism, especially in fields that I am curious about but am not doing research in. My hope, if I were actually teaching these courses, is that offering both research articles and popular science would

  1. allow students who are just starting to learn about infectious disease to engage in the class, hopefully leading to increasing comfort and a transition to the primary literature as the semester goes on,
  2. give students who are confused about or lost in a paper a way to get untangled (and teach students to seek out ways to get untangled), and
  3. show students some of the many different ways of writing about science, and show them good (and possibly bad) examples of how to communicate both with their peers and with a broader audience.

You might also notice that I put these syllabi together in March—some of the work on COVID-19 is already out of date.


As an Amazon Associate I earn from qualifying purchases. The next few paragraphs of this blog post includes links with my Amazon referral code. If you click one and buy something, I get up to 4% of the price as commission. You don’t have to buy these books from Amazon—you can support local bookshops by buying books from Bookshop.org, or you can buy them used and donate them to or start or build a little lending library in your neighborhood, or you can not buy anything at all. You can also support me by buying merch of my art, by buying me a campground store decaf coffee, or by simply reading and enjoying. Thank you!


There are two books I reference a lot, because I like them a lot, that I highly recommend—

Regression and Other Stories (Analytical Methods for Social Research) by Andrew Gelman, Jennifer Hill, and Aki Vehtari:

We used Regression and Other Stories in OEB 201 (Introduction to experimental design and model building for ecologists and evolutionary biologists) with Professor Lizzie Wolkovich my first semester of my PhD. The class and the textbook were both extremely useful and enjoyable—definitely one of my most efficient and relevant learning experiences. Our version of the textbook was an earlier draft, spiral bound, years before it came out—we got to read it early and we got to contribute feedback that went into the final version, which I thought was a fun and special experience and a neat way to feel connected to a work that I greatly enjoyed reading. My copy is very, very dog-eared and highlighted and covered in notes and thoughts in every margin. I refer to it often whenever I need to do any modeling or think about experimental design.

Zoobiquity: The Astonishing Connection Between Human and Animal Health by Barbara Natterson-Horowitz and Kathryn Bowers:

Zoobiquity completely changed the way I think about the human experience and broadened my view of human disease—which was extremely valuable because human disease is the focus of my work. I got to be a Teaching Fellow a few years ago for three sections of HEB 1328 (Evolutionary Medicine: Comparative Perspectives on Medical, Surgical and Psychiatric Illness) with Professor Barbara Natterson-Horowitz. The lectures largely followed the book, which is nice because it means you can get a good part of the learning by reading it.

Evolutionary Perspectives on Human Disease

Immune systems, infection, and inherited disease across life.

This is a lecture-based class introducing the human immune system from a comparative perspective, along with some of the diseases our immune systems help us fight or can cause. We will learn about immune systems across life, in bacteria, plants, humans, and non-human animals—and how comparing immune systems allows us to better understand zoonotic transmission of disease. We will then look at some examples of infectious disease and inherited disease in animals and plants, and how animal parallels of human disease have helped us solve our own, human mysteries and make strides in medicine. Finally, we will look at how pathogens and their hosts impact each other’s evolution, and how human evolution has been impacted by disease.

We will meet on Tuesdays and Thursdays. Every week, you are responsible for reading your choice of two of the other listed texts closely enough to be an expert, and emailing me one generous tweet-length response to each text that you choose (≤250 words each). For the last 15 minutes of each lecture, I will display two responses on the projector and we will discuss them as a class. Highlighted texts are strongly recommended.

In lieu of a final exam, you will choose your favorite of your peers’ “tweets” that we discussed in class (not your own) and use it as a jumping-off point to write a 1,000- to 2,000-word response drawing from the texts and from class discussion.

By the end of this course, you will have a broad understanding of immune systems and disease across life, and (hopefully) the value of knowing it all.

The Immune System

Week 1: The Human Immune System

  • “Understanding the Immune System: How It Works,” published by the NIH in 2003 [link]
  • “The immune system,” published in Essays in Biochemistry in 2016 [link]
  • “Overview of the human immune response,” published in The Journal of Allergy and Clinical Immunology in 2006 [link]

Week 2: Bacterial Immune Systems

  • “The Origin of the Bacterial Immune Response,” Chapter 1 of Self and Nonself, 2012 [link]
  • “Systematic discovery of antiphage defense systems in the microbial pangenome,” published in Science in 2018 [link]
  • “Temperate Bacterial Viruses as Double-Edged Swords in Bacterial Warfare,” published in PLOS ONE in 2013 [link]
  • “Viruses Have Their Own Version of CRISPR,” published in The Atlantic in 2016 [link]

Week 3: Plant Immune Systems

  • “The plant immune system,” published in Nature in 2016 [link]
  • “Origin and evolution of the plant immune system,” published in New Phytologist in 2019 [link]

Week 4: Animal Immune Systems and Evolution

  • “Comparative Immune Systems in Animals,” published in Annual Review of Animal Biosciences in 2014 [link]
  • “Origin and Evolution of Adaptive Immunity,” published in Annual Review of Animal Biosciences in 2014 [link]
  • “Evolution of Immune Systems From Viruses and Transposable Elements,” published in Frontiers in Microbiology in 2019 [link]

Week 5: Vector Immune Systems and Zoonotic Transmission

  • “The Immune Responses of the Animal Hosts of West Nile Virus: A Comparison of Insects, Birds, and Mammals,” published in Frontiers in Cellular and Infection Microbiology in 2018 [link]
  • “Mosquito Vectors and the Globalization of Plasmodium falciparum Malaria,” published in Annual Review of Genetics in 2016 [link]
  • “Host phylogenetic distance drives trends in virus virulence and transmissibility across the animal-human interface,” published in Philosophical Transactions of the Royal Society B: Biological Sciences in 2019 [link]
  • “Surprise! British Red Squirrels Carry Leprosy,” published in The Atlantic in 2016 [link]
  • “Is It Possible to Predict the Next Pandemic?” published in The Atlantic in 2017 [link]

Week 6: Bats as Disease Vector

  • “Why Are Bats’ Immune Systems Totally Different From Any Other Mammal’s?” published in Popular Science in 2015 [link]
  • “Bats’ immune defenses may be why their viruses can be so deadly to people,” published in Science News in February 2020 [link]
  • “Accelerated viral dynamics in bat cell lines, with implications for zoonotic emergence,” published in eLife in 2019 [link]
  • “Dampened NLRP3-mediated inflammation in bats and implications for a special viral reservoir host,” published in Nature Microbiology in 2019 [link]

Infectious Disease Across Life

Week 7: Infectious Disease Across Life

  • “The Koala and the Clap: The Hidden Power of Infection,” Chapter 10 of Zoobiquity
  • “Plant and pathogen warfare under changing climate conditions,” published in Current Biology in 2018 [link]
  • “How Viruses Cooperate to Defeat CRISPR,” published in The Atlantic in 2018 [link]
  • “The Viruses That Eavesdrop on Their Hosts,” published in The Atlantic in 2018 [link]

Week 8: Extinctions and Mass Mortality Events

  • “Recent shifts in the occurrence, cause, and magnitude of animal mass mortality events,” published in PNAS in 2015 [link]
  • “A Starfish-Killing Disease Is Remaking the Oceans,” published in The Atlantic in 2019 [link]
  • “Why Did Two-Thirds of These Weird Antelope Suddenly Drop Dead?,” published in The Atlantic in 2018 [link]
  • “What We Can Learn From the Near-Death of the Banana,” published in Time Magazine in 2019 [link]

The Chytrid Fungus:

  • “Amphibian fungal panzootic causes catastrophic and ongoing loss of biodiversity,” published in Science in 2019 [link]
  • “The Worst Disease Ever Recorded,” published in The Atlantic in 2019 [link]
  • “The Cascading Consequences of the Worst Disease Ever,” published in The Atlantic in February 2020 [link]

Inherited Disease Across Life

Week 9: Inherited Disease Across Life

Heart Disease:

  • “The Feint of Heart: Why We Pass Out,” Chapter 2 of Zoobiquity
  • “Scared to Death: Heart Attacks in the Wild,” Chapter 6 of Zoobiquity

Mental Health:

  • “Grooming Gone Wild: Pain, Pleasure, and the Origins of Self-Injury,” Chapter 8 of Zoobiquity
  • “Fear of Feeding: Eating Disorders in the Animal Kingdom,” Chapter 9 of Zoobiquity
  • “A Landmark Study on the Origins of Alcoholism,” published in The Atlantic in 2018 [link]

Cancer:

  • “Jews, Jaguars, and Jurassic Cancer: New Hope for an Ancient Diagnosis,” Chapter 3 of Zoobiquity
  • “Elephants Have a Secret Weapon Against Cancer,” published in The Atlantic in 2018 [link]

Diabetes:

  • “The Blind Fish That Should Have Diabetes, But Somehow Doesn’t,” published in The Atlantic in 2018 [link]

Week 10: Allergy and Autoimmune Diseases

Allergies:

  • “Comparative Immunology of Allergic Responses,” published in Annual Reviews in 2015 [link]
  • “Early life factors that affect allergy development,” published in Nature Reviews Immunology in 2017 [link]
  • “Pet-keeping in early life reduces the risk of allergy in a dose-dependent fashion,” published in PLOS ONE in 2018 [link]
  • “Comparisons of Allergenic and Metazoan Parasite Proteins: Allergy the Price of Immunity,” published in PLOS Computational Biology in 2015 [link]
  • “Interactions between helminth parasites and allergy,” published in Current Opinion in Allergy and Clinical Immunology in 2009 [link]

Autoimmunity:

  • “Human autoimmune diseases: a comprehensive update,” published in The Journal of Internal Medicine in 2015 [link]
  • “Thymic tolerance as a key brake on autoimmunity,” published in Nature Immunology in 2018 [link]
  • “Regulatory T cells in autoimmune disease,” published in Nature Immunology in 2018 [link]
  • “Narcolepsy confirmed as autoimmune disease,” published in Nature News in 2013 [link]

Co-Evolution of the Human Immune System and Infectious Agents

Week 11: Co-Evolution of Microbial Pathogens and Their Hosts

  • “Rapid evolution of microbe-mediated protection against pathogens in a worm host,” published in The International Society for Microbial Ecology Journal in 2016 [link]
  • “The evolution of the host microbiome as an ecosystem on a leash,” published in Nature in 2017 [link]
  • “Harnessing the Power of Defensive Microbes: Evolutionary Implications in Nature and Disease Control,” published in PLOS Pathogens in 2016 [link]
  • “Some Microbes Have Been With Us Since Before We Existed,” published in The Atlantic in 2017 [links]

Relationships Between Bacteriophages, Bacteria, and the Human Immune System:

  • “Virus tricks the immune system into ignoring bacterial infections,” Nature News in 2019 [link]
  • “Bacteriophage trigger antiviral immunity and prevent clearance of bacterial infection,” published in Science in 2019 [link]
  • “We Might Absorb Billions of Viruses Every Day,” published in The Atlantic in 2017 [link]

Week 12: Human Evolution and Disease

  • “Signatures of Environmental Genetic Adaptation Pinpoint Pathogens as the Main Selective Pressure through Human Evolution,” published in PLOS Genetics in 2011 [link]
  • “Natural selection contributed to immunological differences between hunter-gatherers and agriculturalists,” published in Nature Ecology and Evolution in 2019 [link]
  • “How Viruses Infiltrated Our DNA and Supercharged Our Immune System,” published in The Atlantic in 2016 [link]
  • “Migrating microbes: what pathogens can tell us about population movements and human evolution,” published in Annals of Human Biology in 2017 [link]

Plasmodium falciparum and Sickle Cell:

  • “How Malaria Has Affected the Human Genome and What Human Genetics Can Teach Us about Malaria,” published in The American Journal of Human Genetics in 2005 [link]
  • “Sickle-cell mystery solved,” Nature News in 2011 [link]
  • “Hemoglobins S and C Interfere with Actin Remodeling in Plasmodium falciparum–Infected Erythrocytes,” published in Science in 2011 [link]

The Evolution of the Placenta:

  • “The Viruses That Made Us Human,” published by PBS in 2016 [link]
  • “Retroviruses and the Placenta,” published in Current Biology in 2012 [link]
  • “The placenta goes viral: Retroviruses control gene expression in pregnancy,” published in PLOS Biology in 2018 [link]

Microbial Inhabitants and Infectious Agents of the Human Body

Overview of common viruses, bacteria, and eukaryotes, pathogenic and not, and a history of disease outbreaks.

This class is an introduction to our neighbors in the human body: common viruses, bacteria, and eukaryotes—helpful, neutral, pathogen, or some combination of the three—that we share our bodies and our lives with, and which have profound impacts on both.

At the end of this course, you should have a broad understanding of the kinds of microbes that live in the human body and how they affect our health. You should also have a perspective and opinion on disease outbreaks throughout history, and the lessons we have hopefully learned from them. Finally, you should be able to critically read primary literature and use it to contribute to the broad conversation about human health in both speech and writing.

We meet on Mondays and Wednesdays. On Mondays, this is a lecture class, covering the texts and the topics listed below. On Wednesdays, this is a fast-paced discussion-based class. Every Wednesday meeting starts with a prescribed question, then progresses to your questions, switching topics at any ≥30-second lull in conversation.

The first week, I would like you to read all five papers. Every week after, you are responsible for reading at least two of the provided texts closely enough to be an expert, and for skimming or lightly reading at least three of the others to whatever extent is necessary for you to be able to respond to arguments and carry on intelligent conversation. In both cases, you are expected to go beyond what we cover in the Monday lecture. Come to class on Wednesday with at least three unique and interesting questions about the text(s) you choose to focus on or their implications to discuss with your colleagues. Highlighted texts are strongly recommended.

This class is a safe space. Please feel welcome to share your questions, thoughts, and opinions, even ones that seem “dumb” or “wrong.” We will work through them with empathy together as a class. To enable this atmosphere, please approach debate and discussion with empathy and enthusiasm, and remember that we are growing together and through each other. One of my favorite professors in undergrad started the semester distinguishing uncomfortable and unsafe. Fruitful discussion and growth can, at times, feel uncomfortable. If at any point this class makes you feel unsafe, let me know.

In lieu of a final exam, you will choose your favorite question proposed by a classmate (not by me and not by you) and write a 500- to 1500-word response to it drawn from the text and from class discussion. I will compile all responses into one anonymized document, and you will choose at least three classmates’ thoughts to respond to in generous tweet-length (≤250 words).

We include both scientific papers and publications from other media. I hope that every week, we will have a balance of experts in all texts in all formats, and that we start every new week more knowledgable and thoughtful than we were the week before.

Introduction

Week 1: A Bird’s-Eye View

  • “Introduction to Pathogens,” from Molecular Biology of the Cell, published in 2002 [link]
  • “Cell Biology of Infection,” from Molecular Biology of the Cell, published in 2002 [link]
  • “Visualizing the History of Pandemics,” published in Visual Capitalist on March 14, 2020 [link]
  • “The Microbiome and Human Biology,” published in Annual Review of Genomics and Human Genetics in 2017 [link]
  • “Highlights from studies on the gut microbiome,” published in Nature Outlook in January 2020 [link]

Neutral or Helpful Inhabitants

Week 2: The Microbiome, and Occasionally Helpful Parasites

The Microbiome:

  • “Man and the Microbiome: A New Theory of Everything?” published in Annual Review of Clinical Psychology in 2019 [link]
  • “No Vacancy: How beneficial microbes cooperate with immunity to provide colonization resistance to pathogens,” published in The Journal of Immunology in 2015 [link]
  • “When Poop Becomes Medicine,” published in The Atlantic in 2018 [link]
  • “A Probiotic Skin Cream Made With a Person’s Own Microbes,” published in The Atlantic in 2017 [link]
  • “The Hottest New Cancer Drugs Depend on Gut Microbes,” published in The Atlantic in 2015 [link]
  • “How Bacteria Could Protect Tumors From Anticancer Drugs,” published in The Atlantic in 2017 [link]
  • “A Tiny Tweak to Gut Bacteria Can Extend an Animal’s Life,” published in The Atlantic in 2017 [link]

Parasites:

  • “Friendly foes: The evolution of host protection by a parasite,” published in Evolution Letters in 2017 [link]
  • “Parasites inside your body could be protecting you from disease,” published in The Conversation [link]
  • “Helminth infection, fecundity, and age of first pregnancy in women,” published in Science in 2015 [link]

Week 3: GB Virus C, a Helpful Virus

  • “GB virus C: the good boy virus?” published in Trends in Microbiology in 2012 [link]
  • “Effect of early and late GB virus C viraemia on survival of HIV-infected individuals: a meta-analysis,” published in HIV Medicine in 2006 [link]
  • “GBV-C/HIV-1 coinfection is associated with low HIV-1 viral load and high CD4+ T lymphocyte count,” published in Archives of Virology in 2017 [link]
  • “Pegivirus avoids immune recognition but does not attenuate acute-phase disease in a macaque model of HIV infection,” published in PLOS Pathogens in 2017 [link]
  • “Fighting the Public Health Burden of AIDS With the Human Pegivirus,” published in American Journal of Epidemiology in May 2019 [link]
  • “GB Virus C Coinfections in West African Ebola Patients,” published in Journal of Virology in 2015 [link]

Harmful Inhabitants

Week 4: The Common Cold and Influenza (and why they won’t go away)

The Common Cold:

  • “Rhinoviruses,” Chapter 238 of Principles and Practice of Pediatric Infectious Diseases, 2018 [link]
  • “Human Coronaviruses,” Chapter 222 of Principles and Practice of Pediatric Infectious Diseases, 2018 [link]
  • “Adenoviruses,” Chapter 210 of Principles and Practice of Pediatric Infectious Diseases, 2018 [link]
  • “The Economic Burden of Non–Influenza-Related Viral Respiratory Tract Infection in the United States,” published in Archives of Internal Medicine in 2013 [link]
  • “Why Haven’t We Cured the Common Cold Yet?” published in Scientific American in 2018 [link]

Curing the Common Cold:

  • “Scientists think the common cold may at last be beatable,” published in STAT in 2016 [link]
  • “A polyvalent inactivated rhinovirus vaccine is broadly immunogenic in rhesus macaques,” published in Nature Communications in 2016 [link]
  • “Scientists close in on a cure for the common cold,” published in Stanford Medicine Scope in 2019 [link]
  • “Enterovirus pathogenesis requires the host methyltransferase SETD3,” published in Nature Microbiology in 2019 [link]

Influenza:

  • “Influenza Viruses,” Chapter 229 of Principles and Practice of Pediatric Infectious Diseases, 2018 [link]
  • “Influenza Historic Timeline,” CDC [link]
  • “Estimating Vaccine-Driven Selection in Seasonal Influenza,” published in Viruses in 2018 [link]
  • “Within-Host Evolution of Human Influenza Virus,” published in Trends in Microbiology in 2018 [link]
  • “Global migration of influenza A viruses in swine,” published in Nature Communications in 2014 [link]

The 1918 Spanish Flu:

  • “The Deadliest Flu: The Complete Story of the Discovery and Reconstruction of the 1918 Pandemic Virus,” CDC [link]
  • “Public health interventions and epidemic intensity during the 1918 influenza pandemic,” published in PNAS in 2007 [link]

Week 5: Historical Illness

Bubonic Plague (Black Death) and The Plague of Justinian:

  • “Justinian’s Plague (541-542 CE),” Ancient History Encyclopedia [link]
  • “Black Death,” History.com [link]
  • “Plague genome: The Black Death decoded,” Nature News 2011 [link]
  • “Yersinia pestis and the plague of Justinian 541-543 AD: a genomic analysis,” published in 2014 [link]
  • “A draft genome of Yersinia pestis from victims of the Black Death,” published in Nature in 2011 [link]

Smallpox:

  • “A time transect of exomes from a Native American population before and after European contact,” published in Nature in 2016 [link]
  • “How Europeans brought sickness to the New World,” Science News 2015 [link]

Typhoid Mary:

  • “Mary Mallon (1869-1938) and the history of typhoid fever,” published in the Annals of Gastroenterology in 2013 [link]
  • “Typhoid Mary’s tragic tale exposed the health impacts of ‘super-spreaders’,” published in National Geographic in March 2020 [link]
  • “A Life in Pursuit of Health,” about Josephine Baker, published in The New York Times in 2013 [link]

And a Few Other Superspreaders:

  • “Extensive Transmission of Mycobacterium tuberculosis from a Child,” published in The New England Journal of Medicine in 1999 [link]
  • “Party Zero: How a Soirée in Connecticut Became a ‘Super Spreader,’” published in The New York Times on March 23, 2020 [link]

Week 6: Plasmodium/Malaria

  • “Plasmodium Species (Malaria),” Chapter 271 of Principles and Practice of Pediatric Infectious Diseases, 2018 [link]
  • “About Malaria,” CDC, especially “FAQs” [link], “Disease” [link], “Biology” [link], “Where Malaria Occurs” [link], and “Malaria’s Impact Worldwide” [link]
  • “The History of Malaria, an Ancient Disease,” by the CDC [link]
  • “Greater political commitment needed to eliminate malaria,” published in Infectious Diseases of Poverty in 2019 [link]
  • “Malaria Genomics in the Era of Eradication,” published in Cold Spring Harbor Perspectives in Medicine in 2017 [link]

How Malaria Spread to Humans and Around the World:

  • “Resurrection of the ancestral RH5 invasion ligand provides a molecular explanation for the origin of P. falciparum malaria in humans,” published in PLOS Biology in 2019 [link]
  • “Human migration and the spread of malaria parasites to the New World,” published in Nature in 2018 [link]

Acquired Immunity:

  • “Quantification of anti-parasite and anti-disease immunity to malaria as a function of age and exposure,” published in eLife in 2018 [link]
  • “Malaria: Age, exposure and immunity,” in eLife as an Insight, 2018 [link]
  • “Host-mediated selection impacts the diversity of Plasmodium falciparum antigens within infections,” published in Nature Communications in 2018 [link]

Week 7: Hepatitis A

  • “Hepatitis A Virus,” Chapter 237 of Principles and Practice of Pediatric Infectious Diseases, 2018 [link]
  • “Widespread outbreaks of hepatitis A across the United States,” CDC, March 2020 [link]
  • “Increase in Hepatitis A Virus Infections – United States, 2013-2018,” CDC, 2019 [link]
  • “Summary of reported hepatitis A cases linked to person-to-person outbreak, Massachusetts, April 1, 2018-March 6, 2020,” MA DPH [link]
  • “Forgotten but Not Gone: Learning From the Hepatitis A Outbreak and Public Health Response in San Diego,” published in Topics in Antiviral Medicine in 2019 [link]
  • “Molecular Genotyping of Hepatitis A Virus, California, USA, 2017–2018,” published in Emerging Infectious Diseases in 2019 [link]
  • “Emergence of Hepatitis A Virus Genotype IIIA during an Unprecedented Outbreak in New Hampshire, 2018-2019,” unpublished

Bathroom Access:

  • “An outbreak waiting to happen: Hepatitis A marches through San Diego’s homeless community,” published in STAT in 2017 [link]
  • “After crackdown on tent city, homeless recount Hepatitis horror stories,” published in the San Diego Union-Tribune in 2017 [link]
  • “Hepatitis A outbreak sparks call for L.A. to give homeless people more street toilets,” published in The Los Angeles Times in 2017 [link]
  • “The Politics of Going to the Bathroom,” published in The Nation in 2019 [link]

Herd Immunity and Co-Infections:

  • “Notes from the Field: Acute Hepatitis A Virus Infection Among Previously Vaccinated Persons with HIV Infection – Tennessee, 2018,” CDC, 2019 [link]
  • “Herd Immunity Likely Protected the Men Who Have Sex With Men in the Recent Hepatitis A Outbreak in San Diego, California,” published in Clinical Infectious Diseases in 2019 [link]

Week 8: HIV/AIDS

The Virus:

  • “Introduction to Retroviridae” Chapter 231 [link] and “Human Immunodeficiency Virus” Chapter 233 [link] of Principles and Practice of Pediatric Infectious Diseases, 2018

History:

  • “HIV epidemiology. The early spread and epidemic ignition of HIV-1 in human populations,” published in Science in 2014 [link]
  • “Origins of HIV and the AIDS Pandemic,” published in Cold Spring Harbor Perspectives in Medicine in 2011 [link]
  • “Response to the AIDS Pandemic—A Global Health Model,” published in The New England Journal of Medicine in 2013 [link]
  • “The Reagan Administration’s Unearthed Response to the AIDS Crisis is Chilling,” published in Vanity Fair in 2015 [link]
  • “How the Media, the White House, and Everyone Else Failed AIDS Victims in the 80s,” published in VICE in 2016 [link]
  • “Long-term survivors of HIV/AIDS reflect on what they’ve witnessed and endured,” published on PBS in February 2020 [link]

HIV/AIDS today:

  • “Today’s HIV/AIDS Epidemic,” CDC Fact Sheet published in 2016 [link]
  • “Ending AIDS? These three places show the epidemic is far from over,” published in Science News in 2018 [link]

Curing HIV:

  • “Loss and Recovery of Genetic Diversity in Adapting Populations of HIV,” published in PLOS Genetics in 2014 [link]
  • “Second person cured of HIV is still free of active virus two years on,” in CNN on March 11, 2020 [link]
  • “Evidence for HIV-1 cure after CCR5Δ32/Δ32 allogeneic haemopoietic stem-cell transplantation 30 months post analytical treatment interruption: a case report,” published in The Lancet on March 10, 2020 [link]
  • “Sequential LASER ART and CRISPR Treatments Eliminate HIV-1 in a Subset of Infected Humanized Mice,” published in Nature Communications in 2019 [link]

Week 9: Viral Hemorrhagic Fevers: Ebola and Lassa

  • “Filoviruses and Arenaviruses,” Chapter 230 of Principles and Practice of Pediatric Infectious Diseases, 2018 [link]

Lessons from sequencing Ebola and Lassa:

  • “An Outbreak of Ebola Virus Disease in the Lassa Fever Zone,” published in The Journal of Infectious Diseases in 2016 [link]
  • “Clinical Sequencing Uncovers Origins and Evolution of Lassa Virus,” published in Cell in 2015 [link]
  • “Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak,” published in Science in 2014 [link]
  • “Ebola Virus Epidemiology, Transmission, and Evolution during Seven Months in Sierra Leone,” published in Cell in 2015 [link]
  • “Ebola Virus Epidemiology and Evolution in Nigeria,” published in The Journal of Infectious Diseases in 2016 [link]
  • “Temporal and spatial analysis of the 2014–2015 Ebola virus outbreak in West Africa,” published in Nature in 2015 [link]
  • “Rapid outbreak sequencing of Ebola virus in Sierra Leone identifies transmission chains linked to sporadic cases,” published in Virus Evolution in 2016 [link]
  • “The evolution of Ebola virus: Insights from the 2013–2016 epidemic,” published in Nature in 2016 [link]

Ebola adaptations to host:

  • “Virus genomes reveal factors that spread and sustained the Ebola epidemic,” published in Nature in 2017 [link]
  • “Ebola Virus Glycoprotein with Increased Infectivity Dominated the 2013-2016 Epidemic,” published in Cell in 2016 [link]

Week 10: Genomic Epidemiology and Modern Outbreak Response

  • “Tracking virus outbreaks in the twenty-first century,” published in Nature Microbiology in January 2020 [link]
  • “Precision epidemiology for infectious disease control,” published in Nature Medicine in 2019 [link]
  • “Real-time digital pathogen surveillance — the time is now,” published in Genome Biology in 2015 [link]

Ebola:

  • “Knowledge of Ebola is the weapon to fight it,” published in The Boston Globe in 2014 [link]
  • “Roots, Not Parachutes: Research Collaborations Combat Outbreaks,” published in Cell in 2016 [link]
  • “Lessons from Ebola: Improving infectious disease surveillance to inform outbreak management,” published in Science Translational Medicine in 2015 [link]

Zika and mumps:

  • “Combining genomics and epidemiology to track mumps virus transmission in the United States,” published in PLoS Biology in February 2020 [link]
  • “Zika virus evolution and spread in the Americas,” published in Nature in 2017 [link]
  • “Genomic epidemiology reveals multiple introductions of Zika virus into the United States,” published in Nature in 2017 [link]

Week 11: Difficult Decisions and a Case Study in Progress: Coronavirus Outbreak Response

Genomic research:

  • “Data Sharing and Open Source Software Help Combat Covid-19,” published in WIRED on March 13, 2020 [link]
  • “Genome Composition and Divergence of the Novel Coronavirus (2019-nCoV) Originating in China,” published in Cell on March 11, 2020 [link]
  • “Probable pangolin origin of SARS-CoV-2 associated with the COVID-19 outbreak,” to be published in Cell in March 2020 [link]
  • “The proximal origin of SARS-CoV-2,” published in Nature Medicine on March 17, 2020 [link]
  • “Why the Coronavirus Has Been So Successful,” published in The Atlantic on March 20, 2020 [link]

Social measures against disease spread:

  • “Impact of non-pharmaceutical interventions (NPIs) to reduce COVID19 mortality and healthcare demand,” published by the Imperial College COVID-19 Response Team on March 16, 2020 [link]
  • “Review of Ferguson et al ‘Impact of non-pharmaceutical interventions…’” published by New England Complex Systems Institute on March 17, 2020 [link]
  • “The Korean Clusters,” published in Reuters on March 3, 2020 [link]
  • “The U.K.’s Coronavirus ‘Herd Immunity’ Debacle,” published in The Atlantic on March 16, 2020 [link]

Governmental and organizational outbreak response; economic impact and tradeoffs:

  • “The 4 Key Reasons the U.S. Is So Behind on Coronavirus Testing,” published in The Atlantic on March 13, 2020 [link]
  • “You’re Likely to Get the Coronavirus,” published in The Atlantic in February 2020 [link]
  • “A fiasco in the making? As the coronavirus pandemic takes hold, we are making decisions without reliable data,” published in STAT on March 17, 2020 [link]
  • The Daily podcast:
    • “Why the U.S. Wasn’t Ready for the Coronavirus” on March 11, 2020 [link]
    • “Learning to Live with the Coronavirus” on March 13, 2020 [link]
    • “Why This Recession Will Be Different” on March 16, 2020 [link]
    • “’It’s Like a War’” on March 17, 2020 [link]

Week 12: Disease Surveillance in the Age of Surveillance

Influenza:

  • “nextflu: real-time tracking of seasonal influenza virus evolution in humans,” published in Bioinformatics in 2015 [link]
  • “Flu Near You: Crowdsourced Symptom Reporting Spanning 2 Influenza Seasons,” published in American Journal of Public Health in 2015 [link]
  • “Comparison of crowd-sourced, electronic health records based, and traditional health-care based influenza-tracking systems at multiple spatial resolutions in the United States of America,” published in BMC Infectious Diseases in 2018 [link]

Coronavirus:

  • “This is how the CDC is trying to forecast coronavirus’s spread,” published in MIT Technology Review on March 13, 2020 [link]
  • “We’re not going back to normal,” published in MIT Technology Review on March 17, 2020 [link]
  • “Singapore is the model for how to handle the coronavirus,” published in MIT Technology Review on March 12, 2020 [link]
  • “To Track Coronavirus, Israel Moves to Tap Secret Trove of Cellphone Data,” published in The New York Times on March 16, 2020 [link]

Introduction to Data Analysis Methods for Biological Inference

Seminar on experimental design, modeling, working with multiple variables, wrangling messy data, genomic sequencing, and popular techniques and tools in computational biology.

This class is an introduction to some of the tools of computational biology. We will look at statistical tests and learn how to disentangle the effects of multiple variables. We will learn how to do genome-wide association studies and principal component analysis. We will learn about how genomic sequencing works, and look at how it can be used for diagnosis or discovery of novel organisms. Finally, we will learn how to use genomic sequencing to trace disease transmission. By the end of this course, you should have the tools you need to analyze your own or publicly available data.

We meet on Tuesdays and Fridays. Tuesdays are lectures on the topics and texts listed. Highlighted texts are strongly recommended. On Fridays, we meet for an extended workshop to apply the week’s tools to publicly available data or to data that you bring with you to class (except in Week 7, when we will generate new sequence data). Before every Friday, you are responsible for writing a short proposal for the week, including what dataset you plan to analyze, what tools you plan to use for what analyses, and any hypotheses you have (≤500 words). At the end of the semester, you will choose whichever workshop was most inserting or successful for you to extend into a short final project, which you can work on alone or in a group. On the last Friday of class we will go around the room and briefly summarize our analyses and findings in an informal setting over snacks.

Week 1: Experimental Design, Statistical Tests, Data Visualization

Experimental Design:

  • “Experimental Design,” Chapter 7 of MIT’s 6.S085 Statistics for Research Projects course notes [link]

Statistical Tests, from the Handbook of Biological Statistics, 2014:

  • “Basic concepts of hypothesis testing” [link]
  • “Confounding variables” [link]
  • Common Assumptions:
    • “Normality” [link]
    • “Homoscedasticity and heteroscedasticity” [link]
    • “Data transformations” [link]
  • “Choosing the right test” [link], with focus on:
    • “Fisher’s exact test of independence” [link]
    • “Chi-square test of independence” [link]
    • “Student’s t-test for one sample” [link]
    • “Student’s t-test for two samples” [link]
    • “One-way anova” [link]
    • “Nested anova” [link]
    • “Two-way anova” [link]
    • “Paired t-test” [link]

Data Visualization:

  • “Data to Ink Ratio (Tufte principle of Data Visualisation),” on YouTube [link]
  • “Basic Design Principles,” on YouTube [link]
  • “Visualization of multiple alignments, phylogenies and gene family evolution,” published in Nature Methods in 2010 [link]

Notes on P-Values:

  • “The fickle P value generates irreproducible results,” published in Nature Methods in 2015 [link]
  • “Aligning statistical and scientific reasoning,” published in Science in 2016 [link]
  • “Measurement error and the replication crisis,” published in Science in 2017 [link]

Week 2: Modeling the Effects of a Single or Multiple Variables: Part I

Regression and Other Stories (to be published in 2020):

  • Chapter 5: “Background on regression modeling”
  • Chapter 6: “Linear regression with a single predictor”
  • Chapter 8: “Linear regression with multiple predictors”
  • Chapter 9: “Transformations and model building”

Week 3: Modeling the Effects of a Single or Multiple Variables: Part II

Regression and Other Stories (to be published in 2020):

  • Chapter 10: “Logistic regression”
  • Chapter 11: “Generalized linear models”
  • Chapter 14: “Missing-data imputation”
  • Chapter 15: “Using, evaluating, and comparing models”
  • Appendix A: “Six quick tips to improve your regression modeling”

Week 4: Genome-Wide Association Studies, Part I

GWAS in Action:

  • “10 Years of GWAS Discovery: Biology, Function, and Translation,” published in The American Journal of Human Genetics in 2017 [link]
  • “Benefits and limitations of genomewide association studies,” published in Nature in 2019 [link]

Understanding and Using GWAS:

  • “Microarrays – DNA Chips,” 2017 [link] and “DNA Microarray,” 2012 [link]
  • “PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses,” published in The American Journal of Human Genetics in 2007 [link]
  • “A PLINK tutorial” [link]
  • “Methods and Tools in Genome-wide Association Studies,” Chapter 5 of Computational Cell Biology, 2018 [link]

Week 5: Genome-Wide Association Studies, Part II

  • “Population genetics and GWAS: A primer,” published in PLOS Biology in 2018 [link]
  • From Principles of Population Genetics, 2007:
    • Chapter 9.1: “Evolution of Genome Size and Composition”
    • Chapter 9.2  “Genome-Wide Patterns of Polymorphism”
    • Chapter 9.3: “Differences Between Species”
    • Chapter 10.1: “Human Polymorphism”
    • Chapter 10.2: “Population Genetic Inferences from Human SNPs”
    • Chapter 2.5: “Linkage and Linkage Disequilibrium”
    • Chapter 2.6: “Causes of Linkage Disequilibrium”
    • Chapter 10.3: “Linkage Disequilibrium across the Human Genome”
    • Chapter 10.7: “Seeking Signatures of Human-Specific Genetic Adaptations”

Week 6: Principal Component Analysis (PCA)

PCA in Action:

  • “Genes mirror geography within Europe,” published in Nature in 2008 [link]
  • “Spatial population genomics of the brown rat (Rattus norvegicus) in New York City,” published in Molecular Ecology in 2018 [link]
  • “Urban rat races: spatial population genomics of brown rats (Rattus norvegicus) compared across multiple cities,” published in Proceedings of the Royal Society B: Biological Sciences in 2018 [link]

Understanding and Using PCA:

  • “PCA in R Using FactoMineR: Quick Scripts and Videos,” 2017 [link]
  • “A Step by Step Explanation of Principal Component Analysis,” 2019 [link]
  • “PCA: A Practical Guide to Principal Component Analysis in R & Python,” 2016 [link]

Week 7: DNA and RNA Sequencing

  • “Illumina Sequencing by Synthesis,” 2016 [link]
  • “DNA sequencing at 40: past, present and future,” published in Nature in 2017 [link]
  • “Timeline: History of genomics” [link]
  • “The sequence of sequencers: The history of sequencing DNA,” published in Genomics in 2016 [link]
  • “The future of DNA sequencing,” published in Nature as a Comment in 2017 [link]

Low-Resource Settings:

  • “Real-time, portable genome sequencing for Ebola surveillance,” published in Nature in 2016 [link]
  • “Fighting Ebola With a Palm-Sized DNA Sequencer,” published in The Atlantic in 2015 [link]

Long-Read Sequencing:

  • “Long-read sequencing for rare human genetic diseases,” published in Journal of Human Genetics in 2019 [link]
  • “Multiple Long-Read Sequencing Survey of Herpes Simplex Virus Dynamic Transcriptome,” published in Frontiers in Genetics in 2019 [link]
  • “Direct sequencing of RNA with MinION Nanopore: detecting mutations based on associations,” published in Nucleic Acids Research in 2019 [link]

Week 8: Genome Assembly and Alignment

Genome Assembly:

  • “De novo genome assembly: what every biologist should know,” Technology Feature published in Nature Methods in 2012 [link]
  • “Assembly Information: A primer on genome assembly methods.,” NCBI [link]
  • “Standards for Sequencing Viral Genomes in the Era of High-Throughput Sequencing,” published in mBio in 2014 [link]
  • “Opportunities and challenges in long-read sequencing data analysis,” published in Genome Biology in February 2020 [link]

Genome Alignment and Other Tools:

  • Basic Local Alignment Search Tool, published in Journal of Molecular Biology in 1990 [link]
  • “Bioinformatics explained: BLAST,” 2007 [link]
  • A list of all NCBI resources [link]
  • NCBI documentation [link]

Week 9: Metagenomic Sequencing

Metagenomic Sequencing Tools:

  • “MEGAN analysis of metagenomic data,” published in Genome Research in 2007 [link]
  • “Kraken: ultrafast metagenomic sequence classification using exact alignments,” published in Genome Biology in 2014 [link]
  • “Benchmarking Metagenomics Tools for Taxonomic Classification,” published in Cell in 2019 [link]
  • “Capturing sequence diversity in metagenomes with comprehensive and scalable probe design,” published in Nature Biotechnology in 2019 [link]

Metagenomic Sequencing for Diagnosis:

  • “Diagnostic Testing in Central Nervous System Infection,” published in Seminars in Neurology in 2019 [link]
  • “Rapid Detection of Powassan Virus in a Patient With Encephalitis by Metagenomic Sequencing,” published in Clinical Infectious Diseases in 2018 [link]
  • “Current Trends in Diagnostics of Viral Infections of Unknown Etiology,” published in Viruses in February 2020 [link]

Week 10: Novel Organism Discovery

  • “Using Metagenomics to Characterize an Expanding Virosphere,” published in Cell in 2018 [link]
  • “Redefining the invertebrate RNA virosphere,” published in Nature in 2016 [link]
  • “The evolutionary history of vertebrate RNA viruses,” published in Nature in 2018 [link]
  • “Discovery of Novel Rhabdoviruses in the Blood of Healthy Individuals from West Africa,” published in PLOS Neglected Tropical Diseases in 2015 [link]
  • “Discovering viral genomes in human metagenomic data by predicting unknown protein families,” published in Nature in 2018 [link]
  • “Hiding in plain sight: New virus genomes discovered via a systematic analysis of fungal public transcriptomes,” published in PLOS ONE in 2019 [link]
  • “Welcome to the Virosphere,” published in The New York Times on March 24, 2020 [link]

Week 11: Phylogeny of Disease Transmission and Genomic Epidemiology: Part I

Phylogeny in theory:

  • Principles of Population Genetics, 2007:
    • Chapter 3.6: “Gene Trees and Coalescence”
    • Chapter 7.3: “The Molecular Clock”
    • Chapter 7.6: “Gene Geneologies”
    • Chapter 7.8: “Molecular Phylogenetics”
    • Chapter 7.9: “Multigene Families”
  • “Viral Phylodynamics,” published in PLOS Computational Biology in 2013 [link]

BEAST in action:

  • “How to read a phylogenetic tree,” a tutorial [link]
  • “BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis,” published in PLOS Computational Biology in 2019 [link]
  • “Phylogenetic analysis of nCoV-2019 genomes,” posted on virological.org on March 6, 2020 [link]

Week 12: Phylogeny of Disease Transmission and Genomic Epidemiology: Part II

  • “Real-Time Analysis and Visualization of Pathogen Sequence Data,” published in Journal of Clinical Microbiology in 2018 [link]
  • “Using genomics data to reconstruct transmission trees during disease outbreaks,” published in Scientific and Technical Review in 2016 [link]
  • “The ability of single genes vs full genomes to resolve time and space in outbreak analysis,” published in BMC Evolutionary Biology in 2019 [link]
  • “Predictive Modeling of Influenza Shows the Promise of Applied Evolutionary Biology,” published in Trends in Microbiology in 2018 [link]
  • “Eight challenges in phylodynamic inference,” published in Epidemics in 2014 [link]