Getting started running viral-ngs workflows in Terra for no-code, scalable sequence analysis

This guide is the second of the getting started guides I made originally for our lab, and now for you. All the getting started guides can be found here.

Thank you to Danny Park and Kat DeRuff in our lab for pointing me to the excellent resources that are linked here.

Contents

This document should take less than an hour. (If you would like, you can dedicate a day to become a true expert—but I didn’t, and you too will be perfectly fine if you don’t.)

Background

In our lab, we mostly analyze short-read sequences of RNA viruses, though some of our methods pick up lots of other things too. The samples we work with are occasionally from other sources, like ticks, but are usually from humans—as a lab, we’ve sequenced everything from plasma to nasal swabs to brains. Because most of the genetic material in the sample is going to belong to the host (and we almost never use and usually aren’t even allowed to collect the host’s genetic material), the host genome and all other DNA in the sample is removed before sequencing through DNA depletion. The RNA in the sample is then reverse transcribed to DNA, which is then sequenced either by us or by the Broad Genomics Platform. Members of the lab have explored long-read sequencing, and we can also access Sanger sequencing, but nearly all of the sequencing we do generates short reads.

Historically, most of the sequencing our lab has done has been metagenomic sequencing. Metagenomic sequencing is unbiased—it will pick up everything in the sample, and what you see in the reads produced by sequencing is more or less what you had in your sample, including relative proportions of the species or variants that you see.

Starting with the Zika project published in 2017, our lab has also used amplicon sequencing. In amplicon sequencing, primers based on known sequence from the target organism’s genome are used to amplify molecules from the target organism using PCR. Unlike metagenomic sequencing, amplicon sequencing is not unbiased: if everything goes right, the only thing sequenced is the target organism. The Zika team used amplicon sequencing to amplify Zika because Zika appeared in very small quantities and was difficult to sequence. Now, amplicon sequencing is also used as a cost-saving measure, since money is not spent sequencing anything other than the target organism: almost all SARS-CoV-2 sequencing, for example, is amplicon sequencing. The SARS-CoV-2 genome is a lot longer than the short reads generated by short read sequencing; therefore, many primers are tiled to generate many short, overlapping sequences that can then be assembled into the whole SARS-CoV-2 genome.

As a compromise between unbiased metagenomic sequencing and biased amplicon sequencing, our lab also uses hybrid-capture, which uses primers based on conserved regions from a predetermined set of all our human viruses of interest. When it comes to the set of organisms included in the hybrid capture primers (as long as the genomes in the sample are not too diverged from the primer sequences), the method picks up everything. When it comes to anything else, hybrid capture gets rid of it. Like amplicon sequencing hybrid capture can allow us to sequence organisms that are present in very small quantities.

To save money, we sometimes pool samples by barcoding them, a process in which molecules from a sample are marked with barcodes unique to that sample, allowing us to mix multiple samples together and sequence them for the price of one. Barcoding also decreases contamination by allowing us to identify reads that did not come from the sample we are analyzing. We also spike samples with artificial spike-in molecules to help us detect contamination.

The actual output of the sequencer is not human-readable, and it also needs to go through a lot of work before it is ready for analysis. Reads from spike-ins need to be identified and catalogued. If samples were pooled, they need to be computationally disentangled using the unique barcodes that were used to mark each sample. (Because we often use barcodes even when the samples are not pooled, this process likely still needs to happen.) Barcodes are in the actual sequence of the reads; once they have served their purpose, they need to be computationally removed. The primers used for amplicon sequencing are also in the actual sequence of the reads; they also need to be computationally identified and removed. In our lab, all of this work is automatically done by viral-ngs.

viral-ngs also includes many computations that are necessary for almost all projects, such as aligning reads to a reference genome, assembling genomes from reads using an existing genome as a scaffold, assembling contigs without a scaffold, quantifying and plotting read depth across the genome, identifying within-host variation within a sample, or running tools like kraken or blast to identify, in bulk, the potentially many species a sample’s reads or contigs belong to. All of these computations are also a part of viral-ngs.

Introduction

Our lab’s sequence analysis toolkit and codebase, called viral-ngs, is housed in Terra, a smart, no-code interface for running analyses on any number of samples.

viral-ngs is our lab’s collection of frequently used, modular, generalizable computational workflows. My three favorite workflows are:

  • demux_plus, which takes you directly from the non-human-readable output of your sequencing run to completion of some frequently used analyses and files ready to input into any other analyses;
  • assemble_refbased, reference-based assembly;
  • and align_and_plot, which aligns reads to a reference genome and generates a visualization of coverage across the genome.

Every workflow consists of tasks, which are modular processes that can be put together like Legos. (Each task might appear in multiple workflows.) The tasks are where the action actually happens; workflows simply glue tasks together. Some workflows, such as demux_plus, incorporate many software tools that labs usually run individually, one after the other, for each dataset. Having frequently-used processes and tools compiled into one tool allows us to save time, minimize bugs, adapt to change quickly, and standardize our processes across projects. Many of these tools, such as SAMtools and MAFFT, are created and maintained by other groups; some of the tools and code are created by members of our lab, especially Danny and Chris TT.

All of the viral-ngs workflows are publicly available on the Dockstore, where you can access our and other workflows.

Terra is a wrapper around viral-ngs. Terra allows us to run workflows in the cloud through an intuitive interface (instead of, say, the terminal), handles launching the virtual machines that actually run the code, and allows us to run an analysis on many samples at once. Terra doesn’t actually contain any of the viral-ngs code, it just runs it for us. In addition to being easy to use, Terra is smart and cost-saving: if some or all of your workflow has already been run on the inputs you provided, it will swap in the outputs from last time and won’t redo the work.

Terra is especially powerful because of its use of data tables: you can organize your data into a data table with, for example, one row per sample, with each column corresponding to a particular file or value related to your samples (you might have a column for sample readcount, for example, or a column linking to each sample’s assembled genome). Instead of running an analysis on each of your samples individually, you can run that analysis on some or all of the samples in your data table all at once, specifying the inputs by column name, and Terra will add the outputs of the analysis to your data table as new columns.

The actual data lives and is analyzed in the cloud, in the Google Cloud Platform (GCP). Files in GCP have addresses that start with gs://. You can view a file in its home in GCP by clicking on it in Terra. If you ever need to, you can upload your own files to GCP as well, and use them as inputs to workflows on Terra by providing their gs:// addresses or adding the gs:// addresses to data tables. An important heads up: if a file changes but its filepath (the directory it lives in and its name) doesn’t change, Terra will not know that the file changed. Terra knows this and does not modify files, but if you yourself modify a file, make sure to re-upload it with a new filename or filepath.

Getting started

Terra is located at terra.bio. Log in with your Broad Google account. You will need to be added to a workspace, or you can create a new workspace (you will need a cost object—I usually ask Danny).

Video walk-throughs

Pathogen Genome Assembly using the Terra Platform

This short, 20-minute training video includes:

  • An introduction to Terra
  • An introduction to the Google Cloud Platform (GCP) where our data lives
  • How to put together data tables to run analyses on many samples
  • How to lauch workflows to analyze your data
  • How to find additional workflows in the Dockstore
  • How to access your input and output files in GCP

This video is part of the WARN-ID Genomic Epidemiology Training videos made by our WARN-ID consortium (orchestrated by lab alumnus Shirlee Wohl). The rest of the videos are also excellent.

Terra 101

This much longer, two-day training workshop delves into the above in far more detail. At the end of this workshop, you will be an expert. All slides and materials you might need for this workshop are available here.

Day 1:

Day 2:

This video is part of Introduction to Terra: A scalable platform for biomedical research, a recent DS-I Africa training workshop put together by the Broad Data Science Platform (DSP).

DSP leads introduction to Terra workshops (as well as more advanced workshops and workshops on other topics); there very likely may be an upcoming training you might be interested in.

More video walk-throughs:

Bonus: writing your own tasks and workflows and running them in Terra

If you would like to branch out from what is available in viral-ngs, this training from Broad DSP goes over how to write your own tasks and workflows in the WDL programming language and run them in Terra. It is far easier than I thought it would be! All slides and materials you might need for this workshop are available here. The full workshop lives here.

Once you are comfortable writing a basic WDL workflow and running it on Terra, you can explore more of WDL’s capabilities through code snippet examples on the OpenWDL github:

Resources

  • All viral-ngs workflows are described and documented in the viral-ngs readthedocs
  • The Terra team has put together guides to running some key viral-ngs workflows in Terra
  • You can view (and contribute to) the code behind the viral-ngs workflows at the viral-pipelines GitHub respository:
  • Terra’s documentation contains many how-to guides and answers to frequently asked questions
  • Once you have google cloud platform file access set up on your computer, you can use my script download_files.pl to download gs:// files in bulk. Simply download the script, open your terminal, type

    perl [script directory/]download_files.pl [file with list of files to download] [optional output directory]

    and hit enter, replacing [directory/] with the directory containing download_files.pl, for example Downloads/ and replacing [file with list of files to download] [optional output directory] with the filepath of a text file containing a list of gs:// addresses of the files you’d like to download, one per line. For example:
    perl Downloads/download_files.pl my_fun_directory/my_fun_list_of_files.txt
  • Similarly, you can use my script download_files_listed_in_table_column.pl to download all files listed in a column of a table (for example a data table that you have downloaded from Terra). Simply download the script, open your terminal, type

    perl [script directory/]download_files_listed_in_table_column.pl [table with list of files to download in one of the columns] "[title of column containing filepaths to download]"

    and hit enter, replacing [script directory/] with the directory containing download_files.pl, for example Downloads/ and replacing [table with list of files to download in one of the columns] with the filepath of your data table, and [title of column containing filepaths to download] with the name of the column containing a list of gs:// addresses of the files you’d like to download, one per line.

Getting started using R for data analysis and visualization

This guide is the second of the getting started guides I made originally for our lab, and now for you. All the getting started guides can be found here.

Thank you to Danny Park in our lab for introducing me to R and preempting my confusion about the difference between tables in Excel and R—a lot of this guide was verbally passed on to me from Danny when I first joined the lab; I just wrote it down.

Thank you to Lizzie Wolkovich for teaching me everything I know about modelling.

Contents

This document should take a veritable eternity to work through—but you don’t need to work through all or even most of it to do what you want to do. Take the pieces you need and when you need more pieces take those too.

Introduction

Our lab is split pretty evenly between people who use Python for data analysis and visualization and people who use R for data analysis and visualization. I strongly prefer R, because it was made specifically for data analysis, but there is no wrong answer. R is especially powerful because of the added functionality of its libraries, such as dplyr and ggplot2.

If R is about to become your first programming language (!), I recommend starting off with the first part of codecademy’s Learn R course.

Getting started

Using R within Terra

If you use Terra (our lab uses Terra), you can write and run R scripts directly in Terra, and run your R scripts directly on the data you analyze in Terra. You can watch the Terra team’s 15-minute Terra Notebooks Quickstart video to learn how:

Using R on your computer

Download and install R here.

Download and install RStudio here.

RStudio allows you to write and run your code as a full script in its own file, run subsets of your code by simply highlighting the individual lines you are interested in running and hitting command enter, or writing and running throwaway code in RStudio’s console.

Here is a short video introduction to RStudio:

You can also follow University of British Columbia STAT 545’s guide: Install R and RStudio.

Packages/libraries

Package and library are terms used pretty much interchangeably in R. You can install a package, in this example a package named plyr, by entering the following command in RStudio’s command line with your own package of interest:

install.packages("plyr")

You only need to install a package once. After you have installed it, you can load a package by adding the following line to your script anywhere before the code that will use the package, again using the package plyr as an example:

library(plyr)

You can learn more about packages/libraries in this tutorial.

I have found the following packages helpful:

  • tidyverse: a collection of helpful packages all packaged together—including dplyr, ggplot2, tidyr, and stringr below
  • data visualization:
    • ggplot2: the main package used for data visualization in R
    • ggpubr: customization add-ons for ggplot2
    • scales: customize axes and legends
    • patchwork: put multiple plots together in an aligned arrangement
    • grid and gridExtra: another option for arranging multiple plots into a layout
  • data manipulation and summarization:
  • statistical analysis:
    • binom: binomial confidence intervals
    • broom: summarizes statistical analysis outputs in more readable formats
  • beepr: beeps when your code is done running

Data tables in R

Differences between tables in R and Excel

The thing that most confused me when I started coding in R is that data tables in R are different from the tables I was used to working with in Excel.

If you’re used to working in Excel, you might be accustomed to tables that look like this one. In this table, each row is a patient. The patient name appears in the first column of each row. The rest of the columns belong to species. Each cell contains the number of reads that were mapped to the species (its column) from the patient (its row).

For input into R, this table would look quite different. Every readcount cell in the Excel version of the table gets its own row in the R version of the table.

The R data table format allows you to add more columns to each patient-species pair, which in the Excel table format would require you to make a whole new table. Notice that some of the columns refer to the patient (patient, is_food_handler, collection_date, Ct) while other columns refer to the species (species, is_virus) and other columns refer to the patient-species pair (readcount, genome_covered):

Each column is a variable, and the two terms can be used interchangeably when talking about data tables.

(This data is clearly from an experiment mixing different sample types from each patient into a single sample.)

My example code will use this table. If you would like to follow along, download it here.

Reading in and saving data tables

To read in a comma-separated table:
  input_table_path <- "/Users/lakras/my_input_file.csv"
  my_table <- read.csv(input_table_path)

or
  my_table <- read.table(input_table_path, sep=",", header=TRUE)

To read in a tab-separated table:
  input_table_path <- "/Users/lakras/my_input_file.tsv"
  my_table <- read.delim(input_table_path)

or
  my_table <- read.table(input_table_path, sep="\t", header=TRUE)

To save a table as a comma-separated output table:
  output_table_path <- "/Users/lakras/my_output_file.csv"
  write.table(my_table, file=output_table_path, quote=FALSE, sep=',',   row.names=FALSE)

To save a table as a tab-separated output table:
  output_table_path <- "/Users/lakras/my_output_file.tsv"
  write.table(my_table, file=output_table_path, quote=FALSE, sep='\t',
    row.names=FALSE)

Viewing and manipulating data tables

Exploring your data table and its columns

You can view or access the full data table by simply entering its name:
  my_table

View a summary of the variables (columns) in your data table, their data types (int, num, chr, string, factor), and the first few values using the str function:
  str(my_table)

You can view or access the individual variables (columns) of your data table by entering the table name followed by $ followed by the column name:
  my_table$patient
  my_table$species
  my_table$readcount
  my_table$genome_covered
  my_table$is_virus
  my_table$Ct
  my_table$is_food_handler
  my_table$collection_date

You can view all values of a variable (column) and the number of rows with each value using the table function:
  table(my_table$species)

You can view a subset of variables (columns) using the select function:
  select(my_table, patient, species, genome_covered)

You can view the first few rows of a data table using the head function or the last few rows using the tail function:
  head(my_table)
  tail(my_table)

Subsetting your data table

You can use the subset function to view a subset of your data table. Perhaps we want to view the subset of our rows related to Hepatitis C:
  subset(my_table, species == "Hepatitis C")

You can save the subset, or set the data table to be a subset of itself. Perhaps we are only interested in keeping the subset of our rows with readcount ≥100:
  my_table <- subset(my_table, readcount >= 100)

Creating new columns

To make a new column with all values set to TRUE:
  my_table$is_human <- TRUE

To make a new column by duplicating another column:
  my_table$readcount_2 <- my_table$readcount

To make a new column by multiplying another column by a scalar:
  my_table$percent_genome_covered <- my_table$genome_covered * 100

Conditional viewing and manipulating column values

You can view values of a column including only those rows that meet a certain condition. For example, you can view all readcounts ≥100:
  my_table$readcount[my_table$readcount >= 100]

You can also view all Ct values from rows with readcount ≥100:
  my_table$Ct[my_table$readcount >= 100]

In addition of viewing column values, you can also modify those column values. For example, you can set all readcounts <100 to NA:
  my_table$readcount[my_table$readcount < 100] <- NA

You can also modify a column that is different from the column you are testing in your condition, and the column you modify does not have to already exist. You can create a new variable with values conditional on another column’s values:
  my_table$passes_threshold[my_table$readcount >= 100] <- TRUE
  my_table$passes_threshold[my_table$readcount < 100] <- FALSE

Data types

Here are some of the data types you will encounter and use:

  • chr: a text value
  • factor: a category
  • int: an integer (no decimal places)
  • num: a number (with decimal places)
  • logi: TRUE or FALSE (logi is short for logical)
  • Date: a date

The chr data type is a text value. This data type might be useful for text values that are unique to each row, for example patient name, sample name, or hospital address. In some cases, text values (or other kinds of values) might reappear from row to row and be used to group rows together, for example state of residency, presence of coughing, sample type, or patient gender. This values should be factors.

You can modify the data type of your variable.

To make a variable’s values into text values:
  my_table$patient <- as.character(my_table$patient)

To make a variable’s values into categories:
  my_table$species <- as.factor(my_table$species)

To make a variable’s values into integers (no decimal places):
  my_table$readcount <- as.numeric(my_table$readcount)

To make a variable’s values into numerical values (with decimal places):
  my_table$Ct <- as.numeric(my_table$Ct)
  my_table$genome_covered <- as.numeric(my_table$genome_covered)

To make a variable’s values into TRUE and FALSE values:
  my_table$is_food_handler <- as.logical(my_table$is_food_handler)
  my_table$is_virus <- as.logical(my_table$is_virus)

To make a variable’s values into dates:
  my_table$collection_date <- as.Date(my_table$collection_date)

Special values

Some special values are NA (value not available), NaN (not a number), Inf (infinity), and -Inf (negative infinity). You will need to keep these values in mind and check for them in many analyses.

You can check if a value (my_value in this case) is one of these values:
  my_value <- 5
  is.na(my_value)
  is.nan(my_value)
  is.infinite(my_value)

Statistical analysis

Summary statistics

You can view basic summary statistics for a variable.

Minimum, maximum, and range:
  min(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)
  max(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)
  range(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)

Mean and standard deviation:
  mean(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)
  sd(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)

Median, first quartile (the 25th percentile of your data), and third quartile (the 75th percentile of your data), and interquartile range (the middle 50% of your data, or the distance between the 25th percentile and the 75th percentile):
  median(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)
  quantile(my_table$genome_covered, 1/4, na.rm=TRUE, nan.rm=TRUE)
  quantile(my_table$genome_covered, 3/4, na.rm=TRUE, nan.rm=TRUE)
  IQR(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)

na.rm=TRUE and nan.rm=TRUE exclude NA and NaN values from your calculations of the mean, median, min, max, etc.

You can generate a table of summary statistics grouped by values of a variable using ddply of the plyr or dplyr library. For example, we might calculate the minimum, maximum, and mean of readcounts for each species separately, saving them in a new data table we name readcount_summary:

  readcount_summary <- ddply(my_table, ~ species, summarise,
    min_readcount=min(readcount),
    max_readcount=max(readcount),mean_readcount=mean(readcount))

The output looks like this:

speciesmin_readcountmax_readcountmean_readcount
1E. coli04324620.4286
2Hepatitis C06457931.7143
3HIV0865217.0000
4Influenza A04363677.1429
5SARS-CoV-27645321289.5714

Or we might calculate the minimum, maximum, and mean of both readcounts and Cts, grouping rows not only by species, but also by whether or not the patient is a food handler and whether or not the at least half the genome is covered:

  readcount_summary <- ddply(my_table,
    genome_covered >= 0.50 ~ is_food_handler ~ species, summarise,
    min_readcount=min(readcount),
    max_readcount=max(readcount),
    mean_readcount=mean(readcount),
    min_Ct=min(Ct),
    max_Ct=max(Ct),mean_Ct=mean(Ct))

That output looks like this:

genome_ covered >= 0.5is_food_ handlerspeciesmin_ readcountmax_ readcountmean_ readcountmin _Ctmax _Ctmean _Ct
1FALSEFALSE E. coli04324868.6016.827.921.64000
2FALSEFALSEHepatitis C06516.2516.827.921.72500
3FALSEFALSEHIV0865303.8016.827.921.64000
4FALSEFALSEInfluenza A00016.827.922.70000
5FALSEFALSESARS-CoV-287876546.0021.327.924.66667
6FALSETRUE E. coli00021.523.922.70000
7FALSETRUEHepatitis C00021.523.922.70000
8FALSETRUEHIV00021.523.922.70000
9FALSETRUEInfluenza A65312188.5021.523.922.70000
10FALSETRUESARS-CoV-276435255.5021.523.922.70000
11TRUEFALSEHepatitis C645764576457.0021.321.321.30000
12TRUEFALSEInfluenza A436343634363.0017.417.417.40000
13TRUEFALSESARS-CoV-2234645323439.0016.817.417.10000

dplyr is extremely powerful, and can do a lot more than summarize. You can learn more about dplyr by exploring these tutorials:

Statistical tests

R has many statistical tests built in. Identify the one that is most relevant to your data, determine how to do it in R, and interpret your p-value or other preferred statistical test output. Here is a guide from r-statistics.co.

Modelling

Modelling allows you to estimate the strength of a correlations and to compare the individual impacts of variables affecting an outcome variable.

First, you should get an introduction to selecting the appropriate kind of model and writing the code to generate that model. datacamp.com has an interactive course on modelling in R, in which you practice coding it yourself and it judges your code. If you prefer just reading, r-statistics.co has a guide:

  1. Linear Regression (for numerical outcome variables, and an introduction to modelling in general)
  2. Logistic Regression (for binary outcome variables)
  3. Multinomial Regression (for categorical outcome variables)
  4. Other regression models, if you need them
  5. Model Selection Approaches

Next, you need to understand what your model is telling you. I learned everything I know about modelling from Regression and Other Stories by Andrew Gelman, Jennifer Hill, and Aki Vehtari. This book is thorough, clear, correct, and readable, unlike most things. The pdf is (legally) available for free online. The following are the guides to interpreting your models that I have bookmarked (most of these sections are around 3-5 pages):

The whole book is excellent. Once you are comfortable selecting and interpreting models, I recommend reading the following chapters:

Data visualization

Scatterplots, bar plots, histograms, and other standard figures

Follow this tutorial from r-statistics.co to learn how to use ggplot2 for data visualization. It takes you through making a scatterplot, then extends what you’ve learned to other kinds of plots:

  1. The Complete ggplot2 Tutorial – Part1 | Introduction To ggplot2
  2. The Complete ggplot2 Tutorial – Part 2 | How To Customize ggplot2
  3. Top 50 ggplot2 Visualizations – The Master List

Additionally, I think you should learn about grouped vs. stacked vs. percent bars in bar plots or histograms, which the above tutorial does not cover.

I have found that in my own work, I most frequently use and encounter scatterplots, histograms, barplots, line plots, and violin plots. 

Here are some references with different kinds of plots you might like to make, explanations, and example code to copy-paste and modify:

Once you have mastered the general principles, you can google any kind of plot or modification to your plot that you would like to make. I have found that almost anything I want to do has at some point been asked about on StackOverflow.

Maps

Here are some resources on plotting colorful maps in R:

R is in particular is the best solution for drawing a map with curved directional arrows illustrating transmission events (doing it in python seems to be much more challenging and doesn’t end up as nice). There does not seem to be a good guide for making this kind of figure, but my colleague Gage Moreno found a way to do it using curvedarrow in the diagram package.

Conditionals, loops, and functions

Conditional statements

Conditional statements allow you to automatically run different lines of code depending on what your data looks like. In this example, we check if your minimum collection date is before May 9th; if it is, we subset our data to collection dates after May 10th:

  if(min(my_table$collection_date) < as.Date("2022-05-09"))
  {
    my_table <- subset(my_table, collection_date > as.Date("2022-05-10"))
  }

The above example only executes the code inside the { and } brackets if the condition boils down to TRUE; if the condition is FALSE, your script does nothing. If you’d like, you can add an else statement to make it do something else if the condition is FALSE. In the following example, if our mean readcount is <100, we replace any readcounts less than 100 with NA. If not, we instead move on to the else and replace any readcounts less than 200 with NA:

  if(mean(my_table$readcount) < 100)
  {
    my_table$readcount[my_table$readcount < 100] <- NA
  } else {
    my_table$readcount[my_table$readcount < 200] <- NA
  }

You can chain these conditions together. In this example, we check if the minimum proportion of any genome covered in our table is less than 50%:

  • If it is, we subset our table to rows with is_virus set to TRUE and (&) readcount at least 50 and we exit the chain.
  • If it isn’t, we follow the else to the next statement: we check if the minimum proportion of any genome covered in our table is greater than 70%.
    • If it is, we subset our table to rows with is_virus set to FALSE and (&) readcount at least 100 and we exit the chain.
    • If it isn’t, we follow the else to our next and final statement: we subset our table to rows with readcount at least 100 or (|) at least 50% of the genome covered, and then we set any Ct values greater than 25 to 25.

  if(min(my_table$genome_covered) < 0.50)
  {
    my_table <- subset(my_table, is_virus & readcount >= 50)
  } else if(min(my_table$genome_covered) > 0.70)
  {
    my_table <- subset(my_table, !is_virus & readcount >= 100)
  } else
  {
    my_table <- subset(my_table, readcount >= 100

        | genome_covered >= 0.50)
    my_table$Ct[my_table$Ct > 25] <- 25
  }

Scaling up your analyses with loops

Looping through a predetermined list of values

If you would like to automatically do the same analysis or generate the same plot for multiple values, you can use a for-loop. In the following example, we subset the table to include only rows where species is Hepatitis C, HIV, or Influenza A, and save each table to an output file named after the species. We do that by using the variable my_species inside the { and } brackets and setting my_species to Hepatitis C, HIV, and then Influenza A. With each iteration of the loop, my_species takes on the next value and the code inside the { and } brackets is executed with my_species set to that value.

  for(my_species in c("Hepatitis C", "HIV", "Influenza A"))
  {
    my_table_subset <- subset(my_table, species == my_species)
    output_table_path <- paste(input_table_path, "_", my_species, ".txt",

      sep="")
    write.table(my_table_subset, file=output_table_path, quote=FALSE,
      sep='\t', row.names=FALSE)
  }

Instead of using a for-loop, you could instead copy-paste the code inside the { and } brackets, changing my_species to your species of choice each time:

  my_table_subset <- subset(my_table, species == "Hepatitis C")
  output_table_path <- paste(input_table_path, "_", "Hepatitis C",

    ".txt", sep="")
  write.table(my_table_subset, file=output_table_path, quote=FALSE,
    sep='\t', row.names=FALSE)

  my_table_subset <- subset(my_table, species == "HIV")
  output_table_path <- paste(input_table_path, "_", "HIV",

    ".txt", sep="")
  write.table(my_table_subset, file=output_table_path, quote=FALSE,
    sep='\t', row.names=FALSE)

  my_table_subset <- subset(my_table, species == "Influenza A")
  output_table_path <- paste(input_table_path, "_", "Influenza A",

    ".txt", sep="")
  write.table(my_table_subset, file=output_table_path, quote=FALSE,
    sep='\t', row.names=FALSE)

This is a lot longer and harder to read—and you’re only copy-pasting three lines of code. If you wanted to add another species of interest (or ten), you would need to copy-paste again and again instead of just adding another species name to your list. If you wanted to change something, maybe the names of the output files, you would need to change it for each copy-pasted version. Not only is that a lot of work, it also makes it more likely that you will make a mistake and, because your code is less readable, less likely that you will catch it.

Looping through a non-predetermined list of values

Another benefit of for-loops is that you do not need to decide what values you are cycling through ahead of time. Perhaps we want to print subsets of our table for all species appearing in the table, whatever they are, without knowing them ahead of time. This example does the same thing as our previous two examples, but it does it for all unique species in our table.

  for(my_species in as.list(unique(my_table$species)))
  {
    my_table_subset <- subset(my_table, species == my_species)
    output_table_path <- paste(input_table_path, "_", my_species, ".txt",

      sep="")
    write.table(my_table_subset, file=output_table_path, quote=FALSE,
      sep='\t', row.names=FALSE)
  }

Maintaining data types in loops

Note that as.list is absolutely necessary in the above for-loop—I learned this the hard way. R for some reason strips values of their type when they are iterated through in a for-loop. Try it yourself—here is the list the for-loop is iterating over; it is of type factor:

  unique(my_table$species)
  class(unique(my_table$species))

Here is an iteration through the list with as.list—the elements of the list are as before of type factor:

  for(my_species in as.list(unique(my_table$species)))
  {
    print(my_species)
    print(class(my_species))
  }

Here is an iteration through the list without as.list—the elements of the list are now of type character:

  for(my_species in unique(my_table$species))
  {
    print(my_species)
    print(class(my_species))
  }

Looping through integers

You can also cycle through integer values. This chunk of code, for example, sets my_value to each integer from 2 through 10 and prints it each time.

  for(my_value in 2:10)
  {
    print(my_value)
  }

Reusing your analyses with functions

When you find yourself copy-pasting your code and changing a few things, you should consider using a function. Here is a function I wrote to calculate a 95% confidence interval given a proportion and a total:

  confidence_interval_from_proportion_and_total
    <- function(proportion, total)
  {
    confidence_interval_plus_minus <- 1.96 * sqrt(proportion

      * (1 - proportion) / total)
    confidence_interval_min <- proportion

      - confidence_interval_plus_minus
    confidence_interval_max <- proportion

      + confidence_interval_plus_minus
    return(c(min = confidence_interval_min, proportion = proportion,
      max = confidence_interval_max))
  }

After defining the function, I call it whenever I need to calculate a confidence interval:

  confidence_interval_from_proportion_and_total(0.50, 500)
  confidence_interval_from_proportion_and_total(0.10, 10)
  confidence_interval_from_proportion_and_total(0.99, 200)

Instead of copy-pasting my code each time, I can call the function in one line, making my code shorter and a lot easier to read. If I decide to change how I calculate a confidence interval, perhaps to bound the confidence interval between 0 and 1, I can simply edit the code inside the function rather than needing to locate and edit many lines of code. Some functions can get pretty long, so the benefit adds up.

Avoiding (some) bugs

Put the following line at the very top of your script:

  rm(list=ls())

This line deletes whatever data is already loaded, so that your previously defined variables do not linger, ready to creep up into your analyses when you use that name again because of a typo or without assigning it a new value as you intended to. You should run this line whenever you are starting over analyzing your data or when you open and run a new script—top of the script is the best place.

Resources

Getting started using regular expressions for pattern matching and smart find and replace

This fall I realized that a lot of the knowledge that I at some point during my PhD started taking for granted was not, indeed, universal. One of my favorite things about MIT was almost no information lived exclusively in someone’s brain; all of it was somewhere accessible online. In that spirit I started putting together written guides to whatever I happen to know, initially intended for anyone in our lab and now intended also for you. I’m not reinventing the wheel, not making anything new, just pointing newcomers to resources they can use to get started.

This guide is the first. Eventually, all the getting started guides will be here.

Thank you to my mentors in the Makova and Page labs for teaching me all of this when I first started out in computational biology, Melissa A. Wilson and Daniel Winston Bellott.

Contents

This document should take about an hour. Don’t worry about memorizing anything! Google everything you need when you need it.

Background

Regular expressions (nicknamed regex) are an extremely powerful (and in my opinion vastly underused) tool for pattern matching and find and replace. Here are some things I have recently used regex for:

  • changing the contents of a column to be more useful to me—for example, changing a column with values like 21I (Delta), 21J (Delta), 21K (Omicron), 21L (Omicron), and so on to instead have values like Delta and Omicron
  • changing the format of dates from 10/5/2022 to 2022-10-05
  • deleting the sixth column from a table
  • creating a new table column depending on the contents of another column—for example, making a sample names column from a file paths column
  • scraping a web site to turn messy tables represented by html into simple tab-separated tables in a text file
  • scraping a web page to retrieve the first image appearing in each web page it links to
  • parsing the output of a script someone else wrote to determine when a Global Entry interview frees up (so that I could run it every five seconds and make my computer beep when an interview appears). (It worked—I got an interview!)

Regular expressions can be intimidating. Here’s one I wrote recently, to fill add missing tabs to add missing cells to the last column in a tab-separated table:

^(([^\t]+?\t){29}[^\t]*?)$

I can’t read it just by glancing at it. But if I move through slowly, character by character, left to right, it makes sense. (Luckily, you will probably spend a lot more time writing regular expressions than reading them.)

Getting started with BBEdit

BBEdit is my favorite text editor. BBEdit lets you do regex find and replace. I also use it to code in. Install it here (scroll down to “How can I get BBEdit 14?” for download options).

You can get a lot done just with regex find and replace in BBEdit. But if you like to code, most programming languages use the same or a very similar regular expression format. What you learn here will apply no matter how you like to code.

Getting started with regex101.com

If you’d prefer to get started even faster without downloading anything, go to regex101.com and check Python and Substitution on the left.

In addition to testing/running your regular expressions, regex101.com also has helpful explanations of all the parts of your regular expression and highlights errors.

Getting started with regular expressions

We’ll start with a few quick examples.

Example 1

Here is an example table. Copy/paste it into BBEdit or regex101.com:

patient	Ct	is_food_handler	collection_date	species	is_virus	readcount	genome_covered
Patient 1 21.5 TRUE 2022-05-12 SARS-CoV-2 TRUE 435 0.19
Patient 2 17.4 FALSE 2022-05-11 SARS-CoV-2 TRUE 2346 0.97
Patient 3 24.8 FALSE 2022-05-19 SARS-CoV-2 TRUE 87 0.05
Patient 4 23.9 TRUE 2022-05-12 SARS-CoV-2 TRUE 76 0.10
Patient 5 21.3 FALSE 2022-05-13 SARS-CoV-2 TRUE 675 0.20
Patient 6 16.8 FALSE 2022-05-09 SARS-CoV-2 TRUE 4532 0.99
Patient 7 27.9 FALSE 2022-05-07 SARS-CoV-2 TRUE 876 0.23
Patient 1 21.5 TRUE 2022-05-12 E. coli FALSE 0 0.00
Patient 2 17.4 FALSE 2022-05-11 E. coli FALSE 0 0.00
Patient 3 24.8 FALSE 2022-05-19 E. coli FALSE 4324 0.14
Patient 4 23.9 TRUE 2022-05-12 E. coli FALSE 0 0.00
Patient 5 21.3 FALSE 2022-05-13 E. coli FALSE 19 0.02
Patient 6 16.8 FALSE 2022-05-09 E. coli FALSE 0 0.00
Patient 7 27.9 FALSE 2022-05-07 E. coli FALSE 0 0.00
Patient 1 21.5 TRUE 2022-05-12 Influenza A TRUE 65 0.04
Patient 2 17.4 FALSE 2022-05-11 Influenza A TRUE 4363 0.95
Patient 3 24.8 FALSE 2022-05-19 Influenza A TRUE 0 0.00
Patient 4 23.9 TRUE 2022-05-12 Influenza A TRUE 312 0.12
Patient 5 21.3 FALSE 2022-05-13 Influenza A TRUE 0 0.00
Patient 6 16.8 FALSE 2022-05-09 Influenza A TRUE 0 0.00
Patient 7 27.9 FALSE 2022-05-07 Influenza A TRUE 0 0.00
Patient 1 21.5 TRUE 2022-05-12 Hepatitis C TRUE 0 0.00
Patient 2 17.4 FALSE 2022-05-11 Hepatitis C TRUE 0 0.00
Patient 3 24.8 FALSE 2022-05-19 Hepatitis C TRUE 65 0.08
Patient 4 23.9 TRUE 2022-05-12 Hepatitis C TRUE 0 0.00
Patient 5 21.3 FALSE 2022-05-13 Hepatitis C TRUE 6457 0.65
Patient 6 16.8 FALSE 2022-05-09 Hepatitis C TRUE 0 0.00
Patient 7 27.9 FALSE 2022-05-07 Hepatitis C TRUE 0 0.00
Patient 1 21.5 TRUE 2022-05-12 HIV TRUE 0 0.00
Patient 2 17.4 FALSE 2022-05-11 HIV TRUE 0 0.00
Patient 3 24.8 FALSE 2022-05-19 HIV TRUE 0 0.00
Patient 4 23.9 TRUE 2022-05-12 HIV TRUE 0 0.00
Patient 5 21.3 FALSE 2022-05-13 HIV TRUE 865 0.34
Patient 6 16.8 FALSE 2022-05-09 HIV TRUE 0 0.00
Patient 7 27.9 FALSE 2022-05-07 HIV TRUE 654 0.26

Hit command F to bring up BBEdit’s find and replace dialogue. Enter the following (and click Replace All if you are in BBEdit):

FIND:    \t
REPLACE: ,

As long as your table is simple, your tabs should all be replaced by commas and your table should now be a comma-separated table. (If your table is not simple, you’ll want something more like this or this.)

Example 2

Let’s pretend that we want all numbers to be whole numbers, rounded down. Let’s find all digits following a decimal point and remove them:

FIND:    \.\d+
REPLACE:

If you’re in BBEdit, make sure that the Grep option is checked.

Breaking it down, here’s what that FIND is saying and how it would parse the input 27.9:

\.match a period—the period is a special character, so it is escaped with a \
\d+match at least one numerical digit (0-9) in a row
\.\d+match a period followed by at least one numerical digit (0-9) in a row.

Replace is empty, so when you click Replace All, matched values will simply be deleted.

Video walkthrough

Here is a very thorough walkthrough of all the potential pieces of a regular expression and how you can put them together to match any pattern you want:

You might notice that there are small differences in how regular expressions are written in different programming languages. BBEdit, for example, uses grep (as does the command line in Unix), where matched values are written as \1, \2, \3, and so on. Python uses the same syntax. In this video, matched values are instead written as $1, $2, $3, and so on. If you switch between programming languages, you’ll catch this difference easily. The general principles and the things that matter stay the same. You can play with regular expressions in different languages in regex101.com.

More examples

Example 3

More with the above example table, which you can open in BBEdit or copy/paste it into regex101.com.

Let’s change the format of all of the dates. Right now, all of the dates look like this: 2022-05-12. Let’s make them look like this: 05/12/2022:

FIND:    (\d+)-(\d+)-(\d+)
REPLACE: \2/\3/\1

Breaking it down, here’s what that FIND is saying and how it would parse the input 2022-05-02:

(\d+)match at least one numerical digit (0-9) in a row and save it
-match a dash and don’t save it
(\d+)-(\d+)-(\d+)match at least one numerical digit in a row (2022) and save it, then a dash (and don’t save it), then at least one numerical digit in a row (05) and save it, then a dash (and don’t save it), and finally at least one numerical digit in a row (02) and save it

Here is what that REPLACE is saying:

\2the second thing we saved (05)
/a slash
\3the third thing we saved (12)
/a slash
\1the first thing we saved (2022)
\2/\3/\1the second thing we saved (05), then a slash, then the third thing we saved (02), then a slash, then the first thing we saved (2022)

The input 2022-05-02 will be replaced with 05/02/2022, and all other dates will each individually be processed the same way.

Example 4

Let’s change the format of all the dates again in the same way, but this time let’s also get rid of the leading 0s. Click undo (command Z) and try this instead:

FIND:    0*(\d+)-0*(\d+)-0*(\d+)
REPLACE: \2/\3/\1

Breaking it down, here’s what that FIND is saying and how it would parse the input 2022-05-02:

0*match any zeros in a row, if they are there, and match nothing if they aren’t—we could match nothing, or a 0, or 00, or 000, or 0000, and so on; this part is not inside parentheses, so we do not save it
(\d+)match at least one numerical digit (0-9) in a row and save it
-match a dash and don’t save it
0*(\d+)-0*(\d+)-0*(\d+)match any zeros in a row (or match nothing if there aren’t any 0s) and don’t save it, then at least one numerical digit in a row (2022) and save it,
then a dash and don’t save it, then any or no zeros in a row and don’t save it, then at least one numerical digit in a row (05) and save it, then a dash, then any or no zeros in a row and don’t save it, and finally at least one numerical digit in a row (02) and save it

Here is what that REPLACE is saying:

\2the second thing we saved (5)
/a slash
\3the third thing we saved (12)
/a slash
\1the first thing we saved (2022)
\2/\3/\1the second thing we saved (5), then a slash, then the third thing we saved (2), then a slash, then the first thing we saved (2022)

The input 2022-05-02 will be replaced with 5/2/2022, and all other dates will independently be processed the same way. Notice that the leading 0s are not included, because we wrote them into the regular expression without saving them between parentheses.

Cheat sheets

Don’t try to memorize anything! Once you know the general principles, you can use a regex cheat to build any regular expression you need.

Here are two that I like.

As before, you might notice that there are small differences in how regular expressions are written in different programming languages. In the first cheat sheet, matched values are written as \1, \2, \3, and so on. In the second cheat sheet, matched values are instead written as $1, $2, $3, and so on. $1, $2, $3, and so on will work in some situations, but not in BBEdit or the command line. If you need to switch programming languages, this will be a difference you will catch easily (because it won’t work).

Cheatsheet is at https://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf
Cheatsheet is at https://cheatography.com/davechild/cheat-sheets/regular-expressions/

How to Get a Global Entry Interview Appointment Really Fast (With the Power of Code)

I’m travelling a lot in November, fly a lot in general, and don’t like taking my shoes off or waiting in non-Dunkin’ lines in airports so I finally caved and applied for Global Entry, which is like TSA PreCheck ($85) with the added benefit of faster customs (another $15, why not).

The application was easy. I applied September 16th, my application was conditionally approved September 19th, and then I just needed an interview, which was booking into 2023 at least at Boston Logan. Suboptimal.

This isn’t a blog post about how great Global Entry is. I’m sure it is (that’s why I applied for it), but I wouldn’t really know yet. What I do know is that I was able to get interviews not only for me but also for my partner nearly immediately without sitting refreshing the page for full days because I and other kind people wrote code to make it happen.

If you haven’t yet decided to learn to code I hope this changes your mind. I code not just for work, and not just for personal projects outside of work, but also for dumb mundane things like this—specifically, to avoid dumb mundane things like this. (And in this case, I didn’t even need to write almost any of the actual code.) Laziness wins.

TTP Alerts

The first thing I did, after considering flying to a location with appointments, was look for some kind of existing tool that would alert me when an appointment opened up. I found two: TTP Alerts and Appointment Scanner. TTP Alerts was cheaper so I chose that one. I also used Global Entry Now to look at how many appointments had opened up that day in Boston, so I could have appointment opening FOMO and also feel more deeply in my soul how futile and time-wasting refreshing the appointments page would end up. I also tried to use Global Entry Now in combination with Visualping to let me know when the page changed.

I ended up pretty disappointed with TTP Alerts, though it did end up giving me an alert about an opening I ended up going with and almost missed because I was in the shower and didn’t hear my computer beep (and I was playing music from my phone, so when my phone paused my music to tell me about an email from TTP Alerts I ran out of the shower and booked the appointment). That was the exception. Otherwise, TTP Alerts missed almost every opening I later tracked, and almost every opening shown on Global Entry Now, I think because the appointments moved faster than TTP Alerts checked. Global Entry Now had the same problem, showing past appointments accurately but usually not showing new appointments fast enough to reliably trigger a Visualping alert. I suspect Appointment Scanner would have a similar issue, though I wouldn’t know because I didn’t try Appointment Scanner. Instead, I moved on quickly to looking for some kind of less polished but better solution.

goes-notify

We are now on October 2nd. I almost immediately found a lovely script on GitHub, goes-notify. I couldn’t immediately get the script to work in docker (which would bring the script to me prepackaged with all the correct versions of all the other software it depends on) so I instead cloned it to my computer.

Make a new directory for your Global Entry adventure, navigate to it in your terminal, and enter

git clone https://github.com/Drewster727/goes-notify.git

From there, follow the getting started instructions to fill out the config.json file with your preferences. Here are the fields I changed:

  • "current_interview_date_str": "October 31, 2022" —because I wanted an appointment before Halloween
  • "enrollment_location_id": "5441" —because that’s the code for Boston Logan
  • "use_gmail": false —because I did not want to connect to my email and anyway I did not trust that the script would check frequently enough or that the alert emails would come fast enough for me to actually get the appointments
  • "no_email": true —because I still did not want to connect to my email

You can try running as described in the instructions, and if it works skip the next part. In my case it did not work, because I have Python 3, not Python 2.

Modifying goes-notify to work with Python 3

There is a number of things I could have done once I discovered that goes-notify does not work with Python 3. I could have tried running the script with docker again but since it didn’t work for me immediately I didn’t feel like trying to get it to work. I also could have installed Python 2, but I didn’t want to risk messing up Python 3, though for all I know they can coexist just fine (but I didn’t want to find out, at least not for this). Instead I did what was fastest for me, which was to edit the script line by line until it worked on my laptop with Python 3. If you’re more dedicated than I am or if you are running ✨vintage python✨ you don’t need to do this, but in case you specifically want goes-notify to work with Python 3, here are the lines that I changed:

LineOriginal (Python 2)Updated (Python 3)
6import commandsimport subprocess
81commands.getstatusoutputsubprocess.getstatusoutput
124-125if current_apt > dtp:
dates.append(dtp.strftime('%A, %B %d @ %I:%M%p'))
if current_apt > dtp:
logging.info(dtp)
dates.append(dtp.strftime('%A, %B %d @ %I:%M%p'))
(The above row is a modification to print details about the appointment that was found; the original version works just fine in both Python 2 and 3.)
130-138hash = hashlib.md5(''.join(dates) + current_apt.strftime('%B %d, %Y @ %I:%M%p')).hexdigest()
fn = "goes-notify_{0}.txt".format(hash)
if settings.get('no_spamming') and os.path.exists(fn):
return
else:
for f in glob.glob("goes-notify_*.txt"):
os.remove(f)
f = open(fn,"w")
f.close()
DELETED/commented out
193for key, val in arguments.iteritems():for key, val in arguments.items():

Wrapping goes-notify to refresh automatically and beep when an appointment frees up

Finally, I wanted not only to be able to check if an appointment has opened up, but also to not have to be the one checking. And even if something is checking for me, I also did not want to have to stare at that something. I do spend most of my waking hours either directly in front of or somewhere near my laptop, so I am almost always within earshot (except while sleeping, taking a walk, commuting, or showering), so I wanted something that would check continually for free appointments and beep loudly and repeatedly if one appeared.

My favorite programming language is Perl, so I wrote a Perl script to run goes-notify.py every 15 seconds. I named it run_goes_notify.pl:

use strict;
use warnings;

# parameters you should change
my $number_seconds = 15;
my $goes_notify_script = "/mydirectory/goes-notify.py";

# potential messages from goes-notify
my $no_tests_string = "No tests available.";
my $error_string = "Something went wrong";
my $new_appointment = "Found new appointment";

while(1) # repeat forever
{
	# don't do anything for 15 seconds
	sleep $number_seconds;
	
	# run goes-notify and extract the output
	my $output = `python3 $goes_notify_script`;
	
	# if goes-notify says there are no interviews, print no interviews
	if($output =~ /$no_tests_string/)
	{
		print "no interviews\n";
	}
	
	# if goes-notify says there is an interview, print details and beep
	if($output =~ /$new_appointment/)
	{
		print $output."\n";
		system("echo -ne '\007'"); # beep
	}
	
	# if something went wrong, start this script over
	elsif($output =~ /$error_string/)
	{
		print "error. relaunching\n";
		system("perl /Users/lakras/2022_10_02_global_entry/run_script.pl");
		exit(0);
	}
	
	# if something else happened (it probably didn't but who knows), that probably
	# means there aren't any interviews
	else
	{
		print "no interviews\n";
	}
}

You’ll want to update the value of goes_notify_script to be the actual filepath of goes-notify.py on your computer.

Then just run from your terminal:

perl /mydirectory/run_goes_notify.pl

replacing /mydirectory/run_goes_notify.pl with the actual filepath of run_goes_notify.pl on your computer.

That’s it! That should work. Let it run for 30 seconds to see “no interviews” printed a few times to make sure it is working. Then leave it alone in the corner of your screen while you work on other things, with the volume up loud enough that you’ll hear the beep when it beeps.

This worked so much better for me than everything else I tried. My computer consistently beeped immediately when an appointment opened up, and I was consistently able to get those appointments right away. Not only was I able to immediately get appointments in October for both me and my partner, I was also able to optimize those appointments to be more convenient for our work schedules. I hope it works for you too.

Good luck.

Little Library DIY

We made our own little library! Our little library is not only an accomplishment of a yearslong fantasy, it is also a constant source of joy when people stop by and a great excuse to buy books. I did most of the planning and designing, with construction and style guidance from my parents and my partner Cory and our friend and housemate and generous feudal lady PJ, and fixing from Cory, an actual mechanical engineer, when things broke. I think PJ wanted to buy a professionally built little library, at least at first, but I wanted to do something ill-advised, amateurish, in retrospect possibly manic, and from the soul and also to use a dremel for the first time in ten years.

Our little library is painted black and the books have a sometimes spooky tint, because we live in a not-yet-painted-black house that may or may not have its own soul (and if it does have its own soul, or a visiting soul (other than our visiting souls, of course), it is absolutely a spooky one) across the street from a graveyard in Salem—which of course means that most of our neighbors and subsequently most of the visitors to our little library have died. Spooky books are often also joyful books, and hopeful books—but sometimes just spooky.

This is a blog post about how we made the library/libraries. Spoiler alert, it ends up looking like this:


This blog post includes links with my Amazon referral code. As an Amazon Associate I earn from qualifying purchases. If you click a link and buy something, I get about 4% of the price as commission. You don’t have to buy these things from Amazon—actually, you don’t have to buy these things at all. You can also support me by buying merch of my art, by buying me a campground store decaf coffee, or by simply reading and enjoying. Thank you!



Ingredients

Here’s everything I bought to make the library:

  • A reasonably-sized waterproof bathroom cabinet, to serve as the larger library—we painted it black, but this one was originally white, which already looked very nice as a potential library
  • A narrow waterproof bathroom cabinet, to serve as the small library—this one is dark brown, since it was closest to the intended black, but it also comes in white
  • A five-pack of 8×10″ plexiglass, to serve as the windows
  • Durable, waterproof plastic file folders, to serve as the roofs—in black since our library is black, but you can get colorful ones instead
  • Mounting tape, to attach the windows and the roof
  • Black outdoor paint, to paint the cabinets our preferred color, which was black—but you can choose a different color, or you can choose to not paint your libraries at all
  • Barrels to plant your libraries in—I bought these fancy wooden bucket barrels (18 inch, 15 inch, and 11.5 inch diameter, one for each library and the smallest destined for flowers). I like them a lot because the real wood and its smell and its texture and the metal handles were important to me because smells and textures are important to me in general, but they are expensive; cheaper, perfectly acceptable, possibly more durable plastic bucket barrels exist, and different sizes and shapes and quantities of the wooden ones—this is the fun part; you could even get this weird wishing well planter I’ve been trying to find some excuse to buy (but I have nothing in particular I want to do with it and nowhere in particular I want to put it) and stick a library in it, which is what I would probably try to do if we decided to add a third library

And here are things we already had that we also used:

  • A dremel, to cut out interestingly shaped window-holes
  • A sturdy pocketknife, to cut the plexiglass to fit the window-holes
  • Variously sized small pieces of scrapwood, to attach the roofs to and to make lock-type turning mechanisms so the doors don’t blow away
  • A drill and drill bits and screws, to attach the roofs so they don’t blow away and to attach the lock-type turning mechanisms
  • Lots and lots of rocks
  • Dirt
  • Flowers
  • A very strong glue to fix things when they break, like JB-Weld or Gorilla glue

And some things I bought to put in the library once we built it:


Methods

Here’s how we built the library.

First, I assembled the smaller of the two shelves. (In retrospect, I should have waited until after dremeling the doors, but it worked out fine.)

The shelf fit nicely in its intended bucket, as intended, with some books in it.

I dremeled windows into the doors of both the small shelf, which was easy because I just sliced the spaces between the horizontal gaps, and the larger shelf, which was more challenging. I tried to make the windows large enough that you could see in and see the books. I considered adding more windows to the other sides of the shelves, which you could if you wanted to, but we decided on just the doors.

(Does what I’m doing make you slightly uncomfortable? It probably should. I have no training in this except Science Olympiad in high school.)

I also reoriented the doors of the larger shelf to open in opposite directions because that is more interesting.

Here is how the larger shelf looks, dremeled and assembled:

I like how the large shelf looks as a white shelf, and white might be a good fit for a different project, but we had a whole vibe planned so it had to be painted black. I think it turned out nice and dramatic.

I measured and cut as large rectangles of plexiglass as would fit across each of the doors of the small shelf, covering the windows I had dremeled in (and which had partly already existed before my dremeling). I attached the plexiglass to the doors using mounting tape, which apparently is used for cars so it is probably good enough for this purpose as well.

I cut small straps off a skirt belt I didn’t like and curled them into door handles for the larger shelf, and attached them using mounting tape as well.

Then I cut and attached the plexiglass windows on the doors of the larger shelf—big rectangles covering both the big window holes and the little door handle holes.

Here’s how they turned out, with books inside:

To make a roof, I decided to use file folders, supported by wooden blocks that had been used to deliver furniture. Here is what that brainstorming looked like.

I painted the wooden blocks black and used mounting tape to attach them to the libraries.

I wanted the roofs to be waterproof, and black, so I ordered black plastic file folders and used them as roofs, attached also with mounting tape. Cory told me that when he was improving on this idea later he mentioned my use of plastic file folders as roofs to a coworker, who said that non-engineers sometimes come up with creative ideas to engineering challenges that a person boxed in by an engineering education might not have come up with. A very kind compliment.

They turned out quite nice, I think. Very witchy and spooky.

Here they are in their buckets, outside. We reserved the bottom part of each shelf to fill with rocks so that the libraries would be heavy and more or less sturdy. The shelf comprising the smaller library is actually upside down—the now-bottom shelf used to be the top shelf, intended to store toilet paper.

We bought flowers to plant in the buckets alongside and around the libraries:

Here is Cory planting the flowers. We planted the flowers on the sides and filled the rest of the space in the buckets with dirt. The smaller of the shelves is entirely dedicated to a flower we saw a lot of bees on, which seemed like a very good sign.

Here is how they turned out, after sunset and full of books. Very spooky and cozy:

We bought a ton of books to fit in the libraries. Here are some of the books we bought:

We dedicated the hall window overlooking the libraries to the books we plan to add to the libraries. Here they are at various moments. We ended up moving them from the windowsill to a dedicated shelf under it because there got to be too many.

PJ officially registered our libraries. Here are the fancy materials they sent us, including a little plaque:

Here are the libraries with their plaque. PJ also got a gorgeous flag and a wooden sign and little reading owls sculptures.


Here’s where dreams meet reality, and it gets a little sad—but happy and better afterward. Around Halloween we had a very bad windstorm and everything that could blow away did. The library flag blew away and we found it somewhere down the street. The roof folders blew away and we did not find them. The skeleton hand you can see in the mulch also blew away and we found it later near the graveyard (maybe it was trying to return home). The fence came down, thankfully missing the libaries.

There were two problems we kept running into: one was wind and the other was rain. The roofs kept blowing away, and mounting tape was just not doing the trick. And the doors kept blowing open, letting in rain and getting the books soaked. Twice a door was blown open hard enough that it broke off.

Cory is an actual engineer. He made nice wooden door locks to keep the doors from blowing open and drilled them into the libraries.

Cory also fixed the broken door with superglue.

Finally, Cory added additional wooden supports for the roofs, and drilled screws through the roofs into the supports. No more flying away.

I’m very grateful to Cory for supplementing my—um—creativity with thoughtful and weather-aware actual engineering. Here’s how the libraries turned out, with their improvements:

And here they are now:


Tweeting About Your Science: My Guide/What I’ve Learned

All of the notable things I have ever done—being born, my weird space cow art, my long-form blogging about my mental health, my long-form blogging about underwear, probably other things—were on October 21st rapidly surpassed or at least matched in measurable viewership by 35 words, one emoji, two images, and a link.

The characters in question live in my twitter thread summarizing our study on the July COVID-19 outbreak in Provincetown, Massachusetts—then a preprint, now out in print on the cover of Cell. As I write this sentence, the first tweet of the thread has been viewed over 100,000 times, most of those views having happened within days of the tweet—which means that more than twice the population of my hometown saw at least a summary of our main findings at the moment they had the greatest potential for public health impact. I am quite delighted that the research we all poured our minds and hearts and summer and autumn into was seen by so many people so quickly. I am very impressed by how fast it spread and to whom.

Tweeting about science isn’t something I thought I was good at (and this was my first time doing it), but since it went so well I want to share with you what I’ve learned—for a long time as a consumer and now apparently as a creator. Here is my guide to writing a good twitter thread (tweetorial or tweetstorm, though I’m not a fan of either word) about your research.

Step 0: Before You Tweet

Before you start drafting your tweets, think about who your audience is and what you want your audience to take away from your work:

  1. What are your key scientific findings?
  2. How do you hope your key scientific findings will contribute to people’s everyday lives, immediately or someday?
  3. What action do you want to inspire? Are you hoping that people will alter their behavior? Are you hoping that people will use your tool? Are you hoping that people will build on something you’ve started?
  4. As you wrote the above, whom did you think of? Who are the people who are positioned to use your work immediately? Who are the other people who might also be interested in knowing about it? Who else might stumble on it?
  5. Do you have strong emotions about your work? Do you feel delighted? surprised? grateful? horrified? Is there a particular emotion you want your audience to feel? Are you hoping to delight? to surprise? to horrify? How strong is your own emotion? Does it contribute to your message or does it detract from your message?
  6. In what ways can your work be misinterpreted, misunderstood, or misused, unintentionally or maliciously? Can your work be used to fuel misinformation? Can your work be used to fuel hatred? What content did you personally find helpful in understanding how your work fits into a compassionate, good world? What content might help correct possible misinterpretions of your work?

Step 1: The First Tweet

The first tweet in your thread is the most important tweet of your thread. Most people will not click your link to read your paper and most people will not click your first tweet to read the full thread. For many people, this first tweet is all they will see of your work. It should communicate enough of your main findings and your call to action that not seeing the rest is more or less okay.

To maximize the chance that someone scrolling will read and understand it, your first tweet, more than any of the other tweets in the thread, needs to be short, be easy to understand, and not look like a block of text. Where possible, I recommend using bulleted lists and breaking up your text with plenty of vertical whitespace.

Your first tweet should include a link to the paper (though some people save it for late in the thread, as a sort of reward). If you don’t like the preview that appears when you paste in the URL, you can attach an image. I attached two images: one with the title and one with the abstract. Maybe you have a very pretty image, or a graphic that nicely summarizes your main findings—those would be nice to attach instead. We did not have a graphical abstract yet when I posted this thread, but if I were posting it today I would attach our graphical abstract instead of the images I did attach.

Your first tweet should have some indication that there is a thread attached to it. That indication can be the thread emoji (🧵) or the word thread, perhaps with an arrow (⬇), or it can be a a 1/, or it can be something else.

The first tweet also needs to do the important work of appearing in search results when people look up keywords. Try searching terms related to your work. Look at what results appear on twitter, how many results appear, what kinds of discussions are happening around those terms, and if those discussions involve the audience you are hoping will see your work. In my case, I knew that my twitter thread needed to contain the following keywords:

  • COVID-19
  • SARS-CoV-2
  • Provincetown
  • Ptown
  • outbreak
  • Delta
  • vaccinated
  • vaxxed
  • vaccine
  • public health
  • symptoms

(I don’t see any reason to use hashtags for these keywords—the terms appear in search results regardless of whether or not they have a hashtag in front of them, and having that many hashtags just seems a bit alarming. But I could be wrong.)

These 11 words account for almost a third of the 35 total words of the tweet. I made sacrifices to ensure that they all appeared. (Vaccinated, for example, is much longer than vaxxed, and Provincetown is much longer than Ptown—but they each produce their own search results and I wanted our paper to appear in all of them.)

If you would like and if it comes naturally, you can also (or alternatively) include in your first tweet some kind of a hook—a funny joke, a cliffhanger, an unanswered question, or content that elicits an emotional response. With a hook people will be more likely to click to read the details, but they are also less likely to walk away understanding your work if they don’t click. If the thing that is important to you is that as many people as possible walk away with some understanding of your work, you probably want to summarize your work in the first tweet. If you would like to connect more deeply with a smaller number of people, rather than shallowly with a large number of people, it might be better to use a hook. (I of course did not do that.)

Finally, I think it is important to include yourself in the work. The science is done by people and people like other people. I start the first tweet with Our and many of my other tweets in the thread start with We. I am a person who likes exclamation marks so I include exclamation marks—because exclamation marks best capture how I feel about this paper.

Things to Know

Here are seven general principles to keep in mind as you draft your twitter thread.

  1. I recommend you draft your tweets in Google Docs or similar, and that you get feedback from your co-authors and revise at least once before posting. This thread took me three days with feedback from multiple co-authors: Katie, Bronwyn, Gage, and Pardis all helped me revise this thread and made it much better. (In its original form, it didn’t even have capital letters.)
  2. Each tweet should be its own complete thought. If a person retweets one tweet from your thread and not any of the others, its message should be clear. It should not be possible to take any individual tweet out of context to mean something you did not intend.
  3. Where possible, and especially when making important points, make your text easier to digest by breaking it into bullet points or breaking it up with vertical whitespace.
  4. If you include a url in your tweet, your tweet will have a nice link preview. If you attach any images to your tweet, the link preview will not appear. The link preview in the final, posted tweet will look the same as it looks in the draft tweet before you post.
  5. If you end your tweet in a url (if the url is the very, very last thing in your tweet) and you have not attached any images, the url itself will not appear in the tweet when it is posted (but the link preview will).
  6. Most people do not click on images. You should size your images so that the preview that people see on twitter looks the same as the image itself. If you are attaching one image, it should be 1500 pixels wide and 850 pixels tall (or scaled with that same aspect ratio). If you are attaching more than one image, all of your images should be square. The final cropping of the images does not always match the preview of your tweet draft, nor is it consistent between devices. If it is important that people see the entire image, including the edges, add some white space all around your image so that the content of the image will still be visible even the edges are slightly cropped. (If you do not have image editing software, you can download and use GIMP for free.) I do not recommend attaching more than two images—the previews will be too small to see without clicking, and most people will not click.
  7. Create a secret fake twitter account to test out your tweets before posting them for real. Delete your test tweets immediately. Do test post the actual images you intend to post to make sure they appear as you expect them to appear and are cropped as you expect them to be cropped. Do not link to your work or post any of the actual text you intend to post—not even for a moment. Other than images, replace your actual text with dummy text and your actual urls with dummy urls when you test post. You can also use this account to draft your actual tweets (without posting them) to make sure they fit within the character limit—don’t rely on other applications to count characters; they all seem to count in their own ways.

Step 2: Context, Background, Collaborations

This part is easy. I initially had this tweet closer to the end, but my co-authors encouraged me to move it to the front and I think it is much better here. We start off by acknowledging the organizations involved in this study, because the list was huge. If it were a much smaller study, I might start off by acknowledging the individual authors. It is also helpful to start off with some context—has this work been peer reviewed? Does this work build on previous work? Does this work build on anything familiar to the audience?

If someone has clicked your first tweet to read the thread, they are already in and willing to skim. Your second tweet does not have to be very exciting.

You might notice that while this tweet includes an attractive preview of the link, it does not include the actual url. That is because the url is pasted after the 2/. Because the url is the very last thing in the tweet and there is no image attached to the tweet, the actual url does not appear.

Step 3: Paper Summary

The paper summary will likely be the bulk of your thread. As much as possible, each tweet should include one complete and clear idea and be retweetable on its own. Where relevant, you should include figures from the paper. Try to add some emojis. You can also include links to previous work or work that you use or build on. I start many of these tweets with We.

The order in which ideas appear and the ways you present those ideas might not match their order or presentation in the paper. Your audience on twitter is almost definitely broader than the audience you wrote the paper for, and your language and narrative should adapt. Your focus should be on communicating the findings that are most relevant to this broader audience in a way that is understandable (and not condescending), with minimal jargon. In some cases you might want to delve into more detail, and that’s okay—tweets including jargon should be easy to skim and easy to grasp at least the purpose of without extensive background in your field and should not be necessary for understanding the overall story.

Step 4: Findings Summary

Your paper summary should end in a tweet summarizing the findings that are most relevant to your audience, or most actionable.

Step 5: Your Findings Out in the World

This is the hardest part.

This section is about the role you hope your work will play in the world, and it is a part of your efforts to shape that role. Here, you need to anticipate the ways in which your work could be used to misinform or hurt. You should address possible misinterpretations explicitly and head on. You should provide links to informative content people can engage with to address possible misinterpretations before they arise. Wherever possible, that content should be readable by people outside your field.

If you are less worried about misinterpretation or misuse of your work, you can instead or additionally use this space to explore the life you hope your work will live in the world—how you hope people will use or build on your work, or your own emotional response (if you have one you’d like to share) to your findings.

In either case, it is nice to end with a positive or hopeful feeling before your call to action, because hope is empowering.

Step 6: Key Takeaways/Calls to Action

Finally, at the very end, I have the final takeaways (the third time I am adding takeaways—but that is what this thread is). These takeaways do not relate to the content of the paper itself; instead, they are focused on how I hope the paper’s findings will be useful in the real world. I end by making explicit the calls to action I tried to imply in the first tweet and have been building toward throughout the thread.

Step 7: The End/Thank Yous

This part is fun (and easy). Thank your co-authors—the first and last authors first, if the full author list is very long, and the full author list listed out or in summary. If anyone is on twitter, tag them.

When you are ready to post, use twitter in a web browser on your computer (not the app on your phone) to prepare the thread in advance and then post the entire thread at the same time.

The time you post your thread is important. Don’t tweet when your target audience is having dinner or asleep. I think it’s probably best to time your thread for when your target audience is browsing twitter at work, but there are tons of more informed articles on timing of tweets that you can read instead of using my guesswork.

Finally, the number of people twitter shows your tweet to is determined not only by the number of likes and retweets your tweet gets, but also by the speed at which it gets them. Tell your colleagues and co-authors about your tweet immediately after you tweet it. Hopefully they will engage with it quickly and help it spread.


That’s all! I hope this guide helps your work reach its audience.

If you would like, you can view (and retweet!) this thread in its home on twitter and read the paper the thread is about. You can also read an article about the power of twitter to disseminate scientific research, which I found on twitter.