Getting started running viral-ngs workflows in Terra for no-code, scalable sequence analysis

This guide is the second of the getting started guides I made originally for our lab, and now for you. All the getting started guides can be found here.

Thank you to Danny Park and Kat DeRuff in our lab for pointing me to the excellent resources that are linked here.

Contents

This document should take less than an hour. (If you would like, you can dedicate a day to become a true expert—but I didn’t, and you too will be perfectly fine if you don’t.)

Background

In our lab, we mostly analyze short-read sequences of RNA viruses, though some of our methods pick up lots of other things too. The samples we work with are occasionally from other sources, like ticks, but are usually from humans—as a lab, we’ve sequenced everything from plasma to nasal swabs to brains. Because most of the genetic material in the sample is going to belong to the host (and we almost never use and usually aren’t even allowed to collect the host’s genetic material), the host genome and all other DNA in the sample is removed before sequencing through DNA depletion. The RNA in the sample is then reverse transcribed to DNA, which is then sequenced either by us or by the Broad Genomics Platform. Members of the lab have explored long-read sequencing, and we can also access Sanger sequencing, but nearly all of the sequencing we do generates short reads.

Historically, most of the sequencing our lab has done has been metagenomic sequencing. Metagenomic sequencing is unbiased—it will pick up everything in the sample, and what you see in the reads produced by sequencing is more or less what you had in your sample, including relative proportions of the species or variants that you see.

Starting with the Zika project published in 2017, our lab has also used amplicon sequencing. In amplicon sequencing, primers based on known sequence from the target organism’s genome are used to amplify molecules from the target organism using PCR. Unlike metagenomic sequencing, amplicon sequencing is not unbiased: if everything goes right, the only thing sequenced is the target organism. The Zika team used amplicon sequencing to amplify Zika because Zika appeared in very small quantities and was difficult to sequence. Now, amplicon sequencing is also used as a cost-saving measure, since money is not spent sequencing anything other than the target organism: almost all SARS-CoV-2 sequencing, for example, is amplicon sequencing. The SARS-CoV-2 genome is a lot longer than the short reads generated by short read sequencing; therefore, many primers are tiled to generate many short, overlapping sequences that can then be assembled into the whole SARS-CoV-2 genome.

As a compromise between unbiased metagenomic sequencing and biased amplicon sequencing, our lab also uses hybrid-capture, which uses primers based on conserved regions from a predetermined set of all our human viruses of interest. When it comes to the set of organisms included in the hybrid capture primers (as long as the genomes in the sample are not too diverged from the primer sequences), the method picks up everything. When it comes to anything else, hybrid capture gets rid of it. Like amplicon sequencing hybrid capture can allow us to sequence organisms that are present in very small quantities.

To save money, we sometimes pool samples by barcoding them, a process in which molecules from a sample are marked with barcodes unique to that sample, allowing us to mix multiple samples together and sequence them for the price of one. Barcoding also decreases contamination by allowing us to identify reads that did not come from the sample we are analyzing. We also spike samples with artificial spike-in molecules to help us detect contamination.

The actual output of the sequencer is not human-readable, and it also needs to go through a lot of work before it is ready for analysis. Reads from spike-ins need to be identified and catalogued. If samples were pooled, they need to be computationally disentangled using the unique barcodes that were used to mark each sample. (Because we often use barcodes even when the samples are not pooled, this process likely still needs to happen.) Barcodes are in the actual sequence of the reads; once they have served their purpose, they need to be computationally removed. The primers used for amplicon sequencing are also in the actual sequence of the reads; they also need to be computationally identified and removed. In our lab, all of this work is automatically done by viral-ngs.

viral-ngs also includes many computations that are necessary for almost all projects, such as aligning reads to a reference genome, assembling genomes from reads using an existing genome as a scaffold, assembling contigs without a scaffold, quantifying and plotting read depth across the genome, identifying within-host variation within a sample, or running tools like kraken or blast to identify, in bulk, the potentially many species a sample’s reads or contigs belong to. All of these computations are also a part of viral-ngs.

Introduction

Our lab’s sequence analysis toolkit and codebase, called viral-ngs, is housed in Terra, a smart, no-code interface for running analyses on any number of samples.

viral-ngs is our lab’s collection of frequently used, modular, generalizable computational workflows. My three favorite workflows are:

  • demux_plus, which takes you directly from the non-human-readable output of your sequencing run to completion of some frequently used analyses and files ready to input into any other analyses;
  • assemble_refbased, reference-based assembly;
  • and align_and_plot, which aligns reads to a reference genome and generates a visualization of coverage across the genome.

Every workflow consists of tasks, which are modular processes that can be put together like Legos. (Each task might appear in multiple workflows.) The tasks are where the action actually happens; workflows simply glue tasks together. Some workflows, such as demux_plus, incorporate many software tools that labs usually run individually, one after the other, for each dataset. Having frequently-used processes and tools compiled into one tool allows us to save time, minimize bugs, adapt to change quickly, and standardize our processes across projects. Many of these tools, such as SAMtools and MAFFT, are created and maintained by other groups; some of the tools and code are created by members of our lab, especially Danny and Chris TT.

All of the viral-ngs workflows are publicly available on the Dockstore, where you can access our and other workflows.

Terra is a wrapper around viral-ngs. Terra allows us to run workflows in the cloud through an intuitive interface (instead of, say, the terminal), handles launching the virtual machines that actually run the code, and allows us to run an analysis on many samples at once. Terra doesn’t actually contain any of the viral-ngs code, it just runs it for us. In addition to being easy to use, Terra is smart and cost-saving: if some or all of your workflow has already been run on the inputs you provided, it will swap in the outputs from last time and won’t redo the work.

Terra is especially powerful because of its use of data tables: you can organize your data into a data table with, for example, one row per sample, with each column corresponding to a particular file or value related to your samples (you might have a column for sample readcount, for example, or a column linking to each sample’s assembled genome). Instead of running an analysis on each of your samples individually, you can run that analysis on some or all of the samples in your data table all at once, specifying the inputs by column name, and Terra will add the outputs of the analysis to your data table as new columns.

The actual data lives and is analyzed in the cloud, in the Google Cloud Platform (GCP). Files in GCP have addresses that start with gs://. You can view a file in its home in GCP by clicking on it in Terra. If you ever need to, you can upload your own files to GCP as well, and use them as inputs to workflows on Terra by providing their gs:// addresses or adding the gs:// addresses to data tables. An important heads up: if a file changes but its filepath (the directory it lives in and its name) doesn’t change, Terra will not know that the file changed. Terra knows this and does not modify files, but if you yourself modify a file, make sure to re-upload it with a new filename or filepath.

Getting started

Terra is located at terra.bio. Log in with your Broad Google account. You will need to be added to a workspace, or you can create a new workspace (you will need a cost object—I usually ask Danny).

Video walk-throughs

Pathogen Genome Assembly using the Terra Platform

This short, 20-minute training video includes:

  • An introduction to Terra
  • An introduction to the Google Cloud Platform (GCP) where our data lives
  • How to put together data tables to run analyses on many samples
  • How to lauch workflows to analyze your data
  • How to find additional workflows in the Dockstore
  • How to access your input and output files in GCP

This video is part of the WARN-ID Genomic Epidemiology Training videos made by our WARN-ID consortium (orchestrated by lab alumnus Shirlee Wohl). The rest of the videos are also excellent.

Terra 101

This much longer, two-day training workshop delves into the above in far more detail. At the end of this workshop, you will be an expert. All slides and materials you might need for this workshop are available here.

Day 1:

Day 2:

This video is part of Introduction to Terra: A scalable platform for biomedical research, a recent DS-I Africa training workshop put together by the Broad Data Science Platform (DSP).

DSP leads introduction to Terra workshops (as well as more advanced workshops and workshops on other topics); there very likely may be an upcoming training you might be interested in.

More video walk-throughs:

Bonus: writing your own tasks and workflows and running them in Terra

If you would like to branch out from what is available in viral-ngs, this training from Broad DSP goes over how to write your own tasks and workflows in the WDL programming language and run them in Terra. It is far easier than I thought it would be! All slides and materials you might need for this workshop are available here. The full workshop lives here.

Once you are comfortable writing a basic WDL workflow and running it on Terra, you can explore more of WDL’s capabilities through code snippet examples on the OpenWDL github:

Resources

  • All viral-ngs workflows are described and documented in the viral-ngs readthedocs
  • The Terra team has put together guides to running some key viral-ngs workflows in Terra
  • You can view (and contribute to) the code behind the viral-ngs workflows at the viral-pipelines GitHub respository:
  • Terra’s documentation contains many how-to guides and answers to frequently asked questions
  • Once you have google cloud platform file access set up on your computer, you can use my script download_files.pl to download gs:// files in bulk. Simply download the script, open your terminal, type

    perl [script directory/]download_files.pl [file with list of files to download] [optional output directory]

    and hit enter, replacing [directory/] with the directory containing download_files.pl, for example Downloads/ and replacing [file with list of files to download] [optional output directory] with the filepath of a text file containing a list of gs:// addresses of the files you’d like to download, one per line. For example:
    perl Downloads/download_files.pl my_fun_directory/my_fun_list_of_files.txt
  • Similarly, you can use my script download_files_listed_in_table_column.pl to download all files listed in a column of a table (for example a data table that you have downloaded from Terra). Simply download the script, open your terminal, type

    perl [script directory/]download_files_listed_in_table_column.pl [table with list of files to download in one of the columns] "[title of column containing filepaths to download]"

    and hit enter, replacing [script directory/] with the directory containing download_files.pl, for example Downloads/ and replacing [table with list of files to download in one of the columns] with the filepath of your data table, and [title of column containing filepaths to download] with the name of the column containing a list of gs:// addresses of the files you’d like to download, one per line.

Getting started using regular expressions for pattern matching and smart find and replace

This fall I realized that a lot of the knowledge that I at some point during my PhD started taking for granted was not, indeed, universal. One of my favorite things about MIT was almost no information lived exclusively in someone’s brain; all of it was somewhere accessible online. In that spirit I started putting together written guides to whatever I happen to know, initially intended for anyone in our lab and now intended also for you. I’m not reinventing the wheel, not making anything new, just pointing newcomers to resources they can use to get started.

This guide is the first. Eventually, all the getting started guides will be here.

Thank you to my mentors in the Makova and Page labs for teaching me all of this when I first started out in computational biology, Melissa A. Wilson and Daniel Winston Bellott.

Contents

This document should take about an hour. Don’t worry about memorizing anything! Google everything you need when you need it.

Background

Regular expressions (nicknamed regex) are an extremely powerful (and in my opinion vastly underused) tool for pattern matching and find and replace. Here are some things I have recently used regex for:

  • changing the contents of a column to be more useful to me—for example, changing a column with values like 21I (Delta), 21J (Delta), 21K (Omicron), 21L (Omicron), and so on to instead have values like Delta and Omicron
  • changing the format of dates from 10/5/2022 to 2022-10-05
  • deleting the sixth column from a table
  • creating a new table column depending on the contents of another column—for example, making a sample names column from a file paths column
  • scraping a web site to turn messy tables represented by html into simple tab-separated tables in a text file
  • scraping a web page to retrieve the first image appearing in each web page it links to
  • parsing the output of a script someone else wrote to determine when a Global Entry interview frees up (so that I could run it every five seconds and make my computer beep when an interview appears). (It worked—I got an interview!)

Regular expressions can be intimidating. Here’s one I wrote recently, to fill add missing tabs to add missing cells to the last column in a tab-separated table:

^(([^\t]+?\t){29}[^\t]*?)$

I can’t read it just by glancing at it. But if I move through slowly, character by character, left to right, it makes sense. (Luckily, you will probably spend a lot more time writing regular expressions than reading them.)

Getting started with BBEdit

BBEdit is my favorite text editor. BBEdit lets you do regex find and replace. I also use it to code in. Install it here (scroll down to “How can I get BBEdit 14?” for download options).

You can get a lot done just with regex find and replace in BBEdit. But if you like to code, most programming languages use the same or a very similar regular expression format. What you learn here will apply no matter how you like to code.

Getting started with regex101.com

If you’d prefer to get started even faster without downloading anything, go to regex101.com and check Python and Substitution on the left.

In addition to testing/running your regular expressions, regex101.com also has helpful explanations of all the parts of your regular expression and highlights errors.

Getting started with regular expressions

We’ll start with a few quick examples.

Example 1

Here is an example table. Copy/paste it into BBEdit or regex101.com:

patient	Ct	is_food_handler	collection_date	species	is_virus	readcount	genome_covered
Patient 1 21.5 TRUE 2022-05-12 SARS-CoV-2 TRUE 435 0.19
Patient 2 17.4 FALSE 2022-05-11 SARS-CoV-2 TRUE 2346 0.97
Patient 3 24.8 FALSE 2022-05-19 SARS-CoV-2 TRUE 87 0.05
Patient 4 23.9 TRUE 2022-05-12 SARS-CoV-2 TRUE 76 0.10
Patient 5 21.3 FALSE 2022-05-13 SARS-CoV-2 TRUE 675 0.20
Patient 6 16.8 FALSE 2022-05-09 SARS-CoV-2 TRUE 4532 0.99
Patient 7 27.9 FALSE 2022-05-07 SARS-CoV-2 TRUE 876 0.23
Patient 1 21.5 TRUE 2022-05-12 E. coli FALSE 0 0.00
Patient 2 17.4 FALSE 2022-05-11 E. coli FALSE 0 0.00
Patient 3 24.8 FALSE 2022-05-19 E. coli FALSE 4324 0.14
Patient 4 23.9 TRUE 2022-05-12 E. coli FALSE 0 0.00
Patient 5 21.3 FALSE 2022-05-13 E. coli FALSE 19 0.02
Patient 6 16.8 FALSE 2022-05-09 E. coli FALSE 0 0.00
Patient 7 27.9 FALSE 2022-05-07 E. coli FALSE 0 0.00
Patient 1 21.5 TRUE 2022-05-12 Influenza A TRUE 65 0.04
Patient 2 17.4 FALSE 2022-05-11 Influenza A TRUE 4363 0.95
Patient 3 24.8 FALSE 2022-05-19 Influenza A TRUE 0 0.00
Patient 4 23.9 TRUE 2022-05-12 Influenza A TRUE 312 0.12
Patient 5 21.3 FALSE 2022-05-13 Influenza A TRUE 0 0.00
Patient 6 16.8 FALSE 2022-05-09 Influenza A TRUE 0 0.00
Patient 7 27.9 FALSE 2022-05-07 Influenza A TRUE 0 0.00
Patient 1 21.5 TRUE 2022-05-12 Hepatitis C TRUE 0 0.00
Patient 2 17.4 FALSE 2022-05-11 Hepatitis C TRUE 0 0.00
Patient 3 24.8 FALSE 2022-05-19 Hepatitis C TRUE 65 0.08
Patient 4 23.9 TRUE 2022-05-12 Hepatitis C TRUE 0 0.00
Patient 5 21.3 FALSE 2022-05-13 Hepatitis C TRUE 6457 0.65
Patient 6 16.8 FALSE 2022-05-09 Hepatitis C TRUE 0 0.00
Patient 7 27.9 FALSE 2022-05-07 Hepatitis C TRUE 0 0.00
Patient 1 21.5 TRUE 2022-05-12 HIV TRUE 0 0.00
Patient 2 17.4 FALSE 2022-05-11 HIV TRUE 0 0.00
Patient 3 24.8 FALSE 2022-05-19 HIV TRUE 0 0.00
Patient 4 23.9 TRUE 2022-05-12 HIV TRUE 0 0.00
Patient 5 21.3 FALSE 2022-05-13 HIV TRUE 865 0.34
Patient 6 16.8 FALSE 2022-05-09 HIV TRUE 0 0.00
Patient 7 27.9 FALSE 2022-05-07 HIV TRUE 654 0.26

Hit command F to bring up BBEdit’s find and replace dialogue. Enter the following (and click Replace All if you are in BBEdit):

FIND:    \t
REPLACE: ,

As long as your table is simple, your tabs should all be replaced by commas and your table should now be a comma-separated table. (If your table is not simple, you’ll want something more like this or this.)

Example 2

Let’s pretend that we want all numbers to be whole numbers, rounded down. Let’s find all digits following a decimal point and remove them:

FIND:    \.\d+
REPLACE:

If you’re in BBEdit, make sure that the Grep option is checked.

Breaking it down, here’s what that FIND is saying and how it would parse the input 27.9:

\.match a period—the period is a special character, so it is escaped with a \
\d+match at least one numerical digit (0-9) in a row
\.\d+match a period followed by at least one numerical digit (0-9) in a row.

Replace is empty, so when you click Replace All, matched values will simply be deleted.

Video walkthrough

Here is a very thorough walkthrough of all the potential pieces of a regular expression and how you can put them together to match any pattern you want:

You might notice that there are small differences in how regular expressions are written in different programming languages. BBEdit, for example, uses grep (as does the command line in Unix), where matched values are written as \1, \2, \3, and so on. Python uses the same syntax. In this video, matched values are instead written as $1, $2, $3, and so on. If you switch between programming languages, you’ll catch this difference easily. The general principles and the things that matter stay the same. You can play with regular expressions in different languages in regex101.com.

More examples

Example 3

More with the above example table, which you can open in BBEdit or copy/paste it into regex101.com.

Let’s change the format of all of the dates. Right now, all of the dates look like this: 2022-05-12. Let’s make them look like this: 05/12/2022:

FIND:    (\d+)-(\d+)-(\d+)
REPLACE: \2/\3/\1

Breaking it down, here’s what that FIND is saying and how it would parse the input 2022-05-02:

(\d+)match at least one numerical digit (0-9) in a row and save it
-match a dash and don’t save it
(\d+)-(\d+)-(\d+)match at least one numerical digit in a row (2022) and save it, then a dash (and don’t save it), then at least one numerical digit in a row (05) and save it, then a dash (and don’t save it), and finally at least one numerical digit in a row (02) and save it

Here is what that REPLACE is saying:

\2the second thing we saved (05)
/a slash
\3the third thing we saved (12)
/a slash
\1the first thing we saved (2022)
\2/\3/\1the second thing we saved (05), then a slash, then the third thing we saved (02), then a slash, then the first thing we saved (2022)

The input 2022-05-02 will be replaced with 05/02/2022, and all other dates will each individually be processed the same way.

Example 4

Let’s change the format of all the dates again in the same way, but this time let’s also get rid of the leading 0s. Click undo (command Z) and try this instead:

FIND:    0*(\d+)-0*(\d+)-0*(\d+)
REPLACE: \2/\3/\1

Breaking it down, here’s what that FIND is saying and how it would parse the input 2022-05-02:

0*match any zeros in a row, if they are there, and match nothing if they aren’t—we could match nothing, or a 0, or 00, or 000, or 0000, and so on; this part is not inside parentheses, so we do not save it
(\d+)match at least one numerical digit (0-9) in a row and save it
-match a dash and don’t save it
0*(\d+)-0*(\d+)-0*(\d+)match any zeros in a row (or match nothing if there aren’t any 0s) and don’t save it, then at least one numerical digit in a row (2022) and save it,
then a dash and don’t save it, then any or no zeros in a row and don’t save it, then at least one numerical digit in a row (05) and save it, then a dash, then any or no zeros in a row and don’t save it, and finally at least one numerical digit in a row (02) and save it

Here is what that REPLACE is saying:

\2the second thing we saved (5)
/a slash
\3the third thing we saved (12)
/a slash
\1the first thing we saved (2022)
\2/\3/\1the second thing we saved (5), then a slash, then the third thing we saved (2), then a slash, then the first thing we saved (2022)

The input 2022-05-02 will be replaced with 5/2/2022, and all other dates will independently be processed the same way. Notice that the leading 0s are not included, because we wrote them into the regular expression without saving them between parentheses.

Cheat sheets

Don’t try to memorize anything! Once you know the general principles, you can use a regex cheat to build any regular expression you need.

Here are two that I like.

As before, you might notice that there are small differences in how regular expressions are written in different programming languages. In the first cheat sheet, matched values are written as \1, \2, \3, and so on. In the second cheat sheet, matched values are instead written as $1, $2, $3, and so on. $1, $2, $3, and so on will work in some situations, but not in BBEdit or the command line. If you need to switch programming languages, this will be a difference you will catch easily (because it won’t work).

Cheatsheet is at https://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf
Cheatsheet is at https://cheatography.com/davechild/cheat-sheets/regular-expressions/