This guide is the second of the getting started guides I made originally for our lab, and now for you. All the getting started guides can be found here.
Thank you to Danny Park and Kat DeRuff in our lab for pointing me to the excellent resources that are linked here.

Contents
This document should take less than an hour. (If you would like, you can dedicate a day to become a true expert—but I didn’t, and you too will be perfectly fine if you don’t.)
- Background
- Introduction
- Getting started
- Video walk-throughs
- Bonus: writing your own tasks and workflows and running them in Terra
- Resources
Background
In our lab, we mostly analyze short-read sequences of RNA viruses, though some of our methods pick up lots of other things too. The samples we work with are occasionally from other sources, like ticks, but are usually from humans—as a lab, we’ve sequenced everything from plasma to nasal swabs to brains. Because most of the genetic material in the sample is going to belong to the host (and we almost never use and usually aren’t even allowed to collect the host’s genetic material), the host genome and all other DNA in the sample is removed before sequencing through DNA depletion. The RNA in the sample is then reverse transcribed to DNA, which is then sequenced either by us or by the Broad Genomics Platform. Members of the lab have explored long-read sequencing, and we can also access Sanger sequencing, but nearly all of the sequencing we do generates short reads.
Historically, most of the sequencing our lab has done has been metagenomic sequencing. Metagenomic sequencing is unbiased—it will pick up everything in the sample, and what you see in the reads produced by sequencing is more or less what you had in your sample, including relative proportions of the species or variants that you see.
Starting with the Zika project published in 2017, our lab has also used amplicon sequencing. In amplicon sequencing, primers based on known sequence from the target organism’s genome are used to amplify molecules from the target organism using PCR. Unlike metagenomic sequencing, amplicon sequencing is not unbiased: if everything goes right, the only thing sequenced is the target organism. The Zika team used amplicon sequencing to amplify Zika because Zika appeared in very small quantities and was difficult to sequence. Now, amplicon sequencing is also used as a cost-saving measure, since money is not spent sequencing anything other than the target organism: almost all SARS-CoV-2 sequencing, for example, is amplicon sequencing. The SARS-CoV-2 genome is a lot longer than the short reads generated by short read sequencing; therefore, many primers are tiled to generate many short, overlapping sequences that can then be assembled into the whole SARS-CoV-2 genome.
As a compromise between unbiased metagenomic sequencing and biased amplicon sequencing, our lab also uses hybrid-capture, which uses primers based on conserved regions from a predetermined set of all our human viruses of interest. When it comes to the set of organisms included in the hybrid capture primers (as long as the genomes in the sample are not too diverged from the primer sequences), the method picks up everything. When it comes to anything else, hybrid capture gets rid of it. Like amplicon sequencing hybrid capture can allow us to sequence organisms that are present in very small quantities.
To save money, we sometimes pool samples by barcoding them, a process in which molecules from a sample are marked with barcodes unique to that sample, allowing us to mix multiple samples together and sequence them for the price of one. Barcoding also decreases contamination by allowing us to identify reads that did not come from the sample we are analyzing. We also spike samples with artificial spike-in molecules to help us detect contamination.
The actual output of the sequencer is not human-readable, and it also needs to go through a lot of work before it is ready for analysis. Reads from spike-ins need to be identified and catalogued. If samples were pooled, they need to be computationally disentangled using the unique barcodes that were used to mark each sample. (Because we often use barcodes even when the samples are not pooled, this process likely still needs to happen.) Barcodes are in the actual sequence of the reads; once they have served their purpose, they need to be computationally removed. The primers used for amplicon sequencing are also in the actual sequence of the reads; they also need to be computationally identified and removed. In our lab, all of this work is automatically done by viral-ngs.
viral-ngs also includes many computations that are necessary for almost all projects, such as aligning reads to a reference genome, assembling genomes from reads using an existing genome as a scaffold, assembling contigs without a scaffold, quantifying and plotting read depth across the genome, identifying within-host variation within a sample, or running tools like kraken or blast to identify, in bulk, the potentially many species a sample’s reads or contigs belong to. All of these computations are also a part of viral-ngs.
Introduction
Our lab’s sequence analysis toolkit and codebase, called viral-ngs, is housed in Terra, a smart, no-code interface for running analyses on any number of samples.
viral-ngs is our lab’s collection of frequently used, modular, generalizable computational workflows. My three favorite workflows are:
- demux_plus, which takes you directly from the non-human-readable output of your sequencing run to completion of some frequently used analyses and files ready to input into any other analyses;
- assemble_refbased, reference-based assembly;
- and align_and_plot, which aligns reads to a reference genome and generates a visualization of coverage across the genome.
Every workflow consists of tasks, which are modular processes that can be put together like Legos. (Each task might appear in multiple workflows.) The tasks are where the action actually happens; workflows simply glue tasks together. Some workflows, such as demux_plus, incorporate many software tools that labs usually run individually, one after the other, for each dataset. Having frequently-used processes and tools compiled into one tool allows us to save time, minimize bugs, adapt to change quickly, and standardize our processes across projects. Many of these tools, such as SAMtools and MAFFT, are created and maintained by other groups; some of the tools and code are created by members of our lab, especially Danny and Chris TT.
All of the viral-ngs workflows are publicly available on the Dockstore, where you can access our and other workflows.
Terra is a wrapper around viral-ngs. Terra allows us to run workflows in the cloud through an intuitive interface (instead of, say, the terminal), handles launching the virtual machines that actually run the code, and allows us to run an analysis on many samples at once. Terra doesn’t actually contain any of the viral-ngs code, it just runs it for us. In addition to being easy to use, Terra is smart and cost-saving: if some or all of your workflow has already been run on the inputs you provided, it will swap in the outputs from last time and won’t redo the work.
Terra is especially powerful because of its use of data tables: you can organize your data into a data table with, for example, one row per sample, with each column corresponding to a particular file or value related to your samples (you might have a column for sample readcount, for example, or a column linking to each sample’s assembled genome). Instead of running an analysis on each of your samples individually, you can run that analysis on some or all of the samples in your data table all at once, specifying the inputs by column name, and Terra will add the outputs of the analysis to your data table as new columns.
The actual data lives and is analyzed in the cloud, in the Google Cloud Platform (GCP). Files in GCP have addresses that start with gs://. You can view a file in its home in GCP by clicking on it in Terra. If you ever need to, you can upload your own files to GCP as well, and use them as inputs to workflows on Terra by providing their gs:// addresses or adding the gs:// addresses to data tables. An important heads up: if a file changes but its filepath (the directory it lives in and its name) doesn’t change, Terra will not know that the file changed. Terra knows this and does not modify files, but if you yourself modify a file, make sure to re-upload it with a new filename or filepath.
Getting started
Terra is located at terra.bio. Log in with your Broad Google account. You will need to be added to a workspace, or you can create a new workspace (you will need a cost object—I usually ask Danny).
Video walk-throughs
Pathogen Genome Assembly using the Terra Platform
This short, 20-minute training video includes:
- An introduction to Terra
- An introduction to the Google Cloud Platform (GCP) where our data lives
- How to put together data tables to run analyses on many samples
- How to lauch workflows to analyze your data
- How to find additional workflows in the Dockstore
- How to access your input and output files in GCP
This video is part of the WARN-ID Genomic Epidemiology Training videos made by our WARN-ID consortium (orchestrated by lab alumnus Shirlee Wohl). The rest of the videos are also excellent.
Terra 101
This much longer, two-day training workshop delves into the above in far more detail. At the end of this workshop, you will be an expert. All slides and materials you might need for this workshop are available here.
Day 1:
Day 2:
This video is part of Introduction to Terra: A scalable platform for biomedical research, a recent DS-I Africa training workshop put together by the Broad Data Science Platform (DSP).
DSP leads introduction to Terra workshops (as well as more advanced workshops and workshops on other topics); there very likely may be an upcoming training you might be interested in.
More video walk-throughs:
- The Terra team has put together very helpful video tutorials in their Getting Started guide:
- Making and Uploading a Data Table in Terra
- Terra Workflows Quickstart
- Terra Notebooks Quickstart (a heads up that Notebooks have been revamped, and are now Analyses—more info here)
- The Terra team has also put together non-video tutorials on the same topics:
- Intro to data tables
- Get started running workflows
- Notebooks QuickStart Worksheet (again a heads up that Notebooks have been revamped, and are now Analyses—more info here)
- Many more tutorials and FAQs are covered in Terra’s documentation
- DSP leads trainings, including introductions to Terra. Materials from previous trainings are publicly available, and there quite possibly are upcoming trainings you might be interested in
- Introduction to Dockstore.org: An “app” store for bioinformatics workflows, another recent DS-I Africa training video, this time focused on the Dockstore, where you can access our and other workflows.
- Training videos made by our partners at Theiagen Genomics. The parts about how to get around Terra are broadly applicable, but they train on their own workflows (similar but slightly different from viral-ngs)—the motions will be generalizable but the specifics aren’t.
Bonus: writing your own tasks and workflows and running them in Terra
If you would like to branch out from what is available in viral-ngs, this training from Broad DSP goes over how to write your own tasks and workflows in the WDL programming language and run them in Terra. It is far easier than I thought it would be! All slides and materials you might need for this workshop are available here. The full workshop lives here.
Once you are comfortable writing a basic WDL workflow and running it on Terra, you can explore more of WDL’s capabilities through code snippet examples on the OpenWDL github:
- a linear workflow, similar to what you saw in the tutorial
- a workflow with two outputs
- a scatter-gather example—a workflow that scatters tasks, or performs a task on multiple inputs, running all of them in parallel, and then gathers the outputs
- a workflow that calls tasks with an alias nickname
- a workflow that calls another workflow
- a workflow that collects output values by checking which ones match a pattern
Resources
- All viral-ngs workflows are described and documented in the viral-ngs readthedocs
- The Terra team has put together guides to running some key viral-ngs workflows in Terra
- You can view (and contribute to) the code behind the viral-ngs workflows at the viral-pipelines GitHub respository:
- Workflows are in the /pipes/WDL/workflows/ directory
- Tasks are in the /pipes/WDL/tasks/ directory
- Python scripts written by our lab (which, in addition to other software, are incorporated into tasks) are the .py files in the viral-ngs GitHub repository
- Terra’s documentation contains many how-to guides and answers to frequently asked questions
- Once you have google cloud platform file access set up on your computer, you can use my script download_files.pl to download gs:// files in bulk. Simply download the script, open your terminal, type
perl [script directory/]download_files.pl [file with list of files to download] [optional output directory]
and hit enter, replacing [directory/] with the directory containing download_files.pl, for example Downloads/ and replacing [file with list of files to download] [optional output directory] with the filepath of a text file containing a list of gs:// addresses of the files you’d like to download, one per line. For example:perl Downloads/download_files.pl my_fun_directory/my_fun_list_of_files.txt - Similarly, you can use my script download_files_listed_in_table_column.pl to download all files listed in a column of a table (for example a data table that you have downloaded from Terra). Simply download the script, open your terminal, type
perl [script directory/]download_files_listed_in_table_column.pl [table with list of files to download in one of the columns] "[title of column containing filepaths to download]"
and hit enter, replacing [script directory/] with the directory containing download_files.pl, for example Downloads/ and replacing [table with list of files to download in one of the columns] with the filepath of your data table, and [title of column containing filepaths to download] with the name of the column containing a list of gs:// addresses of the files you’d like to download, one per line.




































































































