This guide is the second of the getting started guides I made originally for our lab, and now for you. All the getting started guides can be found here.
Thank you to Danny Park in our lab for introducing me to R and preempting my confusion about the difference between tables in Excel and R—a lot of this guide was verbally passed on to me from Danny when I first joined the lab; I just wrote it down.
Thank you to Lizzie Wolkovich for teaching me everything I know about modelling.

Contents
This document should take a veritable eternity to work through—but you don’t need to work through all or even most of it to do what you want to do. Take the pieces you need and when you need more pieces take those too.
- Introduction
- Getting started
- Data tables in R
- Statistical analysis
- Data visualization
- Conditionals, loops, and functions
- Avoiding (some) bugs
- Resources
Introduction
Our lab is split pretty evenly between people who use Python for data analysis and visualization and people who use R for data analysis and visualization. I strongly prefer R, because it was made specifically for data analysis, but there is no wrong answer. R is especially powerful because of the added functionality of its libraries, such as dplyr and ggplot2.
If R is about to become your first programming language (!), I recommend starting off with the first part of codecademy’s Learn R course.
Getting started
Using R within Terra
If you use Terra (our lab uses Terra), you can write and run R scripts directly in Terra, and run your R scripts directly on the data you analyze in Terra. You can watch the Terra team’s 15-minute Terra Notebooks Quickstart video to learn how:
Using R on your computer
Download and install R here.
Download and install RStudio here.
RStudio allows you to write and run your code as a full script in its own file, run subsets of your code by simply highlighting the individual lines you are interested in running and hitting command enter, or writing and running throwaway code in RStudio’s console.
Here is a short video introduction to RStudio:
You can also follow University of British Columbia STAT 545’s guide: Install R and RStudio.
Packages/libraries
Package and library are terms used pretty much interchangeably in R. You can install a package, in this example a package named plyr, by entering the following command in RStudio’s command line with your own package of interest:
install.packages("plyr")
You only need to install a package once. After you have installed it, you can load a package by adding the following line to your script anywhere before the code that will use the package, again using the package plyr as an example:
library(plyr)
You can learn more about packages/libraries in this tutorial.
I have found the following packages helpful:
- tidyverse: a collection of helpful packages all packaged together—including dplyr, ggplot2, tidyr, and stringr below
- data visualization:
- ggplot2: the main package used for data visualization in R
- ggpubr: customization add-ons for ggplot2
- scales: customize axes and legends
- patchwork: put multiple plots together in an aligned arrangement
- grid and gridExtra: another option for arranging multiple plots into a layout
- data manipulation and summarization:
- dplyr: a powerful library for manipulating and operating on tables of data
- plyr: the previous iteration of dplyr
- tidyr: rearranging data tables
- stringr: working with strings
- statistical analysis:
- beepr: beeps when your code is done running
Data tables in R
Differences between tables in R and Excel
The thing that most confused me when I started coding in R is that data tables in R are different from the tables I was used to working with in Excel.
If you’re used to working in Excel, you might be accustomed to tables that look like this one. In this table, each row is a patient. The patient name appears in the first column of each row. The rest of the columns belong to species. Each cell contains the number of reads that were mapped to the species (its column) from the patient (its row).

For input into R, this table would look quite different. Every readcount cell in the Excel version of the table gets its own row in the R version of the table.

The R data table format allows you to add more columns to each patient-species pair, which in the Excel table format would require you to make a whole new table. Notice that some of the columns refer to the patient (patient, is_food_handler, collection_date, Ct) while other columns refer to the species (species, is_virus) and other columns refer to the patient-species pair (readcount, genome_covered):

Each column is a variable, and the two terms can be used interchangeably when talking about data tables.
(This data is clearly from an experiment mixing different sample types from each patient into a single sample.)
My example code will use this table. If you would like to follow along, download it here.
Reading in and saving data tables
To read in a comma-separated table: input_table_path <- "/Users/lakras/my_input_file.csv"
my_table <- read.csv(input_table_path)
or my_table <- read.table(input_table_path, sep=",", header=TRUE)
To read in a tab-separated table: input_table_path <- "/Users/lakras/my_input_file.tsv"
my_table <- read.delim(input_table_path)
or my_table <- read.table(input_table_path, sep="\t", header=TRUE)
To save a table as a comma-separated output table: output_table_path <- "/Users/lakras/my_output_file.csv"
write.table(my_table, file=output_table_path, quote=FALSE, sep=',', row.names=FALSE)
To save a table as a tab-separated output table: output_table_path <- "/Users/lakras/my_output_file.tsv"
write.table(my_table, file=output_table_path, quote=FALSE, sep='\t',
row.names=FALSE)
Viewing and manipulating data tables
Exploring your data table and its columns
You can view or access the full data table by simply entering its name: my_table
View a summary of the variables (columns) in your data table, their data types (int, num, chr, string, factor), and the first few values using the str function: str(my_table)
You can view or access the individual variables (columns) of your data table by entering the table name followed by $ followed by the column name: my_table$patient my_table$species my_table$readcount my_table$genome_covered my_table$is_virus my_table$Ct my_table$is_food_handler my_table$collection_date
You can view all values of a variable (column) and the number of rows with each value using the table function: table(my_table$species)
You can view a subset of variables (columns) using the select function: select(my_table, patient, species, genome_covered)
You can view the first few rows of a data table using the head function or the last few rows using the tail function: head(my_table) tail(my_table)
Subsetting your data table
You can use the subset function to view a subset of your data table. Perhaps we want to view the subset of our rows related to Hepatitis C: subset(my_table, species == "Hepatitis C")
You can save the subset, or set the data table to be a subset of itself. Perhaps we are only interested in keeping the subset of our rows with readcount ≥100: my_table <- subset(my_table, readcount >= 100)
Creating new columns
To make a new column with all values set to TRUE: my_table$is_human <- TRUE
To make a new column by duplicating another column: my_table$readcount_2 <- my_table$readcount
To make a new column by multiplying another column by a scalar: my_table$percent_genome_covered <- my_table$genome_covered * 100
Conditional viewing and manipulating column values
You can view values of a column including only those rows that meet a certain condition. For example, you can view all readcounts ≥100: my_table$readcount[my_table$readcount >= 100]
You can also view all Ct values from rows with readcount ≥100: my_table$Ct[my_table$readcount >= 100]
In addition of viewing column values, you can also modify those column values. For example, you can set all readcounts <100 to NA: my_table$readcount[my_table$readcount < 100] <- NA
You can also modify a column that is different from the column you are testing in your condition, and the column you modify does not have to already exist. You can create a new variable with values conditional on another column’s values: my_table$passes_threshold[my_table$readcount >= 100] <- TRUE my_table$passes_threshold[my_table$readcount < 100] <- FALSE
Data types
Here are some of the data types you will encounter and use:
- chr: a text value
- factor: a category
- int: an integer (no decimal places)
- num: a number (with decimal places)
- logi: TRUE or FALSE (logi is short for logical)
- Date: a date
The chr data type is a text value. This data type might be useful for text values that are unique to each row, for example patient name, sample name, or hospital address. In some cases, text values (or other kinds of values) might reappear from row to row and be used to group rows together, for example state of residency, presence of coughing, sample type, or patient gender. This values should be factors.
You can modify the data type of your variable.
To make a variable’s values into text values: my_table$patient <- as.character(my_table$patient)
To make a variable’s values into categories: my_table$species <- as.factor(my_table$species)
To make a variable’s values into integers (no decimal places): my_table$readcount <- as.numeric(my_table$readcount)
To make a variable’s values into numerical values (with decimal places): my_table$Ct <- as.numeric(my_table$Ct) my_table$genome_covered <- as.numeric(my_table$genome_covered)
To make a variable’s values into TRUE and FALSE values: my_table$is_food_handler <- as.logical(my_table$is_food_handler) my_table$is_virus <- as.logical(my_table$is_virus)
To make a variable’s values into dates: my_table$collection_date <- as.Date(my_table$collection_date)
Special values
Some special values are NA (value not available), NaN (not a number), Inf (infinity), and -Inf (negative infinity). You will need to keep these values in mind and check for them in many analyses.
You can check if a value (my_value in this case) is one of these values: my_value <- 5 is.na(my_value) is.nan(my_value) is.infinite(my_value)
Statistical analysis
Summary statistics
You can view basic summary statistics for a variable.
Minimum, maximum, and range: min(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE) max(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE) range(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)
Mean and standard deviation: mean(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE) sd(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)
Median, first quartile (the 25th percentile of your data), and third quartile (the 75th percentile of your data), and interquartile range (the middle 50% of your data, or the distance between the 25th percentile and the 75th percentile): median(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE) quantile(my_table$genome_covered, 1/4, na.rm=TRUE, nan.rm=TRUE) quantile(my_table$genome_covered, 3/4, na.rm=TRUE, nan.rm=TRUE) IQR(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)
na.rm=TRUE and nan.rm=TRUE exclude NA and NaN values from your calculations of the mean, median, min, max, etc.
You can generate a table of summary statistics grouped by values of a variable using ddply of the plyr or dplyr library. For example, we might calculate the minimum, maximum, and mean of readcounts for each species separately, saving them in a new data table we name readcount_summary:
readcount_summary <- ddply(my_table, ~ species, summarise,
min_readcount=min(readcount), max_readcount=max(readcount),mean_readcount=mean(readcount))
The output looks like this:
| species | min_readcount | max_readcount | mean_readcount | |
| 1 | E. coli | 0 | 4324 | 620.4286 |
| 2 | Hepatitis C | 0 | 6457 | 931.7143 |
| 3 | HIV | 0 | 865 | 217.0000 |
| 4 | Influenza A | 0 | 4363 | 677.1429 |
| 5 | SARS-CoV-2 | 76 | 4532 | 1289.5714 |
Or we might calculate the minimum, maximum, and mean of both readcounts and Cts, grouping rows not only by species, but also by whether or not the patient is a food handler and whether or not the at least half the genome is covered:
readcount_summary <- ddply(my_table, genome_covered >= 0.50 ~ is_food_handler ~ species, summarise, min_readcount=min(readcount), max_readcount=max(readcount), mean_readcount=mean(readcount), min_Ct=min(Ct), max_Ct=max(Ct),mean_Ct=mean(Ct))
That output looks like this:
| genome_ covered >= 0.5 | is_food_ handler | species | min_ readcount | max_ readcount | mean_ readcount | min _Ct | max _Ct | mean _Ct | |
| 1 | FALSE | FALSE | E. coli | 0 | 4324 | 868.60 | 16.8 | 27.9 | 21.64000 |
| 2 | FALSE | FALSE | Hepatitis C | 0 | 65 | 16.25 | 16.8 | 27.9 | 21.72500 |
| 3 | FALSE | FALSE | HIV | 0 | 865 | 303.80 | 16.8 | 27.9 | 21.64000 |
| 4 | FALSE | FALSE | Influenza A | 0 | 0 | 0 | 16.8 | 27.9 | 22.70000 |
| 5 | FALSE | FALSE | SARS-CoV-2 | 87 | 876 | 546.00 | 21.3 | 27.9 | 24.66667 |
| 6 | FALSE | TRUE | E. coli | 0 | 0 | 0 | 21.5 | 23.9 | 22.70000 |
| 7 | FALSE | TRUE | Hepatitis C | 0 | 0 | 0 | 21.5 | 23.9 | 22.70000 |
| 8 | FALSE | TRUE | HIV | 0 | 0 | 0 | 21.5 | 23.9 | 22.70000 |
| 9 | FALSE | TRUE | Influenza A | 65 | 312 | 188.50 | 21.5 | 23.9 | 22.70000 |
| 10 | FALSE | TRUE | SARS-CoV-2 | 76 | 435 | 255.50 | 21.5 | 23.9 | 22.70000 |
| 11 | TRUE | FALSE | Hepatitis C | 6457 | 6457 | 6457.00 | 21.3 | 21.3 | 21.30000 |
| 12 | TRUE | FALSE | Influenza A | 4363 | 4363 | 4363.00 | 17.4 | 17.4 | 17.40000 |
| 13 | TRUE | FALSE | SARS-CoV-2 | 2346 | 4532 | 3439.00 | 16.8 | 17.4 | 17.10000 |
dplyr is extremely powerful, and can do a lot more than summarize. You can learn more about dplyr by exploring these tutorials:
- The Data aggregation tutorial of the the STAT 545 course at the University of British Columbia
- Introduction to dplyr for Faster Data Manipulation in R (also available in video)
- The official introduction to dplyr
Statistical tests
R has many statistical tests built in. Identify the one that is most relevant to your data, determine how to do it in R, and interpret your p-value or other preferred statistical test output. Here is a guide from r-statistics.co.
Modelling
Modelling allows you to estimate the strength of a correlations and to compare the individual impacts of variables affecting an outcome variable.
First, you should get an introduction to selecting the appropriate kind of model and writing the code to generate that model. datacamp.com has an interactive course on modelling in R, in which you practice coding it yourself and it judges your code. If you prefer just reading, r-statistics.co has a guide:
- Linear Regression (for numerical outcome variables, and an introduction to modelling in general)
- Logistic Regression (for binary outcome variables)
- Multinomial Regression (for categorical outcome variables)
- Other regression models, if you need them
- Model Selection Approaches
Next, you need to understand what your model is telling you. I learned everything I know about modelling from Regression and Other Stories by Andrew Gelman, Jennifer Hill, and Aki Vehtari. This book is thorough, clear, correct, and readable, unlike most things. The pdf is (legally) available for free online. The following are the guides to interpreting your models that I have bookmarked (most of these sections are around 3-5 pages):
- Interpreting your models:
- Linear models:
- Logistic models for binary outcome variables:
- Other kinds of models, as needed:
- Judging your model:
- Transforming your data to make your coefficients interpretable:
The whole book is excellent. Once you are comfortable selecting and interpreting models, I recommend reading the following chapters:
- Study design:
- Dealing with missing data:
- Causal inference:
Data visualization
Scatterplots, bar plots, histograms, and other standard figures
Follow this tutorial from r-statistics.co to learn how to use ggplot2 for data visualization. It takes you through making a scatterplot, then extends what you’ve learned to other kinds of plots:
- The Complete ggplot2 Tutorial – Part1 | Introduction To ggplot2
- The Complete ggplot2 Tutorial – Part 2 | How To Customize ggplot2
- Top 50 ggplot2 Visualizations – The Master List
Additionally, I think you should learn about grouped vs. stacked vs. percent bars in bar plots or histograms, which the above tutorial does not cover.
I have found that in my own work, I most frequently use and encounter scatterplots, histograms, barplots, line plots, and violin plots.
Here are some references with different kinds of plots you might like to make, explanations, and example code to copy-paste and modify:
Once you have mastered the general principles, you can google any kind of plot or modification to your plot that you would like to make. I have found that almost anything I want to do has at some point been asked about on StackOverflow.
Maps
Here are some resources on plotting colorful maps in R:
R is in particular is the best solution for drawing a map with curved directional arrows illustrating transmission events (doing it in python seems to be much more challenging and doesn’t end up as nice). There does not seem to be a good guide for making this kind of figure, but my colleague Gage Moreno found a way to do it using curvedarrow in the diagram package.
Conditionals, loops, and functions
Conditional statements
Conditional statements allow you to automatically run different lines of code depending on what your data looks like. In this example, we check if your minimum collection date is before May 9th; if it is, we subset our data to collection dates after May 10th:
if(min(my_table$collection_date) < as.Date("2022-05-09")) {
my_table <- subset(my_table, collection_date > as.Date("2022-05-10")) }
The above example only executes the code inside the { and } brackets if the condition boils down to TRUE; if the condition is FALSE, your script does nothing. If you’d like, you can add an else statement to make it do something else if the condition is FALSE. In the following example, if our mean readcount is <100, we replace any readcounts less than 100 with NA. If not, we instead move on to the else and replace any readcounts less than 200 with NA:
if(mean(my_table$readcount) < 100) {
my_table$readcount[my_table$readcount < 100] <- NA } else {
my_table$readcount[my_table$readcount < 200] <- NA }
You can chain these conditions together. In this example, we check if the minimum proportion of any genome covered in our table is less than 50%:
- If it is, we subset our table to rows with is_virus set to TRUE and (&) readcount at least 50 and we exit the chain.
- If it isn’t, we follow the else to the next statement: we check if the minimum proportion of any genome covered in our table is greater than 70%.
- If it is, we subset our table to rows with is_virus set to FALSE and (&) readcount at least 100 and we exit the chain.
- If it isn’t, we follow the else to our next and final statement: we subset our table to rows with readcount at least 100 or (|) at least 50% of the genome covered, and then we set any Ct values greater than 25 to 25.
if(min(my_table$genome_covered) < 0.50) {
my_table <- subset(my_table, is_virus & readcount >= 50) } else if(min(my_table$genome_covered) > 0.70) {
my_table <- subset(my_table, !is_virus & readcount >= 100) } else {
my_table <- subset(my_table, readcount >= 100 | genome_covered >= 0.50)
my_table$Ct[my_table$Ct > 25] <- 25 }
Scaling up your analyses with loops
Looping through a predetermined list of values
If you would like to automatically do the same analysis or generate the same plot for multiple values, you can use a for-loop. In the following example, we subset the table to include only rows where species is Hepatitis C, HIV, or Influenza A, and save each table to an output file named after the species. We do that by using the variable my_species inside the { and } brackets and setting my_species to Hepatitis C, HIV, and then Influenza A. With each iteration of the loop, my_species takes on the next value and the code inside the { and } brackets is executed with my_species set to that value.
for(my_species in c("Hepatitis C", "HIV", "Influenza A")) {
my_table_subset <- subset(my_table, species == my_species)
output_table_path <- paste(input_table_path, "_", my_species, ".txt", sep="")
write.table(my_table_subset, file=output_table_path, quote=FALSE,
sep='\t', row.names=FALSE) }
Instead of using a for-loop, you could instead copy-paste the code inside the { and } brackets, changing my_species to your species of choice each time:
my_table_subset <- subset(my_table, species == "Hepatitis C") output_table_path <- paste(input_table_path, "_", "Hepatitis C", ".txt",sep="") write.table(my_table_subset, file=output_table_path, quote=FALSE,
sep='\t', row.names=FALSE)
my_table_subset <- subset(my_table, species == "HIV") output_table_path <- paste(input_table_path, "_", "HIV", ".txt", sep="") write.table(my_table_subset, file=output_table_path, quote=FALSE, sep='\t', row.names=FALSE)
my_table_subset <- subset(my_table, species == "Influenza A") output_table_path <- paste(input_table_path, "_", "Influenza A", ".txt", sep="") write.table(my_table_subset, file=output_table_path, quote=FALSE,
sep='\t', row.names=FALSE)
This is a lot longer and harder to read—and you’re only copy-pasting three lines of code. If you wanted to add another species of interest (or ten), you would need to copy-paste again and again instead of just adding another species name to your list. If you wanted to change something, maybe the names of the output files, you would need to change it for each copy-pasted version. Not only is that a lot of work, it also makes it more likely that you will make a mistake and, because your code is less readable, less likely that you will catch it.
Looping through a non-predetermined list of values
Another benefit of for-loops is that you do not need to decide what values you are cycling through ahead of time. Perhaps we want to print subsets of our table for all species appearing in the table, whatever they are, without knowing them ahead of time. This example does the same thing as our previous two examples, but it does it for all unique species in our table.
for(my_species in as.list(unique(my_table$species))) {
my_table_subset <- subset(my_table, species == my_species)
output_table_path <- paste(input_table_path, "_", my_species, ".txt", sep="")
write.table(my_table_subset, file=output_table_path, quote=FALSE,
sep='\t', row.names=FALSE) }
Maintaining data types in loops
Note that as.list is absolutely necessary in the above for-loop—I learned this the hard way. R for some reason strips values of their type when they are iterated through in a for-loop. Try it yourself—here is the list the for-loop is iterating over; it is of type factor:
unique(my_table$species) class(unique(my_table$species))
Here is an iteration through the list with as.list—the elements of the list are as before of type factor:
for(my_species in as.list(unique(my_table$species))) {
print(my_species)
print(class(my_species)) }
Here is an iteration through the list without as.list—the elements of the list are now of type character:
for(my_species in unique(my_table$species)) {
print(my_species)
print(class(my_species)) }
Looping through integers
You can also cycle through integer values. This chunk of code, for example, sets my_value to each integer from 2 through 10 and prints it each time.
for(my_value in 2:10) {
print(my_value) }
Reusing your analyses with functions
When you find yourself copy-pasting your code and changing a few things, you should consider using a function. Here is a function I wrote to calculate a 95% confidence interval given a proportion and a total:
confidence_interval_from_proportion_and_total <- function(proportion, total) {
confidence_interval_plus_minus <- 1.96 * sqrt(proportion * (1 - proportion) / total)
confidence_interval_min <- proportion - confidence_interval_plus_minus
confidence_interval_max <- proportion + confidence_interval_plus_minus
return(c(min = confidence_interval_min, proportion = proportion,
max = confidence_interval_max)) }
After defining the function, I call it whenever I need to calculate a confidence interval:
confidence_interval_from_proportion_and_total(0.50, 500) confidence_interval_from_proportion_and_total(0.10, 10) confidence_interval_from_proportion_and_total(0.99, 200)
Instead of copy-pasting my code each time, I can call the function in one line, making my code shorter and a lot easier to read. If I decide to change how I calculate a confidence interval, perhaps to bound the confidence interval between 0 and 1, I can simply edit the code inside the function rather than needing to locate and edit many lines of code. Some functions can get pretty long, so the benefit adds up.
Avoiding (some) bugs
Put the following line at the very top of your script:
rm(list=ls())
This line deletes whatever data is already loaded, so that your previously defined variables do not linger, ready to creep up into your analyses when you use that name again because of a typo or without assigning it a new value as you intended to. You should run this line whenever you are starting over analyzing your data or when you open and run a new script—top of the script is the best place.
Resources
- Additional tutorials and courses:
- codecademy’s Learn R course
- the r-statistics.co series of tutorials
- A hands-on Introduction to R workshop, with biologists as a target audience
- STAT 545: Data wrangling, exploration, and analysis with R at the University of British Columbia
- Introduction to R:
- datacamp.com’s interactive introduction to R course
- A short video introduction to RStudio
- University of British Columbia STAT 545: Install R and RStudio
- An introduction to R from r-statistics.co
- The Terra team’s 15-minute Terra Notebooks Quickstart video for using R in Terra
- Data manipulation and summarizing with dplyr:
- Data visualization examples with explanations and code snippets:
- Modelling:
- Linear Regression (for numerical outcome variables, and an introduction to modelling in general)
- Logistic Regression (for binary outcome variables)
- Multinomial Regression (for categorical outcome variables)
- Other regression models, if you need them
- Model Selection Approaches
- Regression and Other Stories by Andrew Gelman, Jennifer Hill, and Aki Vehtari: (legally) available for free online
- Interpreting your models:
- Linear models:
- Logistic models for binary outcome variables:
- Other kinds of models, as needed:
- Judging your model:
- Transforming your data to make your coefficients interpretable:
- Interpreting your models:
- Study design:
- Dealing with missing data:
- Causal inference:


