Getting started using R for data analysis and visualization

October 24, 2022 lydiaandreyevnakrasilnikovaLeave a comment

This guide is the second of the getting started guides I made originally for our lab, and now for you. All the getting started guides can be found here.

Thank you to Danny Park in our lab for introducing me to R and preempting my confusion about the difference between tables in Excel and R—a lot of this guide was verbally passed on to me from Danny when I first joined the lab; I just wrote it down.

Thank you to Lizzie Wolkovich for teaching me everything I know about modelling.

This document should take a veritable eternity to work through—but you don’t need to work through all or even most of it to do what you want to do. Take the pieces you need and when you need more pieces take those too.

Introduction

Our lab is split pretty evenly between people who use Python for data analysis and visualization and people who use R for data analysis and visualization. I strongly prefer R, because it was made specifically for data analysis, but there is no wrong answer. R is especially powerful because of the added functionality of its libraries, such as dplyr and ggplot2.

If R is about to become your first programming language (!), I recommend starting off with the first part of codecademy’s Learn R course.

Getting started

Using R within Terra

If you use Terra (our lab uses Terra), you can write and run R scripts directly in Terra, and run your R scripts directly on the data you analyze in Terra. You can watch the Terra team’s 15-minute Terra Notebooks Quickstart video to learn how:

Using R on your computer

Download and install R here.

Download and install RStudio here.

RStudio allows you to write and run your code as a full script in its own file, run subsets of your code by simply highlighting the individual lines you are interested in running and hitting command enter, or writing and running throwaway code in RStudio’s console.

Here is a short video introduction to RStudio:

You can also follow University of British Columbia STAT 545’s guide: Install R and RStudio.

Packages/libraries

Package and library are terms used pretty much interchangeably in R. You can install a package, in this example a package named plyr, by entering the following command in RStudio’s command line with your own package of interest:

install.packages("plyr")

You only need to install a package once. After you have installed it, you can load a package by adding the following line to your script anywhere before the code that will use the package, again using the package plyr as an example:

library(plyr)

You can learn more about packages/libraries in this tutorial.

I have found the following packages helpful:

tidyverse: a collection of helpful packages all packaged together—including dplyr, ggplot2, tidyr, and stringr below
data visualization:
- ggplot2: the main package used for data visualization in R
- ggpubr: customization add-ons for ggplot2
- scales: customize axes and legends
- patchwork: put multiple plots together in an aligned arrangement
- grid and gridExtra: another option for arranging multiple plots into a layout
data manipulation and summarization:
- dplyr: a powerful library for manipulating and operating on tables of data
- plyr: the previous iteration of dplyr
- tidyr: rearranging data tables
- stringr: working with strings
statistical analysis:
- binom: binomial confidence intervals
- broom: summarizes statistical analysis outputs in more readable formats
beepr: beeps when your code is done running

Data tables in R

Differences between tables in R and Excel

The thing that most confused me when I started coding in R is that data tables in R are different from the tables I was used to working with in Excel.

If you’re used to working in Excel, you might be accustomed to tables that look like this one. In this table, each row is a patient. The patient name appears in the first column of each row. The rest of the columns belong to species. Each cell contains the number of reads that were mapped to the species (its column) from the patient (its row).

For input into R, this table would look quite different. Every readcount cell in the Excel version of the table gets its own row in the R version of the table.

The R data table format allows you to add more columns to each patient-species pair, which in the Excel table format would require you to make a whole new table. Notice that some of the columns refer to the patient (patient, is_food_handler, collection_date, Ct) while other columns refer to the species (species, is_virus) and other columns refer to the patient-species pair (readcount, genome_covered):

Each column is a variable, and the two terms can be used interchangeably when talking about data tables.

(This data is clearly from an experiment mixing different sample types from each patient into a single sample.)

My example code will use this table. If you would like to follow along, download it here.

Reading in and saving data tables

To read in a comma-separated table:
input_table_path <- "/Users/lakras/my_input_file.csv" my_table <- read.csv(input_table_path)
or
my_table <- read.table(input_table_path, sep=",", header=TRUE)

To read in a tab-separated table:
input_table_path <- "/Users/lakras/my_input_file.tsv" my_table <- read.delim(input_table_path)
or
my_table <- read.table(input_table_path, sep="\t", header=TRUE)

To save a table as a comma-separated output table:
output_table_path <- "/Users/lakras/my_output_file.csv" write.table(my_table, file=output_table_path, quote=FALSE, sep=',', row.names=FALSE)

To save a table as a tab-separated output table:
output_table_path <- "/Users/lakras/my_output_file.tsv" write.table(my_table, file=output_table_path, quote=FALSE, sep='\t', row.names=FALSE)

Viewing and manipulating data tables

Exploring your data table and its columns

You can view or access the full data table by simply entering its name:
my_table

View a summary of the variables (columns) in your data table, their data types (int, num, chr, string, factor), and the first few values using the str function:
str(my_table)

You can view or access the individual variables (columns) of your data table by entering the table name followed by $ followed by the column name:
my_table$patient my_table$species my_table$readcount my_table$genome_covered my_table$is_virus my_table$Ct my_table$is_food_handler my_table$collection_date

You can view all values of a variable (column) and the number of rows with each value using the table function:
table(my_table$species)

You can view a subset of variables (columns) using the select function:
select(my_table, patient, species, genome_covered)

You can view the first few rows of a data table using the head function or the last few rows using the tail function:
head(my_table) tail(my_table)

Subsetting your data table

You can use the subset function to view a subset of your data table. Perhaps we want to view the subset of our rows related to Hepatitis C:
subset(my_table, species == "Hepatitis C")

You can save the subset, or set the data table to be a subset of itself. Perhaps we are only interested in keeping the subset of our rows with readcount ≥100:
my_table <- subset(my_table, readcount >= 100)

Creating new columns

To make a new column with all values set to TRUE:
my_table$is_human <- TRUE

To make a new column by duplicating another column:
my_table$readcount_2 <- my_table$readcount

To make a new column by multiplying another column by a scalar:
my_table$percent_genome_covered <- my_table$genome_covered * 100

Conditional viewing and manipulating column values

You can view values of a column including only those rows that meet a certain condition. For example, you can view all readcounts ≥100:
my_table$readcount[my_table$readcount >= 100]

You can also view all Ct values from rows with readcount ≥100:
my_table$Ct[my_table$readcount >= 100]

In addition of viewing column values, you can also modify those column values. For example, you can set all readcounts <100 to NA:
my_table$readcount[my_table$readcount < 100] <- NA

You can also modify a column that is different from the column you are testing in your condition, and the column you modify does not have to already exist. You can create a new variable with values conditional on another column’s values:
my_table$passes_threshold[my_table$readcount >= 100] <- TRUE my_table$passes_threshold[my_table$readcount < 100] <- FALSE

Data types

Here are some of the data types you will encounter and use:

chr: a text value
factor: a category
int: an integer (no decimal places)
num: a number (with decimal places)
logi: TRUE or FALSE (logi is short for logical)
Date: a date

The chr data type is a text value. This data type might be useful for text values that are unique to each row, for example patient name, sample name, or hospital address. In some cases, text values (or other kinds of values) might reappear from row to row and be used to group rows together, for example state of residency, presence of coughing, sample type, or patient gender. This values should be factors.

You can modify the data type of your variable.

To make a variable’s values into text values:
my_table$patient <- as.character(my_table$patient)

To make a variable’s values into categories:
my_table$species <- as.factor(my_table$species)

To make a variable’s values into integers (no decimal places):
my_table$readcount <- as.numeric(my_table$readcount)

To make a variable’s values into numerical values (with decimal places):
my_table$Ct <- as.numeric(my_table$Ct) my_table$genome_covered <- as.numeric(my_table$genome_covered)

To make a variable’s values into TRUE and FALSE values:
my_table$is_food_handler <- as.logical(my_table$is_food_handler) my_table$is_virus <- as.logical(my_table$is_virus)

To make a variable’s values into dates:
my_table$collection_date <- as.Date(my_table$collection_date)

Special values

Some special values are NA (value not available), NaN (not a number), Inf (infinity), and -Inf (negative infinity). You will need to keep these values in mind and check for them in many analyses.

You can check if a value (my_value in this case) is one of these values:
my_value <- 5 is.na(my_value) is.nan(my_value) is.infinite(my_value)

Statistical analysis

Summary statistics

You can view basic summary statistics for a variable.

Minimum, maximum, and range:
min(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE) max(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE) range(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)

Mean and standard deviation:
mean(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE) sd(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)

Median, first quartile (the 25th percentile of your data), and third quartile (the 75th percentile of your data), and interquartile range (the middle 50% of your data, or the distance between the 25th percentile and the 75th percentile):
median(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE) quantile(my_table$genome_covered, 1/4, na.rm=TRUE, nan.rm=TRUE) quantile(my_table$genome_covered, 3/4, na.rm=TRUE, nan.rm=TRUE) IQR(my_table$genome_covered, na.rm=TRUE, nan.rm=TRUE)

na.rm=TRUE and nan.rm=TRUE exclude NA and NaN values from your calculations of the mean, median, min, max, etc.

You can generate a table of summary statistics grouped by values of a variable using ddply of the plyr or dplyr library. For example, we might calculate the minimum, maximum, and mean of readcounts for each species separately, saving them in a new data table we name readcount_summary:

readcount_summary <- ddply(my_table, ~ species, summarise, min_readcount=min(readcount), max_readcount=max(readcount),mean_readcount=mean(readcount))

The output looks like this:

	species	min_readcount	max_readcount	mean_readcount
1	E. coli	0	4324	620.4286
2	Hepatitis C	0	6457	931.7143
3	HIV	0	865	217.0000
4	Influenza A	0	4363	677.1429
5	SARS-CoV-2	76	4532	1289.5714

Or we might calculate the minimum, maximum, and mean of both readcounts and Cts, grouping rows not only by species, but also by whether or not the patient is a food handler and whether or not the at least half the genome is covered:

readcount_summary <- ddply(my_table, genome_covered >= 0.50 ~ is_food_handler ~ species, summarise, min_readcount=min(readcount), max_readcount=max(readcount), mean_readcount=mean(readcount), min_Ct=min(Ct), max_Ct=max(Ct),mean_Ct=mean(Ct))

That output looks like this:

	genome_ covered >= 0.5	is_food_ handler	species	min_ readcount	max_ readcount	mean_ readcount	min _Ct	max _Ct	mean _Ct
1	FALSE	FALSE	E. coli	0	4324	868.60	16.8	27.9	21.64000
2	FALSE	FALSE	Hepatitis C	0	65	16.25	16.8	27.9	21.72500
3	FALSE	FALSE	HIV	0	865	303.80	16.8	27.9	21.64000
4	FALSE	FALSE	Influenza A	0	0	0	16.8	27.9	22.70000
5	FALSE	FALSE	SARS-CoV-2	87	876	546.00	21.3	27.9	24.66667
6	FALSE	TRUE	E. coli	0	0	0	21.5	23.9	22.70000
7	FALSE	TRUE	Hepatitis C	0	0	0	21.5	23.9	22.70000
8	FALSE	TRUE	HIV	0	0	0	21.5	23.9	22.70000
9	FALSE	TRUE	Influenza A	65	312	188.50	21.5	23.9	22.70000
10	FALSE	TRUE	SARS-CoV-2	76	435	255.50	21.5	23.9	22.70000
11	TRUE	FALSE	Hepatitis C	6457	6457	6457.00	21.3	21.3	21.30000
12	TRUE	FALSE	Influenza A	4363	4363	4363.00	17.4	17.4	17.40000
13	TRUE	FALSE	SARS-CoV-2	2346	4532	3439.00	16.8	17.4	17.10000

dplyr is extremely powerful, and can do a lot more than summarize. You can learn more about dplyr by exploring these tutorials:

The Data aggregation tutorial of the the STAT 545 course at the University of British Columbia
Introduction to dplyr for Faster Data Manipulation in R (also available in video)
The official introduction to dplyr

Statistical tests

R has many statistical tests built in. Identify the one that is most relevant to your data, determine how to do it in R, and interpret your p-value or other preferred statistical test output. Here is a guide from r-statistics.co.

Modelling

Modelling allows you to estimate the strength of a correlations and to compare the individual impacts of variables affecting an outcome variable.

First, you should get an introduction to selecting the appropriate kind of model and writing the code to generate that model. datacamp.com has an interactive course on modelling in R, in which you practice coding it yourself and it judges your code. If you prefer just reading, r-statistics.co has a guide:

Linear Regression (for numerical outcome variables, and an introduction to modelling in general)
Logistic Regression (for binary outcome variables)
Multinomial Regression (for categorical outcome variables)
Other regression models, if you need them
Model Selection Approaches

Next, you need to understand what your model is telling you. I learned everything I know about modelling from Regression and Other Stories by Andrew Gelman, Jennifer Hill, and Aki Vehtari. This book is thorough, clear, correct, and readable, unlike most things. The pdf is (legally) available for free online. The following are the guides to interpreting your models that I have bookmarked (most of these sections are around 3-5 pages):

Interpreting your models:
- Linear models:
- Logistic models for binary outcome variables:
- Other kinds of models, as needed:
  - Chapter 15: Other generalized linear models
Judging your model:
Transforming your data to make your coefficients interpretable:

The whole book is excellent. Once you are comfortable selecting and interpreting models, I recommend reading the following chapters:

Study design:
- Chapter 16: Design and sample size decisions
Dealing with missing data:
- Chapter 17: Poststratification and missing-data imputation
Causal inference:

Data visualization

Scatterplots, bar plots, histograms, and other standard figures

Follow this tutorial from r-statistics.co to learn how to use ggplot2 for data visualization. It takes you through making a scatterplot, then extends what you’ve learned to other kinds of plots:

Additionally, I think you should learn about grouped vs. stacked vs. percent bars in bar plots or histograms, which the above tutorial does not cover.

I have found that in my own work, I most frequently use and encounter scatterplots, histograms, barplots, line plots, and violin plots.

Here are some references with different kinds of plots you might like to make, explanations, and example code to copy-paste and modify:

Once you have mastered the general principles, you can google any kind of plot or modification to your plot that you would like to make. I have found that almost anything I want to do has at some point been asked about on StackOverflow.

Maps

Here are some resources on plotting colorful maps in R:

R is in particular is the best solution for drawing a map with curved directional arrows illustrating transmission events (doing it in python seems to be much more challenging and doesn’t end up as nice). There does not seem to be a good guide for making this kind of figure, but my colleague Gage Moreno found a way to do it using curvedarrow in the diagram package.

Conditionals, loops, and functions

Conditional statements

Conditional statements allow you to automatically run different lines of code depending on what your data looks like. In this example, we check if your minimum collection date is before May 9th; if it is, we subset our data to collection dates after May 10th:

if(min(my_table$collection_date) < as.Date("2022-05-09")) { my_table <- subset(my_table, collection_date > as.Date("2022-05-10")) }

The above example only executes the code inside the { and } brackets if the condition boils down to TRUE; if the condition is FALSE, your script does nothing. If you’d like, you can add an else statement to make it do something else if the condition is FALSE. In the following example, if our mean readcount is <100, we replace any readcounts less than 100 with NA. If not, we instead move on to the else and replace any readcounts less than 200 with NA:

if(mean(my_table$readcount) < 100) { my_table$readcount[my_table$readcount < 100] <- NA } else { my_table$readcount[my_table$readcount < 200] <- NA }

You can chain these conditions together. In this example, we check if the minimum proportion of any genome covered in our table is less than 50%:

If it is, we subset our table to rows with is_virus set to TRUE and (&) readcount at least 50 and we exit the chain.
If it isn’t, we follow the else to the next statement: we check if the minimum proportion of any genome covered in our table is greater than 70%.
- If it is, we subset our table to rows with is_virus set to FALSE and (&) readcount at least 100 and we exit the chain.
- If it isn’t, we follow the else to our next and final statement: we subset our table to rows with readcount at least 100 or (|) at least 50% of the genome covered, and then we set any Ct values greater than 25 to 25.

if(min(my_table$genome_covered) < 0.50) { my_table <- subset(my_table, is_virus & readcount >= 50) } else if(min(my_table$genome_covered) > 0.70) { my_table <- subset(my_table, !is_virus & readcount >= 100) } else { my_table <- subset(my_table, readcount >= 100
| genome_covered >= 0.50) my_table$Ct[my_table$Ct > 25] <- 25 }

Scaling up your analyses with loops

Looping through a predetermined list of values

If you would like to automatically do the same analysis or generate the same plot for multiple values, you can use a for-loop. In the following example, we subset the table to include only rows where species is Hepatitis C, HIV, or Influenza A, and save each table to an output file named after the species. We do that by using the variable my_species inside the { and } brackets and setting my_species to Hepatitis C, HIV, and then Influenza A. With each iteration of the loop, my_species takes on the next value and the code inside the { and } brackets is executed with my_species set to that value.

for(my_species in c("Hepatitis C", "HIV", "Influenza A")) { my_table_subset <- subset(my_table, species == my_species) output_table_path <- paste(input_table_path, "_", my_species, ".txt",
sep="") write.table(my_table_subset, file=output_table_path, quote=FALSE, sep='\t', row.names=FALSE) }

Instead of using a for-loop, you could instead copy-paste the code inside the { and } brackets, changing my_species to your species of choice each time:

my_table_subset <- subset(my_table, species == "Hepatitis C") output_table_path <- paste(input_table_path, "_", "Hepatitis C",
".txt", sep="") write.table(my_table_subset, file=output_table_path, quote=FALSE, sep='\t', row.names=FALSE)

my_table_subset <- subset(my_table, species == "HIV") output_table_path <- paste(input_table_path, "_", "HIV",
".txt", sep="") write.table(my_table_subset, file=output_table_path, quote=FALSE, sep='\t', row.names=FALSE)

my_table_subset <- subset(my_table, species == "Influenza A") output_table_path <- paste(input_table_path, "_", "Influenza A",
".txt", sep="") write.table(my_table_subset, file=output_table_path, quote=FALSE, sep='\t', row.names=FALSE)

This is a lot longer and harder to read—and you’re only copy-pasting three lines of code. If you wanted to add another species of interest (or ten), you would need to copy-paste again and again instead of just adding another species name to your list. If you wanted to change something, maybe the names of the output files, you would need to change it for each copy-pasted version. Not only is that a lot of work, it also makes it more likely that you will make a mistake and, because your code is less readable, less likely that you will catch it.

Looping through a non-predetermined list of values

Another benefit of for-loops is that you do not need to decide what values you are cycling through ahead of time. Perhaps we want to print subsets of our table for all species appearing in the table, whatever they are, without knowing them ahead of time. This example does the same thing as our previous two examples, but it does it for all unique species in our table.

for(my_species in as.list(unique(my_table$species))) { my_table_subset <- subset(my_table, species == my_species) output_table_path <- paste(input_table_path, "_", my_species, ".txt",
sep="") write.table(my_table_subset, file=output_table_path, quote=FALSE, sep='\t', row.names=FALSE) }

Maintaining data types in loops

Note that as.list is absolutely necessary in the above for-loop—I learned this the hard way. R for some reason strips values of their type when they are iterated through in a for-loop. Try it yourself—here is the list the for-loop is iterating over; it is of type factor:

unique(my_table$species) class(unique(my_table$species))

Here is an iteration through the list with as.list—the elements of the list are as before of type factor:

for(my_species in as.list(unique(my_table$species))) { print(my_species) print(class(my_species)) }

Here is an iteration through the list without as.list—the elements of the list are now of type character:

for(my_species in unique(my_table$species)) { print(my_species) print(class(my_species)) }

Looping through integers

You can also cycle through integer values. This chunk of code, for example, sets my_value to each integer from 2 through 10 and prints it each time.

for(my_value in 2:10) { print(my_value) }

Reusing your analyses with functions

When you find yourself copy-pasting your code and changing a few things, you should consider using a function. Here is a function I wrote to calculate a 95% confidence interval given a proportion and a total:

  confidence_interval_from_proportion_and_total
    <- function(proportion, total)   {     confidence_interval_plus_minus <- 1.96 * sqrt(proportion
      * (1 - proportion) / total)     confidence_interval_min <- proportion
      - confidence_interval_plus_minus     confidence_interval_max <- proportion
      + confidence_interval_plus_minus     return(c(min = confidence_interval_min, proportion = proportion,       max = confidence_interval_max))   }

After defining the function, I call it whenever I need to calculate a confidence interval:

confidence_interval_from_proportion_and_total(0.50, 500) confidence_interval_from_proportion_and_total(0.10, 10) confidence_interval_from_proportion_and_total(0.99, 200)

Instead of copy-pasting my code each time, I can call the function in one line, making my code shorter and a lot easier to read. If I decide to change how I calculate a confidence interval, perhaps to bound the confidence interval between 0 and 1, I can simply edit the code inside the function rather than needing to locate and edit many lines of code. Some functions can get pretty long, so the benefit adds up.

Avoiding (some) bugs

Put the following line at the very top of your script:

rm(list=ls())

This line deletes whatever data is already loaded, so that your previously defined variables do not linger, ready to creep up into your analyses when you use that name again because of a typo or without assigning it a new value as you intended to. You should run this line whenever you are starting over analyzing your data or when you open and run a new script—top of the script is the best place.

Resources

Additional tutorials and courses:
- codecademy’s Learn R course
- the r-statistics.co series of tutorials
- A hands-on Introduction to R workshop, with biologists as a target audience
- STAT 545: Data wrangling, exploration, and analysis with R at the University of British Columbia
Introduction to R:
- datacamp.com’s interactive introduction to R course
- A short video introduction to RStudio
- University of British Columbia STAT 545: Install R and RStudio
- An introduction to R from r-statistics.co
The Terra team’s 15-minute Terra Notebooks Quickstart video for using R in Terra
Data manipulation and summarizing with dplyr:
- Introduction to dplyr for Faster Data Manipulation in R (also available in video)
- The official introduction to dplyr
Data visualization examples with explanations and code snippets:
- The Complete ggplot2 Tutorial from r-statistics.co
  - The Complete ggplot2 Tutorial – Part1 | Introduction To ggplot2
  - The Complete ggplot2 Tutorial – Part 2 | How To Customize ggplot2
- Top 50 ggplot2 Visualizations – The Master List
- The R graph gallery
- The tidyverse ggplot2 reference web page
Modelling:
- datacamp.com’s interactive course on modelling in R
- r-statistics.co’s guide:

Linear Regression (for numerical outcome variables, and an introduction to modelling in general)
Logistic Regression (for binary outcome variables)
Multinomial Regression (for categorical outcome variables)
Other regression models, if you need them
Model Selection Approaches
Regression and Other Stories by Andrew Gelman, Jennifer Hill, and Aki Vehtari: (legally) available for free online
- Interpreting your models:
  - Linear models:
  - Logistic models for binary outcome variables:
  - Other kinds of models, as needed:
    - Chapter 15: Other generalized linear models
- Judging your model:
- Transforming your data to make your coefficients interpretable:

Study design:
- Chapter 16: Design and sample size decisions
Dealing with missing data:
- Chapter 17: Poststratification and missing-data imputation
Causal inference:

Getting started using regular expressions for pattern matching and smart find and replace

October 17, 2022November 13, 2024 lydiaandreyevnakrasilnikovaLeave a comment

This fall I realized that a lot of the knowledge that I at some point during my PhD started taking for granted was not, indeed, universal. One of my favorite things about MIT was almost no information lived exclusively in someone’s brain; all of it was somewhere accessible online. In that spirit I started putting together written guides to whatever I happen to know, initially intended for anyone in our lab and now intended also for you. I’m not reinventing the wheel, not making anything new, just pointing newcomers to resources they can use to get started.

This guide is the first. Eventually, all the getting started guides will be here.

Thank you to my mentors in the Makova and Page labs for teaching me all of this when I first started out in computational biology, Melissa A. Wilson and Daniel Winston Bellott.

This document should take about an hour. Don’t worry about memorizing anything! Google everything you need when you need it.

Background
Getting started with BBEdit
Getting started with regex101.com
Getting started with regular expressions
More examples
- Example 3
- Example 4
Cheat sheets

Background

Regular expressions (nicknamed regex) are an extremely powerful (and in my opinion vastly underused) tool for pattern matching and find and replace. Here are some things I have recently used regex for:

changing the contents of a column to be more useful to me—for example, changing a column with values like 21I (Delta), 21J (Delta), 21K (Omicron), 21L (Omicron), and so on to instead have values like Delta and Omicron
changing the format of dates from 10/5/2022 to 2022-10-05
deleting the sixth column from a table
creating a new table column depending on the contents of another column—for example, making a sample names column from a file paths column
scraping a web site to turn messy tables represented by html into simple tab-separated tables in a text file
scraping a web page to retrieve the first image appearing in each web page it links to
parsing the output of a script someone else wrote to determine when a Global Entry interview frees up (so that I could run it every five seconds and make my computer beep when an interview appears). (It worked—I got an interview!)

Regular expressions can be intimidating. Here’s one I wrote recently, to fill add missing tabs to add missing cells to the last column in a tab-separated table:

^(([^\t]+?\t){29}[^\t]*?)$

I can’t read it just by glancing at it. But if I move through slowly, character by character, left to right, it makes sense. (Luckily, you will probably spend a lot more time writing regular expressions than reading them.)

Getting started with BBEdit

BBEdit is my favorite text editor. BBEdit lets you do regex find and replace. I also use it to code in. Install it here (scroll down to “How can I get BBEdit 14?” for download options).

You can get a lot done just with regex find and replace in BBEdit. But if you like to code, most programming languages use the same or a very similar regular expression format. What you learn here will apply no matter how you like to code.

Getting started with regex101.com

If you’d prefer to get started even faster without downloading anything, go to regex101.com and check Python and Substitution on the left.

In addition to testing/running your regular expressions, regex101.com also has helpful explanations of all the parts of your regular expression and highlights errors.

Getting started with regular expressions

We’ll start with a few quick examples.

Example 1

Here is an example table. Copy/paste it into BBEdit or regex101.com:

patient	Ct	is_food_handler	collection_date	species	is_virus	readcount	genome_covered
Patient 1	21.5	TRUE	2022-05-12	SARS-CoV-2	TRUE	435	0.19
Patient 2	17.4	FALSE	2022-05-11	SARS-CoV-2	TRUE	2346	0.97
Patient 3	24.8	FALSE	2022-05-19	SARS-CoV-2	TRUE	87	0.05
Patient 4	23.9	TRUE	2022-05-12	SARS-CoV-2	TRUE	76	0.10
Patient 5	21.3	FALSE	2022-05-13	SARS-CoV-2	TRUE	675	0.20
Patient 6	16.8	FALSE	2022-05-09	SARS-CoV-2	TRUE	4532	0.99
Patient 7	27.9	FALSE	2022-05-07	SARS-CoV-2	TRUE	876	0.23
Patient 1	21.5	TRUE	2022-05-12	E. coli	FALSE	0	0.00
Patient 2	17.4	FALSE	2022-05-11	E. coli	FALSE	0	0.00
Patient 3	24.8	FALSE	2022-05-19	E. coli	FALSE	4324	0.14
Patient 4	23.9	TRUE	2022-05-12	E. coli	FALSE	0	0.00
Patient 5	21.3	FALSE	2022-05-13	E. coli	FALSE	19	0.02
Patient 6	16.8	FALSE	2022-05-09	E. coli	FALSE	0	0.00
Patient 7	27.9	FALSE	2022-05-07	E. coli	FALSE	0	0.00
Patient 1	21.5	TRUE	2022-05-12	Influenza A	TRUE	65	0.04
Patient 2	17.4	FALSE	2022-05-11	Influenza A	TRUE	4363	0.95
Patient 3	24.8	FALSE	2022-05-19	Influenza A	TRUE	0	0.00
Patient 4	23.9	TRUE	2022-05-12	Influenza A	TRUE	312	0.12
Patient 5	21.3	FALSE	2022-05-13	Influenza A	TRUE	0	0.00
Patient 6	16.8	FALSE	2022-05-09	Influenza A	TRUE	0	0.00
Patient 7	27.9	FALSE	2022-05-07	Influenza A	TRUE	0	0.00
Patient 1	21.5	TRUE	2022-05-12	Hepatitis C	TRUE	0	0.00
Patient 2	17.4	FALSE	2022-05-11	Hepatitis C	TRUE	0	0.00
Patient 3	24.8	FALSE	2022-05-19	Hepatitis C	TRUE	65	0.08
Patient 4	23.9	TRUE	2022-05-12	Hepatitis C	TRUE	0	0.00
Patient 5	21.3	FALSE	2022-05-13	Hepatitis C	TRUE	6457	0.65
Patient 6	16.8	FALSE	2022-05-09	Hepatitis C	TRUE	0	0.00
Patient 7	27.9	FALSE	2022-05-07	Hepatitis C	TRUE	0	0.00
Patient 1	21.5	TRUE	2022-05-12	HIV	TRUE	0	0.00
Patient 2	17.4	FALSE	2022-05-11	HIV	TRUE	0	0.00
Patient 3	24.8	FALSE	2022-05-19	HIV	TRUE	0	0.00
Patient 4	23.9	TRUE	2022-05-12	HIV	TRUE	0	0.00
Patient 5	21.3	FALSE	2022-05-13	HIV	TRUE	865	0.34
Patient 6	16.8	FALSE	2022-05-09	HIV	TRUE	0	0.00
Patient 7	27.9	FALSE	2022-05-07	HIV	TRUE	654	0.26

Hit command F to bring up BBEdit’s find and replace dialogue. Enter the following (and click Replace All if you are in BBEdit):

FIND: \t REPLACE: ,

As long as your table is simple, your tabs should all be replaced by commas and your table should now be a comma-separated table. (If your table is not simple, you’ll want something more like this or this.)

Example 2

Let’s pretend that we want all numbers to be whole numbers, rounded down. Let’s find all digits following a decimal point and remove them:

FIND: \.\d+ REPLACE:

If you’re in BBEdit, make sure that the Grep option is checked.

Breaking it down, here’s what that FIND is saying and how it would parse the input 27.9:

`\.`	match a period—the period is a special character, so it is escaped with a `\`
`\d+`	match at least one numerical digit (0-9) in a row
`\.\d+`	match a period followed by at least one numerical digit (0-9) in a row.

Replace is empty, so when you click Replace All, matched values will simply be deleted.

Video walkthrough

Here is a very thorough walkthrough of all the potential pieces of a regular expression and how you can put them together to match any pattern you want:

You might notice that there are small differences in how regular expressions are written in different programming languages. BBEdit, for example, uses grep (as does the command line in Unix), where matched values are written as \1, \2, \3, and so on. Python uses the same syntax. In this video, matched values are instead written as $1, $2, $3, and so on. If you switch between programming languages, you’ll catch this difference easily. The general principles and the things that matter stay the same. You can play with regular expressions in different languages in regex101.com.

More examples

Example 3

More with the above example table, which you can open in BBEdit or copy/paste it into regex101.com.

Let’s change the format of all of the dates. Right now, all of the dates look like this: 2022-05-12. Let’s make them look like this: 05/12/2022:

FIND: (\d+)-(\d+)-(\d+) REPLACE: \2/\3/\1

Breaking it down, here’s what that FIND is saying and how it would parse the input 2022-05-02:

`(\d+)`	match at least one numerical digit (0-9) in a row and save it
`-`	match a dash and don’t save it
`(\d+)-(\d+)-(\d+)`	match at least one numerical digit in a row (2022) and save it, then a dash (and don’t save it), then at least one numerical digit in a row (05) and save it, then a dash (and don’t save it), and finally at least one numerical digit in a row (02) and save it

Here is what that REPLACE is saying:

`\2`	the second thing we saved (05)
`/`	a slash
`\3`	the third thing we saved (12)
`/`	a slash
`\1`	the first thing we saved (2022)
`\2/\3/\1`	the second thing we saved (05), then a slash, then the third thing we saved (02), then a slash, then the first thing we saved (2022)

The input 2022-05-02 will be replaced with 05/02/2022, and all other dates will each individually be processed the same way.

Example 4

Let’s change the format of all the dates again in the same way, but this time let’s also get rid of the leading 0s. Click undo (command Z) and try this instead:

FIND: 0*(\d+)-0*(\d+)-0*(\d+) REPLACE: \2/\3/\1

Breaking it down, here’s what that FIND is saying and how it would parse the input 2022-05-02:

`0*`	match any zeros in a row, if they are there, and match nothing if they aren’t—we could match nothing, or a 0, or 00, or 000, or 0000, and so on; this part is not inside parentheses, so we do not save it
`(\d+)`	match at least one numerical digit (0-9) in a row and save it
`-`	match a dash and don’t save it
`0(\d+)-0(\d+)-0*(\d+)`	match any zeros in a row (or match nothing if there aren’t any 0s) and don’t save it, then at least one numerical digit in a row (2022) and save it, then a dash and don’t save it, then any or no zeros in a row and don’t save it, then at least one numerical digit in a row (05) and save it, then a dash, then any or no zeros in a row and don’t save it, and finally at least one numerical digit in a row (02) and save it

Here is what that REPLACE is saying:

`\2`	the second thing we saved (5)
`/`	a slash
`\3`	the third thing we saved (12)
`/`	a slash
`\1`	the first thing we saved (2022)
`\2/\3/\1`	the second thing we saved (5), then a slash, then the third thing we saved (2), then a slash, then the first thing we saved (2022)

The input 2022-05-02 will be replaced with 5/2/2022, and all other dates will independently be processed the same way. Notice that the leading 0s are not included, because we wrote them into the regular expression without saving them between parentheses.

Cheat sheets

Don’t try to memorize anything! Once you know the general principles, you can use a regex cheat to build any regular expression you need.

Here are two that I like.

As before, you might notice that there are small differences in how regular expressions are written in different programming languages. In the first cheat sheet, matched values are written as \1, \2, \3, and so on. In the second cheat sheet, matched values are instead written as $1, $2, $3, and so on. $1, $2, $3, and so on will work in some situations, but not in BBEdit or the command line. If you need to switch programming languages, this will be a difference you will catch easily (because it won’t work).