Getting started using regular expressions for pattern matching and smart find and replace

This fall I realized that a lot of the knowledge that I at some point during my PhD started taking for granted was not, indeed, universal. One of my favorite things about MIT was almost no information lived exclusively in someone’s brain; all of it was somewhere accessible online. In that spirit I started putting together written guides to whatever I happen to know, initially intended for anyone in our lab and now intended also for you. I’m not reinventing the wheel, not making anything new, just pointing newcomers to resources they can use to get started.

This guide is the first. Eventually, all the getting started guides will be here.

Thank you to my mentors in the Makova and Page labs for teaching me all of this when I first started out in computational biology, Melissa A. Wilson and Daniel Winston Bellott.

Contents

This document should take about an hour. Don’t worry about memorizing anything! Google everything you need when you need it.

Background

Regular expressions (nicknamed regex) are an extremely powerful (and in my opinion vastly underused) tool for pattern matching and find and replace. Here are some things I have recently used regex for:

  • changing the contents of a column to be more useful to me—for example, changing a column with values like 21I (Delta), 21J (Delta), 21K (Omicron), 21L (Omicron), and so on to instead have values like Delta and Omicron
  • changing the format of dates from 10/5/2022 to 2022-10-05
  • deleting the sixth column from a table
  • creating a new table column depending on the contents of another column—for example, making a sample names column from a file paths column
  • scraping a web site to turn messy tables represented by html into simple tab-separated tables in a text file
  • scraping a web page to retrieve the first image appearing in each web page it links to
  • parsing the output of a script someone else wrote to determine when a Global Entry interview frees up (so that I could run it every five seconds and make my computer beep when an interview appears). (It worked—I got an interview!)

Regular expressions can be intimidating. Here’s one I wrote recently, to fill add missing tabs to add missing cells to the last column in a tab-separated table:

^(([^\t]+?\t){29}[^\t]*?)$

I can’t read it just by glancing at it. But if I move through slowly, character by character, left to right, it makes sense. (Luckily, you will probably spend a lot more time writing regular expressions than reading them.)

Getting started with BBEdit

BBEdit is my favorite text editor. BBEdit lets you do regex find and replace. I also use it to code in. Install it here (scroll down to “How can I get BBEdit 14?” for download options).

You can get a lot done just with regex find and replace in BBEdit. But if you like to code, most programming languages use the same or a very similar regular expression format. What you learn here will apply no matter how you like to code.

Getting started with regex101.com

If you’d prefer to get started even faster without downloading anything, go to regex101.com and check Python and Substitution on the left.

In addition to testing/running your regular expressions, regex101.com also has helpful explanations of all the parts of your regular expression and highlights errors.

Getting started with regular expressions

We’ll start with a few quick examples.

Example 1

Here is an example table. Copy/paste it into BBEdit or regex101.com:

patient	Ct	is_food_handler	collection_date	species	is_virus	readcount	genome_covered
Patient 1 21.5 TRUE 2022-05-12 SARS-CoV-2 TRUE 435 0.19
Patient 2 17.4 FALSE 2022-05-11 SARS-CoV-2 TRUE 2346 0.97
Patient 3 24.8 FALSE 2022-05-19 SARS-CoV-2 TRUE 87 0.05
Patient 4 23.9 TRUE 2022-05-12 SARS-CoV-2 TRUE 76 0.10
Patient 5 21.3 FALSE 2022-05-13 SARS-CoV-2 TRUE 675 0.20
Patient 6 16.8 FALSE 2022-05-09 SARS-CoV-2 TRUE 4532 0.99
Patient 7 27.9 FALSE 2022-05-07 SARS-CoV-2 TRUE 876 0.23
Patient 1 21.5 TRUE 2022-05-12 E. coli FALSE 0 0.00
Patient 2 17.4 FALSE 2022-05-11 E. coli FALSE 0 0.00
Patient 3 24.8 FALSE 2022-05-19 E. coli FALSE 4324 0.14
Patient 4 23.9 TRUE 2022-05-12 E. coli FALSE 0 0.00
Patient 5 21.3 FALSE 2022-05-13 E. coli FALSE 19 0.02
Patient 6 16.8 FALSE 2022-05-09 E. coli FALSE 0 0.00
Patient 7 27.9 FALSE 2022-05-07 E. coli FALSE 0 0.00
Patient 1 21.5 TRUE 2022-05-12 Influenza A TRUE 65 0.04
Patient 2 17.4 FALSE 2022-05-11 Influenza A TRUE 4363 0.95
Patient 3 24.8 FALSE 2022-05-19 Influenza A TRUE 0 0.00
Patient 4 23.9 TRUE 2022-05-12 Influenza A TRUE 312 0.12
Patient 5 21.3 FALSE 2022-05-13 Influenza A TRUE 0 0.00
Patient 6 16.8 FALSE 2022-05-09 Influenza A TRUE 0 0.00
Patient 7 27.9 FALSE 2022-05-07 Influenza A TRUE 0 0.00
Patient 1 21.5 TRUE 2022-05-12 Hepatitis C TRUE 0 0.00
Patient 2 17.4 FALSE 2022-05-11 Hepatitis C TRUE 0 0.00
Patient 3 24.8 FALSE 2022-05-19 Hepatitis C TRUE 65 0.08
Patient 4 23.9 TRUE 2022-05-12 Hepatitis C TRUE 0 0.00
Patient 5 21.3 FALSE 2022-05-13 Hepatitis C TRUE 6457 0.65
Patient 6 16.8 FALSE 2022-05-09 Hepatitis C TRUE 0 0.00
Patient 7 27.9 FALSE 2022-05-07 Hepatitis C TRUE 0 0.00
Patient 1 21.5 TRUE 2022-05-12 HIV TRUE 0 0.00
Patient 2 17.4 FALSE 2022-05-11 HIV TRUE 0 0.00
Patient 3 24.8 FALSE 2022-05-19 HIV TRUE 0 0.00
Patient 4 23.9 TRUE 2022-05-12 HIV TRUE 0 0.00
Patient 5 21.3 FALSE 2022-05-13 HIV TRUE 865 0.34
Patient 6 16.8 FALSE 2022-05-09 HIV TRUE 0 0.00
Patient 7 27.9 FALSE 2022-05-07 HIV TRUE 654 0.26

Hit command F to bring up BBEdit’s find and replace dialogue. Enter the following (and click Replace All if you are in BBEdit):

FIND:    \t
REPLACE: ,

As long as your table is simple, your tabs should all be replaced by commas and your table should now be a comma-separated table. (If your table is not simple, you’ll want something more like this or this.)

Example 2

Let’s pretend that we want all numbers to be whole numbers, rounded down. Let’s find all digits following a decimal point and remove them:

FIND:    \.\d+
REPLACE:

If you’re in BBEdit, make sure that the Grep option is checked.

Breaking it down, here’s what that FIND is saying and how it would parse the input 27.9:

\.match a period—the period is a special character, so it is escaped with a \
\d+match at least one numerical digit (0-9) in a row
\.\d+match a period followed by at least one numerical digit (0-9) in a row.

Replace is empty, so when you click Replace All, matched values will simply be deleted.

Video walkthrough

Here is a very thorough walkthrough of all the potential pieces of a regular expression and how you can put them together to match any pattern you want:

You might notice that there are small differences in how regular expressions are written in different programming languages. BBEdit, for example, uses grep (as does the command line in Unix), where matched values are written as \1, \2, \3, and so on. Python uses the same syntax. In this video, matched values are instead written as $1, $2, $3, and so on. If you switch between programming languages, you’ll catch this difference easily. The general principles and the things that matter stay the same. You can play with regular expressions in different languages in regex101.com.

More examples

Example 3

More with the above example table, which you can open in BBEdit or copy/paste it into regex101.com.

Let’s change the format of all of the dates. Right now, all of the dates look like this: 2022-05-12. Let’s make them look like this: 05/12/2022:

FIND:    (\d+)-(\d+)-(\d+)
REPLACE: \2/\3/\1

Breaking it down, here’s what that FIND is saying and how it would parse the input 2022-05-02:

(\d+)match at least one numerical digit (0-9) in a row and save it
-match a dash and don’t save it
(\d+)-(\d+)-(\d+)match at least one numerical digit in a row (2022) and save it, then a dash (and don’t save it), then at least one numerical digit in a row (05) and save it, then a dash (and don’t save it), and finally at least one numerical digit in a row (02) and save it

Here is what that REPLACE is saying:

\2the second thing we saved (05)
/a slash
\3the third thing we saved (12)
/a slash
\1the first thing we saved (2022)
\2/\3/\1the second thing we saved (05), then a slash, then the third thing we saved (02), then a slash, then the first thing we saved (2022)

The input 2022-05-02 will be replaced with 05/02/2022, and all other dates will each individually be processed the same way.

Example 4

Let’s change the format of all the dates again in the same way, but this time let’s also get rid of the leading 0s. Click undo (command Z) and try this instead:

FIND:    0*(\d+)-0*(\d+)-0*(\d+)
REPLACE: \2/\3/\1

Breaking it down, here’s what that FIND is saying and how it would parse the input 2022-05-02:

0*match any zeros in a row, if they are there, and match nothing if they aren’t—we could match nothing, or a 0, or 00, or 000, or 0000, and so on; this part is not inside parentheses, so we do not save it
(\d+)match at least one numerical digit (0-9) in a row and save it
-match a dash and don’t save it
0*(\d+)-0*(\d+)-0*(\d+)match any zeros in a row (or match nothing if there aren’t any 0s) and don’t save it, then at least one numerical digit in a row (2022) and save it,
then a dash and don’t save it, then any or no zeros in a row and don’t save it, then at least one numerical digit in a row (05) and save it, then a dash, then any or no zeros in a row and don’t save it, and finally at least one numerical digit in a row (02) and save it

Here is what that REPLACE is saying:

\2the second thing we saved (5)
/a slash
\3the third thing we saved (12)
/a slash
\1the first thing we saved (2022)
\2/\3/\1the second thing we saved (5), then a slash, then the third thing we saved (2), then a slash, then the first thing we saved (2022)

The input 2022-05-02 will be replaced with 5/2/2022, and all other dates will independently be processed the same way. Notice that the leading 0s are not included, because we wrote them into the regular expression without saving them between parentheses.

Cheat sheets

Don’t try to memorize anything! Once you know the general principles, you can use a regex cheat to build any regular expression you need.

Here are two that I like.

As before, you might notice that there are small differences in how regular expressions are written in different programming languages. In the first cheat sheet, matched values are written as \1, \2, \3, and so on. In the second cheat sheet, matched values are instead written as $1, $2, $3, and so on. $1, $2, $3, and so on will work in some situations, but not in BBEdit or the command line. If you need to switch programming languages, this will be a difference you will catch easily (because it won’t work).

Cheatsheet is at https://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf
Cheatsheet is at https://cheatography.com/davechild/cheat-sheets/regular-expressions/