Code snippets – regular expressions

Photo of handwriting possibly an old manuscript or a language that is not modern English. The central lines are clear whilst the surrounding image is blurry.

Inspired by conversations on the NHS-R Slack where code answers are lost over time (it’s not a paid account), and also for those times when a detailed comment in code isn’t appropriate but would be really useful, this blog is part of a series of code snippet explanations.

Where this code snippet comes from

The first in this series relates to a small part of code shared as part of a larger piece of analysis from the Strategy Unit and the British Heart Foundation to visualise socio-economic inequalities in the Coronary Heart Disease (CHD) pathway. The report and analysis was presented at a Midlands Analyst Huddle in January if you would like to know more about the report and the code I’ll be referring to is published on GitHub.

The code is in two parts with the first being data formatting and the second part being the statistics for relative index of inequality (RII). This blog will explain the use of regex to search for strings in R code.

Thanks to Jacqueline Grout, Senior Healthcare Analyst and Tom Jemmett, Senior Data Scientist of the Strategy Unit.

Using regex1 to pick out metrics

The data set is presented in wide form with each column being a metric from 01 to 24c (and not in order!). A dummy data set in .rds format is available in the repository which is currently just a manually created dummy data set but there are plans to reproduce this in code in the future. In the meantime it’s possible to run the script to see the transformations needed in the data for the later RII statistics to be produced.


For this particular analysis metrics 02 to 09 use a denominator of GP population count of over 16 years old whilst all the others use prevalence of Coronary Heart Disease (CHD). For quickness, it’s possible to list out each metric and its corresponding calculation but that does make code less readable, particularly as the metrics included have gaps in the numbering:

activity_by_type_decile_stg %>%
  mutate(metric02_total_ratio = metric02_total / list_size_total * 1000) %>%
  # skip 03 and skip 04
  mutate(metric05_total_ratio = metric05_total / list_size_total * 1000) %>%
  mutate(metric06_total_ratio = metric06_total / list_size_total * 1000) %>%
  mutate(metric07_total_ratio = metric07_total / list_size_total * 1000) %>%
  mutate(metric08_total_ratio = metric08_total / list_size_total * 1000) %>%
  mutate(metric09_total_ratio = metric09_total / list_size_total * 1000) %>%
  # Above metrics 2 - 9 use GP population count as denominator rather than CHD prevalence
  mutate(metric10_total_ratio = metric10_total / metric01_total * 1000) %>%
  mutate(metric11_total_ratio = metric11_total / metric01_total * 1000)
# and so on

The first downside of this is that metrics may be added later and so get missed in future runs of the scripts. That means that 03 and 04 would have no corresponding metric_total_ratio column and it won’t necessarily be obvious or stand out that they are missing.

The other downside is readability is poor because fewer lines of code is less of a wall of text and so easier to debug. However, such shorter code can be confusing and not self explanatory unless you are familiar with it:


# Metrics 2 - 9 listed as use GP list size 16+ as denominator
# else uses CHD prevalence
by_list_total <- c(
  "02", "05",
  "06", "07",
  "08", "09"

[1] "02" "05" "06" "07" "08" "09"
# Produces a string to be used in an if(else) in later code to match against
# the metric numbers listed in by_list_total
by_list_regex <- glue("^metric({paste(by_list_total, collapse = '|')})_total$")


And this is where the blogs like this can be used to explain parts of the code in more detail!

  • by_list_total is creating a vector (a collection of same things, in this case text/strings) of the metrics which is used in by_list_regex. Writing it this way makes it easier to see what metrics are included, missed or incorrect. It’s not a wall of text and can be scan read really quickly.
  • by_list_regex also creates a vector which uses by_list_total. Together they are important for a later search in the metric names.

Translating the string output ^metric(02|05|06|07|08|09)_total$ into English would be a bit like:

At the beginning (^) of the text look for the word metric followed by 02 or (|) 05 or (|) 06 or 07 (|) or 08 (|) or 09, then an underscore and the word total which come at the end ($) of the string.

The magic of a string like this occurs in later code when it’s used in the ifelse() function:

# The wide data is first converted to longer, 
# "tidy" data before adding a new 
# total_column based on the metric name
activity_long <- activity_type_decile |>
    cols = c(starts_with("metric"), -metric01_total),
    names_to = "metric_name",
    values_to = "metric_total"
  ) |>
    total_column = ifelse(
      str_detect(metric_name, by_list_regex),

I usually use case_when() but when the options are binary, like in this case where it’s one population or another, then the ifelse() function is a bit neater.

Getting involved

Regex is very powerful and is used in various languages, not just R, so can be very useful. If you need any help with regex in R or would like to share your own use cases feel free to share them in the NHS-R Slack or submit a blog for this series.

NHS-R Community also have a repository for demos and how tos which people are welcome to contribute code to either through pull requests or issues.

  1. Regex is a contraction of “regular expression” which is a “pattern (or filter) that describes a set of strings that matches the pattern”