Optimising dplyr – NHS-R Community Quarto website

suppressPackageStartupMessages({
  library(tidyverse)
  library(data.table)
})
set.seed(2214)

Recently in a project I was working on I encountered an issue where one code chunk in my RMarkdown report started to take a significant amount of time to run, and it was using almost 100% of my RAM. The data I was working with was an extract of various tables from HES: each row was relating to some activity that was undertaken.

The particular bit of code that was causing me problems was trying to create a simple chart that showed what percentage of people had particular types of activity. As each person may have had more than one of any of the activity types I had to take distinct sets of rows first, then calculate the percentage of individuals having that activity type.

However, there was a slight complication in our data: as this was part of a wider report we would group some of the activity (the Critical Care Bed Day’s) together at one part in the report, but also be able to break these records down into Elective/Emergency.

The table below shows what some of this activity data looked like (we just show the type’s of activity and the subgroups below).

activity_data <- tribble(
  ~type,                  ~sub_group,                     ~alt_sub_group,        ~n,
  "Urgent Service Event", "Emergency Admission",          NA,                    2.00,
  "Urgent Service Event", "A&E Attendance",               NA,                    3.00,
  "Urgent Service Event", "111 Call",                     NA,                    2.00,
  "Planned Contact",      "Outpatient Attendance",        NA,                    4.00,
  "Planned Contact",      "Mental Health Contact",        NA,                    0.12,
  "Planned Contact",      "IAPT Contact",                 NA,                    0.10,
  "Planned Admission",    "Day Case Admission",           NA,                    0.25,
  "Planned Admission",    "Regular Attendance Admission", NA,                    0.07,
  "Planned Admission",    "Elective Admission",           NA,                    0.10,
  "Bed",                  "Critical Care Bed Day",        "Elective Admission",  0.08,
  "Bed",                  "Critical Care Bed Day",        "Emergency Admission", 0.10
)

# don't show the n column here, this is just used to generate rows later
activity_data %>% select(-n)

# A tibble: 11 × 3
   type                 sub_group                    alt_sub_group      
   <chr>                <chr>                        <chr>              
 1 Urgent Service Event Emergency Admission          <NA>               
 2 Urgent Service Event A&E Attendance               <NA>               
 3 Urgent Service Event 111 Call                     <NA>               
 4 Planned Contact      Outpatient Attendance        <NA>               
 5 Planned Contact      Mental Health Contact        <NA>               
 6 Planned Contact      IAPT Contact                 <NA>               
 7 Planned Admission    Day Case Admission           <NA>               
 8 Planned Admission    Regular Attendance Admission <NA>               
 9 Planned Admission    Elective Admission           <NA>               
10 Bed                  Critical Care Bed Day        Elective Admission 
11 Bed                  Critical Care Bed Day        Emergency Admission

To demonstrate the issue with the initial summary code I produced we can generate some synthetic data. The real dataset included far more columns (e.g. the date of the activity, the organisation where the activity happened at), but for this demonstration we just generate a patient id.

activity_synthetic <- function(people) {
  activity_data %>%
    mutate(pid = map(n, function(.x) {
      rpois(people, .x) %>%
        imap(~rep(.y, .x)) %>%
        flatten_dbl()
    })) %>%
    unnest(pid) %>%
    select(-n)
}

activity_100k <- activity_synthetic(100000)
activity_10k <- filter(activity_100k, pid <= 10000)

Here is a sample of this data:

sample_n(activity_10k, 10)

# A tibble: 10 × 4
   type                 sub_group             alt_sub_group         pid
   <chr>                <chr>                 <chr>               <dbl>
 1 Planned Contact      Outpatient Attendance <NA>                 6253
 2 Urgent Service Event A&E Attendance        <NA>                 1773
 3 Planned Contact      Outpatient Attendance <NA>                 3300
 4 Urgent Service Event 111 Call              <NA>                 9349
 5 Bed                  Critical Care Bed Day Emergency Admission  5617
 6 Planned Contact      Outpatient Attendance <NA>                 6971
 7 Urgent Service Event Emergency Admission   <NA>                 1552
 8 Urgent Service Event 111 Call              <NA>                 9607
 9 Urgent Service Event A&E Attendance        <NA>                 8370
10 Urgent Service Event Emergency Admission   <NA>                 4944

Now, for rendering in this particular part of the report we wanted to slightly modify the labels of activity used. We wanted to show the alternative subgroup labels along with the subgroup labels for the Critical Care Bed Day records.

Ideally we would have done this at the data loading stage. However, because these fields were being used at other parts in the report we couldn’t easily change the data at loading.

This is the code that I originally came up with, which I’m wrapping in a function for now so we can benchmark our different approaches later. First we update the type and subgroup columns, then we perform the summarisation steps (get the distinct rows for each individual, then count how many individuals there were for that activity).

dplyr_naive <- function(activity) {
  activity %>%
    mutate(across(type, ~ifelse(sub_group == "Critical Care Bed Day",
                                sub_group, .x)),
           across(sub_group,
                  ~ifelse(type == "Critical Care Bed Day",
                          paste0("Critical Care (",
                                 word(alt_sub_group, 1),
                                 ")"),
                          .x))) %>%
    group_by(type, sub_group) %>%
    distinct(pid) %>%
    count() %>%
    ungroup() %>%
    arrange(type, sub_group)
}

We can test how long this takes to run with our 100,000 patient dataset.

system.time( dplyr_naive(activity_100k) )[["elapsed"]]

[1] 18.34

This may not seem like a huge amount of time, but I was re-rendering this report upwards of 10 times an hour to ensure the pipeline was still working. Furthermore, we were rendering 12 versions of this report at a time (we had a report for each of the 11 STPs in the Midlands, and an overall Midlands report), and this particular bit of code was chewing up all of my available memory, meaning we couldn’t run in parallel.

What is the problem with the approach above? When we update the type and sub_group columns in the mutate step R has to update all of the rows in the dataframe (in this case, 1,181,633 rows). But, R will not overwrite the original table in memory. Instead it will copy the entire data frame. This is what slows everything down.

So the question is, can we reduce the amount of rows we have to update? Yes! We can simply perform the summary steps before updating the labels! This is the approach I settled on: it’s largely the same as above, I just have to include the alt_sub_group column in the group by step, and remove this column from the results at the end.

dplyr_improved <- function(activity) {
  activity %>%
    group_by(type, sub_group, alt_sub_group) %>%
    distinct(pid) %>%
    count() %>%
    ungroup() %>%
    mutate(across(type, ~ifelse(sub_group == "Critical Care Bed Day",
                                sub_group, .x)),
           across(sub_group,
                  ~ifelse(type == "Critical Care Bed Day",
                          paste0("Critical Care (",
                                 word(alt_sub_group, 1),
                                 ")"),
                          .x))) %>%
    select(-alt_sub_group) %>%
    arrange(type, sub_group)
}

How does this perform?

system.time( dplyr_improved(activity_100k) )[["elapsed"]]

[1] 0.19

Significantly better!

data.table

Many will tell you when start working with larger datasets in R you have to resort to using the data.table package. I’m not the most proficient data.table user in the world, so if you can think of a better way of solving this problem please let me know!

First, here is basically the first approach translated to data.table

data.table_naive <- function(activity) {
  adt <- as.data.table(activity)

  adt[, type := ifelse(sub_group == "Critical Care Bed Day",
                       "Critical Care Bed Day",
                       type)]
  adt[, sub_group := ifelse(type == "Critical Care Bed Day",
                            paste0("Critical Care (",
                            word(alt_sub_group, 1),
                            ")"),
                            sub_group)]
  unique(adt)[order(type, sub_group), .(n = .N),
              c("type", "sub_group")]
}

and here is the second approach

data.table_improved <- function(activity) {
  adt <- as.data.table(activity) %>%
    unique() %>%
    .[, .(n = .N), c("type", "sub_group", "alt_sub_group")]

  adt[sub_group == "Critical Care Bed Day",
      type := "Critical Care Bed Day"]
  adt[type == "Critical Care Bed Day",
      subgroup := paste0("Critical Care (", word(alt_sub_group, 1), ")")]
  adt[, alt_sub_group := NULL]

  adt[order(type, sub_group), ]
}

benchmarking

Now, we can benchmark the different functions to see how they perform.

bench::mark(
  dplyr_naive(activity_10k),
  data.table_naive(activity_10k),
  dplyr_improved(activity_10k),
  data.table_improved(activity_10k),
  iterations = 10,
  check = FALSE
) %>%
   select(expression, min, median, mem_alloc, n_gc)

# A tibble: 4 × 4
  expression                             min   median mem_alloc
  <bch:expr>                        <bch:tm> <bch:tm> <bch:byt>
1 dplyr_naive(activity_10k)         864.18ms    1.15s   44.85MB
2 data.table_naive(activity_10k)       1.38s    1.64s   46.16MB
3 dplyr_improved(activity_10k)       36.48ms  46.75ms    9.34MB
4 data.table_improved(activity_10k)   49.2ms  67.56ms    7.75MB

The table above shows how the improved approach is significantly better than the naive approach. We can see that the median time to run the function is significantly better (note that the naive functions are in seconds, but the improved functions are in milliseconds), but also the amount of memory that is being allocated is much, much lower. The n_gc value tells us how many times the “garbage collector” ran. We want this figure to be low.

We can also see how the function works with the 100,000 row dataset.

bench::mark(
  dplyr_naive(activity_100k),
  data.table_naive(activity_100k),
  dplyr_improved(activity_100k),
  data.table_improved(activity_100k),
  iterations = 1,
  check = FALSE
) %>%
  select(expression, min, median, mem_alloc, n_gc)

# A tibble: 4 × 4
  expression                              min   median mem_alloc
  <bch:expr>                         <bch:tm> <bch:tm> <bch:byt>
1 dplyr_naive(activity_100k)           18.06s   18.06s   455.3MB
2 data.table_naive(activity_100k)      16.14s   16.14s   449.6MB
3 dplyr_improved(activity_100k)      242.15ms 242.15ms    88.6MB
4 data.table_improved(activity_100k)    1.09s    1.09s    69.6MB

As we can see from both of these results both approaches in dplyr and data.table are broadly the same, with a slight performance edge to using data.table over dplyr. The big difference is to more sensibly arrange the order of operations so that you summarise first, then perform the costly data manipulation steps on a smaller data set.

Reuse

CC0 1.0