Code snippets – first(), last() and nth() {dplyr} functions

Inspired by conversations on the NHS-R Slack where code answers are lost over time (it’s not a paid account), and also for those times when a detailed comment in code isn’t appropriate but would be really useful, this blog is part of a series of code snippet explanations.

Where this code snippet comes from

This blog comes from small part of code shared as part of a larger piece of analysis from the Strategy Unit and the British Heart Foundation to visualise socio-economic inequalities in the Coronary Heart Disease (CHD) pathway. The report and analysis was presented at a Midlands Analyst Huddle in January. If you would like to know more about the report and the code I’ll be referring to, it is published on GitHub.

The code is in two parts with the first being data formatting and the second part being the statistics for relative index of inequality (RII).

Thanks to Jacqueline Grout, Senior Healthcare Analyst and Tom Jemmett, Senior Data Scientist of the Strategy Unit.

Creating a column with the `first()` or the `last()` data in a group

In the analysis example these functions are used to repeat the highest and lowest population count per Indices of Multiple Deprivation (IMD) decile for each GP practice area1.

The best way to show what is happening with the two {dplyr} functions `first()` and `last()` is to show it with an even simpler dummy data set:

``````library(dplyr, warn.conflicts = FALSE)

# There are two patients A and B with A having a numbers 1:10 and patient B has numbers 11:20

data <- tibble(id = rep(c("PatientA", "PatientB"), 10)) |>
group_by(id) |>
mutate(number = ifelse(id == "PatientA", 1:10, 11:20)) |>
arrange(number) |>
mutate(
max_number = last(number),
min_number = first(number)
) |>
ungroup() # persistent grouping is needed in the code but removed at the end for good practice

data``````
``````# A tibble: 20 × 4
id       number max_number min_number
<chr>     <int>      <int>      <int>
1 PatientA      1         10          1
2 PatientA      2         10          1
3 PatientA      3         10          1
4 PatientA      4         10          1
5 PatientA      5         10          1
6 PatientA      6         10          1
7 PatientA      7         10          1
8 PatientA      8         10          1
9 PatientA      9         10          1
10 PatientA     10         10          1
11 PatientB     11         20         11
12 PatientB     12         20         11
13 PatientB     13         20         11
14 PatientB     14         20         11
15 PatientB     15         20         11
16 PatientB     16         20         11
17 PatientB     17         20         11
18 PatientB     18         20         11
19 PatientB     19         20         11
20 PatientB     20         20         11``````

Taking the min and the max numbers for a patient is dependent upon a persisent `group_by()` and an `arrange()` and, if you are familiar with SQL, is similar to the Windows partitions `MIN` and `MAX` and would be written for this example as `MIN(number) OVER(PARTITION BY id ORDER BY number)`. What SQL can’t do so easily though is to select the `nth()` number for example:

``````library(dplyr)

# There are two patients A and B with A having a numbers 1:10 and patient B has numbers 11:20 and letters from

data <- tibble(id = rep(c("PatientA", "PatientB"), 10)) |>
mutate(letters = letters[1:20]) |> # occurs before grouping so that the letters don't get restricted to the 10 rows in a group
group_by(id) |>
mutate(number = ifelse(id == "PatientA", 1:10, 11:20)) |>
arrange(number) |>
mutate(sixth_number = nth(letters, 6))

data``````
``````# A tibble: 20 × 4
# Groups:   id [2]
id       letters number sixth_number
<chr>    <chr>    <int> <chr>
1 PatientA a            1 k
2 PatientA c            2 k
3 PatientA e            3 k
4 PatientA g            4 k
5 PatientA i            5 k
6 PatientA k            6 k
7 PatientA m            7 k
8 PatientA o            8 k
9 PatientA q            9 k
10 PatientA s           10 k
11 PatientB b           11 l
12 PatientB d           12 l
13 PatientB f           13 l
14 PatientB h           14 l
15 PatientB j           15 l
16 PatientB l           16 l
17 PatientB n           17 l
18 PatientB p           18 l
19 PatientB r           19 l
20 PatientB t           20 l           ``````

and in this example the 6th character for PatientA is k and for PatientB (row 16) is l.

Use case

Another potential use case for the `nth()` character selection is where analysis is looking for the last but one appointment date. For example, if a patient had appointments:

``````appts <- tibble(id = rep(c("PatientA", "PatientB"), 3)) |>
arrange(id) |>
mutate(
appointments = rep(c("2023-01-01", "2023-02-01", "2023-03-01"), 2),
team = rep(c("teamA", "teamB", "teamC"), 2)
)

appts``````
``````# A tibble: 6 × 3
id       appointments team
<chr>    <chr>        <chr>
1 PatientA 2023-01-01   teamA
2 PatientA 2023-02-01   teamB
3 PatientA 2023-03-01   teamC
4 PatientB 2023-01-01   teamA
5 PatientB 2023-02-01   teamB
6 PatientB 2023-03-01   teamC``````

and we needed to know the details from the second to last appointment to see who they had been seen before teamC:

``````appts |>
filter(appointments == nth(appointments, n() - 1), .by = id)``````
``````# A tibble: 2 × 3
id       appointments team
<chr>    <chr>        <chr>
1 PatientA 2023-02-01   teamB
2 PatientB 2023-02-01   teamB``````

The `nth()` function needs to know which column to look at, appointments, and then which number to select. Instead of hard coding this as 2 as that will only be useful in this very small data set, the code uses `n()` to count all the rows and then minus 1. The reason why this is an operation applied to both patients is because of the `.by = id` which is new feature of {dplyr} v1.1.0. Code using the `group_by()` function will do the same thing but is just an extra two lines as it will also require `ungroup()` to remove:

``````appts |>
group_by(id) |>
filter(appointments == nth(appointments, n() - 1)) |>
ungroup()``````
``````# A tibble: 2 × 3
id       appointments team
<chr>    <chr>        <chr>
1 PatientA 2023-02-01   teamB
2 PatientB 2023-02-01   teamB``````

Getting involved

If you need any help with {dplyr} or would like to share your own use cases feel free to share them in the NHS-R Slack or submit a blog for this series.

NHS-R Community also have a repository for demos and how tos which people are welcome to contribute code to either through pull requests or issues.

Footnotes
1. I’ve written more about IMD in a blog for the CDU Data Science Team.