Code snippets – first(), last() and nth() {dplyr} functions
dplyr
R tips
Author
Zoë Turner
Published
March 13, 2023
Modified
July 27, 2024
Inspired by conversations on the NHS-R Slack where code answers are lost over time (it’s not a paid account), and also for those times when a detailed comment in code isn’t appropriate but would be really useful, this blog is part of a series of code snippet explanations.
Where this code snippet comes from
This blog comes from small part of code shared as part of a larger piece of analysis from the Strategy Unit and the British Heart Foundation to visualise socio-economic inequalities in the Coronary Heart Disease (CHD) pathway. The report and analysis was presented at a Midlands Analyst Huddle in January. If you would like to know more about the report and the code I’ll be referring to, it is published on GitHub.
The code is in two parts with the first being data formatting and the second part being the statistics for relative index of inequality (RII).
Thanks to Jacqueline Grout, Senior Healthcare Analyst and Tom Jemmett, Senior Data Scientist of the Strategy Unit.
Creating a column with the first() or the last() data in a group
In the analysis example these functions are used to repeat the highest and lowest population count per Indices of Multiple Deprivation (IMD) decile for each GP practice area1.
The best way to show what is happening with the two {dplyr} functions first() and last() is to show it with an even simpler dummy data set:
library(dplyr, warn.conflicts =FALSE)# There are two patients A and B with A having a numbers 1:10 and patient B has numbers 11:20data<-tibble(id =rep(c("PatientA", "PatientB"), 10))|>group_by(id)|>mutate(number =ifelse(id=="PatientA", 1:10, 11:20))|>arrange(number)|>mutate( max_number =last(number), min_number =first(number))|>ungroup()# persistent grouping is needed in the code but removed at the end for good practicedata
Taking the min and the max numbers for a patient is dependent upon a persisent group_by() and an arrange() and, if you are familiar with SQL, is similar to the Windows partitions MIN and MAX and would be written for this example as MIN(number) OVER(PARTITION BY id ORDER BY number). What SQL can’t do so easily though is to select the nth() number for example:
library(dplyr)# There are two patients A and B with A having a numbers 1:10 and patient B has numbers 11:20 and letters fromdata<-tibble(id =rep(c("PatientA", "PatientB"), 10))|>mutate(letters =letters[1:20])|># occurs before grouping so that the letters don't get restricted to the 10 rows in a groupgroup_by(id)|>mutate(number =ifelse(id=="PatientA", 1:10, 11:20))|>arrange(number)|>mutate(sixth_number =nth(letters, 6))data
# A tibble: 20 × 4
# Groups: id [2]
id letters number sixth_number
<chr> <chr> <int> <chr>
1 PatientA a 1 k
2 PatientA c 2 k
3 PatientA e 3 k
4 PatientA g 4 k
5 PatientA i 5 k
6 PatientA k 6 k
7 PatientA m 7 k
8 PatientA o 8 k
9 PatientA q 9 k
10 PatientA s 10 k
11 PatientB b 11 l
12 PatientB d 12 l
13 PatientB f 13 l
14 PatientB h 14 l
15 PatientB j 15 l
16 PatientB l 16 l
17 PatientB n 17 l
18 PatientB p 18 l
19 PatientB r 19 l
20 PatientB t 20 l
and in this example the 6th character for PatientA is k and for PatientB (row 16) is l.
Use case
Another potential use case for the nth() character selection is where analysis is looking for the last but one appointment date. For example, if a patient had appointments:
# A tibble: 2 × 3
id appointments team
<chr> <chr> <chr>
1 PatientA 2023-02-01 teamB
2 PatientB 2023-02-01 teamB
The nth() function needs to know which column to look at, appointments, and then which number to select. Instead of hard coding this as 2 as that will only be useful in this very small data set, the code uses n() to count all the rows and then minus 1. The reason why this is an operation applied to both patients is because of the .by = id which is new feature of {dplyr} v1.1.0. Code using the group_by() function will do the same thing but is just an extra two lines as it will also require ungroup() to remove:
# A tibble: 2 × 3
id appointments team
<chr> <chr> <chr>
1 PatientA 2023-02-01 teamB
2 PatientB 2023-02-01 teamB
Getting involved
If you need any help with {dplyr} or would like to share your own use cases feel free to share them in the NHS-R Slack or submit a blog for this series.
NHS-R Community also have a repository for demos and how tos which people are welcome to contribute code to either through pull requests or issues.
Footnotes
I’ve written more about IMD in a blog for the CDU Data Science Team.