library(tidyverse)
Validity <- df %>%
mutate(
length = nchar(NHSNumber),
A = as.numeric(substring(NHSNumber, 1, 1)) * 10,
B = as.numeric(substring(NHSNumber, 2, 2)) * 9,
C = as.numeric(substring(NHSNumber, 3, 3)) * 8,
D = as.numeric(substring(NHSNumber, 4, 4)) * 7,
E = as.numeric(substring(NHSNumber, 5, 5)) * 6,
G = as.numeric(substring(NHSNumber, 6, 6)) * 5,
H = as.numeric(substring(NHSNumber, 7, 7)) * 4,
I = as.numeric(substring(NHSNumber, 8, 8)) * 3,
J = as.numeric(substring(NHSNumber, 9, 9)) * 2,
End = as.numeric(substring(NHSNumber, 10, 10)),
Total = A + B + C + D + E + G + H + I + J,
Remainder = Total %% 11,
Digit = 11 - Remainder,
Summary = case_when(
Digit == 10 ~ 999,
Digit == 11 ~ 0,
TRUE ~ Digit
),
Valid = case_when(
Summary == End & length == 10 ~ TRUE,
TRUE ~ FALSE
)
) %>%
filter(Valid == FALSE)
NHS Numbers are the best. They are numbers that follow you through life, from birth to death, changing only with certain situations and are easy to check for validity from just the numbers themselves.
What? You can check to see if a number is a genuine NHS Number without any other information?!
Now, I’ve worked in the NHS for more years than I care to count and I never realised there was an algorithm to these numbers. I stumbled across this fact through Twitter of all places. Yes, Twitter, the place you’ll find NHS-R enthusiasts tweeting things like this:
https://twitter.com/sellorm/status/1171858506057682944 which I saw retweeted by @HighlandR (John MacKintosh). - Link no longer works
I’ve only just started doing things on GitHub but @Sellorm wasn’t very close in saying there may be millions of packages as there are, in fact there are only 2 for R, but surprisingly, not a single one in SQL.
I installed the package, had a quick play around and looked at the code on GitHub. The downside for this package, for me, was that you need to feed in your data using vectors and I like using data tables. Plus, I didn’t necessarily want a list of outputs like TRUE
, TRUE
, FALSE
, TRUE
but I wanted to see the NHS numbers that aren’t valid. Still, I wouldn’t have got so far, so quickly, without @Sellorm’s code, excellent notes and thoroughly written Readme file and so the moral of the story is, even if the package doesn’t work for a task or problem, you may still be able to use parts of the code.
The Readme on https://github.com/sellorm/nhsnumber, which you can see when you scroll down past all of the file directory looking things, included a link to the wiki page on NHS Numbers and its algorithm check https://en.Wikipedia.org/wiki/NHS_number. Again, I had no idea this existed!
Wiki, like R, is open sourced and one of the links was out of date. I’d found a really useful document on NHS Numbers a while back https://www.closer.ac.uk/wp-content/uploads/CLOSER-NHS-ID-Resource-Report-Apr2018.pdf, so, in the spirit of open source, I updated the Wiki page. Proud moment!
The algorithm is quite simple and another package on GitHub https://github.com/samellisq/nhsnumbergenerator generates numbers using it but I didn’t spend too long on these packages as I decided to do my own script, nothing fancy, no loops, functions or packages…
Importing data
Data is imported into R via Excel or, in my case, I used a SQL connection which worked better as a couple of our systems hold many hundreds of thousands of NHS Numbers and I wanted to check them all. Excel might have got a bit of indigestion from that. Also, it meant I wasn’t storing NHS Numbers anywhere to then run through R. My systems are secure but it’s always worrying having such large amounts of sensitive data in one place and outside of an even securer system like a clinical database.
Data doesn’t always import into R the same way and for mine I needed to remove NULLs and make the NHSNumber column, which was a factor, into a character and then numeric format:
df <- data %>%
filter(!is.na(NHSNumber)) %>%
mutate(NHSNumber = as.numeric(as.character(NHSNumber)))
Factors
Factors are a new thing for me as but, as I understand it, they put data into groups but in the background. For example, if you had male, female and other these would be 3 factors and if you join that to another data set which doesn’t happen to have “other” as a category the factor would still linger around in the data, appearing in a count as 0 as the table still has the factor information but no data in that group.
To manipulate factor data I’ve seen others change the format to character and so that’s what I did. I’ve done similar things in SQL with integers and dates; sometimes you have to change the data to an ‘intermediate’ format.
Validity code – explanation
This code uses dplyr from tidyverse to add columns:
Validity <- df %>%
mutate(length = nchar(NHSNumber)),
An NHS number must be 10 characters and nchar()
reminds me of LEN()
in SQL which is what I would use to check the length of data.
One thing I didn’t code, but I guess people may want to check, is that all characters are numeric and no letters have been introduced erroneously. That’s something to consider.
A = as.numeric(substring(NHSNumber,1,1)) *10,
B = as.numeric(substring(NHSNumber,2,2)) *9,
C = as.numeric(substring(NHSNumber,3,3)) *8,
D = as.numeric(substring(NHSNumber,4,4)) *7,
E = as.numeric(substring(NHSNumber,5,5)) *6,
I didn’t use F as it’s short for FALSE which changed the meaning and colour of the letter. It can be used as a column name but I missed it out for aesthetics of the script!
G = as.numeric(substring(NHSNumber,6,6)) *5,
H = as.numeric(substring(NHSNumber,7,7)) *4,
I = as.numeric(substring(NHSNumber,8,8)) *3,
J = as.numeric(substring(NHSNumber,9,9)) *2,
There is possibly a much smarter way of writing this, perhaps a loop. Samuel Ellis’ package nhsnumbergenerator creates a table using sequences:
checkdigit_weights = data.frame(digit.position=seq(from=1, to = 9, by =1),
weight=seq(from=10, to = 2, by =-1)
)
note he uses = rather than <-
I like this code. It’s very concise, but it just was easier writing out the multiplications for each part of the NHS Number; the first number is multiplied by 10, the second by 9, the third by 8 and so on.
End = as.numeric(substring(NHSNumber,10,10)),
put this in as the final number in the sequence is the check for the later Digit (also called checksum).
Total = A+B+C+D+E+G+H+I+J,
Just adding all the multiplied columns together.
Remainder = Total %% 11,
This gives the remainder from a division by 11.
Digit = 11- Remainder,
11 take away the remainder/checksum.
Summary = case_when(Digit == 10 ~ 999,
Digit == 11 ~ 0,
TRUE ~ Digit),
Lots of people use if(else())
in R but I like case_when()
because it’s like SQL. The lines run in logical order:
- when the digit = 10 then it’s invalid so I put in 999 as that’s so high (in the hundreds) it shouldn’t match the Digit/Checksum (which is a unit),
- if it’s 11 then change that to 0 following the methodology,
- else just use the digit number.
Actually, thinking about it I probably don’t need the 10 becomes 999 as 10 could never match a single unit number. Perhaps that’s redundant code.
Valid = case_when(Summary == End & length == 10 ~ TRUE,
TRUE ~ FALSE
)) %>%
case_when()
again but this time to get the TRUE/FALSE validity. If the number generated is the same as the last digit of the NHS Number AND the length of the NHS Number is 10 then TRUE, else FALSE.
I liked this like of code as it was a bit strange saying TRUE then FALSE but it’s logical!
filter(Valid == FALSE)
Just bring back what isn’t valid.
Did it work?
I think it did. If I got anything wrong I’d love to get feedback but I ran through many hundreds of thousands of NHS Numbers through it and found…..
No invalid numbers
I possibly should have checked beforehand but I suspect our clinical systems don’t allow any incorrect NHS Numbers to be entered in at source. Still, it was fun and could be applied to manually entered data from data kept in spreadsheets for example.
An interesting blog
As here were no SQL code scripts on GitHub for NHS Number validations I did a quick search on the internet and found this: https://healthanalyst.wordpress.com/2011/08/21/nhs-number-validation/ which is a blog by @HealthAnalystUK from 2011. The reason I’m referring to it in this blog is because HealthAnalystUK not only shared SQL code that looked very similar to the code here but also R and uses it as a function:
NHSvalidation <- function(NHSnumber) {
NHSlength <- length(NHSnumber)
A <- as.numeric(substr(NHSnumber, 1, 1))
B <- as.numeric(substr(NHSnumber, 2, 2))
C <- as.numeric(substr(NHSnumber, 3, 3))
D <- as.numeric(substr(NHSnumber, 4, 4))
E <- as.numeric(substr(NHSnumber, 5, 5))
F <- as.numeric(substr(NHSnumber, 6, 6))
G <- as.numeric(substr(NHSnumber, 7, 7))
H <- as.numeric(substr(NHSnumber, 8, 8))
I <- as.numeric(substr(NHSnumber, 9, 9))
J <- as.numeric(substr(NHSnumber, 10, 10))
if ((A == B) & (B == C) & (C == D) & (D == E) & (E == F) & (F == G) & (G == H) & (H == I) & (I == J)) {
UniformNumberCheck <- 1
} else {
UniformNumberCheck <- 0
}
Modulus <- ((A * 10) + (B * 9) + (C * 8) + (D * 7) + (E * 6) + (F * 5) + (G * 4) + (H * 3) + (I * 2))
Modulus <- (11 - (Modulus %% 11))
if (
((Modulus == J) & (UniformNumberCheck != 1) & (NHSlength == 10)) | ((Modulus == 11) & (J == 0) & (UniformNumberCheck != 1) & (NHSlength == 10))) {
ReturnValue <- 1
} else {
ReturnValue <- 0
}
return(ReturnValue)
}
I hadn’t coded the check for repeating numbers:
if ((A==B)&(B==C)&(C==D)&(D==E)&(E==F)&(F==G)&(G==H)&(H==I)&(I==J))
and I couldn’t find any reference to this in the Wiki page or the document from the University of Bristol so I’m unsure if this is a part of the methodology. If it is, then I’ve seen at least 1 NHS Number that would fail this test.
A conclusion
If anyone has created a package or script for NHS number checks and wants to share please feel free to write a blog. NHS-R Community also has a GitHub repository at https://github.com/nhs-r-community where code like this blog can go (I wrote this in Rmarkdown).
Blogs can be emailed to nhs.rcommunity@nhs.net and get checked by a group of enthusiast volunteers for publishing.
This blog was written Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.