Author

Mark Sellors

Published

July 19, 2022

Modified

July 27, 2024

Being the author of a package with tens of thousands of users must be incredibly rewarding. All those people getting value from your work and using it to do incredible things. Few of us will ever write a package that has that kind of reach though.

Most of us must be content to give back to our communities in smaller ways.

In 2019 I was working for a company building software for Genomics England and the NHS. We had many conversations about NHS numbers and NHS Spine and so on, and over time, I became interested in the numbers themselves.

As a manager, I wasn’t directly involved in writing any code, but was still heavily involved in the R community in my free time. So I wrote some code to validate the checksums used by NHS numbers to get a better understanding of how they work and to play with some R.

NHS numbers use a fairly simple format. The first 9 characters are the actual number and the 10th digit is a checksum. A checksum is some data that is used to verify another bit of data. In the case of NHS numbers, it’s the last digit. That 10th digit is generated by an algorithm that takes the first 9 digits as it’s input. This means you can check the validity of a number by taking the first 9 digits, computing the checksum and comparing that with the provided checksum. If they match, the number is valid.

Taken together, all 10 digits form the complete NHS number.

library(nhsnumber)

## Warning: package 'nhsnumber' was built under R version
## 4.1.3

# Take a made up number and generate a checksum
get_checksum(123456788, full_output = TRUE)
[1] 1234567881
# Take that output and a version of the same input number with an
# incorrect checksum and test their validity
is_valid(c(1234567881, 1234567882))
[1]  TRUE FALSE

Credit card numbers, and many other numbers found out in the wild, use the same technique, though often with different algorithms (the Luhn algorithm in the case of credit card numbers). This makes it possible for us to validate the legitimacy of numbers before we pass them on to upstream services for final validation and association with the human it was assigned to..

Of course, in most cases, the algorithms are well known, and there’s nothing to stop people generating fake numbers that pass checksum validation. However, this early stage checksum validation can be used to flag typos and transcription errors in a number, or weed out obvious chancers.

Once I had an implementation figured out, I wrapped it up into an R package, dropped it on GitHub, tweeted about it…

 …and then promptly forgot about it.

Over the years I’ve written a lot of super-niche and one-off R packages , principally for my own entertainment, and this felt like another one of those.

Cut to a year later and I’m working for RStudio, helping out a little at Data Orchard and spending a lot of time thinking about the data science community and ways to give back to the community that’s always been so generous to me. It was at this point that I decided to make the effort to publish the package to CRAN.

Getting your first package on CRAN can be a nerve wracking experience, but it was a fairly smooth process, with only small changes required by the CRAN team before it could be published.

When you publish any package, you never really know if anyone will use it. You’re pushing your work out into the world to see if it can survive on its own. After nhsnumber was published, I would occasionally check its stats and see low, but consistent numbers of downloads. A couple of times since it was published, actual users have reached out to say thanks or report a bug.

As a stats-first language rather than a general purpose one, some might say that R is itself, somewhat niche (though a pretty large niche, it must be said!). Add to that a package that only makes sense in one geographic region and is also specific to those working within and alongside one specific organisation within that region and we’re knee-deep in niches! Clearly this sort of package is never going to be applicable to all R users.

But none of that means a package isn’t valuable. If only one user benefits from its existence I’d consider that a success. If your work can serve a community, however small, and help improve their work in some way, I consider that a win. So, if like me, you have ideas for R packages, but you consider them too niche to be worth the time, I’d encourage you to share them in some way anyway. You’ll learn a lot along the way, maybe have a bit of fun thinking about how best to organise and present your package and may, just may, improve someone else’s life along the way.

Happy coding everyone!

Mark

PS. Because of the interest shown in the nhsnumber R package by the NHS-R community, I decided it would be fun to port the package to Python too.

Mark Sellors is a technologist working in the data science and technical computing space. He is the author of several niche packages for R as well as the Field Guide To The R Ecosystem. By day he works as a Solutions Engineering Manager at RStudio, is a Non-executive director at Data Orchard and the founder of the R4pi.org project.

The other R packages that Mark has published to CRAN are:

You can find Mark on Twitter and LinkedIn and read his infrequently updated blog.

Back to top

Reuse

CC0