Promoting the use of R in the NHS

Blog Article

It’s at this time of year I need to renew my season ticket and I usually get one for the year. Out of interest, I wanted to find out how much the ticket cost per day, taking into account I don’t use it on weekends or my paid holidays. I started my workings out initially in Excel but got as far as typing the formula =WORKDAYS() before I realised it was going to take some working out and perhaps I should give it a go in R as a function…

@ChrisBeeley had recently shown me functions in R and I was surprised how familiar they were as I’ve seen them on Stack Overflow (usually skimmed over those) and they are similar to functions in SQL which I’ve used (not written) where you feed in parameters. When I write code I try to work out how each part works and build it up but writing a function requires running the whole thing and then checking the result, the objects that are created in the function do not materialise so are never available to check. Not having objects building up in the environment console is one of the benefits of using a function, that and not repeating scripts which then ALL need updating if something changes.

Bus ticket function

This is the final function which if you run you’ll see just creates a function.

#Week starts on Sunday (1)
DailyBusFare_function <- function(StartDate,EmployHoliday,Cost,wfh){
  startDate <- dmy(StartDate)
  endDate <- as.Date(startDate) %m+% months(12)
#Now build a sequence between the dates:
  myDates <-seq(from = startDate, to = endDate, by = "days")
  working_days <- sum(wday(myDates)>1&wday(myDates)<7)-length(holidayLONDON(year = lubridate::year(startDate))) - EmployHoliday - wfh
per_day <- Cost/working_days
print(per_day)
}

Running the function you feed in parameters which don’t create their own objects:

 DailyBusFare_function("11/07/2019",27,612,1) 

[1] 2.707965

Going through each line:

To make sure each part within the function works it’s best to write it in another script or move the bit betweeen the curly brackets {}.

Firstly, the startDate is self explanatory but within the function I’ve set the endDate to be dependent upon the startDate and be automatically 1 year later.

Originally when I was trying to find the “year after” a date I found some documentation about lubridate’s function dyear:

#Next couple of lines needed to run the endDate line!
library(lubridate)
startDate <- dmy("11/07/2019")
endDate <- startDate + dyears(1)

but this gives an exact year after a given date and doesn’t take into account leap years. I only remember this because 2020 will be a leap year so the end date I got was a day out!

Instead, @ChrisBeeley suggested the following:

endDate <- as.Date(startDate) %m+% months(12)

Next, the code builds a sequence of days. I got this idea of building up the days from the blogs on getting days between two dates but it has also come in use when plotting over time in things like SPCs when some of the time periods are not in the dataset but would make sense appearing as 0 count.

library(lubridate)

#To run so that the sequencing works
#using as.Date() returns incorrect date formats 0011-07-20 so use dmy from
#lubridate to transform the date
  startDate <- dmy("11/07/2019")
  endDate <- as.Date(startDate) %m+% months(12)
#Now build a sequence between the dates:
  myDates <- seq(from = startDate, to = endDate, by = "days")

All of these return values so trying to open them from the Global Environment won’t do anything. It is possible view the first parts of the values but also typing:

#compactly displays the structure of object, including the format (date in this case)
str(myDates)

Date[1:367], format: “2019-07-11” “2019-07-12” “2019-07-13” “2019-07-14” “2019-07-15” …

#gives a summary of the structure
summary(myDates)
Min.      1st Qu.       Median         Mean      3rd Qu.

“2019-07-11” “2019-10-10” “2020-01-10” “2020-01-10” “2020-04-10” Max. “2020-07-11”

To find out what a function does type ?str or ?summary in the console. The help file will then appear in the bottom right Help screen.

Next I worked out working_days. I got the idea from a blog which said to use length and which:

working_days <- length(which((wday(myDates)>1&wday(myDates)<7)))

Note that the value appears as 262L which is a count of a logical vector. Typing ?logical into the Console gives this in the Help:

Logical vectors are coerced to integer vectors in contexts where a numerical value is required, with TRUE being mapped to 1L, FALSE to 0L and NA to NA_integer._

I was familiar with “length”, it does a count essentially of factors or vectors (type ?length into the Console for information) but “which” wasn’t something I knew about. @ChrisBeeley explained what which does with the following example:

#Generate a list of random logical values
a <- sample(c(TRUE, FALSE), 10, replace = TRUE)
#Look at list
a

[1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE

#using which against the list
gives the number in the list where the logic = TRUE
which(a)

[1] 4 5 8

#counts how many logical
arguments in the list (should be 10)
length(a)

[1] 10

#counts the number of TRUE
logical arguments
length(which(a))

[1] 3

Then @ChrisBeeley suggested just using “sum” instead of length(which()) which counts a logical vector:

sum(a)

[1] 3

It seems this has been discussed on Stack Overflow before: https://stack overflow.com/questions/2190756/how-to-count-true-values-in-a-logical-vector

It’s worthy of note that using sum will also count NAs so the example on Stack Overflow suggest adding:

sum(a, na.rm = TRUE)

[1] 3

This won’t affect the objects created in this blog as there were no NAs in them but it’s just something that could cause a problem if used in a different context.

working_days <- sum(wday(myDates)>1&wday(myDates)<7)
#Just to check adding na.rm = TRUE gives the same result
  working_days <- sum(wday(myDates)>1&wday(myDates)<7, na.rm = TRUE)

I then wanted to take into account bank/public holidays as I wouldn’t use the ticket on those days so I used the function holidayLONDON from the package timeDate:

length(holidayLONDON(year = lubridate::year(startDate)))

[1] 8

I used lubridate::year because the package timeDate uses a parameter called year so the code would read year = year(startDate) which is confusing to me let alone the function!

Again, I counted the vectors using length. This is a crude way of getting bank/public holidays as it is looking at a calendar year and not a period (July to July in this case).

I did look at a package called bizdays but whilst that seemed to be good for building a calendar I couldn’t work out how to make it work so I just stuck with the timeDate package. I think as I get more confident in R it might be something I could look into the actual code for because all packages are open source and available to view through CRAN or GitHub.

Back to the function…

I then added “- EmployHoliday” so I could reduce the days by my paid holidays and “- wfh” to take into account days I’ve worked from home and therefore not travelled into work.

The next part of the code takes the entered Cost and divides by the Working_days, printing the output to the screen:

per_day <- Cost/working_days

print(per_day)

And so the answer to the cost per day is printed in the Console:

DailyBusFare_function("11/07/2019",27,612,1)

[1] 2.707965

A conclusion… of sorts

Whilst this isn’t really related to the NHS it’s been useful to go through the process of producing a function to solve a problem and then to explain it, line by line, for the benefit of others.

I’d recommend doing this to further your knowledge of R at whatever level you are and particularly if you are just learning or consider yourself a novice as sometimes blogs don’t always detail the reasons why things were done (or why they were not done because it all went wrong!).

This blog was written by Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

Leave a Reply