median(table$column)
I suspect there are many blogs about R and why it’s so great:
- It’s free!
- It’s open source!
- There’s a great community!
- It’s reproducible!
You can certainly read on social media (#rstats) about what R can do for you but what about what R does to you, particularly as an analyst in the NHS or social care?
The learning bit…
Back in April 2018 when NHS-R Community was running its first introduction to R course in Leeds, my only knowledge of it had come from a free online base R course with edX that I hadn’t really understood and never finished. Online learning for me is like being back at school. I get the right answer, feel good, move on and promptly forget it all. After the NHS-R course I dabbled with the dplyr package, tried to run lots of other people’s scripts and generally failed a lot. It was a frustrating time of slow delivery of tasks and bafflement. When things did work, I had no idea why and I often wondered what all the fuss was about particularly as I could do the same things in familiar programmes.
Hindsight is a wonderful thing and I can now see my frustrations weren’t just one lump of confusion, but could be split down into how I used these ‘familiar programmes’, namely:
- SQL for data engineering and
- Excel for visualisations.
Although I still used (and use) SQL to get my data, I was copying it to Excel and then loading it into R; once loaded I’d then realise I needed to group by and count or remove something I didn’t need and it seemed too long-winded going back to SQL, copying to Excel and then loading it.
The second frustration of visualising in R came with the inability to replicate the look of the Excel charts in R: getting the same colours, the same font size headers and so on. I’ve yet to resolve that completely but it was here that I realised the worth of R wasn’t in making it look like Excel, but rather that it could do so much more than Excel. I needed to start thinking about what I should be visualising and how to do it.
It’s all about the data
The entire process, from SQL to Excel, is about the data, for example how it joins and what it’s counting. To get the data ‘just so’, it often requires so many clever solutions to so many problems that, as I now realise, it consumes so much thinking time that there’s often little energy left for considering why I am doing this – is it the best thing to do and how can I get more meaning from the data?
If I’d picked up someone else’s script or Excel document on ordering benchmarking data, perhaps the time I spend would be on improving it instead of building it. In a perfect world, I would feed back on what I’d done and share any improvements or corrections.
But what can R do that SQL and Excel can’t?
As a very simple example, consider creating a median average.
In Excel it’s reasonably easy to use the formula MEDIAN()
but to make it dynamic (such that it automatically updates if you add more data), the formula becomes much more complicated. Here is a page explaining how it’s done:
https://www.extendoffice.com/documents/excel/2611-excel-average-dynamic-range.html
There are lots of graphics are used to describe how to do it and you’ll note this is for AVERAGE
which is mean rather than median.
In SQL, creating the MEDIAN
can be solved various ways:
https://stackoverflow.com/questions/1342898/function-to-calculate-median-in-sql-server
There are 204 examples to go through to solve this! I didn’t go through it as that’s too much needless thinking required when R can do this…
Something this simple removes all that otherwise inevitable necessary thinking to figure out the best way to get the median… and then having to check that it is right. Although that may be easy enough in Excel, I know I make mistakes and will have to repeat the exercise more times than I care to admit, and doing so in Excel will involve so many steps that each require checking. All of this uses up that precious resource of focused thinking. With R doing the job so quickly and reliably I now have time to consider if median is actually the right thing to use, particularly if it’s a case of “we’ve always done that”. Then I can ponder on what is it telling me; is it increasing/decreasing and why is that? Is there any other data that could help explain this change?
Like any great piece of analysis or coding, time invested at the beginning pays off. With the benchmarking example of reordered bar charts, spending a couple of days getting it right made something interactive that was engaging for the users. But Excel continues to require thinking and time checking whereas R doesn’t; once it’s coded, that’s it. And that’s where the analyst should be, that space after the work is done. That’s where you can think about the very analysis itself; was it enough? What does it tell you? What else can you look at? Although I’ve had analyst in my job title for a long time, this is what analysis is all about and it’s something, I now realise, that I’ve not been doing because my “thinking time” has been used up elsewhere.
This blog was written Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.