Promoting the use of R in the NHS

Blog Article

Project structure

The aim of this blog article it to describe the initial experience of creating a Shiny dashboard, this process involved a bit of reading on markdown documents and shiny apps to learn how to code it.

When designing this dashboard, I aimed to cover these four basic steps:

  1. Download open data from a Github Repository
  2. Create several indicators with their population rates by using World Band API
  3. Build a Shiny dashboard containing different visualizations types
  4. Use plotly library for interactive plots to animate charts and maps in the Shiny app

1.Download data

When creating any dashboard, I would like to feed daily data to it and also update it as soon as that data became available. Many resources have become available since the start of COVID19 pandemic to analyze information about cases and deaths on different countries worldwide, so I decided to use JHU Github [https://github.com/CSSEGISandData/COVID-19/archive/master.zip] to download daily data.

The scrip below selects daily confirmed, death and recovered COVID-19 cases, and downloads it in a zipped file. It gets extracted and updated.

DownloadCOVIDData <- function() {
  
  # Create data directory if doesn't exist
    if(!dir.exists("data")){dir.create("data")}
  # Download master.zip file 
  download.file(
    url = "https://github.com/CSSEGISandData/COVID-19/archive/master.zip",
    destfile = "data/covid19JH.zip"
  )
  data_path <- "COVID-19-master/csse_covid_19_data/csse_covid_19_time_series/"
  
  # Unzip covid19JH.zip file to extract .csv metric files (confirmed, deaths, recovered)
  # time_series_covid19_confirmed_global.csv, time_series_covid19_deaths_global.csv, 
  # time_series_covid19_recovered_global.csv
  unzip(zipfile = "data/covid19JH.zip",
    
    files = paste0(data_path, c("time_series_covid19_confirmed_global.csv",
                                "time_series_covid19_deaths_global.csv",
                                "time_series_covid19_recovered_global.csv")),
    exdir = "data",
    junkpaths = T
  ) 
}

DownloadCOVIDData()

Then I had to update that initial download every half an hour, in case they update the file throughout the day. I though about this script to be run on a server or VM seven days a week, so it will periodically check to get the most up to date information.

Dataupdate <- function(){
                              T_refresh = 0.5  # hours
                              if(!dir_exists("data")){
                              dir.create("data")
                              DownloadCOVIDData()
  }
  else if((!file.exists("data/covid19JH.zip"))||as.double( Sys.time() - file_info("data/covid19JH.zip")$change_time, units = "hours")>T_refresh ){
    # If the latest refresh exceeds 30 minutes, then you download it again
    DownloadCOVIDData()
  }
}

Dataupdate()

Once the data was downloaded, I did some cleansing and data transformations from wide to long format, and also included new calculations with popularization data extracted from the World Banck API to create each of the indicators as rates per 10,000 population, using seven days rolling average to obtain an average of those daily indicators.

2.Extract relevant indicators

After downloading the original data files, I extracted and assigned names to the three metrics I will use in the dashboard (data_confirmed,data_deceased,data_recovered)

input_covid <- list.files("data/",".csv")

NFILES <- length(input_covid)
file_Name <-c("data_confirmed","data_deceased","data_recovered","WDI_indicators")

for(i in 1:NFILES) {     
  assign(paste0(file_Name[i]),                                   # Read and store data frames
         read_csv(paste0("data/",
                         input_covid[i])))
}

3. Reshape data for plots

Originally, the data is created in a wide format, and I transformed it into long format, to include some calculations furthermore , I also wanted the data to be aggregated to Country level and apply relevant date format to display time series data and animations using a timeline in maps.

For the purpose of this blog post, I will only describe this process for one metric COVID19 Confrimed cases (the full code for remaining two metrics (deceased cases,recovered cases) can be found in my Githup repo at the end of this post).

# Confirmed cases 
library(tidyr)

names(data_confirmed)
# First rename the two first columns using rename() function 
confirmed_tidy <- data_confirmed %>% 
                  rename(Province = 'Province/State',
                         Country = 'Country/Region') %>% 
                  pivot_longer(names_to = "date", 
                  cols = 5:ncol(data_confirmed)) %>% 
                  group_by(Province,Country,Lat,Long,date) %>% 
                  summarise("Confirmed"= sum(value,na.rm = T)) %>% 
                  mutate(date =as.Date(date,"%m/%d/%y"))

4. Stack all COVID19 metrics and Lat Long metrics into a master file

The final metrics set is made of recovered and death COVID19 cases, by country and by date. Countries and dates are displayed in rows and metrics in colums.The original data also includes two columns for latitude and longitude used later to produce a map using Leaflet package.

# Now we merge them together
MAPDATA <- confirmed_tidy %>% 
              full_join(deceased_tidy)

MAPDATAF <- MAPDATA %>% 
              full_join(recovered_tidy) %>% 
              arrange(Province,Country,date) %>% 
              # Recode NA values into 0 
              mutate(  
                Confirmed = ifelse(is.na(Confirmed),0,Confirmed),
                Deaths = ifelse(is.na(Deaths),0,Deaths),
                Recovered = ifelse(is.na(Recovered),0,Recovered)
                    )

Along the process of building the final data set, I will produce several csv files to validate each data step.

file_pathCHK <-('C://Pablo UK//43 R projects 2021//04 My Shiny app//04 Mycovid19 app//CHECKS/')
File_name <-'/MAPDATAG.csv' 
write.csv(MAPDATAG,paste0(file_pathCHK,File_name),row.names = T)
MAPDATAG <- MAPDATAF %>% mutate(
              Confirmed = ifelse(is.na(Confirmed),0,Confirmed),
              Deaths = ifelse(is.na(Deaths),0,Deaths),
              Recovered = ifelse(is.na(Recovered),0,Recovered)
  )

MAPDATAH <- MAPDATAG %>% 
            pivot_longer(names_to = "Metric",
                         cols = c("Confirmed","Deaths","Recovered")) %>% 
            ungroup()

PLOT_LEAFLET_MAPS <- MAPDATAH %>%
                     pivot_wider(names_from = Metric, values_from = c(value))

5. Two output files: COVID metrics plus Lat Long variable for Leadlet maps and COVID1 metrics to be mereged with country population figures

5.1 COVID metrics set including Lat Long variables for Leadlet maps

The next step is the creation of new calculated fields for Confirmed and Deaths rated by 10,000 population for each country. I first created new variables for those rates and then I merged the population figures in using the world Bank API

MAPDATAH <- MAPDATAG %>% 
            pivot_longer(names_to = "Metric",
                         cols = c("Confirmed","Deaths","Recovered")) %>% 
            ungroup()

DATAMAP <- MAPDATAH 
PLOT_LEAFLET <- DATAMAP %>%
                pivot_wider(names_from = Metric, values_from = c(value))

PLOT_LEAFLET_MAPS <- PLOT_LEAFLET

save.image("C:/Pablo UK/43 R projects 2021/04 My Shiny app/04 Mycovid19 app/PLOT LEAFLET MAPS.RData")

Following the set of above calculations this is the set used for the maps in the app

head(PLOT_LEAFLET_MAPS)
5.2 COVID19 population rates

The next step is the creation of new calculated fields for Confirmed and Deaths rated by 10,000 population for each country. I first created new variables for those rates and then I merged the population figures in using the world Bank API

# This file weill be use to comparae COVID19 rates across different countries
PLOT_LEAFLET2_conf <- PLOT_LEAFLET_MAPS %>% 
                select(Country,date,Confirmed) %>% 
                group_by(Country,date) %>% 
                summarise("Confirmed" = sum(Confirmed,na.rm = T))

PLOT_LEAFLET2_death <- PLOT_LEAFLET_MAPS %>% 
                       select(Country,date,Deaths) %>% 
                       group_by(Country,date) %>% 
                        summarise("Death" = sum(Deaths,na.rm = T))
            
PLOT_LEAFLET2_Recov <- PLOT_LEAFLET_MAPS %>% 
                            select(Country,date,Recovered) %>% 
                            group_by(Country,date) %>% 
                            summarise("Recovered" = sum(Recovered,na.rm = T))

# Join together
PLOT_LEAFLET_RATES <- PLOT_LEAFLET2_conf %>% 
                       full_join(PLOT_LEAFLET2_death) %>% 
                       arrange(Country,date)

PLOT_LEAFLET_RATES <- PLOT_LEAFLET_RATES %>% 
                      full_join(PLOT_LEAFLET2_Recov) %>% 
                      arrange(Country,date)

PLOT_LEAFLET_CDR_NUM <-PLOT_LEAFLET_RATES

save.image("C:/Pablo UK/43 R projects 2021/04 My Shiny app/04 Mycovid19 app/PLOT LEAFLET CDR NUM.RData")

Our final set to compute COVID19 population rates is displayed below:

head(PLOT_LEAFLET_CDR_NUM)

6.Import WDI population figures to compute x10,000 population rates

The aim of this section is to download population figures and compute the relevant taxes *10,000from World bank Development Indicators API. All the details from this script can be found in the Github repository.

Just to highlight three main sub-sections included here:

1. The use of Source() function to bring another script that pulls 2019 countries population data from WDI API

# # Include population figures 

source("UI/ui_get_population_figures.R",local =TRUE) 

The aim of this first section is to download population figures from World bank Development Indicators API. After downloading the requested data by using the World Bank’s API, the resulting XMS file is formatted in long country-year format.

# LOAD population figures in the right way
WDIinputdata<-WDI(indicator=c("SP.POP.TOTL"),extra=TRUE)

2. Cleaning the population data to macth COVID19 country names list

This file is located in the Shiny folder structure within a specific folder for UI scripts Within the ui_get_population_figures.R file the script performs two other tasks:

  1. Using a which statement, account for mismatches in country names between COVID and WDI country names

I intend to match the country names in both data sources, so population data matchces COVID19 metrics

# LOAD population figures in the right way
# Input missing values
POP_POPULATED <- POP_DATA_2019

CNpop<-c("Bahamas, The","Brunei Darussalam","Congo, Dem. Rep.", "Congo, Rep." , "Egypt, Arab Rep.", "Gambia, The","Iran, Islamic Rep.","Korea, Rep.",
         "Kyrgyz Republic", "Micronesia, Fed. Sts." ,"Russian Federation"  , "St. Kitts and Nevis",  "St. Lucia","St. Vincent and the Grenadines" ,
         "Slovak Republic","Syrian Arab Republic","United States", "Venezuela, RB", "Yemen, Rep.")

length(CNpop)

CNindic<-c("Bahamas","Brunei","Congo (Brazzaville)","Congo (Kinshasa)","Egypt", "Gambia" , "Iran"   , "Korea, South", 
           "Kyrgyzstan"  , "Micronesia", "Russia", "Saint Kitts and Nevis"  ,"Saint Lucia",  "Saint Vincent and the Grenadines",
           "Slovakia","Syria","US", "Venezuela" ,"Yemen" )
length(CNindic)

# Then we replace those values 
POP_POPULATED[which(POP_POPULATED$country %in% CNpop ), "country"] <- CNindic

6.1 Compute 10000 population rates

There are a couple of calculations needed before rates for COVID19 confirmed, recovered and death cases are calculated:

First the JHU cases is based on cumulative data, so previous day is substracted to obtain the daily count of events for each metrics

POPG_RATES <- POPG %>% 
                  arrange(Country,date) %>% 
                  mutate(
                          ConD = Confirmed - lag(Confirmed, n=1),
                          RecD = Recovered - lag(Recovered, n=1),
                          DeathD = Death - lag(Death, n=1)
                  )
tail(POPG_RATES)

Compute now rates based on daily figures

POPG_RATESF <- POPG_RATES %>% 
  select(Country, date,year,population,ConD,RecD,DeathD) %>% 
  mutate(
            CONFR =ceiling(((ConD/population)*10000)),
            RECR = ceiling(((RecD/population)*10000)),
            DEATHR =ceiling(((DeathD/population)*10000))
        )

tail(POPG_RATESF)

Get those rates as 7 previous days rolling average to smooth daily fluctionations

library(zoo)

RATES7DGAVG <- POPRATESG %>%
                group_by(Country) %>%   
                select(date, Country,population,ConD,RecD,DeathD) %>%
                mutate(CONF_ma07 = rollmean(ConD, k = 7, fill = NA),
                       REC_ma07 = rollmean(RecD,k = 7, fill = NA),
                       DEATH_ma07 = rollmean(DeathD, k = 7, fill = NA))

Finally there is a round done on calculataed taxes to avoid any decimal places

POP_POPULATEDT <- POP_POPULATED_RENAME %>% 
                 mutate(
                          Confirmed_10000 = round(Confirmed_Rate,digits=0),
                          Recovered_10000 = round(Recovered_Rate,digits=0),
                          Deaths_10000 = round(Death_Rate,digits=0)
                      )

POP_POPULATED <- POP_POPULATEDT %>% 
                  select(
                            date,Country,Population,Confirmed,Recovered,Death,
                            Conf_7D_10000 = Confirmed_10000,
                            Rec_7D_10000 = Recovered_10000,
                            Death_7D_10000 = Deaths_10000
                            
                  )
                            
POP_POPULATED

6.2 Dataset with rates ready for Shiny app

This is the final data set that goes into the shiny app

head(POP_POPULATED)

7.Building the Shiny dashboard

Once that all the data is ready to build the Shiny dashboard, there where three main tabs that I wanted to display on the dashboard:

The script for the dashboard can be quite long, and it is available in the Github repository at the end of this blog article.

As a general overview, I will mention that I opted for a standard Sidebar layout, using fluidrow and column functions to arrange the plots layout on each tab.

ui <- dashboardPage(
  
  dashboardHeader(title = "COVID-19"),
  # THis Sidebar menu allows us to include new items on the sidebar
  dashboardSidebar(
                    sidebarMenu(
                    # Setting id makes input$tabs give the tabName of currently-selected tab
                    id = "tabs",
                    menuItem("About", tabName = "about", icon = icon("desktop")),
                    menuItem("Map", tabName = "map", icon = icon("map")),
                    menuItem("Plots", tabName = "plot", icon = icon("wifi")),
                    menuItem("Forecast", tabName = "forecast", icon = icon("chart-line")))
  )
  ,
  dashboardBody(  # Infobox: Total figures KPI world

This is an example of the functions used to populate the infoBoxes on top of the dashbaord

dashboardBody(  # Infobox: Total figures KPI world
    fluidRow( infoBoxOutput("Total_cases_WORLD",width=3),
              infoBoxOutput("Total_recovered_WORLD",width=3),
              infoBoxOutput("Total_deaths_WORLD",width=3),
                          infoBoxOutput("Date", width = 3)
              
              ),  

The reactive components used in the plots and the maps to produce several animations were created using a input$Time_Slider function

dailyData <- reactive(PLOT_LEAFLET_MAPS[PLOT_LEAFLET_MAPS$date ==   format(input$Time_Slider,"%Y/%m/%d"),])
  prevdailyData <-reactive(PLOT_LEAFLET_MAPS[PLOT_LEAFLET_MAPS$date == format(input$Time_Slider-1,"%Y/%m/%d"),])
  RATESTable <- reactive(POP_POPULATED[POP_POPULATED$date == format(input$Time_Slider,"%Y/%m/%d"),])

7.1. Interactive map with KPI and timeline

  • Pop-up and tool tips display COVID19 Total, recovered and death cases
  • Circles radius are proportional to number of cases per country
  • Dynamic animation: Map changes as data varies in time

alt text

7.2. Interactive line charts using plotly library

  • KPI number of cases and day to day percent change
  • Drop down menu to filter for specific countries
  • Line chart cases by country, selected by drop down menu
  • Top 10 country rates *10,000 cases
  • Interactive plots to Zoom in and Zoom out using Plotly library to display Top 10 country rates *10,000 cases

alt text

7.3. In development

  • My intention is to include a new tab in coming weeks to include a predictive modelling tool using some model tool such as tidymodels or modeltime, just to test it. In the long run I would like to learn how to implement hierarchical bayesian modeling into some dashboard.
  • Also I would like to include a specific .CSS file in YAML section of the shiny dashboard to fine tunning the format applied on each tab.

8.Source code available in Github

As this blog article was a brief description of the Shiny app I’ve designed, please follow the link below to get the source code from Github: 00 Maps data prep_SHINY_APP.R, 01 Leaf and pop figures_SHINY_APP.R, 02 ui_server_SHINY_APP.R

Pablo Github Repository

Any comments to this blog article, please feel free to email me at: pablo.leonrodenas@nhs.net

Pablo Leon-Rodenas

NHS England and NHS Improvement

Leave a Reply