Author

Zoë Turner

Published

April 23, 2020

This blog was originally a vignette for the NHSRdatasets package that I wrote when I had only been using R for about a year and I found myself contributing to a package! I produced this vignette which detailed all the laborious steps I took to take the wide data sets for provisionally recorded deaths up to 2019 and made them long form.

I had been using these spreadsheets regularly for a mortality report in the Trust I worked in, but this data became incredibly important from 2020 (as well as changed format yet again) and so I’m sure there are better way to clean it and would be much easier to find than it was back before 2020. For the sake of transparency and showing how code changes over time I’m going to share it in its original state here and perhaps it can be its own training challenge as you free to improve or correct things that inevitably go out of date.

The data output stored in NHSRdatasets won’t be changed and instead will have vignettes built to use the data and help explain and explore it.

The original vignette

This vignette details why and how the ons_mortality dataset was created.

The data were pulled together from the weekly xls spreadsheets provided by the Office of National Statistics (ONS) from 2010 to 2019 for training purposes and is a static data set.

Data is released every Friday and includes provisionally registered deaths for England and Wales for:

  • Total, all deaths
  • regions of usual residence
  • age bands (persons and for males/females)
  • by underlying cause ‘All respiratory diseases’ (the methodology has changed over the years, see the subsection for more information)

Further notes taken from the spreadsheet:

1 This average is based on the actual number of death registrations recorded for each corresponding week over the previous five years. Moveable public holidays, when register offices are closed, affect the number of registrations made in the published weeks and in the corresponding weeks in previous years.

2 Counts of deaths by underlying cause exclude deaths at age under 28 days.

3 Coding of deaths by underlying cause for the latest week is not yet complete.

4 Does not include deaths where age is either missing or not yet fully coded. For this reason counts of ‘Persons’, ‘Males’ and ‘Females’ may not sum to ‘Total Deaths, all ages’.

5 Does not include deaths of those resident outside England and Wales or those records where the place of residence is either missing or not yet fully coded. For this reason counts for “Deaths by Region of usual residence” may not sum to “Total deaths, all ages”.

Later confirmation of deaths

Aggregate numbers of deaths are later confirmed and released in by ONS monthly. Numbers are released at a lower geographical level to the weekly data.

Preparation of data

In order to prepare this data for analysis the worksheet for weekly figures was extracted and the data was moved to long form and merged together over the available years. This was extracted in April 2020 when full previous years’ data was available.

The dataset contains:

  • category_1: character, containing the names of the groups for counts, e.g. “Total deaths”, “all ages”.
  • category_2: character, subcategory of names of groups where necessary, e.g. details of region: “East”, details of age bands “15-44”.
  • counts: numeric, numbers of deaths in whole numbers and average numbers with decimal points. To retain the integrity of the format this column data is left as character.
  • date: date, format is yyyy-mm-dd; all dates are a Friday.
  • week_no: integer, each week in a year is numbered sequentially.

Source and licence acknowledgement

This data has been made available through Office of National Statistics under the Open Government Licence.

The main page for downloads is found on the ONS People Population and Community pages.

Creating the dataset

Spreadsheets are in xls format from 2010 to 2019.

Each week is released with a new file name listing the week number and all weeks for that year are included in the spreadsheet. Counts for weeks ahead are missing and listed as NA.

When a year has passed it is renamed to the year name and does not change.

# 2019
download.file(
  "https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2019/publishedweek522019.xls",
  destfile = "2019Mortality.xls",
  method = "wininet",
  mode = "wb")

# 2018
download.file(
  "https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2018/publishedweek522018withupdatedrespiratoryrow.xls",
  destfile = "2018Mortality.xls",
  method = "wininet",
  mode = "wb")

# 2017
download.file(
  "https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2017/publishedweek522017.xls",
  destfile = "2017Mortality.xls",
  method = "wininet",
  mode = "wb")

# 2016
download.file(
  "https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2016/publishedweek522016.xls",
  destfile = "2016Mortality.xls",
  method = "wininet",
  mode = "wb")

# 2015
download.file(
  "https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2015/publishedweek2015.xls",
  destfile = "2015Mortality.xls",
  method = "wininet",
  mode = "wb")

# 2014
download.file(
  "https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2014/publishedweek2014.xls",
  destfile = "2014Mortality.xls",
  method = "wininet",
  mode = "wb")

# 2013
download.file(
  "https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2013/publishedweek2013.xls",
  destfile = "2013Mortality.xls",
  method = "wininet",
  mode = "wb")

# 2012
download.file(
  "https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2012/publishedweek2012.xls",
  destfile = "2012Mortality.xls",
  method = "wininet",
  mode = "wb")

# 2011
download.file(
  "https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2011/publishedweek2011.xls",
  destfile = "2011Mortality.xls",
  method = "wininet",
  mode = "wb")

# 2010
download.file(
  "https://www.ons.gov.uk/file?uri=%2fpeoplepopulationandcommunity%2fbirthsdeathsandmarriages%2fdeaths%2fdatasets%2fweeklyprovisionalfiguresondeathsregisteredinenglandandwales%2f2010/publishedweek2010.xls",
  destfile = "2010Mortality.xls",
  method = "wininet",
  mode = "wb")

Extracting worksheets to csv

The data can be found in the worksheet called Weekly figures “date/year”. Other worksheets in spreadsheets 2010 to 2019 are:

  • Contents
  • Information
  • Terms and conditions
  • Weekly figures
  • Related publications

The following code finds the tab name and then creates a csv file for each tab/worksheet.

library(readxl)
library(dplyr)

files_list <- list.files(path = "working_files",
                         pattern = "*.xls",
                         full.names = TRUE)


read_then_csv <- function(sheet, path) {
  pathbase <- path %>%
    basename() %>%
    tools::file_path_sans_ext()
  path %>%
    read_excel(sheet = sheet) %>%
    write_csv(paste0(pathbase, "-", sheet, ".csv"))
}


for(j in 1:length(files_list)){

  path <- paste0(files_list[j])

  path %>%
    excel_sheets() %>%
    set_names() %>%
    map(read_then_csv, path = path)
}

Loading the weekly figure worksheets

From 2010 to 2015 the tab name was Weekly Figures then it changed capitalisation to Weekly figures.

files_list_sheets <- list.files(path = "working_files",
                         pattern = "Weekly",
                         full.names = TRUE
                         )

for(i in files_list_sheets) {

  x <- read_csv((i), col_types = cols(.default = col_character()))

  assign(i, x)
}

Format data functions

Two functions were required to format the spreadsheets as there were changes to the layout from 2016.

Packages used:

  • janitor: a specific package for cleaning data from Excel used here to remove the blank lines and columns, convert Excel serial numbers to dates and to format all column headers by removing spaces in those header names and removing capitalisation.
  • dplyr: specifically dplyr for data manipulation (already loaded earlier).
  • tidyr: for pivoting data from wide to long.
  • stringi: a package to works with strings and was used to find certain characters or words.
  • lubridate: a package to work with dates and used here to format dates to the same format.

2010 - 2015

library(janitor)
library(tidyr)
library(stringi)
library(lubridate)

# Column names that are not related to data points to be removed. This is the same for all years' spreadsheets.
# Note that single quotes are used for the categories as one sentence includes '' in the text (4). 

remove_lookup <- c('week over the previous five years1',
  'Deaths by underlying cause2,3',
  'Footnotes',
  '1 This average is based on the actual number of death registrations recorded for each corresponding week over the previous five years. Moveable public holidays, when register offices are closed, affect the number of registrations made in the published weeks and in the corresponding weeks in previous years.',
  '2 Counts of deaths by underlying cause exclude deaths at age under 28 days.',
  '3 Coding of deaths by underlying cause for the latest week is not yet complete.',
  "4Does not include deaths where age is either missing or not yet fully coded. For this reason counts of 'Persons', 'Males' and 'Females' may not sum to 'Total Deaths, all ages'.",
  '5 Does not include deaths of those resident outside England and Wales or those records where the place of residence is either missing or not yet fully coded. For this reason counts for "Deaths by Region of usual residence" may not sum to "Total deaths, all ages".',
  'Source: Office for National Statistics',
  'Deaths by age group'
)

formatFunction <- function(file){

ONS <- file %>%
  clean_names %>%
  remove_empty(c("rows","cols")) %>%
  filter(!contents %in% remove_lookup) %>%
  
  # useful categories are found in the column contents but also include the footnote number
  mutate(Category = case_when(is.na(x2) & str_detect(contents, "Region") ~ "Region",
                              is.na(x2) & str_detect(contents, "Persons") ~ "Persons",
                              is.na(x2) & str_detect(contents, "Females") ~ "Females",
                              is.na(x2) & str_detect(contents, "Males") ~ "Males")
  ) %>%
  select(contents, Category, everything()) %>%
  
  # to ensure data like Persons, Males and Females 
  fill(Category) %>%
  
  # categories with Persons, Males and Females in the original column do not correspond directly to data points (wide form data) so are removed by referring to str_detect to find the word
  filter(!str_detect(contents, "Persons"),
         !str_detect(contents, "Males"),
         !str_detect(contents, "Females")) %>%
  
  # the two columns for Category and contents are merged to Categories to bring the Category column first. Some categories don't have subcategories and these are preceded by NA_ with this merge
  unite("Categories", Category, contents) %>%
  filter(Categories != "Region_Deaths by Region of usual residence 5") %>%
  
  # the NA_ is removed from some of the category names
  mutate(Categories = case_when(str_detect(Categories, "NA_") ~ str_replace(Categories, "NA_", ""),
                                TRUE ~ Categories))

  # Push date row to column names
onsFormattedJanitor <- row_to_names(ONS, 3)

  # move data from wide to long form using pivot_longer
x <- onsFormattedJanitor %>%
  pivot_longer(cols = -`Week ended`,
               names_to = "allDates",
               values_to = "counts") %>%
  
  # some spreadsheets import with Excel serial numbers for dates and others as dates, janitor is used to correct this
  mutate(realDate = dmy(allDates),
         ExcelSerialDate = case_when(stri_length(allDates) == 5 ~ excel_numeric_to_date(as.numeric(allDates), date_system = "modern")),
         date = case_when(is.na(realDate) ~ ExcelSerialDate,
                          TRUE ~ realDate)) %>%
  group_by(`Week ended`) %>%
  
  # the week number is replaced as this was lost with the moving of dates to the column headers
  mutate(week_no = row_number()) %>%
  ungroup() %>%
  
  # Category is a staging name as this is followed by a splitting of the column into category_1 and category_2
  rename(Category = `Week ended`) %>%
  
  # to split the columns there are various characters used as a split point ",", "-", and ":" in the respiratory category the version is denoted by "v"
  mutate(category_1 = case_when(str_detect(Category, ",") ~
                                  substr(Category,1,str_locate(Category, ",") -1),
                                str_detect(Category, ":") ~
                                  substr(Category,1,str_locate(Category, ":") -1),
                                str_detect(Category, "_") ~
                                  substr(Category,1,str_locate(Category, "_") -1),
                                str_detect(Category, "respiratory")  ~
                                  "All respiratory diseases (ICD-10 J00-J99) ICD-10"),
         category_2 = case_when(str_detect(Category, ",") ~
                                  substr(Category,str_locate(Category, ", ") +2, str_length(Category)),
                                str_detect(Category, ":") ~
                                  substr(Category,str_locate(Category, ": ") +2, str_length(Category)),
                                str_detect(Category, "_") ~
                                  substr(Category,str_locate(Category, "_") +1, str_length(Category)),
                                str_detect(Category, "respiratory")  ~
                                  substr(Category,str_locate(Category, "v"), str_length(Category)) ),
         
         # the data for Total deaths: average of corresponding week over the previous 5 years is split over two cells in the spreasheet
         category_2 = recode(category_2,
                             "average of corresponding" = "average of same week over 5 years")) %>%
  select(category_1,
         category_2,
         counts,
         date,
         week_no
         ) %>%
  
  # 2011 requires this code to remove rows where there are no counts and because there are 2 rows relating to respiratory death numbers (see the Respiratory section) the previous methodology has been included in the spreadsheet with : for counts. This code does not affect other years' data/
  filter(!is.na(counts),
         counts != ":") %>%
  fill(category_1)

return(x)

}

Mortality2010 <- formatFunction(`working_files/Weekly/2010Mortality-Weekly Figures 2010.csv`)

# 2011 has two lines relating to respiratory, v 2001 only has one data point and the rest of the year is 2010
Mortality2011 <- formatFunction(`working_files/Weekly/2011Mortality-Weekly Figures 2011.csv`) %>%
  mutate(category_2 = case_when(is.na(category_2) & category_1 == "All respiratory diseases (ICD-10 J00-J99) ICD-10" ~ "v 2010",
                                TRUE ~ category_2))

Mortality2012 <- formatFunction(`working_files/Weekly/2012Mortality-Weekly Figures 2012.csv`)
Mortality2013 <- formatFunction(`working_files/Weekly/2013Mortality-Weekly Figures 2013.csv`)
Mortality2014 <- formatFunction(`working_files/Weekly/2014Mortality-Weekly Figures 2014.csv`)
Mortality2015 <- formatFunction(`working_files/Weekly/2015Mortality-Weekly Figures 2015.csv`)

Format data 2016 - 2019

formatFunction2016 <- function(file){

  ONS <- file %>%
    clean_names %>%
    
    # An extra column has been added for region codes (not included in the dataset) meaning contents are now found in the janitor generated column name x2
    mutate(x2 = case_when(is.na(x2) ~ contents,
                              TRUE ~ x2)) %>%
    remove_empty(c("rows","cols")) %>%
    select(-contents) %>%
    filter(!x2 %in% remove_lookup) %>%
    
    # Region has changed to region
    mutate(Category = case_when(is.na(x3) & str_detect(x2, "region") ~ "Region",
                                is.na(x3) & str_detect(x2, "Persons") ~ "Persons",
                                is.na(x3) & str_detect(x2, "Females") ~ "Females",
                                is.na(x3) & str_detect(x2, "Males") ~ "Males",
                                TRUE ~ NA_character_)
    ) %>%
    select(x2, Category, everything()) %>%
    fill(Category) %>%
    filter(!str_detect(x2, 'Persons'),
           !str_detect(x2, 'Males'),
           !str_detect(x2, 'Females')) %>%
    unite("Categories", Category, x2) %>%
    filter(Categories != 'Region_Deaths by Region of usual residence 5') %>%
    mutate(Categories = case_when(str_detect(Categories, "NA_") ~ str_replace(Categories, "NA_", ""),
                                  TRUE ~ Categories))

  # Push date row to column names
  onsFormattedJanitor <- row_to_names(ONS, 3)

  # move data from wide to long form using pivot_longer
  x <- onsFormattedJanitor %>%
    pivot_longer(cols = -`Week ended`,
                 names_to = "allDates",
                 values_to = "counts") %>%
    mutate(realDate = dmy(allDates),
           ExcelSerialDate = case_when(stri_length(allDates) == 5 ~ excel_numeric_to_date(as.numeric(allDates), date_system = "modern")),
           date = case_when(is.na(realDate) ~ ExcelSerialDate,
                            TRUE ~ realDate)) %>%
    group_by(`Week ended`) %>%
    mutate(week_no = row_number()) %>%
    ungroup() %>%
    rename(Category = `Week ended`) %>%
    mutate(category_1 = case_when(str_detect(Category, ",") ~
                                  substr(Category,1,str_locate(Category, ",") -1),
                                str_detect(Category, ":") ~
                                  substr(Category,1,str_locate(Category, ":") -1),
                                str_detect(Category, "_") ~
                                  substr(Category,1,str_locate(Category, "_") -1),
                                str_detect(Category, "respiratory")  ~
                                  "All respiratory diseases (ICD-10 J00-J99) ICD-10"),
         category_2 = case_when(str_detect(Category, ",") ~
                                  substr(Category,str_locate(Category, ", ") +2, str_length(Category)),
                                str_detect(Category, ":") ~
                                  substr(Category,str_locate(Category, ": ") +2, str_length(Category)),
                                str_detect(Category, "_") ~
                                  substr(Category,str_locate(Category, "_") +1, str_length(Category)),
                                str_detect(Category, "respiratory")  ~
                                  substr(Category,str_locate(Category, "v"), str_length(Category)) ),
         category_2 = recode(category_2,
                             "average of corresponding" = "average of same week over 5 years")) %>%
    select(category_1,
           category_2,
           counts,
           date,
           week_no
    ) %>%
    filter(!is.na(counts)) 

  return(x)

}


Mortality2016 <- formatFunction2016(`working_files/Weekly/2016Mortality-Weekly figures 2016.csv`)
Mortality2017 <- formatFunction2016(`working_files/Weekly/2017Mortality-Weekly figures 2017.csv`)
Mortality2018 <- formatFunction2016(`working_files/Weekly/2018Mortality-Weekly figures 2018.csv`)
Mortality2019 <- formatFunction2016(`working_files/Weekly/2019Mortality-Weekly figures 2019.csv`)

Bind together

ons_mortality <- do.call("rbind", list(
                      Mortality2010,
                      Mortality2011,
                      Mortality2012,
                      Mortality2013,
                      Mortality2014,
                      Mortality2015,
                      Mortality2016,
                      Mortality2017,
                      Mortality2018,
                      Mortality2019))

Load the data

Respiratory

Deaths by underlying cause of respiratory diseases All respiratory diseases (ICD-10 J00-J99) appear as:

ons_mortality %>% 
  mutate(year = year(date)) %>% 
  select(year, category_1, category_2) %>% 
  group_by(year, category_1, category_2) %>% 
  filter(category_1 == 'All respiratory diseases (ICD-10 J00-J99) ICD-10') %>% 
  slice(1)

Notes from the spreadsheets in relation to these methodology changes:

2011

  • Respiratory deaths for 2011 are coded to the new version of ICD-10 while for 2010 they are coded to the previous version. Week 1 2011 has been coded to both versions to give an indication of the impact of the change.

2014

  • For deaths registered from January 2014, cause of death is coded to the ICD-10 classification using IRIS software. Further information about the implementation of the software is available on the ONS website.

  • All deaths registered in week 2 were dual-coded using both ICD-10 v2010 (NCHS) and ICD-10 v2013 (IRIS). An information note providing the preliminary findings on the impact of the implementation of IRIS software for ICD-10 cause of death coding on mortality statistics in England and Wales is available on the ONS website.

2015

  • For deaths registered from January 2014, cause of death is coded to the ICD-10 classification using IRIS software. Further information about the implementation of the software is available on the ONS website.

Age bands

Numbers are provided for categories: Persons, Females and Males.

As detailed in the notes, if the age is missing or not coded at the time of release then the counts of ‘Persons’, ‘Males’ and ‘Females’ may not sum to ‘Total Deaths, all ages’.

Note that these age bands were changed in 2020 releases.

ons_mortality %>% 
  select(category_1, category_2) %>% 
  group_by(category_1, category_2) %>% 
  filter(category_1 %in% c('Persons','Females','Males')) %>% 
  slice(1)

Remove working_files folder

unlink("working_files", recursive = TRUE)

Citations

The data used to build the mortality dataset in this package is released under © Crown copyright and is free to use under the terms of the Open Government Licence. Any subsequent use should include a source accreditation to ONS - Source: Office for National Statistics licensed under the Open Government Licence.

Back to top

Reuse

CC0