Author

Sebastian Zeki

Published

July 23, 2018

Modified

March 23, 2024

There are many occasions when a column of data needs to be created from an already existing column for ease of data manipulation. For example, perhaps you have a body of text as a pathology report and you want to extract all the reports where the diagnosis is ‘dysplasia’.

You could just subset the data using grepl so that you only get the reports that mention this word…but what if the data needs to be cleaned prior to subsetting like excluding reports where the diagnosis is normal but the phrase ‘No evidence of dysplasia’ is present. Or perhaps there are other manipulations needed prior to subsetting.

This is where data accordionisation is useful. This simply means the creation of data from (usually) a column into another column in the same dataframe.

The neatest way to do this is with the mutate function from the {dplyr} package which is devoted to data cleaning. There are also other ways which I will demonstrate at the end.

The input data here will be an endoscopy data set:

Age <- sample(1:100, 130, replace = TRUE)
Dx <- sample(c("NDBE", "LGD", "HGD", "IMC"), 130, replace = TRUE)
TimeOfEndoscopy <- sample(1:60, 130, replace = TRUE)

library(dplyr)

EMRdf <- data.frame(Age, Dx, TimeOfEndoscopy, stringsAsFactors = F)

Perhaps you need to calculate the number of hours spent doing each endoscopy rather than the number of minutes

EMRdftbb <- EMRdf %>% mutate(TimeOfEndoscopy / 60)

# install.packages("knitr")
library(knitr)
library(kableExtra)

# Just show the top 20 results

kable(head(EMRdftbb, 20))
Age Dx TimeOfEndoscopy TimeOfEndoscopy/60
12 LGD 54 0.9000000
40 LGD 52 0.8666667
100 LGD 25 0.4166667
96 LGD 27 0.4500000
5 NDBE 39 0.6500000
54 LGD 38 0.6333333
65 HGD 41 0.6833333
11 IMC 53 0.8833333
2 HGD 40 0.6666667
59 HGD 21 0.3500000
19 NDBE 25 0.4166667
39 LGD 1 0.0166667
49 NDBE 51 0.8500000
21 LGD 22 0.3666667
23 LGD 1 0.0166667
60 NDBE 28 0.4666667
2 LGD 29 0.4833333
22 NDBE 46 0.7666667
44 NDBE 6 0.1000000
17 HGD 22 0.3666667

That is useful but what if you want to classify the amount of time spent doing each endoscopy as follows: <0.4 hours is too little time and >0.4 hours is too long.

Using ifelse() with mutate for conditional accordionisation.

For this we would use ifelse(). However this can be combined with mutate() so that the result gets put in another column as follows

EMRdf2 <- EMRdf %>%
  mutate(TimeInHours = TimeOfEndoscopy / 60) %>%
  mutate(TimeClassification = ifelse(TimeInHours > 0.4, "Too Long", "Too Short"))

# Just show the top 20 results

kable(head(EMRdf2, 20))
Age Dx TimeOfEndoscopy TimeInHours TimeClassification
12 LGD 54 0.9000000 Too Long
40 LGD 52 0.8666667 Too Long
100 LGD 25 0.4166667 Too Long
96 LGD 27 0.4500000 Too Long
5 NDBE 39 0.6500000 Too Long
54 LGD 38 0.6333333 Too Long
65 HGD 41 0.6833333 Too Long
11 IMC 53 0.8833333 Too Long
2 HGD 40 0.6666667 Too Long
59 HGD 21 0.3500000 Too Short
19 NDBE 25 0.4166667 Too Long
39 LGD 1 0.0166667 Too Short
49 NDBE 51 0.8500000 Too Long
21 LGD 22 0.3666667 Too Short
23 LGD 1 0.0166667 Too Short
60 NDBE 28 0.4666667 Too Long
2 LGD 29 0.4833333 Too Long
22 NDBE 46 0.7666667 Too Long
44 NDBE 6 0.1000000 Too Short
17 HGD 22 0.3666667 Too Short

Note how we can chain the mutate() function together.

Using multiple ifelse()

What if we want to get more complex and put several classifiers in? We just use more ifelse’s:

EMRdf2 <- EMRdf %>%
  mutate(TimeInHours = TimeOfEndoscopy / 60) %>%
  mutate(TimeClassification = ifelse(TimeInHours > 0.8, "Too Long", ifelse(TimeInHours < 0.5, "Too Short", ifelse(TimeInHours >= 0.5 & TimeInHours <= 0.8, "Just Right", "N"))))

# Just show the top 20 results

kable(head(EMRdf2, 20))
Age Dx TimeOfEndoscopy TimeInHours TimeClassification
12 LGD 54 0.9000000 Too Long
40 LGD 52 0.8666667 Too Long
100 LGD 25 0.4166667 Too Short
96 LGD 27 0.4500000 Too Short
5 NDBE 39 0.6500000 Just Right
54 LGD 38 0.6333333 Just Right
65 HGD 41 0.6833333 Just Right
11 IMC 53 0.8833333 Too Long
2 HGD 40 0.6666667 Just Right
59 HGD 21 0.3500000 Too Short
19 NDBE 25 0.4166667 Too Short
39 LGD 1 0.0166667 Too Short
49 NDBE 51 0.8500000 Too Long
21 LGD 22 0.3666667 Too Short
23 LGD 1 0.0166667 Too Short
60 NDBE 28 0.4666667 Too Short
2 LGD 29 0.4833333 Too Short
22 NDBE 46 0.7666667 Just Right
44 NDBE 6 0.1000000 Too Short
17 HGD 22 0.3666667 Too Short

Using multiple ifelse() with grepl() or string_extract

Of course we need to extract information from text as well as numeric data. We can do this using grepl() or string_extract() from the library(stringr).

Let’s say we want to extract all the samples that had IMC. We don’t want to subset the data, just extract IMC into a column that says IMC and the rest say ’Non-IMC’

Using the dataset above:

library(stringr)

EMRdf$MyIMC_Column <- str_extract(EMRdf$Dx, "IMC")

# to fill the NA's we would do:EMRdf$MyIMC_Column<-ifelse(grepl("IMC",EMRdf$Dx),"IMC","NoIMC")

# Another way to do this (really should be for more complex examples when you want to extract the entire contents of the cell that has the match)

EMRdf$MyIMC_Column <- ifelse(grepl("IMC", EMRdf$Dx), str_extract(EMRdf$Dx, "IMC"), "NoIMC")

So data can be usefully created from data for further analysis.

Hopefully this way of extrapolating data and especially using conditional expressions to categorise data according to some rules is a helpful way to get more out of your data.

Please follow @gastroDS on twitter

This article originally appeared on https://sebastiz.github.io/gastrodatascience/ and has been edited to render in Quarto and had NHS-R styles applied.

Back to top

Reuse

CC0

Citation

For attribution, please cite this work as:
Zeki, Sebastian. 2018. “How to Extrapolate Data from Data.” July 23, 2018. https://nhs-r-community.github.io/nhs-r-community/blog/how-to-extrapolate-data-from-data.html.