Promoting the use of R in the NHS

Blog Article

This post was originally published on this site

(This article was first published on The Devil is in the Data – The Lucid Manager, and kindly contributed to R-bloggers)

Literature reviews are the cornerstone of science. Keeping abreast of developments within any given field of enquiry has become increasingly difficult given the enormous amounts of new research. Databases and search technology have made finding relevant literature easy but, keeping a coherent overview of the discourse within a field of enquiry is an ever more encompassing task.

Scholars have proposed many approaches to analysing literature, which can be placed along a continuum from traditional narrative methods to systematic analytic syntheses of text using machine learning. Traditional reviews are biased because they rely entirely on the interpretation of the researcher. Analytical approaches follow a process that is more like scientific experimentation. These systematic methods are reproducible in the way literature is searched and collated but still rely on subjective interpretation.

Machine learning provides new methods to analyse large swaths of text. Although these methods sound exciting, these methods are incapable of providing insight. Machine learning cannot interpret a text; it can only summarise and structure a corpus. Machine learning still requires human interpretation to make sense of the information.

This article introduces a mixed-method technique for reviewing literature, combining qualitative and quantitative methods. I used this method to analyse literature published by the International Water Association as part of my dissertation into water utility marketing. You can read the code below, or download it from GitHub. Detailed infromation about the methodology is available through FigShare.

A literature review with RQDA

The purpose of this review was to ascertain the relevance of marketing theory to the discourse of literature in water management. This analysis uses a sample of 244 journal abstracts, each of which was coded with the RQDA library. This library provides functionality for qualitative data analysis. RQDA provides a graphical user interface to mark sections of text and assign them to a code, as shown below.

Literature Review with RQDA
Marking topics in an abstract with RQDA.

You can load a corpus of text into RQDA and mark each of the texts with a series of codes. The texts and the codes are stored in an SQLite database, which can be easily queried for further analysis.

I used a marketing dictionary to assess the abstracts from journals published by the International Water Association from the perspective of marketing. This phase resulted in a database with 244 abstracts and their associated coding.

Discourse Network Analysis

Once all abstracts are coded, we can start analysing the internal structure of the IWA literature. First, let’s have a look at the occurrence of the topics identified for the corpus of abstracts.

The first lines in this snippet call the tidyverse and RQDA libraries and open the abstracts database. The

getCodingTable

function provides a data frame with each of the marked topics and their location.  This function allows us to visualise the occurrence of the topics in the literature.

library(tidyverse)
library(RQDA)
## Open project
openProject("IWA_Abstracts.rqda", updateGUI = TRUE)

## Visualise codes
getCodingTable() %>%
    group_by(codename) %>%
    count() %>%
    arrange(n) %>%
    ungroup() %>%
    mutate(codename = factor(codename, levels = codename)) %>%
    ggplot(aes(codename, n)) +
        geom_col() +
        coord_flip() +
        xlab("Code name") + ylab("Occurence")
Frequencies of topics in IWA literature.
Frequencies of topics in IWA literature.

This bar chart tells us that the literature is preoccupied with asset management and the quality of the product (water) or the service (customer perception). This insight is interesting, but not very enlightening information. We can use discourse network analysis to find a deeper structure in the literature.

Discourse Network Analysis

We can view each abstract with two or more topics as a network where each topic is connected. The example below shows four abstracts with two or more codes and their internal networks.

Examples of complete networks for four abstracts.
Examples of complete networks for four abstracts.

The union of these four networks forms a more extensive network that allows us to analyse the structure of the corpus of literature, shown below.

Union of networks and community detection.
Union of networks and community detection.

We can create a network of topics with the igraph package. The first step is to create a Document-Term-Matrix. This matrix counts how often a topic occurs within each abstract. From this matrix, we can create a graph by transforming it into an Adjacency Matrix. This matrix describes the graph which can be visualised and analysed. For more detailed information about this method, refer to my dissertation.

library(igraph)
library(reshape2)
dtm %
    mutate(freq = 1) %>%
    acast(filename ~ codename, sum)
adj 

Network of topics in IWA literature.
The network of topics in IWA literature.

In this graph, each node is a topic in the literature, and each edge implies that a topic is used in the same abstract. This graph uses the Fruchterman-Reingold algorithm to position each of the nodes, with the most connected topic in the centre.

The last step is to identify the structure of this graph using community detection. A community is a group of nodes that are more connected with each other than with nodes outside the community.

set.seed(123)
comms 

Community detection in IWA literature
Community detection in IWA literature.

We have now succeeded to convert a corpus of 244 journal abstracts to a parsimonious overview of four communities of topics. This analysis resulted in greater insight into how marketing theory applies to water management, which was used to structure a book about water utility marketing.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data – The Lucid Manager.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Comments are closed.