Animated Population pyramids in R: part 1

Anastasiia Zharinova, Healthcare Analyst

12 March 2019

Even in my relatively short experience of working as healthcare analyst, I have massively used population pyramids to describe the local population and how it may change according to ONS population projections. So, I decided to try animated pyramids in R. The overall process for me includes:
1. Wrangle data a bit to make it ready for ggplot.
2. Build 1 pyramid and see how it will look.
3. Create animation with 25 pyramids for period 2016 – 2041 using different animation packages and compare them.

In this part I will consider only first 2 steps.

Change the data

I probably should have said earlier, but I am not an expert in R (actually, I feel like I’m still a perpetual novice). On data wrangling stage, I created datasets for almost each step of the data transformation. It made easier for me to check for errors but made my code a bit ugly.

I used the open-source data from the Office for National Statistics. It has population estimates for mid-2016 and population projections by age and gender for England and CCGs. I separately saved worksheets ‘Females’ and ‘Males’ and loaded them in RStudio. I then just added gender column in both datasets and combined this two data sets by rbind. Overall, data wrangling process should make this:

##    year AGE.GROUP gender population totalyears percentage
## 1  2016       0-4 Female       41.1     1167.5   3.520343
## 2  2016     05-09 Female       39.5     1167.5   3.383298
## 3  2016     10-14 Female       36.4     1167.5   3.117773
## 4  2016     15-19 Female       39.6     1167.5   3.391863
## 5  2016     20-24 Female       49.6     1167.5   4.248394
## 6  2016     25-29 Female       44.2     1167.5   3.785867
## 7  2016     30-34 Female       39.8     1167.5   3.408994
## 8  2016     35-39 Female       38.1     1167.5   3.263383
## 9  2016     40-44 Female       35.6     1167.5   3.049251
## 10 2016     45-49 Female       38.5     1167.5   3.297645

from this

##      ï..CODE    AREA AGE.GROUP    X2016    X2017    X2018    X2019
## 1  E92000001 England       0-4 1,671.40 1,648.60 1,639.50 1,635.80
## 2  E92000001 England     05-09 1,672.60 1,707.40 1,719.40 1,726.50
## 3  E92000001 England     10-14 1,497.80 1,544.00 1,595.70 1,635.80
## 4  E92000001 England     15-19 1,547.60 1,516.20 1,499.40 1,496.10
## 5  E92000001 England     20-24 1,735.90 1,716.40 1,699.50 1,680.50
## 6  E92000001 England     25-29 1,887.40 1,894.80 1,882.30 1,872.80
## 7  E92000001 England     30-34 1,874.70 1,881.90 1,895.70 1,907.30
## 8  E92000001 England     35-39 1,783.20 1,829.00 1,869.00 1,880.00
## 9  E92000001 England     40-44 1,778.80 1,730.80 1,703.40 1,715.20
## 10 E92000001 England     45-49 1,963.30 1,946.30 1,918.40 1,876.30
##       X2020    X2021    X2022    X2023    X2024    X2025    X2026    X2027
## 1  1,633.10 1,627.70 1,629.60 1,626.80 1,623.90 1,620.80 1,617.70 1,614.10
## 2  1,725.50 1,717.90 1,693.20 1,682.30 1,676.80 1,672.70 1,666.50 1,668.10
## 3  1,675.50 1,707.50 1,740.90 1,751.60 1,757.50 1,755.40 1,747.30 1,722.50
## 4  1,507.10 1,537.60 1,582.20 1,632.40 1,671.20 1,709.90 1,741.40 1,774.50
## 5  1,662.20 1,637.00 1,601.60 1,580.70 1,574.10 1,582.90 1,612.00 1,655.90
## 6  1,851.20 1,821.20 1,796.80 1,775.00 1,751.70 1,730.00 1,702.70 1,665.90
## 7  1,912.10 1,922.80 1,927.40 1,912.00 1,899.80 1,876.00 1,844.90 1,819.90
## 8  1,881.70 1,883.90 1,889.50 1,901.70 1,911.70 1,915.20 1,925.30 1,929.50
## 9  1,749.70 1,789.40 1,833.90 1,872.60 1,882.50 1,883.30 1,885.20 1,890.60
## 10 1,836.90 1,781.30 1,732.90 1,705.10 1,716.20 1,750.00 1,789.30 1,833.40
##       X2028    X2029    X2030    X2031    X2032    X2033    X2034    X2035
## 1  1,609.70 1,604.80 1,600.00 1,595.50 1,592.10 1,590.20 1,590.10 1,592.10
## 2  1,665.30 1,662.40 1,659.30 1,656.20 1,652.60 1,648.30 1,643.40 1,638.60
## 3  1,711.60 1,706.10 1,702.00 1,695.80 1,697.40 1,694.60 1,691.70 1,688.70
## 4  1,785.20 1,791.10 1,789.00 1,781.00 1,756.20 1,745.40 1,739.90 1,735.80
## 5  1,706.00 1,744.80 1,783.60 1,815.10 1,848.20 1,858.90 1,864.90 1,863.00
## 6  1,644.60 1,637.90 1,646.70 1,676.00 1,720.10 1,770.50 1,809.50 1,848.40
## 7  1,798.20 1,774.90 1,753.30 1,726.00 1,689.30 1,668.10 1,661.40 1,670.30
## 8  1,914.20 1,902.10 1,878.40 1,847.50 1,822.60 1,801.10 1,777.90 1,756.40
## 9  1,902.80 1,912.80 1,916.30 1,926.40 1,930.60 1,915.50 1,903.50 1,880.00
## 10 1,872.00 1,881.90 1,882.80 1,884.80 1,890.30 1,902.60 1,912.60 1,916.20
##       X2036    X2037    X2038    X2039    X2040    X2041 gender
## 1  1,596.30 1,602.80 1,611.50 1,622.40 1,635.10 1,649.10 Female
## 2  1,634.20 1,630.80 1,628.90 1,628.80 1,630.80 1,635.10 Female
## 3  1,685.50 1,681.90 1,677.60 1,672.80 1,668.00 1,663.60 Female
## 4  1,729.70 1,731.30 1,728.50 1,725.60 1,722.50 1,719.40 Female
## 5  1,855.10 1,830.30 1,819.50 1,814.10 1,809.90 1,803.70 Female
## 6  1,880.10 1,913.40 1,924.20 1,930.10 1,928.00 1,920.00 Female
## 7  1,699.60 1,743.70 1,794.10 1,833.10 1,872.00 1,903.60 Female
## 8  1,729.30 1,692.80 1,671.80 1,665.20 1,674.10 1,703.40 Female
## 9  1,849.20 1,824.60 1,803.20 1,780.30 1,759.00 1,732.00 Female
## 10 1,926.30 1,930.60 1,915.60 1,903.70 1,880.40 1,850.00 Female

Let’s see it step by step:

For the simplicity, I left only area I need for now. In 2016, Birmingham and Solihull CCG were three diffirent CCGs.

df1 <- subset(df, df$AREA == "NHS Birmingham CrossCity CCG" | df$AREA == "NHS Birmingham South and Central CCG" | df$AREA == "NHS Solihull CCG")

My data still has columns for each year separately, so I created column ‘year’ and changed data structure

df2 <- gather (df1, "year", "population", 4:29)
##      ï..CODE                         AREA AGE.GROUP gender  year
## 1  E38000012 NHS Birmingham CrossCity CCG       0-4 Female X2016
## 2  E38000012 NHS Birmingham CrossCity CCG     05-09 Female X2016
## 3  E38000012 NHS Birmingham CrossCity CCG     10-14 Female X2016
## 4  E38000012 NHS Birmingham CrossCity CCG     15-19 Female X2016
## 5  E38000012 NHS Birmingham CrossCity CCG     20-24 Female X2016
## 6  E38000012 NHS Birmingham CrossCity CCG     25-29 Female X2016
## 7  E38000012 NHS Birmingham CrossCity CCG     30-34 Female X2016
## 8  E38000012 NHS Birmingham CrossCity CCG     35-39 Female X2016
## 9  E38000012 NHS Birmingham CrossCity CCG     40-44 Female X2016
## 10 E38000012 NHS Birmingham CrossCity CCG     45-49 Female X2016
##    population
## 1          28
## 2        26.3
## 3        24.1
## 4        26.5
## 5        30.9
## 6        30.5
## 7          27
## 8        25.3
## 9        22.8
## 10       24.2

Boring but important bits: aggregate data by year, age band and gender, change population column to numeric format and drop the row ‘All Ages’ to not accidentally include it in our plot

df2$population <- as.numeric(df2$population)
df3 <- aggregate(population~AGE.GROUP+gender+year, data=df2, FUN=sum) 
df3 <-  df3[df3$AGE.GROUP !='All ages',]

Now, let’s calculate percentages. For standard population pyramids percentages are calculated from total population for the year, so we should calculate this value, add to the table and calculate percentage for each gender-age band pair.

totalyear <- aggregate(population~year, data=df3, FUN=sum) 
df4 <- merge(x = df3, y = totalyear, by = "year", all.x = TRUE)    
colnames(df4)[colnames(df4) == 'population.y'] <- 'totalyears' 
colnames(df4)[colnames(df4) == 'population.x'] <- 'population'    
df4$percentage<- df4$population/df4$totalyears*100    

To draw population pyramids in Excel, I always used negative values for one of the genders and then changed the legend. I used the same logic for R

df4 <- transform(df4, percentage = ifelse(gender == 'Male', -df4$population/df4$totalyears*100, percentage))

Last but not least, I notices ‘X’ in front of the year. Let’s remove it!

df4$year <- substr (df4$year,2, 5) 

Drawing pyramid

Now, when our data looks tidy and ready, we can move to the the most exciting part – using ggplot. The main thing in this process are: build bar chart, flip axes and use the theme we would like. I could not resist and used The Strategy Unit colours!

ggplot(subset(df4, df4$year == "2016"), aes(x = AGE.GROUP, y = percentage, fill = gender)) +   # Fill column
  geom_bar(stat = "identity", width = .85) +   # draw the bars
  scale_y_continuous(breaks = seq(-5,5, length.out = 11),labels = c('5%','4%', '3%', '2%', '1%', '0', '1%','2%','3%','4%','5%')) +
  coord_flip() +  # Flip axes
  labs(title="Birmingham and Solihull population", y="percentage of populaiton", x="Age group") +
  theme(plot.title = element_text(hjust = .5),
        axis.ticks = element_blank(),
        panel.background = element_blank(), strip.background = element_rect(colour="white", fill="white"), strip.text.x = element_text(size = 10)) +   # Centre plot title
  scale_fill_manual(values=c("goldenrod2", "gray32")) + ###colours of Strategy Unit+ 
  facet_grid(. ~ year)

To be continued…

As I previously said, my main aim of this exersice was to learn R animation and compare different packages. So far I have used packages “magick” and “gganimate” and am happy to share results in the next part. Please do not hesitate to leave your comment and suggest any other packages for creating animation, I want to test them all!