Anastasiia Zharinova, Healthcare Analyst
12 March 2019
Even in my relatively short experience of working as healthcare analyst, I have massively used population pyramids to describe the local population and how it may change according to ONS population projections. So, I decided to try animated pyramids in R. The overall process for me includes:
1. Wrangle data a bit to make it ready for ggplot.
2. Build 1 pyramid and see how it will look.
3. Create animation with 25 pyramids for period 2016 – 2041 using different animation packages and compare them.
In this part I will consider only first 2 steps.
Change the data
I probably should have said earlier, but I am not an expert in R (actually, I feel like I’m still a perpetual novice). On data wrangling stage, I created datasets for almost each step of the data transformation. It made easier for me to check for errors but made my code a bit ugly.
I used the open-source data from the Office for National Statistics. It has population estimates for mid-2016 and population projections by age and gender for England and CCGs. I separately saved worksheets ‘Females’ and ‘Males’ and loaded them in RStudio. I then just added gender column in both datasets and combined this two data sets by rbind. Overall, data wrangling process should make this:
## year AGE.GROUP gender population totalyears percentage
## 1 2016 0-4 Female 41.1 1167.5 3.520343
## 2 2016 05-09 Female 39.5 1167.5 3.383298
## 3 2016 10-14 Female 36.4 1167.5 3.117773
## 4 2016 15-19 Female 39.6 1167.5 3.391863
## 5 2016 20-24 Female 49.6 1167.5 4.248394
## 6 2016 25-29 Female 44.2 1167.5 3.785867
## 7 2016 30-34 Female 39.8 1167.5 3.408994
## 8 2016 35-39 Female 38.1 1167.5 3.263383
## 9 2016 40-44 Female 35.6 1167.5 3.049251
## 10 2016 45-49 Female 38.5 1167.5 3.297645
from this
## ï..CODE AREA AGE.GROUP X2016 X2017 X2018 X2019
## 1 E92000001 England 0-4 1,671.40 1,648.60 1,639.50 1,635.80
## 2 E92000001 England 05-09 1,672.60 1,707.40 1,719.40 1,726.50
## 3 E92000001 England 10-14 1,497.80 1,544.00 1,595.70 1,635.80
## 4 E92000001 England 15-19 1,547.60 1,516.20 1,499.40 1,496.10
## 5 E92000001 England 20-24 1,735.90 1,716.40 1,699.50 1,680.50
## 6 E92000001 England 25-29 1,887.40 1,894.80 1,882.30 1,872.80
## 7 E92000001 England 30-34 1,874.70 1,881.90 1,895.70 1,907.30
## 8 E92000001 England 35-39 1,783.20 1,829.00 1,869.00 1,880.00
## 9 E92000001 England 40-44 1,778.80 1,730.80 1,703.40 1,715.20
## 10 E92000001 England 45-49 1,963.30 1,946.30 1,918.40 1,876.30
## X2020 X2021 X2022 X2023 X2024 X2025 X2026 X2027
## 1 1,633.10 1,627.70 1,629.60 1,626.80 1,623.90 1,620.80 1,617.70 1,614.10
## 2 1,725.50 1,717.90 1,693.20 1,682.30 1,676.80 1,672.70 1,666.50 1,668.10
## 3 1,675.50 1,707.50 1,740.90 1,751.60 1,757.50 1,755.40 1,747.30 1,722.50
## 4 1,507.10 1,537.60 1,582.20 1,632.40 1,671.20 1,709.90 1,741.40 1,774.50
## 5 1,662.20 1,637.00 1,601.60 1,580.70 1,574.10 1,582.90 1,612.00 1,655.90
## 6 1,851.20 1,821.20 1,796.80 1,775.00 1,751.70 1,730.00 1,702.70 1,665.90
## 7 1,912.10 1,922.80 1,927.40 1,912.00 1,899.80 1,876.00 1,844.90 1,819.90
## 8 1,881.70 1,883.90 1,889.50 1,901.70 1,911.70 1,915.20 1,925.30 1,929.50
## 9 1,749.70 1,789.40 1,833.90 1,872.60 1,882.50 1,883.30 1,885.20 1,890.60
## 10 1,836.90 1,781.30 1,732.90 1,705.10 1,716.20 1,750.00 1,789.30 1,833.40
## X2028 X2029 X2030 X2031 X2032 X2033 X2034 X2035
## 1 1,609.70 1,604.80 1,600.00 1,595.50 1,592.10 1,590.20 1,590.10 1,592.10
## 2 1,665.30 1,662.40 1,659.30 1,656.20 1,652.60 1,648.30 1,643.40 1,638.60
## 3 1,711.60 1,706.10 1,702.00 1,695.80 1,697.40 1,694.60 1,691.70 1,688.70
## 4 1,785.20 1,791.10 1,789.00 1,781.00 1,756.20 1,745.40 1,739.90 1,735.80
## 5 1,706.00 1,744.80 1,783.60 1,815.10 1,848.20 1,858.90 1,864.90 1,863.00
## 6 1,644.60 1,637.90 1,646.70 1,676.00 1,720.10 1,770.50 1,809.50 1,848.40
## 7 1,798.20 1,774.90 1,753.30 1,726.00 1,689.30 1,668.10 1,661.40 1,670.30
## 8 1,914.20 1,902.10 1,878.40 1,847.50 1,822.60 1,801.10 1,777.90 1,756.40
## 9 1,902.80 1,912.80 1,916.30 1,926.40 1,930.60 1,915.50 1,903.50 1,880.00
## 10 1,872.00 1,881.90 1,882.80 1,884.80 1,890.30 1,902.60 1,912.60 1,916.20
## X2036 X2037 X2038 X2039 X2040 X2041 gender
## 1 1,596.30 1,602.80 1,611.50 1,622.40 1,635.10 1,649.10 Female
## 2 1,634.20 1,630.80 1,628.90 1,628.80 1,630.80 1,635.10 Female
## 3 1,685.50 1,681.90 1,677.60 1,672.80 1,668.00 1,663.60 Female
## 4 1,729.70 1,731.30 1,728.50 1,725.60 1,722.50 1,719.40 Female
## 5 1,855.10 1,830.30 1,819.50 1,814.10 1,809.90 1,803.70 Female
## 6 1,880.10 1,913.40 1,924.20 1,930.10 1,928.00 1,920.00 Female
## 7 1,699.60 1,743.70 1,794.10 1,833.10 1,872.00 1,903.60 Female
## 8 1,729.30 1,692.80 1,671.80 1,665.20 1,674.10 1,703.40 Female
## 9 1,849.20 1,824.60 1,803.20 1,780.30 1,759.00 1,732.00 Female
## 10 1,926.30 1,930.60 1,915.60 1,903.70 1,880.40 1,850.00 Female
Let’s see it step by step:
For the simplicity, I left only area I need for now. In 2016, Birmingham and Solihull CCG were three diffirent CCGs.
df1 <- subset(df, df$AREA == "NHS Birmingham CrossCity CCG" | df$AREA == "NHS Birmingham South and Central CCG" | df$AREA == "NHS Solihull CCG")
My data still has columns for each year separately, so I created column ‘year’ and changed data structure
df2 <- gather (df1, "year", "population", 4:29)
## ï..CODE AREA AGE.GROUP gender year
## 1 E38000012 NHS Birmingham CrossCity CCG 0-4 Female X2016
## 2 E38000012 NHS Birmingham CrossCity CCG 05-09 Female X2016
## 3 E38000012 NHS Birmingham CrossCity CCG 10-14 Female X2016
## 4 E38000012 NHS Birmingham CrossCity CCG 15-19 Female X2016
## 5 E38000012 NHS Birmingham CrossCity CCG 20-24 Female X2016
## 6 E38000012 NHS Birmingham CrossCity CCG 25-29 Female X2016
## 7 E38000012 NHS Birmingham CrossCity CCG 30-34 Female X2016
## 8 E38000012 NHS Birmingham CrossCity CCG 35-39 Female X2016
## 9 E38000012 NHS Birmingham CrossCity CCG 40-44 Female X2016
## 10 E38000012 NHS Birmingham CrossCity CCG 45-49 Female X2016
## population
## 1 28
## 2 26.3
## 3 24.1
## 4 26.5
## 5 30.9
## 6 30.5
## 7 27
## 8 25.3
## 9 22.8
## 10 24.2
Boring but important bits: aggregate data by year, age band and gender, change population column to numeric format and drop the row ‘All Ages’ to not accidentally include it in our plot
df2$population <- as.numeric(df2$population)
df3 <- aggregate(population~AGE.GROUP+gender+year, data=df2, FUN=sum)
df3 <- df3[df3$AGE.GROUP !='All ages',]
Now, let’s calculate percentages. For standard population pyramids percentages are calculated from total population for the year, so we should calculate this value, add to the table and calculate percentage for each gender-age band pair.
totalyear <- aggregate(population~year, data=df3, FUN=sum)
df4 <- merge(x = df3, y = totalyear, by = "year", all.x = TRUE)
colnames(df4)[colnames(df4) == 'population.y'] <- 'totalyears'
colnames(df4)[colnames(df4) == 'population.x'] <- 'population'
df4$percentage<- df4$population/df4$totalyears*100
To draw population pyramids in Excel, I always used negative values for one of the genders and then changed the legend. I used the same logic for R
df4 <- transform(df4, percentage = ifelse(gender == 'Male', -df4$population/df4$totalyears*100, percentage))
Last but not least, I notices ‘X’ in front of the year. Let’s remove it!
df4$year <- substr (df4$year,2, 5)
Drawing pyramid
Now, when our data looks tidy and ready, we can move to the the most exciting part – using ggplot. The main thing in this process are: build bar chart, flip axes and use the theme we would like. I could not resist and used The Strategy Unit colours!
ggplot(subset(df4, df4$year == "2016"), aes(x = AGE.GROUP, y = percentage, fill = gender)) + # Fill column
geom_bar(stat = "identity", width = .85) + # draw the bars
scale_y_continuous(breaks = seq(-5,5, length.out = 11),labels = c('5%','4%', '3%', '2%', '1%', '0', '1%','2%','3%','4%','5%')) +
coord_flip() + # Flip axes
labs(title="Birmingham and Solihull population", y="percentage of populaiton", x="Age group") +
theme(plot.title = element_text(hjust = .5),
axis.ticks = element_blank(),
panel.background = element_blank(), strip.background = element_rect(colour="white", fill="white"), strip.text.x = element_text(size = 10)) + # Centre plot title
scale_fill_manual(values=c("goldenrod2", "gray32")) + ###colours of Strategy Unit+
facet_grid(. ~ year)

To be continued…
As I previously said, my main aim of this exersice was to learn R animation and compare different packages. So far I have used packages “magick” and “gganimate” and am happy to share results in the next part. Please do not hesitate to leave your comment and suggest any other packages for creating animation, I want to test them all!