Promoting the use of R in the NHS

Blog Article

This post was originally published on this site

(This article was first published on An Accounting and Data Science Nerd’s Corner, and kindly contributed to R-bloggers)

Last week, the German NGO Open Knowledge Foundation Deutschland e.V. has made German Trade Resister data available via the project OffeneRegister.de, together with the British NGO opencorporates. While the data from German Trade Resister is publicly available in principle, retrieving the data is a case-by-case activity and is very cumbersome (try for yourself if you like). The data provided by OffeneRegister.de instead comes with an easy to navigate API, and, what is even more convenient, is available for bulk download (alternatively as a JSON or as a SQLite database file).

Having a research focus on corporate transparency, I could not resist the temptation to take a peak. I downloaded the SQLite database. Let’s access it with R.

library(DBI)
library(tidyverse)
library(ggmap)
library(rgdal)
library(rgeos)

tmp 

While I have my reservations about using the individual data contained in the data base for data quality and privacy reasons, some aggregate analysis seems in order. As a proof of concept, I will try to visualize where German companies are located. Thus, I focus on the relation companies and try to extract some spatial variation.

sql 

An inspection of the registered_address field shows that the data is relatively messy, but most non-empty cells contain five digit German post codes (Postleitzahlen, PLZ). I will focus on those.

company %>%
  filter(current_status == "currently registered",
         !is.na(registered_address)) %>%
  mutate(plz = str_extract(registered_address, "d{5}")) %>%
  filter(plz != "") %>%
  select(id, plz) -> company_plz

Time to ask our first question. What are the TOP 10 firm hosting PLZs?

companies_plz %
  group_by(plz) %>%
  summarise(companies = n()) %>%
  mutate(link = sprintf('Link to map', plz)) %>%
  arrange(-companies)

kable(head(companies_plz, 10), col.names = c("PLZ", "Registered companies", "Link to map"),
      format.args = list(big.mark = ","))
PLZ Registered companies Link to map
10117 5,858 Link to map
20457 4,678 Link to map
20354 4,677 Link to map
20095 3,574 Link to map
82031 3,334 Link to map
10719 2,854 Link to map
22767 2,593 Link to map
60325 2,313 Link to map
10707 2,309 Link to map
55129 2,223 Link to map

The top PLZ on the list looks oddly familiar… But the list is not really informative as PLZ vary in size. Let’s link this to some spatial information.

# Shape file is based on OSM data.
# Source: https://osm.wno-edv-service.de/pcboundaries/

unzip("../data/alle_plz_shapes_deu.zip", exdir = tmp)
plz_polys %
  left_join(pop) %>%
  left_join(companies_plz) %>%
  replace_na(list(companies = 0)) %>%
  rename(population = einwohner) %>%
  mutate(comp_by_sqkm = companies/area_sqkm,
         comp_by_1000pop = 1000*companies/population) -> comp_by_plz

How many companies are included in this data?

tab %
  mutate(pa = substr(plz, 1,1 )) %>%
  group_by(pa) %>%
  summarise(companies = sum(companies))
tab 
Post Area Registered companies
0 44,955
1 124,408
2 137,019
3 79,518
4 121,718
5 110,496
6 112,305
7 97,356
8 101,587
9 67,897
Total 997,259

Quite a few. Which reasonably populated PLZs are home to more firms than people?

kable(comp_by_plz %>%
        filter(population > 100,
               comp_by_1000pop > 1000) %>%
        arrange(-comp_by_1000pop),
      col.names = c("PLZ", "Area (km²)", "Population", "Registered companies", "Link to map",
                    "Companies by km²", "Companies by 1,000 inhabitants"),
      format.args = list(big.mark = ",", digits = 2)) 
PLZ Area (km²) Population Registered companies Link to map Companies by km² Companies by 1,000 inhabitants
40212 0.43 543 1,390 Link to map 3,216 2,560
20354 1.30 2,273 4,677 Link to map 3,602 2,058
20457 14.76 2,566 4,678 Link to map 317 1,823
20095 0.77 3,172 3,574 Link to map 4,613 1,127

Interesting, Düsseldorf and Hamburg make the cut. Maps. We want to see maps. First: Registered companies by square kilometer.

# For visualization, values are log transformed. Set zero values to
# be 80 % of non-zero minimum to make them plottable.

min_1000pop  0], na.rm = TRUE)
min_sqkm  0], na.rm = TRUE)
log_safe %
  mutate(comp_by_1000pop = replace(comp_by_1000pop, 
                                   comp_by_1000pop == 0, min_1000pop),
         comp_by_sqkm = replace(comp_by_sqkm, 
                                   comp_by_sqkm == 0, min_sqkm))

plz_map % 
  left_join(log_safe, by = c("id" = "plz"))



ggplot(plz_map, aes(x = long, y = lat, group = group, fill = comp_by_sqkm)) +
  geom_polygon(colour = NA, lwd=0, aes(group = group)) + 
  scale_fill_gradient2(name = "Registered companies per km²", 
                       low = "red", mid = "gray90", high = "blue", trans = "log", 
                       midpoint = log(median(log_safe$comp_by_sqkm, na.rm = TRUE)),
                       breaks = c(1, 10, 100, 1000)) + 
  coord_map() +
  theme_void() +
  theme(legend.justification=c(0,1), legend.position=c(0,1), plot.caption = element_text(hjust = 0)) +
  labs(caption = "Data as provided by OffeneRegister.de.")

Nice but not really surprising. Some areas stand out (Ost-Westfalen Lippe and the Rhein valley) but in general companies are where people live (meaning: in cities). Do we get better insights when we plot registered companies relative to the population?

# As we have seen above, we have some few PLZs with rather extreme companies 
# by population ratios. I set them to 1000 so that they do not mess up the scale

plz_map$comp_by_1000pop[plz_map$comp_by_1000pop > 1000] 

Now this is interesting. You can see relatively strong regional patterns. Some cities are not as company heavy as others (compare Leipzig to Dresden and see the Ruhr area). Some rural areas show higher levels of corporate activity (North Brandenburg, Mecklenburg) while others have relatively low levels (North Schleswig-Holstein, Saxonia, Swabia in Bavaria).

So, the data provided by OffeneRegister.de, while being somewhat messy at the individual level, can be used to generate informative insights at the aggregate level. Let’s hope that the initiative helps to trigger a public debate about the right way to host public data.

To leave a comment for the author, please follow the link and comment on their blog: An Accounting and Data Science Nerd’s Corner.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Comments are closed.