Promoting the use of R in the NHS

Blog Article

This post was originally published on this site

(This article was first published on R on Sastibe’s Data Science Blog, and kindly contributed to R-bloggers)

A treasure trove of leaked passwords

The API of pwnedpasswords.com is quite remarkable. It not only allows you to fetch the results generally obtained by typing in your e-mail into the browser interface and finding out whether or not you’ve been pwned from the comfort of your shell. It further allows you to very simply check whether a certain password has ever been used in any of the dumps they have, and if so, how often. Since haveibeenpwned.com has collected over 550 millions of these in a multitude of data breaches, odds are your password might be amongst these.

Using the API is straightforward, but the way it is secured so that even pwnedpasswords.com itself doesn’t know precisely which password or even password hash you are checking is ingenious, I highly recommend reading their tutorial. The following R snippet allows for obtaining the number of hits for a vector of passwords:

library(httr)
library(digest)
library(plyr)
library(dplyr)
library(stringr)

popularity %
    rename(hashes = V1, count = V2) %>%
    filter(hashes == passw_back) %>%
    mutate(password = password) %>%
    select(password, count)
  if(nrow(hashes) == 0){
    hashes 

City names as passwords

Using this function allows us to search through various ranges of passwords. For instance, let’s see how many people have chosen the names of cites in Baden-Württemberg as their passwords1:

City Name Number of Usages as Password
Freiburg 3077
Stuttgart 9496
Karlsruhe 1426
Heidelberg 4081
Mannheim 5040
Konstanz 924

It seems like the city name “stuttgart” appears most often in the password list, yet that is not incredibly surprising, as it is also the largest city. A plot of the number of password hits in relation to the number of inhabitants2 looks like this:

The red circle describes the number of inhabitants, the black circle the number of usages as password. The plot was created with ggmap.

Quite obviously, the ratio of “number of uses of city name as password” and “number of inhabitants”, i.e. “Use of City Name as Password per Inhabitant” differs from city to city. It seems like this ratio is higher for the cities Heidelberg and Freiburg, each of which is known for a high quality of living and a very picturesque old town. So, let’s look at some international (and especially British) competition:

City Name Inhabitants Number of Usages as Password City Names as Password per 1000 Inhabitants
Liverpool 473073 280723 593.4
Manchester 520215 98831 190.0
Oxford 161291 23069 143.0
Cambridge 151832 12648 83.3
Heidelberg 160601 4081 25.4
London 8787892 196220 22.3
Mannheim 307997 5040 16.4
Paris 2190327 28699 13.1
Berlin 3613495 40952 11.3

In this longer list, the effect of having a famous football team (Liverpool, Manchester) as well as having a famous university in a small city (Oxford, Cambridge and Heidelberg) becomes obvious. In other words, it’s not so much about how many people live in a certain city, but how many people feel a positive connection to that particular city, by loyality of a sports team or by time spent at the university.

Let me conclude this article by pointing out the obvious question: “Can any city beat Liverpool” in this contest? All my manual samples have so far yielded good results, but nothing close to Liverpools numbers, for instance:

City Name Inhabitants Number of Usages as Password City Names as Password per 1000 Inhabitants
Liverpool 473073 280723 593.4
Green Bay 105139 23069 143.0
Barcelona 1620805 152196 129.9

Thus, until futher notice, I proclaim hereby that Liverpool is the most popular city in the world (relative to password use per inhabitant).


  1. Rules are “only lowercase letters, spaces are eliminated”. So for “New York” I looked for usage of “newyork”, for instance. [return]
  2. Numbers of inhabitants as taken from wikipedia.org. [return]

To leave a comment for the author, please follow the link and comment on their blog: R on Sastibe’s Data Science Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Comments are closed.