**R – Open Source Automation**, and kindly contributed to R-bloggers)

The **apply** functions in R are awesome (see this post for some lesser known apply functions). However, if you can use pure vectorization, then you’ll probably end up making your code run a lot faster than just depending upon functions like *sapply* and *lapply*. This is because apply functions like these still rely on looping through elements in a vector or list behind the scenes – *one at a time*. Vectorization, on the other hand, allows parallel operations under the hood – allowing much faster computation. This posts runs through a couple such examples involving string substitution and fuzzy matching.

**String substitution**

For example, let’s create a vector that looks like this:

**test1, test2, test3, test4, …, test1000000**

with one million elements.

With *sapply*, the code to create this would look like:

startAs we can see, this takes over 4 1/2 seconds. However, if we generate the same vector using vectorization, we can get the job done in only 0.75 seconds!

startNow, we can also use

gsubto remove the substringtestfrom every element in the vector,samples— also with vectorization:startThis takes just over one second. In comparison, using

sapplytakes roughlyeleven times longer!start

Fuzzy matchingVectorization can also be used to vastly speed up fuzzy matching (as described in this post). For example, let’s use the

stringipackage to randomly generate one million strings. We’ll then use thestringdistpackage to compare the word “programming” to each random string.library(stringi) library(stringdist) set.seed(1) random_stringsNow, let’s try using

sapplyto calculate a string similarity score (using default parameters) between “programming” and each of the one million strings.startAs we can see, this takes quite a while in computational terms – over 193 seconds. However, we can vastly speed this up using vectorization, rather than

sapply.startAbove, we’re able to calculate the same similarity scores in…under one second! This is vastly better than the first approach and is made possible due to the parallel operations vectorization performs under the hood. To see the randomly generated word with the maximum similarity score to “programming”, we can just run the below line of code:

names(which.max(results))This returns the string “wrrgrrmmrnb”.

That’s it for this post! Click here to view other R posts of mine.

The post Speed Test: Sapply vs. Vectorization appeared first on Open Source Automation.

Toleave a commentfor the author, please follow the link and comment on their blog:R – Open Source Automation.

R-bloggers.com offersdaily e-mail updatesabout R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...