We’re pleased to announce that sparklyr 0.8 is now available on CRAN! Sparklyr provides an R interface to Apache Spark. It supports dplyr syntax for working with Spark DataFrames and exposes the full range of machine learning algorithms available in Spark ML. You can also learn more about Apache Spark and sparklyr at spark.rstudio.com and the sparklyr webinar series. In this version, we added support for Spark 2.3, Livy 0.5, and various enhancements and bugfixes. For this post, we’d like to highlight a new feature from Spark 2.3 and introduce the mleap and graphframes extensions.
Spark 2.3 supports parallelism in hyperparameter tuning. In other words, instead of training each model specification serially, you can now train them in parallel. This can be enabled by setting the
parallelism parameter in
ml_train_split_validation(). Here’s an example:
library(sparklyr) sc % ft_vector_assembler( c("Sepal_Width", "Sepal_Length", "Petal_Width", "Petal_Length"), "features" ) %>% ft_string_indexer_model("Species", "label", labels = labels) %>% ml_logistic_regression() # Specify hyperparameter grid grid
Once the models are trained, you can inspect the performance results by using the newly available helper function
## f1 elastic_net_param_1 reg_param_1 ## 1 0.9506 0.25 1e-03 ## 2 0.9384 0.75 1e-03 ## 3 0.9384 0.25 1e-04 ## 4 0.9569 0.75 1e-04
Pipelines in Production
Earlier this year, we announced support for ML Pipelines in sparklyr, and discussed how one can persist models onto disk. While that workflow is appropriate for batch scoring of large datasets, we also wanted to enable real-time, low-latency scoring using pipelines developed with sparklyr. To enable this, we’ve developed the mleap package, available on CRAN, which provides an interface to the MLeap open source project.
MLeap allows you to use your Spark pipelines in any Java-enabled device or service. This works by serializing Spark pipelines which can later be loaded into the Java Virtual Machine (JVM) for scoring without requiring a Spark cluster. This means that software engineers can take Spark pipelines exported with sparklyr and easily embed them in web, desktop or mobile applications.
To get started, simply grab the package from CRAN and install the necessary dependencies:
install.packages("mleap") library(mleap) install_maven() install_mleap()
Then, build a pipeline as usual:
library(sparklyr) sc % ft_binarizer("hp", "big_hp", threshold = 100) %>% ft_vector_assembler(c("big_hp", "wt", "qsec"), "features") %>% ml_gbt_regressor(label_col = "mpg") pipeline_model
Once we have the pipeline model, we can export it via
# Export model model_path
At this point, we’re ready to use
mtcars_model.zip in other applications. Notice that the following code does not require Spark:
# Import model model
## Observations: 2 ## Variables: 6 ## $ qsec
16.2, 18.1 ## $ hp 101, 99 ## $ wt 2.68, 3.08 ## $ big_hp 1, 0 ## $ features
[[[1, 2.68, 16.2], ], [[0, 3.08, 18.1], ]] ## $ prediction
Notice that MLeap requires Spark 2.0 to 2.2. You can find additional details in the production pipelines guide.
The other extension we’d like to highlight is graphframes, which provides an interface to the GraphFrames Spark package. GraphFrames allows us to run graph algorithms at scale using a DataFrame-based API.
Let’s see graphframes in action through a quick example, where we analyze the relationships among package on CRAN.
library(graphframes) library(dplyr) sc % `[`(, c("Package", "Depends", "Imports")) %>% as_tibble() %>% transmute( package = Package, dependencies = paste(Depends, Imports, sep = ",") %>% gsub("n|s+", "", .) ) # Copy data to Spark packages_tbl % mutate( dependencies = dependencies %>% regexp_replace("(([^)]+))", "") ) %>% ft_regex_tokenizer( "dependencies", "dependencies_vector", pattern = "(s+)?,(s+)?", to_lower_case = FALSE ) %>% transmute( src = package, dst = explode(dependencies_vector) ) %>% filter(!dst %in% c("R", "NA"))
Once we have an edges table, we can easily create a
GraphFrame object by calling
gf_graphframe() and running PageRank:
# Create a GraphFrame object g % gf_vertices() %>% arrange(desc(pagerank))
## # Source: table
[?? x 2] ## # Database: spark_connection ## # Ordered by: desc(pagerank) ## id pagerank ## ## 1 methods 259. ## 2 stats 209. ## 3 utils 194. ## 4 Rcpp 109. ## 5 graphics 104. ## 6 grDevices 60.0 ## 7 MASS 53.7 ## 8 lattice 34.7 ## 9 Matrix 33.3 ## 10 grid 32.1 ## # ... with more rows
We can also collect a sample of the graph locally for visualization:
library(gh) library(visNetwork) list_repos % vapply("[[", "", "name") } rlib_repos % as_tibble() %>% filter(Priority == "base") %>% pull(Package) top_packages % gf_vertices() %>% arrange(desc(pagerank)) %>% head(75) %>% pull(id) edges_local % gf_edges() %>% filter(src %in% !!top_packages && dst %in% !!top_packages) %>% rename(from = src, to = dst) %>% collect() vertices_local % gf_vertices() %>% filter(id %in% top_packages) %>% mutate( group = case_when( id %in% !!rlib_repos ~ "r-lib", id %in% !!tidyverse_repos ~ "tidyverse", id %in% !!base_packages ~ "base", TRUE ~ "other" ), title = id) %>% collect() visNetwork(vertices_local, edges_local, width = "100%") %>% visEdges(arrows = "to")
Notice that GraphFrames currently supports Spark 2.0 and 2.1. You can find additional details in the graph analysis guide.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…