R News

sparklyr 0.5: Livy and dplyr improvements

by RStudio | Open source & professional software for data science teams on RStudio · January 24, 2017

This article is originally published at https://www.rstudio.com/blog/

We’re happy to announce that version 0.5 of the sparklyr package is now available on CRAN. The new version comes with many improvements over the first release, including:

Extended dplyr support by implementing: do() and n_distinct().
New functions including sdf_quantile(), ft_tokenizer() and ft_regex_tokenizer().
Improved compatibility, sparklyr now respects the value of the ‘na.action’ R option and dim(), nrow() and ncol().
Experimental support for Livy to enable clients, including RStudio, to connect remotely to Apache Spark.
Improved connections by simplifying initialization and providing error diagnostics.
Certified sparklyr, RStudio Server Pro and ShinyServer Pro with Cloudera.
Updated spark.rstudio.com with new deployment examples and a sparklyr cheatsheet.

Additional changes and improvements can be found in the sparklyr NEWS file.

For questions or feedback, please feel free to open a sparklyr github issue or a sparklyr stackoverflow question.

Extended dplyr support

sparklyr 0.5 adds supports for n_distinct() as a faster and more concise equivalent of length(unique(x)) and also adds support for do() as a convenient way to perform multiple serial computations over a group_by() operation:

library(sparklyr)
sc <- spark_connect(master = "local")
mtcars_tbl <- copy_to(sc, mtcars, overwrite = TRUE)

by_cyl <- group_by(mtcars_tbl, cyl)
fit_sparklyr <- by_cyl %>%
   do(mod = ml_linear_regression(mpg ~ disp, data = .))

# display results
fit_sparklyr$mod

In this case, . represents a Spark DataFrame, which allows us to perform operations at scale (like this linear regression) for a small set of groups. However, since each group operation is performed sequentially, it is not recommended to use do() with a large number of groups. The code above performs multiple linear regressions with the following output:

[[1]]
Call: ml_linear_regression(mpg ~ disp, data = .)

Coefficients:
 (Intercept)         disp
19.081987419  0.003605119

[[2]]
Call: ml_linear_regression(mpg ~ disp, data = .)

Coefficients:
(Intercept)        disp
 40.8719553  -0.1351418

[[3]]
Call: ml_linear_regression(mpg ~ disp, data = .)

Coefficients:
(Intercept)        disp
22.03279891 -0.01963409

It’s worth mentioning that while sparklyr provides comprehensive support for dplyr, dplyr is not strictly required while using sparklyr. For instance, one can make use of DBI without dplyr as follows:

library(sparklyr)
library(DBI)

sc <- spark_connect(master = "local")
sdf_copy_to(sc, iris)
dbGetQuery(sc, "SELECT * FROM iris LIMIT 4")

  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

New functions

The new sdf_quantile() function computes approximate quantiles (to some relative error), while the new ft_tokenizer() and ft_regex_tokenizer() functions split a string by white spaces or regex patterns.

For example, ft_tokenizer() can be used as follows:

library(sparklyr)
library(janeaustenr)
library(dplyr)

sc %>%
  spark_dataframe() %>%
  na.omit() %>%
  ft_tokenizer(input.col = "text", output.col = "tokens") %>%
  head(4)

Which produces the following output:

                   text                book     tokens
                  <chr>               <chr>     <list>
1 SENSE AND SENSIBILITY Sense & Sensibility <list [3]>
2                       Sense & Sensibility <list [1]>
3        by Jane Austen Sense & Sensibility <list [3]>
4                       Sense & Sensibility <list [1]>

Tokens can be further processed through, for instance, HashingTF.

Improved compatibility

‘na.action’ is a parameter accepted as part of the ‘ml.options’ argument, which defaults to getOption("na.action", "na.omit"). This allows sparklyr to match the behavior of R while processing NA records, for instance, the following linear model drops NA record appropriately:

library(sparklyr)
library(dplyr)
library(nycflights13)

sc <- spark_connect(master = "local")
flights_clean <- na.omit(copy_to(sc, flights))

ml_linear_regression(
  flights_tbl
  response = "dep_delay",
  features = c("arr_delay", "arr_time"))

* Dropped 9430 rows with 'na.omit' (336776 => 327346)
Call: ml_linear_regression(flights_tbl, response = "dep_delay",
                           features = c("arr_delay", "arr_time"))

Coefficients:
 (Intercept)    arr_delay     arr_time
6.1001212994 0.8210307947 0.0005284729

In addition, dim(), nrow() and ncol() are now supported against Spark DataFrames.

Livy connections

Livy, “An Open Source REST Service for Apache Spark (Apache License)", is now available in sparklyr 0.5 as an experimental feature. Among many scenarios, this enables connections from the RStudio desktop to Apache Spark when Livy is available and correctly configured in the remote cluster.

Livy running locally

To work with Livy locally, sparklyr supports livy_install() which installs Livy in your local environment, this is similar to spark_install(). Since Livy is a service to enable remote connections into Apache Spark, the service needs to be started with livy_service_start(). Once the service is running, spark_connect() needs to reference the running service and use method = "Livy", then sparklyr can be used as usual. A short example follows:

livy_install()
livy_service_start()

sc <- spark_connect(master = "http://localhost:8998",
                    method = "livy")
copy_to(sc, iris)

spark_disconnect(sc)
livy_service_stop()

Livy running in HDInsight

Microsoft Azure supports Apache Spark clusters configured with Livy and protected with basic authentication in HDInsight clusters. To use sparklyr with HDInsight clusters through Livy, first create the HDInsight cluster with Spark support:

Creating Spark Cluster in Microsoft Azure HDInsight

Once the cluster is created, you can connect with sparklyr as follows:

library(sparklyr)
library(dplyr)

config <- livy_config(user = "admin", password = "password")
sc <- spark_connect(master = "https://dm.azurehdinsight.net/livy/",
                    method = "livy",
                    config = config)

copy_to(sc, iris)

From a desktop running RStudio, the remote connection looks like this:

Improved connections

sparklyr 0.5 no longer requires internet connectivity to download additional Apache Spark packages. This enables connections in secure clusters that do not have internet access or while on the go.

Some community members reported a generic “Ports file does not exists” error while connecting with sparklyr 0.4. In 0.5, we’ve deprecated the ports file and improved error reporting. For instance, the following invalid connection example throws: a descriptive error, the spark-submit parameters and logging information that helps us troubleshoot connection issues.

> library(sparklyr)
> sc <- spark_connect(master = "local",
                      config = list("sparklyr.gateway.port" = "0"))
Error in force(code) :
  Failed while connecting to sparklyr to port (0) for sessionid (5305):
  Gateway in port (0) did not respond.
  Path: /spark-1.6.2-bin-hadoop2.6/bin/spark-submit
  Parameters: --class, sparklyr.Backend, 'sparklyr-1.6-2.10.jar', 0, 5305

---- Output Log ----
16/12/12 12:42:35 INFO sparklyr: Session (5305) starting

---- Error Log ----

Additional technical details can be found in the sparklyr gateway socket pull request.

Cloudera certification

sparklyr 0.4, sparklyr 0.5, RStudio Server Pro 1.0 and ShinyServer Pro 1.5 went through Cloudera’s certification and are now certified with Cloudera. Among various benefits, authentication features like Kerberos, have been tested and validated against secured clusters.

For more information see Cloudera’s partner listings.

Thanks for visiting r-craft.org
This article is originally published at https://www.rstudio.com/blog/
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

sparklyr 0.5: Livy and dplyr improvements

You may also like...

Categories

sparklyr 0.5: Livy and dplyr improvements

Extended dplyr support

New functions

Improved compatibility

Livy connections

Livy running locally

Livy running in HDInsight

Improved connections

Cloudera certification

You may also like...

Cheatsheet Updates

Fit and predict with tidymodels for #TidyTuesday bird baths in Australia

Top 7 Skills to Get Hired in Data Science in 2022

Categories