SparkSQL-in-R.Rmd

---
title: Spark SQL and Machine Learning with R
author: Ali Zaidi
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output:
  html_notebook:
    toc: yes
    toc_float: true
  html_document:
    keep_md: yes
    toc: yes
    toc_float: true
---


# Using Spark SQL with R

There are three different R APIs we can use with Spark: [`SparkR`](http://spark.apache.org/docs/latest/sparkr.html), [`RxSpark`](https://msdn.microsoft.com/en-us/microsoft-r/scaler/rxspark), and [`sparklyr`](spark.rstudio.com).

We will examine `RxSpark` in depth later today. For now, let's take a look at `sparklyr`. The greatest advantage of `sparklyr` over `SparkR` is it's clean and tidy interface to Spark SQL and SparkML.

```{r, eval = FALSE}
library(sparklyr)
library(dplyr)

sc <- spark_connect(master = "yarn-client")

src_tbls(sc)

```

## Configuration

To specify the configuration of your spark session, use the `spark_config` function from the `sparklyr` package.


```{r}
config <- spark_config()

config$spark.executor.cores <- 3
config$spark.executor.memory <- "15G"

sc <- spark_connect(master = "yarn-client",
                    config = config)
```

## Working with Tables


```{r}

pullrequest <- tbl(sc, "pullrequest")
users <- tbl(sc, "users")

pullrequest %>% head()

# what's the schema of the table users

sdf_schema(users)

```


## Joins

```{r}

users_sub <- select(users, login, userid, bio, blog, company) %>%
  sdf_register("usersSub")

tbl_cache(sc, 'usersSub')

pullrequest_sub <- select(pullrequest, repo, owner, pullrequestid, 
                          additions, deletions, baserepofullname, 
                          body, comments, commits, merged, mergeable,
                          userid, userlogin) %>% sdf_register('pr_sub')

tbl_cache(sc, 'pr_sub')

joined_tbl <- left_join(pullrequest_sub, users_sub, by = "userid")
joined_tbl %>% sdf_register('joined')

tbl_cache(sc, 'joined')

```

## Calculate Number of Merged Pull Requests

```{r}

sum_pr_commit <- joined_tbl %>% 
  group_by(repo) %>% 
  summarise(counts = n(),
            ave_additions = mean(additions),
            ave_deletions = mean(deletions),
            merged_total = sum(as.numeric(merged)))

sum_pr_commit %>% sdf_register("summaryPRMerged")

tbl_cache(sc, 'summaryPRMerged')

```

## Save Data to Parquet

Now that we have our merged dataset, let's save it to a Parquet file so we can consume it and analyze it with Microsoft R Server.

```{r save-to-parquet}

rxHadoopMakeDir("/joinedData/")
rxHadoopListFiles("/")

spark_write_parquet(joined_tbl, 
                    path = "/joinedData/PR_users/")
```

## Loading Data in RevoScaleR

```{r}

myNameNode <- "default"
myPort <- 0
hdfsFS <- RxHdfsFileSystem(hostName = myNameNode, 
                           port = myPort)

joined_parquet <- RxParquetData("/joinedData/PR_users", 
                                fileSystem = hdfsFS)

computeContext <- RxSpark(consoleOutput=TRUE,
                          nameNode=myNameNode,
                          port=myPort,
                          executorCores=14, 
                          executorMem = "20g", 
                          executorOverheadMem = "7g", 
                          persistentRun = TRUE, 
                          extraSparkConfig = "--conf spark.speculation=true")

rxSetComputeContext(computeContext)

spark_disconnect(sc)
rxGetInfo(joined_parquet,getVarInfo = TRUE, numRows = 10)

dtree <- rxDTree(merged ~ commits + comments,data = joined_parquet)
rxStopEngine(computeContext)
```