07-distributed-r.Rmd

---
title: "Distributed R"
output: html_notebook
---

## Class catch up

```{r}
library(dplyr)
library(purrr)
library(readr)
library(sparklyr)
library(lubridate)
top_rows <- read.csv("/usr/share/flights/data/flight_2008_1.csv", nrows = 5)
file_columns <- top_rows %>%
  rename_all(tolower) %>%
  map(function(x) "character")
conf <- spark_config()
conf$`sparklyr.cores.local` <- 4
conf$`sparklyr.shell.driver-memory` <- "8G"
conf$spark.memory.fraction <- 0.9
sc <- spark_connect(master = "local", config = conf,version = "2.0.0")
spark_flights <- spark_read_csv(
  sc,
  name = "flights",
  path = "/usr/share/flights/data/",
  memory = FALSE,
  columns = file_columns,
  infer_schema = FALSE
)
```

## 7.1 - Basic distribution

1. Cache a sample of *fligths*
```{r}
flights_sample <- spark_flights %>%
  sample_frac(0.01) %>%
  mutate(arrdelay = as.numeric(arrdelay)) %>%
  ft_binarizer(
    input.col = "arrdelay",
    output.col = "delayed",
    threshold = 15
  ) %>%
  compute("flights_sample")
```

2. Navigate to the Storage page in the Spark UI

3. Pass `nrow` to `spark_apply()` to get the row count by partition
```{r}
spark_apply(flights_sample, nrow)
```

4. Pass a function to operate the average distance in each partition
```{r}
spark_apply(
  flights_sample, 
  function(x) mean(as.numeric(x$distance))
  )
```

## 7.2 - Use group_by

1. Use the `group_by` argument to partition by the *month* field
```{r}
spark_apply(flights_sample, nrow, group_by = "month", columns = "count")
```

2. Pass the same function from the previous exercise to calculate the average distance by month
```{r}
spark_apply(
  flights_sample,
  function(x) mean(as.numeric(x$distance)),
  group_by = "month",
  columns = "avg_distance"
)
```

## 7.3 - Distributing packages

1. Use `broom::tidy()` to run one `glm()` model per month
```{r}
models <- spark_apply(
  flights_sample,
  function(e) broom::tidy(glm(delayed ~ arrdelay, data = e, family = "binomial")),
  names = c("term", "estimate", "std_error", "statistic", "p_value"),
  group_by = "month"
)

models
```


2. Close Spark connection

```{r}
spark_disconnect(sc)
```