08-spark-pipelines.Rmd

---
title: "Spark pipelines"
output: html_notebook
---

## Class catchup

```{r}
library(tidyverse)
library(sparklyr)
library(lubridate)
top_rows <- read.csv("/usr/share/flights/data/flight_2008_1.csv", nrows = 5)
file_columns <- top_rows %>%
  rename_all(tolower) %>%
  map(function(x) "character")
conf <- spark_config()
conf$`sparklyr.cores.local` <- 4
conf$`sparklyr.shell.driver-memory` <- "8G"
conf$spark.memory.fraction <- 0.9
sc <- spark_connect(master = "local", config = conf, version = "2.0.0")
spark_flights <- spark_read_csv(
  sc,
  name = "flights",
  path = "/usr/share/flights/data/",
  memory = FALSE,
  columns = file_columns,
  infer_schema = FALSE
)
```

## 8.1 - Build a pipeline
*Step-by-step of how to build a new Spark pipeline*

1. Use `sdf_partition()` to create a sample of 1% training and 1% testing of the *flights* table.
```{r}
model_data <- 
```

2. Recreate the `dplyr` code in the *cached_flights* variable from the previous unit. Assign it to a new variable called `pepeline_df`.
```{r}
pipeline_df <- model_data$training %>%
  mutate(
    arrdelay = ifelse(arrdelay == "NA", 0, arrdelay),
    depdelay = ifelse(depdelay == "NA", 0, depdelay)
  ) %>%
  select(
    month,
    dayofmonth,
    arrtime,
    arrdelay,
    depdelay,
    crsarrtime,
    crsdeptime,
    distance
  ) %>%
  mutate_all(as.numeric)
```

3. Start a new pipeline with `ml_pipeline()` and `dplyr`-pipe into `ft_dplyr_transformer()`.  Use `pipeline_df` as the `tbl` argument.
```{r}
ml_pipeline(sc) %>%
  
```

4. Pipe code into `ft_binarizer()` to determine if *arrdelay* is over 15 minutes.
```{r}

```

5. Pipe code into `ft_bucketizer()`. Use it to split *dephour* into six even segments of 4 hours.
```{r}

```

6. Add `ft_r_formula()` with a model that compares uses *arrdelay* and *dephour* against *depdelay*.
```{r}

```

7. Pipe into a logistic regression model, with `ml_logistic_regression()`
```{r}

```


8. Assign the entire piped code to a new variable called `flights_pipeline`
```{r}

```

## 8.2 - Fit, evaluate, save


1. Fit (train) the `flights_pipeline` pipeline model using the training data on `model_data`. The function to use is `ml_fit()`
```{r}
model <- 
```

2. Use the newly fitted model to perform predictions using `ml_transform()`. Use the testing data from `model_data`
```{r}
predictions <- 
```

3. Use `group_by()`/ `tally()` to see how the model performed
```{r}
predictions %>%
  group_by(delayed, prediction) %>%
  tally()
```

4. Save the model into disk using `ml_save()`
```{r}

```

5. Save the pipeline using `ml_save()`
```{r}

```

6. Close the Spark session
```{r}
spark_disconnect(sc)
```
## 8.3 - Reload model
*Use the saved model inside a different Spark session*

1. Open a new Spark connection and reload the data
```{r}
library(sparklyr)
sc <- spark_connect(master = "local", version = "2.0.0")
spark_flights <- spark_read_csv(
  sc,
  name = "flights",
  path = "/usr/share/class/flights/data/",
  memory = FALSE,
  columns = file_columns,
  infer_schema = FALSE
)
```

2. Use `ml_load()` to reload the model directly into the Spark session
```{r}

```


3.  Create a new table called *current*. It needs to pull today's flights.  
```{r}
library(lubridate)

current <- tbl(sc, "flights") %>%
  filter(
    month == !! month(now()),
    dayofmonth == !! day(now())
  )

show_query(current)
```

4. Run predictions against `current` using `ml__transform()`.
```{r}
new_predictions <- 
```

6. Get a quick count of expected delayed flights. The field to check on is called `prediction`
```{r}
new_predictions %>%
  summarise(late_fligths = sum(prediction, na.rm = TRUE))
```

## 8.4 - Reload pipeline
*Overview of how to use new data to re-fit the pipeline, thus creating a new pipeline model*

1. Use `ml_load()` to reload the pipeline into the Spark session
```{r}
flights_pipeline <- 
```

2. Create a new sample data set using `sample_frac()`, 1% of the total data should be sufficient
```{r}
sample <- tbl(sc, "flights") %>%
  sample_frac(0.001) 
```

3. Re-fit the model using `ml_fit()` and the new sample data
```{r}
new_model <- 
```

4. Save the newly fitted model 
```{r}

```

5. Disconnect from Spark
```{r}
spark_disconnect(sc)
```