Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Databricks installation script #457

Merged
merged 34 commits into from
Jan 31, 2019
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
0b70706
Databricks installation script
loomlike Jan 28, 2019
0d7a3ea
Update databricks installation script
loomlike Jan 29, 2019
f6f6309
Update SETUP
loomlike Jan 29, 2019
12c1b3d
Merge branch 'jumin/databricks_setup_sh' of github.com:Microsoft/Reco…
jreynolds01 Jan 29, 2019
3a2a771
Update SETUP
loomlike Jan 29, 2019
cfb600d
update error message
jreynolds01 Jan 29, 2019
fbcec2d
Fix to work on Windows git-bash
loomlike Jan 29, 2019
0a95037
add script to configure ADB for operationalization
jreynolds01 Jan 29, 2019
988fa9d
Merge branch 'jumin/databricks_setup_sh' of github.com:Microsoft/Reco…
jreynolds01 Jan 29, 2019
6c247d0
add a minor update to error message
jreynolds01 Jan 29, 2019
e94eaa5
update o16n prep script to work on gitbash
jreynolds01 Jan 29, 2019
213cf84
Revert "add comment for pyspark_version"
jreynolds01 Jan 29, 2019
93f2ce9
Revert "update help and error messages to be more informative."
jreynolds01 Jan 29, 2019
530f713
Revert "add default name to conda file"
jreynolds01 Jan 29, 2019
1dd9c74
Revert "update conda script to take a parameter for spark version"
jreynolds01 Jan 29, 2019
f7136ac
update usage message
jreynolds01 Jan 29, 2019
cf5b47c
update error message
jreynolds01 Jan 29, 2019
4cbbb75
add script to configure ADB for operationalization
jreynolds01 Jan 29, 2019
235d137
Update SETUP
loomlike Jan 29, 2019
42faa78
Fix to work on Windows git-bash
loomlike Jan 29, 2019
8e62c45
add a minor update to error message
jreynolds01 Jan 29, 2019
48b2fec
update o16n prep script to work on gitbash
jreynolds01 Jan 29, 2019
e039719
update usage message
jreynolds01 Jan 29, 2019
74ebb52
Merge branch 'jumin/databricks_setup_sh' of github.com:Microsoft/Reco…
jreynolds01 Jan 30, 2019
2b2d41a
Merge branch 'staging' into jumin/databricks_setup_sh
jreynolds01 Jan 30, 2019
60ec26b
Revert "Revert "update conda script to take a parameter for spark ver…
jreynolds01 Jan 30, 2019
6ebf3e3
Revert "Revert "add default name to conda file""
jreynolds01 Jan 30, 2019
e5999f4
Revert "Revert "update help and error messages to be more informative.""
jreynolds01 Jan 30, 2019
fd553ac
Revert "Revert "add comment for pyspark_version""
jreynolds01 Jan 30, 2019
23e9796
fix name fieldin generate_conda_file again
jreynolds01 Jan 30, 2019
2e115de
update SETUP.md for clarity and add a section for operationalization …
jreynolds01 Jan 30, 2019
bde7866
Merge branch 'staging' into jumin/databricks_setup_sh
miguelgfierro Jan 30, 2019
23190d0
update databricks envs supported
jreynolds01 Jan 31, 2019
e2b78b9
Merge branch 'jumin/databricks_setup_sh' of github.com:Microsoft/Reco…
jreynolds01 Jan 31, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
190 changes: 150 additions & 40 deletions SETUP.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
# Setup guide
# Setup guide

In this guide we show how to setup all the dependencies to run the notebooks of this repo on a local Linux system or Linux [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) and on [Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/).
This document describes how to setup all the dependencies to run the notebooks in this repository in two different environments:

* a Linux system (local or an [Azure Data Science Virtual Machine (DSVM)](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/))
* [Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/).

## Table of Contents

* [Compute environments](#compute-environments)
* [Setup guide for Local or DSVM](#setup-guide-for-local-or-dsvm)
* [Setup Requirements](#setup-requirements)
Expand All @@ -12,38 +15,41 @@ In this guide we show how to setup all the dependencies to run the notebooks of
* [Troubleshooting for the DSVM](#troubleshooting-for-the-dsvm)
* [Setup guide for Azure Databricks](#setup-guide-for-azure-databricks)
* [Requirements of Azure Databricks](#requirements-of-azure-databricks)
* [Repository upload](#repository-upload)
* [Repository installation](#repository-installation)
* [Troubleshooting for Azure Databricks](#troubleshooting-for-azure-databricks)
</details>
* [Prepare Azure Databricks for Operationalization](#prepare-azure-databricks-for-operationalization)

## Compute environments

We have different compute environments, depending on the kind of machine
Depending on the type of recommender system and the notebook that needs to be run, there are different computational requirements.

Currently, this repository supports the following on a linux DSVM:

Environments supported to run the notebooks on the Linux DSVM:
* Python CPU
* Python GPU
* PySpark
Environments supported to run the notebooks on Azure Databricks:
* PySpark

PySpark is the only supported environment on Azure Databricks.
jreynolds01 marked this conversation as resolved.
Show resolved Hide resolved

## Setup guide for Local or DSVM

### Setup Requirements

- Anaconda with Python version >= 3.6. [Miniconda](https://conda.io/miniconda.html) is the fastest way to get started.
- The Python library dependencies can be found in this [script](scripts/generate_conda_file.sh).
- Machine with Spark (optional for Python environment but mandatory for PySpark environment).
* Anaconda with Python version >= 3.6. [Miniconda](https://conda.io/miniconda.html) is the fastest way to get started.
* The Python library dependencies can be found in this [script](scripts/generate_conda_file.sh).
* Machine with Spark (optional for Python environment but mandatory for PySpark environment).

### Dependencies setup

We install the dependencies with Conda. As a pre-requisite, we may want to make sure that Conda is up-to-date:

conda update anaconda
```{shell}
conda update anaconda
```

We provided a script to [generate a conda file](scripts/generate_conda_file.sh), depending of the environment we want to use. This will create the environment using the Python version 3.6 with all the correct dependencies.
We provide a script to [generate a conda file](scripts/generate_conda_file.sh), depending of the environment we want to use. This will create the environment using the Python version 3.6 with all the correct dependencies.

To install each environment, first we need to generate a conda yaml file and then install the environment. We can specify the environment name with the input `-n`.
To install each environment, first we need to generate a conda yaml file and then install the environment. We can specify the environment name with the input `-n`.

Click on the following menus to see more details:

Expand All @@ -58,7 +64,6 @@ Assuming the repo is cloned as `Recommenders` in the local system, to install th

</details>


<details>
<summary><strong><em>Python GPU environment</em></strong></summary>

Expand All @@ -79,6 +84,10 @@ To install the PySpark environment, which by default installs the CPU environmen
./scripts/generate_conda_file.sh --pyspark
conda env create -n reco_pyspark -f conda_pyspark.yaml

Additionally, if you want to test a particular version of spark, you may pass the --pyspark-version argument:

./scripts/generate_conda_file.sh --pyspark-version 2.4.0

**NOTE** - for this environment, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable.

To set these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux). Assuming that we have installed the environment in `/anaconda/envs/reco_pyspark`, we create the file `/anaconda/envs/reco_pyspark/etc/conda/activate.d/env_vars.sh` and add:
Expand Down Expand Up @@ -109,20 +118,19 @@ To install all three environments:

</details>

### Register the conda environment as a kernel in Jupyter

### Register the conda environment in Jupyter notebook

We can register our created conda environment to appear as a kernel in the Jupyter notebooks.
We can register our created conda environment to appear as a kernel in the Jupyter notebooks.

conda activate my_env_name
python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)"


### Troubleshooting for the DSVM

* We found that there could be problems if the Spark version of the machine is not the same as the one in the conda file. You will have to adapt the conda file to your machine.
* When running Spark on a single local node it is possible to run out of disk space as temporary files are written to the user's home directory. To avoid this we attached an additional disk to the DSVM and made modifications to the Spark configuration. This is done by including the following lines in the file at `/dsvm/tools/spark/current/conf/spark-env.sh`.
```
* We found that there can be problems if the Spark version of the machine is not the same as the one in the conda file. You can use the option `--pyspark-version` to address this issue.
* When running Spark on a single local node it is possible to run out of disk space as temporary files are written to the user's home directory. To avoid this on a DSVM, we attached an additional disk to the DSVM and made modifications to the Spark configuration. This is done by including the following lines in the file at `/dsvm/tools/spark/current/conf/spark-env.sh`.

```{shell}
SPARK_LOCAL_DIRS="/mnt"
SPARK_WORKER_DIR="/mnt"
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true, -Dspark.worker.cleanup.appDataTtl=3600, -Dspark.worker.cleanup.interval=300, -Dspark.storage.cleanupFilesAfterExecutorExit=true"
Expand All @@ -131,29 +139,131 @@ SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true, -Dspark.worker.cleanup.a
## Setup guide for Azure Databricks

### Requirements of Azure Databricks
* Runtime version 4.1 (Apache Spark 2.3.0, Scala 2.11)

* Runtime version 4.3 (Apache Spark 2.3.1, Scala 2.11)
* Python 3

### Repository upload
We need to zip and upload the repository to be used in Databricks, the steps are the following:
* Clone Microsoft Recommenders repo in your local computer.
* Zip the contents inside the Recommenders folder (Azure Databricks requires compressed folders to have the .egg suffix, so we don't use the standard .zip):
```
### Repository installation
You can setup the repository as a library on Databricks either manually or by running an [installation script](scripts/databricks_install.sh). Both options assume you have access to a provisioned Databricks workspace and cluster and that you have appropriate permissions to install libraries.

<details>
<summary><strong><em>Quick install</em></strong></summary>

This option utilizes an installation script to do the setup, and it requires additional dependencies in the environment used to execute the script.

> To run the script, following **prerequisites** are required:
> * Install [Azure Databricks CLI (command-line interface)](https://docs.azuredatabricks.net/user-guide/dev-tools/databricks-cli.html#install-the-cli) and setup CLI authentication. Please find details about how to create a token and set authentication [here](https://docs.azuredatabricks.net/user-guide/dev-tools/databricks-cli.html#set-up-authentication). Very briefly, you can install and configure your environment with the following commands.
jreynolds01 marked this conversation as resolved.
Show resolved Hide resolved
>
> ```{shell}
> pip install databricks-cli
> databricks configure --token
> ```
>
> * Get the target **cluster id** and **start** the cluster if its status is *TERMINATED*.
> * You can get the cluster id from the databricks CLI with:
> ```{shell}
> databricks clusters list
> ```
> * If required, you can start the cluster with:
> ```{shell}
> databricks clusters start --cluster-id <CLUSTER_ID>`
> ```
> * The script also requires the `zip` command line utility, which may not be installed. You can install it with:
> ```{shell}
> sudo apt-get update
> sudo apt-get install zip
> ```

Once you have confirmed the databricks cluster is *RUNNING*, install the modules within this repository with the following commands:

```{shell}
cd Recommenders
zip -r Recommenders.egg .
```
* Once your cluster has started, go to the Databricks home workspace, then go to your user and press import.
* In the next menu there is an option to import a library, it says: `To import a library, such as a jar or egg, click here`. Press click here.
* Then, at the first drop-down menu, mark the option `Upload Python egg or PyPI`.
* Then press on `Drop library egg here to upload` and select the the file `Recommenders.egg` you just created.
* Then press `Create library`. This will upload the zip and make it available in your workspace.
* Finally, in the next menu, attach the library to your cluster.

To make sure it works, you can now create a new notebook and import the utilities:
./scripts/databricks_install.sh <CLUSTER_ID>
```

</details>

<details>
<summary><strong><em>Manual setup</em></strong></summary>

To install the repo manually onto Databricks, follow the steps:

1. Clone the Microsoft Recommenders repository to your local computer.
2. Zip the contents inside the Recommenders folder (Azure Databricks requires compressed folders to have the `.egg` suffix, so we don't use the standard `.zip`):

```{shell}
cd Recommenders
zip -r Recommenders.egg .
```
3. Once your cluster has started, go to the Databricks workspace, and select the `Home` button.
4. Your `Home` directory should appear in a panel. Right click within your directory, and select `Import`.
5. In the pop-up window, there is an option to import a library, where it says: `(To import a library, such as a jar or egg, click here)`. Select `click here`.
6. In the next screen, select the option `Upload Python Egg or PyPI` in the first menu.
7. Next, click on the box that contains the text `Drop library egg here to upload` and use the file selector to choose the `Recommenders.egg` file you just created, and select `Open`.
8. Click on the `Create library`. This will upload the egg and make it available in your workspace.
9. Finally, in the next menu, attach the library to your cluster.

</details>

### Confirm Installation

After installation, you can now create a new notebook and import the utilities from Databricks in order to confirm that the import worked.

```{python}
import reco_utils
```

### Troubleshooting for Azure Databricks
### Troubleshooting Installation on Azure Databricks

* For the [reco_utils](reco_utils) import to work on Databricks, it is important to zip the content correctly. The zip has to be performed inside the Recommenders folder, if you zip directly above the Recommenders folder, it won't work.

## Prepare Azure Databricks for Operationalization
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikhilrj - please take a look at this SETUP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually do we even need the manual installation steps now? Maybe we should cut them out...?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikhilrj hmmm good point. But basically the install-script does the same thing as the manual steps... Maybe we can explain what the script does instead of having those contents as 'manual installation'?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just read your comment below. I agree the SETUP is a bit long now... what's other's thought?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's long, but i think the benefits of having all setup in 1 place is worth it. I think we should provide manual information in case someone can't run the scripts for some reason.

However, there are a few different ways to do that:

  • Say something like see the scripts and comments in the scripts for manual dependencies, and include links to documentation on how to add libraries, etc. The scripts do implement all the steps, so that is in some ways self-documenting
  • Another option would be to have a separate SETUP_MANUAL.md file, and use that as a an Appendix of sorts, where we could reference in the default SETUP.md.

I think if we want to clean it up first bullet is probably a good way to do it. I'd say let's go ahead and merge, and have that as a follow-up action.


This repository includes an end-to-end example notebook that uses Azure Datbaricks to estimate a recommendation model using Alternating Least Squares, writes pre-computed recommendations to Azure Cosmos DB, and then creates a real-time scoring service that retrieves the recommendations from Cosmos DB. In order to execute that [notebook](notebooks//05_operationalize/als_movie_o16n.ipynb), you must install the Recommenders repository as a library (as described above), **AND* you must also install some additional dependencies. Similar to above, you can do so either manually or via an installation [script](scripts/prepare_databricks_for_o16n.sh).

<details>
<summary><strong><em>Quick install</em></strong></summary>

This option utilizes an installation script to do the setup, and it requires the same dependencies as the databricks installation script (see above).

Once you have:

* Installed and configured the databricks CLI
* Confirmed that the appropriate cluster is *RUNNING*
* Installed the Recommenders egg as described above
* Confirmed you are in the root directory of the Recommenders repository

you can install additional dependencies for operationalization with:

```{shell}
scripts/prepare_databricks_for_o16n.sh <CLUSTER_ID>
```

This script does all of the steps described in the *Manual setup* section below.

</details>

<details>
<summary><strong><em>Manual setup</em></strong></summary>

You must install three packages as libraries from PyPI:

* `azure-cli`
* `azureml-sdk[databricks]`
* `pydocumentdb`

You can follow instructions [here](https://docs.azuredatabricks.net/user-guide/libraries.html#install-a-library-on-a-cluster) for details on how to install packages from PyPI.

Additionally, you must install the [spark-cosmosdb connector](https://docs.databricks.com/spark/latest/data-sources/azure/cosmosdb-connector.html) on the cluster. The easiest way to manually do that is to:

1. Download the [appropriate jar](https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.3.0_2.11/1.2.2/azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar) from MAVEN. **NOTE** This is the appropriate jar for spark versions `2.3.X`, and is the appropriate version for the recommended Azure Databricks run-time detailed above.
2. Upload and install the jar by:
1. Log into your `Azure Databricks` workspace
2. Select the `Clusters` button on the left.
3. Select the cluster on which you want to import the library.
4. Select the `Upload` and `Jar` options, and click in the box that has the text `Drop JAR here` in it.
5. Navigate to the downloaded `.jar` file, select it, and click `Open`.
6. Click on `Install`.
7. Restart the cluster.

</details>
70 changes: 70 additions & 0 deletions scripts/databricks_install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# ---------------------------------------------------------
# This script installs Recommenders into Databricks

DATABRICKS_CLI=$(which databricks)
if ! [ -x "$DATABRICKS_CLI" ]; then
echo "No databricks-cli found!! Please see the SETUP.md file for installation prerequisites."
exit 1
fi

CLUSTER_ID=$1
if [ -z $CLUSTER_ID ]; then
echo "Please provide the target cluster id: 'databricks_install.sh <CLUSTER_ID>'."
echo "Cluster id can be found by running 'databricks clusters list'"
echo "which returns a list of <CLUSTER_ID> <CLUSTER_NAME> <STATUS>."
exit 1
fi

CLUSTER_EXIST=false
while IFS=' ' read -ra ARR; do
if [ ${ARR[0]} = $CLUSTER_ID ]; then
CLUSTER_EXIST=true

STATUS=${ARR[2]}
STATUS=${STATUS//[^a-zA-Z]/}
if [ $STATUS = RUNNING ]; then
echo
echo "Preparing Recommenders library file (egg)..."
zip -r -q Recommenders.egg . -i *.py -x tests/\* scripts/\*

echo
echo "Uploading to databricks..."
dbfs cp --overwrite Recommenders.egg dbfs:/FileStore/jars/Recommenders.egg

echo
echo "Installing the library onto databricks cluster $CLUSTER_ID..."
databricks libraries install --cluster-id $CLUSTER_ID --egg dbfs:/FileStore/jars/Recommenders.egg

echo
echo "Done! Installation status checking..."
databricks libraries cluster-status --cluster-id $CLUSTER_ID

echo
echo "Restarting the cluster to activate the library..."
databricks clusters restart --cluster-id $CLUSTER_ID

echo "This will take few seconds. Please check the result from Databricks workspace."
echo "Alternatively, run 'databricks clusters list' to check the restart status and"
echo "run 'databricks libraries cluster-status --cluster-id $CLUSTER_ID' to check the installation status."

rm Recommenders.egg
exit 0
else
echo "Cluster $CLUSTER_ID found, but it is not running. Status=${STATUS}"
echo "You can start the cluster with 'databricks clusters start --cluster-id $CLUSTER_ID'."
echo "Then, check the cluster status by using 'databricks clusters list' and"
echo "re-try installation once the status turns into RUNNING."
exit 1
fi
fi
done < <(databricks clusters list)

if ! [ $CLUSTER_EXIST = true ]; then
echo "Cannot find the target cluster $CLUSTER_ID. Please check if you entered the valid id."
echo "Cluster id can be found by running 'databricks clusters list'"
echo "which returns a list of <CLUSTER_ID> <CLUSTER_NAME> <STATUS>."
exit 1
fi

Loading