Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Databricks installation script #457

Merged
merged 34 commits into from
Jan 31, 2019
Merged

Conversation

loomlike
Copy link
Collaborator

Description

  • Add repo installation sh for Databricks
  • Update SETUP.md accordingly

Motivation and Context

To make repo installation onto Databricks

Releated Issues

#237

Checklist:

  • My code follows the code style of this project, as detailed in our contribution guidelines.
  • I have added tests.
  • I have updated the documentation accordingly.

Add installation sh
Update SETUP.md accordingly
Copy link
Collaborator

@gramhagen gramhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just a few small changes
also in general I'm wondering if we should move towards python based scripts which could generalize better across OS's. I wouldn't worry about it right now. but food for thought

SETUP.md Outdated Show resolved Hide resolved
scripts/databricks_install.sh Outdated Show resolved Hide resolved
scripts/databricks_install.sh Outdated Show resolved Hide resolved
scripts/databricks_install.sh Outdated Show resolved Hide resolved
loomlike and others added 2 commits January 29, 2019 00:27
Handle error cases such as no databricks-cli installed, cluster id not found, etc.
SETUP.md bug fix as databricks-cli works for not just Azure Databricks
Copy link
Collaborator

@gramhagen gramhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, nice work

@jreynolds01
Copy link
Collaborator

jreynolds01 commented Jan 29, 2019

Agreed with Scott re: python vs. shell.

This does a small part of the create and config notebook in the pull request here: #438.

It assumes a supported version of ADB is running on the cluster (4.1). Using the .sh and CLI approach to do this is fine, but would like to see that this is extensible to support the full set up, including cluster creation and potentially preparing for operationalization (which includes additional library installation, including the cosmosdb-spark connector).

I had chosen python rather than sh so I only had to do 1 REST call to install the libraries instead of doing multiple CLI calls, but moving those to sh is fine if we think that's easier.

Re: dependencies, it requires zip is installed (usually, that's available, but I didn't have access to it by default within a clean Ubuntu from a Windows subsystem for linux). I don't know if we want to mention this or not.

@jreynolds01
Copy link
Collaborator

To follow-up on my last comment; the question I would like feedback on is that we have a few options with how to simplify setup on azure databricks.

Do we want 1 setup script for adb that installs only the reco_utils module and then a second setup script that facilitates the preparation for operationalization. Or do we want to 1 setup script that can install all dependencies end-to-end depending on flags. Which do we think is easier for the user?

@yueguoguo
Copy link
Collaborator

Do we want 1 setup script for adb that installs only the reco_utils module and then a second setup script that facilitates the preparation for operationalization. Or do we want to 1 setup script that can install all dependencies end-to-end depending on flags. Which do we think is easier for the user?

The former is preferred by myself. It is more modular and more flexible.

SETUP.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@miguelgfierro miguelgfierro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, Scott idea is interesting

@jreynolds01
Copy link
Collaborator

Currently testing an addition to do the second part of installation to prepare for operationalization.

@loomlike
Copy link
Collaborator Author

Just had quick chat w/ @jreynolds01. We can utilize this script within the create_and_configure_(databricks)_cluster notebook. So that this script can serve those who already have Databricks running and want to install our repo on it, while Jeremy's notebook serves more o18n forcused use-cases.

Update to mention the databricks installation script doesn't handle dependencies,
and set prerequisites a part from the main step.
@loomlike
Copy link
Collaborator Author

Last commit addresses the bash script issue when using Windows' git-bash

@jreynolds01
Copy link
Collaborator

@loomlike - can you do a check of the script I just added for o16n?

@jreynolds01 jreynolds01 mentioned this pull request Jan 30, 2019
1 task
* For the [reco_utils](reco_utils) import to work on Databricks, it is important to zip the content correctly. The zip has to be performed inside the Recommenders folder, if you zip directly above the Recommenders folder, it won't work.

## Prepare Azure Databricks for Operationalization
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikhilrj - please take a look at this SETUP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually do we even need the manual installation steps now? Maybe we should cut them out...?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikhilrj hmmm good point. But basically the install-script does the same thing as the manual steps... Maybe we can explain what the script does instead of having those contents as 'manual installation'?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just read your comment below. I agree the SETUP is a bit long now... what's other's thought?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's long, but i think the benefits of having all setup in 1 place is worth it. I think we should provide manual information in case someone can't run the scripts for some reason.

However, there are a few different ways to do that:

  • Say something like see the scripts and comments in the scripts for manual dependencies, and include links to documentation on how to add libraries, etc. The scripts do implement all the steps, so that is in some ways self-documenting
  • Another option would be to have a separate SETUP_MANUAL.md file, and use that as a an Appendix of sorts, where we could reference in the default SETUP.md.

I think if we want to clean it up first bullet is probably a good way to do it. I'd say let's go ahead and merge, and have that as a follow-up action.

@nikhilrj nikhilrj assigned nikhilrj and unassigned nikhilrj Jan 30, 2019
@nikhilrj nikhilrj self-requested a review January 30, 2019 20:12
@miguelgfierro
Copy link
Collaborator

I don't understand why we are getting the error in the spark pipeline. I'll investigate tomorrow, but feel free to ignore it for this PR

Copy link
Contributor

@nikhilrj nikhilrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

* For the [reco_utils](reco_utils) import to work on Databricks, it is important to zip the content correctly. The zip has to be performed inside the Recommenders folder, if you zip directly above the Recommenders folder, it won't work.

## Prepare Azure Databricks for Operationalization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@nikhilrj nikhilrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Main comment is that the Setup.md is really long now. Maybe we should cut out the manual installation steps in the Operationalization dependency setup?

I'm less worried about a long setup as we have a mini setup in the top level readme.

SETUP.md Outdated Show resolved Hide resolved
SETUP.md Show resolved Hide resolved
* For the [reco_utils](reco_utils) import to work on Databricks, it is important to zip the content correctly. The zip has to be performed inside the Recommenders folder, if you zip directly above the Recommenders folder, it won't work.

## Prepare Azure Databricks for Operationalization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually do we even need the manual installation steps now? Maybe we should cut them out...?

@loomlike
Copy link
Collaborator Author

@jreynolds01 Script looks good to me! Just need to update a paragraph in SETUP about Databricks' env to include all CPU, GPU, and Spark. See my comment above.

FYI, Databricks runtime ML includes TensorFlow package. For non-ml, cpu- and gpu-clusters, you still can install TF or pyTorch manually.

@jreynolds01 jreynolds01 merged commit ced29ce into staging Jan 31, 2019
@miguelgfierro miguelgfierro deleted the jumin/databricks_setup_sh branch February 7, 2019 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants