Fix Databricks installation script and LightGBM test #1531

anargyri · 2021-09-20T12:25:45Z

Description

Fixed issues arising in #1439

moved MMLSPARK_INFO so that the failing LightGBM test passes
added mmlspark related arguments in start_or_get_spark() following Scott's suggestion
enclosed occurrences of start_or_get_spark() inside a check for not is_databricks()

Moreover,

changed databricks_install.py to install from PyPI instead of egg file
fixed some azure related dependencies that were breaking the installation on Databricks.

Related Issues

#1439

Checklist:

I have followed the contribution guidelines and code style for this project.
I have added tests covering my contributions.
I have updated the documentation accordingly.
This PR is being made to staging branch and not to main branch.

review-notebook-app · 2021-09-20T12:25:49Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

miguelgfierro · 2021-09-21T14:33:42Z

examples/00_quick_start/als_movielens.ipynb

@@ -2,46 +2,34 @@
 "cells": [


@anargyri, queston about this, why are you checking if we are on databricks?
are you trying to make sure that the exact same notebook can work on a DSVM and in Databricks?

Reply via ReviewNB

Yes, now the notebook won't fail on Databricks (it was failing because start_or_get_spark() fails).
This was a suggestion by Le here #1439

one thing that is a little bit weird is that we have start_or_get_spark but then we are adding an if clause to all notebooks.

in the code of that function at the end we have:

spark_opts.append("getOrCreate()")

so if we add start_or_get_spark into Databricks, since spark instance already exists, it should return it directly.

Based on that, it should be possible to remove the if in:

if not is_databricks(): spark = start_or_get_spark("ALS PySpark", memory="16g")

and leave only:

spark = start_or_get_spark("ALS PySpark", memory="16g")

Actually, if you call start_or_get_spark() on Databricks it fails.
We already had the if not is_databricks(): clause before this PR in a couple of notebooks e.g. als_movie_o16n. I just added it to the rest.

But one alternative that may look better is to incorporate the if not is_databricks() check inside the definition of start_or_get_spark().

Hmm, I just tried start_or_get_spark() on ADB with the PyPI reocmmenders package and it works.
That is,

from reco_utils.common.spark_utils import start_or_get_spark start_or_get_spark()

inside a cell returns the spark session as expected.

I think what I was doing was something like

%%bash python -c "from reco_utils.common.spark_utils import start_or_get_spark start_or_get_spark()"

inside a cell. This raises an error in start_or_get_spark().
I doubt though that this is a recommended way to use Databricks.

FYI @yueguoguo in case you have more experience with start_or_get_spark() on Databricks.

I just tried start_or_get_spark() on ADB with the PyPI recommenders package and it works

ok this is cool, and I guess the same function also works on DSVM spark right? then there is no need to add the if clause

Right. I can remove the if not is_databricks(), except where it is needed for other reasons.

miguelgfierro · 2021-09-22T16:29:58Z

recommenders/utils/spark_utils.py

+MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1"
+MMLSPARK_REPO = "https://mvnrepository.com/artifact"
+
 def start_or_get_spark(
    app_name="Sample",
    url="local[*]",
    memory="10g",
    config=None,
    packages=None,
    jars=None,
-    repository=None,
+    repositories=None,
+    mmlspark_package=None,
+    mmlspark_repository=None


I saw the comment of @gramhagen and @yueguoguo here: #1439 about adding mmlspark in the signature.

Another approach could be the following, having a generic function that could input any package, jar, etc:

def start_or_get_spark( app_name="Sample", url="local[*]", memory="10g", config=None, packages=None, jars=None, repositories=None )

and then a second function that includes specifically mmlspark, something like:

def start_or_get_spark_with_mmlspark( app_name="Sample", url="local[*]", memory="10g", config=None, packages=None, jars=None, repositories=None ): packages_full = mmlspark_package if packages is not None: package_full+="".format(",".join(new_packages)) # same with repositories

The advantage of this is that we have a generalistic function for spark that can get any package, and another one explicitely for mmlspark.

What do you think @yueguoguo , @gramhagen , @anargyri? Happy to back off if you all guys agree we should go the original approach

🤔 I don't see a clear benefit from duplicating the function (the code of the two functions will be almost identical).
I think the Python philosophy of specific rather than generalistic applies mainly when the functionalities differ significantly. In this case it's the same functionality essentially, just the repo info changes.

is it not possible to add mmlspark using the current packages/repositories inputs? just wondering if it's worth hiding that from the user and creating a library specific option. my preference would be to let the user provide these values and we can provide examples.

Yes, I was following your earlier suggestion, but this is another good option, to have just one function and let the user define the list of packages and repositories (If I understand correctly what you suggest).

right, would be good to test if this works

start_or_get_spark(packages=["com.microsoft.ml.spark:mmlspark_2.11:0.18.1"], repositories=["https://mvnrepository.com/artifact"])

This was already working here (only change was the repos is now a list).

got it makes sense, if you guys think that two functions (one general, one with mmlspark) are unnecessary, then I would lean more towards having just one function where we can have any arbitrary input:
start_or_get_spark(packages=["com.microsoft.ml.spark:mmlspark_2.11:0.18.1"], repositories=["https://mvnrepository.com/artifact"]), rather than a function with the specific mmlspark inputs:

start_or_get_spark( app_name="Sample", url="local[*]", memory="10g", config=None, packages=None, jars=None, repository=None, repositories=None, mmlspark_package=None, mmlspark_repository=None

Ok, I will keep a single, simpler version then.

@miguelgfierro @gramhagen if the recent changes look good, could you approve of the PR?

codecov-commenter · 2021-09-24T15:26:28Z

Codecov Report

Merging #1531 (bc69bb8) into staging (09e31a5) will increase coverage by 0.01%.
The diff coverage is 84.61%.

❗ Current head bc69bb8 differs from pull request most recent head d40156c. Consider uploading reports for the commit d40156c to get more accurate results

@@             Coverage Diff             @@
##           staging    #1531      +/-   ##
===========================================
+ Coverage    62.20%   62.22%   +0.01%     
===========================================
  Files           84       84              
  Lines         8441     8449       +8     
===========================================
+ Hits          5251     5257       +6     
- Misses        3190     3192       +2

Impacted Files	Coverage Δ
recommenders/utils/spark_utils.py	`90.90% <84.61%> (-5.10%)`	⬇️
recommenders/evaluation/spark_evaluation.py	`86.66% <0.00%> (ø)`
recommenders/evaluation/python_evaluation.py	`93.68% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 09e31a5...d40156c. Read the comment docs.

gramhagen · 2021-09-28T10:28:55Z

recommenders/utils/spark_utils.py

@@ -11,25 +10,28 @@
    pass  # skip this import if we are in pure python environment


+MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1"


Do we need these lines now?

Yes, for the mmlspark_lightgbm_criteo.ipynb notebook and databricks_install.py

anargyri requested review from gramhagen, loomlike, miguelgfierro, wutaomsft and yueguoguo as code owners September 20, 2021 12:25

anargyri mentioned this pull request Sep 20, 2021

[BUG] Move MMLSPARK info from tools to pip installable package #1439

Closed

miguelgfierro reviewed Sep 21, 2021

View reviewed changes

miguelgfierro reviewed Sep 22, 2021

View reviewed changes

anargyri added 2 commits September 24, 2021 14:16

Roll back is_databricks() checks

1178adb

Merge branch 'staging' into andreas/mmlspark

b14b755

anargyri force-pushed the andreas/mmlspark branch from bc69bb8 to b14b755 Compare September 24, 2021 14:24

anargyri added 2 commits September 24, 2021 15:20

Fix missing import

c363c34

Remove unused imports

d40156c

gramhagen reviewed Sep 28, 2021

View reviewed changes

gramhagen approved these changes Sep 28, 2021

View reviewed changes

anargyri merged commit 546eb8d into staging Sep 28, 2021

miguelgfierro deleted the andreas/mmlspark branch September 30, 2021 13:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Databricks installation script and LightGBM test #1531

Fix Databricks installation script and LightGBM test #1531

anargyri commented Sep 20, 2021 •

edited

Loading

review-notebook-app bot commented Sep 20, 2021

miguelgfierro Sep 21, 2021

anargyri Sep 21, 2021

miguelgfierro Sep 22, 2021

anargyri Sep 22, 2021 •

edited

Loading

anargyri Sep 22, 2021

anargyri Sep 23, 2021 •

edited

Loading

anargyri Sep 23, 2021

anargyri Sep 23, 2021

miguelgfierro Sep 24, 2021

anargyri Sep 24, 2021

miguelgfierro Sep 22, 2021

anargyri Sep 23, 2021

gramhagen Sep 23, 2021

anargyri Sep 23, 2021

gramhagen Sep 23, 2021

anargyri Sep 23, 2021

miguelgfierro Sep 24, 2021

anargyri Sep 24, 2021

anargyri Sep 28, 2021

codecov-commenter commented Sep 24, 2021

gramhagen Sep 28, 2021

anargyri Sep 28, 2021

		@@ -11,25 +10,28 @@
		pass # skip this import if we are in pure python environment


		MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1"

Fix Databricks installation script and LightGBM test #1531

Fix Databricks installation script and LightGBM test #1531

Conversation

anargyri commented Sep 20, 2021 • edited Loading

Description

Related Issues

Checklist:

review-notebook-app bot commented Sep 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anargyri Sep 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anargyri Sep 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 24, 2021

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anargyri commented Sep 20, 2021 •

edited

Loading

anargyri Sep 22, 2021 •

edited

Loading

anargyri Sep 23, 2021 •

edited

Loading