Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Databricks installation script and LightGBM test #1531

Merged
merged 4 commits into from
Sep 28, 2021

Conversation

anargyri
Copy link
Collaborator

@anargyri anargyri commented Sep 20, 2021

Description

Fixed issues arising in #1439

  • moved MMLSPARK_INFO so that the failing LightGBM test passes
  • added mmlspark related arguments in start_or_get_spark() following Scott's suggestion
  • enclosed occurrences of start_or_get_spark() inside a check for not is_databricks()

Moreover,

  • changed databricks_install.py to install from PyPI instead of egg file
  • fixed some azure related dependencies that were breaking the installation on Databricks.

Related Issues

#1439

Checklist:

  • I have followed the contribution guidelines and code style for this project.
  • I have added tests covering my contributions.
  • I have updated the documentation accordingly.
  • This PR is being made to staging branch and not to main branch.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@@ -2,46 +2,34 @@
"cells": [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anargyri, queston about this, why are you checking if we are on databricks?

are you trying to make sure that the exact same notebook can work on a DSVM and in Databricks?


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now the notebook won't fail on Databricks (it was failing because start_or_get_spark() fails).
This was a suggestion by Le here #1439

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing that is a little bit weird is that we have start_or_get_spark but then we are adding an if clause to all notebooks.

in the code of that function at the end we have:

spark_opts.append("getOrCreate()")

so if we add start_or_get_spark into Databricks, since spark instance already exists, it should return it directly.

Based on that, it should be possible to remove the if in:

if not is_databricks():
    spark = start_or_get_spark("ALS PySpark", memory="16g")

and leave only:

spark = start_or_get_spark("ALS PySpark", memory="16g")

Copy link
Collaborator Author

@anargyri anargyri Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if you call start_or_get_spark() on Databricks it fails.
We already had the if not is_databricks(): clause before this PR in a couple of notebooks e.g. als_movie_o16n. I just added it to the rest.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But one alternative that may look better is to incorporate the if not is_databricks() check inside the definition of start_or_get_spark().

Copy link
Collaborator Author

@anargyri anargyri Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I just tried start_or_get_spark() on ADB with the PyPI reocmmenders package and it works.
That is,

from reco_utils.common.spark_utils import start_or_get_spark
start_or_get_spark()

inside a cell returns the spark session as expected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I was doing was something like

%%bash
python -c "from reco_utils.common.spark_utils import start_or_get_spark
start_or_get_spark()"

inside a cell. This raises an error in start_or_get_spark().
I doubt though that this is a recommended way to use Databricks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @yueguoguo in case you have more experience with start_or_get_spark() on Databricks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried start_or_get_spark() on ADB with the PyPI recommenders package and it works

ok this is cool, and I guess the same function also works on DSVM spark right? then there is no need to add the if clause

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I can remove the if not is_databricks(), except where it is needed for other reasons.

Comment on lines 14 to 26
MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1"
MMLSPARK_REPO = "https://mvnrepository.com/artifact"

def start_or_get_spark(
app_name="Sample",
url="local[*]",
memory="10g",
config=None,
packages=None,
jars=None,
repository=None,
repositories=None,
mmlspark_package=None,
mmlspark_repository=None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the comment of @gramhagen and @yueguoguo here: #1439 about adding mmlspark in the signature.

Another approach could be the following, having a generic function that could input any package, jar, etc:

def start_or_get_spark(
    app_name="Sample",
    url="local[*]",
    memory="10g",
    config=None,
    packages=None,
    jars=None,
repositories=None
)

and then a second function that includes specifically mmlspark, something like:

def start_or_get_spark_with_mmlspark(
    app_name="Sample",
    url="local[*]",
    memory="10g",
    config=None,
    packages=None,
    jars=None,
repositories=None

):
packages_full = mmlspark_package
if packages is not None:
    package_full+="".format(",".join(new_packages))
# same with repositories

The advantage of this is that we have a generalistic function for spark that can get any package, and another one explicitely for mmlspark.

What do you think @yueguoguo , @gramhagen , @anargyri? Happy to back off if you all guys agree we should go the original approach

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I don't see a clear benefit from duplicating the function (the code of the two functions will be almost identical).
I think the Python philosophy of specific rather than generalistic applies mainly when the functionalities differ significantly. In this case it's the same functionality essentially, just the repo info changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it not possible to add mmlspark using the current packages/repositories inputs? just wondering if it's worth hiding that from the user and creating a library specific option. my preference would be to let the user provide these values and we can provide examples.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was following your earlier suggestion, but this is another good option, to have just one function and let the user define the list of packages and repositories (If I understand correctly what you suggest).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, would be good to test if this works

start_or_get_spark(packages=["com.microsoft.ml.spark:mmlspark_2.11:0.18.1"], repositories=["https://mvnrepository.com/artifact"])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already working here (only change was the repos is now a list).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it makes sense, if you guys think that two functions (one general, one with mmlspark) are unnecessary, then I would lean more towards having just one function where we can have any arbitrary input:
start_or_get_spark(packages=["com.microsoft.ml.spark:mmlspark_2.11:0.18.1"], repositories=["https://mvnrepository.com/artifact"]), rather than a function with the specific mmlspark inputs:

start_or_get_spark(
    app_name="Sample",
    url="local[*]",
    memory="10g",
    config=None,
    packages=None,
    jars=None,
    repository=None,
    repositories=None,
    mmlspark_package=None,
    mmlspark_repository=None

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will keep a single, simpler version then.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miguelgfierro @gramhagen if the recent changes look good, could you approve of the PR?

@codecov-commenter
Copy link

Codecov Report

Merging #1531 (bc69bb8) into staging (09e31a5) will increase coverage by 0.01%.
The diff coverage is 84.61%.

❗ Current head bc69bb8 differs from pull request most recent head d40156c. Consider uploading reports for the commit d40156c to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           staging    #1531      +/-   ##
===========================================
+ Coverage    62.20%   62.22%   +0.01%     
===========================================
  Files           84       84              
  Lines         8441     8449       +8     
===========================================
+ Hits          5251     5257       +6     
- Misses        3190     3192       +2     
Impacted Files Coverage Δ
recommenders/utils/spark_utils.py 90.90% <84.61%> (-5.10%) ⬇️
recommenders/evaluation/spark_evaluation.py 86.66% <0.00%> (ø)
recommenders/evaluation/python_evaluation.py 93.68% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 09e31a5...d40156c. Read the comment docs.

@@ -11,25 +10,28 @@
pass # skip this import if we are in pure python environment


MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these lines now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for the mmlspark_lightgbm_criteo.ipynb notebook and databricks_install.py

@anargyri anargyri merged commit 546eb8d into staging Sep 28, 2021
@miguelgfierro miguelgfierro deleted the andreas/mmlspark branch September 30, 2021 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants