-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Databricks installation script and LightGBM test #1531
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@@ -2,46 +2,34 @@ | |||
"cells": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, now the notebook won't fail on Databricks (it was failing because start_or_get_spark()
fails).
This was a suggestion by Le here #1439
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thing that is a little bit weird is that we have start_or_get_spark
but then we are adding an if clause to all notebooks.
in the code of that function at the end we have:
spark_opts.append("getOrCreate()")
so if we add start_or_get_spark
into Databricks, since spark instance already exists, it should return it directly.
Based on that, it should be possible to remove the if in:
if not is_databricks():
spark = start_or_get_spark("ALS PySpark", memory="16g")
and leave only:
spark = start_or_get_spark("ALS PySpark", memory="16g")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, if you call start_or_get_spark()
on Databricks it fails.
We already had the if not is_databricks():
clause before this PR in a couple of notebooks e.g. als_movie_o16n. I just added it to the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But one alternative that may look better is to incorporate the if not is_databricks()
check inside the definition of start_or_get_spark()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I just tried start_or_get_spark()
on ADB with the PyPI reocmmenders package and it works.
That is,
from reco_utils.common.spark_utils import start_or_get_spark
start_or_get_spark()
inside a cell returns the spark session as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I was doing was something like
%%bash
python -c "from reco_utils.common.spark_utils import start_or_get_spark
start_or_get_spark()"
inside a cell. This raises an error in start_or_get_spark()
.
I doubt though that this is a recommended way to use Databricks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @yueguoguo in case you have more experience with start_or_get_spark()
on Databricks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried start_or_get_spark() on ADB with the PyPI recommenders package and it works
ok this is cool, and I guess the same function also works on DSVM spark right? then there is no need to add the if clause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I can remove the if not is_databricks()
, except where it is needed for other reasons.
recommenders/utils/spark_utils.py
Outdated
MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1" | ||
MMLSPARK_REPO = "https://mvnrepository.com/artifact" | ||
|
||
def start_or_get_spark( | ||
app_name="Sample", | ||
url="local[*]", | ||
memory="10g", | ||
config=None, | ||
packages=None, | ||
jars=None, | ||
repository=None, | ||
repositories=None, | ||
mmlspark_package=None, | ||
mmlspark_repository=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw the comment of @gramhagen and @yueguoguo here: #1439 about adding mmlspark in the signature.
Another approach could be the following, having a generic function that could input any package, jar, etc:
def start_or_get_spark(
app_name="Sample",
url="local[*]",
memory="10g",
config=None,
packages=None,
jars=None,
repositories=None
)
and then a second function that includes specifically mmlspark, something like:
def start_or_get_spark_with_mmlspark(
app_name="Sample",
url="local[*]",
memory="10g",
config=None,
packages=None,
jars=None,
repositories=None
):
packages_full = mmlspark_package
if packages is not None:
package_full+="".format(",".join(new_packages))
# same with repositories
The advantage of this is that we have a generalistic function for spark that can get any package, and another one explicitely for mmlspark.
What do you think @yueguoguo , @gramhagen , @anargyri? Happy to back off if you all guys agree we should go the original approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I don't see a clear benefit from duplicating the function (the code of the two functions will be almost identical).
I think the Python philosophy of specific rather than generalistic applies mainly when the functionalities differ significantly. In this case it's the same functionality essentially, just the repo info changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it not possible to add mmlspark using the current packages/repositories inputs? just wondering if it's worth hiding that from the user and creating a library specific option. my preference would be to let the user provide these values and we can provide examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was following your earlier suggestion, but this is another good option, to have just one function and let the user define the list of packages and repositories (If I understand correctly what you suggest).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, would be good to test if this works
start_or_get_spark(packages=["com.microsoft.ml.spark:mmlspark_2.11:0.18.1"], repositories=["https://mvnrepository.com/artifact"])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was already working here (only change was the repos is now a list).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it makes sense, if you guys think that two functions (one general, one with mmlspark) are unnecessary, then I would lean more towards having just one function where we can have any arbitrary input:
start_or_get_spark(packages=["com.microsoft.ml.spark:mmlspark_2.11:0.18.1"], repositories=["https://mvnrepository.com/artifact"])
, rather than a function with the specific mmlspark inputs:
start_or_get_spark(
app_name="Sample",
url="local[*]",
memory="10g",
config=None,
packages=None,
jars=None,
repository=None,
repositories=None,
mmlspark_package=None,
mmlspark_repository=None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will keep a single, simpler version then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miguelgfierro @gramhagen if the recent changes look good, could you approve of the PR?
bc69bb8
to
b14b755
Compare
Codecov Report
@@ Coverage Diff @@
## staging #1531 +/- ##
===========================================
+ Coverage 62.20% 62.22% +0.01%
===========================================
Files 84 84
Lines 8441 8449 +8
===========================================
+ Hits 5251 5257 +6
- Misses 3190 3192 +2
Continue to review full report at Codecov.
|
@@ -11,25 +10,28 @@ | |||
pass # skip this import if we are in pure python environment | |||
|
|||
|
|||
MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need these lines now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for the mmlspark_lightgbm_criteo.ipynb notebook and databricks_install.py
Description
Fixed issues arising in #1439
start_or_get_spark()
following Scott's suggestionstart_or_get_spark()
inside a check fornot is_databricks()
Moreover,
databricks_install.py
to install from PyPI instead of egg fileRelated Issues
#1439
Checklist:
staging branch
and not tomain branch
.