[BUG] Move MMLSPARK info from tools to pip installable package #1439

anargyri · 2021-06-14T12:20:32Z

Description

pytest tests/unit/examples/test_notebooks_pyspark.py requires info from directory tools/:

E           papermill.exceptions.PapermillExecutionError:
E           ---------------------------------------------------------------------------
E           Exception encountered at "In [1]":
E           ---------------------------------------------------------------------------
E           ModuleNotFoundError                       Traceback (most recent call last)
E           <ipython-input-1-a0f1f3501b03> in <module>
E                17 if not is_databricks():
E                18     # get the maven coordinates for MML Spark from databricks_install script
E           ---> 19     from tools.databricks_install import MMLSPARK_INFO
E                20     packages = [MMLSPARK_INFO["maven"]["coordinates"]]
E                21     repo = MMLSPARK_INFO["maven"].get("repo")
E
E           ModuleNotFoundError: No module named 'tools'

/anaconda/envs/reco/lib/python3.6/site-packages/papermill/execute.py:234: PapermillExecutionError

We could move the required MMLSPARK_INFO from tools/ to a definition built inside the pip package.

In which platform does it happen?

Any platform

How do we replicate the issue?

Create a conda / virtual environment
pip install ms-recommenders[examples,spark]
Run unit test tests/unit/examples/test_notebooks_pyspark.py with pytest -k criteo

Expected behavior (i.e. solution)

The tests should pass successfully even if tools/ directory is not present

Other Comments

The text was updated successfully, but these errors were encountered:

gramhagen · 2021-06-14T13:27:49Z

it would be nice to have the mmlspark version set in setup.py with the rest of the dependencies, but i think that would mean setup.py would have to modify a separate file at installation (reco_utils/common/spark_utils.py?) to inject the location of the spark package & maven repo. Alternatively we could just put that information directly into spark_utils and put a note in setup.py that if you want to change the default version of mmlspark to modify spark_utils.py?

were you thinking that there would be an additional flag in start_or_get_spark() that would add mmlspark, or just store the info and leave it up to the user to retrieve it and pass it to start_or_get_spark() using the packages and repositories inputs?

anargyri · 2021-06-14T14:01:51Z

I am open to suggestions. I was thinking of defining MMLSPARK_INFO somewhere in reco_utils/common/spark_utils.py where databricks_install.py would import it from, and it will also be available from the pip package.

I think setup.py doesn't need to be aware of MMLSPARK_INFO (after all it is irrelevant to Python). It is something that could be changed independently of the pip package by the user, if they want to test another mmlspark version.

I don't know about a flag for mmlspark, it may be a nice idea. Wouldn't it conflict with other packages the user may want to install?

@yueguoguo @miguelgfierro @loomlike any comments?

gramhagen · 2021-06-14T19:57:03Z

makes sense, for the flag i was thinking we could do something like this at the beginning of get_or_start_spark

def get_or_start_spark(..., mmlspark=False):
    new_packages = packages.copy() if isinstance(packages, list) else None
    new__repos = repositories.copy() if isinstance(repositories, list) else None
    if mmlspark:
        if new_pacakges is None:
            new_pacakges = []
        new_packages.append(MMLSPARK_PACKAGE)
        if new_repos is None:
            new_repos = []
        new_repos.append(MMLSPARK_REPOS)
    ...
    if new_packages is not None:
        submit_args = "--packages {} ".format(",".join(new_packages))
    if jargs is not None:
        submit_args += "--jars {} ".format(",".join(jars))
    if new_repos is not None:
       submit_args += "--repositories {}".format(",".join(new_repos))
    ...

yueguoguo · 2021-06-15T08:02:15Z

Vote for adding mmlspark in the spark creation step. It makes less sense to put the very spark specific info into setup.py which is generic to the package. One thing to note is that the get_or_start_spark may not work nicely with databricks which has a different scheme for spark config (UI-based). Maybe we want to make this clear somewhere.

anargyri · 2021-09-14T12:15:31Z

Vote for adding mmlspark in the spark creation step. It makes less sense to put the very spark specific info into setup.py which is generic to the package. One thing to note is that the get_or_start_spark may not work nicely with databricks which has a different scheme for spark config (UI-based). Maybe we want to make this clear somewhere.

I will wrap any calls in the notebooks inside a is_databricks() check (we already do this in a couple of notebooks).

anargyri · 2021-09-20T12:29:02Z

Addressed in #1531

anargyri added the bug Something isn't working label Jun 14, 2021

anargyri self-assigned this Jun 14, 2021

anargyri mentioned this issue Sep 20, 2021

Fix Databricks installation script and LightGBM test #1531

Merged

4 tasks

anargyri closed this as completed Sep 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Move MMLSPARK info from tools to pip installable package #1439

[BUG] Move MMLSPARK info from tools to pip installable package #1439

anargyri commented Jun 14, 2021

gramhagen commented Jun 14, 2021

anargyri commented Jun 14, 2021

gramhagen commented Jun 14, 2021

yueguoguo commented Jun 15, 2021

anargyri commented Sep 14, 2021

anargyri commented Sep 20, 2021