Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Move MMLSPARK info from tools to pip installable package #1439

Closed
anargyri opened this issue Jun 14, 2021 · 6 comments
Closed

[BUG] Move MMLSPARK info from tools to pip installable package #1439

anargyri opened this issue Jun 14, 2021 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@anargyri
Copy link
Collaborator

Description

pytest tests/unit/examples/test_notebooks_pyspark.py requires info from directory tools/:

E           papermill.exceptions.PapermillExecutionError:
E           ---------------------------------------------------------------------------
E           Exception encountered at "In [1]":
E           ---------------------------------------------------------------------------
E           ModuleNotFoundError                       Traceback (most recent call last)
E           <ipython-input-1-a0f1f3501b03> in <module>
E                17 if not is_databricks():
E                18     # get the maven coordinates for MML Spark from databricks_install script
E           ---> 19     from tools.databricks_install import MMLSPARK_INFO
E                20     packages = [MMLSPARK_INFO["maven"]["coordinates"]]
E                21     repo = MMLSPARK_INFO["maven"].get("repo")
E
E           ModuleNotFoundError: No module named 'tools'

/anaconda/envs/reco/lib/python3.6/site-packages/papermill/execute.py:234: PapermillExecutionError

We could move the required MMLSPARK_INFO from tools/ to a definition built inside the pip package.

In which platform does it happen?

Any platform

How do we replicate the issue?

  • Create a conda / virtual environment
  • pip install ms-recommenders[examples,spark]
  • Run unit test tests/unit/examples/test_notebooks_pyspark.py with pytest -k criteo

Expected behavior (i.e. solution)

The tests should pass successfully even if tools/ directory is not present

Other Comments

@anargyri anargyri added the bug Something isn't working label Jun 14, 2021
@anargyri anargyri self-assigned this Jun 14, 2021
@gramhagen
Copy link
Collaborator

it would be nice to have the mmlspark version set in setup.py with the rest of the dependencies, but i think that would mean setup.py would have to modify a separate file at installation (reco_utils/common/spark_utils.py?) to inject the location of the spark package & maven repo. Alternatively we could just put that information directly into spark_utils and put a note in setup.py that if you want to change the default version of mmlspark to modify spark_utils.py?

were you thinking that there would be an additional flag in start_or_get_spark() that would add mmlspark, or just store the info and leave it up to the user to retrieve it and pass it to start_or_get_spark() using the packages and repositories inputs?

@anargyri
Copy link
Collaborator Author

I am open to suggestions. I was thinking of defining MMLSPARK_INFO somewhere in reco_utils/common/spark_utils.py where databricks_install.py would import it from, and it will also be available from the pip package.

I think setup.py doesn't need to be aware of MMLSPARK_INFO (after all it is irrelevant to Python). It is something that could be changed independently of the pip package by the user, if they want to test another mmlspark version.

I don't know about a flag for mmlspark, it may be a nice idea. Wouldn't it conflict with other packages the user may want to install?

@yueguoguo @miguelgfierro @loomlike any comments?

@gramhagen
Copy link
Collaborator

makes sense, for the flag i was thinking we could do something like this at the beginning of get_or_start_spark

def get_or_start_spark(..., mmlspark=False):
    new_packages = packages.copy() if isinstance(packages, list) else None
    new__repos = repositories.copy() if isinstance(repositories, list) else None
    if mmlspark:
        if new_pacakges is None:
            new_pacakges = []
        new_packages.append(MMLSPARK_PACKAGE)
        if new_repos is None:
            new_repos = []
        new_repos.append(MMLSPARK_REPOS)
    ...
    if new_packages is not None:
        submit_args = "--packages {} ".format(",".join(new_packages))
    if jargs is not None:
        submit_args += "--jars {} ".format(",".join(jars))
    if new_repos is not None:
       submit_args += "--repositories {}".format(",".join(new_repos))
    ...

@yueguoguo
Copy link
Collaborator

Vote for adding mmlspark in the spark creation step. It makes less sense to put the very spark specific info into setup.py which is generic to the package. One thing to note is that the get_or_start_spark may not work nicely with databricks which has a different scheme for spark config (UI-based). Maybe we want to make this clear somewhere.

@anargyri
Copy link
Collaborator Author

Vote for adding mmlspark in the spark creation step. It makes less sense to put the very spark specific info into setup.py which is generic to the package. One thing to note is that the get_or_start_spark may not work nicely with databricks which has a different scheme for spark config (UI-based). Maybe we want to make this clear somewhere.

I will wrap any calls in the notebooks inside a is_databricks() check (we already do this in a couple of notebooks).

@anargyri
Copy link
Collaborator Author

Addressed in #1531

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants