Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use spark3 by default #1549

Merged
merged 7 commits into from
Oct 19, 2021
Merged

Use spark3 by default #1549

merged 7 commits into from
Oct 19, 2021

Conversation

laserprec
Copy link
Contributor

@laserprec laserprec commented Oct 14, 2021

Description

  1. Set pyspark dependency to >=3.0.0, <4.0.0
    • Update to >=2.4.5, <3.2.0
  2. Update CIs to run the build in spark3

Checklist:

  • I have followed the contribution guidelines and code style for this project.
  • I have added tests covering my contributions.
  • I have updated the documentation accordingly.
  • This PR is being made to staging branch and not to main branch.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@laserprec laserprec changed the base branch from main to staging October 14, 2021 17:10
@codecov-commenter
Copy link

codecov-commenter commented Oct 14, 2021

Codecov Report

Merging #1549 (ce81bdd) into staging (2a426d7) will increase coverage by 62.07%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##           staging    #1549       +/-   ##
============================================
+ Coverage     0.00%   62.07%   +62.07%     
============================================
  Files           84       84               
  Lines         8492     8492               
============================================
+ Hits             0     5271     +5271     
- Misses           0     3221     +3221     
Flag Coverage Δ
nightly ?
pr-gate 62.07% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
recommenders/datasets/mind.py 0.00% <0.00%> (ø)
recommenders/datasets/movielens.py 68.58% <0.00%> (+68.58%) ⬆️
recommenders/datasets/download_utils.py 90.19% <0.00%> (+90.19%) ⬆️
recommenders/models/geoimc/geoimc_data.py 41.66% <0.00%> (+41.66%) ⬆️
recommenders/models/newsrec/models/npa.py 95.71% <0.00%> (+95.71%) ⬆️
recommenders/models/newsrec/models/naml.py 92.50% <0.00%> (+92.50%) ⬆️
recommenders/models/newsrec/models/nrms.py 91.80% <0.00%> (+91.80%) ⬆️
recommenders/models/newsrec/models/lstur.py 87.50% <0.00%> (+87.50%) ⬆️
recommenders/evaluation/python_evaluation.py 93.68% <0.00%> (+93.68%) ⬆️
recommenders/models/newsrec/newsrec_utils.py 70.00% <0.00%> (+70.00%) ⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a426d7...ce81bdd. Read the comment docs.

@anargyri
Copy link
Collaborator

@laserprec @miguelgfierro I wonder whether it is possible to specify a custom command line flag that tells pip install to install v2 instead of v3 (I don't mean using an additional extras option). In case we would like to allow v2 as an option to the users.

@gramhagen
Copy link
Collaborator

@laserprec @miguelgfierro I wonder whether it is possible to specify a custom command line flag that tells pip install to install v2 instead of v3 (I don't mean using an additional extras option). In case we would like to allow v2 as an option to the users.

I don't know of an approach other than using extras. Was the issue with this change that version 3 is not installed by default?

@anargyri
Copy link
Collaborator

Yes, sometimes pip installed v3 (when I tried on a DSVM), others v2 (when Jianjie tried on Windows). I think we would like to install v3 by default (since it's the version we mainly support), but maybe we would like to leave v2 as an option (although I don't see any use case where v2 is a strong requirement).

@gramhagen
Copy link
Collaborator

Yes, sometimes pip installed v3 (when I tried on a DSVM), others v2 (when Jianjie tried on Windows). I think we would like to install v3 by default (since it's the version we mainly support), but maybe we would like to leave v2 as an option (although I don't see any use case where v2 is a strong requirement).

@laserprec do you know why that happens? pip on windows should by default install the latest pyspark release (3.2.0), this is what happens on my local machine when i do pip install "pyspark>=2.4.5,<4.0.0". curious if there's some configuration or package dependency which is restricting the version used?

@laserprec
Copy link
Contributor Author

laserprec commented Oct 18, 2021

@laserprec do you know why that happens? pip on windows should by default install the latest pyspark release (3.2.0), this is what happens on my local machine when i do pip install "pyspark>=2.4.5,<4.0.0". curious if there's some configuration or package dependency which is restricting the version used?

Ah, looking into this, it turns out my local environment is corrupted with a global dependency that pings the pyspark to 2 😅 . After removing it, pip install "pyspark>=2.4.5,<4.0.0" will install pyspark==3.1.2 in my machine. This is also consistent with tox installation in our CIs too:
image

I think the question I have with keeping dependency anchored in range of pyspark>=2.4.5,<4.0.0 is whether we should run our tests in both spark3 and spark2 in our CIs, or keep it simple and run everything in spark 3 only. If the team wants to continue carrying the spark2 support into our future release, I think it's best to have the CIs running tests in spark2 to ensure its reliability.

Like @gramhagen mentioned, there isn't a good way to switch between spark3 and spark2 unless we introduce a new extra recommeders[spark2]. I think my motivation for this PR change is to do a hard upgrade to spark3 to simply our pipelines and
recommend users to install recommenders==0.7.0 if they still need spark 2.

@anargyri
Copy link
Collaborator

@laserprec do you know why that happens? pip on windows should by default install the latest pyspark release (3.2.0), this is what happens on my local machine when i do pip install "pyspark>=2.4.5,<4.0.0". curious if there's some configuration or package dependency which is restricting the version used?

Ah, looking into this, it turns out my local environment is corrupted with a global dependency that pings the pyspark to 2 😅 . After removing it, pip install "pyspark>=2.4.5,<4.0.0" will install pyspark==3.1.2 in my machine. This is also consistent with tox installation in our CIs too: image

I think the question I have with keeping dependency anchored in range of pyspark>=2.4.5,<4.0.0 is whether we should run our tests in both spark3 and spark2 in our CIs, or keep it simple and run everything in spark 3 only. Not sure if the team wants to continue carrying the spark2 support into our future release.

Like @gramhagen mentioned, there isn't a good way to switch between spark3 and spark2 unless we introduce a new extra recommeders[spark2]. I think my motivation for this PR change is to do a hard upgrade to spark3 to simply our pipelines and recommend users to install recommenders==0.6.0 if they still need spark 2.

So if you follow the installation process as in the README in a clean environment, do you get version 3?

@laserprec
Copy link
Contributor Author

laserprec commented Oct 18, 2021

So if you follow the installation process as in the README in a clean environment, do you get version 3?

I reran the installation in a clean docker image, we will get the latest pyspark version with pyspark>=2.4.5, <4.0.0:
image
image

FYI @anargyri, I think our smoke tests failed on pyspark==3.2.0. For more detail see: https://github.com/microsoft/recommenders/runs/3928935302?check_suite_focus=true

No issue running on pyspark==3.1.2:
https://github.com/microsoft/recommenders/runs/3897543705?check_suite_focus=true

3.2.0 is released 9 hours ago 😄.
image

We may also want to adjust the upper bound to <3.2.0 instead of <4.0.0?

@anargyri
Copy link
Collaborator

:-> Yes, I agree, let's do <3.2.0
And >= 2.4.5 if everyone else agrees.

@laserprec
Copy link
Contributor Author

laserprec commented Oct 19, 2021

:-> Yes, I agree, let's do <3.2.0

Sounds good. I can apply this change.

I think we have a consensus on not introducing an extra dependency solely for spark2 support (e.g. pip install recommenders[spark2] ), @gramhagen and @miguelgfierro, please let us know if you have other thoughts. If no objections, we will run our pipelines in Java 11 + pyspark v3.1.2 (default version that satisfies >=2.4.5, <3.2.0) and drop support for spark 2 & java 8 in the pipeline.

And >= 2.4.5 if everyone else agrees`

I think eventually we want to update the lower bound to pyspark>=3.0.0, but since we will incorporate many version upgrades in our next release, we can be flexible with spark2 support and see how the community respond. Perhaps in two releases away, we can restraint the lower bound to pyspark>=3.0.0, from pyspark>=2.4.5. Does it sound good @anargyri?

@laserprec laserprec linked an issue Oct 19, 2021 that may be closed by this pull request
@gramhagen
Copy link
Collaborator

:-> Yes, I agree, let's do <3.2.0

Sounds good. I can apply this change.

I think we have a consensus on not introducing an extra dependency solely for spark2 support (e.g. pip install recommenders[spark2] ), @gramhagen and @miguelgfierro, please let us know if you have other thoughts. If no objections, we will run our pipelines in Java 11 + pyspark v3.1.2 (default version that satisfies >=2.4.5, <3.2.0) and drop support for spark 2 & java 8 in the pipeline.

And >= 2.4.5 if everyone else agrees`

2.4.5<=pyspark<3.2.0 is fine with me, can we add an issue capturing what needs to change to support >=3.2.0?

Copy link
Collaborator

@miguelgfierro miguelgfierro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@laserprec
Copy link
Contributor Author

can we add an issue capturing what needs to change to support >=3.2.0?

Done. Please see #1553. The errors are accessible in the pipeline history and are linked in the issue.

@laserprec laserprec merged commit d867e87 into staging Oct 19, 2021
@miguelgfierro miguelgfierro deleted the laserprec/ci-runs-spark3 branch February 7, 2022 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Build failed on Pyspark 3.2.0
5 participants