Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Add support for Python 3.12, latest versions of Dask #127

Merged
merged 7 commits into from
Oct 16, 2024

Conversation

johncmerfeld
Copy link
Contributor

@johncmerfeld johncmerfeld commented Sep 12, 2024

Pertaining to this Monday item. Two main ideas here:

  1. In order to upgrade to Python 3.12, we need to upgrade Dask. Dask now uses a query optimizer that does not work with Earthmover. Fortunately we can turn this off with a config setting. This allows us to use the latest version of Dask and pin our dependency to ~=2024.8.0 -- this also allows us to pin our pandas dependency to ~=2.2.0 and resume receiving updates. A future Dask version may force us to use the query optimizer, but hopefully by then, the bugs that emerge when we use it with Earthmover will have been fixed.
  2. In order to continue supporting Python 3.9 and below, we need to specify conditional dependencies in our requirements.txt

Personally I think we should consider "deprecating" support for Python 3.9 (and especially 3.8, which is at end-of-life) and removing them from the next major earthmover release - i.e. with earthmover 3.5.0, we would require users to be at Python 3.10 or later. I think this is fair game and comparable with what other tools do, but also might be self-defeating for our purposes 🤷‍♂️

Testing

  • earthmover -t (Python 3.8-3.12, MacOS and Ubuntu 22.04)
  • example_projects run_all.sh (Python 3.8-3.12, MacOS and Ubuntu)
  • CogAT assessment (full data) (Python 3.8-12, Ubuntu)
    • In this assessment (but no others so far) I've noticed a significant decrease in performance with the later versions (like 20-25% slower). Dask also doesn't print out the stream of warnings about fragmented dataframes that it used to, so there's probably some specific operation in the CogAT bundle that is materially different and less efficient in the newer versions of Dask
  • SC Ready (10k row synthetic) (Python 3.8-12, MacOS)
  • MAP Grwoth (12k row synthetic) (Python 3.8-12, MacOS)

Anyone else is welcome to do their own testing -- trying on a variety of bundles, data volumes, and Python versions* would be great. At this point I feel confident about what I have here

*the easiest way I've found to do this is to use pyenv to set the global python version, then use venv to create an environment that uses that version (named e.g. py39). After that, any time you activate that environment, it'll use that python version. Be careful when installing packages with pip, though. You'll need to run this before reinstalling earthmover in a new environment to make sure you're not using the wrong versions of packages:

pip cache purge
pip uninstall earthmover -y
pip freeze | xargs pip uninstall -y

Copy link
Collaborator

@tomreitz tomreitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not merge this until @jayckaiser has a chance to review, but I'm approving this.

I also tested it (via earthmover -t && cd example_projects/ && ./run_all.sh) under python versions 3.9.19, 3.10.12, 3.11.9, and 3.12.5, and all tests completed successfully.

One minor thing to look into (but shouldn't block merging this) is under python 3.12.5, I got this warning about the regex in the group_by operation when running earthmover -t:

SyntaxWarning: invalid escape sequence '\('"([A-Za-z0-9_]*)\(([A-Za-z0-9_]*)?,?(.*)?\)",

Again, earthmover -t succeeded, so the group by operation must have executed correctly, but based on the warning, I wonder if some change to the regex library was made in python 3.12.*+?

Excellent work here @johncmerfeld - thank you!

@jayckaiser
Copy link
Collaborator

Let's not merge this until @jayckaiser has a chance to review, but I'm approving this.

I also tested it (via earthmover -t && cd example_projects/ && ./run_all.sh) under python versions 3.9.19, 3.10.12, 3.11.9, and 3.12.5, and all tests completed successfully.

One minor thing to look into (but shouldn't block merging this) is under python 3.12.5, I got this warning about the regex in the group_by operation when running earthmover -t:

SyntaxWarning: invalid escape sequence '\('"([A-Za-z0-9_]*)\(([A-Za-z0-9_]*)?,?(.*)?\)",

Again, earthmover -t succeeded, so the group by operation must have executed correctly, but based on the warning, I wonder if some change to the regex library was made in python 3.12.*+?

Excellent work here @johncmerfeld - thank you!

This issue can be resolved by sticking a raw string identifier to the start of the regex query in the group-by operation. (r"")

@jayckaiser
Copy link
Collaborator

Next steps:

  1. Use Bhenzel's tool (or something comparable) to profile memory and time across runs across versions (i.e., is the PyArrow string serialization in Pandas 2.0 taking effect?)
  2. Investigate why the query optimizer fails (for future update)

I want us to support Python 3.8 through 3.12 for the time being while we look into upgrading our infrastructure to the latest versions. Additionally, I'd like to confirm that the decrease in time-performance for the CogAT assessment comes with improved memory usage.

@tomreitz
Copy link
Collaborator

@johncmerfeld looks like you can see the peak memory usage of any process by running it with

/usr/bin/time -v command

e.g.,

/usr/bin/time -v earthmover -t

and look at the line Maximum resident set size (kbytes): (Might be good to average a few runs to eliminate outliers.)

This could be a quick way to profile the runtime and memory usage under the various Python environments, as Jay suggested (next step #1 above).

@johncmerfeld
Copy link
Contributor Author

@jayckaiser I learned a few things and collected a lot of data. It paints an ambiguous picture but we don't really have a choice of whether to go through with this upgrade.

You have to turn on pyarrow-backed dataframes

The use of pyarrow for serialization does not happen automatically. The official docs almost make it seem like something that should be used on a case-by-case basis, although they never come out and say that. There are ways to set it globally, and I did so using config in the global __init__.py. The lines I added might be redundant but I don't know exactly how Dask and Pandas share settings so it seemed appropriate.

There's also an optional performance download that comes with Pandas now. I included this in my tests.

Overall I don't see significant time/memory improvements between what we had before and this "fully enhanced" code

Unfortunate but that's what I see, after much testing. On CogAT (the most realistic dataset I have) both time and max memory increase.

Why might this be? I haven't dived fully into what people are saying about Pandas 2; it's easy to find benchmarks showing astonishing improvements, and it's possible the kinds of data I tested with simply don't fit those patterns well. I've also seen claims that Pyarrow-backed-Pandas is worse at some operations than the NumPy version. My takeaway is that none of this makes enough of a difference to us to block this change, and it's not worth the effort trying to optimize at the level of individual types of operations.

However, if you assume we're using 3.12, the memory improvement becomes more noticeable

I'm assuming we more or less have to upgrade Python, and I still think it's the right thing to do. Given that, I believe the global settings applied on this branch are probably beneficial.

@tomreitz tomreitz merged commit 82076a9 into main Oct 16, 2024
@tomreitz tomreitz deleted the feature/python-312 branch October 16, 2024 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants