Feature: Add support for Python 3.12, latest versions of Dask #127

johncmerfeld · 2024-09-12T22:17:53Z

Pertaining to this Monday item. Two main ideas here:

In order to upgrade to Python 3.12, we need to upgrade Dask. Dask now uses a query optimizer that does not work with Earthmover. Fortunately we can turn this off with a config setting. This allows us to use the latest version of Dask and pin our dependency to ~=2024.8.0 -- this also allows us to pin our pandas dependency to ~=2.2.0 and resume receiving updates. A future Dask version may force us to use the query optimizer, but hopefully by then, the bugs that emerge when we use it with Earthmover will have been fixed.
In order to continue supporting Python 3.9 and below, we need to specify conditional dependencies in our requirements.txt

Personally I think we should consider "deprecating" support for Python 3.9 (and especially 3.8, which is at end-of-life) and removing them from the next major earthmover release - i.e. with earthmover 3.5.0, we would require users to be at Python 3.10 or later. I think this is fair game and comparable with what other tools do, but also might be self-defeating for our purposes 🤷‍♂️

Testing

earthmover -t (Python 3.8-3.12, MacOS and Ubuntu 22.04)
example_projects run_all.sh (Python 3.8-3.12, MacOS and Ubuntu)
CogAT assessment (full data) (Python 3.8-12, Ubuntu)
- In this assessment (but no others so far) I've noticed a significant decrease in performance with the later versions (like 20-25% slower). Dask also doesn't print out the stream of warnings about fragmented dataframes that it used to, so there's probably some specific operation in the CogAT bundle that is materially different and less efficient in the newer versions of Dask
SC Ready (10k row synthetic) (Python 3.8-12, MacOS)
MAP Grwoth (12k row synthetic) (Python 3.8-12, MacOS)

Anyone else is welcome to do their own testing -- trying on a variety of bundles, data volumes, and Python versions* would be great. At this point I feel confident about what I have here

*the easiest way I've found to do this is to use pyenv to set the global python version, then use venv to create an environment that uses that version (named e.g. py39). After that, any time you activate that environment, it'll use that python version. Be careful when installing packages with pip, though. You'll need to run this before reinstalling earthmover in a new environment to make sure you're not using the wrong versions of packages:

pip cache purge
pip uninstall earthmover -y
pip freeze | xargs pip uninstall -y

tomreitz

Let's not merge this until @jayckaiser has a chance to review, but I'm approving this.

I also tested it (via earthmover -t && cd example_projects/ && ./run_all.sh) under python versions 3.9.19, 3.10.12, 3.11.9, and 3.12.5, and all tests completed successfully.

One minor thing to look into (but shouldn't block merging this) is under python 3.12.5, I got this warning about the regex in the group_by operation when running earthmover -t:

SyntaxWarning: invalid escape sequence '\('"([A-Za-z0-9_]*)\(([A-Za-z0-9_]*)?,?(.*)?\)",

Again, earthmover -t succeeded, so the group by operation must have executed correctly, but based on the warning, I wonder if some change to the regex library was made in python 3.12.*+?

Excellent work here @johncmerfeld - thank you!

jayckaiser · 2024-09-18T21:00:54Z

Let's not merge this until @jayckaiser has a chance to review, but I'm approving this.

I also tested it (via earthmover -t && cd example_projects/ && ./run_all.sh) under python versions 3.9.19, 3.10.12, 3.11.9, and 3.12.5, and all tests completed successfully.

One minor thing to look into (but shouldn't block merging this) is under python 3.12.5, I got this warning about the regex in the group_by operation when running earthmover -t:
SyntaxWarning: invalid escape sequence '\('"([A-Za-z0-9_]*)\(([A-Za-z0-9_]*)?,?(.*)?\)",
Again, earthmover -t succeeded, so the group by operation must have executed correctly, but based on the warning, I wonder if some change to the regex library was made in python 3.12.*+?

Excellent work here @johncmerfeld - thank you!

This issue can be resolved by sticking a raw string identifier to the start of the regex query in the group-by operation. (r"")

jayckaiser · 2024-09-18T21:08:44Z

Next steps:

Use Bhenzel's tool (or something comparable) to profile memory and time across runs across versions (i.e., is the PyArrow string serialization in Pandas 2.0 taking effect?)
Investigate why the query optimizer fails (for future update)

I want us to support Python 3.8 through 3.12 for the time being while we look into upgrading our infrastructure to the latest versions. Additionally, I'd like to confirm that the decrease in time-performance for the CogAT assessment comes with improved memory usage.

tomreitz · 2024-09-18T21:50:52Z

@johncmerfeld looks like you can see the peak memory usage of any process by running it with

/usr/bin/time -v command

e.g.,

/usr/bin/time -v earthmover -t

and look at the line Maximum resident set size (kbytes): (Might be good to average a few runs to eliminate outliers.)

This could be a quick way to profile the runtime and memory usage under the various Python environments, as Jay suggested (next step #1 above).

johncmerfeld · 2024-09-25T17:30:45Z

@jayckaiser I learned a few things and collected a lot of data. It paints an ambiguous picture but we don't really have a choice of whether to go through with this upgrade.

You have to turn on pyarrow-backed dataframes

The use of pyarrow for serialization does not happen automatically. The official docs almost make it seem like something that should be used on a case-by-case basis, although they never come out and say that. There are ways to set it globally, and I did so using config in the global __init__.py. The lines I added might be redundant but I don't know exactly how Dask and Pandas share settings so it seemed appropriate.

There's also an optional performance download that comes with Pandas now. I included this in my tests.

Overall I don't see significant time/memory improvements between what we had before and this "fully enhanced" code

Unfortunate but that's what I see, after much testing. On CogAT (the most realistic dataset I have) both time and max memory increase.

Why might this be? I haven't dived fully into what people are saying about Pandas 2; it's easy to find benchmarks showing astonishing improvements, and it's possible the kinds of data I tested with simply don't fit those patterns well. I've also seen claims that Pyarrow-backed-Pandas is worse at some operations than the NumPy version. My takeaway is that none of this makes enough of a difference to us to block this change, and it's not worth the effort trying to optimize at the level of individual types of operations.

However, if you assume we're using 3.12, the memory improvement becomes more noticeable

I'm assuming we more or less have to upgrade Python, and I still think it's the right thing to do. Given that, I believe the global settings applied on this branch are probably beneficial.

johncmerfeld added 2 commits September 12, 2024 16:57

minimal changes required

c6721b8

resolve changlelog

3d24a73

johncmerfeld requested review from jayckaiser, tomreitz and jalvord1 September 12, 2024 22:17

johncmerfeld self-assigned this Sep 12, 2024

johncmerfeld added 3 commits September 12, 2024 17:18

bump pandas a bit

fb77791

add a couple explicit support lines to setup.py

3272570

add comment explaining whythe optimizer needs to be turned off

58e9dc0

tomreitz approved these changes Sep 17, 2024

View reviewed changes

johncmerfeld added 2 commits September 24, 2024 11:24

make regex pattern a raw string

66d63e8

apply performance-enhancing settings

9f73a46

tomreitz merged commit 82076a9 into main Oct 16, 2024

tomreitz deleted the feature/python-312 branch October 16, 2024 21:48

sleblanc23 mentioned this pull request Oct 28, 2024

Unexpected pyarrow dependency when using optional sources #135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Add support for Python 3.12, latest versions of Dask #127

Feature: Add support for Python 3.12, latest versions of Dask #127

johncmerfeld commented Sep 12, 2024 •

edited

Loading

tomreitz left a comment

jayckaiser commented Sep 18, 2024

jayckaiser commented Sep 18, 2024

tomreitz commented Sep 18, 2024

johncmerfeld commented Sep 25, 2024

Feature: Add support for Python 3.12, latest versions of Dask #127

Feature: Add support for Python 3.12, latest versions of Dask #127

Conversation

johncmerfeld commented Sep 12, 2024 • edited Loading

Testing

tomreitz left a comment

Choose a reason for hiding this comment

jayckaiser commented Sep 18, 2024

jayckaiser commented Sep 18, 2024

tomreitz commented Sep 18, 2024

johncmerfeld commented Sep 25, 2024

You have to turn on pyarrow-backed dataframes

Overall I don't see significant time/memory improvements between what we had before and this "fully enhanced" code

However, if you assume we're using 3.12, the memory improvement becomes more noticeable

johncmerfeld commented Sep 12, 2024 •

edited

Loading