-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Add support for Python 3.12, latest versions of Dask #127
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not merge this until @jayckaiser has a chance to review, but I'm approving this.
I also tested it (via earthmover -t && cd example_projects/ && ./run_all.sh
) under python versions 3.9.19
, 3.10.12
, 3.11.9
, and 3.12.5
, and all tests completed successfully.
One minor thing to look into (but shouldn't block merging this) is under python 3.12.5
, I got this warning about the regex in the group_by operation when running earthmover -t
:
SyntaxWarning: invalid escape sequence '\('"([A-Za-z0-9_]*)\(([A-Za-z0-9_]*)?,?(.*)?\)",
Again, earthmover -t
succeeded, so the group by operation must have executed correctly, but based on the warning, I wonder if some change to the regex library was made in python 3.12.*
+?
Excellent work here @johncmerfeld - thank you!
This issue can be resolved by sticking a raw string identifier to the start of the regex query in the group-by operation. ( |
Next steps:
I want us to support Python 3.8 through 3.12 for the time being while we look into upgrading our infrastructure to the latest versions. Additionally, I'd like to confirm that the decrease in time-performance for the CogAT assessment comes with improved memory usage. |
@johncmerfeld looks like you can see the peak memory usage of any process by running it with /usr/bin/time -v command e.g., /usr/bin/time -v earthmover -t and look at the line This could be a quick way to profile the runtime and memory usage under the various Python environments, as Jay suggested (next step #1 above). |
@jayckaiser I learned a few things and collected a lot of data. It paints an ambiguous picture but we don't really have a choice of whether to go through with this upgrade. You have to turn on pyarrow-backed dataframesThe use of pyarrow for serialization does not happen automatically. The official docs almost make it seem like something that should be used on a case-by-case basis, although they never come out and say that. There are ways to set it globally, and I did so using config in the global There's also an optional Overall I don't see significant time/memory improvements between what we had before and this "fully enhanced" codeUnfortunate but that's what I see, after much testing. On CogAT (the most realistic dataset I have) both time and max memory increase. Why might this be? I haven't dived fully into what people are saying about Pandas 2; it's easy to find benchmarks showing astonishing improvements, and it's possible the kinds of data I tested with simply don't fit those patterns well. I've also seen claims that Pyarrow-backed-Pandas is worse at some operations than the NumPy version. My takeaway is that none of this makes enough of a difference to us to block this change, and it's not worth the effort trying to optimize at the level of individual types of operations. However, if you assume we're using 3.12, the memory improvement becomes more noticeableI'm assuming we more or less have to upgrade Python, and I still think it's the right thing to do. Given that, I believe the global settings applied on this branch are probably beneficial. |
Pertaining to this Monday item. Two main ideas here:
~=2024.8.0
-- this also allows us to pin our pandas dependency to~=2.2.0
and resume receiving updates. A future Dask version may force us to use the query optimizer, but hopefully by then, the bugs that emerge when we use it with Earthmover will have been fixed.Personally I think we should consider "deprecating" support for Python 3.9 (and especially 3.8, which is at end-of-life) and removing them from the next major earthmover release - i.e. with earthmover 3.5.0, we would require users to be at Python 3.10 or later. I think this is fair game and comparable with what other tools do, but also might be self-defeating for our purposes 🤷♂️
Testing
earthmover -t
(Python 3.8-3.12, MacOS and Ubuntu 22.04)run_all.sh
(Python 3.8-3.12, MacOS and Ubuntu)Anyone else is welcome to do their own testing -- trying on a variety of bundles, data volumes, and Python versions* would be great. At this point I feel confident about what I have here
*the easiest way I've found to do this is to use pyenv to set the global python version, then use venv to create an environment that uses that version (named e.g.
py39
). After that, any time you activate that environment, it'll use that python version. Be careful when installing packages with pip, though. You'll need to run this before reinstalling earthmover in a new environment to make sure you're not using the wrong versions of packages:pip cache purge pip uninstall earthmover -y pip freeze | xargs pip uninstall -y