Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pandas and numpy >= 2 compatibility #287

Merged
merged 1 commit into from
Jan 20, 2025

Conversation

mkopec87
Copy link
Contributor

@mkopec87 mkopec87 commented Dec 6, 2024

  • align requirements.txt with pyproject.toml
  • remove calls to np.string_ not existing in numpy >= 2.0.0
  • remove calls to pd._testing.makeMixedDataFrame not existing in new pandas versions
  • fix install and test commands in documentation for developers
  • replace np.mean with column-wise version
  • drop pandas dependency constraint <2
  • require Python 3.9 in pyproject.toml
  • add PySpark 3.5.3 to test pipeline matrix
  • update test pipeline matrix: exclude Python 3.8, include Python 3.12

@mkopec87 mkopec87 force-pushed the feature/update-pandas branch 2 times, most recently from c32e4ad to b5f7f11 Compare December 16, 2024 10:02
@mkopec87
Copy link
Contributor Author

mkopec87 commented Dec 16, 2024

Still 5 failing tests :(

FAILED tests/popmon/notebooks/test_notebooks.py::test_notebook_advanced - nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
FAILED tests/popmon/pipeline/test_metrics.py::test_hists_stability_metrics - pandas.errors.DataError: Cannot aggregate non-numeric type: object
FAILED tests/popmon/pipeline/test_report.py::test_df_stability_report_external - TypeError: float() argument must be a string or a real number, not 'CategorizeHistogramMethods'
FAILED tests/popmon/pipeline/test_report.py::test_df_stability_report_rolling - pandas.errors.DataError: Cannot aggregate non-numeric type: object
FAILED tests/popmon/pipeline/test_report.py::test_df_stability_report_expanding - pandas.errors.DataError: Cannot aggregate non-numeric type: object

BTW. do we need 'requirements.txt' for anything?

@mkopec87 mkopec87 force-pushed the feature/update-pandas branch 4 times, most recently from 6e0e1e6 to cfc85d1 Compare December 16, 2024 17:59
@@ -233,7 +238,7 @@ def __init__(
:param kwargs: (dict, optional): residual kwargs passed on to mean and std functions
"""
super().__init__(
np.mean,
ReferencePullCalculator.mean,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functools.partial could be used here instead of staticmethod

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mkopec87 mkopec87 force-pushed the feature/update-pandas branch 2 times, most recently from 4f1b459 to 0c424a1 Compare January 3, 2025 10:59
@mkopec87
Copy link
Contributor Author

mkopec87 commented Jan 3, 2025

Managed to fix 3 tests, two more to go:

  • FAILED tests/popmon/notebooks/test_notebooks.py::test_notebook_advanced - nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
  • FAILED tests/popmon/pipeline/test_report.py::test_df_stability_report_external - TypeError: float() argument must be a string or a real number, not 'CategorizeHistogramMethods'

Something seems wrong with 'external' reference type.

@mbaak
Copy link
Contributor

mbaak commented Jan 3, 2025

Hi @mkopec87, thanks for following up. Happy to have a look at these tests then - I'll give it a go.

@mbaak
Copy link
Contributor

mbaak commented Jan 4, 2025

Hi @mkopec87,
I found the problem: np.mean and np.std call df.mean and df.std underneath.
However, df.mean() and df.std() have now become more strict. We now need: df.std(numeric_only=True).
Before it was skipping non-number columns by default. But not anymore.

@mbaak
Copy link
Contributor

mbaak commented Jan 4, 2025

@mkopec87 With these fixes below the two tests work for me.
Can you try these?

In popmon/analysis/profiling/pull_calculator.py, in class ReferencePullCalculator, lines 236-237, set instead:

  • for mean: instead of np.mean use partial(pd.DataFrame.mean, numeric_only=True)
  • for std: instead of np.std use partial(pd.DataFrame.std, numeric_only=True, ddof=0)

That should fix the crash, and return the same values as before.

@mkopec87 mkopec87 force-pushed the feature/update-pandas branch from 0c424a1 to 7762ae8 Compare January 4, 2025 23:12
@mkopec87
Copy link
Contributor Author

mkopec87 commented Jan 4, 2025

Thanks @mbaak, I added changes you've suggested :)

@mkopec87 mkopec87 marked this pull request as ready for review January 4, 2025 23:15
@mkopec87 mkopec87 force-pushed the feature/update-pandas branch 2 times, most recently from 3df42f5 to 7fc8ffb Compare January 5, 2025 14:03
- align requirements.txt with pyproject.toml
- remove calls to np.string_ not existing in numpy >= 2.0.0
- remove calls to pd._testing.makeMixedDataFrame not existing in new pandas versions
- fix install and test commands in documentation for developers
- replace np.mean with column-wise version
- drop pandas dependency constraint <2
- require histogrammar>=1.0.34
- require Python 3.9 in pyproject.toml
- add PySpark 3.5.3 to test pipeline matrix
- update test pipeline matrix: exclude Python 3.8, include Python 3.12
- add test notebook output to .gitignore
- switch to importlib from pkg_resources
- install project dependencies after pyspark in spark build tests
- run mean and std calculations only on numeric columns
- add dependency versions constraints to Spark tests
- update version of actions/upload-artifact task in build pipeline
@mkopec87 mkopec87 force-pushed the feature/update-pandas branch from 7fc8ffb to 150710a Compare January 5, 2025 21:49
@mkopec87
Copy link
Contributor Author

mkopec87 commented Jan 5, 2025

I've set up a build in my forked repo, seems it's passing now :)

@mbaak
Copy link
Contributor

mbaak commented Jan 6, 2025

Let's give @sbrugman a day or two still for a quick, final review.

@mkopec87
Copy link
Contributor Author

@sbrugman bump ;)

@mkopec87
Copy link
Contributor Author

Hi guys, any chance you'd find some time for a final review? :) I'm eager to start using this new feature as soon as possible :)

@mbaak mbaak merged commit 0bc188a into ing-bank:master Jan 20, 2025
10 checks passed
@mbaak
Copy link
Contributor

mbaak commented Jan 20, 2025

@mkopec87 merged. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants