Support uncertainties as EA Dtype #53970

MichaelTiemannOSC · 2023-07-02T14:06:56Z

In parallel with Pint and Pint-Pandas, changes to support uncertainties as an EA Dtype.

See hgrecco/pint#1615
and hgrecco/pint-pandas#140

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

In parallel with Pint and Pint-Pandas, changes to support uncertainties as an EA Dtype. See hgrecco/pint#1615 and hgrecco/pint-pandas#140 Signed-off-by: Michael Tiemann <[email protected]>

Make black and ruff happy. Signed-off-by: Michael Tiemann <[email protected]>

Somewhere along the line, pandas-dev#39098 was fixed. Remove the XFAIL mark so we can celebrate one more success! Signed-off-by: Michael Tiemann <[email protected]>

Fix pylint complaint about useless parent delgation. Signed-off-by: Michael Tiemann <[email protected]>

jbrockmendel · 2023-07-03T20:24:48Z

IIUC the problem here is isna not recognizing your custom scalar? I think the way to address this is directly in #27462/#27825

My last attempt at an implementation was #51378, which I still think is the way to go.

With these changes (and pandas-dev/pandas#53970 and hgrecco/pint#1615) the test suite passes or xpasses everything (no failures or error). Indeed, the uncertainties code has essentially doubled the scope of the test suite (to test with and without it). The biggest gotcha is that the EA for complex numbers is not compatible with the EA for uncertainties, due to incompatible hacks: The hack for complex numbers is to np.nan (which is, technically, a complex number) for na_value across all numeric types. But that doesn't work for uncertainties, because uncertainties doesn't accept np.nan as an uncertain value. The hack for uncertainties is to use pd.NA for na_value. This works for Int64, Float64, and uncertainties, but doesn't work for complex (which cannot tolerate NAType). Some careful subclassing fills in what doesn't easily work, with fixtures to prevent the improper mixing of complex and uncertainty types in the same python environment. Happy to discuss! Signed-off-by: Michael Tiemann <[email protected]>

Documentation updated. Signed-off-by: Michael Tiemann <[email protected]>

rst ``code`` is two backticks (which my pre-commit didn't catch). Also, try to better guess the proper rst doc pattern for referencing test files in pandas/tests directory. Signed-off-by: Michael Tiemann <[email protected]>

Try referencing pull request using html reference as there is no extlink definition for `:pull:` Signed-off-by: Michael Tiemann <[email protected]>

MichaelTiemannOSC · 2023-07-12T15:35:02Z

Please consider adding the https://github.com/pandas-dev/pandas/milestone/103 label to this pull-request. There should be no user-visible changes. The 2.1 milestone contains several enhancements that would further enhance the development versions of Pint and Pint-Pandas which await the changes from this pull request (see hgrecco/pint#1615 and hgrecco/pint-pandas#140).

jbrockmendel · 2023-07-12T16:18:58Z

pandas/core/arrays/masked.py

-        if allow_fill and notna(fill_value):
+        # NaN with uncertainties is scalar but does not register as `isna`,
+        # so use fact that NaN != NaN
+        if allow_fill and notna(fill_value) and fill_value == fill_value:


how would you even get here with uncertainties? is your EA subclassing BaseMaskedArray?

The EA itself is a very vanilla EA. The test suite instantiates the EA in myriad ways for test purposes. In this case the EA gets instantiated as a MaskedArray (which derives from BaseMaskedArray) for testing MaskedArray functionality. I think the test suite has every right to rigorously instantiate EAs in every way imaginable, which it does.

im still confused here. the EA i see in hgrecco/pint-pandas#140 subclasses ExtensionArray but not BaseMaskedArray, so i don't see how the code change here would affect it.

OK...I looked more deeply into this when I get back the computer where I did original work and didn't upload some notes to the cloud. I'm pretty sure it's due to some conditions created when trying to merge two DataFrames that I dummied up in my notes. I'll create a test case that can go into the Pandas test suite. It won't depend on uncertainties, but if and when used with uncertainties, it should trigger the path in question. Should be done by Monday.

Well, that was a lot more challenging to reconstruct than I expected, but I re-created the condition. TL;DR: now that I use pd.NA as my NA value for my particular EA, this change that caught your eye is indeed unnecessary and I can revert it.

But, since you asked, here's the path I took initially. Starting with PintType we define this:

@property def na_value(self): if HAS_UNCERTAINTIES: return self.ureg.Quantity(_ufloat_nan, self.units) else: return self.ureg.Quantity(np.nan, self.units)

The idea (before settling on pd.NA, which works without this particular change) was to use a NaN to which we could cast any numerical type (we cannot downcast uncertainties to float).

In the test code we set up data_missing and data_missing_for_sorting to use ufloat(np.nan, 0) as the NA value where appropriate. The test case test_groupby_agg_extension uses data_for_grouping, which is called without numeric_dtype as None, and this is where we start going off the rails.

Here's the data_for_grouping code:

@pytest.fixture def data_for_grouping(numeric_dtype, USE_UNCERTAINTIES): a = 1.0 b = 2.0**32 + 1 c = 2.0**32 + 10 if USE_UNCERTAINTIES: a = a + ufloat(0, 0) b = b + ufloat(0, 0) c = c + ufloat(0, 0) _n = _ufloat_nan numeric_dtype = None elif numeric_dtype: numeric_dtype = dtypemap.get(numeric_dtype, numeric_dtype) _n = np.nan else: _n = pd.NA return PintArray.from_1darray_quantity( ureg.Quantity(pd.array([b, b, _n, _n, a, a, b, c], dtype=numeric_dtype), ureg.m) )

The way the test code is run, because numeric_dtype is None we get:

(Pdb) data_for_grouping <PintArray> [4294967297.0, 4294967297.0, <NA>, <NA>, 1.0, 1.0, 4294967297.0, 4294967306.0] Length: 8, dtype: pint[meter]

When we execute expected = df.iloc[[0, 2, 4, 7]] there comes a point where we come here (/Users/michael/Documents/GitHub/MichaelTiemannOSC/pandas-dev/pandas/core/internals/managers.py(646)<listcomp>()):

644 else: 645 new_blocks = [ 646 -> blk.take_nd( 647 indexer, 648 axis=1, 649 fill_value=( 650 fill_value if fill_value is not None else blk.fill_value 651 ),

With fill_value as None, we inherit the fill value from the PintArray, which is the uncertain NaN. Here's the stack trace down from there:

pandas/core/internals/managers.py(645)reindex_indexer() -> new_blocks = [ pandas/core/internals/managers.py(646)<listcomp>() -> blk.take_nd( pandas/core/internals/blocks.py(975)take_nd() -> new_values = algos.take_nd( pandas/core/array_algos/take.py(115)take_nd() -> return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill) pint-pandas/pint_pandas/pint_array.py(523)take() -> result = take(data, indices, fill_value=fill_value, allow_fill=allow_fill) pandas/core/algorithms.py(1307)take() -> result = take_nd( pandas/core/array_algos/take.py(115)take_nd() -> return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill) > pandas/core/arrays/masked.py(879)take() -> result[fill_mask] = fill_value

In this particular case, because none of the indexes are -1, no filling is going to happen. But should we have things to fill, they'd be filled with the uncertain nan. And because this code really doesn't want to fill add new NaNs to the array, I put this extra defensive code in. And that's how we hit this condition.

I'm going to take this extraneous test out (fill_value == fill_value). Does this answer your question?

pandas/tests/extension/json/test_json.py

Having observed that GH#39098 works in the current release, allow the test that previously XFAIL'd to run (and pass). This is done to make CI/CD happy, and has nothing to do with changes otherwise present in this PR. Signed-off-by: Michael Tiemann <[email protected]>

MichaelTiemannOSC · 2023-07-14T11:49:32Z

Looks like we now depend on #54110 for CI/CD to pass (which is not yet merged).

MichaelTiemannOSC · 2023-07-14T11:55:15Z

I also don't understand this complaint from pylint:

pylint...................................................................Failed
- hook id: pylint
- duration: 793.25s
- exit code: 4

************* Module pandas.tests.extension.json.test_json
pandas/tests/extension/json/test_json.py:[359](https://github.com/pandas-dev/pandas/actions/runs/5553184370/jobs/10141428866?pr=53970#step:12:364):4: W0246: Useless parent or super() delegation in method 'test_groupby_agg_extension' (useless-parent-delegation)

The code I un-commented was original code, and it appears to follow precisely the pattern of all other code in the module. I'm at a loss as to how to proceed because I didn't create the TestGroupby class hierarchy and don't know why this particular case is getting singled out.

jbrockmendel · 2023-07-14T14:15:44Z

The code I un-commented was original code

Again, just remove it and the parent class test will run.

MichaelTiemannOSC · 2023-07-14T14:40:47Z

The code I un-commented was original code

Again, just remove it and the parent class test will run.

Ah, now I understand. I was afraid to remove a test that should be run, and did not grok that the test would run in the parent class. I'll re-submit, waiting a few hours in case there's an answer to my pyarrow question.

Removed the superfluous check testing whether fill_value was a NaN or not (in pandas/core/arrays/masked.py::take). pylint properly flags an unnecessary implementation of TestGroupby.test_groupby_agg_extension (which merely calls the parent method by the same name). By deleting that implementation, the parent method is called automatically. Signed-off-by: Michael Tiemann <[email protected]>

Needed in order to fix broken reduce operations (pandas-dev#52788). Signed-off-by: Michael Tiemann <[email protected]>

pandas/core/arrays/masked.py

pandas/core/groupby/groupby.py

jbrockmendel · 2023-07-25T17:17:12Z

pandas/core/groupby/ops.py

@@ -590,7 +590,7 @@ def nkeys(self) -> int:

    def get_iterator(
        self, data: NDFrameT, axis: AxisInt = 0
-    ) -> Iterator[tuple[Hashable, NDFrameT]]:
+    ) -> Iterator[tuple[Hashable, NDFrameT]]:  # Doesn't work with non-hashable EA types


jbrockmendel · 2023-07-25T17:17:47Z

pandas/tests/extension/test_boolean.py

@@ -28,7 +28,7 @@


 def make_data():
-    return [True, False] * 4 + [np.nan] + [True, False] * 44 + [np.nan] + [True, False]
+    return [True, False] * 4 + [pd.NA] + [True, False] * 44 + [pd.NA] + [True, False]


why is this necessary?

I made this change to track the identical changes in test_integer.py and test_floating.py. The it does not matter whether the NA values in this case are pd.NA, np.nan, or a mixture. But I did think it better to make the three files consistent. I'm happy to revert this change if you prefer.

Agreed this should be removed. It's ideal to minimize extraneous changes

Patches in response to four review comments/questions received. Signed-off-by: Michael Tiemann <[email protected]>

`_na_value` is a property of ExtensionDtype, not ExtensionArray. Signed-off-by: Michael Tiemann <[email protected]>

Testing in Python 3.12 reveals that we *do* need to fix this after all. Signed-off-by: Michael Tiemann <[email protected]>

Alas, _na_value is not a documented field of xarray. Current code checks do not seem to cover this code path, so the first failure was found attribute checks in doc build process. Also re-reverting test_boolean.py back to original state. The python-dev checks first tripped on boolean REPL, not groupby checks. Signed-off-by: Michael Tiemann <[email protected]>

doc/source/whatsnew/v2.1.0.rst

mroeschke · 2023-07-26T17:56:33Z

doc/source/whatsnew/v2.1.0.rst

@@ -160,6 +161,7 @@ Other enhancements
 - :meth:`DataFrame.stack` gained the ``sort`` keyword to dictate whether the resulting :class:`MultiIndex` levels are sorted (:issue:`15105`)
 - :meth:`DataFrame.unstack` gained the ``sort`` keyword to dictate whether the resulting :class:`MultiIndex` levels are sorted (:issue:`15105`)
 - :meth:`DataFrameGroupby.agg` and :meth:`DataFrameGroupby.transform` now support grouping by multiple keys when the index is not a :class:`MultiIndex` for ``engine="numba"`` (:issue:`53486`)
+- :meth:`GroupBy.first` if series contains only NA values (which might be NaN), return the first NA value, else return np.nan (see `PR <https://github.com/pandas-dev/pandas/pull/53970>`_)


Not sure what this is decribing

This is describing a change of behavior of the GroupBy.first helper function to not assume that np.nan is a valid NA value for the EA in question. The idea of the change (which, if valid, should also be applied to GroupBy.last) is that if we cannot find any non-NA values in the array, but if there are NA values, the return value should be the first (or last) NA value in the array. I did this because I thought that my EA could only deal with NaN values that were UFloats (having a nominal value and a std_dev value). But as I traced through the code and observed things more closely, I see that the EA does have tolerance for np.nan as an element among UFloats, giving np.nan as the nominal value and zero as the std_dev. So I will remove this change (by being less strict in some of my own assumptions in checking EA consistency).

doc/source/whatsnew/v2.1.0.rst

Not much left for this PR after all! Signed-off-by: Michael Tiemann <[email protected]>

Clean up more doc changes missed in previous commit. Signed-off-by: Michael Tiemann <[email protected]>

Turns out that this fixes GH#39098. Maybe I should resubmit under that guise just this one patch? Signed-off-by: Michael Tiemann <[email protected]>

MichaelTiemannOSC · 2023-07-29T16:11:27Z

This PR is obsolete now.

MichaelTiemannOSC added 4 commits July 2, 2023 10:04

Support uncertainties as EA Dtype

4c04e30

In parallel with Pint and Pint-Pandas, changes to support uncertainties as an EA Dtype. See hgrecco/pint#1615 and hgrecco/pint-pandas#140 Signed-off-by: Michael Tiemann <[email protected]>

Update masked.py

a165278

Make black and ruff happy. Signed-off-by: Michael Tiemann <[email protected]>

Update test_json.py

0b833aa

Somewhere along the line, pandas-dev#39098 was fixed. Remove the XFAIL mark so we can celebrate one more success! Signed-off-by: Michael Tiemann <[email protected]>

Update test_json.py

2a1913b

Fix pylint complaint about useless parent delgation. Signed-off-by: Michael Tiemann <[email protected]>

MichaelTiemannOSC marked this pull request as ready for review July 5, 2023 23:28

MichaelTiemannOSC requested a review from rhshadrach as a code owner July 5, 2023 23:28

This was referenced Jul 5, 2023

Make AffineScalarFunc hashable for Pandas, Pint and Pint-Pandas lmfit/uncertainties#170

Closed

Values dtype hgrecco/pint-pandas#192

Open

MichaelTiemannOSC added 3 commits July 12, 2023 07:21

Update v2.1.0.rst

7d8ba79

Documentation updated. Signed-off-by: Michael Tiemann <[email protected]>

Update v2.1.0.rst

8ada662

rst ``code`` is two backticks (which my pre-commit didn't catch). Also, try to better guess the proper rst doc pattern for referencing test files in pandas/tests directory. Signed-off-by: Michael Tiemann <[email protected]>

Update v2.1.0.rst

42ff1a8

Try referencing pull request using html reference as there is no extlink definition for `:pull:` Signed-off-by: Michael Tiemann <[email protected]>

jbrockmendel reviewed Jul 12, 2023

View reviewed changes

pandas/tests/extension/json/test_json.py Outdated Show resolved Hide resolved

MichaelTiemannOSC added 2 commits July 17, 2023 22:24

Merge remote-tracking branch 'upstream/main' into uncertainties

93be04f

Needed in order to fix broken reduce operations (pandas-dev#52788). Signed-off-by: Michael Tiemann <[email protected]>

jbrockmendel reviewed Jul 25, 2023

View reviewed changes

pandas/core/arrays/masked.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Jul 25, 2023

View reviewed changes

pandas/core/groupby/groupby.py Show resolved Hide resolved

jbrockmendel reviewed Jul 25, 2023

View reviewed changes

MichaelTiemannOSC added 3 commits July 26, 2023 08:33

Code review cleanpus

ed69055

Patches in response to four review comments/questions received. Signed-off-by: Michael Tiemann <[email protected]>

Update groupby.py

65cf17a

`_na_value` is a property of ExtensionDtype, not ExtensionArray. Signed-off-by: Michael Tiemann <[email protected]>

Update test_boolean.py

2f92811

Testing in Python 3.12 reveals that we *do* need to fix this after all. Signed-off-by: Michael Tiemann <[email protected]>

mroeschke reviewed Jul 26, 2023

View reviewed changes

doc/source/whatsnew/v2.1.0.rst Outdated Show resolved Hide resolved

mroeschke reviewed Jul 26, 2023

View reviewed changes

doc/source/whatsnew/v2.1.0.rst Outdated Show resolved Hide resolved

mroeschke reviewed Jul 26, 2023

View reviewed changes

doc/source/whatsnew/v2.1.0.rst Outdated Show resolved Hide resolved

MichaelTiemannOSC added 2 commits July 26, 2023 15:42

Finish removing unneeded code and doc changes.

2964f44

Not much left for this PR after all! Signed-off-by: Michael Tiemann <[email protected]>

Update v2.1.0.rst

6644fe0

Clean up more doc changes missed in previous commit. Signed-off-by: Michael Tiemann <[email protected]>

MichaelTiemannOSC marked this pull request as draft July 26, 2023 21:27

Update groupby.py

c9e0de3

Turns out that this fixes GH#39098. Maybe I should resubmit under that guise just this one patch? Signed-off-by: Michael Tiemann <[email protected]>

MichaelTiemannOSC closed this Jul 29, 2023

MichaelTiemannOSC deleted the uncertainties branch July 29, 2023 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support uncertainties as EA Dtype #53970

Support uncertainties as EA Dtype #53970

MichaelTiemannOSC commented Jul 2, 2023 •

edited

Loading

jbrockmendel commented Jul 3, 2023 •

edited

Loading

MichaelTiemannOSC commented Jul 12, 2023

jbrockmendel Jul 12, 2023

MichaelTiemannOSC Jul 12, 2023

jbrockmendel Jul 14, 2023

MichaelTiemannOSC Jul 14, 2023

MichaelTiemannOSC Jul 17, 2023

MichaelTiemannOSC commented Jul 14, 2023 •

edited

Loading

MichaelTiemannOSC commented Jul 14, 2023

jbrockmendel commented Jul 14, 2023

MichaelTiemannOSC commented Jul 14, 2023

jbrockmendel Jul 25, 2023

jbrockmendel Jul 25, 2023

MichaelTiemannOSC Jul 26, 2023

mroeschke Jul 26, 2023

mroeschke Jul 26, 2023

MichaelTiemannOSC Jul 26, 2023

MichaelTiemannOSC commented Jul 29, 2023

Support uncertainties as EA Dtype #53970

Support uncertainties as EA Dtype #53970

Conversation

MichaelTiemannOSC commented Jul 2, 2023 • edited Loading

jbrockmendel commented Jul 3, 2023 • edited Loading

MichaelTiemannOSC commented Jul 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelTiemannOSC commented Jul 14, 2023 • edited Loading

MichaelTiemannOSC commented Jul 14, 2023

jbrockmendel commented Jul 14, 2023

MichaelTiemannOSC commented Jul 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelTiemannOSC commented Jul 29, 2023

MichaelTiemannOSC commented Jul 2, 2023 •

edited

Loading

jbrockmendel commented Jul 3, 2023 •

edited

Loading

MichaelTiemannOSC commented Jul 14, 2023 •

edited

Loading