Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST (string dtype): resolve xfails for frame methods #60336

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Nov 16, 2024

No description provided.

@WillAyd WillAyd added this to the 2.3 milestone Nov 16, 2024
@WillAyd WillAyd force-pushed the fix-string-frame-methods branch from 9592c2d to a2e8dc3 Compare November 16, 2024 15:32
Copy link
Member Author

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are pretty tricky and not sure I've approached correctly. Could use some extra input @jorisvandenbossche

@@ -2362,5 +2362,6 @@ def external_values(values: ArrayLike) -> ArrayLike:
values.flags.writeable = False

# TODO(CoW) we should also mark our ExtensionArrays as read-only
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we already had discussions on how to make ExtensionArrays readonly?

dt1 = datetime.datetime(2015, 1, 1, tzinfo=dateutil.tz.tzutc())
dt2 = datetime.datetime(2015, 2, 2, tzinfo=dateutil.tz.tzutc())
df["Time"] = [dt1]
df = DataFrame({"Time": [dt1]})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this test is that an empty DataFrame is created first, which creates an object dtype column; subsequently, the assignment of a column keeps the column dtype as object.

That seems like a more general usage issue which needs to be resolved, although for this test I didn't think it was important to use that construction pattern

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, see my comment above about this (and opened #60338 about it), but for the tests the above change is indeed fine

@@ -6273,6 +6274,10 @@ class max type
else:
to_insert = ((self.index, None),)

if len(new_obj.columns) == 0 and names:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a local fix to the problem of appending column names to an empty set, which defaults the column dtype to object. While this fix the tests, there seems to be a larger issue at play that I'm not sure how to solve

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a similar issue like the pattern of creating an empty dataframe and then adding columns that I also encountered in the tests (for now I always got the tests passing by either ensuring the expected uses object dtype or ensuring the empty dataframe starts with an empty columns Index of dtype "str").

I am not sure we should "fix" this issue, as it would also introduce an inconsistency in the expected dtype, but opened #60338 to give this a bit more visibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we should "fix" this issue

I think any code changes should perhaps be separate PRs to the general resolving xfails PRs and maybe to avoid any regressions on 2.3.x be wrapped in using_string_dtype if blocks?

Copy link
Member Author

@WillAyd WillAyd Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I'll get this removed

@simonjayhawkins just to confirm I understand, are you asking to separate out PRs that need to change tests to correct the xfails from PRs that need to change the core implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the PR title implies that the changes are test related then I don't normally expect to see code changes to the core implementation, so yes, I think splitting this PR is wise.

assert item is pd.NA

# For non-NA values, we should match what we get for non-EA str
alt = obj.astype(str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe repeat the above also with dta.astype("str"), so we test the default string dtype as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to leave a comment on this. It's not visible in the diff but there is already a tm.assert_frame_equal call a few lines up from this. Is there any expected value calling that and then calling it with a slice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to repeat the expected = frame_or_series(dta.astype("string")) as expected = frame_or_series(dta.astype("str")) (to test both the NA and NaN variant)

dt1 = datetime.datetime(2015, 1, 1, tzinfo=dateutil.tz.tzutc())
dt2 = datetime.datetime(2015, 2, 2, tzinfo=dateutil.tz.tzutc())
df["Time"] = [dt1]
df = DataFrame({"Time": [dt1]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, see my comment above about this (and opened #60338 about it), but for the tests the above change is indeed fine

expected = Series([np.array(["bar"])])
else:
expected = Series(["bar"])
expected = Series(np.array(["bar"]), dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, shouldn't we expect str dtype here if that is enabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see either way and I don't have a strong preference

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed that this is a kind of "reducing" apply, because the applied lambda returns an 0dim array (kind of an array scalar).

When doing a normal apply preserving the column length, it already infers it as string:

In [25]: result = df.apply(lambda col: np.array(["bar"]))

In [26]: result
Out[26]: 
     0
0  bar

In [27]: result.dtypes
Out[27]: 
0    str
dtype: object

So here it is essentially reducing each column and then creating a Series with the results. Now, also in this case I would expect that we infer the dtype?
But it seems this is not specific to strings, because also when doing the same with an integer, we get object dtype:

In [31]: result = df.apply(lambda col: np.array(1))

In [32]: result
Out[32]: 
0    1
dtype: object

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, and the reason that it is object dtype is because we actually store the 0dim array object in the Series.. Continuing with the last example above:

In [33]: result.values
Out[33]: array([array(1)], dtype=object)

So yes, object dtype is correct here, but it's also just a strange test .. (I would say that ideally we "unpack" those 0dim arrays into actual scalars and then do proper type inference)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I'm not sure. I don't quite understand how this test is useful in practice, so hard to form an opinion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 0-dim numpy scalar in an object array is the expected result and is intentional, xref #46199

and the same as creating a Series with a scalar....

>>> pd.Series(np.array("bar"))
0    bar
dtype: object
>>> pd.Series(np.array("bar")).item()
array('bar', dtype='<U3')
>>> 

so I think need to just be more explicit in the expected, i.e. expected = Series(np.array("bar")) and this will also pass with both future.infer_string = True and future.infer_string = False

I think Series(np.array("bar")) is more explicit than Series(np.array(["bar"]), dtype=object) even though they both compare equal in testing...

>>> pd.Series(np.array(["bar"]), dtype=object)
0    bar
dtype: object
>>> pd.Series(np.array(["bar"]), dtype=object).item()
'bar'
>>> import pandas._testing as tm
>>> tm.assert_series_equal(pd.Series(np.array("bar")), pd.Series(np.array(["bar"]), dtype=object))
>>> 

@@ -64,7 +64,6 @@ def test_interpolate_inplace(self, frame_or_series, request):
assert np.shares_memory(orig, obj.values)
assert orig.squeeze()[1] == 1.5

# TODO(infer_string) raise proper TypeError in case of string dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to be done?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas/core/arrays/arrow/array.py:2164, in ArrowExtensionArray.interpolate raises a ValueError

        if not self.dtype._is_numeric:
            raise ValueError("Values must be numeric.")

whereas for the legacy NumPy object dtype, this gets caught in the internal block code (pandas/core/internals/blocks.py) before dispatch to an array method.

        if self.dtype == _dtype_obj:
            # GH#53631
            name = {1: "Series", 2: "DataFrame"}[self.ndim]
            raise TypeError(f"{name} cannot interpolate with object dtype.")

        copy, refs = self._get_refs_and_copy(inplace)

        # Dispatch to the EA method.
        new_values = self.array_values.interpolate(

Now, I assume (maybe wrongly) that we don't want to change the EA code to raise a TypeError or change the error message?

Is the solution to catch the ValueError in the block code and raise the TypeError there?

Copy link
Member

@jorisvandenbossche jorisvandenbossche Nov 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation

Now, I assume (maybe wrongly) that we don't want to change the EA code to raise a TypeError or change the error message?

I would say we can just update the error raised from ArrowEA.interpolate. Is there a reason not to do it there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know of no reason, other than I assumed this was a conscious choice when implemented.

Quickly looking at some of the code a NotImplementedError(f"interpolate is not implemented for dtype={self.dtype}") maybe more consistent.

alt = obj.astype(str)
assert np.all(alt.iloc[1:] == result.iloc[1:])
else:
assert item is np.nan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

item should never be np.nan with the original string dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch - I'll take a closer look as to why that happens

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be no need for inference here. the result and expected are both astyped. I would expect that the using_infer_string fixture is not needed at all. @jorisvandenbossche has asked that you also test with astype("str") and that would not change any inference. There is a fixture for testing the different the string dtypes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the error message NotImplementedError: eq not implemented for <class 'pandas.core.arrays.string_.StringArray'> is misleading.

if you do result = obj.astype("string[pyarrow]") this test passes and only fails for the numpy backed object array. i.e. the default for obj.astype("string") is obj.astype("string[python]")

In the <ArrowStringArrayNumpySemantics> code...

    def _cmp_method(self, other, op) -> ArrowExtensionArray:
        pc_func = ARROW_CMP_FUNCS[op.__name__]
        if isinstance(
            other, (ArrowExtensionArray, np.ndarray, list, BaseMaskedArray)
        ) or isinstance(getattr(other, "dtype", None), CategoricalDtype):
            try:
                result = pc_func(self._pa_array, self._box_pa(other))
            except pa.ArrowNotImplementedError:

for the object backed string array, other is not an instance of ArrowExtensionArray but is an instance of StringArray and although the comparison is skipped and a NotImplementedError raised, it appears that pc_func(self._pa_array, self._box_pa(other)) does give the expected result with a numpy backed object StringArray

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears that pc_func(self._pa_array, self._box_pa(other)) does give the expected result with a numpy backed object StringArray

Yes, that is also what I would expect

@simonjayhawkins simonjayhawkins added the Strings String extension data type and string data label Nov 18, 2024
warning = FutureWarning if using_infer_string else None
with tm.assert_produces_warning(warning, match="empty entries"):
comb = float_frame.combine_first(DataFrame())
comb = float_frame.combine_first(DataFrame())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this now does not issue a FutureWarning on main since #58056.

I guess the changes to this file would not be backported to 2.3.x. There is no xfail for this test on 2.3.x and the FutureWarning: The behavior of array concatenation with empty entries is deprecated. is issued when future.infer_string = True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I think you can also remove the using_infer_string fixture from the function arguments too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted for backport - nice catch

import pyarrow as pa

with pytest.raises(pa.lib.ArrowNotImplementedError, match="has no kernel"):
with pytest.raises(TypeError, match="Cannot perform reduction"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that the ArrowNotImplementedError is no longer raised, I wonder whether there is any value having the if using_infer_string: block or whether just to have the two messages combined like we do elsewhere, i.e. msg - msg1|msg2 (prefered syntax is "|".join(...) and then match=msg.

However, this also raises the question is the new message that users see, TypeError: Cannot perform reduction 'mean' with string dtype and better/worse than the previous message, TypeError: Could not convert ['foofoofoofoofoofoofoofoofoofoo'] to numeric. As the most informative message would probably suggest that they use numeric_only=True?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this specific case it is useful to keep the if using_infer_string/else blocks, because then when cleaning up the usage of that, we can remove the entire else block, and we don't keep testing the two error messages later on.

As the most informative message would probably suggest that they use numeric_only=True?

That would indeed be nice, but that's something specifically for the aggregation at the DataFrame level (e.g. df.mean()) to catch and amend the error message (not something specifically for the string dtype / array to do). But definitely worth having a separate issue for that.

@WillAyd WillAyd force-pushed the fix-string-frame-methods branch from a2e8dc3 to e559a1b Compare December 13, 2024 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants