-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST (string dtype): resolve xfails for frame methods #60336
base: main
Are you sure you want to change the base?
Conversation
9592c2d
to
a2e8dc3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are pretty tricky and not sure I've approached correctly. Could use some extra input @jorisvandenbossche
@@ -2362,5 +2362,6 @@ def external_values(values: ArrayLike) -> ArrayLike: | |||
values.flags.writeable = False | |||
|
|||
# TODO(CoW) we should also mark our ExtensionArrays as read-only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we already had discussions on how to make ExtensionArrays readonly?
dt1 = datetime.datetime(2015, 1, 1, tzinfo=dateutil.tz.tzutc()) | ||
dt2 = datetime.datetime(2015, 2, 2, tzinfo=dateutil.tz.tzutc()) | ||
df["Time"] = [dt1] | ||
df = DataFrame({"Time": [dt1]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with this test is that an empty DataFrame is created first, which creates an object dtype column; subsequently, the assignment of a column keeps the column dtype as object.
That seems like a more general usage issue which needs to be resolved, although for this test I didn't think it was important to use that construction pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, see my comment above about this (and opened #60338 about it), but for the tests the above change is indeed fine
pandas/core/frame.py
Outdated
@@ -6273,6 +6274,10 @@ class max type | |||
else: | |||
to_insert = ((self.index, None),) | |||
|
|||
if len(new_obj.columns) == 0 and names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a local fix to the problem of appending column names to an empty set, which defaults the column dtype to object. While this fix the tests, there seems to be a larger issue at play that I'm not sure how to solve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a similar issue like the pattern of creating an empty dataframe and then adding columns that I also encountered in the tests (for now I always got the tests passing by either ensuring the expected uses object dtype or ensuring the empty dataframe starts with an empty columns Index of dtype "str"
).
I am not sure we should "fix" this issue, as it would also introduce an inconsistency in the expected dtype, but opened #60338 to give this a bit more visibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we should "fix" this issue
I think any code changes should perhaps be separate PRs to the general resolving xfails PRs and maybe to avoid any regressions on 2.3.x be wrapped in using_string_dtype if blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I'll get this removed
@simonjayhawkins just to confirm I understand, are you asking to separate out PRs that need to change tests to correct the xfails from PRs that need to change the core implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the PR title implies that the changes are test related then I don't normally expect to see code changes to the core implementation, so yes, I think splitting this PR is wise.
assert item is pd.NA | ||
|
||
# For non-NA values, we should match what we get for non-EA str | ||
alt = obj.astype(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe repeat the above also with dta.astype("str")
, so we test the default string dtype as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meant to leave a comment on this. It's not visible in the diff but there is already a tm.assert_frame_equal
call a few lines up from this. Is there any expected value calling that and then calling it with a slice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to repeat the expected = frame_or_series(dta.astype("string"))
as expected = frame_or_series(dta.astype("str"))
(to test both the NA and NaN variant)
dt1 = datetime.datetime(2015, 1, 1, tzinfo=dateutil.tz.tzutc()) | ||
dt2 = datetime.datetime(2015, 2, 2, tzinfo=dateutil.tz.tzutc()) | ||
df["Time"] = [dt1] | ||
df = DataFrame({"Time": [dt1]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, see my comment above about this (and opened #60338 about it), but for the tests the above change is indeed fine
expected = Series([np.array(["bar"])]) | ||
else: | ||
expected = Series(["bar"]) | ||
expected = Series(np.array(["bar"]), dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, shouldn't we expect str dtype here if that is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see either way and I don't have a strong preference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed that this is a kind of "reducing" apply, because the applied lambda returns an 0dim array (kind of an array scalar).
When doing a normal apply preserving the column length, it already infers it as string:
In [25]: result = df.apply(lambda col: np.array(["bar"]))
In [26]: result
Out[26]:
0
0 bar
In [27]: result.dtypes
Out[27]:
0 str
dtype: object
So here it is essentially reducing each column and then creating a Series with the results. Now, also in this case I would expect that we infer the dtype?
But it seems this is not specific to strings, because also when doing the same with an integer, we get object dtype:
In [31]: result = df.apply(lambda col: np.array(1))
In [32]: result
Out[32]:
0 1
dtype: object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, and the reason that it is object dtype is because we actually store the 0dim array object in the Series.. Continuing with the last example above:
In [33]: result.values
Out[33]: array([array(1)], dtype=object)
So yes, object dtype is correct here, but it's also just a strange test .. (I would say that ideally we "unpack" those 0dim arrays into actual scalars and then do proper type inference)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I'm not sure. I don't quite understand how this test is useful in practice, so hard to form an opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 0-dim numpy scalar in an object array is the expected result and is intentional, xref #46199
and the same as creating a Series with a scalar....
>>> pd.Series(np.array("bar"))
0 bar
dtype: object
>>> pd.Series(np.array("bar")).item()
array('bar', dtype='<U3')
>>>
so I think need to just be more explicit in the expected, i.e. expected = Series(np.array("bar"))
and this will also pass with both future.infer_string = True
and future.infer_string = False
I think Series(np.array("bar"))
is more explicit than Series(np.array(["bar"]), dtype=object)
even though they both compare equal in testing...
>>> pd.Series(np.array(["bar"]), dtype=object)
0 bar
dtype: object
>>> pd.Series(np.array(["bar"]), dtype=object).item()
'bar'
>>> import pandas._testing as tm
>>> tm.assert_series_equal(pd.Series(np.array("bar")), pd.Series(np.array(["bar"]), dtype=object))
>>>
@@ -64,7 +64,6 @@ def test_interpolate_inplace(self, frame_or_series, request): | |||
assert np.shares_memory(orig, obj.values) | |||
assert orig.squeeze()[1] == 1.5 | |||
|
|||
# TODO(infer_string) raise proper TypeError in case of string dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still needs to be done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/core/arrays/arrow/array.py:2164, in ArrowExtensionArray.interpolate raises a ValueError
if not self.dtype._is_numeric:
raise ValueError("Values must be numeric.")
whereas for the legacy NumPy object dtype, this gets caught in the internal block code (pandas/core/internals/blocks.py) before dispatch to an array method.
if self.dtype == _dtype_obj:
# GH#53631
name = {1: "Series", 2: "DataFrame"}[self.ndim]
raise TypeError(f"{name} cannot interpolate with object dtype.")
copy, refs = self._get_refs_and_copy(inplace)
# Dispatch to the EA method.
new_values = self.array_values.interpolate(
Now, I assume (maybe wrongly) that we don't want to change the EA code to raise a TypeError or change the error message?
Is the solution to catch the ValueError in the block code and raise the TypeError there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation
Now, I assume (maybe wrongly) that we don't want to change the EA code to raise a TypeError or change the error message?
I would say we can just update the error raised from ArrowEA.interpolate. Is there a reason not to do it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know of no reason, other than I assumed this was a conscious choice when implemented.
Quickly looking at some of the code a NotImplementedError(f"interpolate is not implemented for dtype={self.dtype}")
maybe more consistent.
alt = obj.astype(str) | ||
assert np.all(alt.iloc[1:] == result.iloc[1:]) | ||
else: | ||
assert item is np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
item should never be np.nan with the original string
dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch - I'll take a closer look as to why that happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should be no need for inference here. the result and expected are both astyped. I would expect that the using_infer_string fixture is not needed at all. @jorisvandenbossche has asked that you also test with astype("str")
and that would not change any inference. There is a fixture for testing the different the string dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that the error message NotImplementedError: eq not implemented for <class 'pandas.core.arrays.string_.StringArray'>
is misleading.
if you do result = obj.astype("string[pyarrow]")
this test passes and only fails for the numpy backed object array. i.e. the default for obj.astype("string") is obj.astype("string[python]")
In the <ArrowStringArrayNumpySemantics>
code...
def _cmp_method(self, other, op) -> ArrowExtensionArray:
pc_func = ARROW_CMP_FUNCS[op.__name__]
if isinstance(
other, (ArrowExtensionArray, np.ndarray, list, BaseMaskedArray)
) or isinstance(getattr(other, "dtype", None), CategoricalDtype):
try:
result = pc_func(self._pa_array, self._box_pa(other))
except pa.ArrowNotImplementedError:
for the object backed string array, other is not an instance of ArrowExtensionArray
but is an instance of StringArray
and although the comparison is skipped and a NotImplementedError raised, it appears that pc_func(self._pa_array, self._box_pa(other))
does give the expected result with a numpy backed object StringArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it appears that
pc_func(self._pa_array, self._box_pa(other))
does give the expected result with a numpy backed objectStringArray
Yes, that is also what I would expect
warning = FutureWarning if using_infer_string else None | ||
with tm.assert_produces_warning(warning, match="empty entries"): | ||
comb = float_frame.combine_first(DataFrame()) | ||
comb = float_frame.combine_first(DataFrame()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this now does not issue a FutureWarning on main since #58056.
I guess the changes to this file would not be backported to 2.3.x. There is no xfail for this test on 2.3.x and the FutureWarning: The behavior of array concatenation with empty entries is deprecated.
is issued when future.infer_string = True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and I think you can also remove the using_infer_string fixture from the function arguments too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted for backport - nice catch
import pyarrow as pa | ||
|
||
with pytest.raises(pa.lib.ArrowNotImplementedError, match="has no kernel"): | ||
with pytest.raises(TypeError, match="Cannot perform reduction"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that the ArrowNotImplementedError
is no longer raised, I wonder whether there is any value having the if using_infer_string:
block or whether just to have the two messages combined like we do elsewhere, i.e. msg - msg1|msg2
(prefered syntax is "|".join(...
) and then match=msg
.
However, this also raises the question is the new message that users see, TypeError: Cannot perform reduction 'mean' with string dtype
and better/worse than the previous message, TypeError: Could not convert ['foofoofoofoofoofoofoofoofoofoo'] to numeric
. As the most informative message would probably suggest that they use numeric_only=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this specific case it is useful to keep the if using_infer_string
/else
blocks, because then when cleaning up the usage of that, we can remove the entire else
block, and we don't keep testing the two error messages later on.
As the most informative message would probably suggest that they use numeric_only=True?
That would indeed be nice, but that's something specifically for the aggregation at the DataFrame level (e.g. df.mean()
) to catch and amend the error message (not something specifically for the string dtype / array to do). But definitely worth having a separate issue for that.
a2e8dc3
to
e559a1b
Compare
No description provided.