-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Integer NA Extension Array #21160
Conversation
pandas/core/arrays/base.py
Outdated
|
||
Parameters | ||
---------- | ||
other : ExtenionArray or list/tuple of ExtenionArrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo Extenion-->Extension
Codecov Report
@@ Coverage Diff @@
## master #21160 +/- ##
==========================================
+ Coverage 91.96% 91.98% +0.01%
==========================================
Files 166 167 +1
Lines 50329 50606 +277
==========================================
+ Hits 46287 46551 +264
- Misses 4042 4055 +13
Continue to review full report at Codecov.
|
There's a lot here, so preliminary thoughts based on a quick pass:
|
@jreback Here is my take on the operators implementation, based on what I did in #20889. Looking at your implementation in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only partway through. Will try to take a closer look later today.
pandas/core/dtypes/base.py
Outdated
@@ -156,6 +156,12 @@ def name(self): | |||
""" | |||
raise AbstractMethodError(self) | |||
|
|||
@property | |||
def array_type(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this to the list of abstract methods / properties on line 110
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change limits us to 1 array type per extension dtype. Is everyone OK with that? I don't see any downsides to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a problem, you simply have multiple arrays and multiple dtypes (as I did here).
pandas/core/dtypes/dtypes.py
Outdated
from pandas import compat | ||
from pandas.core.dtypes.generic import ABCIndexClass, ABCCategoricalIndex | ||
|
||
from .base import ExtensionDtype, _DtypeOpsMixin | ||
|
||
|
||
class Registry(object): | ||
""" class to register our dtypes for inference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just "Registry for dtype inference"
I split out #21185 with the changes to EA. To answer some of @jbrockmendel questions
I think this would overcomplicate things.
how so, the implementation is barely changed here
yes this is tricky, we are matching on the current impl. if we were to change this we should adjust.
sure, but we don't really support the concept of an ExtensionIndex yet. |
741edac
to
97b01e4
Compare
if mask is None: | ||
mask = isna(values) | ||
else: | ||
assert len(mask) == len(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this raise a ValueError instead of AssertionError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this is an internal construction error, need to satisfy the input guarantees
pandas/core/ops.py
Outdated
is_extension_array_dtype(y) and not | ||
is_scalar(y)): | ||
y = x.__class__._from_sequence(y) | ||
return op(x, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overlap in comment in #21191 since IntegerArray has an __init__
that I'm pretty sure can take left
here (actually below in wrapper
), I'm hopeful that in this case we can save a lot of trouble by using dispatch_to_index_op
directly (passing the EA subclass rather than an Index subclass)
5c9975f
to
d1e5281
Compare
latest push remove the type specific arrays, IOW everything is now just a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pandas/core/arrays/base.py, I think you are missing __rmod__
in the list of arithmetic operators
def _from_factorized(cls, values, original): | ||
return cls(values, dtype=original.dtype) | ||
|
||
def __getitem__(self, item): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without trying to overcomplicated things have we considered moving some of these items to a MaskedEAMixin
? I'm thinking of taking a stab at the Boolean EA next and can see this being generalizable along with a few other methods (__iter__
, __setitem__
, perhaps take
, etc...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
certainly could, though I think might be better just to just directly subclass IntegerArray
and the dtype, but that's for another PR :>
anyhow any final comments @pandas-dev/pandas-core certainly will be follows in any event. |
Made a couple doc follow-up notes in #22003 |
I am not necessarily saying that we should use None, but the fact is that we currently somewhat 'misuse' |
you can certainly make an issue to discuss this. and I agree we should prob move to a singular Null type. But this is not the PR for it. I disagree that |
Again, I didn't say we should use None for this. I also think we should not use that, it is already used in many other contexts. |
ok, let's certainly discuss. In reality is a pretty cosmetic detail (mostly for actual printing). |
bombs away |
Ahum, I was actually reviewing this now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make it easier to review if you do not rebase your commits, once you do that github's features to automatically let me see what has changed since the last review does not work anymore.
To come back to my previous comment: can you add a section to the actual documentation? (not only whatsnew)
expected = pd.Series([True, True, False, False], | ||
index=list('ABCD')) | ||
result = df.dtypes.apply(str) == str(dtype) | ||
self.assert_series_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also test with the dtype itself? (result = df.dtypes == dtype
)
IntegerArray | ||
""" | ||
self._data, self._mask = coerce_to_array( | ||
values, dtype=dtype, mask=mask, copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you respond here further?
# coerce when needed | ||
s + 0.01 | ||
|
||
These dtypes can operate as part of of ``DataFrame``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"of of " -> "of a"
""" | ||
We represent an IntegerArray with 2 numpy arrays | ||
- data: contains a numpy integer array of the appropriate dtype | ||
- mask: a boolean array holding a mask on the data, False is missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
False is missing
I don't think this is true currently?
But given the discussion earlier, I think it would be good to actually implement what you stated there to follow the example of arrow?
|
||
# coerce | ||
data = self._coerce_to_ndarray() | ||
return data.astype(dtype=dtype, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we treat converting to float separately here? (that could be easily made more performant, as a probably common use case for astype)
|
||
def _maybe_mask_result(self, result, mask, other, op_name): | ||
""" | ||
Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one
raise NotImplementedError( | ||
"can only perform ops with 1-d structures") | ||
elif is_list_like(other): | ||
other = np.asarray(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we try to convert to IntegerArray here if possible?
eg s + s.tolist()
gives floats (in case s
is a series with int-na dtype)
# otherwise perform the op | ||
if isinstance(right, compat.string_types): | ||
raise TypeError("{typ} cannot perform the operation mod".format( | ||
typ=type(left).__name__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one
if isinstance(right, np.ndarray): | ||
|
||
# handle numpy scalars, this is a PITA | ||
# TODO(jreback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify this to do comment?
# assert our expected result | ||
self.assert_series_equal(result, expected) | ||
|
||
def test_arith_integer_array(self, data, all_arithmetic_operators): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to discuss how to handle this (it will pop-up in all the other internal extension dtypes as well), so now is a good time.
I would personally move all specific tests that you add here that are not in the parent base tests to tests/arrays/integer
, and only subclass here to check that the base tests make sense
ExtensionScalarOpsMixin) | ||
from .categorical import Categorical # noqa | ||
from .datetimes import DatetimeArrayMixin # noqa | ||
from .interval import IntervalArray # noqa | ||
from .period import PeriodArrayMixin # noqa | ||
from .timedeltas import TimedeltaArrayMixin # noqa | ||
from .integer import ( # noqa | ||
IntegerArray, to_integer_array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the goal of exposing to_integer_array
here? Is it used somewhere else?
well I'll adress these in a followup. and for the record I didn't actually rebase this. |
your comment about the init. Well I already address this and it is simply not possible to have the init do nothing. I am -1 on adding from_* methods. This is how it is implemented. If you want to change it you are welcome to submit a PR. |
Sorry, but I don't think this should be an acceptable response. I think we should either roll back this PR or disable all public interfaces for this functionality until the (collective) development team is happy with the new features. As of now, I would consider this a blocker for new releases. As a general policy, I think we should require explicit approval from at least one other core developer before merging pull requests. I am all for avoiding large PRs and doing collaborative development in master, but we should not merge PRs unless they pass code review (certainly not for large changes like this one). |
@shoyer that is a disengenuos response. If you want to do code review on the 100's of open PR's be my guest. This PR has been in a ready state for weeks. |
If folks respond in a timely manner, sure. If you haven't noticed I approve every other PR in a pretty timely manner. After round and rounds of comments. I believe I have earned, and deserve the same consideration. |
This PR is not just "one of the 100's of open PRs", but a major change that needs more attention than smaller PRs. I agree with Stephan that it would be good for such PRs to have the explicit +1 of other core devs. But OK, although I would have preferred to iterate further in this PR, let's move forward in follow-up PRs! |
|
||
.. warning:: | ||
|
||
The Integer NA support currently uses the captilized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
captilized -> capitalized?
This misses the point. I spend an inordinate amount of time reviewing practically everything. I don't expect everyone to do this, rather for my PR's, which I DO expect review time from other folks. At the same time, endless comments, while certainly debatable, we don't force on anyone. Many many times, we have done follow up PR's just to cut down on the endless review cycle. In any event I will address your comments. |
* ENH: add integer-na support via an ExtensionArray closes pandas-dev#20700 closes pandas-dev#20747
closes #20700
closes #20747