-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF/ENH: Refactor NDFrame finalization #28334
Conversation
In preperation for pandas-dev#27108 (disallowing duplicates), we need to enhance our metadata propagation. *We need a way for a particiular attribute to deterimine how it's propagated for a particular method*. Our current method of metadata propagation lacked two features 1. It only copies an attribute from a source NDFrame to a new NDFrame. There is no way to propagate metadata from a collection of NDFrames (say from `pd.concat`) to a new NDFrame. 2. It only and always copies the attribute. This is not always appropriate when dealing with a collection of input NDFrames, as the source attributes may differ. The resolution of conflicts will differ by attribute (for `Series.name` we might throw away the name. For `Series.allow_duplicates`, any Series disallowing duplicates should mean the output disallows duplicates)
cc @jreback @jbrockmendel, this is the start of what I need for #27108 (comment). Briefly, adding a This PR puts the infrastructure in place to support this. We aren't using it yet (outside of tests), but will when I make the duplicate labels PR. duplicate_labels_meta = PandasMetadata("allows_duplicates")
@duplicate_labels_meta.register(pd.concat)
def _(new, concatenater):
new.allows_duplicate_labels = all(x.allows_duplicate_labels for x in concatenater.objs)
@duplicate_labels_meta.register(Series.__array_ufunc__)
def _(new, inputs):
new.allows_duplicate_labels = ... But I split this off so that PR will be smaller. |
pandas/core/_meta.py
Outdated
from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries | ||
|
||
if TYPE_CHECKING: | ||
from pandas.core.generic import NDFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want FrameOrSeries from pandas._typing
Not necessarily a precursor to this, but it would be nice if we had a systematic way to make sure we were using |
pandas/core/_meta.py
Outdated
|
||
Parameters | ||
---------- | ||
pandas_method : callable or str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like you can register a single finalizer? but we already have internal ones, shouldn't this just append to a list of finalizers? how is the default done if we have 1 or more finalizers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to register one finalizer per pandas method. I Brock's subclassed based approach will make this clearer.
Pandas will provide a default implementation, which the subclass can override.
how is the default done if we have 1 or more finalizers?
Previously, the __finalize__
iterated over each metadata and applied the "default finalizer" (copy from self
to new
).
Now we iterate over metadata attributes, look up the finalizer for that attribute, and then apply that finalizer. This gives you potentially different finalization behavior for different attributes (which we need for .name
vs. .allows_duplicates
).
pandas/core/_meta.py
Outdated
finalizer = dispatch.get(key_of(method), {}).get(name) | ||
|
||
if finalizer: | ||
finalizer(new, other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not these return new?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a style choice. All these operations are inplace. My hope is that by returning None
, we make it clearer that you can't return a new object.
I refactored to use a class-based approach, rather than decorators. The weirdest thing is that users need to actually create an instance of their subclass so that we know about it, but don't actually do anything with it. >>> mymeta = MyMeta("mymeta") Now that I think about it, I guess that we could have a class decorator for registering. So then they do something like @register_metadata
class MyMeta(PandasMetadata):
name = "mymeta" And then we just instantiate it when we need it, rather than relying on singletons per Edit: If we prefer this class-based approach, then I'll need to write out additional |
Yah, this is a little confusing. What if my Meta subclass can take on multiple values, I'd assume I'd need multiple instances of it. |
Do you mean refer to multiple attributes? If so, then yes, that would get strange. class MyMetaBase(PandasMetadata):
...
@register_metdata
class MyMyetaA(MyMetaBase):
name = 'a'
@register_metadata
class MyMetaB(MyMetaBase):
name = 'b versus my_meta_a = MyMetaBase(name='a')
my_meta_b = MyMetaBase(name='b') neither is great. |
I mean different values that a single attribute can take. e.g.
|
Hmm OK. Not sure I fully understand, but let me try to clarify. Everything I've put up here as to do with metadata resolution: How do we transfer metadata from the source object(s) to the new object. It looks like your example is getting into specifying metadata values, though I may be misunderstanding. I haven't thought much about changing |
That part of my example was probably unhelpful. If it makes it any clearer, pretend I use something other than |
Sorry, I'm still not following your example. |
Changed my mind one more time. I started down a class based approach where we define
While that's nice, since it gives a well-defined API for what finalize does, it complicates the implementation of the common case a decent amount. You can see it at https://github.com/pandas-dev/pandas/compare/master...TomAugspurger:metadata-dispatch+complex?expand=1 So I've backed off that and simplified this to get things working for my short-term needs (disallow_duplicates). A piece of metadata still has per-method control (pandas will need to be updated to pass Doing some perf checks now. |
Do we need this class-based approach? What can you not do with the current implementation?
This is already possible, as
In principle, you can do this in |
To be clear, I am not opposed to making the implementation nicer, or making it more easily extensible to add custom logic, etc. I just first want to understand the constraints / needed functionality, before forming my opinion on the new implementation. An easier way for external people to define custom behaviour for metadata without clashing with pandas (eg currently by overriding |
Class-based isn't necessary. The root thing that I'd like to see is a way for a specific piece of Since we (pandas) are the one adding diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index 68308b2f83..5d63105335 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -5174,7 +5174,11 @@ class NDFrame(PandasObject, SelectionMixin):
"""
if isinstance(other, NDFrame):
for name in self._metadata:
- object.__setattr__(self, name, getattr(other, name, None))
+ if name == 'allows_duplicates':
+ allows_duplicates = all(getattr(x, name, None) for x in other)
+ object.__setattr__(self, name, allows_duplicates)
+ else:
+ object.__setattr__(self, name, getattr(other, name, None))
return self
def __getattr__(self, name): I'll have a few more changes propagating metdata for things like _Concatenator, Given that I'd like to do this for 1.0, should we avoid the larger changes here? |
from pandas.core.meta import PandasMetadata | ||
|
||
|
||
class MyMeta(PandasMetadata): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an implementation for index_allows_duplicates? I'd be more comfortable if there were a more fully fleshed-out example/test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I thought I did but am having trouble finding it right now.
@jorisvandenbossche I put up #28394 for ease of comparison. That'll be the "minimal" approach (no general changes to |
Right now, I'm going to push forward on #28394. That may give some guidance on how to design a metadata finalization API. Apologies for the wasted review time. |
In preperation for #27108
(disallowing duplicates), we need to enhance our metadata propagation.
We need a way for a particiular attribute to deterimine how it's
propagated for a particular method. Our current method of metadata
propagation lacked two features
There is no way to propagate metadata from a collection of NDFrames
(say from
pd.concat
) to a new NDFrame.appropriate when dealing with a collection of input NDFrames, as the
source attributes may differ. The resolution of conflicts will differ
by attribute (for
Series.name
we might throw away the name. ForSeries.allow_duplicates
, any Series disallowing duplicates shouldmean the output disallows duplicates)
__finalize__
to pass throughmethod