Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF/ENH: Refactor NDFrame finalization #28334

Closed

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Sep 7, 2019

In preperation for #27108
(disallowing duplicates), we need to enhance our metadata propagation.

We need a way for a particiular attribute to deterimine how it's
propagated for a particular method
. Our current method of metadata
propagation lacked two features

  1. It only copies an attribute from a source NDFrame to a new NDFrame.
    There is no way to propagate metadata from a collection of NDFrames
    (say from pd.concat) to a new NDFrame.
  2. It only and always copies the attribute. This is not always
    appropriate when dealing with a collection of input NDFrames, as the
    source attributes may differ. The resolution of conflicts will differ
    by attribute (for Series.name we might throw away the name. For
    Series.allow_duplicates, any Series disallowing duplicates should
    mean the output disallows duplicates)

In preperation for pandas-dev#27108
(disallowing duplicates), we need to enhance our metadata propagation.

*We need a way for a particiular attribute to deterimine how it's
propagated for a particular method*. Our current method of metadata
propagation lacked two features

1. It only copies an attribute from a source NDFrame to a new NDFrame.
   There is no way to propagate metadata from a collection of NDFrames
   (say from `pd.concat`) to a new NDFrame.
2. It only and always copies the attribute. This is not always
   appropriate when dealing with a collection of input NDFrames, as the
   source attributes may differ. The resolution of conflicts will differ
   by attribute (for `Series.name` we might throw away the name. For
   `Series.allow_duplicates`, any Series disallowing duplicates should
   mean the output disallows duplicates)
@TomAugspurger TomAugspurger added the metadata _metadata, .attrs label Sep 7, 2019
@TomAugspurger
Copy link
Contributor Author

cc @jreback @jbrockmendel, this is the start of what I need for #27108 (comment). Briefly, adding a .allows_duplicate_labels attribute to NDFrame, which will be propagated through all operations. That already works for most things, but it's dropped for operations involving multiple inputs (concat, binops, etc.).

This PR puts the infrastructure in place to support this. We aren't using it yet (outside of tests), but will when I make the duplicate labels PR.

duplicate_labels_meta = PandasMetadata("allows_duplicates")


@duplicate_labels_meta.register(pd.concat)
def _(new, concatenater):
    new.allows_duplicate_labels = all(x.allows_duplicate_labels for x in concatenater.objs)

@duplicate_labels_meta.register(Series.__array_ufunc__)
def _(new, inputs):
    new.allows_duplicate_labels = ...

But I split this off so that PR will be smaller.

from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries

if TYPE_CHECKING:
from pandas.core.generic import NDFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want FrameOrSeries from pandas._typing

pandas/core/_meta.py Outdated Show resolved Hide resolved
pandas/core/_meta.py Outdated Show resolved Hide resolved
pandas/core/_meta.py Outdated Show resolved Hide resolved
@jbrockmendel
Copy link
Member

Not necessarily a precursor to this, but it would be nice if we had a systematic way to make sure we were using _constructor and __finalize__ in all the appropriate places.

pandas/core/_meta.py Outdated Show resolved Hide resolved

Parameters
----------
pandas_method : callable or str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like you can register a single finalizer? but we already have internal ones, shouldn't this just append to a list of finalizers? how is the default done if we have 1 or more finalizers?

Copy link
Contributor Author

@TomAugspurger TomAugspurger Sep 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to register one finalizer per pandas method. I Brock's subclassed based approach will make this clearer.

Pandas will provide a default implementation, which the subclass can override.

how is the default done if we have 1 or more finalizers?

Previously, the __finalize__ iterated over each metadata and applied the "default finalizer" (copy from self to new).

Now we iterate over metadata attributes, look up the finalizer for that attribute, and then apply that finalizer. This gives you potentially different finalization behavior for different attributes (which we need for .name vs. .allows_duplicates).

finalizer = dispatch.get(key_of(method), {}).get(name)

if finalizer:
finalizer(new, other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not these return new?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a style choice. All these operations are inplace. My hope is that by returning None, we make it clearer that you can't return a new object.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 9, 2019

it would be nice if we had a systematic way to make sure we were using _constructor and finalize in all the appropriate places.

I refactored to use a class-based approach, rather than decorators. The weirdest thing is that users need to actually create an instance of their subclass so that we know about it, but don't actually do anything with it.

>>> mymeta = MyMeta("mymeta")

Now that I think about it, I guess that we could have a class decorator for registering. So then they do something like

@register_metadata
class MyMeta(PandasMetadata):
    name = "mymeta"

And then we just instantiate it when we need it, rather than relying on singletons per name.

Edit: If we prefer this class-based approach, then I'll need to write out additional finalize_* methods, one per method calling finalize. This will give a nice API on what calls __finalize__ with which types.

@jbrockmendel
Copy link
Member

The weirdest thing is that users need to actually create an instance of their subclass so that we know about it, but don't actually do anything with it.

Yah, this is a little confusing. What if my Meta subclass can take on multiple values, I'd assume I'd need multiple instances of it.

@TomAugspurger
Copy link
Contributor Author

What if my Meta subclass can take on multiple values

Do you mean refer to multiple attributes? If so, then yes, that would get strange.

class MyMetaBase(PandasMetadata):
    ...

@register_metdata
class MyMyetaA(MyMetaBase):
    name = 'a'

@register_metadata
class MyMetaB(MyMetaBase):
    name = 'b

versus

my_meta_a = MyMetaBase(name='a')
my_meta_b = MyMetaBase(name='b')

neither is great.

@jbrockmendel
Copy link
Member

Do you mean refer to multiple attributes? If so, then yes, that would get strange.

I mean different values that a single attribute can take. e.g.

class AllowsDuplicateLabels(PandasMetadata):
    def __init__(self, allows):
        self.allows = allows

    def __bool__(self):
        return self.allows

    [...]


ser = pd.Series(range(3)
ser._metadata["allows"] = AllowsDuplicateLabels(False)

ser2 = pd.Series(range(4))
ser2._metadata["allows"] = AllowsDuplicateLabels(True)

>>> ser.index = [1, 1, 1]
ValueError: Series does not allow duplicate labels.

@TomAugspurger
Copy link
Contributor Author

Hmm OK. Not sure I fully understand, but let me try to clarify.

Everything I've put up here as to do with metadata resolution: How do we transfer metadata from the source object(s) to the new object. It looks like your example is getting into specifying metadata values, though I may be misunderstanding. I haven't thought much about changing NDFrame._metadata from a list of strings specifying attributes.

@jbrockmendel
Copy link
Member

I haven't thought much about changing NDFrame._metadata from a list of strings specifying attributes.

That part of my example was probably unhelpful. If it makes it any clearer, pretend I use something other than _metadata that doesn't already exist.

@TomAugspurger
Copy link
Contributor Author

Sorry, I'm still not following your example.

@TomAugspurger
Copy link
Contributor Author

Changed my mind one more time. I started down a class based approach where we define

class PandasMeta:
    def copy(self, new, other):
    def concat(self, new, other):
    def sort_index(self, new, other):

While that's nice, since it gives a well-defined API for what finalize does, it complicates the implementation of the common case a decent amount. You can see it at https://github.com/pandas-dev/pandas/compare/master...TomAugspurger:metadata-dispatch+complex?expand=1

So I've backed off that and simplified this to get things working for my short-term needs (disallow_duplicates). A piece of metadata still has per-method control (pandas will need to be updated to pass method= in more places).

Doing some perf checks now.

@TomAugspurger TomAugspurger changed the title [WIP]REF/ENH: Refactor NDFrame finalization REF/ENH: Refactor NDFrame finalization Sep 10, 2019
@jorisvandenbossche
Copy link
Member

Do we need this class-based approach? What can you not do with the current implementation?

It only copies an attribute from a source NDFrame to a new NDFrame.
There is no way to propagate metadata from a collection of NDFrames
(say from pd.concat) to a new NDFrame.

This is already possible, as __finalize__ gets passed the _concatenator object, which has access to all frames passed to concat.
This is what we do in geopandas (https://github.com/geopandas/geopandas/blob/29add0a735b00dc20c79e0fccc8e6a775c4997b0/geopandas/geodataframe.py#L561-L574)

It only and always copies the attribute. This is not always
appropriate when dealing with a collection of input NDFrames, as the
source attributes may differ. The resolution of conflicts will differ
by attribute

In principle, you can do this in __finalize__ as well. You have access to the name of the attribute, so you can implement different logic for different attributes. It might not necessarily nice to have to put this all in __finalize__, and another approach could be cleaner. But I don't fully understand what functionality you are trying to add.

@jorisvandenbossche
Copy link
Member

To be clear, I am not opposed to making the implementation nicer, or making it more easily extensible to add custom logic, etc. I just first want to understand the constraints / needed functionality, before forming my opinion on the new implementation.

An easier way for external people to define custom behaviour for metadata without clashing with pandas (eg currently by overriding __finalize__, if pandas would add custom logic for an allow_duplicates, the current implementation of geopandas would override that I think), would certainly be very valuable!

@TomAugspurger
Copy link
Contributor Author

Class-based isn't necessary.

The root thing that I'd like to see is a way for a specific piece of
metadata (Series.name, NDFrame.allows_duplicates) to dictate how it's
resolved. Right now, all attributes are handled the same (copy from the source
to the new when the source is an NDFrame).

Since we (pandas) are the one adding NDFrame.allows_duplicates, we can of
course just do that handling ourself in NDFrame.__finalize__. Roughly

diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index 68308b2f83..5d63105335 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -5174,7 +5174,11 @@ class NDFrame(PandasObject, SelectionMixin):
         """
         if isinstance(other, NDFrame):
             for name in self._metadata:
-                object.__setattr__(self, name, getattr(other, name, None))
+                if name == 'allows_duplicates':
+                    allows_duplicates = all(getattr(x, name, None) for x in other)
+                    object.__setattr__(self, name, allows_duplicates)
+                else:
+                    object.__setattr__(self, name, getattr(other, name, None))
         return self
 
     def __getattr__(self, name):

I'll have a few more changes propagating metdata for things like _Concatenator,
and I'll want to pass different metadata in places (Series.__array_ufunc__
should pass inputs). But that's the basic idea.

Given that I'd like to do this for 1.0, should we avoid the larger changes here?
Certainly, adding special handling for allows_duplicates will let us establish
the desired behavior on master, guiding a later refactor.

from pandas.core.meta import PandasMetadata


class MyMeta(PandasMetadata):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an implementation for index_allows_duplicates? I'd be more comfortable if there were a more fully fleshed-out example/test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I thought I did but am having trouble finding it right now.

@TomAugspurger
Copy link
Contributor Author

@jorisvandenbossche I put up #28394 for ease of comparison. That'll be the "minimal" approach (no general changes to NDArray.__finalize__)

@TomAugspurger
Copy link
Contributor Author

Right now, I'm going to push forward on #28394. That may give some guidance on how to design a metadata finalization API. Apologies for the wasted review time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metadata _metadata, .attrs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants