REF/ENH: Refactor NDFrame finalization #28334

TomAugspurger · 2019-09-07T19:01:13Z

In preperation for #27108
(disallowing duplicates), we need to enhance our metadata propagation.

We need a way for a particiular attribute to deterimine how it's
propagated for a particular method. Our current method of metadata
propagation lacked two features

It only copies an attribute from a source NDFrame to a new NDFrame.
There is no way to propagate metadata from a collection of NDFrames
(say from pd.concat) to a new NDFrame.
It only and always copies the attribute. This is not always
appropriate when dealing with a collection of input NDFrames, as the
source attributes may differ. The resolution of conflicts will differ
by attribute (for Series.name we might throw away the name. For
Series.allow_duplicates, any Series disallowing duplicates should
mean the output disallows duplicates)

Closes DOC/API: document how to use metadata #8572
Update existing calls to __finalize__ to pass through method
profile / Run ASV

In preperation for pandas-dev#27108 (disallowing duplicates), we need to enhance our metadata propagation. *We need a way for a particiular attribute to deterimine how it's propagated for a particular method*. Our current method of metadata propagation lacked two features 1. It only copies an attribute from a source NDFrame to a new NDFrame. There is no way to propagate metadata from a collection of NDFrames (say from `pd.concat`) to a new NDFrame. 2. It only and always copies the attribute. This is not always appropriate when dealing with a collection of input NDFrames, as the source attributes may differ. The resolution of conflicts will differ by attribute (for `Series.name` we might throw away the name. For `Series.allow_duplicates`, any Series disallowing duplicates should mean the output disallows duplicates)

TomAugspurger · 2019-09-07T19:11:28Z

cc @jreback @jbrockmendel, this is the start of what I need for #27108 (comment). Briefly, adding a .allows_duplicate_labels attribute to NDFrame, which will be propagated through all operations. That already works for most things, but it's dropped for operations involving multiple inputs (concat, binops, etc.).

This PR puts the infrastructure in place to support this. We aren't using it yet (outside of tests), but will when I make the duplicate labels PR.

duplicate_labels_meta = PandasMetadata("allows_duplicates")


@duplicate_labels_meta.register(pd.concat)
def _(new, concatenater):
    new.allows_duplicate_labels = all(x.allows_duplicate_labels for x in concatenater.objs)

@duplicate_labels_meta.register(Series.__array_ufunc__)
def _(new, inputs):
    new.allows_duplicate_labels = ...

But I split this off so that PR will be smaller.

WillAyd · 2019-09-07T19:29:55Z

pandas/core/_meta.py

+from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries
+
+if TYPE_CHECKING:
+    from pandas.core.generic import NDFrame


I think you want FrameOrSeries from pandas._typing

pandas/core/_meta.py

pandas/tests/generic/test_metadata.py

jbrockmendel · 2019-09-07T23:23:21Z

Not necessarily a precursor to this, but it would be nice if we had a systematic way to make sure we were using _constructor and __finalize__ in all the appropriate places.

pandas/core/_meta.py

jreback · 2019-09-08T14:58:45Z

pandas/core/_meta.py

+
+        Parameters
+        ----------
+        pandas_method : callable or str


this looks like you can register a single finalizer? but we already have internal ones, shouldn't this just append to a list of finalizers? how is the default done if we have 1 or more finalizers?

The idea was to register one finalizer per pandas method. I Brock's subclassed based approach will make this clearer.

Pandas will provide a default implementation, which the subclass can override.

how is the default done if we have 1 or more finalizers?

Previously, the __finalize__ iterated over each metadata and applied the "default finalizer" (copy from self to new).

Now we iterate over metadata attributes, look up the finalizer for that attribute, and then apply that finalizer. This gives you potentially different finalization behavior for different attributes (which we need for .name vs. .allows_duplicates).

jreback · 2019-09-08T14:59:35Z

pandas/core/_meta.py

+        finalizer = dispatch.get(key_of(method), {}).get(name)
+
+        if finalizer:
+            finalizer(new, other)


should not these return new?

Just a style choice. All these operations are inplace. My hope is that by returning None, we make it clearer that you can't return a new object.

TomAugspurger · 2019-09-09T14:03:17Z

it would be nice if we had a systematic way to make sure we were using _constructor and finalize in all the appropriate places.

I refactored to use a class-based approach, rather than decorators. The weirdest thing is that users need to actually create an instance of their subclass so that we know about it, but don't actually do anything with it.

>>> mymeta = MyMeta("mymeta")

Now that I think about it, I guess that we could have a class decorator for registering. So then they do something like

@register_metadata
class MyMeta(PandasMetadata):
    name = "mymeta"

And then we just instantiate it when we need it, rather than relying on singletons per name.

Edit: If we prefer this class-based approach, then I'll need to write out additional finalize_* methods, one per method calling finalize. This will give a nice API on what calls __finalize__ with which types.

jbrockmendel · 2019-09-09T14:46:14Z

The weirdest thing is that users need to actually create an instance of their subclass so that we know about it, but don't actually do anything with it.

Yah, this is a little confusing. What if my Meta subclass can take on multiple values, I'd assume I'd need multiple instances of it.

TomAugspurger · 2019-09-09T15:07:30Z

What if my Meta subclass can take on multiple values

Do you mean refer to multiple attributes? If so, then yes, that would get strange.

class MyMetaBase(PandasMetadata):
    ...

@register_metdata
class MyMyetaA(MyMetaBase):
    name = 'a'

@register_metadata
class MyMetaB(MyMetaBase):
    name = 'b

versus

my_meta_a = MyMetaBase(name='a')
my_meta_b = MyMetaBase(name='b')

neither is great.

jbrockmendel · 2019-09-09T15:23:19Z

Do you mean refer to multiple attributes? If so, then yes, that would get strange.

I mean different values that a single attribute can take. e.g.

class AllowsDuplicateLabels(PandasMetadata):
    def __init__(self, allows):
        self.allows = allows

    def __bool__(self):
        return self.allows

    [...]


ser = pd.Series(range(3)
ser._metadata["allows"] = AllowsDuplicateLabels(False)

ser2 = pd.Series(range(4))
ser2._metadata["allows"] = AllowsDuplicateLabels(True)

>>> ser.index = [1, 1, 1]
ValueError: Series does not allow duplicate labels.

TomAugspurger · 2019-09-09T15:28:23Z

Hmm OK. Not sure I fully understand, but let me try to clarify.

Everything I've put up here as to do with metadata resolution: How do we transfer metadata from the source object(s) to the new object. It looks like your example is getting into specifying metadata values, though I may be misunderstanding. I haven't thought much about changing NDFrame._metadata from a list of strings specifying attributes.

jbrockmendel · 2019-09-09T15:34:21Z

I haven't thought much about changing NDFrame._metadata from a list of strings specifying attributes.

That part of my example was probably unhelpful. If it makes it any clearer, pretend I use something other than _metadata that doesn't already exist.

TomAugspurger · 2019-09-10T16:15:22Z

Sorry, I'm still not following your example.

TomAugspurger · 2019-09-10T17:23:50Z

Changed my mind one more time. I started down a class based approach where we define

class PandasMeta:
    def copy(self, new, other):
    def concat(self, new, other):
    def sort_index(self, new, other):

While that's nice, since it gives a well-defined API for what finalize does, it complicates the implementation of the common case a decent amount. You can see it at https://github.com/pandas-dev/pandas/compare/master...TomAugspurger:metadata-dispatch+complex?expand=1

So I've backed off that and simplified this to get things working for my short-term needs (disallow_duplicates). A piece of metadata still has per-method control (pandas will need to be updated to pass method= in more places).

Doing some perf checks now.

jorisvandenbossche · 2019-09-11T07:57:08Z

Do we need this class-based approach? What can you not do with the current implementation?

It only copies an attribute from a source NDFrame to a new NDFrame.
There is no way to propagate metadata from a collection of NDFrames
(say from pd.concat) to a new NDFrame.

This is already possible, as __finalize__ gets passed the _concatenator object, which has access to all frames passed to concat.
This is what we do in geopandas (https://github.com/geopandas/geopandas/blob/29add0a735b00dc20c79e0fccc8e6a775c4997b0/geopandas/geodataframe.py#L561-L574)

It only and always copies the attribute. This is not always
appropriate when dealing with a collection of input NDFrames, as the
source attributes may differ. The resolution of conflicts will differ
by attribute

In principle, you can do this in __finalize__ as well. You have access to the name of the attribute, so you can implement different logic for different attributes. It might not necessarily nice to have to put this all in __finalize__, and another approach could be cleaner. But I don't fully understand what functionality you are trying to add.

jorisvandenbossche · 2019-09-11T08:03:18Z

To be clear, I am not opposed to making the implementation nicer, or making it more easily extensible to add custom logic, etc. I just first want to understand the constraints / needed functionality, before forming my opinion on the new implementation.

An easier way for external people to define custom behaviour for metadata without clashing with pandas (eg currently by overriding __finalize__, if pandas would add custom logic for an allow_duplicates, the current implementation of geopandas would override that I think), would certainly be very valuable!

TomAugspurger · 2019-09-11T20:41:38Z

Class-based isn't necessary.

The root thing that I'd like to see is a way for a specific piece of
metadata (Series.name, NDFrame.allows_duplicates) to dictate how it's
resolved. Right now, all attributes are handled the same (copy from the source
to the new when the source is an NDFrame).

Since we (pandas) are the one adding NDFrame.allows_duplicates, we can of
course just do that handling ourself in NDFrame.__finalize__. Roughly

diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index 68308b2f83..5d63105335 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -5174,7 +5174,11 @@ class NDFrame(PandasObject, SelectionMixin):
         """
         if isinstance(other, NDFrame):
             for name in self._metadata:
-                object.__setattr__(self, name, getattr(other, name, None))
+                if name == 'allows_duplicates':
+                    allows_duplicates = all(getattr(x, name, None) for x in other)
+                    object.__setattr__(self, name, allows_duplicates)
+                else:
+                    object.__setattr__(self, name, getattr(other, name, None))
         return self
 
     def __getattr__(self, name):

I'll have a few more changes propagating metdata for things like _Concatenator,
and I'll want to pass different metadata in places (Series.__array_ufunc__
should pass inputs). But that's the basic idea.

Given that I'd like to do this for 1.0, should we avoid the larger changes here?
Certainly, adding special handling for allows_duplicates will let us establish
the desired behavior on master, guiding a later refactor.

jbrockmendel · 2019-09-11T21:57:10Z

pandas/tests/generic/test_metadata.py

+from pandas.core.meta import PandasMetadata
+
+
+class MyMeta(PandasMetadata):


Do you have an implementation for index_allows_duplicates? I'd be more comfortable if there were a more fully fleshed-out example/test.

Hmm, I thought I did but am having trouble finding it right now.

TomAugspurger · 2019-09-11T22:13:45Z

@jorisvandenbossche I put up #28394 for ease of comparison. That'll be the "minimal" approach (no general changes to NDArray.__finalize__)

TomAugspurger · 2019-09-13T18:35:11Z

Right now, I'm going to push forward on #28394. That may give some guidance on how to design a metadata finalization API. Apologies for the wasted review time.

TomAugspurger added the metadata label Sep 7, 2019

WillAyd reviewed Sep 7, 2019

View reviewed changes

jbrockmendel reviewed Sep 7, 2019

View reviewed changes

pandas/tests/generic/test_metadata.py Outdated Show resolved Hide resolved

jreback requested changes Sep 8, 2019

View reviewed changes

TomAugspurger added 2 commits September 9, 2019 08:58

refactor

d7bb99c

Merge remote-tracking branch 'upstream/master' into metadata-dispatch

60bc89c

move to meta

b05782c

mypy

53576eb

TomAugspurger added 2 commits September 9, 2019 13:23

fixed subclass

3009732

Merge remote-tracking branch 'upstream/master' into metadata-dispatch

d68e5bb

TomAugspurger added 2 commits September 10, 2019 11:58

define API

710d73a

simplify

ecf3989

TomAugspurger changed the title ~~[WIP]REF/ENH: Refactor NDFrame finalization~~ REF/ENH: Refactor NDFrame finalization Sep 10, 2019

jbrockmendel reviewed Sep 11, 2019

View reviewed changes

TomAugspurger mentioned this pull request Sep 11, 2019

Optionally disallow duplicate labels #28394

Merged

6 tasks

TomAugspurger closed this Sep 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF/ENH: Refactor NDFrame finalization #28334

REF/ENH: Refactor NDFrame finalization #28334

TomAugspurger commented Sep 7, 2019 •

edited

Loading

TomAugspurger commented Sep 7, 2019

WillAyd Sep 7, 2019

jbrockmendel commented Sep 7, 2019

jreback Sep 8, 2019

TomAugspurger Sep 8, 2019 •

edited

Loading

jreback Sep 8, 2019

TomAugspurger Sep 8, 2019

TomAugspurger commented Sep 9, 2019 •

edited

Loading

jbrockmendel commented Sep 9, 2019

TomAugspurger commented Sep 9, 2019

jbrockmendel commented Sep 9, 2019

TomAugspurger commented Sep 9, 2019

jbrockmendel commented Sep 9, 2019

TomAugspurger commented Sep 10, 2019

TomAugspurger commented Sep 10, 2019

jorisvandenbossche commented Sep 11, 2019

jorisvandenbossche commented Sep 11, 2019

TomAugspurger commented Sep 11, 2019

jbrockmendel Sep 11, 2019

TomAugspurger Sep 11, 2019

TomAugspurger commented Sep 11, 2019

TomAugspurger commented Sep 13, 2019

		from pandas.core.meta import PandasMetadata


		class MyMeta(PandasMetadata):

REF/ENH: Refactor NDFrame finalization #28334

REF/ENH: Refactor NDFrame finalization #28334

Conversation

TomAugspurger commented Sep 7, 2019 • edited Loading

TomAugspurger commented Sep 7, 2019

WillAyd Sep 7, 2019

Choose a reason for hiding this comment

jbrockmendel commented Sep 7, 2019

jreback Sep 8, 2019

Choose a reason for hiding this comment

TomAugspurger Sep 8, 2019 • edited Loading

Choose a reason for hiding this comment

jreback Sep 8, 2019

Choose a reason for hiding this comment

TomAugspurger Sep 8, 2019

Choose a reason for hiding this comment

TomAugspurger commented Sep 9, 2019 • edited Loading

jbrockmendel commented Sep 9, 2019

TomAugspurger commented Sep 9, 2019

jbrockmendel commented Sep 9, 2019

TomAugspurger commented Sep 9, 2019

jbrockmendel commented Sep 9, 2019

TomAugspurger commented Sep 10, 2019

TomAugspurger commented Sep 10, 2019

jorisvandenbossche commented Sep 11, 2019

jorisvandenbossche commented Sep 11, 2019

TomAugspurger commented Sep 11, 2019

jbrockmendel Sep 11, 2019

Choose a reason for hiding this comment

TomAugspurger Sep 11, 2019

Choose a reason for hiding this comment

TomAugspurger commented Sep 11, 2019

TomAugspurger commented Sep 13, 2019

TomAugspurger commented Sep 7, 2019 •

edited

Loading

TomAugspurger Sep 8, 2019 •

edited

Loading

TomAugspurger commented Sep 9, 2019 •

edited

Loading