Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: set accessor for Series (WIP) #21547

Closed
wants to merge 2 commits into from

Conversation

h-vetinari
Copy link
Contributor

@h-vetinari h-vetinari commented Jun 19, 2018

Closes #4480, and a fair bit more that was discussed in comments. In particular, it makes the numpy methods NaN-safe, index aware, and adds a bit of wrapping convenience.

This is still a WIP, so no tests yet. Also more work to do on docstrings, whatsnew, and some overview documentation similar to pandas/doc/source/text.rst. The implementation borrows heavily from pandas/core/strings.py.

Before I write any tests etc., I'd like to have a discussion on the API - which functions to add (I only added union/intersection/difference/symmetric difference), how to name them, and what arguments they should have. I've followed the suggestions in #13877 (not yet realised for the .str-accessor) to have an errors- and a fill_value-parameter, as well as the join-parameter from #20347.

Here's a sample API documentation:
4

Here's some more usage examples. The basic idea is to follow the numpy set methods, which work with | & ^ - between arrays and arrays (as well as arrays and singletons), but are not NaN-safe. Basiscs:

s = pd.Series([{1, 2}, {2, 4}, {3, 1}])
s
# 0    {1, 2}
# 1    {2, 4}
# 2    {1, 3}
# dtype: object

s.set.union()  # apply (sequentially) to elements of calling Series
# {1, 2, 3, 4}

s.set.union({2})  # broadcast like "s.values | {2}"
# 0       {1, 2}
# 1       {2, 4}
# 2    {1, 2, 3}
# dtype: object

With another Series:

t = pd.Series([{2, 3}, {1, 2}, np.nan])
t
# 0    {2, 3}
# 1    {1, 2}
# 2       NaN
# dtype: object

s.values | t.values  # remember...
# TypeError

s.set.union(t)
# 0    {1, 2, 3}
# 1    {1, 2, 4}
# 2          NaN
# dtype: object

s.set.union(t, fill_value={5})
# 0    {1, 2, 3}
# 1    {1, 2, 4}
# 2    {1, 3, 5}
# dtype: object

With different indices (fill_value also works for NaNs created by alignment):

u = pd.Series(t.values, index=[1, 2, 3])
u
# 1    {2, 3}
# 2    {1, 2}
# 3       NaN
# dtype: object

s.set.union(u)
# 0          NaN
# 1    {2, 3, 4}
# 2    {1, 2, 3}
# dtype: object

s.set.union(u, join='outer')
# 0          NaN
# 1    {2, 3, 4}
# 2    {1, 2, 3}
# 3          NaN
# dtype: object

s.set.union(u, join='outer', fill_value={5})
# 0    {1, 2, 5}
# 1    {2, 3, 4}
# 2    {1, 2, 3}
# 3          {5}
# dtype: object

Finally, the behaviour of the errors-parameter. Since strings are not list-like, but can be coerced into a set, I made a distinction between 'coerce' and 'wrap', which is the most permissive (but treats strings as singletons).

v = pd.Series([{1, 2}, [2, 3], 'abcd', 4, np.nan])
v
# 0    {1, 2}
# 1    [2, 3]
# 2      abcd
# 3         4
# 4       NaN
# dtype: object

v.set.union(set(), errors='raise')  # default
# ValueError

v.set.union(set(), errors='ignore')
# 0    {1, 2}
# 1    {2, 3}
# 2       NaN
# 3       NaN
# 4       NaN
# dtype: object

v.set.union(set(), errors='coerce')
# 0          {1, 2}
# 1          {2, 3}
# 2    {a, c, b, d}
# 3             NaN
# 4             NaN
# dtype: object

v.set.union(set(), errors='wrap')
# 0    {1, 2}
# 1    {2, 3}
# 2    {abcd}
# 3       {4}
# 4       NaN
# dtype: object

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests - they should be the first thing you write

@codecov
Copy link

codecov bot commented Jun 20, 2018

Codecov Report

Merging #21547 into master will increase coverage by 0.02%.
The diff coverage is 98.05%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21547      +/-   ##
==========================================
+ Coverage   91.91%   91.94%   +0.02%     
==========================================
  Files         153      154       +1     
  Lines       49546    49716     +170     
==========================================
+ Hits        45542    45709     +167     
- Misses       4004     4007       +3
Flag Coverage Δ
#multiple 90.34% <98.05%> (+0.02%) ⬆️
#single 41.78% <33.76%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/core/series.py 94.2% <100%> (+0.02%) ⬆️
pandas/core/sets.py 98.01% <98.01%> (ø)
pandas/core/arrays/categorical.py 95.63% <0%> (-0.06%) ⬇️
pandas/core/generic.py 96.12% <0%> (-0.01%) ⬇️
pandas/core/indexes/base.py 96.62% <0%> (ø) ⬆️
pandas/core/indexes/category.py 97.28% <0%> (+0.18%) ⬆️
pandas/core/dtypes/cast.py 88.49% <0%> (+0.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5fbb683...8c98d30. Read the comment docs.

@h-vetinari
Copy link
Contributor Author

@jreback IMO there's something that needs to come before tests, and that's having a good idea about the API (which I can't decide by myself; hence the discussion here). Otherwise, a lot of time is sunk in writing obsolete tests.

Could you please comment on your thoughts for the API choices so far? The rendered docstring should give a good starting point.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jun 20, 2018

@jreback I added over 200 tests (including parametrization almost everywhere) that cover everything I could think of. It's hopefully modular enough that renaming/changing the API won't be too much of a hassle.

I also added the same 'errors'/'fill_value' parameters to set.len, which I could therefore refactor (and cut out a lot of code I had pasted from pandas/core/strings.py).

@jreback
Copy link
Contributor

jreback commented Jun 21, 2018

at a glance it looks ok, but you have way too much complexity here. all of the errors options and the fill_value. some simpler errors are ok, but still to the coerce, ignore, raise options.

Furthermore this is fundamentally flawed in that you can use this with anything, in reality this should be be restricted to an inferred_dtype=='set', so that's the first thing to change.

@jorisvandenbossche
Copy link
Member

To be the devil's advocate here: given the new accessor registry api: isn't this a perfect case for having this in an external package?

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jun 21, 2018

@jreback I don't care much if it's only 'raise'|'ignore'|'coerce', just added 'wrap' for convenience and 'skip' for performance. The 'errors'/'fill_value' pair is used in many places, and propsed for .str-methods in #13877, so it made sense for me to emulate that. fill_value has negligible complexity attached, BTW.

To me, the whole point of giving an 'errors'-parameter is to be able to use it with anything, and whether it's lists or strings or whatnot, this should be made available IMO. Especially since it's such a bitch to prepare real-world data to a point where it's just sets (not least because fillna doesn't take lists or sets).

Aside from the templatization to avoid repetition, I went out of my way to keep complexity at a minimum. Could you tell me what you find complex? There's one clean-up method for Series, one clean-up for others, one method for executing the numpy method (and one putting the pieces together), all nicely segregated...

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche I maybe don't fully understand the implications/requirements of accessor registry, but since this is really just a wrapper to make the existing numpy methods nan-safe and index-aware, I think this should be part of the core, and not an external package?

@jreback
Copy link
Contributor

jreback commented Jun 21, 2018

@h-vetinari
pls read my last point, which is the main point. sets are not a first class type, that is the first problem to address.

@jreback
Copy link
Contributor

jreback commented Jun 21, 2018

@h-vetinari in reality this is much better done as a real ExtensionArray type. The point of the accessors is to extend with dtype specific functionaility. But first you have to have actual set operations supported.

Furthermore these are quite non-performant.

@h-vetinari
Copy link
Contributor Author

@jreback To which I responded as follows

To me, the whole point of giving an 'errors'-parameter is to be able to use it with anything, [...]

I agree this is an important discussion - but there's normally lots of preprocessing necessary to get this to work with dirty real-world data, and every restriction makes it harder to actually transform real data. Something about consenting adults and all that. ;-)

@jreback
Copy link
Contributor

jreback commented Jun 21, 2018

agree this is an important discussion - but there's normally lots of preprocessing necessary to get this to work with dirty real-world data, and every restriction makes it harder to actually transform real data. Something about consenting adults and all that. ;-)

sure but that's what pandas does. set intelligent defaults.

The issue I have is that this is the very first thing that works on containers inside a pandas object. Everything else is a scalar. Sure its 'ok' now, you 'can' use whatever you want but there is no first class support. So need to think about best way / if to provide this support.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jun 21, 2018

@jreback
Re:

@h-vetinari in reality this is much better done as a real ExtensionArray type. The point of the accessors is to extend with dtype specific functionaility. But first you have to have actual set operations supported.

But then there are no methods to transform the data to a point where the accessor can be used. At least .str has .astype(str). Of course, then the 'errors'/'fill_value' kwargs could be positioned in a potential .astype(set, errors, fill_value) call, which does have a certain appeal, I admit.

Furthermore these are quite non-performant.

This is not a pressing issue IMO; the numpy methods are orders of magnitude faster than writing for-loops already.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jun 21, 2018

@jreback

The issue I have is that this is the very first thing that works on containers inside a pandas object. Everything else is a scalar. Sure its 'ok' now, you 'can' use whatever you want but there is no first class support. So need to think about best way / if to provide this support.

Lots of things work on containers already, like .str.len(), .str[slice], for lists etc (yes, this is deliberately misusing the .str-accessor, but it works). Also, .str.extract and str.split with expand=False yield lists in Series, for example.

And people are using pandas in this way a lot - for clean data, one can just call map(set), map(list), 'map(len)` etc. and continue, but often the clean-up is the biggest headache. Granted, sometimes it's just an intermediate processing step on the way to a DF or a MultiIndex, but sometimes its genuinely a set/list that needs per-row processing. There's several SO questions on this, for example.

@gfyoung gfyoung added Enhancement Dtype Conversions Unexpected or buggy dtype conversions labels Jun 21, 2018
@h-vetinari
Copy link
Contributor Author

@jreback How do you propose to continue with this? As I said above, I like the idea of:

  • a set dtype
  • having all the conversion logic in .astype(set, errors, fill_value)
  • having the .set accessor only for that dtype, and then the methods don't need the kwargs themselves

But the problem is, I have no idea what creating a new dtype would imply. It would be easy enough to write the code for .astype(set, errors, fill_value), but I'm guessing that's not nearly enough.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jun 22, 2018

@jreback looking at #21160 and related PRs for example, that looks to be a much too tall of an order for me...

Similarly for #8640 and the work you refer to in your last comment:

nice library by xhochy https://github.com/xhochy/fletcher

@jreback
Copy link
Contributor

jreback commented Jun 22, 2018

http://pandas-docs.github.io/pandas-docs-travis/extending.html#extension-types

this should be an extension array. the actual impl is pretty easy, it could just e a list of sets. The problem is adding an accessor to disambiguate an object dtype is a bit odd. I am not totally against it, but having a defined dtype is much much cleaner.

@h-vetinari
Copy link
Contributor Author

@jreback Not sure I follow completely - I'm afraid I'm getting a bit confused between implementing an EA (your link; apparently not so hard) and implementing a new dtype set (very hard?).

Further up you said:

[...] this should be be restricted to an inferred_dtype=='set', so that's the first thing to change.

which sounds like implementing a new dtype, and now:

[...] adding an accessor to disambiguate an object dtype is a bit odd.

which sounds (more) like the version I already implemented (or the same thing translated to EA)? Again, it's not clear to me what the requirements/recommendations are at this point.

As a side note, I think that sets are such a fundamental building block in python (as opposed to IP addresses in cyperpandas or gps coordinates in geopandas), that the methods for them shouldn't be externalised to some package - this is what the EA documentation says, if I understand correctly.

@jreback
Copy link
Contributor

jreback commented Jun 22, 2018

@h-vetinari whether this is external or not is slightly orthogonal to this discussion, the point of EA is that it can be external (or maybe start that way).

EA is an extension array & a dtype, both of which are first class. That's how I'd like to see sets (and lists and dicts)

@h-vetinari h-vetinari mentioned this pull request Aug 16, 2018
@jreback
Copy link
Contributor

jreback commented Sep 25, 2018

closing as out of scope. This should be an EA type (could potentially be internal), but needs to be restructured to that.

@Florents-Tselai
Copy link

I also needed something like this, but as @jorisvandenbossche said, initially makes sense to have it as an extension. pip install pandas-sets

You can have a look here https://github.com/Florents-Tselai/pandas-sets - related post: https://tselai.com/pandas-sets.html.

The implementation is very primitive (and based on StringMethods) , but it does the job.

May come back later with a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants