Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupBy Rank Operations With Infinity Incorrect #20561

Closed
WillAyd opened this issue Mar 30, 2018 · 5 comments · Fixed by #20681
Closed

GroupBy Rank Operations With Infinity Incorrect #20561

WillAyd opened this issue Mar 30, 2018 · 5 comments · Fixed by #20681
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@WillAyd
Copy link
Member

WillAyd commented Mar 30, 2018

xref #20091 @peterpanmj

In [51]: df = pd.DataFrame([1, np.nan, np.inf, -np.inf, 25])

In [52]: df['key'] = 'foo'

In [53]: df.groupby("key").rank()      # not working properly for infinity
Out[53]:
     0
0  2.0
1  NaN
2  NaN
3  1.0
4  3.0

In [54]: df.rank()     # this is ok
Out[54]:
     0  key
0  2.0  3.0
1  NaN  3.0
2  4.0  3.0
3  1.0  3.0
4  3.0  3.0

In [55]: df.groupby("key").apply(lambda x:x.rank())   # this is also ok
Out[55]:
     0  key
0  2.0  3.0
1  NaN  3.0
2  4.0  3.0
3  1.0  3.0
4  3.0  3.0
@jreback jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations Difficulty Intermediate labels Apr 1, 2018
@jreback jreback added this to the Next Major Release milestone Apr 1, 2018
@peterpanmj
Copy link
Contributor

At line 723, groupby_helper.pxi

            if (i == N - 1 or (
                    (masked_vals[_as[i]] != masked_vals[_as[i+1]]) and not
                    (mask[_as[i]] and mask[_as[i+1]]))):

This is the cause of it. We cannot separate infs from nans.
When inf and nan are stacked together after the lexsort, this logic won't catch the transition point,
It is always True that, Masked_vals[_as[i]] == masked_vals[_as[i+1]], because they are all infinity (one real inf, one filled inf).

@WillAyd
Copy link
Member Author

WillAyd commented Apr 1, 2018

@peterpanmj doesn't the mask array still differentiate between np.nan and np.inf? Your comment is correct in that the masked_vals array obfuscates that different because it uses np.inf as a fill value, but I'm assuming mask would still help you distinguish the two, right?

@peterpanmj
Copy link
Contributor

peterpanmj commented Apr 2, 2018

Yes, mask should distinguish the two. However the code combines the two conditions with an and. When
not (mask[_as[i]] and mask[_as[i+1]]) is True at the transition point from real infs to filled infs , (masked_vals[_as[i]] != masked_vals[_as[i+1]]) is False. The result is a False. It missed it.

@WillAyd
Copy link
Member Author

WillAyd commented Apr 2, 2018

So would it make sense to have those be separate conditions then? Something along the lines of

if masked_vals[_as[i]] != masked_vals[_as[i+1]]:  
    if mask[_as[i]] or mask[_as[i+1]:
        # Either nan or inf here...
        mask[_as[i]] and mask[_as[i+1]:
            # This would mean both are nan
        else:
            # One is inf
    else:
        # Not dealing with nan or inf

Not saying the above is the right approach and couldn't be refactored but think its in the right direction. All of the elements are there so just need to piece together appropriately

@peterpanmj
Copy link
Contributor

            if (i == N - 1 or         # condition 1
                    (masked_vals[_as[i]] != masked_vals[_as[i+1]]) or   # condition 2
                    (mask[_as[i]] ^ mask[_as[i+1]])):   # condition 3
  1. End of Arrary,
  2. A new value occurs
  3. Transition from nans to not-nans ( or the reverse depends on na_option and ascending)
    I think condition 2 and 3 are independent to each others. Haven't tested it though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants