Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Copy categorical codes if empty (fixes #18051) #18279

Closed

Conversation

ghasemnaddaf
Copy link
Contributor

@ghasemnaddaf ghasemnaddaf commented Nov 14, 2017

If old_categories is empty (all nan categories) then _recode_for_categories
should return codes.copy() so that the writable flag is True.

@@ -2279,7 +2279,7 @@ def _recode_for_categories(codes, old_categories, new_categories):

if len(old_categories) == 0:
# All null anyway, so just retain the nulls
return codes
return codes.copy()
Copy link
Contributor Author

@ghasemnaddaf ghasemnaddaf Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return codes causes writable flag to be False hence we get the error reported in #18051

@codecov
Copy link

codecov bot commented Nov 14, 2017

Codecov Report

Merging #18279 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18279      +/-   ##
==========================================
- Coverage    91.4%   91.38%   -0.02%     
==========================================
  Files         164      164              
  Lines       49878    49878              
==========================================
- Hits        45590    45581       -9     
- Misses       4288     4297       +9
Flag Coverage Δ
#multiple 89.19% <ø> (ø) ⬆️
#single 39.41% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/categorical.py 95.75% <ø> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 69472f9...b35f16e. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 14, 2017

Codecov Report

Merging #18279 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18279      +/-   ##
==========================================
- Coverage    91.4%   91.38%   -0.02%     
==========================================
  Files         164      164              
  Lines       49880    49880              
==========================================
- Hits        45592    45583       -9     
- Misses       4288     4297       +9
Flag Coverage Δ
#multiple 89.19% <100%> (ø) ⬆️
#single 39.42% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/categorical.py 95.75% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 148ed63...21734aa. Read the comment docs.

@jorisvandenbossche jorisvandenbossche added Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version labels Nov 14, 2017
@jorisvandenbossche jorisvandenbossche added this to the 0.21.1 milestone Nov 14, 2017
@@ -1227,6 +1227,10 @@ def test_set_categories(self):
exp_categories = Index(["a", "b", "c", "d"])
tm.assert_index_equal(cat.categories, exp_categories)

# all-nan categories GH 18051
cat_nan = Categorical([np.nan])
assert cat_nan.unique()._codes.flags.writeable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this are all tests about set_categories so it feels a bit strange to put this one in here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a whatsnew note in 0.21.1

@@ -1227,6 +1227,10 @@ def test_set_categories(self):
exp_categories = Index(["a", "b", "c", "d"])
tm.assert_index_equal(cat.categories, exp_categories)

# all-nan categories GH 18051
cat_nan = Categorical([np.nan])
assert cat_nan.unique()._codes.flags.writeable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just assert the results of nunique

Copy link
Contributor Author

@ghasemnaddaf ghasemnaddaf Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no nunique for categorical (?) What is being tested makes sense to me, that is the root cause was that the codes in the output of unique() was not writeable.

rebased to remove conflicts in whats new
@ghasemnaddaf ghasemnaddaf force-pushed the catDtype_copy_nan_codes branch from 9189c78 to 6c751c0 Compare November 15, 2017 00:32
@ghasemnaddaf
Copy link
Contributor Author

synced off master and rebased to remove conflict in whatsnew

@jorisvandenbossche
Copy link
Member

I would maybe add the original reported case of pd.Series(pd.Categorical([np.nan])).nunique() as a test as well? (eg in test_categorical.py::TestCategoricalAsBlock)

@topper-123
Copy link
Contributor

Works for me.

This issue is a bit too much under the hood for me to understand, so I can't speak to whether this is the best solution, but it works fine for me.

@@ -1673,6 +1673,10 @@ def test_unique(self):
exp_cat = Categorical(["b", np.nan, "a"], categories=["b", "a"])
tm.assert_categorical_equal(res, exp_cat)

# GH 18051 unique()._codes should be writeable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you just need to compare the result of .unique() that's the user visible thing we are testing here. The non-writable issue is detail.

cat = Categorical([np.nan])
res = cat.unique()
exp_cat = Categorical([np.nan], categories=[])
tm.assert_categorical_equal(res, exp_cat)
Copy link
Contributor Author

@ghasemnaddaf ghasemnaddaf Nov 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback , do you mean like this? (This is not failing on 0.21.0 though)

Copy link
Contributor

@topper-123 topper-123 Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a test for

assert pd.Series(pd.Categorical([np.nan])).nunique() == 0

Note that nunique is a method on Series, not Categorical.

Copy link
Contributor Author

@ghasemnaddaf ghasemnaddaf Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@topper-123 topper-123 Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks great IMO.

Also, you mention above that Categorical([np.nan]).unique() doesn't fail in 0.21.0, but pd.Series(pd.Categorical([np.nan])).unique() does fail. So if you could add a test for that as well in the test for series, then IMO the PR is good (but let @jreback make the final call on that).

@topper-123
Copy link
Contributor

Hi,

I've looked into this again and IMO the tests should go in tests/series/test_analytics.py, as that's where Series.unique and Series.nunique() are tested. These test failures actually are not happening on Categoricals themselves, o tests shouldn't go into test_categorical.py.

So in method test_value_counts_nunique I'd add these lines:

# GH 18051
s = pd.Series(pd.Categorical([]))
assert s.nunique() == 0
s = pd.Series(pd.Categorical([np.nan]))
assert s.nunique() == 0

and in method test_unique I'd add:

# GH 18051
s = pd.Series(pd.Categorical([]))
tm.assert_categorical_equal(s.unique(), pd.Categorical([]),
                            check_dtype=False)
s = pd.Series(pd.Categorical([np.nan]))
tm.assert_categorical_equal(s.unique(), pd.Categorical([np.nan]),
                            check_dtype=False)

and have no tests in test_categorical.py.

@jreback , do you agree?

@jreback
Copy link
Contributor

jreback commented Nov 22, 2017

@topper-123 suggestions look reasonable.

@jreback jreback removed this from the 0.21.1 milestone Nov 22, 2017
@topper-123
Copy link
Contributor

@jreback, is it too late to get this in 0.21.1?

@ghasemnaddaf, if you don't have time to finish this up, I could open a new PR in order to get this into 0.21.1. Alternatively, I'd be very happy if you could finish this up for 0.21.1.

@ghasemnaddaf
Copy link
Contributor Author

sorry @topper-123 im busy till weekend. Go for it thanks

@jreback
Copy link
Contributor

jreback commented Nov 22, 2017

superseded by #18436

@jreback jreback closed this Nov 22, 2017
@jreback jreback added this to the No action milestone Nov 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: method .nunique on categorical series in v0.21 with only NaNs gives ValueError
4 participants