Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Categorical sort_values and sort Documentation #12785

Closed
gfyoung opened this issue Apr 3, 2016 · 26 comments
Closed

DOC: Categorical sort_values and sort Documentation #12785

gfyoung opened this issue Apr 3, 2016 · 26 comments
Labels
Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas
Milestone

Comments

@gfyoung
Copy link
Member

gfyoung commented Apr 3, 2016

In categorical.py, we enforce the fact that self must be ordered when calling min or max. However, self can be unordered when calling sort_values. This doesn't make sense in my mind, for if you can sort the values for unordered self, then I can then find a minimum value of self. The same comment applies to argsort as well.

@jreback
Copy link
Contributor

jreback commented Apr 3, 2016

prob an overight, should be using check_for_ordered(..) internally there.

@jreback jreback added Difficulty Novice Error Reporting Incorrect or improved errors from pandas Categorical Categorical Data Type labels Apr 3, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Apr 3, 2016

Given how straightforward the PR is, would it be for v0.18.1 or v0.19.0?

@jreback
Copy link
Contributor

jreback commented Apr 3, 2016

its a bug, why would it not be 0.18.1?

@jreback jreback added this to the 0.18.1 milestone Apr 3, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Apr 3, 2016

Just wanted to double check. 😄

@jorisvandenbossche
Copy link
Member

Are we sure this is a bug?
We have had a lot of discussions about the API, and certainly also about the ordered implication, min() raising etc. But I can't really remember if we discussed this specific issue.

In any case, I think it can be quite annoying if you cannot sort values (eg to just group them).
And as a comparison, R does allow to sort, but raises for min() in case of unordered factors.

@jreback
Copy link
Contributor

jreback commented Apr 3, 2016

you could be right @jorisvandenbossche, I don't really remember.

cc @JanSchulz

@gfyoung
Copy link
Member Author

gfyoung commented Apr 3, 2016

@jorisvandenbossche : Is calling .as_ordered() that problematic? How about my explanation above?

@everyone : Regardless of whether this is a bug or not, something will have to give. The documentation insists that argsort and sort can only be called on ordered Categorical objects, but the internals do not match.

@jankatins
Copy link
Contributor

That was a consious descision sometime after implementing categoricals (and I just saw that the docstring was not changed: https://github.com/pydata/pandas/blob/master/pandas/core/categorical.py#L205)

commit is here: 87fec5b

@jreback
Copy link
Contributor

jreback commented Apr 3, 2016

#9347 and #9611

@jankatins
Copy link
Contributor

Yep, I think it started here: #9611 (comment)

@jorisvandenbossche
Copy link
Member

But in the end, it was changed here: #9622 (@JanSchulz you can see that in the commit (below the commit message))

@gfyoung
Copy link
Member Author

gfyoung commented Apr 3, 2016

Perhaps refactoring groupby would be too difficult, but IMHO don't agree that the convenience is worth compromising the overall logical consistency of the interface. I suspect there should exist some way to refactor groupby so that the categorical columns can be ordered internally.

@gfyoung
Copy link
Member Author

gfyoung commented Apr 10, 2016

What are people's thoughts on this? Should we stick with the current functionality, which is consistent with R but seems to be logically inconsistent, or should it be enforced?

@jorisvandenbossche
Copy link
Member

Personally, I would leave it as is.
I also think it makes 'some' sense to be able to sort categoricals. To make the analogy, if you have colored cubes, you can 'sort' these to form groups of all cubes of the same color, but you cannot give an 'order' to them. Of course we use the sort method for both 'sort' and 'order' meaning in other cases, which makes it a bit confusing.

Anyway, if we do the above, the docs of sort_values should be updated.
And argsort does currently not even work on a categorical series:

In [2]: s = pd.Series(['a', 'b', 'a', 'c'], dtype='category')

In [5]: s.argsort()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-bb845bcfbc4f> in <module>()
----> 1 s.argsort()

c:\users\vdbosscj\scipy\pandas-joris\pandas\core\series.py in argsort(self, axis
, kind, order)
   1848         else:
   1849             return self._constructor(
-> 1850                 np.argsort(values, kind=kind), index=self.index,
   1851                 dtype='int64').__finalize__(self)
   1852

C:\Anaconda\envs\devel\lib\site-packages\numpy\core\fromnumeric.pyc in argsort(a
, axis, kind, order)
    906     except AttributeError:
    907         return _wrapit(a, 'argsort', axis, kind, order)
--> 908     return argsort(axis, kind, order)
    909
    910

TypeError: argsort() takes at most 2 arguments (4 given)

Another numpy compat issue .. :-)

@gfyoung
Copy link
Member Author

gfyoung commented Apr 11, 2016

@jorisvandenbossche : Don't worry about the numpy compat issue. I've got that one covered. 😄 But regarding the semantics of what sort means, okay, that is beginning to make more sense in my head. I agree at the very least that the documentation should make that clearer because that meaning wasn't the first one that came to my head.

@gfyoung
Copy link
Member Author

gfyoung commented Apr 11, 2016

@everyone: I think if it is just a semantics thing, it's just a DOC-change PR then (not an API one as I had originally titled it)?

@jreback
Copy link
Contributor

jreback commented Apr 11, 2016

yes why don't u update doc string and docs as needed to be more clear

@gfyoung gfyoung changed the title API: Why are unordered Categoricals sortable? DOC: Why are unordered Categoricals sortable? Apr 11, 2016
@gfyoung gfyoung changed the title DOC: Why are unordered Categoricals sortable? DOC: Categorical sort_values and sort Documentation Apr 12, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Apr 12, 2016

I have generalized this issue to just questions about Categorical.sort and Categorical.sort_values since this is another question about semantics and wording. In the documentation, the wording does not make any mention of na_position being dependent on ascending = False (i.e. it is only respected provided ascending = False). Just curious, why is that the case?

@jorisvandenbossche
Copy link
Member

@gfyoung Because having good docs is hard and needs many eyes.
However, for me the question in this case is more: why is na_position only implemented for ascending=False?

@gfyoung
Copy link
Member Author

gfyoung commented Apr 12, 2016

@jorisvandenbossche : Ah, that's what I actually meant. Sorry, the "it" was a little vague! My question is: why is na_position even dependent on ascending at all?

@jreback
Copy link
Contributor

jreback commented Apr 12, 2016

@gfyoung na_position has to do with the nan sorting, IOW, are nans first or last. So it is tied to ascending by-definition. This is all doced for Series & Index I believe, many not for Categorical. this is why we use shared_docs as much as possible (though not in this case because it wasn't done, though it should be done)

@gfyoung
Copy link
Member Author

gfyoung commented Apr 12, 2016

@jreback : Does first and last mean that nan is the "largest" or "smallest" element in the _codes? In that case, I can understand the dependence. Nevertheless, it's not being respected when ascending = True, so something has to give here. The documentation at the very least has to be updated to reflect that dependence.

>>> c = pd.Categorical([np.nan, 2, 2, np.nan, 5])
>>> c.sort_values(ascending=True, na_position='first')
# expected : [2.0, 2.0, 5.0, NaN, NaN]
[NaN, NaN, 2.0, 2.0, 5.0]
Categories: (2, int64): [2, 5]
>>> c.sort_values(ascending=True, na_position='last')
[NaN, NaN, 2.0, 2.0, 5.0]
Categories: (2, int64): [2, 5]

@jreback
Copy link
Contributor

jreback commented Apr 12, 2016

In [6]: s = Series([np.nan, 2, 2, np.nan, 5])

In [7]: s.sort_values(ascending=True, na_position='first')
Out[7]: 
0    NaN
3    NaN
1    2.0
2    2.0
4    5.0
dtype: float64

In [8]: s.sort_values(ascending=True, na_position='last')
Out[8]: 
1    2.0
2    2.0
4    5.0
0    NaN
3    NaN
dtype: float64

In [9]: s.sort_values(ascending=False, na_position='first')
Out[9]: 
0    NaN
3    NaN
4    5.0
2    2.0
1    2.0
dtype: float64

In [10]: s.sort_values(ascending=False, na_position='last')
Out[10]: 
4    5.0
2    2.0
1    2.0
0    NaN
3    NaN
dtype: float64

@gfyoung
Copy link
Member Author

gfyoung commented Apr 12, 2016

@jreback : Ah, okay, so it makes sense to align with the Series API, but observe that na_position is completely independent of ascending.

@jreback
Copy link
Contributor

jreback commented Apr 12, 2016

yep, misspoke, its independent, but would like to tie Series,Index,Categorical doc-strings at the least together

@gfyoung
Copy link
Member Author

gfyoung commented Apr 12, 2016

@jreback : IMO the Categorical documentation should be separate because the meaning of sort is not the same as the other two per the discussion above.

gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 13, 2016
Clarifies the meaning of 'sort' in the context of
Categorical to mean 'organization' rather than 'order',
as it is possible to call this method (as well as
'sort_values') when the Categorical is unordered.

Also patches a bug in 'Categorical.sort_values' in
which 'na_position' was not being respected when
'ascending' was set to 'True'. This commit aligns
the behaviour with that of Series.

Finally, this commit deprecates 'sort' in favor
of 'sort_values,' which is in alignment with the
Series API as well.

Closes pandas-devgh-12785.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

No branches or pull requests

4 participants