Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: document dropna kwarg of pd.factorize #35667

Closed
FlorianWetschoreck opened this issue Aug 11, 2020 · 11 comments · Fixed by #35852
Closed

DOC: document dropna kwarg of pd.factorize #35667

FlorianWetschoreck opened this issue Aug 11, 2020 · 11 comments · Fixed by #35852
Labels
Milestone

Comments

@FlorianWetschoreck
Copy link

Location of the documentation

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html

Documentation problem

The docs show the existence of a kwarg "dropna" which does not exist

Suggested fix for documentation

Delete the kwarg "dropna"

@FlorianWetschoreck FlorianWetschoreck added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 11, 2020
@jorisvandenbossche
Copy link
Member

I think it's the other way around: dropna is an actual keyword, but is missing in the "Parameter" section of the docstring.
(#30584 added the keyword recently, by @charlesdong1991 )

Doc updates are certainly welcome!

@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Aug 11, 2020
@MarcoGorelli MarcoGorelli added good first issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 11, 2020
@FlorianWetschoreck
Copy link
Author

FlorianWetschoreck commented Aug 11, 2020

Thank you for the hint. It indeed works:

d = {'col1': [1, pd.NA], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
pd.factorize(df['col1'], dropna=False)
pd.factorize(df['col1'], dropna=True)

It seems to be the case that the dropna controls if NA is encoded with the na_sentinel or in the usual logic. Is this correct?
I like the feature but I think the naming is confusing because I would have expected that it drops the rows based on the name dropna.


During testing I also found that the kwargs does not seem to be available on the Series API:

df['col1'].factorize(dropna=True)
>> TypeError: factorize() got an unexpected keyword argument 'dropna'

@jorisvandenbossche
Copy link
Member

I agree the name can be a bit confusing:

In [9]: pd.factorize(np.array([1, 2, 1, np.nan]), dropna=True)   
Out[9]: (array([ 0,  1,  0, -1]), array([1., 2.]))

In [10]: pd.factorize(np.array([1, 2, 1, np.nan]), dropna=False)   
Out[10]: (array([0, 1, 0, 2]), array([ 1.,  2., nan]))

So the "dropping" is about dropping NA from the uniques, not from the original values / code.

I suppose we mainly added this for internal usage to implement groupby's dropna keyword (@charlesdong1991 ?), but since this is a public method, we should also ensure it makes sense for factorize itself.
I can't directly think of a better name (I think I am also fine with keeping the name, if we make the meaning very clear in the docstring).

@jorisvandenbossche jorisvandenbossche modified the milestones: Contributions Welcome, 1.2 Aug 12, 2020
@swapniljha001
Copy link

This issue is no longer an issue, as the confusion has been cleared by @jorisvandenbossche and #35667 should be closed now, right? Or am I misunderstanding something?

@jorisvandenbossche
Copy link
Member

The keyword still needs to be documented (and potentially its name discussed)

@jorisvandenbossche jorisvandenbossche changed the title DOC: pd.factorize shows non-existing kwarg dropna DOC: document dropna kwarg of pd.factorize Aug 12, 2020
@charlesdong1991
Copy link
Member

emm, indeed, the dropna was intended to serve only for internal usage to support dropna in groupby. But I didn't realize it is also public and thus caused confusion in naming and functionality.

I am very busy recently, so I could try to wrap up a PR to add docstring (also change the naming if needed) and add support for Series case at this weekend.

But if anyone has time and is up for a PR, feel free to do so!!

@zeina99
Copy link

zeina99 commented Aug 13, 2020

does documentation include adding examples?

Edit:
After reading the contributing guide, I understand it needs the parameters section changed and an example using the parameter dropna correct?

@data-RanDan
Copy link

Is this one still active?
If so,please clarify what needs to be done..

@simonjayhawkins
Copy link
Member

The keyword still needs to be documented (and potentially its name discussed)

since the keyword is not documented, could one assume that it is not yet part of the public API

if so, could we incorporate the functionality using the na_sentinel keyword instead.

I agree the name can be a bit confusing:

In [9]: pd.factorize(np.array([1, 2, 1, np.nan]), dropna=True)   
Out[9]: (array([ 0,  1,  0, -1]), array([1., 2.]))

In [10]: pd.factorize(np.array([1, 2, 1, np.nan]), dropna=False)   
Out[10]: (array([0, 1, 0, 2]), array([ 1.,  2., nan]))

So the "dropping" is about dropping NA from the uniques, not from the original values / code.

on master

>>> pd.__version__
'1.2.0.dev0+128.gdb6414fbe'
>>> pd.factorize(np.array([1, 2, 1, np.nan]), na_sentinel=False)
(array([0, 1, 0, 0], dtype=int64), array([1., 2.]))
>>>
>>> pd.factorize(np.array([1, 2, 1, np.nan]), na_sentinel=None)
Traceback (most recent call last):
...
TypeError: 'NoneType' object cannot be interpreted as an integer
>>>

The False case doesn't raise and gives inappropriate results. maybe the expected output could be

>>> pd.factorize(np.array([1, 2, 1, np.nan]), na_sentinel=False)
(array([0, 1, 0, 2]), array([ 1.,  2., nan]))

@jorisvandenbossche
Copy link
Member

The False case doesn't raise and gives inappropriate results. maybe the expected output could be

It actually is "working" .., it's just that the False is cast to the dtype of the codes, being int:

In [25]: pd.factorize(np.array([1, 2, 1, np.nan]), na_sentinel=False)   # interpreted as 0
Out[25]: (array([0, 1, 0, 0]), array([1., 2.]))

In [26]: pd.factorize(np.array([1, 2, 1, np.nan]), na_sentinel=True)   # interpreted as 1
Out[26]: (array([0, 1, 0, 1]), array([1., 2.]))

Now, since the codes can only be int, we can of course special case booleans.
So that might indeed be a nice idea, na_sentinel=False meaning: don't use a specific sentinel in the codes, but just include it in the uniques instead.

@data-RanDan
Copy link

data-RanDan commented Aug 21, 2020

Hey guys
Why not just drop the na's, and do a dropna=False to include them?
And maybe an optional na_token=int
I think it's more pandas'y

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants