DOC: document dropna kwarg of pd.factorize #35667

FlorianWetschoreck · 2020-08-11T09:18:43Z

Location of the documentation

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html

Documentation problem

The docs show the existence of a kwarg "dropna" which does not exist

Suggested fix for documentation

Delete the kwarg "dropna"

jorisvandenbossche · 2020-08-11T11:08:26Z

I think it's the other way around: dropna is an actual keyword, but is missing in the "Parameter" section of the docstring.
(#30584 added the keyword recently, by @charlesdong1991 )

Doc updates are certainly welcome!

FlorianWetschoreck · 2020-08-11T13:03:55Z

Thank you for the hint. It indeed works:

d = {'col1': [1, pd.NA], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
pd.factorize(df['col1'], dropna=False)
pd.factorize(df['col1'], dropna=True)

It seems to be the case that the dropna controls if NA is encoded with the na_sentinel or in the usual logic. Is this correct?
I like the feature but I think the naming is confusing because I would have expected that it drops the rows based on the name dropna.

During testing I also found that the kwargs does not seem to be available on the Series API:

df['col1'].factorize(dropna=True)
>> TypeError: factorize() got an unexpected keyword argument 'dropna'

jorisvandenbossche · 2020-08-12T11:26:39Z

I agree the name can be a bit confusing:

In [9]: pd.factorize(np.array([1, 2, 1, np.nan]), dropna=True)   
Out[9]: (array([ 0,  1,  0, -1]), array([1., 2.]))

In [10]: pd.factorize(np.array([1, 2, 1, np.nan]), dropna=False)   
Out[10]: (array([0, 1, 0, 2]), array([ 1.,  2., nan]))

So the "dropping" is about dropping NA from the uniques, not from the original values / code.

I suppose we mainly added this for internal usage to implement groupby's dropna keyword (@charlesdong1991 ?), but since this is a public method, we should also ensure it makes sense for factorize itself.
I can't directly think of a better name (I think I am also fine with keeping the name, if we make the meaning very clear in the docstring).

swapniljha001 · 2020-08-12T14:43:17Z

This issue is no longer an issue, as the confusion has been cleared by @jorisvandenbossche and #35667 should be closed now, right? Or am I misunderstanding something?

jorisvandenbossche · 2020-08-12T14:56:05Z

The keyword still needs to be documented (and potentially its name discussed)

charlesdong1991 · 2020-08-12T20:42:28Z

emm, indeed, the dropna was intended to serve only for internal usage to support dropna in groupby. But I didn't realize it is also public and thus caused confusion in naming and functionality.

I am very busy recently, so I could try to wrap up a PR to add docstring (also change the naming if needed) and add support for Series case at this weekend.

But if anyone has time and is up for a PR, feel free to do so!!

zeina99 · 2020-08-13T16:56:29Z

does documentation include adding examples?

Edit:
After reading the contributing guide, I understand it needs the parameters section changed and an example using the parameter dropna correct?

data-RanDan · 2020-08-15T16:33:37Z

Is this one still active?
If so,please clarify what needs to be done..

simonjayhawkins · 2020-08-21T11:21:14Z

The keyword still needs to be documented (and potentially its name discussed)

since the keyword is not documented, could one assume that it is not yet part of the public API

if so, could we incorporate the functionality using the na_sentinel keyword instead.

I agree the name can be a bit confusing:
In [9]: pd.factorize(np.array([1, 2, 1, np.nan]), dropna=True)   
Out[9]: (array([ 0,  1,  0, -1]), array([1., 2.]))

In [10]: pd.factorize(np.array([1, 2, 1, np.nan]), dropna=False)   
Out[10]: (array([0, 1, 0, 2]), array([ 1.,  2., nan]))
So the "dropping" is about dropping NA from the uniques, not from the original values / code.

on master

>>> pd.__version__
'1.2.0.dev0+128.gdb6414fbe'
>>> pd.factorize(np.array([1, 2, 1, np.nan]), na_sentinel=False)
(array([0, 1, 0, 0], dtype=int64), array([1., 2.]))
>>>
>>> pd.factorize(np.array([1, 2, 1, np.nan]), na_sentinel=None)
Traceback (most recent call last):
...
TypeError: 'NoneType' object cannot be interpreted as an integer
>>>

The False case doesn't raise and gives inappropriate results. maybe the expected output could be

>>> pd.factorize(np.array([1, 2, 1, np.nan]), na_sentinel=False)
(array([0, 1, 0, 2]), array([ 1.,  2., nan]))

jorisvandenbossche · 2020-08-21T15:19:18Z

The False case doesn't raise and gives inappropriate results. maybe the expected output could be

It actually is "working" .., it's just that the False is cast to the dtype of the codes, being int:

In [25]: pd.factorize(np.array([1, 2, 1, np.nan]), na_sentinel=False)   # interpreted as 0
Out[25]: (array([0, 1, 0, 0]), array([1., 2.]))

In [26]: pd.factorize(np.array([1, 2, 1, np.nan]), na_sentinel=True)   # interpreted as 1
Out[26]: (array([0, 1, 0, 1]), array([1., 2.]))

Now, since the codes can only be int, we can of course special case booleans.
So that might indeed be a nice idea, na_sentinel=False meaning: don't use a specific sentinel in the codes, but just include it in the uniques instead.

data-RanDan · 2020-08-21T17:39:29Z

Hey guys
Why not just drop the na's, and do a dropna=False to include them?
And maybe an optional na_token=int
I think it's more pandas'y

FlorianWetschoreck added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 11, 2020

jorisvandenbossche added good first issue and removed good first issue labels Aug 11, 2020

jorisvandenbossche added this to the Contributions Welcome milestone Aug 11, 2020

MarcoGorelli added good first issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 11, 2020

jorisvandenbossche modified the milestones: Contributions Welcome, 1.2 Aug 12, 2020

jorisvandenbossche changed the title ~~DOC: pd.factorize shows non-existing kwarg dropna~~ DOC: document dropna kwarg of pd.factorize Aug 12, 2020

jorisvandenbossche removed the good first issue label Aug 17, 2020

charlesdong1991 mentioned this issue Aug 22, 2020

API: replace dropna=False option with na_sentinel=None in factorize #35852

Merged

5 tasks

simonjayhawkins modified the milestones: 1.2, 1.1.2 Aug 28, 2020

charlesdong1991 closed this as completed in #35852 Sep 2, 2020

jorisvandenbossche mentioned this issue May 2, 2022

Deprecate na_sentinel, add use_na_sentinel #46910

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: document dropna kwarg of pd.factorize #35667

DOC: document dropna kwarg of pd.factorize #35667

FlorianWetschoreck commented Aug 11, 2020

jorisvandenbossche commented Aug 11, 2020

FlorianWetschoreck commented Aug 11, 2020 •

edited

Loading

jorisvandenbossche commented Aug 12, 2020

swapniljha001 commented Aug 12, 2020

jorisvandenbossche commented Aug 12, 2020

charlesdong1991 commented Aug 12, 2020

zeina99 commented Aug 13, 2020 •

edited

Loading

data-RanDan commented Aug 15, 2020

simonjayhawkins commented Aug 21, 2020

jorisvandenbossche commented Aug 21, 2020

data-RanDan commented Aug 21, 2020 •

edited

Loading

DOC: document dropna kwarg of pd.factorize #35667

DOC: document dropna kwarg of pd.factorize #35667

Comments

FlorianWetschoreck commented Aug 11, 2020

Location of the documentation

Documentation problem

Suggested fix for documentation

jorisvandenbossche commented Aug 11, 2020

FlorianWetschoreck commented Aug 11, 2020 • edited Loading

jorisvandenbossche commented Aug 12, 2020

swapniljha001 commented Aug 12, 2020

jorisvandenbossche commented Aug 12, 2020

charlesdong1991 commented Aug 12, 2020

zeina99 commented Aug 13, 2020 • edited Loading

data-RanDan commented Aug 15, 2020

simonjayhawkins commented Aug 21, 2020

jorisvandenbossche commented Aug 21, 2020

data-RanDan commented Aug 21, 2020 • edited Loading

FlorianWetschoreck commented Aug 11, 2020 •

edited

Loading

zeina99 commented Aug 13, 2020 •

edited

Loading

data-RanDan commented Aug 21, 2020 •

edited

Loading