ENH: standardize fill_value behavior across the API #15533

ResidentMario · 2017-02-28T21:00:17Z

Problem

In the PR for #15486, I found that type validation for the fill_value parameters strewn across a large number of pandas API methods is done ad-hoc. This results in a wide variety of possible accepted inputs. I think it would be good to standardize this so that all of these methods use the same behavior, the one currently used by fillna.

Implementation Details

Partially the point of providing a fill_value is to avoid having to do a slow-down type conversion otherwise (using .fillna().astype()). However, specifying other formats is nevertheless a useful convenience to have. Implementation would roughly be:

Before executing the rest of the method body, check whether or not the fill_value is valid (using a centralized maybe_fill method). If it is not, throw a ValueError. If it is, check whether or not incorporating the fill_value would result in an upcast in the column dtype. If it would not, follow a code path where the column never gets type-converted. If it would, follow that same code path, then do something like a filla operation at the end before returning.

Target Implementation

The same as what fillna currently does. Which follows.

Invalid:

categorical fill for a category not in the categories will raise a ValueError.
sparse matrices refuse upcasting.
Passing an object or list or other non-coercable "thing" as a fill.

Valid, upcast:

int fill will promote bool dtypes to int.
float fill will promote int and bool dtypes to float (this is what happens with np.nan already).
object (str) fill would promote lesser dtypes to object.
int, float, and bool fill to a datetime dtype will be treated as a UNIX-like timestamp and promoted to datetime.
object fill will promote datetime dtype to object.

Valid, no-cast:

Everything else.

Current Implementation

...is ad-hoc. The following are the methods which currently provide a fill_value input, as well as where they deviate from the model above.

Series.combine, DataFrame.combine, Series.to_sparse: These are unique usages of fill_value which aren't compatible with the rest of them.
Series.unstack, DataFrame.unstack: any fill_value is allowed. You can pass an object if you'd like, or even another DataFrame (yo dawg...).
DataFrame.align: Any fill_value is allowed.
DataFrame.reindex_axis: Lists and dicts are allowed, objects are not.
DataFrame.asfreq, Series.asfreq: any fill_value is allowed.
pd.pivot_table: ...
Series.add, DataFrame.add: ...
Series.subtract, DataFrame.substract: ...
Probably others, there's a lot of these.

The text was updated successfully, but these errors were encountered:

jreback · 2017-02-28T22:24:25Z

this is kind of like _maybe_promote, see here: https://github.com/pandas-dev/pandas/blob/master/pandas/types/cast.py#L230, though in this case its a validator, but same idea.

a lot of the validation is prob occuring now but at a lower level and with no consistency of error messages. A lot of the routines expect certain types for filling, IOW filling floats needs a compatible float/int (or would raise).

So a friendly high level check would be nice. The hard part about this issue is not the code changes, but the tests :>

Also collecting these tests into a standard place would be fine as well (this is tricky because we like to keep the with the types, e.g. in pandas/tests/series, but for example routines in pandas/tests/missing would be nice as well.

ResidentMario · 2017-02-28T23:51:32Z

I'm hopeful I can figure out how to implement. Why are the tests the hard part? I assume you mean figuring out where to put them, which does sound challenging...might collect them in a separate file instead, just for the meantime.

ResidentMario · 2017-03-04T04:22:38Z

_maybe_upcast doesn't have guards against upcasting "weird" stuff. So for example the following is legal:

_maybe_upcast(np.array([np]))

When a fill_value parameter is passed, _maybe_upcast is the first and only validation step that parameter has to go through. So since the above is legal, so is whatever garbage you pass it, e.g. Series.shift(fill_value=<class 'garbage_type'>).

In other cases (e.g. fillna) there is additional validation that prevents this from happening.

Should _maybe_upcast (continue to) allow this behavior? This is deep in the internals, so I suspect not touching it would be best, but it does seem like an odd thing to allow, to me.

It wouldn't be too hard to add a separate check to prevent this sort of input from reaching _maybe_upcast at all.

ResidentMario · 2017-03-04T04:29:34Z

(sorry about the close/open, fat-fingered the wrong button there)

jreback · 2017-03-04T07:30:40Z

you might be able to add a check here
though i suspect have a _maybe_cast_fill which does validation might be easier

internal routines just have implicit (or better yet explicit guarantees that are in the docstring)

ResidentMario · 2017-03-05T21:19:35Z

FYI, this is legal in fillna right now:

pd.Series([1, 2, np.nan]).fillna(lambda f: f)

Which is counter-factual w.r.t a (separate) TypeError statement in the method body:

    if isinstance(value, (list, tuple)):
        raise TypeError('"value" parameter must be a scalar or dict, but '
                        'you passed a "{0}"'.format(type(value).__name__))

(to fix this you could do if isinstance(value, (list, tuple)) or callable(value))

jreback · 2017-03-05T22:43:41Z

yeah you prob have to check inclusion rather than exclusion

e.g. is_scalar, is_dict_like, is_list_like

ResidentMario · 2017-03-05T22:45:52Z

Yeah. Funny little bug with that:

 >>> import import pandas.core.common as com
 >>> com.is_string_dtype(type)
 True

ResidentMario · 2017-03-05T22:49:00Z

In com.is_string_type:

dtype.kind in ('O', 'S', 'U')

The type checker naively assumes that if you passed it an object, it must have been a string!

jreback · 2017-03-05T22:57:51Z

is_string_dtype is not strict. It really can't be w/o a lot of code inference (which is not cheap). You can certainly add a comment to it if you'd like. If you really need inference then you can do lib.infer_type which IS strict. (but again is not free, though not too bad as it short-circuits).

jreback · 2017-03-05T23:03:43Z

@ResidentMario note that imports from pandas.core.common are almost all deprecated, use pandas.types.common

jorisvandenbossche · 2017-03-09T09:54:08Z

@ResidentMario Is the description at the top of this issue still up to date with how you are trying to implement things in #15587 ?

ResidentMario · 2017-03-09T16:06:23Z

Hmm. This list is incomplete, and I think there's been a couple of changes there:

Period is O dtype, the implementation there uses object rules for Period columns because of that. (probably just need to investigate this further?)
I'm being a bit more strict with only allowing datetime types to datetime64[ns] columns, not numerical types (so no int, float, etc.).

A big question right now is whether or not in the case of a DataFrame we want to validate column-by-column or consolidate the dtype (probably into object) and use the rules for filling that instead.

jreback · 2017-03-09T18:43:21Z

Period is O dtype, the implementation there uses object rules for Period columns because of that. (probably just need to investigate this further?)

yes this is a special case atm, you you can simply use is_period_arraylike on an object column to check, and if true, then restrict the fill value.

I'm being a bit more strict with only allowing datetime types to datetime64[ns] columns, not numerical types (so no int, float, etc.).

yes

jreback · 2017-03-09T18:48:13Z

A big question right now is whether or not in the case of a DataFrame we want to validate column-by-column or consolidate the dtype (probably into object) and use the rules for filling that instead.

I think a reasonable way to do this is to:

add a errors='ignore'|'raise'|'force' kw to .fillna* routines
if errors='ignore' (default), then allow filling of a column only if the types match (IOW don't fill datetimes with ints, just skip them)
if errors='raise' then raise on anything that is not compat with the filling
if errors='force' then coerce the columns as needed (even to object).

this would give nice behavior by default of filling things that can take that value and providing error checking otherwise (with an option for force filling, but that's user selected).

The current situation is effectively errors='force'.

In [2]: df = DataFrame({'A':[1,2,3],'B':pd.date_range('20130101',periods=3)})

In [3]: df
Out[3]: 
   A          B
0  1 2013-01-01
1  2 2013-01-02
2  3 2013-01-03

In [4]: df.iloc[1] = np.nan

In [5]: df
Out[5]: 
     A          B
0  1.0 2013-01-01
1  NaN        NaT
2  3.0 2013-01-03

In [6]: df.fillna(0)
Out[6]: 
     A          B
0  1.0 2013-01-01
1  0.0 1970-01-01
2  3.0 2013-01-03

In [7]: df.fillna(pd.Timestamp('20130110'))
Out[7]: 
                     A          B
0                    1 2013-01-01
1  2013-01-10 00:00:00 2013-01-10
2                    3 2013-01-03

I suppose we could also make the default errors='raise', though not back-compat . This would be more obvious. errors='ignore' is more convenient.

ResidentMario · 2017-03-10T19:29:07Z

Ok so then:

New PR implementing an errors param for fillna via a new validator func (BUG: fillna('') does not replace NaT #11953).
PR implementing that validator func in the various fill_value routines with the default validator func behavior (ENH: standardize fill_value behavior across the API #15533; PR#15587).
PR adding the errors param to shift (ENH: fill_value argument for shift #15486; PR#15527).

I suggest also adding a new pd.set_option param for letting the user pick their error coercion mode if they want.

jreback added Dtype Conversions Unexpected or buggy dtype conversions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Difficulty Intermediate labels Feb 28, 2017

jreback added this to the Next Major Release milestone Feb 28, 2017

jreback added the Error Reporting Incorrect or improved errors from pandas label Feb 28, 2017

ResidentMario closed this as completed Mar 4, 2017

ResidentMario reopened this Mar 4, 2017

jreback mentioned this issue Mar 5, 2017

API: is_string_dtype is not strict #15585

Closed

This was referenced Mar 6, 2017

ENH: standardize fill_value behavior across the API #15587

Closed

BUG: fillna('') does not replace NaT #11953

Open

jreback mentioned this issue Mar 9, 2017

pandas.DataFrame.where not replacing NaTs properly #15613

Closed

ResidentMario mentioned this issue Mar 11, 2017

ENH: Provide an errors parameter to fillna #15653

Closed

4 tasks

kernc mentioned this issue Apr 24, 2017

SparseDataFrame.fillna() doesn't fill all NaNs #16112

Closed

jreback mentioned this issue May 20, 2017

API: inconsistent in handling fill/where/setitem of incompat type #16402

Closed

jbrockmendel removed Effort Medium labels Oct 21, 2019

jbrockmendel added the API - Consistency Internal Consistency of API/Behavior label Dec 19, 2019

mroeschke added the Enhancement label May 7, 2020

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: standardize fill_value behavior across the API #15533

ENH: standardize fill_value behavior across the API #15533

ResidentMario commented Feb 28, 2017 •

edited

Loading

jreback commented Feb 28, 2017

ResidentMario commented Feb 28, 2017

ResidentMario commented Mar 4, 2017 •

edited

Loading

ResidentMario commented Mar 4, 2017

jreback commented Mar 4, 2017

ResidentMario commented Mar 5, 2017 •

edited

Loading

jreback commented Mar 5, 2017

ResidentMario commented Mar 5, 2017

ResidentMario commented Mar 5, 2017

jreback commented Mar 5, 2017

jreback commented Mar 5, 2017

jorisvandenbossche commented Mar 9, 2017 •

edited

Loading

ResidentMario commented Mar 9, 2017

jreback commented Mar 9, 2017

jreback commented Mar 9, 2017

ResidentMario commented Mar 10, 2017 •

edited

Loading

ENH: standardize fill_value behavior across the API #15533

ENH: standardize fill_value behavior across the API #15533

Comments

ResidentMario commented Feb 28, 2017 • edited Loading

Problem

Implementation Details

Target Implementation

Current Implementation

jreback commented Feb 28, 2017

ResidentMario commented Feb 28, 2017

ResidentMario commented Mar 4, 2017 • edited Loading

ResidentMario commented Mar 4, 2017

jreback commented Mar 4, 2017

ResidentMario commented Mar 5, 2017 • edited Loading

jreback commented Mar 5, 2017

ResidentMario commented Mar 5, 2017

ResidentMario commented Mar 5, 2017

jreback commented Mar 5, 2017

jreback commented Mar 5, 2017

jorisvandenbossche commented Mar 9, 2017 • edited Loading

ResidentMario commented Mar 9, 2017

jreback commented Mar 9, 2017

jreback commented Mar 9, 2017

ResidentMario commented Mar 10, 2017 • edited Loading

ResidentMario commented Feb 28, 2017 •

edited

Loading

ResidentMario commented Mar 4, 2017 •

edited

Loading

ResidentMario commented Mar 5, 2017 •

edited

Loading

jorisvandenbossche commented Mar 9, 2017 •

edited

Loading

ResidentMario commented Mar 10, 2017 •

edited

Loading