Avoiding the "pandas trap" #4

TomAugspurger · 2020-05-16T11:29:35Z

Split from the discussions in #2.

To avoid the trap of "let's just match pandas", let's collect a list of specific problems with the pandas API, which we'll intentionally deviate from. To the extent possible we should limit this discussoin to issues with the API, rather than implementation.

pandas.DataFrame can't implement collections.abc.Mapping because .values is a property returning an array, rather than a method. (added by @TomAugspurger)
The "groupby-apply" pattern of passing opaque functions for non-trivial aggregations that are otherwise able to be expressed easily in e.g. SQL (consider: any aggregation expression that involves more than one column) (added by @wesm)
Indexing in pandas accepts a variety of different inputs, which each have their own semantics, e.g. passing a function to the __getitem__ or loc/iloc. It is not explicitly clear to new users the difference between df[["a","b","c"]] and df[slice(5)] and df[lambda idx: idx % 5 == 0]. (added by @devin-petersohn)
pandas allows the dot operator (__getattr__) to get columns, which causes problems for columns that share names with other APIs. (added by @devin-petersohn)
Duplicate APIs:
- Simple aliases: e.g. isna and isnull, multiply and mul, etc.(added by @devin-petersohn)
- More complex duplication: e.g. query("a > b") and df[df["a"] > df["b"]] (added by @devin-petersohn)
- Indexing, there are 7 or 8 ways to get one or more columns in pandas: e.g. __getitem__, __getattr__, loc, iloc, apply, drop (added by @devin-petersohn)
- merge and join call each other and are confusing for new users (added by @devin-petersohn)
Having a separate object to represent a one column dataframe (i.e. Series). Creating all the complexity of having to reimplement most functionality of dataframe. And not providing a consistent way of applying operations to N columns (including 1). Separate object for a dataframe colum? (is Series needed?) #6 @datapythonista
"missing" APIs / extension points:
- These are APIs or extension points that pandas and/or numpy lacks, and which -- for one reason or another -- has led libraries needing to consume pandas objects (e.g. DataFrame, Series) to hard-code support for these types. This makes pandas work well with these libraries but means it's not easy (or even possible) for other DataFrame implementations to be supported. Lack of interop support between alternative DataFrame implementations and these libraries can be a small but constant annoyance for users, and in some cases a performance issue as well (if data needs to be converted to a pandas object just to get something to work).
- Introspection API for autocomplete / "IntelliSense" APIs.
  - In riptide we've implemented a hook + protocol and implemented it on our dataframe class Dataset. This provides more-detailed data compared to what a "static" tool like Jedi can return; compared to dir, our protocol allows our Dataset class to control which columns, properties, etc. are returned for display in autocomplete dropdowns.
  - Our protocol also allows as well as to provide richer metadata for data columns. For example, the dtype or array subclass name; for Categoricals, we can provide the number of labels/categories.
  - The features mentioned above could alternatively be implemented through some property(ies) on the standardized DataFrame and/or Array APIs (rather than a protocol with a method that returns a more-complex data structure / dictionary).
    ...

The text was updated successfully, but these errors were encountered:

devin-petersohn · 2020-05-16T15:30:28Z

To avoid the trap of "let's just match pandas", let's collect a list of specific problems with the pandas API, which we'll intentionally deviate from.

I think there are multiple traps here, for example specifically deviating from or removing entire semantics (not APIs) from pandas.As you may guess I am all for removing duplicated or unused APIs.

I know you are not talking about Modin (or cuDF) specifically with this trap comment, but I want to address this discussion because there are some obvious points of disagreement that some have made in the past. In Modin, we have specifically chosen to be drop-in compatible with the pandas API. The main goal here is to fix the pandas API by slowly moving people away from it, and users have overwhelmingly agreed with this stance.

The argument and disagreement here will no doubt be "should we tell users what they need or should users tell us what they need?". I believe that users do know what they need to do, even if they cannot always accurately describe it in words.

@TomAugspurger are we editing your first comment? I see @wesm did this and I have a few APIs to add to the problem list.

wesm · 2020-05-16T15:34:36Z

I added bylines to the list, so go right ahead

TomAugspurger mentioned this issue May 16, 2020

Trying to define "data frame" #2

Open

TomAugspurger mentioned this issue May 18, 2020

Separate object for a dataframe colum? (is Series needed?) #6

Closed

TomAugspurger changed the title ~~Deficiencies in pandas' API~~ Avoiding the "pandas trap" May 18, 2020

kgryte mentioned this issue Dec 10, 2020

Add statistical methods #33

Closed

rgommers mentioned this issue Jan 30, 2024

Expressions - another attempt #346

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoiding the "pandas trap" #4

Avoiding the "pandas trap" #4

TomAugspurger commented May 16, 2020 •

edited by jack-pappas

Loading

devin-petersohn commented May 16, 2020

wesm commented May 16, 2020

Avoiding the "pandas trap" #4

Avoiding the "pandas trap" #4

Comments

TomAugspurger commented May 16, 2020 • edited by jack-pappas Loading

devin-petersohn commented May 16, 2020

wesm commented May 16, 2020

TomAugspurger commented May 16, 2020 •

edited by jack-pappas

Loading