-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoiding the "pandas trap" #4
Comments
I think there are multiple traps here, for example specifically deviating from or removing entire semantics (not APIs) from pandas.As you may guess I am all for removing duplicated or unused APIs. I know you are not talking about Modin (or cuDF) specifically with this trap comment, but I want to address this discussion because there are some obvious points of disagreement that some have made in the past. In Modin, we have specifically chosen to be drop-in compatible with the pandas API. The main goal here is to fix the pandas API by slowly moving people away from it, and users have overwhelmingly agreed with this stance. The argument and disagreement here will no doubt be "should we tell users what they need or should users tell us what they need?". I believe that users do know what they need to do, even if they cannot always accurately describe it in words. @TomAugspurger are we editing your first comment? I see @wesm did this and I have a few APIs to add to the problem list. |
I added bylines to the list, so go right ahead |
Split from the discussions in #2.
To avoid the trap of "let's just match pandas", let's collect a list of specific problems with the pandas API, which we'll intentionally deviate from. To the extent possible we should limit this discussoin to issues with the API, rather than implementation.
collections.abc.Mapping
because.values
is a property returning an array, rather than a method. (added by @TomAugspurger)__getitem__
orloc
/iloc
. It is not explicitly clear to new users the difference betweendf[["a","b","c"]]
anddf[slice(5)]
anddf[lambda idx: idx % 5 == 0]
. (added by @devin-petersohn)__getattr__
) to get columns, which causes problems for columns that share names with other APIs. (added by @devin-petersohn)isna
andisnull
,multiply
andmul
, etc.(added by @devin-petersohn)query("a > b")
anddf[df["a"] > df["b"]]
(added by @devin-petersohn)__getitem__
,__getattr__
,loc
,iloc
,apply
,drop
(added by @devin-petersohn)merge
andjoin
call each other and are confusing for new users (added by @devin-petersohn)Series
). Creating all the complexity of having to reimplement most functionality of dataframe. And not providing a consistent way of applying operations to N columns (including 1). Separate object for a dataframe colum? (is Series needed?) #6 @datapythonistaDataFrame
,Series
) to hard-code support for these types. This makes pandas work well with these libraries but means it's not easy (or even possible) for otherDataFrame
implementations to be supported. Lack of interop support between alternativeDataFrame
implementations and these libraries can be a small but constant annoyance for users, and in some cases a performance issue as well (if data needs to be converted to a pandas object just to get something to work).Dataset
. This provides more-detailed data compared to what a "static" tool like Jedi can return; compared todir
, our protocol allows ourDataset
class to control which columns, properties, etc. are returned for display in autocomplete dropdowns.dtype
or array subclass name; for Categoricals, we can provide the number of labels/categories.DataFrame
and/orArray
APIs (rather than a protocol with a method that returns a more-complex data structure / dictionary)....
The text was updated successfully, but these errors were encountered: