Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Array API functions on DataFrame objects #50

Open
jakirkham opened this issue Aug 18, 2021 · 5 comments
Open

Using Array API functions on DataFrame objects #50

jakirkham opened this issue Aug 18, 2021 · 5 comments

Comments

@jakirkham
Copy link
Member

In some cases users like to use Array API functions (for example where) on DataFrame objects (in particular Series). Is this something that we would like to support in the API? If not, how would we recommend users approach these kinds of problems.

For an example of this please see issue ( dask/distributed#5224 )

@rgommers
Copy link
Member

In some cases users like to use Array API functions (for example where) on DataFrame objects (in particular Series). Is this something that we would like to support in the API? If not, how would we recommend users approach these kinds of problems.

We had (/have) a pretty strong consensus that there should not be a separate Series-like object, but only a DataFrame object with a single column. IIRC the key issue is that Series and DataFrame have so much API duplication, for little benefit.

That's not a complete answer though. Your question can be rephrased as something like:

  • "Can array API functions be applied to dataframes?", or
  • "Is there a connection between the array and dataframe APIs?"

It in principle makes sense to me to have array API functions work on dataframes with a homogeneous dtype (which includes single-column dataframes). I'm not sure there's a good way to pick and choose what functions in the array API make sense. I can imagine a dataframe library providing the whole array API somehow, or to reuse an existing array library that is a dependency.

@jakirkham
Copy link
Member Author

When discussing this in the call a few weeks back (and please feel free to correct me), we explored a few options. In the end gravitated towards having some way to share/convert data between DataFrames and Arrays with the idea one could then use Array operations on the converted Array. If the underlying library doesn't have an actually Array, they could just return some object that implements the Array API. Here are some related discussions on this topic ( #25 ) ( #39 ) ( #48 )

cc @jorisvandenbossche (in case I missed anything here 🙂)

@jbrockmendel
Copy link
Contributor

is something analogous to __array_ufunc__ or __array_function__ an option? IIUC the numpy devs haven't always been happy with these designs

@rgommers
Copy link
Member

is something analogous to __array_ufunc__ or __array_function__ an option?

The array API standard has __array_namespace__. If Column is array API standard compliant and has an __array_namespace__ method, things will work.

IIUC the numpy devs haven't always been happy with these designs

Indeed. __array_namespace__ is based on the design ideas that were trying to address the limitations of __array_ufunc/function__.

@rgommers
Copy link
Member

We had (/have) a pretty strong consensus that there should not be a separate Series-like object, but only a DataFrame object with a single column.

We collectively changed our minds on this one. I think that came over time with (a) the realization that we're really building a library author-focused API that is quite different from the public APIs in current dataframe libraries, (b) that there are actually things that a one-column dataframe does not do well enough (e.g. we now are introducing unique values function for Column which is pretty clean; unique for DataFrame would be different and much more complex), and (c) that a column object which is array-like but with solid missing data support and string/datetime/categorical dtypes is a useful primitive.

So, I think we still want the thing here that @jakirkham originally asked for - but on Column rather than on DataFrame.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants