Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add to_pandas_batches #6805

Closed
1 task done
thegreymatter opened this issue Aug 8, 2023 · 6 comments · Fixed by #7123
Closed
1 task done

feat: add to_pandas_batches #6805

thegreymatter opened this issue Aug 8, 2023 · 6 comments · Fixed by #7123
Labels
feature Features or general enhancements

Comments

@thegreymatter
Copy link
Contributor

Is your feature request related to a problem?

Today I'm lacking the ability to get pandas batches easily from an ibis expression
I can do it manually by employing to_pyarrow_batches(), and then to_pandas on each batch.
there are several issues with that approach

  • It adds a little bit more complexity, and a little less affordability.
  • There are sometimes implementation differences between ibis->pyarrow->pandas and ibis->pandas which creates different results (for example - array columns in snowflake play nice with to_pandas but not with to_pyarrow)

Describe the solution you'd like

add to_pandas_batches() function on table expression.
by default it could be implemented using the pyarrowbatches->pandas batches approach , and if a specific backend has a better impl we can use it.

What version of ibis are you running?

6.0.0

What backend(s) are you using, if any?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@thegreymatter thegreymatter added the feature Features or general enhancements label Aug 8, 2023
@cpcloud
Copy link
Member

cpcloud commented Aug 8, 2023

@thegreymatter Nice to see you again, and thanks for the issue!

for example - array columns in snowflake play nice with to_pandas but not with to_pyarrow

Yeah, this is a bit of a pickle.

The issue is that pandas is more flexible in this case because it can hold arbitrary Python objects in a column. Turns out this is useful in the case of snowflake complex types since those types are effectively untyped JSON. PyArrow is more efficient across the board, but lacks an in-memory representation that is more specific than a string type for snowflake's complex types.

to_pandas_batches seems reasonable to me

@gforsyth @jcrist Thoughts?

@gforsyth
Copy link
Member

I'm not opposed. The fallback implementation for pyarrow batches should work for this case, too, although I think we should decouple pandas batching from pyarrow so we can still provide the more "flexible" typing

@NickCrews
Copy link
Contributor

Are there other limitations of just having to_pyarrow, or is that the only one? If that is the only one, then I would vote for keeping things simple and just supporting .to_pyarrow(). I want to see pyarrow being the lingua franca, the "hub of the conversion wheel", where every other implementation can convert to/from that, and then the conversion path is always X <-> pyarrow <-> Y. Otherwise, every time we add a way to convert directly between X <-> Y, we risk adding another place for inconsistencies to slip in, and it is more maintenance.

@cpcloud
Copy link
Member

cpcloud commented Sep 7, 2023

I think this makes sense to do. While it is additional code to maintain, I'm not sure it further entrenches pandas anywhere or reduces the need/desire to Arrow-ize all the things.

The implementation might even be as simple as:

yield from (
	PandasSchema.convert_table(batch.to_pandas(), expr.schema())
	for batch in expr.to_pyarrow_batches()
)

@NickCrews
Copy link
Contributor

OK, sounds good! It will definitely be appreciated and used by people.

Does that implementation still suffer from the drawback of snowflake complex types, since you aren't you still are going snowflake -> pyarrow -> pandas?

@cpcloud
Copy link
Member

cpcloud commented Sep 8, 2023

Does that implementation still suffer from the drawback of snowflake complex types, since you aren't you still are going snowflake -> pyarrow -> pandas?

No, because the PandasSchema.convert_table should handle calling json.loads on those columns (for snowflake only)

@cpcloud cpcloud added this to the 7.0 milestone Sep 8, 2023
@cpcloud cpcloud removed this from the 7.0 milestone Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants