feat: add to_pandas_batches #6805

thegreymatter · 2023-08-08T05:59:50Z

Is your feature request related to a problem?

Today I'm lacking the ability to get pandas batches easily from an ibis expression
I can do it manually by employing to_pyarrow_batches(), and then to_pandas on each batch.
there are several issues with that approach

It adds a little bit more complexity, and a little less affordability.
There are sometimes implementation differences between ibis->pyarrow->pandas and ibis->pandas which creates different results (for example - array columns in snowflake play nice with to_pandas but not with to_pyarrow)

Describe the solution you'd like

add to_pandas_batches() function on table expression.
by default it could be implemented using the pyarrowbatches->pandas batches approach , and if a specific backend has a better impl we can use it.

What version of ibis are you running?

6.0.0

What backend(s) are you using, if any?

No response

Code of Conduct

I agree to follow this project's Code of Conduct

cpcloud · 2023-08-08T10:25:21Z

@thegreymatter Nice to see you again, and thanks for the issue!

for example - array columns in snowflake play nice with to_pandas but not with to_pyarrow

Yeah, this is a bit of a pickle.

The issue is that pandas is more flexible in this case because it can hold arbitrary Python objects in a column. Turns out this is useful in the case of snowflake complex types since those types are effectively untyped JSON. PyArrow is more efficient across the board, but lacks an in-memory representation that is more specific than a string type for snowflake's complex types.

to_pandas_batches seems reasonable to me

@gforsyth @jcrist Thoughts?

gforsyth · 2023-08-14T14:11:09Z

I'm not opposed. The fallback implementation for pyarrow batches should work for this case, too, although I think we should decouple pandas batching from pyarrow so we can still provide the more "flexible" typing

NickCrews · 2023-08-30T17:53:28Z

Are there other limitations of just having to_pyarrow, or is that the only one? If that is the only one, then I would vote for keeping things simple and just supporting .to_pyarrow(). I want to see pyarrow being the lingua franca, the "hub of the conversion wheel", where every other implementation can convert to/from that, and then the conversion path is always X <-> pyarrow <-> Y. Otherwise, every time we add a way to convert directly between X <-> Y, we risk adding another place for inconsistencies to slip in, and it is more maintenance.

cpcloud · 2023-09-07T17:54:58Z

I think this makes sense to do. While it is additional code to maintain, I'm not sure it further entrenches pandas anywhere or reduces the need/desire to Arrow-ize all the things.

The implementation might even be as simple as:

yield from (
	PandasSchema.convert_table(batch.to_pandas(), expr.schema())
	for batch in expr.to_pyarrow_batches()
)

NickCrews · 2023-09-07T23:00:36Z

OK, sounds good! It will definitely be appreciated and used by people.

Does that implementation still suffer from the drawback of snowflake complex types, since you aren't you still are going snowflake -> pyarrow -> pandas?

cpcloud · 2023-09-08T14:29:39Z

Does that implementation still suffer from the drawback of snowflake complex types, since you aren't you still are going snowflake -> pyarrow -> pandas?

No, because the PandasSchema.convert_table should handle calling json.loads on those columns (for snowflake only)

thegreymatter added the feature Features or general enhancements label Aug 8, 2023

cpcloud added this to the 7.0 milestone Sep 8, 2023

cpcloud mentioned this issue Sep 11, 2023

feat(api): add to_pandas_batches #7123

Merged

cpcloud removed this from the 7.0 milestone Sep 28, 2023

cpcloud closed this as completed in #7123 Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add to_pandas_batches #6805

feat: add to_pandas_batches #6805

thegreymatter commented Aug 8, 2023

cpcloud commented Aug 8, 2023

gforsyth commented Aug 14, 2023

NickCrews commented Aug 30, 2023

cpcloud commented Sep 7, 2023 •

edited

Loading

NickCrews commented Sep 7, 2023

cpcloud commented Sep 8, 2023

feat: add to_pandas_batches #6805

feat: add to_pandas_batches #6805

Comments

thegreymatter commented Aug 8, 2023

Is your feature request related to a problem?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

cpcloud commented Aug 8, 2023

gforsyth commented Aug 14, 2023

NickCrews commented Aug 30, 2023

cpcloud commented Sep 7, 2023 • edited Loading

NickCrews commented Sep 7, 2023

cpcloud commented Sep 8, 2023

cpcloud commented Sep 7, 2023 •

edited

Loading