feat: allow registering all in-memory table types via `create_table` #6593

jcrist · 2023-07-06T21:10:47Z

Summarizing the discussions below, here's my (@gforsyth) rough outline:

Add fallback behavior to all backends such that if create_table or insert gets a non-ibis-expression as obj, convert it to a memtable and then continue.
For Polars, Datafusion, pandas, and dask, use the native methods to read in these objects
For DuckDB, use native methods to read in these objects, then create a table from the object so that create_table is actually creating a table.
For Polars, Datafusion, DuckDB, pandas, and dask, additionally support passing in-memory objects to create_view (for all but DuckDB, the distinction between views and tables carries no meaning)

Is your feature request related to a problem?

Originally we had con.register to handle data ingestion of anything a backend could support. This method works fine and is defined for many backends, but is a bit "magical" in that it determines the input format based on the type and value of the input. Later on we split part of this functionality out into standard read_csv/read_parquet/read_* methods, which register dispatches to.

The duckdb backend also has a read_in_memory method for reading from an in-memory object. Unlike files (where we dispatch based on things like the file extension), in memory data always has an explicit python type, so having a method like read_in_memory to handle any in-memory data sources seems like a nice interface. It would be good to standardize this interface across all the backends that could support it. For the most part this would be extracting out the existing functionality from the .register methods and moving it to a new read_in_memory method.

Describe the solution you'd like

For any backend that could support it to define a method like:

def read_in_memory(self, source: pd.DataFrame | pa.Table | pa.RecordBatchReader, table_name: str | None = None) -> ir.Table:
    ...

What version of ibis are you running?

dev

What backend(s) are you using, if any?

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

chloeh13q · 2024-03-07T17:55:37Z

I might be missing some context/motivation, but I was wondering: would this not function effectively the same as create_table() with an in-memory obj?

gforsyth · 2024-04-15T21:15:44Z

We discussed this today briefly and I think @chloeh13q is right -- this ends up being (mostly) a specialized version of passing an in-memory object to create_table.

Currently, if the obj isn't an Ibis expression, we convert it to a memtable and then create the table from that.

I think that's a reasonable default/fallback behavior for most backends and allows for users to insert pyarrow tables, polars dataframes and pandas dataframes into all backends. This is also currently implemented.

For DuckDB, Datafusion, and Polars, there are extra options available.

For Datafusion and Polars, there are native methods to read in pyarrow tables, pyarrow recordbatchreaders, polars dataframes, pandas dataframes, etc. Datafusion creates all of these objects as "Tables", Polars does not distinguish between tables and views, so they are also created as what Ibis refers to as "tables".

Adding extra handling for the fast-path for querying these objects will allow us to deprecate register without also introducing a new read_in_memory to these backends. We can also deprecate some of the existing format-specific methods in those backends (e.g. polars read_pandas).

DuckDB also has fast-paths for querying in-memory objects -- if we use the DuckDB connection register method, this will create a view, though, not a table. This seems like a potential source of confusion for users -- I'd expect to have a persistent table if I pass something to create_table.

Maybe we should also support passing in-memory objects to create_view and have that be the fast-path for DuckDB?

jcrist · 2024-04-15T21:24:44Z

Agreed, makes sense to me!

Maybe we should also support passing in-memory objects to create_view and have that be the fast-path for DuckDB?

I think that makes sense. To clarify, creating a view from an in-memory dataset would only work for backends that would implement it as a true view (meaning querying the in-memory data directly) - effectively limiting con.create_view("test", df) to things like duckdb/pandas/polars? FWIW I think this is more of an optimization thing and not a thing most users would need to do.

gforsyth · 2024-04-16T12:54:17Z

effectively limiting con.create_view("test", df) to things like duckdb/pandas/polars?

Ehh, there's a weird bifurcation.

DuckDB is the only one of the in-memory friendly systems that has it's own persistent format, so for DuckDB, there's a very real distinction between a view of an arrow table vs. a table of an arrow table.

For pandas, dask, polars, and datafusion, there is effectively no difference between views and tables, since everything is in memory and nothing persists outside of a given session.

I think it may not actually matter all that much -- we'll allow arrow tables to be passed to both create_view and create_table for these 5 backends, and on 4 of them, there will be no difference in how that manifests, and with DuckDB, there's a performance edge-case but not a big one.

This PR adds/codifies support for passing in-memory data to `create_table`. The default behavior for most backends is to first create a `memtable` with whatever `obj` is passed to `create_table`, then we create a table based on that `memtable` -- because of this, semantics around `temp` tables and `catalog.database` locations are handled correctly. After the new table (that the user has provided a name for) is created, we drop the intermediate `memtable` so we don't add two tables for every in-memory object passed to `create_table`. Currently most backends fail when passed `RecordBatchReaders`, or a single `RecordBatch`, or a `pyarrow.Dataset` -- if we add support for these to `memtable`, all of those backends would start working, so I've marked those xfails as `notimpl` for now. A few backends _don't_ work this way: `polars` reads in the table directly using their fast-path local-memory reading stuff. `datafusion` uses a fast-path read, then creates a table from the table that is created by the fast-path -- this is because the `datafusion` dataframe API has no way to specify things like `overwrite`, or table location, but the CTAS from already present tables is very quick (and _possibly_ zero-copy?) so no issue there. `duckdb` has a refactored `read_in_memory` (which we should deprecate), but it isn't entirely hooked up inside of `create_table` yet, so some paths may go via `memtable` creation, but `memtable` creation on DuckDB is especially fast, so I'm all for fixing this up eventually. `pyspark` works with the intermediate `memtable` -- there are possibly fast-paths available, but they aren't currently implemented. `pandas` and `dask` have a custom `_convert_object` path TODO: * ~[ ] Flink~ Flink can't create tables from in-memory data? * [x] Impala * [x] BigQuery * [x] Remove `read_in_memory` from datafusion and polars Resolves #6593 xref #8863 Signed-off-by: Gil Forsyth <[email protected]> - refactor(duckdb): add polars df as option, move test to backend suite - feat(polars): enable passing in-memory data to create_table - feat(datafusion): enable passing in-memory data to create_table - feat(datafusion): use info_schema for list_tables - feat(duckdb): enable passing in-memory data to create_table - feat(postgres): allow passing in-memory data to create_table - feat(trino): allow passing in-memory date to create_table - feat(mysql): allow passing in-memory data to create_table - feat(mssql): allow passing in-memory data to create_table - feat(exasol): allow passing in-memory data to create_table - feat(risingwave): allow passing in-memory data to create_table - feat(sqlite): allow passing in-memory data to create_table - feat(clickhouse): enable passing in-memory data to create_table - feat(oracle): enable passing in-memory data to create_table - feat(snowflake): allow passing in-memory data to create_table - feat(pyspark): enable passing in-memory data to create_table - feat(pandas,dask): allow passing in-memory data to create_table --------- Signed-off-by: Gil Forsyth <[email protected]> Co-authored-by: Phillip Cloud <[email protected]>

jcrist added the feature Features or general enhancements label Jul 6, 2023

lostmygithubaccount added this to Ibis planning and roadmap Nov 30, 2023

github-project-automation bot moved this to backlog in Ibis planning and roadmap Nov 30, 2023

gforsyth mentioned this issue Apr 5, 2024

refactor: standardize read_in_memory across backends #8858

Closed

gforsyth added this to the 9.0 milestone Apr 5, 2024

gforsyth mentioned this issue Apr 11, 2024

refactor: deprecate register api #8863

Merged

gforsyth self-assigned this Apr 11, 2024

cpcloud removed this from the 9.0 milestone Apr 12, 2024

gforsyth moved this from backlog to todo in Ibis planning and roadmap Apr 18, 2024

gforsyth mentioned this issue Apr 25, 2024

[meta] Ibis expression API stability #8996

Closed

10 tasks

gforsyth changed the title ~~feat: Standardize read_in_memory across backends~~ feat: allow registering all in-memory table types via create_table May 6, 2024

This was referenced May 6, 2024

feat: ibis.from_* methods for injesting in-memory data structures #8622

Closed

[meta] Ibis backend public API stability #8994

Closed

gforsyth mentioned this issue May 24, 2024

feat(all): enable passing in-memory data to create_table #9251

Merged

3 tasks

gforsyth closed this as completed in #9251 May 29, 2024

github-project-automation bot moved this from todo to done in Ibis planning and roadmap May 29, 2024

csubhodeep mentioned this issue Jun 23, 2024

bug: Error writing a table expression to a create (or replace existing) table when using read_parquet #9432

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow registering all in-memory table types via `create_table` #6593

feat: allow registering all in-memory table types via `create_table` #6593

jcrist commented Jul 6, 2023 •

edited by gforsyth

Loading

chloeh13q commented Mar 7, 2024 •

edited

Loading

gforsyth commented Apr 15, 2024

jcrist commented Apr 15, 2024

gforsyth commented Apr 16, 2024

feat: allow registering all in-memory table types via create_table #6593

feat: allow registering all in-memory table types via create_table #6593

Comments

jcrist commented Jul 6, 2023 • edited by gforsyth Loading

Is your feature request related to a problem?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

chloeh13q commented Mar 7, 2024 • edited Loading

gforsyth commented Apr 15, 2024

jcrist commented Apr 15, 2024

gforsyth commented Apr 16, 2024

feat: allow registering all in-memory table types via `create_table` #6593

feat: allow registering all in-memory table types via `create_table` #6593

jcrist commented Jul 6, 2023 •

edited by gforsyth

Loading

chloeh13q commented Mar 7, 2024 •

edited

Loading