-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow registering all in-memory table types via create_table
#6593
Comments
I might be missing some context/motivation, but I was wondering: would this not function effectively the same as |
We discussed this today briefly and I think @chloeh13q is right -- this ends up being (mostly) a specialized version of passing an in-memory object to Currently, if the I think that's a reasonable default/fallback behavior for most backends and allows for users to insert pyarrow tables, polars dataframes and pandas dataframes into all backends. This is also currently implemented. For DuckDB, Datafusion, and Polars, there are extra options available. For Datafusion and Polars, there are native methods to read in pyarrow tables, pyarrow recordbatchreaders, polars dataframes, pandas dataframes, etc. Datafusion creates all of these objects as "Tables", Polars does not distinguish between tables and views, so they are also created as what Ibis refers to as "tables". Adding extra handling for the fast-path for querying these objects will allow us to deprecate DuckDB also has fast-paths for querying in-memory objects -- if we use the DuckDB connection Maybe we should also support passing in-memory objects to |
Agreed, makes sense to me!
I think that makes sense. To clarify, creating a view from an in-memory dataset would only work for backends that would implement it as a true view (meaning querying the in-memory data directly) - effectively limiting |
Ehh, there's a weird bifurcation. DuckDB is the only one of the in-memory friendly systems that has it's own persistent format, so for DuckDB, there's a very real distinction between a view of an arrow table vs. a table of an arrow table. For pandas, dask, polars, and datafusion, there is effectively no difference between views and tables, since everything is in memory and nothing persists outside of a given session. I think it may not actually matter all that much -- we'll allow arrow tables to be passed to both |
read_in_memory
across backendscreate_table
This PR adds/codifies support for passing in-memory data to `create_table`. The default behavior for most backends is to first create a `memtable` with whatever `obj` is passed to `create_table`, then we create a table based on that `memtable` -- because of this, semantics around `temp` tables and `catalog.database` locations are handled correctly. After the new table (that the user has provided a name for) is created, we drop the intermediate `memtable` so we don't add two tables for every in-memory object passed to `create_table`. Currently most backends fail when passed `RecordBatchReaders`, or a single `RecordBatch`, or a `pyarrow.Dataset` -- if we add support for these to `memtable`, all of those backends would start working, so I've marked those xfails as `notimpl` for now. A few backends _don't_ work this way: `polars` reads in the table directly using their fast-path local-memory reading stuff. `datafusion` uses a fast-path read, then creates a table from the table that is created by the fast-path -- this is because the `datafusion` dataframe API has no way to specify things like `overwrite`, or table location, but the CTAS from already present tables is very quick (and _possibly_ zero-copy?) so no issue there. `duckdb` has a refactored `read_in_memory` (which we should deprecate), but it isn't entirely hooked up inside of `create_table` yet, so some paths may go via `memtable` creation, but `memtable` creation on DuckDB is especially fast, so I'm all for fixing this up eventually. `pyspark` works with the intermediate `memtable` -- there are possibly fast-paths available, but they aren't currently implemented. `pandas` and `dask` have a custom `_convert_object` path TODO: * ~[ ] Flink~ Flink can't create tables from in-memory data? * [x] Impala * [x] BigQuery * [x] Remove `read_in_memory` from datafusion and polars Resolves #6593 xref #8863 Signed-off-by: Gil Forsyth <[email protected]> - refactor(duckdb): add polars df as option, move test to backend suite - feat(polars): enable passing in-memory data to create_table - feat(datafusion): enable passing in-memory data to create_table - feat(datafusion): use info_schema for list_tables - feat(duckdb): enable passing in-memory data to create_table - feat(postgres): allow passing in-memory data to create_table - feat(trino): allow passing in-memory date to create_table - feat(mysql): allow passing in-memory data to create_table - feat(mssql): allow passing in-memory data to create_table - feat(exasol): allow passing in-memory data to create_table - feat(risingwave): allow passing in-memory data to create_table - feat(sqlite): allow passing in-memory data to create_table - feat(clickhouse): enable passing in-memory data to create_table - feat(oracle): enable passing in-memory data to create_table - feat(snowflake): allow passing in-memory data to create_table - feat(pyspark): enable passing in-memory data to create_table - feat(pandas,dask): allow passing in-memory data to create_table --------- Signed-off-by: Gil Forsyth <[email protected]> Co-authored-by: Phillip Cloud <[email protected]>
Summarizing the discussions below, here's my (@gforsyth) rough outline:
create_table
orinsert
gets a non-ibis-expression asobj
, convert it to a memtable and then continue.create_table
is actually creating a table.create_view
(for all but DuckDB, the distinction between views and tables carries no meaning)Is your feature request related to a problem?
Originally we had
con.register
to handle data ingestion of anything a backend could support. This method works fine and is defined for many backends, but is a bit "magical" in that it determines the input format based on the type and value of the input. Later on we split part of this functionality out into standardread_csv
/read_parquet
/read_*
methods, whichregister
dispatches to.The duckdb backend also has a
read_in_memory
method for reading from an in-memory object. Unlike files (where we dispatch based on things like the file extension), in memory data always has an explicit python type, so having a method likeread_in_memory
to handle any in-memory data sources seems like a nice interface. It would be good to standardize this interface across all the backends that could support it. For the most part this would be extracting out the existing functionality from the.register
methods and moving it to a newread_in_memory
method.Describe the solution you'd like
For any backend that could support it to define a method like:
What version of ibis are you running?
dev
What backend(s) are you using, if any?
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: