Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow registering all in-memory table types via create_table #6593

Closed
1 task done
jcrist opened this issue Jul 6, 2023 · 4 comments · Fixed by #9251
Closed
1 task done

feat: allow registering all in-memory table types via create_table #6593

jcrist opened this issue Jul 6, 2023 · 4 comments · Fixed by #9251
Assignees
Labels
feature Features or general enhancements

Comments

@jcrist
Copy link
Member

jcrist commented Jul 6, 2023

Summarizing the discussions below, here's my (@gforsyth) rough outline:

  • Add fallback behavior to all backends such that if create_table or insert gets a non-ibis-expression as obj, convert it to a memtable and then continue.
  • For Polars, Datafusion, pandas, and dask, use the native methods to read in these objects
  • For DuckDB, use native methods to read in these objects, then create a table from the object so that create_table is actually creating a table.
  • For Polars, Datafusion, DuckDB, pandas, and dask, additionally support passing in-memory objects to create_view (for all but DuckDB, the distinction between views and tables carries no meaning)

Is your feature request related to a problem?

Originally we had con.register to handle data ingestion of anything a backend could support. This method works fine and is defined for many backends, but is a bit "magical" in that it determines the input format based on the type and value of the input. Later on we split part of this functionality out into standard read_csv/read_parquet/read_* methods, which register dispatches to.

The duckdb backend also has a read_in_memory method for reading from an in-memory object. Unlike files (where we dispatch based on things like the file extension), in memory data always has an explicit python type, so having a method like read_in_memory to handle any in-memory data sources seems like a nice interface. It would be good to standardize this interface across all the backends that could support it. For the most part this would be extracting out the existing functionality from the .register methods and moving it to a new read_in_memory method.

Describe the solution you'd like

For any backend that could support it to define a method like:

def read_in_memory(self, source: pd.DataFrame | pa.Table | pa.RecordBatchReader, table_name: str | None = None) -> ir.Table:
    ...

What version of ibis are you running?

dev

What backend(s) are you using, if any?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@jcrist jcrist added the feature Features or general enhancements label Jul 6, 2023
@chloeh13q
Copy link
Contributor

chloeh13q commented Mar 7, 2024

I might be missing some context/motivation, but I was wondering: would this not function effectively the same as create_table() with an in-memory obj?

@gforsyth gforsyth added this to the 9.0 milestone Apr 5, 2024
@gforsyth gforsyth self-assigned this Apr 11, 2024
@cpcloud cpcloud removed this from the 9.0 milestone Apr 12, 2024
@gforsyth
Copy link
Member

We discussed this today briefly and I think @chloeh13q is right -- this ends up being (mostly) a specialized version of passing an in-memory object to create_table.

Currently, if the obj isn't an Ibis expression, we convert it to a memtable and then create the table from that.

I think that's a reasonable default/fallback behavior for most backends and allows for users to insert pyarrow tables, polars dataframes and pandas dataframes into all backends. This is also currently implemented.

For DuckDB, Datafusion, and Polars, there are extra options available.

For Datafusion and Polars, there are native methods to read in pyarrow tables, pyarrow recordbatchreaders, polars dataframes, pandas dataframes, etc. Datafusion creates all of these objects as "Tables", Polars does not distinguish between tables and views, so they are also created as what Ibis refers to as "tables".

Adding extra handling for the fast-path for querying these objects will allow us to deprecate register without also introducing a new read_in_memory to these backends. We can also deprecate some of the existing format-specific methods in those backends (e.g. polars read_pandas).

DuckDB also has fast-paths for querying in-memory objects -- if we use the DuckDB connection register method, this will create a view, though, not a table. This seems like a potential source of confusion for users -- I'd expect to have a persistent table if I pass something to create_table.

Maybe we should also support passing in-memory objects to create_view and have that be the fast-path for DuckDB?

@jcrist
Copy link
Member Author

jcrist commented Apr 15, 2024

Agreed, makes sense to me!

Maybe we should also support passing in-memory objects to create_view and have that be the fast-path for DuckDB?

I think that makes sense. To clarify, creating a view from an in-memory dataset would only work for backends that would implement it as a true view (meaning querying the in-memory data directly) - effectively limiting con.create_view("test", df) to things like duckdb/pandas/polars? FWIW I think this is more of an optimization thing and not a thing most users would need to do.

@gforsyth
Copy link
Member

effectively limiting con.create_view("test", df) to things like duckdb/pandas/polars?

Ehh, there's a weird bifurcation.

DuckDB is the only one of the in-memory friendly systems that has it's own persistent format, so for DuckDB, there's a very real distinction between a view of an arrow table vs. a table of an arrow table.

For pandas, dask, polars, and datafusion, there is effectively no difference between views and tables, since everything is in memory and nothing persists outside of a given session.

I think it may not actually matter all that much -- we'll allow arrow tables to be passed to both create_view and create_table for these 5 backends, and on 4 of them, there will be no difference in how that manifests, and with DuckDB, there's a performance edge-case but not a big one.

@gforsyth gforsyth moved this from backlog to todo in Ibis planning and roadmap Apr 18, 2024
@gforsyth gforsyth changed the title feat: Standardize read_in_memory across backends feat: allow registering all in-memory table types via create_table May 6, 2024
gforsyth added a commit that referenced this issue May 29, 2024
This PR adds/codifies support for passing in-memory data to
`create_table`.

The default behavior for most backends is to first create a `memtable`
with
whatever `obj` is passed to `create_table`, then we create a table based
on that
`memtable` -- because of this, semantics around `temp` tables and
`catalog.database` locations are handled correctly. 

After the new table (that the user has provided a name for) is created,
we
drop the intermediate `memtable` so we don't add two tables for every
in-memory
object passed to `create_table`.

Currently most backends fail when passed `RecordBatchReaders`, or a
single
`RecordBatch`, or a `pyarrow.Dataset` -- if we add support for these to
`memtable`, all of those backends would start working, so I've marked
those
xfails as `notimpl` for now.

A few backends _don't_ work this way:

`polars` reads in the table directly using their fast-path local-memory
reading stuff.

`datafusion` uses a fast-path read, then creates a table from the table
that is
created by the fast-path -- this is because the `datafusion` dataframe
API has
no way to specify things like `overwrite`, or table location, but the
CTAS from
already present tables is very quick (and _possibly_ zero-copy?) so no
issue
there.

`duckdb` has a refactored `read_in_memory` (which we should deprecate),
but it
isn't entirely hooked up inside of `create_table` yet, so some paths may
go via
`memtable` creation, but `memtable` creation on DuckDB is especially
fast, so
I'm all for fixing this up eventually.

`pyspark` works with the intermediate `memtable` -- there are possibly
fast-paths available, but they aren't currently implemented.

`pandas` and `dask` have a custom `_convert_object` path


TODO:
* ~[ ] Flink~  Flink can't create tables from in-memory data?
* [x] Impala
* [x] BigQuery 
* [x] Remove `read_in_memory` from datafusion and polars

Resolves #6593 
xref #8863


Signed-off-by: Gil Forsyth <[email protected]>
- refactor(duckdb): add polars df as option, move test to backend suite
- feat(polars): enable passing in-memory data to create_table
- feat(datafusion): enable passing in-memory data to create_table
- feat(datafusion): use info_schema for list_tables
- feat(duckdb): enable passing in-memory data to create_table
- feat(postgres): allow passing in-memory data to create_table
- feat(trino): allow passing in-memory date to create_table
- feat(mysql): allow passing in-memory data to create_table
- feat(mssql): allow passing in-memory data to create_table
- feat(exasol): allow passing in-memory data to create_table
- feat(risingwave): allow passing in-memory data to create_table
- feat(sqlite): allow passing in-memory data to create_table
- feat(clickhouse): enable passing in-memory data to create_table
- feat(oracle): enable passing in-memory data to create_table
- feat(snowflake): allow passing in-memory data to create_table
- feat(pyspark): enable passing in-memory data to create_table
- feat(pandas,dask): allow passing in-memory data to create_table

---------

Signed-off-by: Gil Forsyth <[email protected]>
Co-authored-by: Phillip Cloud <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants