feat(all): enable passing in-memory data to create_table #9251

gforsyth · 2024-05-24T20:31:46Z

This PR adds/codifies support for passing in-memory data to create_table.

The default behavior for most backends is to first create a memtable with
whatever obj is passed to create_table, then we create a table based on that
memtable -- because of this, semantics around temp tables and
catalog.database locations are handled correctly.

After the new table (that the user has provided a name for) is created, we
drop the intermediate memtable so we don't add two tables for every in-memory
object passed to create_table.

Currently most backends fail when passed RecordBatchReaders, or a single
RecordBatch, or a pyarrow.Dataset -- if we add support for these to
memtable, all of those backends would start working, so I've marked those
xfails as notimpl for now.

A few backends don't work this way:

polars reads in the table directly using their fast-path local-memory reading stuff.

datafusion uses a fast-path read, then creates a table from the table that is
created by the fast-path -- this is because the datafusion dataframe API has
no way to specify things like overwrite, or table location, but the CTAS from
already present tables is very quick (and possibly zero-copy?) so no issue
there.

duckdb has a refactored read_in_memory (which we should deprecate), but it
isn't entirely hooked up inside of create_table yet, so some paths may go via
memtable creation, but memtable creation on DuckDB is especially fast, so
I'm all for fixing this up eventually.

pyspark works with the intermediate memtable -- there are possibly
fast-paths available, but they aren't currently implemented.

pandas and dask have a custom _convert_object path

TODO:

~~[ ] Flink~~ Flink can't create tables from in-memory data?
Impala
BigQuery
Remove read_in_memory from datafusion and polars

Resolves #6593
xref #8863

Signed-off-by: Gil Forsyth [email protected]

refactor(duckdb): add polars df as option, move test to backend suite
feat(polars): enable passing in-memory data to create_table
feat(datafusion): enable passing in-memory data to create_table
feat(datafusion): use info_schema for list_tables
feat(duckdb): enable passing in-memory data to create_table
feat(postgres): allow passing in-memory data to create_table
feat(trino): allow passing in-memory date to create_table
feat(mysql): allow passing in-memory data to create_table
feat(mssql): allow passing in-memory data to create_table
feat(exasol): allow passing in-memory data to create_table
feat(risingwave): allow passing in-memory data to create_table
feat(sqlite): allow passing in-memory data to create_table
feat(clickhouse): enable passing in-memory data to create_table
feat(oracle): enable passing in-memory data to create_table
feat(snowflake): allow passing in-memory data to create_table
feat(pyspark): enable passing in-memory data to create_table
feat(pandas,dask): allow passing in-memory data to create_table

cpcloud

Nitiest of nits. LGTM overall!

ibis/backends/datafusion/__init__.py

kszucs · 2024-05-25T18:57:13Z

ibis/backends/datafusion/__init__.py

+
+
+@lazy_singledispatch
+def _read_in_memory(


Ideally this could be ibis.memtable()

Yeah, and that would unify the implementations across the backends, too. I'll open an follow-up to make use of the lazy single-dispatching for memtable insertion for the in-process backends.

gforsyth · 2024-05-28T12:28:31Z

Question: would folks rather I add polars as an extra to many CI jobs to test polars inputs to create_table or put an importorskip around it?

Signed-off-by: Gil Forsyth <[email protected]>

supports pandas, polars, and pyarrow tablelikes

This means you can actually select a database with `list_tables`

I don't know that we can unregister `_clean_up_tmp_table` for specific tables, so Exasol might throw some atexit errors (which are ignored) at shutdown, because it's attempting to drop tables that have already been dropped (also not sure why Exasol complains about this with `force=True`). Still, I think it's better to not pollute the table-space with copies of memtables for every table we create.

Co-authored-by: Phillip Cloud <[email protected]>

This branch initially started with my adding `read_in_memory` everywhere before we settled on making this functionality part of `create_table` instead. This hasn't landed in a release, so I'm removing it.

cpcloud · 2024-05-28T21:49:05Z

@gforsyth Can we add polars to one or two backends instead of all of them?

And we also have it installed already in the postgres torch build and DuckDB.

gforsyth · 2024-05-28T22:01:10Z

@gforsyth Can we add polars to one or two backends instead of all of them?

Yep -- added it explicitly to MySQL, MSSQL, and Oracle. And we already have it available on the DuckDB jobs, and the Postgres torch job (and obviously on the polars jobs) -- seems like reasonably good coverage?

cpcloud approved these changes May 25, 2024

View reviewed changes

ibis/backends/datafusion/__init__.py Outdated Show resolved Hide resolved

ibis/backends/datafusion/__init__.py Outdated Show resolved Hide resolved

cpcloud added this to the 9.1 milestone May 25, 2024

kszucs reviewed May 26, 2024

View reviewed changes

gforsyth and others added 22 commits May 28, 2024 11:21

refactor(duckdb): add polars df as option, move test to backend suite

e9e0adc

Signed-off-by: Gil Forsyth <[email protected]>

feat(polars): enable passing in-memory data to create_table

1c6833d

supports pandas, polars, and pyarrow tablelikes

feat(datafusion): enable passing in-memory data to create_table

68a6c3c

feat(datafusion): use info_schema for list_tables

28231bd

This means you can actually select a database with `list_tables`

feat(duckdb): enable passing in-memory data to create_table

0b3ea31

feat(postgres): allow passing in-memory data to create_table

66d2fb7

feat(trino): allow passing in-memory date to create_table

819abd1

feat(mysql): allow passing in-memory data to create_table

13bd22f

feat(mssql): allow passing in-memory data to create_table

7545abf

feat(risingwave): allow passing in-memory data to create_table

7fe4130

feat(sqlite): allow passing in-memory data to create_table

8ddcd9a

feat(clickhouse): enable passing in-memory data to create_table

235c03f

feat(oracle): enable passing in-memory data to create_table

4486097

feat(snowflake): allow passing in-memory data to create_table

59b8b73

feat(pyspark): enable passing in-memory data to create_table

a17e76e

feat(pandas,dask): allow passing in-memory data to create_table

0219d63

chore: apply suggestions

2f9252b

Co-authored-by: Phillip Cloud <[email protected]>

chore(polars,datafusion): remove nascent read_in_memory

f65379d

This branch initially started with my adding `read_in_memory` everywhere before we settled on making this functionality part of `create_table` instead. This hasn't landed in a release, so I'm removing it.

feat(bigquery): allow passing in-memory data to create_table

e9fef1c

feat(impala): allow passing in-memory data to create_table

fe6b114

test(create_table): create initial memtable on backend being tested

a1d9227

gforsyth force-pushed the ibis-create-table-in-memory branch from a5d34a3 to 1bfbdb8 Compare May 28, 2024 15:25

gforsyth marked this pull request as ready for review May 28, 2024 15:26

test(create_table): use lambdas for all inputs

4e45c10

gforsyth force-pushed the ibis-create-table-in-memory branch from 1bfbdb8 to 4e45c10 Compare May 28, 2024 16:08

gforsyth mentioned this pull request May 28, 2024

fix(ddl): use column names, not position, for insertion order #9264

Merged

ncclementi mentioned this pull request May 28, 2024

refactor: deprecate register api #8863

Merged

chore(ci): add polars to mysql, mssql, and oracle

9a3a698

And we also have it installed already in the postgres torch build and DuckDB.

gforsyth force-pushed the ibis-create-table-in-memory branch from 1c5abe5 to 9a3a698 Compare May 28, 2024 22:00

cpcloud approved these changes May 28, 2024

View reviewed changes

gforsyth merged commit fa15c7d into ibis-project:main May 29, 2024
74 checks passed

gforsyth deleted the ibis-create-table-in-memory branch May 29, 2024 01:06

gforsyth mentioned this pull request May 29, 2024

fix(bigquery): only register memtable if obj is not None #9268

Merged

csubhodeep mentioned this pull request Jun 23, 2024

bug: Error writing a table expression to a create (or replace existing) table when using read_parquet #9432

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(all): enable passing in-memory data to create_table #9251

feat(all): enable passing in-memory data to create_table #9251

gforsyth commented May 24, 2024 •

edited

Loading

cpcloud left a comment

kszucs May 25, 2024

gforsyth May 28, 2024

gforsyth commented May 28, 2024

cpcloud commented May 28, 2024

gforsyth commented May 28, 2024



		@lazy_singledispatch
		def _read_in_memory(

feat(all): enable passing in-memory data to create_table #9251

feat(all): enable passing in-memory data to create_table #9251

Conversation

gforsyth commented May 24, 2024 • edited Loading

cpcloud left a comment

Choose a reason for hiding this comment

kszucs May 25, 2024

Choose a reason for hiding this comment

gforsyth May 28, 2024

Choose a reason for hiding this comment

gforsyth commented May 28, 2024

cpcloud commented May 28, 2024

gforsyth commented May 28, 2024

gforsyth commented May 24, 2024 •

edited

Loading