INSERT INTO does not allow to insert only specified columns #8091

karlovnv · 2023-11-08T11:14:31Z

Describe the bug

Most SQL databases (PG, MSSQL, ClickHouse, ...) allow to do INSERTS without resort to providing all the columns.

This is allowed by insert into TABLE(columns) VALUES (values) syntax when nullable columns could be omited.

It's important for upgrading schema of tables systems with zero downtime: on the first step we add new column and on the second we deploy new app code that deals with new schema.

Now it's impossible add column to schema without downtime due to logical planner error:
Error during planning: Inserting query must have the same schema with the table.

To Reproduce

create table foo(x int, y int);

insert into foo(x) values (10);

Expected:
(10, NULL) is inserted

Actual:

DataFusion CLI v32.0.0
❯ create table foo(x int, y int);
0 rows in set. Query took 0.002 seconds.

❯ insert into foo(x) values (10);
Error during planning: Inserting query must have the same schema with the table.

Expected behavior

row (10, NULL) is inserted

Additional context

Minimal repro without CLI:

use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx = SessionContext::new();

    let queries = [
        "create table foo(x int, y int);",
        "insert into foo(x) values (10);",
        "insert into foo(x, y) values (10, 11);",
    ];
    for query in queries {
        let df = ctx.sql(query).await?;
        println!("> {query}");
        if let Err(err) = df.show().await {
            println!("{}", err);
        }
        
    }

    Ok(())
}


> create table foo(x int, y int);
++
++
> insert into foo(x) values (10);
Error during planning: Inserting query must have the same schema with the table.
> insert into foo(x, y) values (10, 11);
+-------+
| count |
+-------+
| 1     |
+-------+

The text was updated successfully, but these errors were encountered:

jonahgao · 2023-11-09T03:04:32Z

I plan to support this.

I think we can begin by supporting NULL as the default value, and later on, allow specifying default values for columns.

karlovnv · 2023-11-09T11:24:47Z

I plan to support this.

I think we can begin by supporting NULL as the default value, and later on, allow specifying default values for columns.

Now I consider several workarounds:

Write custom user defined LogicalPlan rewriter in order to add skipped columns
The proiblem is we have wedi tables (1000+ columns) and I think that it might be be slow.
Implement something like UDF placeholder replaced by custom logic
Use recordbatch (consists of rows to insert) as a parameter and provide it via user defined VarProvider, but unfortunately VarProvider can return only ScalarValue (not record batch)

devinjdangelo · 2023-11-09T12:35:23Z

One simple approach would be to update the check here: https://github.com/apache/arrow-datafusion/blob/4512805c2087d1a5538afdaba9d2e2ca5347c90c/datafusion/core/src/datasource/listing/table.rs#L827-L835 to be more sophisticated. You can see it fails immediately if the schema's have different lengths: https://github.com/apache/arrow-datafusion/blob/4512805c2087d1a5538afdaba9d2e2ca5347c90c/datafusion/common/src/dfschema.rs#L398-L403

That check could be updated to only fail if the input schema is missing a field from the table which is not nullable. That way, the insert plan will happily continue, writing out files which are missing those nullable columns.

A longer term solution would have a more explicit way to fill in missing nullable columns or columns with a default values. This could be a dedicated execution plan that comes right before an insertion (fill nulls and defaults), or this behavior could be added directly to the existing FileSinkExec plan.

oleggator · 2023-11-09T14:23:04Z

One more check performs in FileSinkExec. So fixing the check in TableProvider is not enough.

jonahgao · 2023-11-09T14:51:35Z

Now I consider several workarounds:

Write custom user defined LogicalPlan rewriter in order to add skipped columns
The proiblem is we have wedi tables (1000+ columns) and I think that it might be be slow.

I think the implementation of the first version could adopt a similar approach to the first one.
Since it is intuitive, and should be applicable to different sinks.

We can fill missing values with NULLs inside insert_to_plan.
https://github.com/apache/arrow-datafusion/blob/4512805c2087d1a5538afdaba9d2e2ca5347c90c/datafusion/sql/src/statement.rs#L1067

For scenarios with a large number of columns, maybe we can do some profiling later.

karlovnv added the bug Something isn't working label Nov 8, 2023

jonahgao mentioned this issue Nov 13, 2023

feat: fill missing values with NULLs while inserting #8146

Merged

alamb closed this as completed in #8146 Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INSERT INTO does not allow to insert only specified columns #8091

INSERT INTO does not allow to insert only specified columns #8091

karlovnv commented Nov 8, 2023 •

edited

Loading

jonahgao commented Nov 9, 2023

karlovnv commented Nov 9, 2023 •

edited

Loading

devinjdangelo commented Nov 9, 2023

oleggator commented Nov 9, 2023 •

edited

Loading

jonahgao commented Nov 9, 2023

INSERT INTO does not allow to insert only specified columns #8091

INSERT INTO does not allow to insert only specified columns #8091

Comments

karlovnv commented Nov 8, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

jonahgao commented Nov 9, 2023

karlovnv commented Nov 9, 2023 • edited Loading

devinjdangelo commented Nov 9, 2023

oleggator commented Nov 9, 2023 • edited Loading

jonahgao commented Nov 9, 2023

karlovnv commented Nov 8, 2023 •

edited

Loading

karlovnv commented Nov 9, 2023 •

edited

Loading

oleggator commented Nov 9, 2023 •

edited

Loading