Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INSERT INTO does not allow to insert only specified columns #8091

Closed
karlovnv opened this issue Nov 8, 2023 · 5 comments · Fixed by #8146
Closed

INSERT INTO does not allow to insert only specified columns #8091

karlovnv opened this issue Nov 8, 2023 · 5 comments · Fixed by #8146
Labels
bug Something isn't working

Comments

@karlovnv
Copy link

karlovnv commented Nov 8, 2023

Describe the bug

Most SQL databases (PG, MSSQL, ClickHouse, ...) allow to do INSERTS without resort to providing all the columns.

This is allowed by insert into TABLE(columns) VALUES (values) syntax when nullable columns could be omited.

It's important for upgrading schema of tables systems with zero downtime: on the first step we add new column and on the second we deploy new app code that deals with new schema.

Now it's impossible add column to schema without downtime due to logical planner error:
Error during planning: Inserting query must have the same schema with the table.

To Reproduce

create table foo(x int, y int);

insert into foo(x) values (10);

Expected:
(10, NULL) is inserted

Actual:

DataFusion CLI v32.0.0
❯ create table foo(x int, y int);
0 rows in set. Query took 0.002 seconds.

❯ insert into foo(x) values (10);
Error during planning: Inserting query must have the same schema with the table.

Expected behavior

row (10, NULL) is inserted

Additional context

Minimal repro without CLI:

use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx = SessionContext::new();

    let queries = [
        "create table foo(x int, y int);",
        "insert into foo(x) values (10);",
        "insert into foo(x, y) values (10, 11);",
    ];
    for query in queries {
        let df = ctx.sql(query).await?;
        println!("> {query}");
        if let Err(err) = df.show().await {
            println!("{}", err);
        }
        
    }

    Ok(())
}


> create table foo(x int, y int);
++
++
> insert into foo(x) values (10);
Error during planning: Inserting query must have the same schema with the table.
> insert into foo(x, y) values (10, 11);
+-------+
| count |
+-------+
| 1     |
+-------+

@karlovnv karlovnv added the bug Something isn't working label Nov 8, 2023
@jonahgao
Copy link
Member

jonahgao commented Nov 9, 2023

I plan to support this.

I think we can begin by supporting NULL as the default value, and later on, allow specifying default values for columns.

@karlovnv
Copy link
Author

karlovnv commented Nov 9, 2023

I plan to support this.

I think we can begin by supporting NULL as the default value, and later on, allow specifying default values for columns.

Now I consider several workarounds:

  1. Write custom user defined LogicalPlan rewriter in order to add skipped columns
    The proiblem is we have wedi tables (1000+ columns) and I think that it might be be slow.
  2. Implement something like UDF placeholder replaced by custom logic
  3. Use recordbatch (consists of rows to insert) as a parameter and provide it via user defined VarProvider, but unfortunately VarProvider can return only ScalarValue (not record batch)

@devinjdangelo
Copy link
Contributor

One simple approach would be to update the check here: https://github.com/apache/arrow-datafusion/blob/4512805c2087d1a5538afdaba9d2e2ca5347c90c/datafusion/core/src/datasource/listing/table.rs#L827-L835 to be more sophisticated. You can see it fails immediately if the schema's have different lengths: https://github.com/apache/arrow-datafusion/blob/4512805c2087d1a5538afdaba9d2e2ca5347c90c/datafusion/common/src/dfschema.rs#L398-L403

That check could be updated to only fail if the input schema is missing a field from the table which is not nullable. That way, the insert plan will happily continue, writing out files which are missing those nullable columns.

A longer term solution would have a more explicit way to fill in missing nullable columns or columns with a default values. This could be a dedicated execution plan that comes right before an insertion (fill nulls and defaults), or this behavior could be added directly to the existing FileSinkExec plan.

@oleggator
Copy link

oleggator commented Nov 9, 2023

One more check performs in FileSinkExec. So fixing the check in TableProvider is not enough.

@jonahgao
Copy link
Member

jonahgao commented Nov 9, 2023

Now I consider several workarounds:

  1. Write custom user defined LogicalPlan rewriter in order to add skipped columns
    The proiblem is we have wedi tables (1000+ columns) and I think that it might be be slow.

I think the implementation of the first version could adopt a similar approach to the first one.
Since it is intuitive, and should be applicable to different sinks.

We can fill missing values with NULLs inside insert_to_plan.
https://github.com/apache/arrow-datafusion/blob/4512805c2087d1a5538afdaba9d2e2ca5347c90c/datafusion/sql/src/statement.rs#L1067

For scenarios with a large number of columns, maybe we can do some profiling later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants