Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insert DML results in incorrect record batch schema (doesn't correctly identify values as non-nullable), leading to errors in future queries that are sensitive to them #7693

Closed
matthewgapp opened this issue Sep 29, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@matthewgapp
Copy link
Contributor

matthewgapp commented Sep 29, 2023

Describe the bug

Filing on behalf of xhwhis

When inserting values into a table (memtable) that has a schema where each field is non-nullable and and then performing a window function with a partition by clause, the following runtime panic occurs.

called `Result::unwrap()` on an `Err` value: ArrowError(InvalidArgumentError("batches[0] schema is different with argument schema.\n            batches[0] schema: Schema { fields: [Field { name: \"id\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"name\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} },\n            argument schema: Schema { fields: [Field { name: \"id\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"name\", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }\n            "))

To Reproduce

To illustrate, @xhwhis created a minimal reproducible example below. Note that the insert query and the partition keyword in the window function are required to throw an error.

use std::sync::Arc;

use datafusion::{
    arrow::{
        datatypes::{DataType, Field, Schema},
        util::pretty::print_batches,
    },
    datasource::MemTable,
    prelude::{SessionConfig, SessionContext},
};

#[tokio::main(flavor = "current_thread")]
async fn main() {
    let config = SessionConfig::new()
        .with_create_default_catalog_and_schema(true)
        .with_information_schema(true);

    let ctx = SessionContext::with_config(config);

    ctx.register_table("source_table", Arc::new(create_mem_table()))
        .unwrap();

    let insert_table_query = r#"INSERT INTO source_table VALUES (1, 'Alice'),(2, 'Bob'),(3, 'Charlie'),(4, 'David'), (5, 'Eve')"#;
    let _ = ctx
        .sql(insert_table_query)
        .await
        .unwrap()
        .collect()
        .await
        .unwrap();

    let create_table_query =
        r#"SELECT *, RANK() OVER (PARTITION BY id) AS row_num FROM source_table"#;

    let batches = ctx
        .sql(create_table_query)
        .await
        .unwrap()
        .collect()
        .await
        .unwrap();

    print_batches(&batches).unwrap();
}

fn create_mem_table() -> MemTable {
    let schema = Arc::new(Schema::new(vec![
        Field::new("id", DataType::Int32, false),
        Field::new("name", DataType::Utf8, false),
    ]));

    MemTable::try_new(schema, vec![vec![]]).unwrap()
}

Expected behavior

No panic, the schema set at insert time should reflect that of the table it was inserted into

Additional context

No response

@jonahgao
Copy link
Member

jonahgao commented Jun 1, 2024

I have verified that this issue has been fixed.

@jonahgao jonahgao closed this as completed Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants