Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INSERT INTO support for MemTable #5520

Merged
merged 5 commits into from
Mar 14, 2023
Merged

INSERT INTO support for MemTable #5520

merged 5 commits into from
Mar 14, 2023

Conversation

metesynnada
Copy link
Contributor

Which issue does this PR close?

Improves #5130.

Rationale for this change

"INSERT INTO abc SELECT * FROM xyz" support for MemTable.

What changes are included in this PR?

  • insert_into API for the TableProvider trait.
  • insert_into implementation for the MemTable

If we have different partition counts with created plan, we handle it as

// Get the number of partitions in the plan and the table.
let plan_partition_count = plan.output_partitioning().partition_count();
let table_partition_count = self.batches.read().await.len();

// Adjust the plan as necessary to match the number of partitions in the table.
let plan: Arc<dyn ExecutionPlan> =
    if plan_partition_count == table_partition_count {
        plan
    } else if table_partition_count == 1 {
        // If the table has only one partition, coalesce the partitions in the plan.
        Arc::new(CoalescePartitionsExec::new(plan))
    } else {
        // Otherwise, repartition the plan using a round-robin partitioning scheme.
        Arc::new(RepartitionExec::try_new(
            plan,
            Partitioning::RoundRobinBatch(table_partition_count),
        )?)
    };

Are these changes tested?

Yes

Are there any user-facing changes?

Query is supported.

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions labels Mar 9, 2023
@metesynnada metesynnada changed the title Feature/insert into memory table INSERT INTO support for MemTable Mar 9, 2023
@jackwener jackwener self-requested a review March 9, 2023 15:37
@jackwener
Copy link
Member

I prepare to review this PR tomorrow, thank you @metesynnada

_state: &SessionState,
_input: &LogicalPlan,
) -> Result<()> {
let msg = "Insertion not implemented for this table".to_owned();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the table has name, its likely better to return it to the user in error message instead of this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think TableProvider has a way to fetch its own name yet 🤔

Copy link
Contributor

@comphead comphead Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imho TableProvider likely should have its own name.... I can investigate the purpose of having name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name for a particular table provider is currently managed by its containing schema - this allows, for example, the same provider to be registered as different table names.

// Create a physical plan from the logical plan.
let plan = state.create_physical_plan(input).await?;

// Check that the schema of the plan matches the schema of this table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please elaborate why exactly this check is needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the alternate behavior be? Pad missing columns with nulls?

Copy link
Contributor Author

@metesynnada metesynnada Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check was done with a defensive approach in mind. I am not sure how the two schemas would be different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think whatever is creating the plan should ensure that the incoming rows match the record batch (as in I prefer this error approach).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @metesynnada -- this is very exciting!

Can you please update the sqllogictest runner https://github.com/apache/arrow-datafusion/blob/bd645279874df8e70f1cd891ecea89577733b788/datafusion/core/tests/sqllogictests/src/engines/datafusion/insert.rs#L34-L88 to use this new API (I think maybe all that is needed is to remove the special case code)

Not only will that reduce code duplication, it will increase the test coverage (with existing tests that use the sqllogictest runner)

I am very excited to see this@

datafusion/core/src/datasource/memory.rs Show resolved Hide resolved
Comment on lines 507 to 510
let table_scan = Arc::new(
LogicalPlanBuilder::scan("source", multi_partition_provider, None)?
.build()?,
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly I think you can refactor this into a function, it would make the tests easier to foloow;

let table_scan = create_table_scan(multi_partition_provider);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed 👍🏻

@@ -2714,6 +2728,46 @@ mod tests {
Ok(())
}

#[tokio::test]
async fn sql_table_insert() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to write this as a sqllogictest https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/sqllogictests (after updating it to use the new codepath)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed 👍🏻

datafusion/expr/src/logical_plan/builder.rs Outdated Show resolved Hide resolved
/// Convert a logical plan into a builder with a [DmlStatement]
pub fn insert_into(
input: LogicalPlan,
table_name: impl Into<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will (logically) conflict with changes @Jefffrey is making in #5343

Copy link
Contributor Author

@metesynnada metesynnada Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have complete knowledge about PR at the moment, but I'll look into it. Nevertheless, it's not a critical inclusion, so it can be readily eliminated if necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would just involve changing from:

table_name: impl Into<String>

to:

table_name: impl Into<OwnedTableReference>

and use that directly, if #5343 gets merged first. Since there will be a From<String> implementation for OwnedTableReference anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed 👍🏻

// Create a physical plan from the logical plan.
let plan = state.create_physical_plan(input).await?;

// Check that the schema of the plan matches the schema of this table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the alternate behavior be? Pad missing columns with nulls?


// Execute the plan and collect the results into batches.
let mut tasks = vec![];
for idx in 0..table_partition_count {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will try and run all the partitions in parallel, which is probably fine (though maybe we want to limit it to the target_partitions 🤔 )

I don't think this is important to fix now

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 13, 2023
@metesynnada
Copy link
Contributor Author

metesynnada commented Mar 13, 2023

I made changes to the code based on the feedback received. While testing the code, I discovered a problem with how a table was being created using a DDL query. The issue was that the table was made with non-nullable columns by default, which contradicts PostgreSQL 15.

CREATE TABLE table_without_values(field1 BIGINT, field2 BIGINT);

DDL query. It defaults to non-nullable columns while creating a table, which is not the case in PostgreSQL 15.

NULL
The column is allowed to contain null values. This is the default.
This clause is only provided for compatibility with non-standard SQL databases. Its use is discouraged in new applications.

To fix this issue, I changed the DDL queries to explicitly specify that the columns can contain null values, like this: CREATE TABLE table_without_values(field1 BIGINT NULL, field2 BIGINT NULL);. This was necessary because when using the AS VALUES ... statement to insert data into the table, it created a plan that assumed the schema was nullable, which is not compatible with a schema that has non-nullable fields.

PS: The code piece that results non-nullable default:
https://github.com/apache/arrow-datafusion/blob/1a22f9fd436c9892566b668e535e0d6c8cb9fbd3/datafusion/sql/src/planner.rs#L127-L144

@metesynnada metesynnada requested review from Jefffrey, comphead and alamb and removed request for jackwener, Jefffrey and comphead March 13, 2023 11:57
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @metesynnada

@@ -1,93 +0,0 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@alamb
Copy link
Contributor

alamb commented Mar 13, 2023

The issue was that the table was made with non-nullable columns by default, which contradicts PostgreSQL 15.

I agree -- I have been hit by this too -- filed #5575 to track

@alamb alamb merged commit ff55eff into apache:main Mar 14, 2023
@ursabot
Copy link

ursabot commented Mar 14, 2023

Benchmark runs are scheduled for baseline = 3df1cb3 and contender = ff55eff. ff55eff is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@andygrove andygrove added the enhancement New feature or request label Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate enhancement New feature or request logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants