feat: delete operation #1176

Blajda · 2023-02-24T04:15:10Z

Description

This is a full implementation of the Delta Delete Command. Users can now delete records that match a predicate. This implementation is not limited to only partition columns and allows non-partition columns.

This also implements a find_files function which can be used to implement the Update command.

Related Issue(s)

closes Implement simple delete case #832

Documentation

Blajda · 2023-04-24T03:17:26Z

I want to continue to push this forward to have a MVP for deletion. My biggest issue with the current implementation is the find_files function since it has iterate through each file in series to determine if it contains a record.
If Datafusion supported a method like input_file_name then the multiple scans can be performed in parallel and would speed up the operation.

Requires further changes when #1303 is merged.
Currently there is an optimization for when the columns in the expression are only partitions columns. I may disable it until all pruning issues are resolved.

Looking forward to the great feedback as always. 😄

roeap

This is great work @Blajda!

Left some comments mostly around some Option treatment. Also had a thought on how we can achieve the parallelisim you mentioned when scanning files, but not sure yet if that will actually work :)

rust/src/operations/delete.rs

roeap · 2023-04-24T20:11:25Z

rust/src/operations/delete.rs

+            let mut table = DeltaTable::new_with_state(this.store, this.snapshot);
+            table.update().await?;


Here we can probably avoid some io by returning the actions and version from execute and updating the new state directly. Essentially something like.

this.snapshot.merge(DeltaTableState::from_actions(actions, version)?, true, true);

I implemented this and I imagine we would want to use this pattern for other operations. Should we return the version and actions in an optional? Currently I return an empty list + the current version if there are no changes.

rust/src/operations/delete.rs

roeap · 2023-04-24T20:35:35Z

rust/src/operations/delete.rs

+            &execution_props,
+        )?;
+        let filter = Arc::new(FilterExec::try_new(predicate_expr, parquet_scan)?);
+        let limit: Arc<dyn ExecutionPlan> = Arc::new(GlobalLimitExec::new(filter, 0, Some(1)));


Haven't fully thought this through, but maybe one way for us to exploit parallelism may be to create the ParquetScan in a way that each file group contains exactly one file and use LocalLimitExec. Looked a little bit throught the df code, and I think the order would be preserved, so we can infer from the partition number the file that was read.

If that cannot work, and we have to add the file name as a column, that should be feasible by treating it as as "virtual partition coluimn". However this may required some more updates and special caeses how we create the schema / statistics in various places.. Including the file name as partition value in PartitionedFile is straight forward though.

Thanks. The primary insight that I missed was adding the file path as a partition column. I'll give this a try and report any issues :)

I was able to implement the suggestions. Let me know if you have any suggestions for the new implementation.

wjones127

I'll take a closer look tomorrow, but provided some initial comments. Thanks for working on this!

wjones127 · 2023-04-27T03:19:21Z

rust/src/operations/delete.rs

+    // rewrite phases.
+    match expr {
+        Expr::ScalarVariable(_, _) | Expr::Literal(_) => (),
+        Expr::Alias(expr, _) => validate_expr(expr, partition_columns, properties)?,


Perhaps we can avoid recursion using the TreeNode trait methods on Expr?
https://docs.rs/datafusion/23.0.0/datafusion/prelude/enum.Expr.html#impl-TreeNode-for-Expr

Would help avoid stack overflows if there is a very nested expression. I could imagine that happening if someone passes a filter like:

x = 1 OR x = 2 OR x = 3 ...

The Datafusion implementation still uses recursion but overflow issues are now on them. The visitor pattern also makes the code look cleaner too!

The Datafusion implementation still uses recursion but overflow issues are now on them

🤣

rust/src/operations/delete.rs

github-actions · 2023-04-30T20:56:06Z

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

wjones127

Thanks for the updates. I have a few more suggestions.

rust/src/operations/delete.rs

wjones127 · 2023-05-03T04:21:33Z

rust/src/operations/delete.rs

+
+    #[tokio::test]
+    async fn test_delete_on_nonpartiton_column() {
+        // Delete based on a nonpartition column


How are null values handled? For example, if I have a column x: [1, 2, null, 4], does DELETE FROM table_name WHERE x > 2 delete just the last row, or also the third row? Seems like it's worth a unit test to verify.

Yeah great call out.
The previous commit would delete the row. I've added a new test and changed the behavior to not delete the third record in this case.

A record should only be deleted if the predicate evaluates to true other it is kept. null > 2 evaluates to UNKNOWN.
I've checked the spark implementation and the behavior aligns

wjones127 · 2023-05-03T04:32:15Z

rust/src/operations/delete.rs

+}
+
+// Create a record batch that contains the partition columns plus the path of the file
+fn create_partition_record_batch(


Maybe you can re-use this function for now:

delta-rs/rust/src/table_state_arrow.rs

Line 51 in e046a77

pub fn add_actions_table(

It handles the different data types for partition values. It has a little more overhead since it also parses out the statistics, but I think that's fine for now. Later on, I expect we'll replace this with expression simplification which will let us use statistics and remove redundant parts of the predicate:

apache/datafusion#6171

Blajda · 2023-05-06T02:16:25Z

Hi @wjones127 @roeap
I've made the requested changes. Let me know if it is good to merge or any additional concerns.

roeap

From my end this look great - thanks for this excellent contribution @Blajda.

I'll leave it open though for @wjones127 to chine in.

github-actions bot added binding/rust Issues for the Rust crate rust labels Feb 24, 2023

Blajda force-pushed the delete-op branch from 0e42dbd to 3472235 Compare April 24, 2023 01:18

Blajda changed the title ~~:WIP: Delete Operation~~ feat: [WIP] Delete Operation Apr 24, 2023

Blajda force-pushed the delete-op branch from 3472235 to da86b50 Compare April 24, 2023 01:29

Blajda marked this pull request as ready for review April 24, 2023 03:19

Blajda requested review from houqp, xianwill, wjones127, fvaleye, roeap, rtyler and mosyp as code owners April 24, 2023 03:19

Blajda changed the title ~~feat: [WIP] Delete Operation~~ feat: Delete Operation Apr 24, 2023

roeap reviewed Apr 24, 2023

View reviewed changes

wjones127 requested changes Apr 27, 2023

View reviewed changes

Blajda changed the title ~~feat: Delete Operation~~ feat: delete operation Apr 30, 2023

Blajda added 9 commits April 30, 2023 17:08

Rebase delete operation on main

5888f60

small test change

ae897cb

Fix clippy errors

9f30dd2

clippy tests

0b37011

rewrite find_files with one parquet scan + local limit

b6dc025

Allow parallelism for find files

aa14cbb

Refactor to use tree visitor

6655b9e

Perform in memory scan for partition only queries

d941320

rebase on main

6cfbfce

Blajda force-pushed the delete-op branch from 2ee19c7 to 6cfbfce Compare April 30, 2023 22:19

wjones127 requested changes May 3, 2023

View reviewed changes

Blajda added 6 commits May 4, 2023 22:30

Fix null handling for deletion

ccb08f3

Fix null handling for deletion 2

9cb42e7

Clean up unwraps

83aa857

minor cleanups

70007ef

Merge branch 'main' into delete-op

c5c53e3

fix metrics

4276f3b

roeap previously approved these changes May 6, 2023

View reviewed changes

Merge branch 'main' into delete-op

1f47ca7

Blajda dismissed roeap’s stale review via 1f47ca7 May 8, 2023 00:18

wjones127 approved these changes May 8, 2023

View reviewed changes

wjones127 merged commit 0115fbb into delta-io:main May 8, 2023

Blajda deleted the delete-op branch May 10, 2023 03:04

adhish20 mentioned this pull request Jul 28, 2023

Support deletes GlareDB/glaredb#1352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: delete operation #1176

feat: delete operation #1176

Blajda commented Feb 24, 2023 •

edited

Loading

Blajda commented Apr 24, 2023 •

edited

Loading

roeap left a comment

roeap Apr 24, 2023

Blajda Apr 27, 2023

roeap Apr 24, 2023

Blajda Apr 25, 2023

Blajda Apr 27, 2023

wjones127 left a comment

wjones127 Apr 27, 2023

Blajda Apr 30, 2023

wjones127 May 3, 2023

github-actions bot commented Apr 30, 2023

wjones127 left a comment

wjones127 May 3, 2023

Blajda May 5, 2023

wjones127 May 3, 2023

Blajda commented May 6, 2023

roeap left a comment

		let mut table = DeltaTable::new_with_state(this.store, this.snapshot);
		table.update().await?;

feat: delete operation #1176

feat: delete operation #1176

Conversation

Blajda commented Feb 24, 2023 • edited Loading

Description

Related Issue(s)

Documentation

Blajda commented Apr 24, 2023 • edited Loading

roeap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 30, 2023

wjones127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blajda commented May 6, 2023

roeap left a comment

Choose a reason for hiding this comment

Blajda commented Feb 24, 2023 •

edited

Loading

Blajda commented Apr 24, 2023 •

edited

Loading