Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: move vacuum command to operations module #1045

Merged
merged 10 commits into from
Jan 5, 2023

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Dec 31, 2022

Description

Moving the vacuum operation into the operations module and adopting IntoFuture for the command builder. This is breaking the APIs for the builder (now with consistent setter names) but we are able to keep the APIs for DeltaTable in rust and python.

In a follow up I would like to move th optimize command as well, This however may require refactoring the PartitionValue since we can only deal with static lifetimes when using IntoFuture, A while back we talked about pulling in ScalarValue from datafusion to optimize that implementation and maybe that's a good opportunitiy to look into that as well.

Related Issue(s)

Documentation

@houqp
Copy link
Member

houqp commented Jan 1, 2023

I do agree that the builder interface makes the style more consistent and the argument passing less error prone 👍

That said, I think having an extra thin wrapper API on the table struct like below will provide a better developer UX:

        let result = table.vacuum()
            .with_retention_period(Duration::hours(169))
            .with_dry_run(true)
            .await
            .unwrap();

@github-actions github-actions bot added the python label Jan 1, 2023
@roeap
Copy link
Collaborator Author

roeap commented Jan 1, 2023

wrapper API on the table struct like below will provide a better developer UX

Absoluetely, even more so on the python side of things. This unfortuantely lead me down a little bit of a rabbit hole :D. Essentially again the question on how to most efficiently share the log for commands etc. Right now in the code we just clone that state, but to my understanding this is not the way to go ...

Currently I am exploring a few options without a clear winner yet. However I do want to avoid making the DeltaTableState mutable when being shared, since this to me defeats the point of a snapshot, which is what we need for conflict resolution etc. Also looking forward to @wjones127's work in #1033 getting merged, since this should give us a nice baseline to finaly move to an arrow based state :). I do hope to make some general progress on this in this PR though to prepare for adopting this.

Comment on lines 1069 to 1078
let state = std::mem::take(&mut self.state);
let mut plan = VacuumBuilder::new(self.object_store(), state)
.with_dry_run(dry_run)
.with_enforce_retention_duration(enforce_retention_duration);
if let Some(hours) = retention_hours {
plan = plan.with_retention_period(Duration::hours(hours as i64));
}

let (table, metrics) = plan.await?;
self.state = table.state;
Copy link
Collaborator Author

@roeap roeap Jan 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I am a little bit concernbed about the error handling. If we fail duing the execution, we are still left with a DeltaTable with an empty state. If users wanted to recover from a failure and keep using the table, they would probably see unexpexted results.

Once we manage the state in arrow, we can likely have near zero-cost clones of the state and mitigate the issue. For now one a way to handle this could be to handle the error internally, and load the previous table state again. But not sure what users wold expect / want to see.

The same applies on the python side.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make it immutable and put it behind an Arc, so we can easily clone pointers to it. Not sure how viable that is though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make the mutation operations like merge() and process_action() less efficient.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clone the state for now, and as a follow up let's figure out how to do this more efficiently?

@@ -330,6 +345,39 @@ impl DeltaTableState {

Ok(())
}

/// Obtain Add actions for files that match the filter
pub fn get_active_add_actions_by_partitions<'a>(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this function to the state was part of an experiment. I did not move it back, since it exclusiveliy works on the state and from what I can see we are going to need functionality as such as we are evolving operations capabilities.

@roeap roeap marked this pull request as ready for review January 1, 2023 11:30
Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice improvement.

However I do want to avoid making the DeltaTableState mutable when being shared

I'm thinking maybe the builders need to take &mut DeltaTable or &mut DeltaTableState. Then the borrow checker will enforce this. DeltaTableState is clone-able, so that means users can always clone themselves if they don't want to deal with the lifetimes.

rust/src/operations/vacuum.rs Show resolved Hide resolved
let expired_tombstones = get_stale_files(&self.snapshot, retention_period, now_millis);
let valid_files = self.snapshot.file_paths_iter().collect::<HashSet<Path>>();

let mut files_to_delete = vec![];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to eventually make this an iterator, so we don't have to wait to resolve the entire file list to start deleting.

.ok_or(DeltaTableError::NoMetadata)?
.partition_columns
.iter()
.any(|partition_column| path_name.starts_with(partition_column)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need this line. I worry it will accidentally exclude a directory that shouldn't be there. And the partition directories should already be handled by the tombstones, right? Plus, Hive-partitioned directory structure isn't guaranteed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should take this out and see if the _date=2022-07-03/delete_me.parquet test case passes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just deleting the line unfortunately led to failing tests, but I did not dive deeper. My general feeling is we have to re-visit the logic anyhow, since we are not yet cleaning up "associated" files. I can look further into this here, but would prefer in a follow up :) - maybe when doing #688.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Thanks for looking into it.

rust/src/table_state.rs Outdated Show resolved Hide resolved
Comment on lines 1069 to 1078
let state = std::mem::take(&mut self.state);
let mut plan = VacuumBuilder::new(self.object_store(), state)
.with_dry_run(dry_run)
.with_enforce_retention_duration(enforce_retention_duration);
if let Some(hours) = retention_hours {
plan = plan.with_retention_period(Duration::hours(hours as i64));
}

let (table, metrics) = plan.await?;
self.state = table.state;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make it immutable and put it behind an Arc, so we can easily clone pointers to it. Not sure how viable that is though.

Comment on lines 1069 to 1078
let state = std::mem::take(&mut self.state);
let mut plan = VacuumBuilder::new(self.object_store(), state)
.with_dry_run(dry_run)
.with_enforce_retention_duration(enforce_retention_duration);
if let Some(hours) = retention_hours {
plan = plan.with_retention_period(Duration::hours(hours as i64));
}

let (table, metrics) = plan.await?;
self.state = table.state;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make the mutation operations like merge() and process_action() less efficient.

rust/src/delta.rs Outdated Show resolved Hide resolved
Comment on lines 1069 to 1078
let state = std::mem::take(&mut self.state);
let mut plan = VacuumBuilder::new(self.object_store(), state)
.with_dry_run(dry_run)
.with_enforce_retention_duration(enforce_retention_duration);
if let Some(hours) = retention_hours {
plan = plan.with_retention_period(Duration::hours(hours as i64));
}

let (table, metrics) = plan.await?;
self.state = table.state;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clone the state for now, and as a follow up let's figure out how to do this more efficiently?

wjones127
wjones127 previously approved these changes Jan 5, 2023
Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Though if you want, hiding that one function would be good.

@roeap
Copy link
Collaborator Author

roeap commented Jan 5, 2023

@wjones127 - applied your suggestions! Could you re-approve? :)

@roeap roeap merged commit 50ea9f5 into delta-io:main Jan 5, 2023
@roeap roeap deleted the move-operations branch January 5, 2023 15:19
chitralverma pushed a commit to chitralverma/delta-rs that referenced this pull request Mar 17, 2023
# Description

Moving the `vacuum` operation into the operations module and adopting
`IntoFuture` for the command builder. This is breaking the APIs for the
builder (now with consistent setter names) but we are able to keep the
APIs for `DeltaTable` in rust and python.

In a follow up I would like to move th optimize command as well, This
however may require refactoring the `PartitionValue` since we can only
deal with `static` lifetimes when using `IntoFuture`, A while back we
talked about pulling in `ScalarValue` from datafusion to optimize that
implementation and maybe that's a good opportunitiy to look into that as
well.

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->

Co-authored-by: Will Jones <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants