-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: introduce LogStore
trait
#1706
Conversation
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
This trait is supposed to serve as the entry point to read and write commits in the Delta log.
fcd4dd6
to
2f475e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective I think we will benefit from having this abstraction, and it does line up with some other refactorings and improvement I am hoping to make, e.g. #1713
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of getting us in line with Databricks' DynamoDB protocol. Had one question on this interface.
rust/src/logstore/mod.rs
Outdated
&self, | ||
version: i64, | ||
actions: Vec<Action>, | ||
overwrite: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does overwrite do? Why is it necessary?(I would think overwriting a commit would be not compatible with consistency.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point. I copied that over from LogStore::write
from the reference implementation, but couldn't trace back any justification for the overwrite
flag. The design doc linked above has the following to say:
If overwrite=true, then write directly into S3 with no DynamoDB interaction
else
which doesn't help me understand the "why" either. As you correctly point out, overwrite is violating any of the consistency the delta log is supposed to deliver in the first place.
If you don't see a use case / call to that method that would set overwrite = true
, my proposal would be to drop that argument from write_commit_entry
and potentially add it when we can justify its existence.
As discussed in the PR, it's not clear what this would be useful for.
closing in favor of #1742 |
This trait is supposed to serve as the entry point to read and write commits in the Delta log.
Description
This PR introduces a new trait,
LogStore
, which is meant to encapsulate interaction with the delta commit log, i.e. things in the directory_delta_log/
. This is half a PR and half a discussion starting point, hence the PR description that is substantially longer than the actual proposed code change :-).The major goal of this exercise is to align the implementation of multi-cluster writes for Delta Lake on S3 with the one provided by the original
delta
library, enabling multi-cluster writes with some writers using Spark / Delta library and other writers usingdelta-rs
For an overview of how it's done in delta, please see:This approach requires readers of a delta table to "recover" unfinished commits from writers - as a result, reading and writing is combined in a single interface, which in this PR is modeled after LogStore.java. Currently in
delta-rs
, read path for commits is implemented directly inDeltaTable
, and there's no mechanism to implement storage-specific behavior like interacting with DynamoDb.In this draft,
LogStore
provides:read_commit_entry(version)
to read a commit entry representing a specific versionwrite_commit_entry(version, actions)
to write a set of actions as a commit entryget_latest_version()
to find the latest version in the delta logThis trait could be extended to cover all interactions with anything inside
_delta_log
, e.g. checkpoints, finding latest commit for a timestamp, etc. However, this is not necessary to implement the S3 log store, and additional functionality can be integrated incrementally when it makes sense.Implementation
This PR does not include an actual implementation for
LogStore
or its potential integration into the existing code base. It serves as a basis for a discussion and feedback on the proposed changes only.However, some thoughts on how an implementation would fit in:
A default implementation of
trait LogStore
would be configured with a location and anObjectStore
(orDeltaObjectStore
), and combine functionality that is currently distributed oversrc/table/mod.rs
(reading data for individual commits) andsrc/writer/mod.rs
(try_commit_transaction
mostly). It would probably be owned byDeltaTable
.The S3 + dynamo implementation would be different, utilizing DynamoDb for both read and write operations to enable multi-cluster writes on S3. I believe all other object stores are able to share a single implementation for now, as they sort of do in the current code base anyways.