-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ReadWrite blockstore resumption #147
Conversation
* Implement an option for read-write blockstore, that if enabled the write can resume from where the writer left off. For resumption to work the `WithResumption` option needs to be set explicitly. Otherwise, if path to an existing file is passed, the blockstore construction will return an error. The resumption requires the roots passed to constructor as well as padding options to be identical with roots in file. Resumption only works on paths where at least V2 pragma and CAR v1 header was successfully written onto the file. Otherwise an error is returned. * Implement resumption test that verifies files resumed from match expected header, data and index. * Implement a CAR v1 equals function to check if two given headers are identical. This implementation requires exact ordering of root elements. A TODO is left to relax the exact ordering requirement. * Implement Seeker in internal offset writer in order to forward offset of CAR v1 writer within a resumed read-write blockstore after resumption. The offset of the writer needs to be set to the latest written frame in order for consecutive writes to be at the right offset. * Fix bug in offset read seeker in internal IO, where seek and returned position was not normalized by the base. Reflect the fix in read-only blockstore AllKeysChan where reader was twisted to work. We now read the header to get its size, then seek past it to then iterate over blocks to populate the channel. * Add TODOs in places to make treating zero-length frames as EOF optional; See #140 for context. * Run `gofumpt -l -w .` on everything to maintain consistent formatting.
@@ -59,50 +66,164 @@ func WithIndexPadding(p uint64) Option { | |||
// This can help avoid redundancy in a CARv1's list of CID-Block pairs. | |||
// | |||
// Note that this compares whole CIDs, not just multihashes. | |||
func WithCidDeduplication(b *ReadWrite) { | |||
func WithCidDeduplication(b *ReadWrite) { // TODO should this take a bool and return an option to allow disabling dedupliation? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current is setup (default off, option for on) seems good enough for the first pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of finalising the interfaces, do you think we should keep this function as is, instead of taking a bool
for the v2 release? If released we need to keep it for the lifetime of v2 and introduce a WithCidDeduplicationDisabled
or something like that if we want to allow disabling of deduplication.
On a side note, other option functions in this file return Option
so whichever we decide it would be good to keep them consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
being consistent with an Option
seems good. i don't think i have strong feelings about whether there's a boolean argument
v2/blockstore/readwrite.go
Outdated
Version: 1, | ||
fFlag := os.O_RDWR | os.O_CREATE | ||
if !b.resume { | ||
fFlag = fFlag | os.O_EXCL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we want to be exclusive in the resume case as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because I explicitly want ReadWrite
constructor to return an error, when resumption is set to false but the file already exists. I have documented this on resumption option here.
The rationale is to avoid overriding existing files by accident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the same rationale, altered the code slight to avoid creating files when resumption is enabled in e12754d.
return err | ||
} | ||
|
||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would at least pull this utility into the index package rather than having it inline here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I thought I do this as part of #118 where most of the index types need to be unexposed and moved back to internal packages.
Doing this now would either mean exposing more things in index
package that we need to unexpose later or making this PR much larger by also doing #118.
If you agree, I have left the TODO here and will attend to it in a separate PR.
* Implement equality check for CAR v1 headers where roots in different order are considered to be equal. * Improve resumption docs to clarify what matching roots mean.
When resumption is enabled, we do not want to create a file if it does not exist. Similarly, when resumption is disabled, fail if the given file at path already exists to avoid unintended file overrides and creations.
@aarshkshah1992 @dirkmc Thank you for the feedback on this; here is a summary:
|
When file exists, attempt resumption by default. This seems like a more user-friendly API and less requirements to explicitly tune knobs. Explicitly fail of the file is finalized. We technically can append to a finalized file but that involves changes in multiple places in reader abstractions. Left a TODO as a feature for future when asked for. Close off file if construction of ReadWrite blockstore fails after file is opened. If left open, it causes test failures when t cleans up temp files.
I removed the resumption option completely now that it is attempted by default. I have also made the error when resuming on finalised files explicit. Merging; thank you all for the reviews. |
return nil, err | ||
} | ||
} | ||
f, err := os.OpenFile(path, os.O_RDWR|os.O_CREATE, 0o666) // TODO: Should the user be able to configure FileMode permissions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masih Open once then stat on the open file, because it is possible for the file to have moved in between stat and open calls.
} | ||
} | ||
// Seek to the end of last skipped block where the writer should resume writing. | ||
_, err = b.carV1Writer.Seek(frameOffset, io.SeekStart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masih resumption should be used with deduplicated puts, so puts that got resumed don't get written again.
We can add this to the doc and or warn people about it unless they explicitly want duplicate blocks in there.
Maybe deduplication should be enabled by default?
// Two headers are considered equal if: | ||
// 1. They have the same version number, and | ||
// 2. They contain the same root CIDs in any order. | ||
func (h CarHeader) Equals(other CarHeader) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masih Rename Equal
to something else since Equal
has a specific meaning in go, more in the lines of reflect.DeepEqual
where order of elements matter.
* Use the file already open to get the stats for resumption purposes. On getting the stats, because the open flag contains `os.O_CREATE` check the size of the file as a way to determine wheter we should attempt resumption or not. In practice this is the same as the code that explicitly differentiates from non-existing files, since file existence doesn't matter here; the key thing is the file content. Plus this also avoids unnecessary errors where the files exists but is empty. * Add TODOs in places to consider enabling deduplication by default and optimise computational complexity of roots check. Note, explicitly enable deduplication by default in a separate PR so that the change is clearly communicated to dependant clients in case it causes any unintended side-effects. * Rename `Equals` to `Matches` to avoid confusion about what it does. Relates to: - #147 (comment) - #147 (comment) - #147 (comment)
Assume the read seeker passed in is seeked to the beginning of CAR payload. This is both for consistency of API and avoid unnecessary seeks when the reader may already be at the right place. Address review comments * Use the file already open to get the stats for resumption purposes. On getting the stats, because the open flag contains `os.O_CREATE` check the size of the file as a way to determine wheter we should attempt resumption or not. In practice this is the same as the code that explicitly differentiates from non-existing files, since file existence doesn't matter here; the key thing is the file content. Plus this also avoids unnecessary errors where the files exists but is empty. * Add TODOs in places to consider enabling deduplication by default and optimise computational complexity of roots check. Note, explicitly enable deduplication by default in a separate PR so that the change is clearly communicated to dependant clients in case it causes any unintended side-effects. * Rename `Equals` to `Matches` to avoid confusion about what it does. Relates to: - #147 (comment) - #147 (comment) - #147 (comment)
Assume the read seeker passed in is seeked to the beginning of CAR payload. This is both for consistency of API and avoid unnecessary seeks when the reader may already be at the right place. Address review comments * Use the file already open to get the stats for resumption purposes. On getting the stats, because the open flag contains `os.O_CREATE` check the size of the file as a way to determine wheter we should attempt resumption or not. In practice this is the same as the code that explicitly differentiates from non-existing files, since file existence doesn't matter here; the key thing is the file content. Plus this also avoids unnecessary errors where the files exists but is empty. * Add TODOs in places to consider enabling deduplication by default and optimise computational complexity of roots check. Note, explicitly enable deduplication by default in a separate PR so that the change is clearly communicated to dependant clients in case it causes any unintended side-effects. * Rename `Equals` to `Matches` to avoid confusion about what it does. Relates to: - ipld/go-car#147 (comment) - ipld/go-car#147 (comment) - ipld/go-car#147 (comment) This commit was moved from ipld/go-car@cb2b58d
Implement an option for read-write blockstore, that if enabled the
write can resume from where the writer left off. For resumption to work
the
WithResumption
option needs to be set explicitly. Otherwise, ifpath to an existing file is passed, the blockstore construction will
return an error. The resumption requires the roots passed to constructor
as well as padding options to be identical with roots in file.
Resumption only works on paths where at least V2 pragma and CAR v1
header was successfully written onto the file. Otherwise an error is
returned.
Implement resumption test that verifies files resumed from match
expected header, data and index.
Implement a CAR v1 equals function to check if two given headers are
identical. This implementation requires exact ordering of root elements.
A TODO is left to relax the exact ordering requirement.
Implement Seeker in internal offset writer in order to forward offset
of CAR v1 writer within a resumed read-write blockstore after
resumption. The offset of the writer needs to be set to the latest
written frame in order for consecutive writes to be at the right offset.
Fix bug in offset read seeker in internal IO, where seek and returned
position was not normalized by the base. Reflect the fix in read-only
blockstore AllKeysChan where reader was twisted to work. We now read the
header to get its size, then seek past it to then iterate over blocks
to populate the channel.
Add TODOs in places to make treating zero-length frames as EOF
optional; See add support for null-padded carv1 payloads #140 for context.
Run
gofumpt -l -w .
on everything to maintain consistent formatting.Fixes #98