Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ReadWrite blockstore resumption #147

Merged
merged 4 commits into from
Jul 9, 2021
Merged

Implement ReadWrite blockstore resumption #147

merged 4 commits into from
Jul 9, 2021

Conversation

masih
Copy link
Member

@masih masih commented Jul 8, 2021

  • Implement an option for read-write blockstore, that if enabled the
    write can resume from where the writer left off. For resumption to work
    the WithResumption option needs to be set explicitly. Otherwise, if
    path to an existing file is passed, the blockstore construction will
    return an error. The resumption requires the roots passed to constructor
    as well as padding options to be identical with roots in file.
    Resumption only works on paths where at least V2 pragma and CAR v1
    header was successfully written onto the file. Otherwise an error is
    returned.

  • Implement resumption test that verifies files resumed from match
    expected header, data and index.

  • Implement a CAR v1 equals function to check if two given headers are
    identical. This implementation requires exact ordering of root elements.
    A TODO is left to relax the exact ordering requirement.

  • Implement Seeker in internal offset writer in order to forward offset
    of CAR v1 writer within a resumed read-write blockstore after
    resumption. The offset of the writer needs to be set to the latest
    written frame in order for consecutive writes to be at the right offset.

  • Fix bug in offset read seeker in internal IO, where seek and returned
    position was not normalized by the base. Reflect the fix in read-only
    blockstore AllKeysChan where reader was twisted to work. We now read the
    header to get its size, then seek past it to then iterate over blocks
    to populate the channel.

  • Add TODOs in places to make treating zero-length frames as EOF
    optional; See add support for null-padded carv1 payloads #140 for context.

  • Run gofumpt -l -w . on everything to maintain consistent formatting.

Fixes #98

@masih masih added this to the CAR v2 milestone Jul 8, 2021
@masih masih linked an issue Jul 8, 2021 that may be closed by this pull request
* Implement an option for read-write blockstore, that if enabled the
write can resume from where the writer left off. For resumption to work
the `WithResumption` option needs to be set explicitly. Otherwise, if
path to an existing file is passed, the blockstore construction will
return an error. The resumption requires the roots passed to constructor
as well as padding options to be identical with roots in file.
Resumption only works on paths where at least V2 pragma and CAR v1
header was successfully written onto the file. Otherwise an error is
returned.

* Implement resumption test that verifies files resumed from match
expected  header, data and index.

* Implement a CAR v1 equals function to check if two given headers are
identical. This implementation requires exact ordering of root elements.
A TODO is left to relax the exact ordering requirement.

* Implement Seeker in internal offset writer in order to forward offset
of CAR v1 writer within a resumed read-write blockstore after
resumption. The offset of the writer needs to be set to the latest
written frame in order for consecutive writes to be at the right offset.

* Fix bug in offset read seeker in internal IO, where seek and returned
position was not normalized by the base. Reflect the fix in read-only
blockstore AllKeysChan where reader was twisted to work. We now read the
 header to get its size, then seek past it to then iterate over blocks
 to populate the channel.

* Add TODOs in places to make treating zero-length frames as EOF
optional; See #140 for context.

* Run `gofumpt -l -w .` on everything to maintain consistent formatting.
@masih masih marked this pull request as ready for review July 8, 2021 16:00
@masih masih requested review from willscott and aarshkshah1992 July 8, 2021 16:00
@@ -59,50 +66,164 @@ func WithIndexPadding(p uint64) Option {
// This can help avoid redundancy in a CARv1's list of CID-Block pairs.
//
// Note that this compares whole CIDs, not just multihashes.
func WithCidDeduplication(b *ReadWrite) {
func WithCidDeduplication(b *ReadWrite) { // TODO should this take a bool and return an option to allow disabling dedupliation?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current is setup (default off, option for on) seems good enough for the first pass

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of finalising the interfaces, do you think we should keep this function as is, instead of taking a bool for the v2 release? If released we need to keep it for the lifetime of v2 and introduce a WithCidDeduplicationDisabled or something like that if we want to allow disabling of deduplication.

On a side note, other option functions in this file return Option so whichever we decide it would be good to keep them consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

being consistent with an Option seems good. i don't think i have strong feelings about whether there's a boolean argument

v2/blockstore/readwrite.go Outdated Show resolved Hide resolved
Version: 1,
fFlag := os.O_RDWR | os.O_CREATE
if !b.resume {
fFlag = fFlag | os.O_EXCL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we want to be exclusive in the resume case as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because I explicitly want ReadWrite constructor to return an error, when resumption is set to false but the file already exists. I have documented this on resumption option here.

The rationale is to avoid overriding existing files by accident.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the same rationale, altered the code slight to avoid creating files when resumption is enabled in e12754d.

return err
}

for {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would at least pull this utility into the index package rather than having it inline here

Copy link
Member Author

@masih masih Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I thought I do this as part of #118 where most of the index types need to be unexposed and moved back to internal packages.

Doing this now would either mean exposing more things in index package that we need to unexpose later or making this PR much larger by also doing #118.

If you agree, I have left the TODO here and will attend to it in a separate PR.

v2/internal/carv1/car.go Outdated Show resolved Hide resolved
masih added 2 commits July 9, 2021 12:10
* Implement equality check for CAR v1 headers where roots in different
  order are considered to be equal.

* Improve resumption docs to clarify what matching roots mean.
When resumption is enabled, we do not want to create a file if it does
not exist. Similarly, when resumption is disabled, fail if the given
file at path already exists to avoid unintended file overrides and
creations.
@masih
Copy link
Member Author

masih commented Jul 9, 2021

@aarshkshah1992 @dirkmc Thank you for the feedback on this; here is a summary:

  • By default resume if file exists, removing the need to explicitly set the resumption flag to true for it to work.
  • On Finalize write the CAR v2 header last. This reduces the chances of having unresumable files, since header size is much smaller than index.
  • Improve docs by explicitly stating on which API calls the resumption conditions are met.

When file exists, attempt resumption by default. This seems like a more
user-friendly API and less requirements to explicitly tune knobs.

Explicitly fail of the file is finalized. We technically can append to a
finalized file but that involves changes in multiple places in reader
abstractions. Left a TODO as a feature for future when asked for.

Close off file if construction of ReadWrite blockstore fails after file
is opened. If left open, it causes test failures when t cleans up temp
files.
@masih
Copy link
Member Author

masih commented Jul 9, 2021

@aarshkshah1992 @dirkmc Thank you for the feedback on this; here is a summary:

  • By default resume if file exists, removing the need to explicitly set the resumption flag to true for it to work.
  • On Finalize write the CAR v2 header last. This reduces the chances of having unresumable files, since header size is much smaller than index.
  • Improve docs by explicitly stating on which API calls the resumption conditions are met.

I removed the resumption option completely now that it is attempted by default.
I thought we will add the option to disable it when needed, since it seems redundant just now.

I have also made the error when resuming on finalised files explicit.

Merging; thank you all for the reviews.

return nil, err
}
}
f, err := os.OpenFile(path, os.O_RDWR|os.O_CREATE, 0o666) // TODO: Should the user be able to configure FileMode permissions?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masih Open once then stat on the open file, because it is possible for the file to have moved in between stat and open calls.

}
}
// Seek to the end of last skipped block where the writer should resume writing.
_, err = b.carV1Writer.Seek(frameOffset, io.SeekStart)
Copy link
Member Author

@masih masih Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masih resumption should be used with deduplicated puts, so puts that got resumed don't get written again.

We can add this to the doc and or warn people about it unless they explicitly want duplicate blocks in there.
Maybe deduplication should be enabled by default?

// Two headers are considered equal if:
// 1. They have the same version number, and
// 2. They contain the same root CIDs in any order.
func (h CarHeader) Equals(other CarHeader) bool {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masih Rename Equal to something else since Equal has a specific meaning in go, more in the lines of reflect.DeepEqual where order of elements matter.

masih added a commit that referenced this pull request Jul 13, 2021
* Use the file already open to get the stats for resumption purposes. On
getting the stats, because the open flag contains `os.O_CREATE` check
the size of the file as a way to determine wheter we should attempt
resumption or not. In practice this is the same as the code that
explicitly differentiates from non-existing files, since file existence
doesn't matter here; the key thing is the file content. Plus this also
avoids unnecessary errors where the files exists but is empty.

* Add TODOs in places to consider enabling deduplication by default and
optimise computational complexity of roots check. Note, explicitly
enable deduplication by default in a separate PR so that the change is
clearly communicated to dependant clients in case it causes any
unintended side-effects.

* Rename `Equals` to `Matches` to avoid confusion about what it does.

Relates to:
- #147 (comment)
- #147 (comment)
- #147 (comment)
mvdan pushed a commit that referenced this pull request Jul 16, 2021
Assume the read seeker passed in is seeked to the beginning of CAR
payload. This is both for consistency of API and avoid unnecessary
seeks when the reader may already be at the right place.

Address review comments

* Use the file already open to get the stats for resumption purposes. On
getting the stats, because the open flag contains `os.O_CREATE` check
the size of the file as a way to determine wheter we should attempt
resumption or not. In practice this is the same as the code that
explicitly differentiates from non-existing files, since file existence
doesn't matter here; the key thing is the file content. Plus this also
avoids unnecessary errors where the files exists but is empty.

* Add TODOs in places to consider enabling deduplication by default and
optimise computational complexity of roots check. Note, explicitly
enable deduplication by default in a separate PR so that the change is
clearly communicated to dependant clients in case it causes any
unintended side-effects.

* Rename `Equals` to `Matches` to avoid confusion about what it does.

Relates to:
- #147 (comment)
- #147 (comment)
- #147 (comment)
Jorropo pushed a commit to ipfs/boxo that referenced this pull request Mar 22, 2023
Assume the read seeker passed in is seeked to the beginning of CAR
payload. This is both for consistency of API and avoid unnecessary
seeks when the reader may already be at the right place.

Address review comments

* Use the file already open to get the stats for resumption purposes. On
getting the stats, because the open flag contains `os.O_CREATE` check
the size of the file as a way to determine wheter we should attempt
resumption or not. In practice this is the same as the code that
explicitly differentiates from non-existing files, since file existence
doesn't matter here; the key thing is the file content. Plus this also
avoids unnecessary errors where the files exists but is empty.

* Add TODOs in places to consider enabling deduplication by default and
optimise computational complexity of roots check. Note, explicitly
enable deduplication by default in a separate PR so that the change is
clearly communicated to dependant clients in case it causes any
unintended side-effects.

* Rename `Equals` to `Matches` to avoid confusion about what it does.

Relates to:
- ipld/go-car#147 (comment)
- ipld/go-car#147 (comment)
- ipld/go-car#147 (comment)


This commit was moved from ipld/go-car@cb2b58d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read/Write blockstore should support resuming
2 participants