-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
registry/storagedriver S3 Walk optimization #17
registry/storagedriver S3 Walk optimization #17
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, other than the nit about explaining the principle of the new doWalk
implementation.
Finding through further testing, the current impl does not work for scan repositories so addressing that now. |
// => [ "/path/to/folder/folder2", "/path/to/folder/folder2/folder1" ] | ||
// Eg 5 directoryDiff("/", "/path/to/folder/folder/file") | ||
// => [ "/path", "/path/to", "/path/to/folder", "/path/to/folder/folder" ], | ||
func directoryDiff(prev, current string) []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you familiar with the filepath package? Specifically, filepath.Rel. Along with the the filepath.SplitList subcommand we should be able to simplify the loop/logic constructing parents
and eliminate the sort.Sort
by looping over the split directory names to construct the parents one-by-one. Not totally clear to me if that would make this function overall more efficient, but if it so it should be worth the effort since it looks like we run this for every filepath we see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take a look at that 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how filepath.Rel
and SplitList
would help here, though I can replace sort.Sort
with something to reverse the list ordering, as the way this is done generates a list in reverse order compared to how we want them to be walked. If you have an idea of an alternate implementation, do you mind writing it up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a simple reverse
function now and all the unit tests still pass 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how filepath.Rel and SplitList would help here
The main benefit in my view would be improved readability and remove the need for the sort (or reverse now i guess).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, though it took me a couple of read-throughs (and some S3 doc reading) to understand the correctness. I've left a couple of comment suggestions that would help make it more obvious for the reader.
… linked/blobstore
…rking & added a Files Removed test for WalkFilesFallback.
…l that was left in.
…ng ErrSkipDir from stopping gracefully
…rom stopping gracefully
…it all into S3 tests.
…es to walk between files. This is needed for manifest enumeration among others
Upstream PR #3480
Objective
blobstore enumeration with S3 storage driver (and possibly others with follow up effort) can be optimized by several orders of magnitude in most cases by offloading more work to the S3 API. In some cases this gives identical performance but in extreme cases, eg thousands of blobs in separate folders, this gives a huge performance boost.
Changes
ListObjectsV2PagesWithContext
withoutDelimiter
, giving all objects of subpaths in batches up to 1000Delimiter
& recursive implementation) by comparing subsequent object paths of different subdirectoriesBug Fix
While testing, I noticed that
WalkFallback
does not handleErrSkipDir
as documented for non-directory.WalkFallback
should stop whenErrSkipDir
is returned for a non-directory, as documentedWalkFallback
WalkFallback
handlesErrSkipDir
for non-directory by skipping the file and does not stop. This is tested with the added caseTestWalkFallback/stop early
Run S3 Tests
Performance
On a few test registries, I performed a rough benchmark using
BlobEnumerator::enumerate
twice: Once before making these chances & again with the changes. I used a few local changes to keep track of the number of objects / folders enumerated and API calls made.Test 1 (medium) ~300 blobs
Results
Test 2 (large) ~50k blobs
Only the first 5 minutes of
Walk
are recorded and extrapolated, which I think is fair to get the point acrossResults