-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(az): optimize non-recursive _list_dir
#447
Conversation
_list_dir
Thanks, @M0dEx. Could you also run before and after perf numbers for Azure? You'll need to run this command but with You should get a table you can paste into the PR like this: |
dfdbd8b
to
1af198a
Compare
Added performance tests before and after. |
1af198a
to
8f27b82
Compare
Looks like the performance generally is the same, with |
8f27b82
to
a230114
Compare
Pushed a fix for the failing tests (apparently caused by the |
a230114
to
2783e38
Compare
Removed the |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #447 +/- ##
========================================
- Coverage 93.7% 93.3% -0.5%
========================================
Files 23 23
Lines 1654 1659 +5
========================================
- Hits 1551 1548 -3
- Misses 103 111 +8
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a few questions about the approach here. Preferably we can use a more explicitly supported version of things.
is_folder = ( | ||
metadata.content_settings.content_type is None | ||
and metadata.content_settings.content_md5 is None | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point to a more specific reference? I don't see the link you sent supporting the claim that this is definitive for folders.
It does seem like x-ms-resource-type
is useful to answer the question directly, but only for certain configurations so could be included as an optimization. It may be the case that all the scenarios where those vars are none are also ones where x-ms-resource-type
is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the reference is sort of vague. I was mainly going of off these two statements:
Content-Type: The content type that's specified for the blob. If no content type is specified, the default content type is application/octet-stream.
Content-MD5: If the Content-MD5 header has been set for the blob, this response header is returned so that the client can check for message content integrity.
In version 2012-02-12 and later, Put Blob sets a block blob’s MD5 value even when the Put Blob request doesn’t include an MD5 header.
Using this, coupled with our observations from real-life usage, we concluded that these parameters are always None
for folders and never None
for blobs, even if they are empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does your configuration have the x-ms-resource-type
header for these folder blobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That header does not seem to be returned within BlobProperties
, nor can I see it in the in-browser Azure storage explorer.
|
||
for blob in blobs: | ||
# walk_blobs returns folders with a trailing slash | ||
blob_path = blob.name.rstrip("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it problematic to keep the trailing slashes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it to be consistent with the current behaviour, but we can simply not remove the trailing slashes.
Incorporated into #453, thanks! |
As described in #446, the current implementation of
_list_dir
in theAzureBlobClient
can be painfully slow for directories with a high level of nesting, even if the result is supposed to be non-recursive.This PR significantly improves the performance by using a newer portion of the Azure Blob Storage API, more specifically the
BlobServiceClient.walk_blobs
function, which allows for much faster non-recursive iteration over the contents of a directory. Additionally, it is now possible to recursively iterate over the root of an account - over all containers.Furthermore, the detection of whether a blob is a file or a directory has been improved, using
content_type
andcontent_md5
properties as described by the Azure Blob API.Closes #446.
Closes #444.
Performance before:
Performance after: