-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optimization to only list log files starting at a certain name #1252
Comments
Good idea! |
Heard from Scott that this was recently implemented in the Spark impl. Definitely seems like low-hanging fruit. |
Would you mind if I worked on this? While I would love to tackle some of the "aged" good first issues, I haven't been able to determine which ones are still active and require assistance. Therefore, I would like to begin by addressing a low-hanging fruit before diving into more extensive contributions. |
Feel free to! I'm not 100% sure this optimization actually applies to our codebase yet. It looks like instead of listing, we just look for the N+1 log file until we get a 404 error back: Lines 792 to 793 in 8a4b2b8
But I think there's at least one function that could benefit now: Line 657 in 8a4b2b8
|
@wjones127 |
FYI: |
@ognis1205 - no worries at all. I do believe that for this optimization to really take effect, it needs to be added in the object_store crate rather then here, otherwise we would be getting the full payload in any case. After a quick scan, it seems S3 supports this, although I haven't seen any explicit mention of ordering in the response, there is a "start-after" header that can be passed to list. I guess If there is a concept of "after" there should be a concept of ordering as well :). For Azure I unfortunately have not found an equivalent parameter. |
@roeap |
FYI:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/ListingKeysUsingAPIs.html
https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html#API_ListObjectsV2_Example_3
https://cloud.google.com/storage/docs/json_api/v1/objects/list
https://learn.microsoft.com/en-us/rest/api/storageservices/list-blobs?tabs=azure-ad
https://github.com/apache/arrow-rs/blob/master/object_store/CHANGELOG-old.md https://docs.rs/object_store/latest/object_store/trait.ObjectStore.html#tymethod.list
|
delta-rs/rust/src/storage/mod.rs Line 201 in 930d16e
Line 858 in 8a4b2b8
Line 670 in 8a4b2b8
In this part, |
I think I found a bug relating to this issue: |
Hi folks, Sorry for being late; I have been a bit busy lately. Regarding the issue, after reviewing the current implementation of
Instead, while reviewing the current implementation, I guess I found a bug relating to |
We don't use |
@wjones127
Would you mind if I proceed this way? |
Yes, that sounds good |
@wjones127 |
# Description Adds the `list_with_offset` delegation method to `DeltaObjectStore`. # Related Issue(s) - closes #1252 # Documentation apache/arrow-rs#3970 Signed-off-by: Shingo OKAWA <[email protected]>
# Description Adds the `list_with_offset` delegation method to `DeltaObjectStore`. # Related Issue(s) - closes delta-io#1252 # Documentation apache/arrow-rs#3970 Signed-off-by: Shingo OKAWA <[email protected]>
Description
When listing the
_delta_log
directory, we should be able to say "start listing at 000...0010000.json` to only get log files after the 10000th (skipping a lot of calls) if that's all we need.Use Case
Related Issue(s)
The text was updated successfully, but these errors were encountered: