-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure we return a sorted listing when using a local client #344
Ensure we return a sorted listing when using a local client #344
Conversation
kernel/src/scan/mod.rs
Outdated
@@ -627,10 +627,6 @@ mod tests { | |||
|
|||
#[test] | |||
fn test_scan_data() { | |||
use tracing_subscriber::EnvFilter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was left over that shouldn't be here, and was causing test failures because you shouldn't init in a test like this
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #344 +/- ##
==========================================
+ Coverage 78.22% 78.34% +0.12%
==========================================
Files 49 50 +1
Lines 10253 10341 +88
Branches 10253 10341 +88
==========================================
+ Hits 8020 8102 +82
- Misses 1783 1785 +2
- Partials 450 454 +4 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code seems fine, if we decide that ordered listing is the way to go. Given that S3 directory buckets don't support ordered listing, I wonder if we need to seriously consider hardening kernel against disorder.
HOWEVER -- everything I can think of would require materializing the listing result, and if we do that unconditionally we penalize stores that do provide ordered listing. Also, it's probably simpler overall to ensure the listing is ordered up front, so the rest of kernel doesn't have to worry about it?
9513ed0
to
4b19ed3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm just added a question on expanding this to all non-s3-normal-bucket object_store implementations
@@ -72,7 +80,14 @@ impl<E: TaskExecutor> FileSystemClient for ObjectStoreFileSystemClient<E> { | |||
} | |||
}); | |||
|
|||
Ok(Box::new(receiver.into_iter())) | |||
if self.is_local { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like we might need to just detect if it "isn't S3 non-directory bucket" and then keep expanding from there? Seems safer to do a bit of extra sorting for now and just remove it as we discover different object_stores can do sorting? (or better yet we do some upstream fixes to expose a new API to communicate sorting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be huge for building analytical engines, as it will leave them in a dilemma about whether to trust the iterators to return the sorted entries. Do you know if any conversations are going on upstream about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! I added some context here #344 (comment). But I think we want to guarantee ordered from our listing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree with @zachschuermann here -- but it's probably important to at least have answers for AWS, GCP, and Azure if we're going that route?
Actually... thinking more... we don't currently rely on ordered listing, so unrecognized cloud stores will just keep today's bad behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason to assume sorting is needed unless proven otherwise: It's not clear that Azure supports an efficient list_with_offset, which makes me worry it also might not return ordered results:
Some stores, such as S3 and GCS, may be able to push offset down to reduce the number of network requests required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yet another reason: object_store
supports various adapters which would throw off name- or type-based checks, so any such check needs to fail-safe not fail-dangerous.
@scovich: Should I base the PR #322 on these changes? Are we ignoring the |
@nicklan -- what is left for this to be able to merge? Are we concerned about the approach? Worried it's somehow incorrect? Something else? |
Sorry for the delay here! This this is ready now. I did some work to figure out if we need to do anything for non-local stores. As far as I can tell, aws, azure, and gcp all return sorted listings.
So it should just be local that we need for now. I've made a follow-up issue to check if the hdfs integration returns in order (I suspect it does) (TODO: Link issue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a couple questions but seems fine :) and maybe we want to add the comment that you had about each cloud's storing as an actual code comment for posterity?
kernel/src/engine/default/mod.rs
Outdated
} | ||
|
||
pub fn new(store: Arc<DynObjectStore>, prefix: Path, task_executor: Arc<E>) -> Self { | ||
// HACK to check if we're using a LocalFileSystem from ObjectStore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make an issue for follow-up to fix in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I don't think there's a better way to do it, or I would have :)
commit_files | ||
); | ||
// We assume listing returned ordered, we want reverse order | ||
let commit_files = commit_files.into_iter().rev().collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious: it says .rev()
works on double-ended iterators - is that what's going on here? This still has to be O(N) right? makes me wonder if we lose that much just by sorting like we used to..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think reversing is still gonna be faster than a full sort. Or, at least not slower.
Good call, done |
@@ -72,7 +80,14 @@ impl<E: TaskExecutor> FileSystemClient for ObjectStoreFileSystemClient<E> { | |||
} | |||
}); | |||
|
|||
Ok(Box::new(receiver.into_iter())) | |||
if self.is_local { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree with @zachschuermann here -- but it's probably important to at least have answers for AWS, GCP, and Azure if we're going that route?
Actually... thinking more... we don't currently rely on ordered listing, so unrecognized cloud stores will just keep today's bad behavior?
@@ -72,7 +80,14 @@ impl<E: TaskExecutor> FileSystemClient for ObjectStoreFileSystemClient<E> { | |||
} | |||
}); | |||
|
|||
Ok(Box::new(receiver.into_iter())) | |||
if self.is_local { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason to assume sorting is needed unless proven otherwise: It's not clear that Azure supports an efficient list_with_offset, which makes me worry it also might not return ordered results:
Some stores, such as S3 and GCS, may be able to push offset down to reduce the number of network requests required
@@ -72,7 +80,14 @@ impl<E: TaskExecutor> FileSystemClient for ObjectStoreFileSystemClient<E> { | |||
} | |||
}); | |||
|
|||
Ok(Box::new(receiver.into_iter())) | |||
if self.is_local { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yet another reason: object_store
supports various adapters which would throw off name- or type-based checks, so any such check needs to fail-safe not fail-dangerous.
// - Azure: Docs state | ||
// [here](https://learn.microsoft.com/en-us/rest/api/storageservices/enumerating-blob-resources): | ||
// "A listing operation returns an XML response that contains all or part of the requested | ||
// list. The operation returns entities in alphabetical order." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find! I couldn't find that documented before.
But what about non-blob azure storage? Isn't there ADLS gen 1 and ADLS gen 2 as well? Or are they supersets of the blob store API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question on adls gen 2. I think that it doesn't matter because alds gen 2 is built on blob store, so the rest apis should be the same.
63571a4
to
2e03a28
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has_ordered_listing LGTM. highlighting concern from slack we will need to follow up on: if we have S3-compat filesystem e.g. minio but that doesn't return ordered listing then we wouldn't be able to trust s3:// url to mean ordered listing.
Filed #410 to follow-up |
This ensures that calls to
list_files
are returned sorted if aLocalFileSystem
ObjectStore
is used.It's not clean for a few reasons:
type_of
or similar. We have a foreign trait object that doesn't have anas_any
on it, so we can force the reference into anAny
which would allow us to useis
, and we can't add an implementation of something likeas_any
becausedyn ObjectStore
isn'tSized
. So this resorts toformat
ing the object at creation (sinceObjectStore
requiresDisplay
) and checking if it starts withLocalFileSystem
....Adds a test that when using local client things come back sorted now. Without these changes the test failed.
I think we should merge (or something like it) and then also see about adding a
list_sorted
andlist_with_offset_sorted
toObjectStore