-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: inconsistent path in azure list #673
Conversation
@@ -368,7 +368,7 @@ impl StorageBackend for AdlsGen2Backend { | |||
.flat_map(|it| match it { | |||
Ok(paths) => Either::Left(stream::iter(paths.into_iter().map(|p| { | |||
Ok(ObjectMeta { | |||
path: path.to_string(), | |||
path: format!("adls2://{}/{}", self.file_system_name, path.to_owned()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me how this fixes the bug reported:
The file system and Azure list_objs implementations return only objects within the directory specified by path which differs with the S3 implementation which returns all objects that have a prefix of path.
It seems like this change just changes the format of the file paths in list_dir
? I can clearly see .recursive(false)
above, so maybe that's the change being asked for? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, i was too hasty there. .. but this yet another bug....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw. I do remember now where this comes from - likely the same for the local store. SO far (aside from vacuum) we never list the file directory, bur always have the delta log when accessing files. Vacuum assumes that there might be orphaned files, not discoverable via the delta log, so it does a list on the files. This is where folders first come into play ... in an ideal world this would not even be necessary, as we should only remove tombstones if we remove files - right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should only remove tombstones if we remove files - right?
What is "removing tombstones"? Is that dropping part of the delta log? I thought vacuum just deleted files and didn't necessarily modify the log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wirh that i meant delete files that have remove actions in the log. Ideally we would never have any files in the data folder that are not referenced at all in the log. But i am not sure if the remove actions - tombstones - are guaranteed to be retained. I.e they might disappear when we create checkpoints and get orphaned files in the log after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason we list is there might be some files left over from aborted transactions, which wouldn't show up in the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we need to fix local as well, I was thinking to move the walkdir based implementation over from object_store_rs ... also really looking forward to having this migration :). I think we need some more more on the auth mechanism over there before we can switch, but getting there ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll be focusing on that soon.
Also what would you think of integration testing with the Azurite docker container? We could add it to the |
Integration tests are definately the way to go - unfortunately azurite does not support the alds gen2 apis we use here. I once thought about migrating the current state in object_store to also support blob apis and do integration tests. There is a way to something integration-like with the ADLS2 apis - a record / replay pattern. While this creates a lof of files in the repo at an un-configurable location, it does do testing :). On other days I prefer getting object_store up to state that we can use it here. |
Description
This PR fixes the list method in the azure storage backend, that was introduced in the latest refactor
Related Issue(s)
Documentation