Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: inconsistent path in azure list #673

Merged
merged 1 commit into from
Jul 7, 2022
Merged

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Jul 5, 2022

Description

This PR fixes the list method in the azure storage backend, that was introduced in the latest refactor

Related Issue(s)

Documentation

@@ -368,7 +368,7 @@ impl StorageBackend for AdlsGen2Backend {
.flat_map(|it| match it {
Ok(paths) => Either::Left(stream::iter(paths.into_iter().map(|p| {
Ok(ObjectMeta {
path: path.to_string(),
path: format!("adls2://{}/{}", self.file_system_name, path.to_owned()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me how this fixes the bug reported:

The file system and Azure list_objs implementations return only objects within the directory specified by path which differs with the S3 implementation which returns all objects that have a prefix of path.

It seems like this change just changes the format of the file paths in list_dir? I can clearly see .recursive(false) above, so maybe that's the change being asked for? Am I missing something?

Copy link
Collaborator Author

@roeap roeap Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, i was too hasty there. .. but this yet another bug....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw. I do remember now where this comes from - likely the same for the local store. SO far (aside from vacuum) we never list the file directory, bur always have the delta log when accessing files. Vacuum assumes that there might be orphaned files, not discoverable via the delta log, so it does a list on the files. This is where folders first come into play ... in an ideal world this would not even be necessary, as we should only remove tombstones if we remove files - right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should only remove tombstones if we remove files - right?

What is "removing tombstones"? Is that dropping part of the delta log? I thought vacuum just deleted files and didn't necessarily modify the log.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wirh that i meant delete files that have remove actions in the log. Ideally we would never have any files in the data folder that are not referenced at all in the log. But i am not sure if the remove actions - tombstones - are guaranteed to be retained. I.e they might disappear when we create checkpoints and get orphaned files in the log after all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason we list is there might be some files left over from aborted transactions, which wouldn't show up in the logs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we need to fix local as well, I was thinking to move the walkdir based implementation over from object_store_rs ... also really looking forward to having this migration :). I think we need some more more on the auth mechanism over there before we can switch, but getting there ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll be focusing on that soon.

@wjones127
Copy link
Collaborator

Also what would you think of integration testing with the Azurite docker container? We could add it to the docker_compose.yml file at the top level.

@roeap
Copy link
Collaborator Author

roeap commented Jul 6, 2022

Integration tests are definately the way to go - unfortunately azurite does not support the alds gen2 apis we use here. I once thought about migrating the current state in object_store to also support blob apis and do integration tests.

There is a way to something integration-like with the ADLS2 apis - a record / replay pattern. While this creates a lof of files in the repo at an un-configurable location, it does do testing :).

On other days I prefer getting object_store up to state that we can use it here.

@roeap roeap merged commit d301727 into delta-io:main Jul 7, 2022
@roeap roeap deleted the fix-azure-list branch September 12, 2022 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants