Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid some allocations in DeltaStorageHandler #1115

Merged
merged 5 commits into from
Feb 22, 2023

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Feb 1, 2023

Description

Fixes an error in the DeltaFileSystemHandler, when reading file metadata from remote storages. Due to an inconsistency between the behaviour object stores when invoking list operations on a path that points to a file, we incorrectly returned an Directory type for files in case of object stores. The bug only surfaced when using pyarrow < 9, since we used the call only when getting the file size, which we avoid when using more recent pyarrow versions.

@tustvold - I seem to vaguely remember discussing this at some point, but am not sure anymore. Is this something we should look into in object-store?

Update: validated locally, that the upstream fixes will fix the linked issue, so the main reason for this PR is resolved elsewhere. There are some changes included which safe us some allocation (admittedly very few), but hopefully an improvement anyhow.

Related Issue(s)

closes #1109

Documentation

@github-actions github-actions bot added the binding/python Issues for the Python package label Feb 1, 2023
@roeap roeap marked this pull request as ready for review February 13, 2023 06:46
@tustvold
Copy link

tustvold commented Feb 13, 2023

I seem to vaguely remember discussing this at some point, but am not sure anymore. Is this something we should look into in object-store?

object_store should not return directories in listing responses, it follows an object store / KV model and not a filesystem model. If one of the implementations is returning directories that is a bug. I'm not sure if this is what is going on here though?

@roeap
Copy link
Collaborator Author

roeap commented Feb 13, 2023

From what I could tell happend. If you have a store with a single object file at the root. Using list (or list with prefix) like

path = Path::from("file")
store.list(Some(path))

I think the remote stores will return a list containing that single object, whereas LocalFileSystem will return an empty list.

@tustvold
Copy link

tustvold commented Feb 13, 2023

Both should return an empty list, that may be a bug, I'll take a look

Edit: filed a bug report upstream, thank you for reporting - apache/arrow-rs#3712

@roeap roeap marked this pull request as draft February 13, 2023 19:35
@roeap
Copy link
Collaborator Author

roeap commented Feb 13, 2023

turning this back into a draft, since the upstream fixes are in the works ...

@roeap roeap marked this pull request as ready for review February 14, 2023 14:07
@roeap roeap changed the title fix: file object info for delta fs handler fix: avoid some allocations in DeltaStorageHandler Feb 14, 2023
@wjones127 wjones127 merged commit 48f2e70 into delta-io:main Feb 22, 2023
@roeap roeap deleted the fix-delta-fs branch February 22, 2023 07:07
chitralverma pushed a commit to chitralverma/delta-rs that referenced this pull request Mar 17, 2023
# Description

Fixes an error in the `DeltaFileSystemHandler`, when reading file
metadata from remote storages. Due to an inconsistency between the
behaviour object stores when invoking list operations on a path that
points to a file, we incorrectly returned an Directory type for files in
case of object stores. The bug only surfaced when using pyarrow < 9,
since we used the call only when getting the file size, which we avoid
when using more recent pyarrow versions.

@tustvold - I seem to vaguely remember discussing this at some point,
but am not sure anymore. Is this something we should look into in
object-store?

Update: validated locally, that the upstream fixes will fix the linked
issue, so the main reason for this PR is resolved elsewhere. There are
some changes included which safe us some allocation (admittedly very
few), but hopefully an improvement anyhow.

# Related Issue(s)

closes delta-io#1109 

# Documentation

<!---
Share links to useful documentation
--->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package
Projects
None yet
3 participants