Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: integrate with object_store / datafusion APIs #703

Merged
merged 31 commits into from
Aug 1, 2022

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Jul 18, 2022

Description

Using the latest arrow (18) and datafusion (10) dependencies.

supersedes #666
part of #610

This PR moves most file path handling to use the Path abstraction from object_store. We do so by way of moving our internal StorageBackend behind a new DeltaObjectStore, which implements ObjectStore. The delta store exposes all path as relative to the table root, consistent with how they are treated in the log.

To integrate with datafusion, a table-specific specific store is registered on the runtime environment. Datafusion required get_range to be defined, which I did for local using the new store. This was just to get datafusion tests to pass - this needs to be improved in a follow up :).

One major change that is implied by Path is, that for local file system the table root folder has to exist now prior to creating the table, but can be empty.

The most annoying aspect was handling windlows path for local. but I tried to keep all special handling contained in DeltaObjectStore and suspect most of it will disappear, once we move the storage backends to object_store.

  • also bumps arrow (18) and datafusion (10)

Last but not least, I found that the Azure paths are somewhat of an oddity compared to other - will open a follow up for that :).

@Blajda - this should also remove the delete issues with vacuum. I activated them in windows and it seems to work

cc @Tom-Newton

Related Issue(s)

closes #671

Documentation

@roeap
Copy link
Collaborator Author

roeap commented Jul 19, 2022

@houqp @wjones127 - latest datafusion is already using the Path abstraction from object_strore_rs as well as the specific "view" on what an object store is. For the unix systems quickly hacked something together to make it work. For windows paths its still failing.

However since #696 has also been raised, I was wondering if we should start to integrate with the object_Store_rs crate. While its too early to use the object store itself just yet, maybe we should start to introduce the Path abstraction along with the implications on how we handle paths in general.

Here I could either switch the focus of this PR - or maybe skip datafusion tests on windows and do a follow up.

Of course I could also go into handling windows paths, but right now this feels very temporary ... Thoughts?

@wjones127
Copy link
Collaborator

While its too early to use the object store itself just yet, maybe we should start to introduce the Path abstraction along with the implications on how we handle paths in general.

If it seems feasible, that seems like it's likely a good idea. I've been thinking trying to integrate object_store_rs entirely in 1 PR would be too much. So if we can migrate the Paths part first, I think that's a good idea.

@houqp
Copy link
Member

houqp commented Jul 21, 2022

yeah +1 on incrementally migrate to object_store_rs

@roeap roeap changed the title chore: bump arrow and datafusion feat: integrate with object_store / datafusion APIs Jul 31, 2022
@roeap roeap requested review from houqp and wjones127 July 31, 2022 19:31
Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, except for two small things that I noticed.

This is a good start to integrating the object store.

Comment on lines 35 to +37
std::fs::create_dir_all(path).unwrap();
std::fs::remove_dir_all(path).unwrap();
std::fs::create_dir_all(path).unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is happening here?

Copy link
Collaborator Author

@roeap roeap Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question :D! Due to Path always canonicalizing all local path, they must exists - but I guess one create_all is enough...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out we needed the first crate all - remove all will panic if the folders do not exist (i.,e. in CI), but we still need empty folders, thus the second one. Added it back since it is "just" test code, but the proper solution would be to clean up directly after the tests.

Comment on lines 268 to 277
/// Move an object from one path to another in the same object store.
///
/// By default, this is implemented as a copy and then delete source. It may not
/// check when deleting source that it was the same object that was originally copied.
///
/// If there exists an object at the destination, it will be overwritten.
async fn rename(&self, from: &Path, to: &Path) -> ObjectStoreResult<()> {
self.copy(from, to).await?;
self.delete(from).await
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, you can skip including this and use the default implementation.

@roeap roeap requested a review from wjones127 August 1, 2022 06:23
@roeap roeap merged commit 8fe368f into delta-io:main Aug 1, 2022
@roeap roeap deleted the bump-arrows branch August 1, 2022 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'list_objs` behaviour is not uniform among backends
3 participants