Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: bump to datafusion 39, arrow 52, pyo3 0.21 #2581

Merged
merged 6 commits into from
Jun 11, 2024

Conversation

abhiaagarwal
Copy link
Contributor

Description

Updates the arrow and datafusion dependencies to 52 and 39(-rc1) respectively. This is necessary for updating pyo3.

While most changes with trivial, some required big rewrites. Namely, the logic for the Updates operation had to be rewritten (and simplified) to accommodate some new sanity checks inside datafusion: (apache/datafusion#10088).

Depends on delta-kernel having its arrow and object-store version bumped as well. This PR doesn't include any major changes for pyo3, I'll open a separate PR depending on this PR.

Related Issue(s)

Documentation

@abhiaagarwal abhiaagarwal changed the title chore: bump to datafusion 39, arrow 52 chore: bump to datafusion 39, arrow 52, pyo3 0.21 Jun 9, 2024
@abhiaagarwal abhiaagarwal marked this pull request as ready for review June 9, 2024 19:29
@abhiaagarwal
Copy link
Contributor Author

This is now ready for review. All python and rust tests are passing locally on my machine (docs seem to have issues building, though). This only enables the gil-refs feature in the python bindings, it does not migrate to the new API (will be done in a follow-up PR)

@@ -276,33 +272,6 @@ impl<'a> Display for SqlFormat<'a> {
write!(f, "{expr} IN ({})", expr_vec_fmt!(list))
}
}
Expr::GetIndexedField(GetIndexedField { expr, field }) => match field {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being removed?

Copy link
Contributor Author

@abhiaagarwal abhiaagarwal Jun 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those functions don't exist anymore, and based on my understanding from apache/datafusion#10568, those rewrites happen earlier now so the functionality stays the same, it's just done by DF

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if we can catch the new functions and keep the old display formatting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, I'm not entirely sure this is possible. Those operations have been converted into ScalarFunctions, so the concept of "taking an index of a field" is now "a scalar function that operates on a field and an index". Had to modify a bunch of tests to accommodate.

simple!(
col("_struct").field("a").eq(lit(20_i64)),
"get_field(_struct, 'a') = 20".to_string()
),

It could be possible manually match on scalar functions and convert them into the old style, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it would make the roundtrippable format closer to SQL, but I guess with the timestamp parsing I already kind of stopped doing that 😆

crates/core/src/operations/update.rs Show resolved Hide resolved
python/src/filesystem.rs Show resolved Hide resolved
python/src/filesystem.rs Outdated Show resolved Hide resolved
ion-elgreco
ion-elgreco previously approved these changes Jun 9, 2024
@ion-elgreco
Copy link
Collaborator

Nice work! @abhiaagarwal

@abhiaagarwal
Copy link
Contributor Author

Thank you @ion-elgreco! DF 39 should be released later this week, and the upstream delta-kernel needs to be resolved as well. I'll open a follow-up PR for moving to the new pyo3 lifetimes, and I'm also interested in implementing an async-native API using pyo3-asyncio since a good chunk of my work code can benefit from it :)

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Jun 9, 2024

@abhiaagarwal yeah leaving it unmerged for now. Also checking with rest on slack if they are fine with keeping the git reference of delta-kernel-rs for time being.

Great, you can ping me when you got the new bound APIs working!

Just two things regarding pyo3-asyncio. It seems the original maintainer is not very active anymore, so it's still locked to pyo3 0.20 unfortunately.

The other thing is, where do you need it for? :) as most of the async stuff is internal

@abhiaagarwal
Copy link
Contributor Author

@ion-elgreco yeah, but it appears that the maintainer of pyo3 is interested in taking stewardship over it, so maybe that'll be resolved soon.

I'm mostly interested to see if there are any perf improvements. Most of the code I use with the python deltalake package in already is async native (ie. async main), having async methods won't "infect" the rest of our codebase. We use it as a pseudo backend with FastAPI, for reference. There appears to be work being done in the main pyo3 repo that'll allow the gil to be released in async methods, so we could have the best of both worlds.

@ion-elgreco
Copy link
Collaborator

@abhiaagarwal can you bump delta-kernel-rs to 0.1.1, it includes your changes now. And also bump the python cargo toml to 0.18.1?

@abhiaagarwal
Copy link
Contributor Author

@ion-elgreco sure, I'll do it by EOD. Also will bump DF since they released 39

Cargo.toml Outdated
@@ -26,33 +26,33 @@ debug = true
debug = "line-tables-only"

[workspace.dependencies]
delta_kernel = { version = "0.1" }
delta_kernel = { git = "https://github.com/abhiaagarwal/delta-kernel-rs", branch = "arrow_52" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should now be updated to 0.1.1

@abhiaagarwal
Copy link
Contributor Author

Bumped to published versions, also added a new test that seemingly confirms that the issues related to decimals in #1778 et al on both engines seem to be fixed

@ion-elgreco
Copy link
Collaborator

Bumped to published versions, also added a new test that seemingly confirms that the issues related to decimals in #1778 et al on both engines seem to be fixed

Yes I was waiting for some time for the arrow bump since they added Scientific notation support for decimals :)

Thanks for adding the test! 🙂

@ion-elgreco ion-elgreco enabled auto-merge (squash) June 11, 2024 18:44
@ion-elgreco ion-elgreco merged commit 57795da into delta-io:main Jun 11, 2024
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants