Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Incompatibility with deltalake v0.19.0 and above - failure in write_deltalake() #2753 #2754

Closed

Conversation

igor-pechersky
Copy link

@igor-pechersky igor-pechersky changed the title Incompatibility with deltalake v0.19.0 and above - failure in write_deltalake() #2753 [BUG] Incompatibility with deltalake v0.19.0 and above - failure in write_deltalake() #2753 Aug 28, 2024
@github-actions github-actions bot added the bug Something isn't working label Aug 28, 2024
@igor-pechersky
Copy link
Author

this PR conflicts with #2704 and overrides it

Copy link

codspeed-hq bot commented Aug 28, 2024

CodSpeed Performance Report

Merging #2754 will degrade performances by 57.59%

Comparing igor-pechersky:bugfix/deltalake-2753 (2e96ee5) with main (a97d871)

Summary

⚡ 1 improvements
❌ 1 regressions
✅ 14 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main igor-pechersky:bugfix/deltalake-2753 Change
test_count[1 Small File] 10.1 ms 23.7 ms -57.59%
test_show[100 Small Files] 55.8 ms 49.7 ms +12.22%

@samster25
Copy link
Member

samster25 commented Aug 29, 2024

hi @igor-pechersky! Thanks for contribution. Do you mind sharing the error you receiving prior to this PR?

Looks like we have a test cast failing because the data schema and table schema differ

Data schema:
a: int64
b: double
c: string
d: binary

vs

Table Schema:
a: int64
b: double
c: large_string
d: large_binary

We may need to still perform a manual cast

@igor-pechersky
Copy link
Author

igor-pechersky commented Sep 1, 2024

#2754 (comment)
hi @samster25

Do you mind sharing the error you receiving prior to this PR?

sure
#2753

We may need to still perform a manual cast

here
dc6c75b

@djouallah
Copy link
Contributor

try large_dtypes=True when writing to delta

@kevinzwang
Copy link
Member

Hi @igor-pechersky I didn't realize that we already had a PR open for this issue before I made #2827 😄

Thanks for contributing this! However I think I would prefer to merge in my PR instead because it maintains compatibility with past deltalake versions.

@kevinzwang
Copy link
Member

Closing this because we are merging in #2827!

@kevinzwang kevinzwang closed this Sep 11, 2024
@jaychia
Copy link
Contributor

jaychia commented Sep 11, 2024

Thanks @igor-pechersky for taking a stab at this!

kevinzwang added a commit that referenced this pull request Sep 11, 2024
Deltalake v0.19 changes their `_convert_pa_schema_to_delta` function to
take in a `schema_conversion_mode` instead of `large_dtypes`. Pyarrow
also needed to be upgraded to v16.0.0 as well to be compatible with the
new version of deltalake

The difference between this PR and #2754 is that it still maintains
compatibility with older deltalake versions.

The reason why this PR also includes a change to arrow2 is because
starting in version 0.19, deltalake uses arrow-rs by default instead of
pyarrow to write files when calling `deltalake.write_deltalake`. We do
not actually use this functionality but our tests do, and arrow-rs
writes map arrays in a way that does not conform to the parquet spec. I
figured it would be good to just add that compatibility in there just in
case some user is using `arrow-rs` to write their parquet files.

However, there are also other issues with deltalake's rust writer,
including improper encoding of partitioned binary columns, so we will
use their pyarrow writer for testing.
desmondcheongzx pushed a commit to desmondcheongzx/Daft that referenced this pull request Sep 16, 2024
…nc#2827)

Deltalake v0.19 changes their `_convert_pa_schema_to_delta` function to
take in a `schema_conversion_mode` instead of `large_dtypes`. Pyarrow
also needed to be upgraded to v16.0.0 as well to be compatible with the
new version of deltalake

The difference between this PR and Eventual-Inc#2754 is that it still maintains
compatibility with older deltalake versions.

The reason why this PR also includes a change to arrow2 is because
starting in version 0.19, deltalake uses arrow-rs by default instead of
pyarrow to write files when calling `deltalake.write_deltalake`. We do
not actually use this functionality but our tests do, and arrow-rs
writes map arrays in a way that does not conform to the parquet spec. I
figured it would be good to just add that compatibility in there just in
case some user is using `arrow-rs` to write their parquet files.

However, there are also other issues with deltalake's rust writer,
including improper encoding of partitioned binary columns, so we will
use their pyarrow writer for testing.
desmondcheongzx pushed a commit to desmondcheongzx/Daft that referenced this pull request Sep 16, 2024
…nc#2827)

Deltalake v0.19 changes their `_convert_pa_schema_to_delta` function to
take in a `schema_conversion_mode` instead of `large_dtypes`. Pyarrow
also needed to be upgraded to v16.0.0 as well to be compatible with the
new version of deltalake

The difference between this PR and Eventual-Inc#2754 is that it still maintains
compatibility with older deltalake versions.

The reason why this PR also includes a change to arrow2 is because
starting in version 0.19, deltalake uses arrow-rs by default instead of
pyarrow to write files when calling `deltalake.write_deltalake`. We do
not actually use this functionality but our tests do, and arrow-rs
writes map arrays in a way that does not conform to the parquet spec. I
figured it would be good to just add that compatibility in there just in
case some user is using `arrow-rs` to write their parquet files.

However, there are also other issues with deltalake's rust writer,
including improper encoding of partitioned binary columns, so we will
use their pyarrow writer for testing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants