Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHORE] Ensure compatibility with deltalake version v0.19 #2827

Merged
merged 10 commits into from
Sep 11, 2024

Conversation

kevinzwang
Copy link
Member

@kevinzwang kevinzwang commented Sep 10, 2024

Deltalake v0.19 changes their _convert_pa_schema_to_delta function to take in a schema_conversion_mode instead of large_dtypes. Pyarrow also needed to be upgraded to v16.0.0 as well to be compatible with the new version of deltalake

The difference between this PR and #2754 is that it still maintains compatibility with older deltalake versions.

The reason why this PR also includes a change to arrow2 is because starting in version 0.19, deltalake uses arrow-rs by default instead of pyarrow to write files when calling deltalake.write_deltalake. We do not actually use this functionality but our tests do, and arrow-rs writes map arrays in a way that does not conform to the parquet spec. I figured it would be good to just add that compatibility in there just in case some user is using arrow-rs to write their parquet files.

However, there are also other issues with deltalake's rust writer, including improper encoding of partitioned binary columns, so we will use their pyarrow writer for testing.

@github-actions github-actions bot added the chore label Sep 10, 2024
Copy link

codspeed-hq bot commented Sep 10, 2024

CodSpeed Performance Report

Merging #2827 will degrade performances by 56.51%

Comparing kevin/deltalake-v0.19 (97e651d) with main (4582192)

Summary

❌ 1 regressions
✅ 15 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main kevin/deltalake-v0.19 Change
test_count[1 Small File] 10.6 ms 24.4 ms -56.51%

@kevinzwang kevinzwang changed the title [CHORE] Upgrade deltalake version to v0.19.2 [CHORE] Ensure compatibility with deltalake version to v0.19.2 Sep 10, 2024
@kevinzwang kevinzwang changed the title [CHORE] Ensure compatibility with deltalake version to v0.19.2 [CHORE] Ensure compatibility with deltalake version v0.19 Sep 10, 2024
Copy link

codecov bot commented Sep 11, 2024

Codecov Report

Attention: Patch coverage is 95.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 64.04%. Comparing base (2a89d0d) to head (97e651d).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
daft/io/_deltalake.py 88.88% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2827      +/-   ##
==========================================
+ Coverage   63.37%   64.04%   +0.66%     
==========================================
  Files        1010     1007       -3     
  Lines      114165   112948    -1217     
==========================================
- Hits        72355    72334      -21     
+ Misses      41810    40614    -1196     
Files with missing lines Coverage Δ
daft/dataframe/dataframe.py 86.27% <100.00%> (+0.03%) ⬆️
daft/io/__init__.py 95.83% <100.00%> (ø)
daft/table/table_io.py 85.45% <100.00%> (+0.03%) ⬆️
src/arrow2/src/io/parquet/read/schema/convert.rs 95.84% <100.00%> (+0.12%) ⬆️
daft/io/_deltalake.py 77.50% <88.88%> (ø)

... and 75 files with indirect coverage changes

daft/dataframe/dataframe.py Outdated Show resolved Hide resolved
daft/table/table_io.py Outdated Show resolved Hide resolved
@kevinzwang kevinzwang enabled auto-merge (squash) September 11, 2024 18:48
@kevinzwang kevinzwang merged commit 7048b97 into main Sep 11, 2024
34 of 35 checks passed
@kevinzwang kevinzwang deleted the kevin/deltalake-v0.19 branch September 11, 2024 19:11
@fecet
Copy link

fecet commented Sep 12, 2024

I face this

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 df_with_image.write_deltalake("[/data/delta/sample1](http://localhost:8888/data/delta/sample1)")

File [/opt/conda/lib/python3.10/site-packages/daft/api_annotations.py:26](http://localhost:8888/opt/conda/lib/python3.10/site-packages/daft/api_annotations.py#line=25), in DataframePublicAPI.<locals>._wrap(*args, **kwargs)
     24 type_check_function(func, *args, **kwargs)
     25 timed_method = time_df_method(func)
---> 26 return timed_method(*args, **kwargs)

File [/opt/conda/lib/python3.10/site-packages/daft/analytics.py:198](http://localhost:8888/opt/conda/lib/python3.10/site-packages/daft/analytics.py#line=197), in time_df_method.<locals>.tracked_method(*args, **kwargs)
    195 @functools.wraps(method)
    196 def tracked_method(*args, **kwargs):
    197     if _ANALYTICS_CLIENT is None:
--> 198         return method(*args, **kwargs)
    200     start = time.time()
    201     try:

File [/opt/conda/lib/python3.10/site-packages/daft/dataframe/dataframe.py:841](http://localhost:8888/opt/conda/lib/python3.10/site-packages/daft/dataframe/dataframe.py#line=840), in DataFrame.write_deltalake(self, table, mode, schema_mode, name, description, configuration, custom_metadata, dynamo_table_name, io_config)
    838 pyarrow_schema = pa.schema((f.name, f.dtype.to_arrow_dtype()) for f in self.schema())
    840 large_dtypes = True
--> 841 delta_schema = _convert_pa_schema_to_delta(pyarrow_schema, **large_dtypes_kwargs(large_dtypes))
    843 if table:
    844     table.update_incremental()

File [/opt/conda/lib/python3.10/site-packages/deltalake/schema.py:139](http://localhost:8888/opt/conda/lib/python3.10/site-packages/deltalake/schema.py#line=138), in _convert_pa_schema_to_delta(schema, schema_conversion_mode)
    136     fields_cast = [f.with_type(dtype_to_delta_dtype(f.type)) for f in fields]
    137     return pa.struct(fields_cast)
--> 139 return pa.schema([f.with_type(dtype_to_delta_dtype(f.type)) for f in schema])

File [/opt/conda/lib/python3.10/site-packages/deltalake/schema.py:139](http://localhost:8888/opt/conda/lib/python3.10/site-packages/deltalake/schema.py#line=138), in <listcomp>(.0)
    136     fields_cast = [f.with_type(dtype_to_delta_dtype(f.type)) for f in fields]
    137     return pa.struct(fields_cast)
--> 139 return pa.schema([f.with_type(dtype_to_delta_dtype(f.type)) for f in schema])

File [/opt/conda/lib/python3.10/site-packages/deltalake/schema.py:99](http://localhost:8888/opt/conda/lib/python3.10/site-packages/deltalake/schema.py#line=98), in _convert_pa_schema_to_delta.<locals>.dtype_to_delta_dtype(dtype)
     97     return pa.binary()
     98 try:
---> 99     return dtype_map[dtype]
    100 except KeyError:
    101     return dtype

TypeError: unhashable type: 'DaftExtension'

with deltalake 0.19.2

@kevinzwang
Copy link
Member Author

kevinzwang commented Sep 12, 2024

Hi @fecet, would you mind creating a Github issue for this? Also, could you check if this worked in the current release version of Daft with deltalake 0.18.2? It would also help to see more details about the query that you are trying to make.

desmondcheongzx pushed a commit to desmondcheongzx/Daft that referenced this pull request Sep 16, 2024
…nc#2827)

Deltalake v0.19 changes their `_convert_pa_schema_to_delta` function to
take in a `schema_conversion_mode` instead of `large_dtypes`. Pyarrow
also needed to be upgraded to v16.0.0 as well to be compatible with the
new version of deltalake

The difference between this PR and Eventual-Inc#2754 is that it still maintains
compatibility with older deltalake versions.

The reason why this PR also includes a change to arrow2 is because
starting in version 0.19, deltalake uses arrow-rs by default instead of
pyarrow to write files when calling `deltalake.write_deltalake`. We do
not actually use this functionality but our tests do, and arrow-rs
writes map arrays in a way that does not conform to the parquet spec. I
figured it would be good to just add that compatibility in there just in
case some user is using `arrow-rs` to write their parquet files.

However, there are also other issues with deltalake's rust writer,
including improper encoding of partitioned binary columns, so we will
use their pyarrow writer for testing.
desmondcheongzx pushed a commit to desmondcheongzx/Daft that referenced this pull request Sep 16, 2024
…nc#2827)

Deltalake v0.19 changes their `_convert_pa_schema_to_delta` function to
take in a `schema_conversion_mode` instead of `large_dtypes`. Pyarrow
also needed to be upgraded to v16.0.0 as well to be compatible with the
new version of deltalake

The difference between this PR and Eventual-Inc#2754 is that it still maintains
compatibility with older deltalake versions.

The reason why this PR also includes a change to arrow2 is because
starting in version 0.19, deltalake uses arrow-rs by default instead of
pyarrow to write files when calling `deltalake.write_deltalake`. We do
not actually use this functionality but our tests do, and arrow-rs
writes map arrays in a way that does not conform to the parquet spec. I
figured it would be good to just add that compatibility in there just in
case some user is using `arrow-rs` to write their parquet files.

However, there are also other issues with deltalake's rust writer,
including improper encoding of partitioned binary columns, so we will
use their pyarrow writer for testing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants