Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize.compact() fails with bad schema after updating to pyarrow 8.0 #1889

Closed
sebdiem opened this issue Nov 20, 2023 · 4 comments · Fixed by #1926
Closed

optimize.compact() fails with bad schema after updating to pyarrow 8.0 #1889

sebdiem opened this issue Nov 20, 2023 · 4 comments · Fixed by #1926
Labels
bug Something isn't working

Comments

@sebdiem
Copy link
Contributor

sebdiem commented Nov 20, 2023

Environment

Delta-rs version: 0.12.0

Binding: python

Environment:

  • Cloud provider:
  • OS: Apple M2 Sonoma 14.1.1
  • Other:

Bug

What happened: When I perform an optimize.compact() operation following an append to a table, I encounter an exception. Oddly, this optimize.compact() succeeds after the initial append to an empty table but fails after subsequent appends. This is the exception I get:

_internal.DeltaError: Data does not match the schema or partitions of the table: Unexpected Arrow schema: got: Field { name: "name", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "surname", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, expected: Field { name: "name", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "surname", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }

The diff in the exception shows that there are expected non nullable values that are nullable in the Arrow schema.
The issue does not manifest when using pyarrow == 7.0.0. Actually it does

What you expected to happen: The optimize.compact() operation should work.

How to reproduce it:

# The problem exists with pyarrow >= 8 and deltalake = 0.12.0

import pandas as pd
import pyarrow as pa
from deltalake import DeltaTable, write_deltalake
from deltalake.schema import Field, Schema

path = "/tmp/reproduce_delta"

delta_schema = Schema(
    [
        Field("name", "string", nullable=False),
        Field("surname", "string", nullable=False),
    ]
)
pyarrow_schema = pa.schema(field.to_pyarrow() for field in delta_schema.fields)

try:
    table = DeltaTable(path)
except:
    write_deltalake(
        path,
        data=[],
        schema=pyarrow_schema,
        name="test_table",
        partition_by=[],
        mode="error",
        overwrite_schema=False,
        configuration={
            "delta.isolationLevel": "Serializable",
            "delta.enableChangeDataFeed": "false",
            "delta.autoOptimize.autoCompact": "true",
            "delta.autoOptimize.optimizeWrite": "true",
            "delta.targetFileSize": str(128 * 1024 * 1024),
            "delta.deletedFileRetentionDuration": "interval 7 day",
            "delta.logRetentionDuration": "interval 30 day",
        },
    )

table = DeltaTable(path)

for i in range(2):
    write_deltalake(
        table,
        pd.DataFrame(
            [
                {"name": "john", "surname": "doe"},
                {"name": "johny", "surname": "doey"},
            ]
        ),
        schema=pyarrow_schema,
        mode="append",
    )

    print(f"Attempting compaction after append #{i}")
    table.optimize.compact()

More details: ok

@sebdiem sebdiem added the bug Something isn't working label Nov 20, 2023
@ion-elgreco
Copy link
Collaborator

Can you share a sample dataframe so that I can reproduce it?

@sebdiem
Copy link
Contributor Author

sebdiem commented Nov 22, 2023

the code provided contains a very simple pandas dataframe that illustrates the problem (at least on my environment)

@ion-elgreco
Copy link
Collaborator

@sebdiem I have a fix in the pipeline for this.

@sebdiem
Copy link
Contributor Author

sebdiem commented Nov 30, 2023

great news ! thanks a lot

ion-elgreco added a commit that referenced this issue Dec 2, 2023
…ed large/normal arrow (#1926)

# Description
- Fixes optimize.compact not working when a table has parquet files with
large and normal arrow types. Basically it cast the recordbatch to
normal arrow types

# Issues
- closes #1889
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants