Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting error when converting a partitioned parquet table to delta table #2626

Closed
AngeloFrigeri opened this issue Jun 26, 2024 · 2 comments
Closed
Labels
bug Something isn't working

Comments

@AngeloFrigeri
Copy link

Environment

Delta-rs version: 0.18.1

Binding: python 3.8.19

Environment:

  • Cloud provider: AWS
  • OS: Ubuntu 22.04
  • Other:

Bug

What happened: When converting a partitioned parquet table to delta table, I got the following error:

File [~/Envs/general-p38/lib/python3.8/site-packages/deltalake/writer.py:610](Envs/general-p38/lib/python3.8/site-packages/deltalake/writer.py#line=609), in convert_to_deltalake(uri, mode, partition_by, partition_strategy, name, description, configuration, storage_options, custom_metadata)
    607 if mode == "ignore" and try_get_deltatable(uri, storage_options) is not None:
    608     return
--> 610 _convert_to_deltalake(
    611     str(uri),
    612     partition_by,
    613     partition_strategy,
    614     name,
    615     description,
    616     configuration,
    617     storage_options,
    618     custom_metadata,
    619 )
    620 return

DeltaError: Generic error: The schema of partition columns must be provided to convert a Parquet table to a Delta table

What you expected to happen: To have a delta log folder create on our S3 path

How to reproduce it:

from deltalake import convert_to_deltalake

s3_storage_options = {
    AWS_ACCESS_KEY_ID="AWS_ACCESS_KEY_ID",
    AWS_SECRET_ACCESS_KEY="AWS_SECRET_ACCESS_KEY",
    AWS_S3_ALLOW_UNSAFE_RENAME="true",
}
convert_to_deltalake(
    uri=f"s3://{BUCKET}/{PREFIX}_0.18.1/",
    storage_options=s3_storage_options,
    partition_by=pyarrow.schema(
        [
            pyarrow.field("product", pyarrow.string()),
        ]
    ),
    partition_strategy="hive",
)

More details:

@AngeloFrigeri AngeloFrigeri added the bug Something isn't working label Jun 26, 2024
@sherlockbeard
Copy link
Contributor

Tried with (below ) not able to reproduce in local . Maybe specific to s3.
@AngeloFrigeri can you check partition column are correct in your code and not an issue there.

from deltalake import convert_to_deltalake

import pyarrow as pa

import pandas as pd

df = pd.DataFrame(data={'blaaPara': ['a', 'a', 'b'],
                        'year':  [2020, 2020, 2021],
                        'month': [1,12,2], 
                        'day':   [1,31,28], 
                        'value': [1000,2000,3000]})

df.to_parquet('./mydf', partition_cols=['blaaPara'])



convert_to_deltalake(
    './mydf',
    partition_by=pa.schema(
        [
            pa.field("blaaPara", pa.string()),
        ]
    ),
    partition_strategy="hive"
)

@ion-elgreco
Copy link
Collaborator

Can't reproduce either, so closing this one. If you can provide an MRE that works, we can have look, but as far as the error message indicates you didn't provide the correct partition columns

@ion-elgreco ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants