Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use deltalake Schema in write_deltalake #1862

Closed
LoicRaillon opened this issue Nov 14, 2023 · 10 comments · Fixed by #1922
Closed

Unable to use deltalake Schema in write_deltalake #1862

LoicRaillon opened this issue Nov 14, 2023 · 10 comments · Fixed by #1922
Assignees
Labels
binding/python Issues for the Python package enhancement New feature or request
Milestone

Comments

@LoicRaillon
Copy link

Environment

Delta-rs version: 0.13.0
Binding: Python 3.11
OS: Windows 10 WSL2 (Ubuntu 22.04.02 LTS)


Bug

I want to use a deltalake.schema.Schema instance in the function write_deltalake(schema=...) in order to control the nullable parameter and add metadata. However, a pyarrow.lib.Schema type is expected instead of deltalake._internal.Schema. This is issue is reproducible with Polars and Pandas as shown with the snippet below

import json

import polars as pl
import pandas as pd
from deltalake import write_deltalake
from deltalake.schema import Schema


schema = Schema.from_json(
    json.dumps(
        {
            "type": "struct",
            "fields": [
                {"name": "x", "type": "integer", "nullable": True, "metadata": {"description": "variable x"}},
                {"name": "y", "type": "float", "nullable": False, "metadata": {"description": "variable y"}},
            ],
        }
    )
)

data = {"x": [1, 2], "y": [3.0, 4.0]}

try:
    write_deltalake(
        "data/test_schema",
        data=pl.DataFrame(data),
        name="test",
        schema=schema,
        description="A test",
        mode="overwrite",
    )
except TypeError as e:
    print(f"Polars -> {e}")

try:
    write_deltalake(
        "data/test_schema",
        data=pd.DataFrame(data),
        name="test",
        schema=schema,
        description="A test",
        mode="overwrite",
    )
except TypeError as e:
    print(f"Pandas -> {e}")
Polars -> Argument 'schema' has incorrect type (expected pyarrow.lib.Schema, got deltalake._internal.Schema)
Pandas -> Argument 'schema' has incorrect type (expected pyarrow.lib.Schema, got deltalake._internal.Schema)
@LoicRaillon LoicRaillon added the bug Something isn't working label Nov 14, 2023
@r3stl355
Copy link
Contributor

take

@r3stl355
Copy link
Contributor

I'm not actually looking into this, maybe I will later, just tried the take comment action

@r3stl355 r3stl355 removed their assignment Nov 15, 2023
@r3stl355
Copy link
Contributor

I wouldn't call this a bug as the write_deltalake definition clearly states it's accepting a pyarrow.Schema. On top of that, deltalake._internal.Schema has a to_pyarrow function which you can use on your Schema before passing to write_deltalake.

@ion-elgreco
Copy link
Collaborator

@r3stl355 I think there is a method call to convert delta schema to pyarrow, so perhaps we should allow take both inputs

@r3stl355
Copy link
Contributor

r3stl355 commented Nov 15, 2023

yep @ion-elgreco , deltalake._internal.Schema.to_pyarrow as I mentioned earlier. I can extend the function to take either if you think it's useful

@ion-elgreco
Copy link
Collaborator

@r3stl355 ah my bad, read too fast over your post

@r3stl355
Copy link
Contributor

take

@LoicRaillon
Copy link
Author

LoicRaillon commented Nov 16, 2023

I missed the fact that the nullable and the metadata are stored in the fields attributes of the `pyarrow.Schema. Should I close this issue or is it still pertinent ?

@r3stl355
Copy link
Contributor

I don't know, maybe remove a bug label for now. What do you think @ion-elgreco - should I extend the write_deltalake to accept either of Schema types?

@ion-elgreco ion-elgreco added enhancement New feature or request binding/python Issues for the Python package and removed bug Something isn't working labels Nov 16, 2023
@ion-elgreco
Copy link
Collaborator

@r3stl355 yes makes sense to add

@ion-elgreco ion-elgreco added this to the python v0.14 milestone Nov 22, 2023
ion-elgreco pushed a commit that referenced this issue Nov 29, 2023
# Description
A second attempt to extend the write_deltalake to accept either PyArrow
or Deltalake schema (messed up the previous PR with some rebase issues)
Added a test

# Related Issue(s)
closes #1862

---------

Signed-off-by: Nikolay Ulmasov <[email protected]>
ion-elgreco pushed a commit to ion-elgreco/delta-rs that referenced this issue Dec 1, 2023
# Description
A second attempt to extend the write_deltalake to accept either PyArrow
or Deltalake schema (messed up the previous PR with some rebase issues)
Added a test

# Related Issue(s)
closes delta-io#1862

---------

Signed-off-by: Nikolay Ulmasov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package enhancement New feature or request
Projects
None yet
3 participants