Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow projection of schemas/structs #38615

Open
Fokko opened this issue Nov 6, 2023 · 3 comments
Open

Allow projection of schemas/structs #38615

Fokko opened this issue Nov 6, 2023 · 3 comments

Comments

@Fokko
Copy link
Contributor

Fokko commented Nov 6, 2023

Describe the enhancement requested

For PyIceberg recently, concatenation of tables has been added: #36846 To add new fields I concat the requested schema with the data that was loaded. However, now I'm hitting the next barrier, unable to project the schemas of nested structs.

Bit of context. For the top-level schema it is not an issue because we can select the columns that we need when reading in the table, but it doesn't allow selection of nested columns.

Selecting a subset:

Desktop python3
Python 3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> 
>>> current_schema = pa.schema([pa.field("x", pa.float32()), pa.field("y", pa.float32())])
>>> tbl = pa.Table.from_pylist(
...     [
...         {"x": 52.371807, "y": 4.896029},
...         {"x": 52.387386, "y": 4.646219},
...         {"x": 52.078663, "y": 4.288788},
...     ],
...     schema=current_schema,
... )
>>> schema_with_z = pa.schema(
...     [
...         pa.field("x", pa.float32()),
...     ]
... )
>>> tbl.cast(schema_with_z)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3793, in pyarrow.lib.Table.cast
ValueError: Target schema's field names are not matching the table's field names: ['x', 'y'], ['x']

Or in a nested struct:

Desktop python3
Python 3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> 
>>> current_schema = pa.schema(
...     pa.field(
...         "location",
...         pa.struct([pa.field("x", pa.float32()), pa.field("y", pa.float32())]),
...     )
... )
>>> 
>>> tbl = pa.Table.from_pylist(
...     [
...         {"location": {"x": 52.371807, "y": 4.896029}},
...         {"location": {"x": 52.387386, "y": 4.646219}},
...         {"location": {"x": 52.078663, "y": 4.288788}},
...     ],
...     schema=current_schema,
... )
>>> schema_without_x = pa.schema(
...     pa.field(
...         "location",
...         pa.struct(
...             [
...                 pa.field("x", pa.float32()),
...             ]
...         ),
...     )
... )
>>> tbl.cast(schema_without_x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3793, in pyarrow.lib.Table.cast
ValueError: Target schema's field names are not matching the table's field names: ['x', 'y'], ['x']

Any thoughts on adding this? Or can we achieve this in another way?

Component(s)

Python

@lidavidm
Copy link
Member

lidavidm commented Nov 6, 2023

There's the pyarrow.compute.struct_field kernel which can do these sorts of things for nested structs, but that is much lower level than what you're after here

I'm not quite sure casting is the right name to give this operation, but I think there is room to add an operation like what you're proposing (whether it's called casting or not)

@Fokko
Copy link
Contributor Author

Fokko commented Nov 6, 2023

Thanks again for the quick response @lidavidm

The current operation that does this is called .select() which accepts names and indices. However, this would require traversing the schema for finding all the nested structs (and I'd rather not do it in Python). Also, if there are any valid promotions, another cast would be needed. Therefore I think it makes sense to make it part of the .cast operation.

@Fokko
Copy link
Contributor Author

Fokko commented Jan 5, 2024

Still running into this. I would expect something like below to work:

In [1]: import pyarrow as pa

In [2]: current_schema = pa.schema([
   ...: ^Ipa.field("x", pa.float32()),
   ...: ^Ipa.field("y", pa.float32())
   ...: ])
   ...: 
   ...: tbl = pa.Table.from_pylist(
   ...:     [
   ...:         {"x": 52.371807, "y": 4.896029},
   ...:         {"x": 52.387386, "y": 4.646219},
   ...:         {"x": 52.078663, "y": 4.288788},
   ...:     ],
   ...:     schema=current_schema,
   ...: )
   ...: 
   ...: schema_with_z = pa.schema(
   ...:     [
   ...: ^I^Ipa.field("x", pa.float32()),
   ...: ^I^Ipa.field("y", pa.float32()),
   ...:         pa.field("x", pa.float32()),
   ...:     ]
   ...: )
   ...: 
   ...: tbl.cast(schema_with_z)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 23
      6 tbl = pa.Table.from_pylist(
      7     [
      8         {"x": 52.371807, "y": 4.896029},
   (...)
     12     schema=current_schema,
     13 )
     15 schema_with_z = pa.schema(
     16     [
     17 		pa.field("x", pa.float32()), 
   (...)
     20     ]
     21 )
---> 23 tbl.cast(schema_with_z)

File /opt/homebrew/lib/python3.11/site-packages/pyarrow/table.pxi:3793, in pyarrow.lib.Table.cast()

ValueError: Target schema's field names are not matching the table's field names: ['x', 'y'], ['x', 'y', 'x']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants