Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to map column projection- incompatible data types list field element vs item #31

Open
AlJohri opened this issue Oct 3, 2022 · 3 comments

Comments

@AlJohri
Copy link

AlJohri commented Oct 3, 2022

I have a table that reads correctly using Spark + Delta Lake Libraries, but I'm having trouble reading via pv.

do you know which downstream dependency could be giving me this error?

Error: ArrowError(ExternalError(Execution("Failed to map column projection for field mycolumn. Incompatible data types List(Field { name: "element", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }) and List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None })")))

I checked the schema from the delta transaction log and didn't see a hardcoded item or element:

❯ aws s3 cp s3://mybucket/year=2022/month=6/day=9/myprefix/_delta_log/00000000000000000000.json - | head -n 3 | tail -n 1 | jq '.metaData.schemaString | fromjson | .fields[] | select(.name == "mycolumn")'
{
  "name": "mycolumn",
  "type": {
    "type": "array",
    "elementType": "string",
    "containsNull": true
  },
  "nullable": true,
  "metadata": {}
}

When I look at the schema of a sample parquet file on s3, I do indeed see that the item in the list is called element:

pqrs schema =(s5cmd cat s3://mybucket/year=2022/month=6/day=9/myprefix/_partition=00001/part-00037-cb2e71c3-4f26-4de0-9e9a-18298489ccdc.c000.snappy.parquet)

...
message spark_schema {
  ...
  OPTIONAL group mycolumn (LIST) {
    REPEATED group list {
      OPTIONAL BYTE_ARRAY element (UTF8);
    }
  }
  ...
}

I see this exact error is from here: https://github.com/apache/arrow-datafusion/blob/aad82fbb32dc1bb4d03e8b36297f8c9a3148df89/datafusion/core/src/physical_plan/file_format/mod.rs#L253

And I also see that element is hardcoded in delta-rs here:

https://github.com/delta-io/delta-rs/blob/83b8296fa5d55ebe050b022ed583dc57152221fe/rust/src/delta_arrow.rs#L38-L48 (pr: delta-io/delta-rs#228)

But I can't seem to find where the schema mismatch is coming from.

@timvw
Copy link
Owner

timvw commented Oct 4, 2022

Thanks for the feedback!

I've seen this issue pop up in the past datafusion-contrib/datafusion-catalogprovider-glue#4 (comment) but it fell off my radar... Seems that this could use a bit more investigation..

@timvw
Copy link
Owner

timvw commented Oct 18, 2022

When I find some time this could help in tracing back the mismatch: https://arrow.apache.org/blog/2022/10/17/arrow-parquet-encoding-part-3/

@AlJohri
Copy link
Author

AlJohri commented Oct 30, 2022

@timvw I found the documentation for use_compliant_nested_type for PyArrow helpful for understanding this issue: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html

use_compliant_nested_type : bool, default False

Whether to write compliant Parquet nested type (lists) as defined here, defaults to False. For use_compliant_nested_type=True, this will write into a list with 3-level structure where the middle level, named list, is a repeated group with a single field named element:

<list-repetition> group <name> (LIST) {
    repeated group list {
          <element-repetition> <element-type> element;
    }
}

For use_compliant_nested_type=False, this will also write into a list with 3-level structure, where the name of the single field of the middle level list is taken from the element name for nested columns in Arrow, which defaults to item:

<list-repetition> group <name> (LIST) {
    repeated group list {
        <element-repetition> <element-type> item;
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants