You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
See #246 and 6a65543. There are some notes referring to this issue in that PR.
The issue is that the different parquet implementations handle non-null structs (and possibly lists) differently.
Spark doesn't seem to have a facility to create non-null struct schemas, so structs are nullable by default. If one creates a non-null struct with null children, pyspark won't read it.
The C++ implementation reads this back fine, perhaps because there's a good mapping to Arrow data.
The Rust implementation will write the file, but won't read it back.
I also have some uncertainty on whether a non-null parent + null child is logically correct or Arrow specification compliant.
To Reproduce
Create a RecordBatch that has a non-null struct with a nullable child.
Write that to Parquet
Read the Parquet file with Spark
Expected behavior
There shoulb some clear behaviour that is also documented.
Additional context
See the commit 6a65543, specifically the comments added around the tests.
The text was updated successfully, but these errors were encountered:
nevi-me
changed the title
[placeholder] Investigate compatibility quirks between arrow and parquet structs
Investigate compatibility quirks between arrow and parquet structs
May 4, 2021
alamb
changed the title
Investigate compatibility quirks between arrow and parquet structs
Fix compatibility quirks between arrow and parquet structs
May 16, 2021
Describe the bug
See #246 and 6a65543. There are some notes referring to this issue in that PR.
The issue is that the different parquet implementations handle non-null structs (and possibly lists) differently.
Spark doesn't seem to have a facility to create non-null struct schemas, so structs are nullable by default. If one creates a non-null struct with null children, pyspark won't read it.
The C++ implementation reads this back fine, perhaps because there's a good mapping to Arrow data.
The Rust implementation will write the file, but won't read it back.
I also have some uncertainty on whether a non-null parent + null child is logically correct or Arrow specification compliant.
To Reproduce
Expected behavior
There shoulb some clear behaviour that is also documented.
Additional context
See the commit 6a65543, specifically the comments added around the tests.
The text was updated successfully, but these errors were encountered: