Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix compatibility quirks between arrow and parquet structs #245

Closed
nevi-me opened this issue May 2, 2021 · 0 comments · Fixed by #270
Closed

Fix compatibility quirks between arrow and parquet structs #245

nevi-me opened this issue May 2, 2021 · 0 comments · Fixed by #270
Labels
bug parquet Changes to the parquet crate

Comments

@nevi-me
Copy link
Contributor

nevi-me commented May 2, 2021

Describe the bug

See #246 and 6a65543. There are some notes referring to this issue in that PR.

The issue is that the different parquet implementations handle non-null structs (and possibly lists) differently.
Spark doesn't seem to have a facility to create non-null struct schemas, so structs are nullable by default. If one creates a non-null struct with null children, pyspark won't read it.

The C++ implementation reads this back fine, perhaps because there's a good mapping to Arrow data.
The Rust implementation will write the file, but won't read it back.

I also have some uncertainty on whether a non-null parent + null child is logically correct or Arrow specification compliant.

To Reproduce

  • Create a RecordBatch that has a non-null struct with a nullable child.
  • Write that to Parquet
  • Read the Parquet file with Spark

Expected behavior

There shoulb some clear behaviour that is also documented.

Additional context

See the commit 6a65543, specifically the comments added around the tests.

@nevi-me nevi-me added parquet Changes to the parquet crate bug labels May 2, 2021
@nevi-me nevi-me changed the title [placeholder] Investigate compatibility quirks between arrow and parquet structs Investigate compatibility quirks between arrow and parquet structs May 4, 2021
@alamb alamb changed the title Investigate compatibility quirks between arrow and parquet structs Fix compatibility quirks between arrow and parquet structs May 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant