You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a table with > 32 colums, when trying to create a checkpoint in a delta table (using checkpoints::create_checkpoint API) that contains transaction log written by Spark, which only includes stats for 32 columns by default rather than for all columns, we're getting the following Err: Failed to convert into Arrow schema: Json error: whilst decoding field 'add': whilst decoding field 'stats_parsed': whilst decoding field 'minValues': Encountered unmasked nulls in non-nullable StructArray child: < child>.
I suspect it's either a bug in arrow-json package that for some reason receive null pos for the overflowing columns when decoding the transaction log statistics, or perhaps it's a bug in the 'add' transaction json created by delta.rs during checkpoint in which the schema contains > 32 columns, but the 'stats_parsed' json does not have a corresponding value to all columns.
What you expected to happen:
I expect to be able to construct the Arrow Json schema when stats are not present for all columns, and more broadly - to be able to create a checkpoint file using delta.rs library after Spark has optimized the table.
How to reproduce it:
A table with > 32 columns that Spark engine ran OPTIMIZE transaction on, which doesn't include stats for all fields.
The delta log itself is enough to repro this issue. I'm able to provide necessary example files if needed.
More details:
If this is intended behavior and not a bug, please let me know. Thanks in advance!
The text was updated successfully, but these errors were encountered:
shanisolomon
changed the title
Creating checkpoints for tables with missing column stats panics
Creating checkpoints for tables with missing column stats results in Err
May 8, 2024
Delta-rs version:
0.16.5
Bug
In a table with > 32 colums, when trying to create a checkpoint in a delta table (using
checkpoints::create_checkpoint
API) that contains transaction log written by Spark, which only includes stats for 32 columns by default rather than for all columns, we're getting the following Err:Failed to convert into Arrow schema: Json error: whilst decoding field 'add': whilst decoding field 'stats_parsed': whilst decoding field 'minValues': Encountered unmasked nulls in non-nullable StructArray child: < child>
.I suspect it's either a bug in arrow-json package that for some reason receive null pos for the overflowing columns when decoding the transaction log statistics, or perhaps it's a bug in the 'add' transaction json created by delta.rs during checkpoint in which the schema contains > 32 columns, but the 'stats_parsed' json does not have a corresponding value to all columns.
What you expected to happen:
I expect to be able to construct the Arrow Json schema when stats are not present for all columns, and more broadly - to be able to create a checkpoint file using delta.rs library after Spark has optimized the table.
How to reproduce it:
A table with > 32 columns that Spark engine ran OPTIMIZE transaction on, which doesn't include stats for all fields.
The delta log itself is enough to repro this issue. I'm able to provide necessary example files if needed.
More details:
If this is intended behavior and not a bug, please let me know. Thanks in advance!
The text was updated successfully, but these errors were encountered: