Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: make struct fields nullable in stats schema #2346

Merged
merged 3 commits into from
Mar 28, 2024

Conversation

qinix
Copy link
Contributor

@qinix qinix commented Mar 27, 2024

Description

Currently only top level fields are mapped to nullable=true in stats schema. However, delta-spark generated stats may have null stats fields even if origin data fields are declared as NOT NULL. So stats for columns nested within a struct should also be mapped to nullable=true as well.

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Mar 27, 2024
@rtyler
Copy link
Member

rtyler commented Mar 27, 2024

hm interesting, would you be able to add a test case or example stats to the pull request that we can incorporate as a test to ensure that nullable nested fields continue to be ssupported

@qinix
Copy link
Contributor Author

qinix commented Mar 27, 2024

Sure, I'll try to add a test case.

BTW, here is a minimal pyspark snippet to reproduce this case.

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.1
      /_/

Using Python version 3.10.13 (main, Aug 24 2023 12:59:26)
Spark context Web UI available at http://192.168.31.133:4040
Spark context available as 'sc' (master = local[*], app id = local-1711560099703).
SparkSession available as 'spark'.

>>> from pyspark.sql.types import *; from delta.tables import *; import deltalake;

>>> schema = StructType([StructField("s", dataType = StructType([StructField("l", LongType(), nullable = False), StructField("b", BooleanType(), nullable = False)]), nullable = False)])

>>> DeltaTable.createOrReplace(spark).location('/tmp/test_delta_notnull_struct').addColumns(schema).execute()
<delta.tables.DeltaTable object at 0x1596c7ee0>

>>> spark.createDataFrame([{'s': {'l': 10, 'b': True}}, {'s': {'l': 20, 'b': False}}], schema).write.format('delta').mode('append').save('/tmp/test_delta_notnull_struct')

>>> deltalake.DeltaTable('/tmp/test_delta_notnull_struct')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/qinix/.virtualenvs/matrix/lib/python3.10/site-packages/deltalake/table.py", line 408, in __init__
    self._table = RawDeltaTable(
Exception: Json error: whilst decoding field 'minValues': whilst decoding field 's': Encountered unmasked nulls in non-nullable StructArray child: Field { name: "b", data_type: Boolean, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }

@rtyler rtyler enabled auto-merge (rebase) March 28, 2024 03:45
@rtyler rtyler merged commit 01f832c into delta-io:main Mar 28, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants