-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] write_dataset does not preserve non-nullable columns in schema #35730
Comments
@ildipo Thanks for the report! The parquet format doesn't has such a flag directly, but it stores nulls as a repetition level, and you can indicate a field to be "required", and it seems that for reading writing individual tables to Parquet files, we translate "not null" into required parquet types, and also when reading convert a required field back to "not null": >>> pq.write_table(table, "test_nullability.parquet")
>>> pq.read_metadata("test_nullability.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7f21b778fec0>
required group field_id=-1 schema {
required int64 field_id=-1 x;
optional int64 field_id=-1 y;
optional int32 field_id=-1 date (Date);
}
>>> pq.read_table("test_nullability.parquet").schema
Out[28]:
x: int64 not null
y: int64
date: date32[day] So it seems this is supported in the Parquet module itself, and so this should be something in the dataset API that loses this information. Quick guess is that it has to do with partitioning: >>> pq.write_to_dataset(table, "test_dataset_nullability"')
# reading directory -> lost "not null"
>>> ds.dataset("test_dataset_nullability/", format="parquet").schema
x: int64
y: int64
date: date32[day]
# reading single file -> preserved "not null"
>>> ds.dataset("test_nullability.parquet", format="parquet").schema
Out[37]:
x: int64 not null
y: int64
date: date32[day] |
Yes, write_dataset is a bit tricky when it comes to schema information. If the input is multiple tables, then write_dataset is probably going to be combining them into a single output table, so which metadata do we use? What the write node does today is allow a Then we have a bit of a hack in place today for "If the input is a single table then preserve the metadata". This is in
This That being said,
|
The behavior changed sometime between arrow 7 and 12 since it used to work with arrow 7 |
I think we want the solution that is easier to backport to arrow 12 |
Note that this is also broken when the 'schema' parameter is passed explicitly. |
Does it work if you set |
Nvm, I see this is |
@weston note that this is not (AFAIU) about custom metadata, but just about how the arrow schema gets translated to a Parquet schema (or how the arrow schema gets changed throughout dataset writing). If we write a single file (directly using the Parquet file writer, not going through datasets), then a pyarrow field with nullable=False gets translated into a "required" parquet field: >>> schema = pa.schema([pa.field("col1", "int64", nullable=True), pa.field("col2", "int64", nullable=False)])
>>> table = pa.table({"col1": [1, 2, 3], "col2": [2, 3, 4]}, schema=schema)
>>> table.schema
col1: int64
col2: int64 not null
>>> pq.write_table(table, "test_nullability.parquet")
>>> pq.read_metadata("test_nullability.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7f21957c9700>
required group field_id=-1 schema {
optional int64 field_id=-1 col1;
required int64 field_id=-1 col2; # <--- this is "required" instead of "optional"
} But if we write this as a single file (in a directory) through the dataset API (so not even using a partitioning column), the non-nullable column is no longer "required" in the parquet field: >>> ds.write_dataset(table, "test_dataset_nullability/", format="parquet")
>>> pq.read_metadata("test_dataset_nullability/part-0.parquet").schema
Out[68]:
<pyarrow._parquet.ParquetSchema object at 0x7f219d16cfc0>
required group field_id=-1 schema {
optional int64 field_id=-1 col1;
optional int64 field_id=-1 col2; # <--- no longer "required" !
} So I suppose that somewhere in the dataset writing code path, the schema looses the field nullability information
I suppose this is because we now use |
Digging a bit further, this nullable field information is lost in acero's ProjectNode (the Small reproducer in python: from pyarrow.acero import Declaration, TableSourceNodeOptions, ProjectNodeOptions, field
schema = pa.schema([pa.field("col1", "int64", nullable=True), pa.field("col2", "int64", nullable=False)])
table = pa.table({"col1": [1, 2, 3], "col2": [2, 3, 4]}, schema=schema)
table_source = Declaration("table_source", options=TableSourceNodeOptions(table))
project = Declaration("project", ProjectNodeOptions([field("col1"), field("col2")]))
decl = Declaration.from_sequence([table_source, project])
>>> table.schema
col1: int64
col2: int64 not null
>>> decl.to_table().schema
col1: int64
col2: int64 This happens because the ProjectNode naively recreates the schema from the names/exprs, ignoring the field information of the original input schema: arrow/cpp/src/arrow/acero/project_node.cc Lines 64 to 75 in 6bd31f3
So this only preserves the type of the original input schema, but will ignore any nullable flag or field metadata information (and then we only do some special code to preserve the custom metadata of the full schema) @westonpace rereading your original comment, while your explanation first focused on the schema metadata, you actually also already said essentially the above:
But for what we need to do about this: shouldn't the ProjectNode just try to preserve this information for trivial field ref expressions? |
in 7.0 we were using |
If this is enough it should be pretty quick |
So here is the change that introduced this: #31452 Before the change we used to require the schema be specified on the write node options. This was a unnecessary burden when you didn't care about any custom field information (since we've already calculated the schema).
I think there is still the problem that we largely ignore nullability. We can't usually assume that all batches will have the same nullability. For example, imagine a scan node where we are scanning two different parquet files. One of the parquet files marks a column as nullable and the other does not. I suppose the correct answer, if Acero were nulalbility-aware and once evolution is a little more robust, would be to "evolve" the schema of the file with a nullable type to a non-nullable type so that we have a common input schema. In the meantime, the quickest simple fix to this regression is to allow the user to specify an output schema instead of just key / value metadata. |
… write (#35860) ### Rationale for this change The dataset write node previously allowed you to specify custom key/value metadata on a write node. This was added to support saving schema metadata. However, it doesn't capture field metadata or field nullability. This PR replaces that capability with the ability to specify a custom schema instead. The custom schema must have the same number of fields as the input to the write node and each field must have the same type. ### What changes are included in this PR? Added `custom_schema` to `WriteNodeOptions` and removed `custom_metadata`. ### Are these changes tested? Yes, I added a new C++ unit test to verify that the custom info is applied to written files. ### Are there any user-facing changes? No. Only new functionality (which is user facing) * Closes: #35730 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Nic Crane <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: anjakefala <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Weston Pace <[email protected]>
… write (#35860) ### Rationale for this change The dataset write node previously allowed you to specify custom key/value metadata on a write node. This was added to support saving schema metadata. However, it doesn't capture field metadata or field nullability. This PR replaces that capability with the ability to specify a custom schema instead. The custom schema must have the same number of fields as the input to the write node and each field must have the same type. ### What changes are included in this PR? Added `custom_schema` to `WriteNodeOptions` and removed `custom_metadata`. ### Are these changes tested? Yes, I added a new C++ unit test to verify that the custom info is applied to written files. ### Are there any user-facing changes? No. Only new functionality (which is user facing) * Closes: #35730 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Nic Crane <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: anjakefala <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Weston Pace <[email protected]>
@github-actions crossbow submit test-r-ubuntu-22.04 |
@github-actions crossbow submit test-r-versions |
the two jobs above have failed on the maintenance branch (https://github.com/ursacomputing/crossbow/actions/runs/5149898329/jobs/9273436127 and https://github.com/ursacomputing/crossbow/actions/runs/5149898179/jobs/9273436054). I am validating what is the status here as they seem related with this change. |
|
Failure is occurring in this context: https://github.com/apache/arrow/pull/35860/files#diff-0d1ff6f17f571f6a348848af7de9c05ed588d3339f46dd3bcf2808489f7dca92R340 |
I do see |
@thisisnic Would you be able to take a look? |
The error is a bit of a red herring. It is not building Arrow-C++. Instead it is downloading Arrow-C++. If you look at a passing build (e.g. from the nightly tests) you can see:
On the other hand, if you look at these failing builds, you see:
So the nightly test looks for The test build you've shared is looking for |
@github-actions crossbow submit test-r-ubuntu-22.04 |
ok, I've finally realised this is the issue, not the PR :) |
…ataset write (apache#35860) The dataset write node previously allowed you to specify custom key/value metadata on a write node. This was added to support saving schema metadata. However, it doesn't capture field metadata or field nullability. This PR replaces that capability with the ability to specify a custom schema instead. The custom schema must have the same number of fields as the input to the write node and each field must have the same type. Added `custom_schema` to `WriteNodeOptions` and removed `custom_metadata`. Yes, I added a new C++ unit test to verify that the custom info is applied to written files. No. Only new functionality (which is user facing) * Closes: apache#35730 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Nic Crane <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: anjakefala <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Weston Pace <[email protected]>
When writing a table whose schema has not nullable columns using write_dataset the not nullable info is not saved
To reproduce
Component(s)
Python
The text was updated successfully, but these errors were encountered: