Sanitized special character column name before writing to parquet #590

kevinjqliu · 2024-04-07T18:35:15Z

Fixes #584

Before this PR, PyIceberg allows writing parquet files with special characters in column names. This is currently not allowed in the Java Iceberg library. Instead, the Java Iceberg library transforms the special characters before writing to parquet. This transformation is done during reading as well.

For example, in the Java Iceberg library, an Iceberg table with TEST:A1B2.RAW.ABC-GG-1-A column is transformed into TEST_x3AA1B2_x2ERAW_x2EABC_x2DGG_x2D1_x2DA which is then used to write the parquet files.

apache/iceberg/#10120 is opened as a feature request to allow writing parquet files with special characters in the column name.

In the meantime, we want to mirror the Java Iceberg library behavior.
#83 does the column name transformation during reading.
This PR does the column name transformation during writing.

column_name_test.py

kevinjqliu · 2024-04-11T03:08:55Z

pyiceberg/io/pyarrow.py

@@ -1772,12 +1772,13 @@ def write_file(io: FileIO, table_metadata: TableMetadata, tasks: Iterator[WriteT
    )

    def write_parquet(task: WriteTask) -> DataFile:
+        df = pa.Table.from_batches(task.record_batches)
+        df = df.rename_columns(schema.column_names)


we need to change the Schema (column names) of the arrow data frame, if there's a better way to do this, please let me know

Shall we extend the integration test to test the nested schema case? For example,

pa.field('name', pa.string()), pa.field('address', pa.struct([ pa.field('street', pa.string()), pa.field('city', pa.string()), pa.field('zip', pa.int32()) ]

Updated: I got

pyarrow.lib.ArrowInvalid: tried to rename a table of 4 columns but only 7 names were provided

when trying with the following dataset

TEST_DATA_WITH_SPECIAL_CHARACTER_COLUMN = { column_name_with_special_character: ['a', None, 'z'], 'id': [1, 2, 3], 'name': ['AB', 'CD', 'EF'], 'address': [ {'street': '123', 'city': 'SFO', 'zip': 12345}, {'street': '456', 'city': 'SW', 'zip': 67890}, {'street': '789', 'city': 'Random', 'zip': 10112} ] } pa_schema = pa.schema([ pa.field(column_name_with_special_character, pa.string()), pa.field('id', pa.int32()), pa.field('name', pa.string()), pa.field('address', pa.struct([ pa.field('street', pa.string()), pa.field('city', pa.string()), pa.field('zip', pa.int32()) ])) ])

good catch! the i dont think rename_columns works well with nested schema

kevinjqliu · 2024-04-11T03:15:09Z

cc @Fokko / @HonahX

tests/integration/test_writes/test_writes.py

HonahX · 2024-04-11T09:18:31Z

pyiceberg/io/pyarrow.py

@@ -1772,12 +1772,13 @@ def write_file(io: FileIO, table_metadata: TableMetadata, tasks: Iterator[WriteT
    )

    def write_parquet(task: WriteTask) -> DataFile:
+        df = pa.Table.from_batches(task.record_batches)
+        df = df.rename_columns(schema.column_names)


Shall we extend the integration test to test the nested schema case? For example,

pa.field('name', pa.string()), pa.field('address', pa.struct([ pa.field('street', pa.string()), pa.field('city', pa.string()), pa.field('zip', pa.int32()) ]

Updated: I got

pyarrow.lib.ArrowInvalid: tried to rename a table of 4 columns but only 7 names were provided

when trying with the following dataset

TEST_DATA_WITH_SPECIAL_CHARACTER_COLUMN = { column_name_with_special_character: ['a', None, 'z'], 'id': [1, 2, 3], 'name': ['AB', 'CD', 'EF'], 'address': [ {'street': '123', 'city': 'SFO', 'zip': 12345}, {'street': '456', 'city': 'SW', 'zip': 67890}, {'street': '789', 'city': 'Random', 'zip': 10112} ] } pa_schema = pa.schema([ pa.field(column_name_with_special_character, pa.string()), pa.field('id', pa.int32()), pa.field('name', pa.string()), pa.field('address', pa.struct([ pa.field('street', pa.string()), pa.field('city', pa.string()), pa.field('zip', pa.int32()) ])) ])

kevinjqliu · 2024-04-11T22:35:49Z

pyiceberg/io/pyarrow.py

@@ -1122,12 +1121,12 @@ def project_table(
    return result


-def to_requested_schema(requested_schema: Schema, file_schema: Schema, table: pa.Table) -> pa.Table:
-    struct_array = visit_with_partner(requested_schema, table, ArrowProjectionVisitor(file_schema), ArrowAccessor(file_schema))
+def to_requested_schema(table: pa.Table, from_schema: Schema, to_schema: Schema) -> pa.Table:


I refactored the helper method

i can pull this refactor into a separate PR if it helps with review

This is a public method, so we're breaking the API here. Not sure if a refactor justifies the the breaking change. Also, The file_Schema and requested_schema are more informative to me.

I didn't realize it's a public API. I reverted the refactor

kevinjqliu · 2024-04-11T22:37:14Z

pyiceberg/io/pyarrow.py

@@ -1772,16 +1772,17 @@ def write_file(io: FileIO, table_metadata: TableMetadata, tasks: Iterator[WriteT
    )

    def write_parquet(task: WriteTask) -> DataFile:
+        arrow_table = pa.Table.from_batches(task.record_batches)
+        df = to_requested_schema(table=arrow_table, from_schema=iceberg_table_schema, to_schema=parquet_schema)


similar to the read side (#83 & #597), turns the Arrow table from unsanitized Iceberg table schema to sanitized parquet schema

Currently, we batch the incoming dataframe first (in _dataframe_to_data_files) and then transform the scheme for each batch.

We can optimize by transforming first and then batching.

I want the schema transformation to happen as closely to the parquet writing as possible, so going with the first method for now

HonahX

LGTM! Sorry for the merge conflict because I took the integration test in #597 .

Follow-up PR: As proposed in apache/iceberg#10120, shall we also add a configuration to also allow writing parquet file with original column names?

About milestone: Shall we add this to 0.7.0 milestone? I think changing the column naming behavior of parquet generation may be too much for a patch release. Also, I hesitate to label it as a "bug" since Iceberg relies on the field-id and it does not violate the spec. Releasing it in 0.7.0 can also give us more time to develop the follow-up PR and discuss more about the apache/iceberg#10120. WDYT?

@kevinjqliu @Fokko

HonahX · 2024-04-13T00:08:03Z

tests/integration/test_writes/test_writes.py

+
+    tbl.overwrite(arrow_table_with_special_character_column)
+    # PySpark toPandas() turns nested field into tuple by default, but returns the proper schema when Arrow is enabled
+    spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")


Shall we add this to the spark fixture in conftest.py? Since the fixture's scope is "session", if we change the config here, all tests before this line will not have the configuration and all after this line will have this enabled. Moving it to the initialization part can ensure we have a consistent set of spark configs throughout the integration tests. WDYT?

good catch! i didn't know about the fixture scope behavior. Moved to conftest

kevinjqliu · 2024-04-13T01:47:56Z

+1 to adding this to the 0.7.0 release. The original issue in #584 is already fixed by #597.

kevinjqliu · 2024-04-13T01:56:41Z

tests/integration/test_inspect_table.py

@@ -186,8 +185,6 @@ def test_inspect_entries(

                    assert df_lhs == df_rhs, f"Difference in data_file column {df_column}: {df_lhs} != {df_rhs}"
            elif column == 'readable_metrics':
-                right = right.asDict(recursive=True)


dont need this anymore because of spark.sql.execution.arrow.pyspark.enabled, pandas DF turns tuple to dict

Fokko · 2024-04-13T20:05:32Z

pyiceberg/io/pyarrow.py

+    parquet_schema = sanitize_column_names(iceberg_table_schema)
+    arrow_file_schema = parquet_schema.as_arrow()


Nit: I realize we have many names, but that might be confusing. Parquet-schema is appropriate today since we only support parquet, but we might also support ORC and Avro later.

Suggested change

parquet_schema = sanitize_column_names(iceberg_table_schema)

arrow_file_schema = parquet_schema.as_arrow()

arrow_file_schema = sanitize_column_names(iceberg_table_schema).as_arrow()

Fokko · 2024-04-13T20:34:20Z

pyiceberg/io/pyarrow.py

@@ -1780,16 +1781,17 @@ def write_file(io: FileIO, table_metadata: TableMetadata, tasks: Iterator[WriteT
    )

    def write_parquet(task: WriteTask) -> DataFile:
+        arrow_table = pa.Table.from_batches(task.record_batches)
+        df = to_requested_schema(requested_schema=parquet_schema, file_schema=iceberg_table_schema, table=arrow_table)


Do we know if from_arrays in the ArrowProjectionVisitor is no-op?

The I'm asking is that we're introducing quite a bit of logic here, and I think the rewrites are only applicable for Avro: https://avro.apache.org/docs/1.8.1/spec.html#names

Quick check, depending on how long the from_arrays take, it doesn't seem to copy anything:

python3.9 Python 3.9.18 (main, Aug 24 2023, 18:16:58) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow as pa >>> numbers = pa.array(range(100000000)) >>> pa.Table.from_arrays([numbers], names=['abc']) pyarrow.Table abc: int64 ---- abc: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]

kevinjqliu · 2024-04-14T17:36:30Z

pyiceberg/io/pyarrow.py

+        table_schema = task.schema
+        arrow_table = pa.Table.from_batches(task.record_batches)
+        # if schema needs to be transformed, use the transformed schema and adjust the arrow table accordingly
+        # otherwise use the original schema
+        if (sanitized_schema := sanitize_column_names(table_schema)) != table_schema:
+            file_schema = sanitized_schema
+            arrow_table = to_requested_schema(requested_schema=file_schema, file_schema=table_schema, table=arrow_table)
+        else:
+            file_schema = table_schema


@Fokko wdyt of this? only used the transformed schema when necessary

another option could be to push this logic up to the caller of write_file or maybe even in WriteTask

My preference is to leave out all this logic when writing Parquet files. It doesn't affect Parquet files (since the characters are supported). With Iceberg we resolve the fields by ID, so names are unimportant. I would like to get @rdblue's opinion since he did the initial work (I also have a meeting with him tomorrow, and I can ping him there).

sounds good! I opened iceberg/#10120 to track this

I pinged Ryan and he's in favor of adding the aliases 👍

Fokko · 2024-04-17T11:47:20Z

Thanks for working on this @kevinjqliu and thanks @HonahX for the review

write with sanitized column names

06dd647

kevinjqliu mentioned this pull request Apr 7, 2024

[BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name #584

Closed

gwindes reviewed Apr 9, 2024

View reviewed changes

column_name_test.py Outdated Show resolved Hide resolved

gwindes reviewed Apr 9, 2024

View reviewed changes

column_name_test.py Outdated Show resolved Hide resolved

gwindes reviewed Apr 9, 2024

View reviewed changes

column_name_test.py Outdated Show resolved Hide resolved

kevinjqliu force-pushed the kevinjqliu/special-character-column-parquet branch from d09b945 to 345827e Compare April 11, 2024 01:54

kevinjqliu added 2 commits April 10, 2024 19:10

push down to when parquet writes

d7b5147

add test for writing special character column name

d278ee5

kevinjqliu force-pushed the kevinjqliu/special-character-column-parquet branch from 345827e to d278ee5 Compare April 11, 2024 02:41

kevinjqliu changed the title ~~Kevinjqliu/special character column parquet~~ Sanitized special character column before writing to parquet Apr 11, 2024

kevinjqliu changed the title ~~Sanitized special character column before writing to parquet~~ Sanitized special character column name before writing to parquet Apr 11, 2024

kevinjqliu marked this pull request as ready for review April 11, 2024 02:56

kevinjqliu commented Apr 11, 2024

View reviewed changes

kevinjqliu requested a review from gwindes April 11, 2024 03:14

HonahX reviewed Apr 11, 2024

View reviewed changes

HonahX mentioned this pull request Apr 11, 2024

Read: fetch file_schema directly from pyarrow_to_schema #597

Merged

kevinjqliu added 7 commits April 11, 2024 10:18

parameterize format_version

168931f

use to_requested_schema

ca11640

refactor to_requested_schema

ce9a587

more refactor

bf87a8a

test nested schema

25bf991

special character inside nested field

3b6ecad

comment on why arrow is enabled

f6a5ac2

kevinjqliu commented Apr 11, 2024

View reviewed changes

kevinjqliu requested a review from HonahX April 11, 2024 22:41

kevinjqliu added 2 commits April 11, 2024 15:43

use existing variable

b51b5ce

Merge branch 'main' into kevinjqliu/special-character-column-parquet

41f5354

HonahX approved these changes Apr 13, 2024

View reviewed changes

move spark config to conftest

d264ac3

pyspark arrow turns pandas df from tuple to dict

5de0b1c

kevinjqliu commented Apr 13, 2024

View reviewed changes

HonahX added this to the PyIceberg 0.7.0 release milestone Apr 13, 2024

kevinjqliu added 3 commits April 13, 2024 12:40

Revert refactor to_requested_schema

e81472e

reorder args

f6b72e9

Merge branch 'main' into kevinjqliu/special-character-column-parquet

22be232

kevinjqliu requested a review from Fokko April 13, 2024 19:57

Fokko reviewed Apr 13, 2024

View reviewed changes

kevinjqliu added 3 commits April 14, 2024 10:15

refactor

9ea64c6

pushdown schema

e5f2611

only tranform when necessary

177a6b7

kevinjqliu commented Apr 14, 2024

View reviewed changes

Fokko approved these changes Apr 17, 2024

View reviewed changes

Fokko merged commit 62b527e into apache:main Apr 17, 2024
7 checks passed

kevinjqliu deleted the kevinjqliu/special-character-column-parquet branch April 17, 2024 16:11

kevinjqliu mentioned this pull request Dec 21, 2024

URL-encode partition field names in file locations #1457

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitized special character column name before writing to parquet #590

Sanitized special character column name before writing to parquet #590

kevinjqliu commented Apr 7, 2024 •

edited

Loading

kevinjqliu Apr 11, 2024

HonahX Apr 11, 2024

kevinjqliu Apr 11, 2024

kevinjqliu commented Apr 11, 2024

HonahX Apr 11, 2024

kevinjqliu Apr 11, 2024

kevinjqliu Apr 11, 2024

Fokko Apr 13, 2024

kevinjqliu Apr 13, 2024

kevinjqliu Apr 11, 2024

kevinjqliu Apr 11, 2024

HonahX left a comment •

edited

Loading

HonahX Apr 13, 2024

kevinjqliu Apr 13, 2024

kevinjqliu commented Apr 13, 2024

kevinjqliu Apr 13, 2024

Fokko Apr 13, 2024

Fokko Apr 13, 2024

Fokko Apr 13, 2024

Fokko Apr 13, 2024

kevinjqliu Apr 14, 2024

kevinjqliu Apr 14, 2024

Fokko Apr 15, 2024

kevinjqliu Apr 15, 2024

Fokko Apr 17, 2024

Fokko commented Apr 17, 2024

		parquet_schema = sanitize_column_names(iceberg_table_schema)
		arrow_file_schema = parquet_schema.as_arrow()

	parquet_schema = sanitize_column_names(iceberg_table_schema)
	arrow_file_schema = parquet_schema.as_arrow()
	arrow_file_schema = sanitize_column_names(iceberg_table_schema).as_arrow()

Sanitized special character column name before writing to parquet #590

Sanitized special character column name before writing to parquet #590

Conversation

kevinjqliu commented Apr 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu commented Apr 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonahX left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu commented Apr 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko commented Apr 17, 2024

kevinjqliu commented Apr 7, 2024 •

edited

Loading

HonahX left a comment •

edited

Loading