[feature request] Allow Java Iceberg library to write parquet files with special character column names #10120

kevinjqliu · 2024-04-11T03:06:03Z

Feature Request / Improvement

Based on discussions from iceberg-python/#584, we found that the Java Iceberg library "sanitizes" and transforms column names with special characters before writing to parquet.

For example, an Iceberg table with TEST:A1B2.RAW.ABC-GG-1-A column is transformed into TEST_x3AA1B2_x2ERAW_x2EABC_x2DGG_x2D1_x2DA which is then used to write the parquet files.
This process is done for both reads and writes. The behavior was introduced in #601

I think Iceberg should (optionally) allow writing column names without the "sanitization" and transformation. This can be made configurable to enable backward compatibility.

Query engine

None

The text was updated successfully, but these errors were encountered:

Fokko · 2024-04-16T19:22:23Z

I can confirm this, when creating a table:

CREATE TABLE default.abc(`a.b+c` string); %%sql
INSERT INTO default.abc VALUES ('a'), ('b')

The field name is sanitized:

parq 00000-0-bd071a55-7d0f-4ecd-b4be-44f55532624d-0-00001.parquet --schema 

 # Schema 
 <pyarrow._parquet.ParquetSchema object at 0x14cf4fdc0>
required group field_id=-1 table {
  required binary field_id=1 a_x2Eb_x2Bc (String);
}

github-actions · 2024-10-29T00:15:56Z

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

kevinjqliu added the improvement PR that improves existing functionality label Apr 11, 2024

This was referenced Apr 11, 2024

[BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name apache/iceberg-python#584

Closed

Sanitized special character column name before writing to parquet apache/iceberg-python#590

Merged

Fokko mentioned this issue Apr 16, 2024

Incorrect Metrics Calculation for Iceberg Table Due to Column Name Transformation with Special Characters #10115

Closed

This was referenced Aug 21, 2024

Support both legacy(pre 0.13, letters and numbers) and new specification(post 0.13, all unicode in backticks) for the Hive identifiers facebookincubator/velox#10785

Open

Customize field separators facebookincubator/velox#7252

Open

github-actions bot added the stale label Oct 29, 2024

kevinjqliu removed the stale label Oct 31, 2024

smaheshwar-pltr mentioned this issue Dec 21, 2024

URL-encode partition field names in file locations apache/iceberg-python#1457

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] Allow Java Iceberg library to write parquet files with special character column names #10120

[feature request] Allow Java Iceberg library to write parquet files with special character column names #10120

kevinjqliu commented Apr 11, 2024

Fokko commented Apr 16, 2024

github-actions bot commented Oct 29, 2024

[feature request] Allow Java Iceberg library to write parquet files with special character column names #10120

[feature request] Allow Java Iceberg library to write parquet files with special character column names #10120

Comments

kevinjqliu commented Apr 11, 2024

Feature Request / Improvement

Query engine

Fokko commented Apr 16, 2024

github-actions bot commented Oct 29, 2024