Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Allow Java Iceberg library to write parquet files with special character column names #10120

Open
kevinjqliu opened this issue Apr 11, 2024 · 2 comments
Labels
improvement PR that improves existing functionality

Comments

@kevinjqliu
Copy link
Contributor

Feature Request / Improvement

Based on discussions from iceberg-python/#584, we found that the Java Iceberg library "sanitizes" and transforms column names with special characters before writing to parquet.

For example, an Iceberg table with TEST:A1B2.RAW.ABC-GG-1-A column is transformed into TEST_x3AA1B2_x2ERAW_x2EABC_x2DGG_x2D1_x2DA which is then used to write the parquet files.
This process is done for both reads and writes. The behavior was introduced in #601

I think Iceberg should (optionally) allow writing column names without the "sanitization" and transformation. This can be made configurable to enable backward compatibility.

Query engine

None

@Fokko
Copy link
Contributor

Fokko commented Apr 16, 2024

I can confirm this, when creating a table:

CREATE TABLE default.abc(`a.b+c` string); %%sql
INSERT INTO default.abc VALUES ('a'), ('b')

The field name is sanitized:

parq 00000-0-bd071a55-7d0f-4ecd-b4be-44f55532624d-0-00001.parquet --schema 

 # Schema 
 <pyarrow._parquet.ParquetSchema object at 0x14cf4fdc0>
required group field_id=-1 table {
  required binary field_id=1 a_x2Eb_x2Bc (String);
}

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement PR that improves existing functionality
Projects
None yet
Development

No branches or pull requests

2 participants