AWS: provide option to hide old fields in Glue table #7584

LucasRoesler · 2023-05-11T08:07:49Z

Feature Request / Improvement

In #3888 the Glue schema generation was adjusted so that all old fields are included in the schema. The original reasoning was

so that people know what were the columns that were already used in the past and avoid adding the same name column.

In my organization, there are many users of these tables via Athena who are not data engineers that own the schema. They no idea about the old schema, they are not editing the schema, and their default use case is querying the current data. They report it as confusing that the schema shows a field that does not exist and produces errors if they attempt to use it.

Neither Athena nor Glue seem to have any support to display these old fields as non-active or deprecated or to hide these fields. Therefore, it would be nice to have a configuration option to disable including non-current fields in the schema.

Query engine

Athena

dertodestod · 2023-05-16T13:35:24Z

I also don't quite understand the current behavior in Athena/Glue when a column is dropped. I can see that a new schema is created in the metadata file without the column and in Glue the column moves to the end of the table and gets a "iceberg.field.current": "false" setting. However, the column still shows up for consumers in Athena web console (but not when doing a DESCRIBE of the table) so this has led to some confusion in our business.

I couldn't check if the column appears via JDBC (because of some errors) but I guess the column won't be listed because I see in Athena that a DESCRIBE query is used to retrieve that information. Can someone confirm that?

I personally think that Athena should not show the deleted columns (neither in the web nor via JDBC). Is there perhaps a way to keep track of the dropped column(s) without showing them in Athena? If not, it would be great if one could be created.

wojciechjak · 2023-06-07T10:39:00Z

Also the same issue when renaming columns.

pdehaansbp · 2023-07-27T13:57:08Z

Same issue. Curious to read what @jackye1995 and @yyanyy think about it.

tcassou · 2023-12-30T14:54:35Z

Hello! Our organization is facing the same problem. In particular, the Glue API will return columns that cannot be resolved in the source data, causing queries to fail. We've been using Presto views created dynamically, and breaking every time a column is dropped.

Technically, schema versioning is meant to solve this challenge:

so that people know what were the columns that were already used in the past and avoid adding the same name column.

The latest schema of a table should be aligned with the data, and previous versions will keep track of historical modifications.
Could we think of publishing new schema versions in Glue instead of this workaround that introduced bugs/defects?
Or at the very least making this newly introduced behavior optional?

github-actions · 2024-08-30T00:14:20Z

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

tcassou · 2024-08-31T10:15:35Z

Hi there!
This is still an issue, and the only workaround we found is to build a custom Iceberg jar without the faulty commit which is not really sustainable of course.
Any change this could get prioritized, or even just acknowledged to start with?

Thanks a lot!

LucasRoesler changed the title ~~AWS: provide option to hid old fields in Glue table~~ AWS: provide option to hide old fields in Glue table May 19, 2023

Raphael-Vignes mentioned this issue Jan 5, 2024

AWS: Add Option to don't write non current columns in glue schema closes #7584 #9420

Closed

github-actions bot added the stale label Aug 30, 2024

github-actions bot removed the stale label Sep 1, 2024

This was referenced Oct 1, 2024

wr.athena.to_iceberg not using temp_path aws/aws-sdk-pandas#2978

Closed

fix: return only "current" iceberg columns aws/aws-sdk-pandas#2982

Merged

duoxoud mentioned this issue Oct 16, 2024

Add optional Glue Schema configuration to exclude Non-Current Fields #11334

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS: provide option to hide old fields in Glue table #7584

AWS: provide option to hide old fields in Glue table #7584

LucasRoesler commented May 11, 2023

dertodestod commented May 16, 2023

wojciechjak commented Jun 7, 2023

pdehaansbp commented Jul 27, 2023

tcassou commented Dec 30, 2023 •

edited

Loading

github-actions bot commented Aug 30, 2024

tcassou commented Aug 31, 2024

AWS: provide option to hide old fields in Glue table #7584

AWS: provide option to hide old fields in Glue table #7584

Comments

LucasRoesler commented May 11, 2023

Feature Request / Improvement

Query engine

dertodestod commented May 16, 2023

wojciechjak commented Jun 7, 2023

pdehaansbp commented Jul 27, 2023

tcassou commented Dec 30, 2023 • edited Loading

github-actions bot commented Aug 30, 2024

tcassou commented Aug 31, 2024

tcassou commented Dec 30, 2023 •

edited

Loading