Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS: Retain Glue Catalog column comment after updating Iceberg table #10276

Merged
merged 5 commits into from
Jul 5, 2024

Conversation

lawofcycles
Copy link
Contributor

@lawofcycles lawofcycles commented May 6, 2024

What this PR does

This PR addresses the issue where column comments in the Glue catalog are not retained when updating an existing Iceberg table schema. It fixes the problem by modifying the IcebergToGlueConverter.setTableInputInformation method to preserve existing column comments from the Glue table when creating a new TableInput for an update operation.

When an existing Glue table is provided, the modified method retrieves the comments from its columns and applies them to the corresponding columns in the new TableInput, but only if the Iceberg table's column does not already have a comment.

This change ensures that any user-defined column comments in the Glue catalog are carried over during table updates, preventing unintentional loss of documentation.

Testing

Added a new unit test method testSetTableInputInformationWithExistingTable to verify that column comments from an existing Glue table are correctly preserved when creating a new TableInput.
I'll also add Integration Tests same as #10199.

Fixes #10220

@github-actions github-actions bot added the AWS label May 6, 2024
@lawofcycles
Copy link
Contributor Author

I built the jars and manually tested on Amazon EMR as a Spark runtime.

@lawofcycles lawofcycles changed the title AWS: Updating Glue catalog table removes column descriptions AWS: Retain Glue Catalog column comment after updating Iceberg table May 12, 2024
@lawofcycles
Copy link
Contributor Author

lawofcycles commented May 13, 2024

Hi @geruh would you review this PR?
I noticed you are reviewing #10199, so I'd like to ask for this similar issue.

List<Column> columns = toColumns(metadata);
if (existingTable != null) {
List<Column> existingColumns = existingTable.storageDescriptor().columns();
Map<String, Column> existingColumnMap =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making this a map of column name to comment here. Since we aren't using any additional information from the existing comments

@@ -501,6 +501,97 @@ public void testColumnCommentsAndParameters() {
assertThat(actualColumns).isEqualTo(expectedColumns);
}

@Test
public void testGlueTableColumnCommentsPreserved() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the case where we drop a column from a table. I believe today we keep those columns and descriptions. Should we add a test for that?

Copy link
Contributor

@geruh geruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, @lawofcycles! I've taken an initial pass. Thanks for the solid work!

@lawofcycles
Copy link
Contributor Author

@geruh Thank you for your review!
I have pushed the changes addressing your review comments.
Could you please take another look when you have a chance?

Comment on lines 258 to 259
Optional.ofNullable(properties.get(GLUE_DESCRIPTION_KEY))
.ifPresent(tableInputBuilder::description);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lawofcycles thank you for your patch!

Now this method has the existing table information, so I think it would be better to move the function to keep the table description https://github.com/aajisaka/iceberg/blob/58c61241bd61de7a4062dbf664691f05eeb2ea53/aws/src/main/java/org/apache/iceberg/aws/glue/GlueTableOperations.java#L319-L321 into this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review!
I moved keeping the table description logic to IcebergToGlueConverter.

@lawofcycles
Copy link
Contributor Author

Hi @geruh would you review the latest change?

Optional.ofNullable(existingTable.description()).ifPresent(tableInputBuilder::description);
}

List<Column> columns = toColumns(metadata);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we can simplify the implementation by creating existingColumnMap first and use it inside addColumnWithDedupe method. In this way we don't have to replace the new columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged existing column comment processing into creating column set on to toColumns and addColumnWithDedupe method

Copy link
Member

@aajisaka aajisaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @lawofcycles

@lawofcycles
Copy link
Contributor Author

lawofcycles commented Jun 8, 2024

Hi @amogh-jahagirdar would you review the latest change?
I noticed you are reviewing #10199, so I'd like to ask for this similar issue.

@lawofcycles
Copy link
Contributor Author

Hi @amogh-jahagirdar,

I hope this message finds you well. I wanted to kindly follow up on this PR, which addresses an important issue with preserving Glue Catalog column comments when updating Iceberg tables (#10220).

Could you please take a look when you have a moment? Thank you for your time and consideration.

@amogh-jahagirdar
Copy link
Contributor

Really sorry for the delayed review @lawofcycles , I'll take a look today.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lawofcycles I think this is pretty good, just had some questions and nits. I'll wait a bit if @geruh @rahil-c have any concerns and if not I'll go ahead and merge

@lawofcycles lawofcycles force-pushed the keep-glue-column-desc branch 3 times, most recently from 6f32211 to 8df1fc2 Compare June 27, 2024 14:03
@Fokko Fokko merged commit 9a0fc49 into apache:main Jul 5, 2024
45 checks passed
@Fokko
Copy link
Contributor

Fokko commented Jul 5, 2024

Let's get this in, thanks @lawofcycles for working on this, and @amogh-jahagirdar and @rahil-c for the review!

@lawofcycles
Copy link
Contributor Author

Thanks for reviewing @amogh-jahagirdar , @aajisaka , @geruh and @rahil-c !!

@lawofcycles
Copy link
Contributor Author

Thank you for merging @Fokko !

jasonf20 pushed a commit to jasonf20/iceberg that referenced this pull request Aug 4, 2024
* Retain Glue Catalog column comment

* merge handling existing column comment to creating column set

* apply spotless

* Add newline after else block for improved readability
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AWS: Updating Glue catalog table removes column descriptions
6 participants