Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve original field order during schema evolution #96

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

istreeter
Copy link
Collaborator

@istreeter istreeter commented Nov 18, 2024

This overcomes a limitation with how Hudi syncs schemas to the Glue catalog. Previously, if version 1-0-0 of a schema had fields a and b, and then vesion 1-0-1 adds a field c, then the new field might be added before the original fields in the Hudi schema. The new field would get synced to Glue, but only for new partitions; it is not back-filled to existing partitions.

After this change, the new field c is added after the original fields a and b in the Hudi schema. Then there is no need to sync the new field to existing partitions in Glue.

The problem manifested in AWS Athena with a message like:

HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas.

This fix was implemented in snowplow/schema-ddl#213 and snowplow-incubator/common-streams#98 and imported via a new version of common-streams.

This change does not impact Delta or Iceberg, where nothing was broken.

@istreeter istreeter force-pushed the common-streams-0.8.2 branch from ec99815 to 9b2376f Compare November 18, 2024 11:41
This overcomes a limitation with how Hudi syncs schemas to the Glue
catalog. Previously, if version `1-0-0` of a schema had fields `a` and
`b`, and then vesion `1-0-1` adds a field `c`, then the new field might
be added _before_ the original fields in the Hudi schema.  The new field
would get synced to Glue, but only for new partitions; it is not
back-filled to existing partitions.

After this change, the new field `c` is added _after_ the original
fields `a` and `b` in the Hudi schema.  Then there is no need to sync
the new field to existing partitions in Glue.

The problem manifested in AWS Athena with a message like:

> HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas.

This fix was implemented in snowplow/schema-ddl#213 and
snowplow-incubator/common-streams#98 and imported via a new version of
common-streams.

This change does not impact Delta or Iceberg, where nothing was broken.
@istreeter istreeter force-pushed the common-streams-0.8.2 branch from 9b2376f to 29e7689 Compare November 19, 2024 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants