Skip to content

Commit

Permalink
Preserve original field order during schema evolution
Browse files Browse the repository at this point in the history
This overcomes a limitation with how Hudi syncs schemas to the Glue
catalog. Previously, if version `1-0-0` of a schema had fields `a` and
`b`, and then vesion `1-0-1` adds a field `c`, then the new field might
be added _before_ the original fields in the Hudi schema.  The new field
would get synced to Glue, but only for new partitions; it is not
back-filled to existing partitions.

After this change, the new field `c` is added _after_ the original
fields `a` and `b` in the Hudi schema.  Then there is no need to sync
the new field to existing partitions in Glue.

The problem manifested in AWS Athena with a message like:

> HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas.

This fix was implemented in snowplow/schema-ddl#213 and
snowplow-incubator/common-streams#98 and imported via a new version of
common-streams.

This change does not impact Delta or Iceberg, where nothing was broken.
  • Loading branch information
istreeter committed Nov 19, 2024
1 parent c1e275e commit 29e7689
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 3 deletions.
3 changes: 1 addition & 2 deletions modules/core/src/main/resources/reference.conf
Original file line number Diff line number Diff line change
Expand Up @@ -57,9 +57,8 @@
# -- Record key and partition settings. Chosen to be consistent with `hudiTableOptions`.
"hoodie.keygen.timebased.timestamp.type": "SCALAR"
"hoodie.keygen.timebased.output.dateformat": "yyyy-MM-dd"
"hoodie.datasource.write.reconcile.schema": "true"
"hoodie.datasource.write.partitionpath.field": "load_tstamp"
"hoodie.schema.on.read.enable": "true"
"hoodie.write.set.null.for.missing.columns": "true"
"hoodie.metadata.index.column.stats.column.list": "load_tstamp,collector_tstamp,derived_tstamp,dvce_created_tstamp,true_tstamp"
"hoodie.metadata.index.column.stats.enable": "true"
"hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled": "true"
Expand Down
2 changes: 1 addition & 1 deletion project/Dependencies.scala
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ object Dependencies {
val awsRegistry = "1.1.20"

// Snowplow
val streams = "0.8.0"
val streams = "0.8.2-M1"
val igluClient = "4.0.0"

// Transitive overrides
Expand Down

0 comments on commit 29e7689

Please sign in to comment.