Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize schema fingerprint for column permutations #17044

Merged

Conversation

findingrish
Copy link
Contributor

@findingrish findingrish commented Sep 12, 2024

Parent issue: #14989

It is possible for the order of columns to vary across segments especially during realtime ingestion.
Since, the schema fingerprint is sensitive to column order this leads to creation of a large number of segment schema in the metadata database for essentially the same set of columns.

This is wasteful, this patch fixes this problem by computing schema fingerprint on lexicographically sorted columns. This would result in creation of a single schema in the metadata database with the first observed column order for a given signature.

Release notes

In the CentralizedDatasourceSchema feature, different permutations of the same column order do not result in distinct schemas in the database.

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise.

// thus avoiding schema explosion in the metadata database
// Note that this signature is not persisted anywhere, it is only used for fingerprint computation
final RowSignature sortedSignature = getLexicographicallySortedSignature(schemaPayload.getRowSignature());
final SchemaPayload updatedPayload = new SchemaPayload(sortedSignature, schemaPayload.getAggregatorFactories());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention a note about the aggregator factories as well that they are column order independent since they are backed by a map.

@abhishekagarwal87 abhishekagarwal87 merged commit 43d790f into apache:master Sep 18, 2024
89 of 90 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants