Normalize schema fingerprint for column permutations #17044

findingrish · 2024-09-12T14:49:57Z

Parent issue: #14989

It is possible for the order of columns to vary across segments especially during realtime ingestion.
Since, the schema fingerprint is sensitive to column order this leads to creation of a large number of segment schema in the metadata database for essentially the same set of columns.

This is wasteful, this patch fixes this problem by computing schema fingerprint on lexicographically sorted columns. This would result in creation of a single schema in the metadata database with the first observed column order for a given signature.

Release notes

In the CentralizedDatasourceSchema feature, different permutations of the same column order do not result in distinct schemas in the database.

…se for different column permutations

cryptoe

LGTM otherwise.

cryptoe · 2024-09-16T02:53:46Z

server/src/main/java/org/apache/druid/segment/metadata/FingerprintGenerator.java

+    // thus avoiding schema explosion in the metadata database
+    // Note that this signature is not persisted anywhere, it is only used for fingerprint computation
+    final RowSignature sortedSignature = getLexicographicallySortedSignature(schemaPayload.getRowSignature());
+    final SchemaPayload updatedPayload = new SchemaPayload(sortedSignature, schemaPayload.getAggregatorFactories());


Please mention a note about the aggregator factories as well that they are column order independent since they are backed by a map.

…-order-in-schema

findingrish added 3 commits September 12, 2024 20:03

Normalize schema fingerprint for column permutations

1e6b7b6

Add test

c39326e

Add test to verify that only a single schema is created in the databa…

45c9f9a

…se for different column permutations

cryptoe approved these changes Sep 16, 2024

View reviewed changes

findingrish added 4 commits September 17, 2024 10:29

Update docs

9b2735a

Merge remote-tracking branch 'upstream/master' into first-seen-column…

2229635

…-order-in-schema

checkstyle

7076e30

Merge remote-tracking branch 'upstream/master' into first-seen-column…

8c9f5af

…-order-in-schema

abhishekagarwal87 merged commit 43d790f into apache:master Sep 18, 2024
89 of 90 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize schema fingerprint for column permutations #17044

Normalize schema fingerprint for column permutations #17044

findingrish commented Sep 12, 2024 •

edited

Loading

cryptoe left a comment

cryptoe Sep 16, 2024

Normalize schema fingerprint for column permutations #17044

Normalize schema fingerprint for column permutations #17044

Conversation

findingrish commented Sep 12, 2024 • edited Loading

Release notes

cryptoe left a comment

Choose a reason for hiding this comment

cryptoe Sep 16, 2024

Choose a reason for hiding this comment

findingrish commented Sep 12, 2024 •

edited

Loading