Centralize datasource schema management in Coordinator #14989

findingrish · 2023-09-14T13:15:22Z

Motivation

Original proposal https://github.com/abhishekagarwal87/druid/blob/metadata_design_proposal/design-proposal.md#3-proposed-solution-storing-segment-schema-in-metadata-store

In summary, the current approach of constructing table schemas, involving brokers querying data nodes and tasks for segment schemas has several limitations and operational challenges. These issues encompass slow broker startup, excessive communication in the system, schema rebuilding on broker startup, and a lack of unified schema owner. Furthermore, it has functional limitations such as inability to query from the deep storage.

The proposed solution is to centralize schema management within the coordinator. This involves tasks publishing their schemas in the metadata database, along with segment row count information. The coordinator can then build the table schema by combining individual segment schema within the datasource.

Design

Changes are required in tasks, coordinator and broker.
Detailed design in individual PRs.

Phases

The first phase is to move existing schema building functionality from the brokers to the coordinator and allow the broker to query schema from the coordinator, while retaining the capability to build table schema if the need arises.

The next step is to have the coordinator publish segment schema in the background to reduce the volume of segment metadata queries during coordinator startup.

In parallel, tasks should be updated to publish their schema in the database. Eventually, eliminating the need to query segment schema directly from data nodes and tasks.

Changes are also required to fetch and publish schema for cold tier segments. This can be done in the Coordinator.

Future work, involves serving system table queries from the Coordinator.

findingrish · 2023-09-14T13:18:43Z

Pr to move schema building functionality to coordinator #14985

…er (#15496) Description With CentralizedDatasourceSchema (#14989) feature enabled, metadata for appended segments was not being refreshed. This caused numRows to be 0 for the new segments and would probably cause the datasource schema to not include columns from the new segments. Analysis The problem turned out in the new QuerySegmentWalker implementation in the Coordinator. It first finds the segment to be queried in the Coordinator timeline. Then it creates a new timeline of the segments present in the timeline. The problem was that it is looking up complete partition set in the new timeline. Since the appended segments by themselves do not make a complete partition set, no SegmentMetadataQuery were executed.

…rce Schema Building (#15817) Issue: #14989 The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (#14985). Thereafter, we addressed the problem of publishing schema for realtime segments (#15475). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information. This is the final change which involves publishing segment schema for finalized segments from task and periodically polling them in the Coordinator.

Parent issue: #14989 It is possible for the order of columns to vary across segments especially during realtime ingestion. Since, the schema fingerprint is sensitive to column order this leads to creation of a large number of segment schema in the metadata database for essentially the same set of columns. This is wasteful, this patch fixes this problem by computing schema fingerprint on lexicographically sorted columns. This would result in creation of a single schema in the metadata database with the first observed column order for a given signature.

findingrish · 2024-10-04T15:42:16Z

Completed!

findingrish added Design Review Proposal labels Sep 14, 2023

findingrish mentioned this issue Sep 22, 2023

Relocating Table Schema Building: Shifting from Brokers to Coordinator for Improved Efficiency #14985

Merged

10 tasks

findingrish changed the title ~~Centralize table schema management in Coordinator~~ Centralize datasource schema management in Coordinator Dec 4, 2023

This was referenced Feb 1, 2024

Introduce Segment Schema Publishing and Polling for Efficient Datasource Schema Building findingrish/druid#4

Closed

Introduce Segment Schema Publishing and Polling for Efficient Datasource Schema Building #15817

Merged

This was referenced Jul 9, 2024

Enable querying entirely cold datasources #16676

Merged

Followup changes to 15817 (Segment schema publishing and polling) #16368

Merged

findingrish mentioned this issue Aug 20, 2024

Filter out tombstone segments from metadata cache #16890

Merged

10 tasks

findingrish mentioned this issue Sep 4, 2024

Log a small subset of segments to refresh for debugging Coordinator refresh logic #16998

Merged

findingrish closed this as completed Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralize datasource schema management in Coordinator #14989

Centralize datasource schema management in Coordinator #14989

findingrish commented Sep 14, 2023 •

edited

Loading

findingrish commented Sep 14, 2023 •

edited

Loading

findingrish commented Oct 4, 2024

Centralize datasource schema management in Coordinator #14989

Centralize datasource schema management in Coordinator #14989

Comments

findingrish commented Sep 14, 2023 • edited Loading

Motivation

Design

Phases

findingrish commented Sep 14, 2023 • edited Loading

findingrish commented Oct 4, 2024

findingrish commented Sep 14, 2023 •

edited

Loading

findingrish commented Sep 14, 2023 •

edited

Loading