Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralize datasource schema management in Coordinator #14989

Closed
findingrish opened this issue Sep 14, 2023 · 2 comments
Closed

Centralize datasource schema management in Coordinator #14989

findingrish opened this issue Sep 14, 2023 · 2 comments

Comments

@findingrish
Copy link
Contributor

findingrish commented Sep 14, 2023

Motivation

Original proposal https://github.com/abhishekagarwal87/druid/blob/metadata_design_proposal/design-proposal.md#3-proposed-solution-storing-segment-schema-in-metadata-store

In summary, the current approach of constructing table schemas, involving brokers querying data nodes and tasks for segment schemas has several limitations and operational challenges. These issues encompass slow broker startup, excessive communication in the system, schema rebuilding on broker startup, and a lack of unified schema owner. Furthermore, it has functional limitations such as inability to query from the deep storage.

The proposed solution is to centralize schema management within the coordinator. This involves tasks publishing their schemas in the metadata database, along with segment row count information. The coordinator can then build the table schema by combining individual segment schema within the datasource.

Design

Changes are required in tasks, coordinator and broker.
Detailed design in individual PRs.

Phases

The first phase is to move existing schema building functionality from the brokers to the coordinator and allow the broker to query schema from the coordinator, while retaining the capability to build table schema if the need arises.

The next step is to have the coordinator publish segment schema in the background to reduce the volume of segment metadata queries during coordinator startup.

In parallel, tasks should be updated to publish their schema in the database. Eventually, eliminating the need to query segment schema directly from data nodes and tasks.

Changes are also required to fetch and publish schema for cold tier segments. This can be done in the Coordinator.

Future work, involves serving system table queries from the Coordinator.

@findingrish
Copy link
Contributor Author

findingrish commented Sep 14, 2023

Pr to move schema building functionality to coordinator #14985

@findingrish findingrish changed the title Centralize table schema management in Coordinator Centralize datasource schema management in Coordinator Dec 4, 2023
abhishekagarwal87 pushed a commit that referenced this issue Dec 6, 2023
…er (#15496)

Description
With CentralizedDatasourceSchema (#14989) feature enabled, metadata for appended segments was not being refreshed. This caused numRows to be 0 for the new segments and would probably cause the datasource schema to not include columns from the new segments.

Analysis
The problem turned out in the new QuerySegmentWalker implementation in the Coordinator. It first finds the segment to be queried in the Coordinator timeline. Then it creates a new timeline of the segments present in the timeline.
The problem was that it is looking up complete partition set in the new timeline. Since the appended segments by themselves do not make a complete partition set, no SegmentMetadataQuery were executed.
cryptoe pushed a commit that referenced this issue Apr 24, 2024
…rce Schema Building (#15817)

Issue: #14989

The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (#14985). Thereafter, we addressed the problem of publishing schema for realtime segments (#15475). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information.

This is the final change which involves publishing segment schema for finalized segments from task and periodically polling them in the Coordinator.
abhishekagarwal87 pushed a commit that referenced this issue Sep 18, 2024
Parent issue: #14989

It is possible for the order of columns to vary across segments especially during realtime ingestion.
Since, the schema fingerprint is sensitive to column order this leads to creation of a large number of segment schema in the metadata database for essentially the same set of columns.

This is wasteful, this patch fixes this problem by computing schema fingerprint on lexicographically sorted columns. This would result in creation of a single schema in the metadata database with the first observed column order for a given signature.
@findingrish
Copy link
Contributor Author

Completed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant