Eliminate Periodic Realtime Segment Metadata Queries: Task Now Publish Schema for Seamless Coordinator Updates #5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Issue: apache#14989
The initial step in optimizing segment metadata was to centralize the construction of table schema in the Coordinator (apache#14985). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information. This task encompasses addressing both realtime and finalized segments.
This modification specifically addresses the issue with realtime segments. Tasks will now routinely communicate the schema for realtime segments during the segment announcement process. The Coordinator will identify the schema alongside the segment announcement and subsequently update the schema for realtime segments in the metadata cache.
Design
Task
StreamAppenderator.SinkSchemaAnnouncer
will compute sink schema changes and announce them to theDataSegmentAnnouncer
.DataSegmentAnnouncer
to receive sink schema information and manage schema cleanup when a task is closed.SegmentSchemas
has been added to facilitate the passing of schema information for multiple segments.DataSegmentChangeRequest
has been introduced, namedSegmentSchemasChangeRequest
.Coordinator
HttpServerInventoryView
to handle schema information.CoordinatorSegmentMetadata
cache has been updated to incorporate schema changes. Changes have also been made to the refresh logic to eliminate the need for executing segment metadata queries for realtime segments.Testing
Potential side effects
TBA
Limitations
Currently, this feature doesn't work with zookeeper based segment announcement.
Upgrade considerations
The general upgrade order should be followed. The new code is behind a feature flag, so it is compatible with existing setups. Even if centralized table schema building (apache#14985) is enabled, realtime segments will be refreshed using segment metadata query to Indexer/Task.
Release notes
This experimental feature aims to eliminate the necessity for periodically executing the SegmentMetadataQuery to the Indexer/Task for retrieving the schema of realtime segments. Presently, it is accessible through two feature flags and should only be enabled for Proof of Concept (PoC) or testing purposes. To activate it, configure the following settings in the common configurations:
druid.coordinator.centralizedTableSchema.enabled
anddruid.coordinator.centralizedTableSchema.announceRealtimeSegmentSchema
. It's important to note that the feature flag is temporarydruid.coordinator.centralizedTableSchema.announceRealtimeSegmentSchema
and will be removed in a subsequent update.