Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Segment Schema Publishing and Polling for Efficient Datasource Schema Building #4

Closed
wants to merge 114 commits into from

Conversation

findingrish
Copy link
Owner

@findingrish findingrish commented Nov 6, 2023

Description

Issue: apache#14989

The initial step in optimizing segment metadata was to centralize the construction of datasource schema in the Coordinator (apache#14985). Thereafter, we addressed the problem of publishing schema for realtime segments (apache#15475). Subsequently, our goal is to eliminate the requirement for regularly executing queries to obtain segment schema information.

This is the final change which involves publishing segment schema for finalized segments from task and periodically polling them in the Coordinator.

Design

Database

Schema Table

Table Name: SegmentSchema
Purpose: Store unique schema for segment.

Columns

Column Name Data Type Description
id autoincrement primary key
created_date varchar creation time, allows filtering schema created after a point
fingerprint varchar unique identifier for the schema
payload blob includes rowSignature, aggregatorFactories

Segments Table

New columns will be added to the already existing Segments table.

Columns

Column Name Data Type Description
num_rows long number of rows in the segment
schema_id long foreign key, references id in the schema table

Task

Changes in the task to publish schema along with segment metadata.

Streaming

  • Changes in StreamAppenderator to get the RowSignature, AggregatorFactories and numRows for the segment.

Batch

TBA

MSQ

TBA

Coordinator

Schema Poll

Schema Caching

SegmentMetadataCache changes

Schema Cleanup

Testing

TBA

Potential side effects

TBA

Limitations

TBA

Upgrade considerations

TBA

@findingrish findingrish changed the title Coordinator schema read write Introduce Segment Schema Publishing and Polling for Efficient Datasource Schema Building Feb 1, 2024
@findingrish findingrish closed this Feb 1, 2024
andrisnoko pushed a commit to andrisnoko/druid that referenced this pull request Jul 17, 2024
add empty response and GrpcResponseHandler tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant