-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "is_overshadowed" column to sys.segments table #7233
Comments
Hmm, maybe |
It might be fine to call them "not published" especially if we define it right (like, "any segments that have used = true in the MD store and are not overshadowed"). I'm thinking that is the most useful definition for what this column is likely to be used for: finding all segments active in the metadata store. (Overshadowed ones aren't really active since they don't do anything) |
Do you mean adding a definition to doc? I'm afraid if ppl would check it before using the system schema. |
Well, it shouldn't be called just |
I think (2) makes sense, since as this proposal points out, (1) is unintuitive when doing an overwrite of a set of segments. The old set and new set will both have
This is sorta like Maybe what is happening here is that we want versions of both |
Hmm, ok. It looks that there are some misunderstandings and confusions. First of all, I think it makes sense to keep I also have some questions for the meaning of "published segments".
I think this means any segments in the
This isn't exactly same with the meaning of
Yeah, if you're thinking the lag in brokers until they refresh their cache, it might not be "currently" being queried. |
Below statement is correct :
the
The doc here says above, but I think it's incomplete, as it omits the part about @jihoonson Did you mean to suggest that the |
@surekhasaharan could you please clarify the motivation "Showing overshadowed segments as part of published segments with is_published=true may not be correct behavior" - showing where and by whom?
There is already an |
Sorry that it's not clear, will edit the issue to clarify. What I meant to say is |
@leventov oh, I missed it's already there. Thanks. I talked with @gianm and @surekhasaharan offline and now it's more clear to me. Here is the current behavior.
The issue here is that both columns are not considering overshadowing such that it makes debugging harder. Probably, in many use cases, we want to know what segments are not loaded yet and why. To figure this out, we need to know what segment is published and is not fully overshadowed only by other segments in metadata store. This information is not being served yet, but it looks worth to add a new column. We came up with a name of |
This suggested terminology sounds reasonable to me - in summary, not changing the meaning of |
Is this right that Or not, since this message
Contains the verb "stores"? |
I don't like |
Yes this seems right, I guess "contains" conveys the meaning better than "stores" ? Not sure on which verb to use. |
We were trying to make it easier for the user to get active segments just by writing
That's the plan. |
@leventov I agree that Regarding |
Any particular reason you’re thinking it makes sense to define I’m thinking “overshadowed by some published segments” would be more useful in a debugging scenario. It allows you to write |
Perhaps Regarding the semantics of Maybe we need both:
|
I think this statement is a bit incorrect because segments which are currently being generated by stream ingestion tasks should also be able to be queried but |
Yes. Also, the part "can" is wrong, because |
Thanks everyone for the discussion and your inputs on this. I am thinking of adding a column |
I have updated the issue with proposal components which talks about the way I am thinking of implementing |
@surekhasaharan Can you please add the definition of
I am ok with leaving |
thanks @gianm for articulating the definition of |
The current proposal (add |
@surekhasaharan what do you think about "includeOvershadowedStatus", "SegmentWithOvershadowedStatus" instead of "overshadowInfo"? |
@leventov |
@surekhasaharan then the proposal looks good to me too. If you plan to rename "used segments" to "published segments" across the codebase, I think it would be easier to do this after #7306 is merged, see #7306 (comment). |
Closed by #7425 |
Description
sys.segments
table retrieves published segments from coordinator (segments which are in metadata store withused=1
), and marks all those segments withis_published=true
. The published segments can include overshadowed segments if compaction is underway. Currently,sys.segments
table gets all the segments from coordinator whereused=1
, there might be some segments which are overshadowed and still haveused=1
in coordinator as there can be a lag beforeused=0
is set for overshadowed segments, at which point they will not be returned from coordinator. But, if those overshadowed segments are not handled properly, thesys.segments
table may show dangling segments for a while.Motivation
sys.segments
table contains overshadowed segments with theiris_published
column is set totrue
, which may not be the correct behavior because those overshadowed segments are going to be markedused=0
once compaction finishes. Andsys.segments
queries would be showing phantom segments for sometime. A better behavior forsys.segments
is to add another column calledis_overshadowed
which marks the overshadowed segments withis_overshadowed=true
in the query results, until they are removed completely from the table. So users can get the list of segments which should be available by queryingis_published && !is_overshadowed
Proposed changes
is_overshadowed
will be added to thesys.segments
virtual table.includeOvershadowedStatus
will be added to coordinator api/druid/coordinator/v1/metadata/segments
. So the proposed API isGET /druid/coordinator/v1/metadata/segments?includeOvershadowedStatus
is_overshadowed
flag for a segment is set to true if the given segment is published and is completely overshadowed by some other published segment .Currently,is_overshadowed
is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for is_published = 1 AND is_overshadowed = 0. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet.Without
includeOvershadowedStatus
query param, the api returns a stream ofDataSegment
objects and with this param, it will return a stream ofDataSegment
plusovershadow
flag.It would be a new data structure something like
SegmentWithOvershadowedStatus
with 2 properties:The
MetadataSegmentView
would storeSegmentWithOvershadowedStatus
object in thepublishedSegments
in case of cacheEnabled, else will pass along the stream ofSegmentWithOvershadowedStatus
toSystemSchema
where theis_overshadowed
column ofsys.segments
would be filled with the boolean value per segment.Rationale
Current approach seems to have no or minimal impact on broker memory as the the overshadow info per segment is being retrieved in coordinator. Other way thought of was to create a
VersionedIntervalTimeline
in broker, but was discarded due to additional memory overhead it would introduce in broker to maintain a timeline for all published segments.Operational impact
No operational impact I can think of.
The text was updated successfully, but these errors were encountered: