-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/16361 permits to control kill task lock type #16362
Feature/16361 permits to control kill task lock type #16362
Conversation
Idea looks good to me @IgorBerman.
|
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Outdated
Show resolved
Hide resolved
thanks for the comments @AmatyaAvadhanula and @abhishekrb19 |
30e4eb4
to
6ed5022
Compare
@AmatyaAvadhanula I think I've addressed all your suggestions #16362 (comment) |
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Outdated
Show resolved
Hide resolved
|
||
TaskLockType actualLockType = getContextValue(Tasks.TASK_LOCK_TYPE, defaultLockType); | ||
|
||
if (markAsUnused && actualLockType != TaskLockType.EXCLUSIVE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
markAsUnused
is undocumented and has been deprecated for a few releases since Druid 28. Instead of expanding on the legacy flag, we can target removing it in Druid 31. It should simplify this code and related testing a bit. cc: @kfaraz
@IgorBerman, would you be interested in doing that in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean separate PR for removing this flag completely?
I can try that when 31 will be around the corner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can remove the deprecated markAsUnused
parameter from the kill task. IMO, it'd be best if we handle that separately first since this feature expands on the deprecated parameter when it shouldn't be used and needed.
Alternatively, as part of this patch, we can add a validation check in the kill task constructor, similar to this. The isReady()
method here can then assume that markAsUnused
is unset or false, so we'd have fewer branches to determine which lock to use.
The branch for Druid 30 is already cut here: https://github.com/apache/druid/tree/30.0.0. So any changes merged to master from now on will automatically be targeted for Druid 31. Let us know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, sure. So I'm removing markAsUnused and any references to it. Hopefully I'll find all of it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, hopefully removed in all relevant places
return isReady(taskActionClient, TaskLockType.EXCLUSIVE); | ||
} | ||
|
||
protected boolean isReady(TaskActionClient taskActionClient, TaskLockType taskLockType) throws Exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave the base implementation as it is, without adding this extra protected method? The logic is simple enough for child class implementations to fully override isReady()
without calling super.isReady()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhishekrb19 thanks for the suggestion, wdyt about following:
parent's isReady() uses private field of interval
, so either we convert it to protected or leave some base method
do you think it will be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IgorBerman, I see, you can just call the public getter getInterval()
from the base class which is already used by the kill task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Outdated
Show resolved
Hide resolved
1992742
to
05014be
Compare
i had some problems with squashing all commits, had wrong merge from master commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me.
I guess it makes sense to try out the new lock types for kill
tasks.
But we must be cautious while using these in production as they may have unforeseen effects. (see next section)
The PR description needs a release note which explains that:
- The
markAsUnused
parameter has been removed fromkill
tasks. kill
tasks may use different lock types such asAPPEND
,REPLACE
but this is only experimental and it may have unforeseen effects in a production environment.
Future work
Going ahead, in order to determine the best lock type for kill
, we would need to do a proper evaluation of the interplay of these actions:
markAsUsed
,markAsUnused
APIkill
taskREPLACE
ingestion taskAPPEND
ingestion task
There are some concerns here:
- API
markAsUsed
/markAsUnused
MUST NOT happen concurrently with akill
task (Otherwise, we might kill a segment that the user now wants to retain). I don't think this check currently exists. - API
markAsUsed
/markAsUnused
MUST NOT happen concurrently with anAPPEND
job. (Otherwise, the allocated segment IDs could be all over the place and we could have data loss) - API
markAsUsed
/markAsUnused
CAN happen concurrently with aREPLACE
job. (Thanks to thedruid_upgradeSegments
metadata table, theREPLACE
job knows that it needs to upgrade only the entries present in this table. I don't think there is any other aspect of aREPLACE
job which cares about other used segments present in the system.) kill
should PROBABLY NOT happen concurrently with aREPLACE
job in case theREPLACE
job wants to upgrade a segment that thekill
is trying to get rid of. Having the auto-kill buffer period mostly safeguards against such situations but it is still a possibility.kill
CAN happen concurrently with anAPPEND
job, because the append job itself has the information of all the segments that it needs to commit. Deleting any other segments from the system should not have any effect on it (except maybe killing segments of the version to which we are trying to append data, but that is again safeguarded by the auto-kill buffer period).
A point to note is that the buffer period applies only to auto-kill triggered by the coordinator and not to kill tasks submitted by the user from the web-console.
@@ -880,7 +846,7 @@ public void testKillBatchSizeThree() throws Exception | |||
|
|||
Assert.assertEquals(Collections.emptyList(), observedUnusedSegments); | |||
Assert.assertEquals( | |||
new KillTaskReport.Stats(4, 3, 4), | |||
new KillTaskReport.Stats(4, 3, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The field numSegmentsMarkedAsUnused
should be completely removed from the KillTaskReport.Stats
class rather than passing 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kfaraz thank you, updated PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kfaraz can you confirm that release notes should be placed in docs/release-info/release-notes.md ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we just need to add it in the PR description under the heading Release Note. The release notes in the documentation are updated only when a new Druid version is released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated PR description, do you think kill task documentation should be updated as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kfaraz regarding your comment about locks and inter-connection, I have few questions/comments if you permit me
-
API markAsUsed/markAsUnused MUST NOT happen concurrently with a kill task
this one is interesting. Why markAsUsed/Unused not taking any locks? -
API markAsUsed/markAsUnused MUST NOT happen concurrently with an APPEND job.
if we using concurrent locks, this may happen already, isn't it? Once again, markAsUsed/markAsUnused not using any locks(at least I haven't found it) -
Overall it seems to me that markAsUsed/markAsUnused not playing "nice" with concurrent locks currently, am I right?
-
kill should PROBABLY NOT happen concurrently with a REPLACE job in case the REPLACE job wants to upgrade a segment that the kill is trying to get rid of. Having the auto-kill buffer period mostly safeguards against such situations but it is still a possibility.
if one uses kill job with REPLACE lock(default for concurrent locks), shouldn't 2 REPLACE jobs be mutually exclusive? Maybe we shouldn't permit any other lock besides REPLACE? -
kill CAN happen concurrently with an APPEND job, because the append job itself has the information of all the segments that it needs to commit. Deleting any other segments from the system should not have any effect on it (except maybe killing segments of the version to which we are trying to append data, but that is again safeguarded by the auto-kill buffer period)
if we trying to append data to some segments, shouldn't they be 'used'? I mean kill job should delete only unused segments, isn't it? Or I'm missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's true, markAsUsed / markAsUnused APIs don't seem to take any locks right now. I think one way would be to wire up those APIs to use task actions.
if one uses kill job with REPLACE lock(default for concurrent locks), shouldn't 2 REPLACE jobs be mutually exclusive? Maybe we shouldn't permit any other lock besides REPLACE?
Yes, two REPLACE jobs are mutually exclusive. So point 4 is handled if we use a REPLACE lock for a kill
task.
if we trying to append data to some segments, shouldn't they be 'used'? I mean kill job should delete only unused segments, isn't it? Or I'm missing something
That's true. I have just tried to list out all the possible corner case scenarios (that I could think of) for posterity.
In point 5, I am trying to call out that it is actually okay to run a kill
task concurrently with APPEND
jobs.
The only case when this might cause an issue is if while the append is in progress, the kill task marks the segments being appended to as unused AND deletes them. But since we are getting rid of the markAsUnused
parameter from kill
tasks in this PR, this too is not a concern anymore.
(However, there is still the possibility of the markAsUnused API marking the segments that we are appending to as unused and then kill
task deleting them. But this is already covered by point 2.)
0a0f78b
to
21e92ea
Compare
I've missed 1 place and all tests failed, fixed it already |
can somebody re-trigger checks please? |
coverage failed on code of KillTaskReport. I've removed numSegmentsMarkedAsUnused field from it. Seems to me it wasn't covered before as well
|
@kfaraz Hi, can you please look at results of latest checks, it fails on coverage of KillTaskReport.Stats.hashCode() ? |
@IgorBerman , there is a test That should help with the coverage without adding any redundant tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, left one non-blocking comment pertaining to context parameter preference order.
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Outdated
Show resolved
Hide resolved
… not with markAsUnused on feature-13324 better granularity for lock type in kill task, removing markAsUnused and all it's traces
21e92ea
to
aa5d8f5
Compare
@kfaraz @AmatyaAvadhanula @abhishekrb19 Thank you for the support, suggestions and review |
…arameter (apache#16362) Changes: - Remove deprecated `markAsUnused` parameter from `KillUnusedSegmentsTask` - Allow `kill` task to use `REPLACE` lock when `useConcurrentLocks` is true - Use `EXCLUSIVE` lock by default
Fixes #16361 16361.
Description
Using replace lock for kill task if concurrent locks are used.
Overwritten isReady in kill task to take lock from context with default of exclusive lock, but if concurrent locks are used, then default is replace
Release note
The markAsUnused parameter has been removed from kill tasks.
Kill tasks may use different lock types such as APPEND, REPLACE, however this is experimental only and it may have unforeseen effects in a production environment.
This PR has: