Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/16361 permits to control kill task lock type #16362

Merged
merged 1 commit into from
May 10, 2024

Conversation

IgorBerman
Copy link
Contributor

@IgorBerman IgorBerman commented Apr 30, 2024

Fixes #16361 16361.

Description

Using replace lock for kill task if concurrent locks are used.

Overwritten isReady in kill task to take lock from context with default of exclusive lock, but if concurrent locks are used, then default is replace

Release note

  1. The markAsUnused parameter has been removed from kill tasks.

  2. Kill tasks may use different lock types such as APPEND, REPLACE, however this is experimental only and it may have unforeseen effects in a production environment.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@AmatyaAvadhanula
Copy link
Contributor

AmatyaAvadhanula commented May 1, 2024

Idea looks good to me @IgorBerman.

  1. Kill tasks must hold a lock to avoid conflicts with every other Fixed Interval Task types.
    Move, Restore and Archive tasks change the set of used segments, and should always use an exclusive lock.

  2. Although Kill tasks can coexist with other types of tasks such as other appending tasks even without concurrent append and replace, using a SHARED or APPEND lock introduces the possibility of having multiple kill tasks running for the same interval. Using EXCLUSIVE lock as the default seems right to me, and REPLACE locks with concurrent append / replace. A user may choose to use SHARED / APPEND when they are certain that there are no duplicate kill tasks or when that is fine.

  3. Please add validation that the task lock type is only EXCLUSIVE when using markAsUnused:true

@IgorBerman
Copy link
Contributor Author

thanks for the comments @AmatyaAvadhanula and @abhishekrb19
I'll address your comments today/tomorrow.

@IgorBerman IgorBerman force-pushed the feature/16361-kill-task-lock-type branch from 30e4eb4 to 6ed5022 Compare May 1, 2024 07:07
@IgorBerman
Copy link
Contributor Author

@AmatyaAvadhanula I think I've addressed all your suggestions #16362 (comment)


TaskLockType actualLockType = getContextValue(Tasks.TASK_LOCK_TYPE, defaultLockType);

if (markAsUnused && actualLockType != TaskLockType.EXCLUSIVE) {
Copy link
Contributor

@abhishekrb19 abhishekrb19 May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

markAsUnused is undocumented and has been deprecated for a few releases since Druid 28. Instead of expanding on the legacy flag, we can target removing it in Druid 31. It should simplify this code and related testing a bit. cc: @kfaraz

@IgorBerman, would you be interested in doing that in a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean separate PR for removing this flag completely?
I can try that when 31 will be around the corner

Copy link
Contributor

@abhishekrb19 abhishekrb19 May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can remove the deprecated markAsUnused parameter from the kill task. IMO, it'd be best if we handle that separately first since this feature expands on the deprecated parameter when it shouldn't be used and needed.

Alternatively, as part of this patch, we can add a validation check in the kill task constructor, similar to this. The isReady() method here can then assume that markAsUnused is unset or false, so we'd have fewer branches to determine which lock to use.

The branch for Druid 30 is already cut here: https://github.com/apache/druid/tree/30.0.0. So any changes merged to master from now on will automatically be targeted for Druid 31. Let us know what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sure. So I'm removing markAsUnused and any references to it. Hopefully I'll find all of it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, hopefully removed in all relevant places

return isReady(taskActionClient, TaskLockType.EXCLUSIVE);
}

protected boolean isReady(TaskActionClient taskActionClient, TaskLockType taskLockType) throws Exception
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave the base implementation as it is, without adding this extra protected method? The logic is simple enough for child class implementations to fully override isReady() without calling super.isReady()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhishekrb19 thanks for the suggestion, wdyt about following:
parent's isReady() uses private field of interval, so either we convert it to protected or leave some base method
do you think it will be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IgorBerman, I see, you can just call the public getter getInterval() from the base class which is already used by the kill task

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@IgorBerman
Copy link
Contributor Author

i had some problems with squashing all commits, had wrong merge from master commit.
hopefully now it's better.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me.

I guess it makes sense to try out the new lock types for kill tasks.

But we must be cautious while using these in production as they may have unforeseen effects. (see next section)

The PR description needs a release note which explains that:

  • The markAsUnused parameter has been removed from kill tasks.
  • kill tasks may use different lock types such as APPEND, REPLACE but this is only experimental and it may have unforeseen effects in a production environment.

Future work

Going ahead, in order to determine the best lock type for kill, we would need to do a proper evaluation of the interplay of these actions:

  • markAsUsed, markAsUnused API
  • kill task
  • REPLACE ingestion task
  • APPEND ingestion task

There are some concerns here:

  1. API markAsUsed/markAsUnused MUST NOT happen concurrently with a kill task (Otherwise, we might kill a segment that the user now wants to retain). I don't think this check currently exists.
  2. API markAsUsed/markAsUnused MUST NOT happen concurrently with an APPEND job. (Otherwise, the allocated segment IDs could be all over the place and we could have data loss)
  3. API markAsUsed/markAsUnused CAN happen concurrently with a REPLACE job. (Thanks to the druid_upgradeSegments metadata table, the REPLACE job knows that it needs to upgrade only the entries present in this table. I don't think there is any other aspect of a REPLACE job which cares about other used segments present in the system.)
  4. kill should PROBABLY NOT happen concurrently with a REPLACE job in case the REPLACE job wants to upgrade a segment that the kill is trying to get rid of. Having the auto-kill buffer period mostly safeguards against such situations but it is still a possibility.
  5. kill CAN happen concurrently with an APPEND job, because the append job itself has the information of all the segments that it needs to commit. Deleting any other segments from the system should not have any effect on it (except maybe killing segments of the version to which we are trying to append data, but that is again safeguarded by the auto-kill buffer period).

A point to note is that the buffer period applies only to auto-kill triggered by the coordinator and not to kill tasks submitted by the user from the web-console.

@@ -880,7 +846,7 @@ public void testKillBatchSizeThree() throws Exception

Assert.assertEquals(Collections.emptyList(), observedUnusedSegments);
Assert.assertEquals(
new KillTaskReport.Stats(4, 3, 4),
new KillTaskReport.Stats(4, 3, 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field numSegmentsMarkedAsUnused should be completely removed from the KillTaskReport.Stats class rather than passing 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kfaraz thank you, updated PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kfaraz can you confirm that release notes should be placed in docs/release-info/release-notes.md ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we just need to add it in the PR description under the heading Release Note. The release notes in the documentation are updated only when a new Druid version is released.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated PR description, do you think kill task documentation should be updated as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kfaraz regarding your comment about locks and inter-connection, I have few questions/comments if you permit me

  1. API markAsUsed/markAsUnused MUST NOT happen concurrently with a kill task
    this one is interesting. Why markAsUsed/Unused not taking any locks?

  2. API markAsUsed/markAsUnused MUST NOT happen concurrently with an APPEND job.
    if we using concurrent locks, this may happen already, isn't it? Once again, markAsUsed/markAsUnused not using any locks(at least I haven't found it)

  3. Overall it seems to me that markAsUsed/markAsUnused not playing "nice" with concurrent locks currently, am I right?

  4. kill should PROBABLY NOT happen concurrently with a REPLACE job in case the REPLACE job wants to upgrade a segment that the kill is trying to get rid of. Having the auto-kill buffer period mostly safeguards against such situations but it is still a possibility.
    if one uses kill job with REPLACE lock(default for concurrent locks), shouldn't 2 REPLACE jobs be mutually exclusive? Maybe we shouldn't permit any other lock besides REPLACE?

  5. kill CAN happen concurrently with an APPEND job, because the append job itself has the information of all the segments that it needs to commit. Deleting any other segments from the system should not have any effect on it (except maybe killing segments of the version to which we are trying to append data, but that is again safeguarded by the auto-kill buffer period)
    if we trying to append data to some segments, shouldn't they be 'used'? I mean kill job should delete only unused segments, isn't it? Or I'm missing something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true, markAsUsed / markAsUnused APIs don't seem to take any locks right now. I think one way would be to wire up those APIs to use task actions.

if one uses kill job with REPLACE lock(default for concurrent locks), shouldn't 2 REPLACE jobs be mutually exclusive? Maybe we shouldn't permit any other lock besides REPLACE?

Yes, two REPLACE jobs are mutually exclusive. So point 4 is handled if we use a REPLACE lock for a kill task.

if we trying to append data to some segments, shouldn't they be 'used'? I mean kill job should delete only unused segments, isn't it? Or I'm missing something

That's true. I have just tried to list out all the possible corner case scenarios (that I could think of) for posterity.

In point 5, I am trying to call out that it is actually okay to run a kill task concurrently with APPEND jobs.

The only case when this might cause an issue is if while the append is in progress, the kill task marks the segments being appended to as unused AND deletes them. But since we are getting rid of the markAsUnused parameter from kill tasks in this PR, this too is not a concern anymore.

(However, there is still the possibility of the markAsUnused API marking the segments that we are appending to as unused and then kill task deleting them. But this is already covered by point 2.)

@IgorBerman IgorBerman force-pushed the feature/16361-kill-task-lock-type branch 2 times, most recently from 0a0f78b to 21e92ea Compare May 6, 2024 14:48
@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels May 6, 2024
@IgorBerman
Copy link
Contributor Author

I've missed 1 place and all tests failed, fixed it already

@IgorBerman
Copy link
Contributor Author

can somebody re-trigger checks please?

@kfaraz kfaraz closed this May 7, 2024
@kfaraz kfaraz reopened this May 7, 2024
@IgorBerman
Copy link
Contributor Author

coverage failed on code of KillTaskReport. I've removed numSegmentsMarkedAsUnused field from it. Seems to me it wasn't covered before as well

------------------------------------------------------------------------------
org/apache/druid/indexer/report/KillTaskReport.java
------------------------------------------------------------------------------
22                import com.fasterxml.jackson.annotation.JsonCreator;
23                import com.fasterxml.jackson.annotation.JsonProperty;
24                

91                        @JsonProperty("numBatchesProcessed") int numBatchesProcessed

121                            && numBatchesProcessed == that.numBatchesProcessed;

127 F | L               return Objects.hash(numSegmentsKilled, numBatchesProcessed);
------------------------------------------------------------------------------

@IgorBerman
Copy link
Contributor Author

@kfaraz Hi, can you please look at results of latest checks, it fails on coverage of KillTaskReport.Stats.hashCode() ?
I mean I've tried to find any place that uses hash/equals of KillTaskReport or/and KillTaskReport.Stats and I couldn't find any. I.e. those classes are not used as key for map(e.g.)
I can basically add some test that explores hash/equals for those classes, but not sure what value it will bring.
Any suggestions will be appreciated.

@kfaraz
Copy link
Contributor

kfaraz commented May 9, 2024

@IgorBerman , there is a test TaskReportSerdeTest.testSerdeOfKillTaskReport().
There, you can add an assertion Assert.assertEquals(originalReport.hashCode(), deserializedReport.hashCode()).

That should help with the coverage without adding any redundant tests.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left one non-blocking comment pertaining to context parameter preference order.

… not with markAsUnused on

feature-13324 better granularity for lock type in kill task, removing markAsUnused  and all it's traces
@IgorBerman IgorBerman force-pushed the feature/16361-kill-task-lock-type branch from 21e92ea to aa5d8f5 Compare May 9, 2024 11:09
@kfaraz kfaraz merged commit d0f3fda into apache:master May 10, 2024
87 checks passed
@IgorBerman
Copy link
Contributor Author

@kfaraz @AmatyaAvadhanula @abhishekrb19 Thank you for the support, suggestions and review
Appreciate your help!

gianm pushed a commit to gianm/druid that referenced this pull request May 10, 2024
…arameter (apache#16362)

Changes:
- Remove deprecated `markAsUnused` parameter from `KillUnusedSegmentsTask`
- Allow `kill` task to use `REPLACE` lock when `useConcurrentLocks` is true
- Use `EXCLUSIVE` lock by default
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kill task using exclusive locks currently and no way to change it
4 participants