-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix geoip index deletion race condition #105367
Fix geoip index deletion race condition #105367
Conversation
We can probably do better than just 'synchronized' but this is a good WIP version for thinking aloud.
logger.warn("failed to remove " + databasesIndex, e); | ||
} | ||
})); | ||
if (geoIpDownloader != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this needs to be here because the code is set up to be able to stop the task from any node, but the current task is only set on the node that is running it. Check out the clusterChanged method. We're not ensuring that the stopTask method is only run on the master node like we are in the setEnabled
method...
To mirror my thoughts from off-page discussions: I think the bones of the change is good, but there is a new issue when going with this approach: In I don't know if there is a teardown hook for persistent tasks that we can use instead, or if we should just change it so that every node fires off the stop action like how we do it in But then again - if you toggle the enabled switch on and off really fast how bad do things break now? |
@@ -323,6 +327,20 @@ private void cleanDatabases() { | |||
stats = stats.expiredDatabases((int) expiredDatabases); | |||
} | |||
|
|||
synchronized void deleteIndex() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right beneath this method is the onCancelled()
method, which runs on the node executing the task (on a GENERIC thread, so we should be fine). The task should not be scheduled again until after markAsCompleted()
is called, so as long as we have a way to avoid running the delete at the same time as the update (synchronized is probably fine?) then the delete will happen always happen on the right node.
I'm going to take this one back to the drawing board, the problem here is real, but this solution is mostly barking up the wrong tree. The comments on this PR and elsewhere have been illuminating, though, so I think there's a good chance we can put this issue to bed (edit: albeit by way of a different PR!). |
Closes #75221, #100361, and #101418
Just a draft/WIP PR, thinking about how to prevent the scenario from #101418 (comment) from occurring.
I'm uncertain about the
threadPool
bit, especially, but also thegeoIpDownloader != null
check is interesting (without it we do get test failures, but there was no such requirement before that there'd a current task...).