Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix geoip index deletion race condition #105367

Closed

Conversation

joegallo
Copy link
Contributor

@joegallo joegallo commented Feb 9, 2024

Closes #75221, #100361, and #101418

Just a draft/WIP PR, thinking about how to prevent the scenario from #101418 (comment) from occurring.

I'm uncertain about the threadPool bit, especially, but also the geoIpDownloader != null check is interesting (without it we do get test failures, but there was no such requirement before that there'd a current task...).

We can probably do better than just 'synchronized' but this is a good
WIP version for thinking aloud.
@joegallo joegallo added >test Issues or PRs that are addressing/adding tests WIP :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team labels Feb 9, 2024
logger.warn("failed to remove " + databasesIndex, e);
}
}));
if (geoIpDownloader != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this needs to be here because the code is set up to be able to stop the task from any node, but the current task is only set on the node that is running it. Check out the clusterChanged method. We're not ensuring that the stopTask method is only run on the master node like we are in the setEnabled method...

@jbaiera
Copy link
Member

jbaiera commented Feb 9, 2024

To mirror my thoughts from off-page discussions: I think the bones of the change is good, but there is a new issue when going with this approach:

In org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor#setEnabled we only start and stop the task on the master node. If we end up stopping the task via the master node, then the currentTask references is likely to always be null, and thus, the delete will not happen.

I don't know if there is a teardown hook for persistent tasks that we can use instead, or if we should just change it so that every node fires off the stop action like how we do it in org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor#clusterChanged which will ensure the node that is running the task does indeed have a chance to execute the cleanup in a threadsafe way.

But then again - if you toggle the enabled switch on and off really fast how bad do things break now?

@@ -323,6 +327,20 @@ private void cleanDatabases() {
stats = stats.expiredDatabases((int) expiredDatabases);
}

synchronized void deleteIndex() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right beneath this method is the onCancelled() method, which runs on the node executing the task (on a GENERIC thread, so we should be fine). The task should not be scheduled again until after markAsCompleted() is called, so as long as we have a way to avoid running the delete at the same time as the update (synchronized is probably fine?) then the delete will happen always happen on the right node.

@joegallo
Copy link
Contributor Author

joegallo commented Feb 12, 2024

I'm going to take this one back to the drawing board, the problem here is real, but this solution is mostly barking up the wrong tree. The comments on this PR and elsewhere have been illuminating, though, so I think there's a good chance we can put this issue to bed (edit: albeit by way of a different PR!).

@joegallo joegallo closed this Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team >test Issues or PRs that are addressing/adding tests v8.13.0 WIP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] GeoIpDownloaderCliIT.testInvalidTimestamp failure
3 participants