Fix geoip index deletion race condition #105367

joegallo · 2024-02-09T20:33:29Z

Just a draft/WIP PR, thinking about how to prevent the scenario from #101418 (comment) from occurring.

I'm uncertain about the threadPool bit, especially, but also the geoIpDownloader != null check is interesting (without it we do get test failures, but there was no such requirement before that there'd a current task...).

We can probably do better than just 'synchronized' but this is a good WIP version for thinking aloud.

jbaiera · 2024-02-09T21:06:23Z

...s/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpDownloaderTaskExecutor.java

-                        logger.warn("failed to remove " + databasesIndex, e);
-                    }
-                }));
+            if (geoIpDownloader != null) {


I wonder if this needs to be here because the code is set up to be able to stop the task from any node, but the current task is only set on the node that is running it. Check out the clusterChanged method. We're not ensuring that the stopTask method is only run on the master node like we are in the setEnabled method...

jbaiera · 2024-02-09T21:47:07Z

To mirror my thoughts from off-page discussions: I think the bones of the change is good, but there is a new issue when going with this approach:

In org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor#setEnabled we only start and stop the task on the master node. If we end up stopping the task via the master node, then the currentTask references is likely to always be null, and thus, the delete will not happen.

I don't know if there is a teardown hook for persistent tasks that we can use instead, or if we should just change it so that every node fires off the stop action like how we do it in org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor#clusterChanged which will ensure the node that is running the task does indeed have a chance to execute the cleanup in a threadsafe way.

But then again - if you toggle the enabled switch on and off really fast how bad do things break now?

jbaiera · 2024-02-10T22:55:19Z

modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/GeoIpDownloader.java

@@ -323,6 +327,20 @@ private void cleanDatabases() {
        stats = stats.expiredDatabases((int) expiredDatabases);
    }

+    synchronized void deleteIndex() {


Right beneath this method is the onCancelled() method, which runs on the node executing the task (on a GENERIC thread, so we should be fine). The task should not be scheduled again until after markAsCompleted() is called, so as long as we have a way to avoid running the delete at the same time as the update (synchronized is probably fine?) then the delete will happen always happen on the right node.

joegallo · 2024-02-12T15:05:56Z

I'm going to take this one back to the drawing board, the problem here is real, but this solution is mostly barking up the wrong tree. The comments on this PR and elsewhere have been illuminating, though, so I think there's a good chance we can put this issue to bed (edit: albeit by way of a different PR!).

joegallo added 3 commits February 9, 2024 15:18

Re-enable some tests

555b0fc

Extract index deletion into a method

d686e1d

Use a threadpool and (super simple) mutex

8afcae4

We can probably do better than just 'synchronized' but this is a good WIP version for thinking aloud.

joegallo added >test Issues or PRs that are addressing/adding tests WIP :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team labels Feb 9, 2024

elasticsearchmachine added the v8.13.0 label Feb 9, 2024

jbaiera reviewed Feb 9, 2024

View reviewed changes

jbaiera reviewed Feb 10, 2024

View reviewed changes

joegallo closed this Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix geoip index deletion race condition #105367

Fix geoip index deletion race condition #105367

joegallo commented Feb 9, 2024

jbaiera Feb 9, 2024

jbaiera commented Feb 9, 2024

jbaiera Feb 10, 2024

joegallo commented Feb 12, 2024 •

edited

Loading

Fix geoip index deletion race condition #105367

Fix geoip index deletion race condition #105367

Conversation

joegallo commented Feb 9, 2024

jbaiera Feb 9, 2024

Choose a reason for hiding this comment

jbaiera commented Feb 9, 2024

jbaiera Feb 10, 2024

Choose a reason for hiding this comment

joegallo commented Feb 12, 2024 • edited Loading

joegallo commented Feb 12, 2024 •

edited

Loading