Fix enrich coordinator to reject documents instead of deadlocking #56247

jbaiera · 2020-05-05T21:22:42Z

This PR removes the blocking call to insert ingest documents into a queue in the coordinator. It replaces it with an offer call which will throw a rejection exception in the event that the queue is full. This prevents deadlocks of the write threads when the queue fills to capacity and there are more than one enrich processors in a pipeline.

Relates #55634

This does not solve the entire issue we have with #55634 - we still need to find a way to process the results of the search results not on search threads and in a way that does not flood the write thread pool queue with small tasks. We are weighing options and will be fixing that problem soon.

elasticmachine · 2020-05-05T21:22:44Z

Pinging @elastic/es-core-features (:Core/Features/Ingest)

martijnvg

LGTM, I left some minor comments.

...enrich/src/main/java/org/elasticsearch/xpack/enrich/action/EnrichCoordinatorProxyAction.java

martijnvg · 2020-05-06T09:20:23Z

...enrich/src/main/java/org/elasticsearch/xpack/enrich/action/EnrichCoordinatorProxyAction.java

+            boolean accepted = queue.offer(new Slot(searchRequest, listener));
+            int queueSize = queue.size();
+
+            // coordinate lookups no matter what, even if queues were full


Can describe why it is important to coordicate lookups even the queue is full?

I left a short comment on the code but wanted to mirror some thoughts here:

One of the issues with the code is that once the queue is full in the current version only a search thread can drain it. The search thread does so only after it completes processing the results of the multi-search, during which the thread may end up in this part of the code again. If the queue is full here, and the code does not coordinate lookups on the data in the queue no matter what, then the search thread will eventually fail all the records it's processing with 429 errors because they cannot enter the queue for the next enrich processor in the pipeline, essentially halting ingestion until the queues can accept writes again. All the while, the bulk threads are also rejecting documents, until a search thread can drain the queue a bit. If the queue fills up again while the search is running, when the search comes back, it too will reject all the documents it's processing at the time.

Now that I'm thinking about this more, scheduling lookups no matter what may solve the rejection problem at this layer, but it puts more strain on the search thread pool. I still think it is better though to rely on the thread pool task queues to regulate back pressure rather than this coordination queue, which to me seems more like a mechanism to facilitate combining multiple requests together.

Thanks for sharing your thoughts here.

I still think it is better though to rely on the thread pool task queues to regulate back pressure rather than this coordination queue, which to me seems more like a mechanism to facilitate combining multiple requests together.

Yes, this is the purpose of the coordination queue.

martijnvg · 2020-05-06T09:21:32Z

...enrich/src/main/java/org/elasticsearch/xpack/enrich/action/EnrichCoordinatorProxyAction.java

+            // Use offer(...) instead of put(...). We are on a write thread and blocking here can be dangerous,
+            // especially since the logic to kick off draining the queue is located right after this section. If we
+            // cannot insert a request to the queue, we should reject the document with a 429 error code.
+            boolean accepted = queue.offer(new Slot(searchRequest, listener));


x-pack/plugin/enrich/src/test/java/org/elasticsearch/xpack/enrich/EnrichResiliencyTests.java

...enrich/src/main/java/org/elasticsearch/xpack/enrich/action/EnrichCoordinatorProxyAction.java

jbaiera · 2020-05-11T17:15:31Z

@elasticmachine run elasticsearch-ci/bwc

jbaiera · 2020-05-11T17:15:45Z

@elasticmachine run elasticsearch-ci/default-distro

…astic#56247) This PR removes the blocking call to insert ingest documents into a queue in the coordinator. It replaces it with an offer call which will throw a rejection exception in the event that the queue is full. This prevents deadlocks of the write threads when the queue fills to capacity and there are more than one enrich processors in a pipeline.

…6247) (#57179) This PR removes the blocking call to insert ingest documents into a queue in the coordinator. It replaces it with an offer call which will throw a rejection exception in the event that the queue is full. This prevents deadlocks of the write threads when the queue fills to capacity and there are more than one enrich processors in a pipeline.

…6247) (#57188) This PR removes the blocking call to insert ingest documents into a queue in the coordinator. It replaces it with an offer call which will throw a rejection exception in the event that the queue is full. This prevents deadlocks of the write threads when the queue fills to capacity and there are more than one enrich processors in a pipeline.

…6247) (#57189) This PR removes the blocking call to insert ingest documents into a queue in the coordinator. It replaces it with an offer call which will throw a rejection exception in the event that the queue is full. This prevents deadlocks of the write threads when the queue fills to capacity and there are more than one enrich processors in a pipeline.

Fix enrich coordinator to reject documents instead of deadlocking

5bf4810

jbaiera added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels May 5, 2020

jbaiera requested a review from martijnvg May 5, 2020 21:22

elasticmachine added the Team:Data Management Meta label for data/management team label May 5, 2020

martijnvg approved these changes May 6, 2020

View reviewed changes

jbaiera added 2 commits May 7, 2020 22:43

Fixing PR review feedback and tests

9b2130d

Make the rest of precommit happy

e2563dd

jbaiera mentioned this pull request May 8, 2020

Add thread pool to assist in asynchronous processing of Ingest documents #56450

Closed

elastic deleted a comment from elasticmachine May 12, 2020

jbaiera added 3 commits May 14, 2020 15:55

Merge branch 'master' into fix-enrich-queue-rejections

52a8639

Fixing build failure

381e72b

Merge branch 'master' into fix-enrich-queue-rejections

d004c4e

jbaiera merged commit 9f5c06d into elastic:master May 26, 2020

jbaiera deleted the fix-enrich-queue-rejections branch May 26, 2020 18:05

jbaiera mentioned this pull request May 26, 2020

[Backport 7.x] Fix enrich coordinator to reject documents instead of deadlocking (#56247) #57179

Merged

jbaiera added backport pending v8.0.0 labels May 26, 2020

This was referenced May 27, 2020

[Backport 7.8] Fix enrich coordinator to reject documents instead of deadlocking (#56247) #57188

Merged

[Backport 7.7] Fix enrich coordinator to reject documents instead of deadlocking (#56247) #57189

Merged

jbaiera added v7.8.1 v7.9.0 v7.7.1 and removed backport pending labels May 27, 2020

jbaiera mentioned this pull request Jul 6, 2020

High volume of ingest traffic can cause Enrich to deadlock #55634

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix enrich coordinator to reject documents instead of deadlocking #56247

Fix enrich coordinator to reject documents instead of deadlocking #56247

jbaiera commented May 5, 2020

elasticmachine commented May 5, 2020

martijnvg left a comment

martijnvg May 6, 2020

jbaiera May 8, 2020

martijnvg May 8, 2020

martijnvg May 6, 2020

jbaiera commented May 11, 2020

jbaiera commented May 11, 2020

Fix enrich coordinator to reject documents instead of deadlocking #56247

Fix enrich coordinator to reject documents instead of deadlocking #56247

Conversation

jbaiera commented May 5, 2020

elasticmachine commented May 5, 2020

martijnvg left a comment

Choose a reason for hiding this comment

martijnvg May 6, 2020

Choose a reason for hiding this comment

jbaiera May 8, 2020

Choose a reason for hiding this comment

martijnvg May 8, 2020

Choose a reason for hiding this comment

martijnvg May 6, 2020

Choose a reason for hiding this comment

jbaiera commented May 11, 2020

jbaiera commented May 11, 2020