-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High volume of ingest traffic can cause Enrich to deadlock #55634
Comments
Pinging @elastic/es-core-features (:Core/Features/Ingest) |
Nice find @jbaiera ! I think this deadlock can only happen if a The root cause of the deadlock, I do believe is as Jimmy mentions - the search thread ends up processing (parts) of the ingest pipeline(s) and end up getting hung up on the same queue.put method that the write threads are stuck on. Meaning that the search threads needs to complete to allow the write thread to complete, but the search threads can not because they are blocked by the same resource that the write thread is blocked on. I did some debugging with the following additional logging: https://gist.github.com/jakelandis/61d18359baa325c6c12b40b8d015e798 (log in comments of gist) Using the following repro case, i was able to see search thread processing the ingest document. I believe the fix here is to ensure that that CompoundProcessor
The relevant (custom) logs are:
Basically the execute pipeline via the CompoundProcessor.innerExecute is getting executed on the search thread. It needs to fork back to the write thread pool (or some other thread pool) to prevent the deadlock. |
And the following assertion gets trips when running from source code with assertions enabled (code in assert Thread.currentThread().getName().contains(ThreadPool.Names.WRITE)
|| Thread.currentThread().getName().contains(ThreadPool.Names.MANAGEMENT); With the following repro:
|
Good catch @jbaiera and thanks @jakelandis for the additional explanation and easy reproduction! |
I was able to get a solid reproduction by eliding the assertions (similar to how a production runtime would do so) and dramatically throttling the maximum allowed throughput (only 1 concurrent search at a time, queue size set to 10). Importantly, the pipeline must indeed contain two enrich processors, but they do not need to be separated by a pipeline processor. The deadlock still occurs when running a single pipeline with two processors back to back. For reference, here are the stack traces that came out of the pipeline processor scenario:
|
How do we increase the 1024 max capacity of the enrich coordinator? We are just barely tripping that breaker and would like to adjust that up a bit. |
@jmp601 By default, the queue size is set to the number of concurrent enrich search operations allowed at a time (default 8) times the number of enrich lookups per search operation (default 128). You can increase the queue capacity for a node by giving a concrete value to If you have enough extra headroom on your deployment to run more enrich operations at a time, you could look at increasing either |
Enrich processors all route their search requests through the
EnrichCoordinatorProxyAction
, which collects enrichment search requests together in order to collapse them down and submit them in one multi-search request. The coordinator maintains an internal queue of search requests for this purpose. Each thread entering the coordinator adds to this queue, then atomically drains the contents into a multi-search request which is executed asynchronously on a search thread. A maximum number of in-flight search requests is allowed (default 8). If that limit is reached then the coordinator simply queues ingest documents up until a new multi-search request can be executed. When the enrich coordinator queue reaches maximum capacity (1024 requests by default) it blocks the write thread under the assumption that a search request will eventually complete and begin draining the queue. This is meant to create a back pressure to the rest of the ingestion framework.The discovered bug pertains to when the search thread completes the enrich lookup. When the multi-search completes, the search thread calls the response handler for the search. This handler simply returns to the ingestion framework and begins processing the next set of processors in the pipeline, potentially even the next document in the bulk request. Since the pipeline contains an enrich processor, the search thread will attempt to add a search request to the coordinator queue when it reaches it, just like a write request would. If this queue is filled, then the search thread is captured waiting for the queue to drain, just as the write threads would be. No threads would be able to pass this critical section to drain the queue and schedule the next search results. Thus, a deadlock arises, consuming the write threads and a portion of the read threads on a node.
Normally, even though search threads are erroneously captured to perform ingestion work, they are eventually released back to the search pool once the bulk request they are stuck in completes processing. This may have been why this bug flew under the radar and only manifests in cases where there is high load placed on the enrich system for an extended period of time. If the write threads are able to create more search requests in the coordination queue than the search threads can keep up with, then the system will degrade until it passes the queue capacity and locks in place.
The text was updated successfully, but these errors were encountered: