Avoiding BulkProcessor deadlock in ILMHistoryStore #91238

masseyke · 2022-11-01T22:58:57Z

We have been seeing deadlocks in ILMHistoryStore in production (#68468). The deadlock appears to be caused by the fact that BulkProcessor uses two locks (BulkProcessor.lock and BulkRequestHandler.semaphore) and holds onto the latter lock for an indefinite amount of time.

This PR avoids deadlock by using a new BulkProcessor2 that does not require that both locks be held at once, and drastically shortens the amount of time that either lock needs to be held. It does this by adding a new queue (in the Retry2 class).
Note that we have left the original BulkProcessor in place, and have cloned/modified it into BulkProcessor2. BulkProcessor2 for now is only used by ILMHistoryStore but will likely replace BulkProcessor altogether in the near future.

The flow in the original BulkProcessor is like this:

ILMHistoryStore adds IndexRequests to BulkProcessor asynchronously.
BulkProcessor acquires its lock to build up a BulkRequest from these IndexRequests.
If a call from ILMHistoryStore adds an IndexRequest that pushes the BulkRequest over its configured threshold (size or count) then it calls BulkRequestHandler to load that BulkRequest to the server.
BulkRequestHandler must acquire a token from its semaphore to do this.
It calls Client::bulk from the current thread.
If it fails, it keeps the semaphore lock while it retries (possibly multiple times).

The flow in the new BulkProcessor2:

ILMHistoryStore adds IndexRequests to BulkProcessor synchronously (since this part is very fast now).
BulkProcessor acquires its lock to build up a BulkRequest from these IndexRequests.
If a call from ILMHistoryStore adds an IndexRequest that pushes the BulkRequest over its configured threshold (size or count) then it calls Retry2::withBackoff to attempt to load the bulk a fixed number of times.
If the number of bytes already in flight to Elasticsearch is higher than a configured number, or if Elasticsearch is too busy, the listener is notified with an EsRejectedExecutionException.
Either way, control returns immediately and the BulkProcessor lock is released.

We are no longer using a semaphore to throttle how many concurrent requests can be sent to Elasticsearch at once. And there is no longer any blocking. Instead we throttle the total number of bytes in flight to Elasticsearch (approximate), and allow Elasticsearch to throw an EsRejectedExecutionException if it thinks there are too many concurrent requests.

Closes #50440
Closes #68468

masseyke · 2022-11-02T01:14:45Z

@elasticmachine update branch

…earch into BulkProcessor-deadlock

masseyke · 2022-11-03T18:38:47Z

@elasticmachine update branch

…earch into BulkProcessor-deadlock

x-pack/plugin/ilm/src/test/java/org/elasticsearch/xpack/ilm/history/ILMHistoryStoreTests.java

server/src/main/java/org/elasticsearch/action/bulk/BulkProcessor2.java

masseyke · 2023-01-05T23:49:39Z

@elasticmachine update branch

In #91238 we rewrote BulkProcessor to avoid deadlock that had been seen in the IlmHistoryStore. This commit ports watcher over to the new BulkProcessor2 implementation. The only real change is that watcher history documents are now indexed asynchronously instead of in a blocking way, meaning that tests had to change to account for this.

* backporting 91238 and 86184 * increasing test timeouts (#92771) BulkProcessor2IT can occasionally fail with timeouts like this: ``` java.util.concurrent.TimeoutException: (No message provided) at __randomizedtesting.SeedInfo.seed([164F04355E8E8724:9D44A00946BBB3F3]:0) at java.util.concurrent.Phaser.awaitAdvanceInterruptibly(Phaser.java:795) at org.elasticsearch.action.bulk.Retry2.awaitClose(Retry2.java:129) at org.elasticsearch.action.bulk.BulkProcessor2.awaitClose(BulkProcessor2.java:254) at org.elasticsearch.action.bulk.BulkProcessor2IT.testBulkProcessor2ConcurrentRequestsReadOnlyIndex(BulkProcessor2IT.java:197) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:568) at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758) at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946) at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982) at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44) at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45) at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390) at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843) at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490) at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955) at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840) at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891) at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902) at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38) at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53) at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390) at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850) at java.lang.Thread.run(Thread.java:833) ``` It looks like we're just cutting it a little too closely using a 1-second timeout to wait for all requests to complete. This PR bumps that timeout to 5 seconds. In the previous version of this test (BulkProcessorIT) the code did not actually wait for all requests to complete, which explains why this behavior is new. Closes #92770 * fixing build problems * reverting accidental change * fixing build problems * fixing a unit test * fixing tests * fixing tests * Not propagating TimeoutException from Retry2::awaitClose (#92773) Logging a message rather than propagating a TimeoutException from Retry2::awaitClose --------- Co-authored-by: Joe Gallo <[email protected]>

…BulkProcessor (#94172) In #91238 we rewrote BulkProcessor to avoid deadlock that had been seen in the IlmHistoryStore. At some point we will remove BulkProcessor altogether. This PR ports a couple of integration tests that were using BulkProcesor over to BulkProcessor2.

In #91238 we rewrote BulkProcessor to avoid deadlock that had been seen in the IlmHistoryStore. This commit ports deprecation logging over to the new BulkProcessor2 implementation.

In #91238 we rewrote BulkProcessor to avoid deadlock that had been seen in the IlmHistoryStore. This PR ports TSDB downsampling over to the new BulkProcessor2 implementation.

masseyke added 8 commits October 31, 2022 13:49

initial work -- queue of BulkRequests

dc63507

cleanup

d8085c3

Adding a separate queue for retries and adding a test

e08ea9d

cleanup

9aab67b

Adding tests

945e7e1

removing dead code

2e08203

cleanup

4093b50

Adding awaitClose implementation

ac75c31

masseyke added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Nov 1, 2022

spotlessApply

0a12752

elasticmachine and others added 10 commits November 1, 2022 21:14

Merge branch 'main' into BulkProcessor-deadlock

94b2589

adding trace logging

4b8aa92

Merge branch 'BulkProcessor-deadlock' of github.com:masseyke/elastics…

eb2c151

…earch into BulkProcessor-deadlock

spotlessApply

56b3d61

Added loggging, fixed a unit test

4b5ab6c

improving awaitClose

48acd5c

improved logging, smaller queues

29f3080

Allowing a different queue size per bulk processor

4830c82

making queue size a setting

0217435

adding a test for rejections

0afa2f6

elasticmachine and others added 3 commits November 3, 2022 14:38

Merge branch 'main' into BulkProcessor-deadlock

01a837f

cleanup / comments

b84f33b

Merge branch 'BulkProcessor-deadlock' of github.com:masseyke/elastics…

c024937

…earch into BulkProcessor-deadlock

masseyke changed the title ~~Bulk processor deadlock~~ Avoiding BulkProcessor deadlock in ILMHistoryStore Nov 3, 2022

masseyke added 3 commits November 4, 2022 08:09

making poll interval configurable

3fb5d08

cleanup

088414b

adding integration tests

abf7771

masseyke removed the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Nov 4, 2022

joegallo reviewed Jan 3, 2023

View reviewed changes

x-pack/plugin/ilm/src/test/java/org/elasticsearch/xpack/ilm/history/ILMHistoryStoreTests.java Outdated Show resolved Hide resolved

joegallo reviewed Jan 3, 2023

View reviewed changes

server/src/main/java/org/elasticsearch/action/bulk/BulkProcessor2.java Outdated Show resolved Hide resolved

joegallo reviewed Jan 3, 2023

View reviewed changes

server/src/main/java/org/elasticsearch/action/bulk/BulkProcessor2.java Show resolved Hide resolved

joegallo reviewed Jan 3, 2023

View reviewed changes

server/src/main/java/org/elasticsearch/action/bulk/BulkProcessor2.java Outdated Show resolved Hide resolved

code review feedback

846cc6e

masseyke requested a review from joegallo January 4, 2023 15:15

joegallo approved these changes Jan 5, 2023

View reviewed changes

code review feedback

2fbee72

masseyke added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jan 5, 2023

elasticmachine and others added 3 commits January 6, 2023 10:49

Merge branch 'main' into BulkProcessor-deadlock

c9cde2d

fixing compilation errors after merge

72baa84

fixing forbidden api use

e8c6275

elasticsearchmachine merged commit 0e5844f into elastic:main Jan 9, 2023

masseyke deleted the BulkProcessor-deadlock branch January 9, 2023 15:58

masseyke mentioned this pull request Jan 9, 2023

Increasing timeouts in BulkProcessor2IT #92771

Merged

masseyke mentioned this pull request Feb 1, 2023

Avoiding BulkProcessor deadlock in ILMHistoryStore (#91238) #93424

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoiding BulkProcessor deadlock in ILMHistoryStore #91238

Avoiding BulkProcessor deadlock in ILMHistoryStore #91238

masseyke commented Nov 1, 2022 •

edited by joegallo

Loading

masseyke commented Nov 2, 2022

masseyke commented Nov 3, 2022

masseyke commented Jan 5, 2023

Avoiding BulkProcessor deadlock in ILMHistoryStore #91238

Avoiding BulkProcessor deadlock in ILMHistoryStore #91238

Conversation

masseyke commented Nov 1, 2022 • edited by joegallo Loading

masseyke commented Nov 2, 2022

masseyke commented Nov 3, 2022

masseyke commented Jan 5, 2023

masseyke commented Nov 1, 2022 •

edited by joegallo

Loading