-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve concurrency with watcher trigger service #39092
Conversation
The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Resolves: elastic#39087
Pinging @elastic/es-core-features |
@@ -141,7 +141,6 @@ public void testLoadMalformedWatchRecord() throws Exception { | |||
}); | |||
} | |||
|
|||
@AwaitsFix(bugUrl = "Supposedly fixed; https://github.com/elastic/x-pack-elasticsearch/issues/1915") | |||
public void testLoadExistingWatchesUponStartup() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test would fail approximately 1 in 50 times due to this bug.
The final assertBusy
would fail because the returned getWatchesCount()
would be 1 less than expected (numWatches) due to the size of the internal map becoming corrupt.
If you iterated through the map and explicitly counted the number of items, it was correct, but the size()
returned an incorrect value.
This test didn't set an id() on the Watch, but ConcurrentHashMap doesn't allow null keys. Note: Per Watch.equals and Watch.hashCode, id is not allowed to be null
Watch watch = mock(Watch.class); | ||
when(watch.trigger()).thenReturn(trigger); | ||
when(watch.id()).thenReturn(id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConcurrentHashMap
requires a non-null key, so this test would otherwise fail.
Watch.hashCode & Watch.equals already assume id is non-null, so the requirement in ConcurrentHashMap
is fine, it's just the test that was wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tvernum nice find ! LGTM
The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Backport of: elastic#39092
The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Backport of: elastic#39092
…follow * elastic/master: (37 commits) Enable test logging for TransformIntegrationTests#testSearchTransform. stronger wording for ilm+rollover in docs (elastic#39159) Mute SingleNodeTests (elastic#39156) AwaitsFix XPackUsageIT#testXPackCcrUsage. Resolve concurrency with watcher trigger service (elastic#39092) Fix median calculation in MedianAbsoluteDeviationAggregatorTests (elastic#38979) [DOCS] Edits the remote clusters documentation (elastic#38996) add version 6.6.2 Revert "Mute failing test 20_mix_typless_typefull (elastic#38781)" (elastic#38912) Rebuild remote connections on profile changes (elastic#37678) Document 'max_size' parameter as shard size for rollover (elastic#38750) Add some missing toString() implementations (elastic#39124) Migrate Streamable to Writeable for cluster block package (elastic#37391) fix RethrottleTests retry (elastic#38978) Disable date parsing test in non english locale (elastic#39052) Remove BCryptTests (elastic#39098) [ML] Stop the ML memory tracker before closing node (elastic#39111) Allow retention lease operations under blocks (elastic#39089) ML refactor DatafeedsConfig(Update) so defaults are not populated in queries or aggs (elastic#38822) Fix retention leases sync on recovery test ...
There is a strong indication that the test was originally failing for the same reason as testLoadExistingWatchesUponStartup. This was fixed in #39092 and the cause is explained in https://github.com/elastic/elasticsearch/pull/39092/files#r257895150
SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few times with various causes (not all of which we have logs for). This change enables the test again. 1. The fix from elastic#39092 should resolve any issues in assertWatchCount 2. In at least 1 case, getWatchHistoryEntry failed due to a ResponseException, which is not caught by assertBusy. This commit catches those and calls "fail" so that assertBusy will sleep and retry 3. Additional logging has been included to help diagnose any other failures causes.
The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Backport of: #39092
The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Backport of: #39092
SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few times with various causes (not all of which we have logs for). This change enables the test again. 1. The fix from #39092 should resolve any issues in assertWatchCount 2. In at least 1 case, getWatchHistoryEntry failed due to a ResponseException, which is not caught by assertBusy. This commit catches those and calls "fail" so that assertBusy will sleep and retry 3. Additional logging has been included to help diagnose any other failures causes.
The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Backport of: elastic#39092
The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Backport of: #39092
SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few times with various causes (not all of which we have logs for). This change enables the test again. 1. The fix from elastic#39092 should resolve any issues in assertWatchCount 2. In at least 1 case, getWatchHistoryEntry failed due to a ResponseException, which is not caught by assertBusy. This commit catches those and calls "fail" so that assertBusy will sleep and retry 3. Additional logging has been included to help diagnose any other failures causes.
SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few times with various causes (not all of which we have logs for). This change enables the test again. 1. The fix from elastic#39092 should resolve any issues in assertWatchCount 2. In at least 1 case, getWatchHistoryEntry failed due to a ResponseException, which is not caught by assertBusy. This commit catches those and calls "fail" so that assertBusy will sleep and retry 3. Additional logging has been included to help diagnose any other failures causes.
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.
This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.
Resolves: #39087