Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve concurrency with watcher trigger service #39092

Merged
merged 2 commits into from
Feb 19, 2019

Conversation

tvernum
Copy link
Contributor

@tvernum tvernum commented Feb 19, 2019

The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Resolves: #39087

The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Resolves: elastic#39087
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@@ -141,7 +141,6 @@ public void testLoadMalformedWatchRecord() throws Exception {
});
}

@AwaitsFix(bugUrl = "Supposedly fixed; https://github.com/elastic/x-pack-elasticsearch/issues/1915")
public void testLoadExistingWatchesUponStartup() throws Exception {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test would fail approximately 1 in 50 times due to this bug.
The final assertBusy would fail because the returned getWatchesCount() would be 1 less than expected (numWatches) due to the size of the internal map becoming corrupt.

If you iterated through the map and explicitly counted the number of items, it was correct, but the size() returned an incorrect value.

This test didn't set an id() on the Watch, but ConcurrentHashMap
doesn't allow null keys.

Note: Per Watch.equals and Watch.hashCode, id is not allowed to be
null
Watch watch = mock(Watch.class);
when(watch.trigger()).thenReturn(trigger);
when(watch.id()).thenReturn(id);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConcurrentHashMap requires a non-null key, so this test would otherwise fail.
Watch.hashCode & Watch.equals already assume id is non-null, so the requirement in ConcurrentHashMap is fine, it's just the test that was wrong.

Copy link
Contributor

@jakelandis jakelandis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tvernum nice find ! LGTM

@tvernum tvernum merged commit e694473 into elastic:master Feb 19, 2019
tvernum added a commit to tvernum/elasticsearch that referenced this pull request Feb 20, 2019
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Backport of: elastic#39092
tvernum added a commit to tvernum/elasticsearch that referenced this pull request Feb 20, 2019
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Backport of: elastic#39092
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 20, 2019
…follow

* elastic/master: (37 commits)
  Enable test logging for TransformIntegrationTests#testSearchTransform.
  stronger wording for ilm+rollover in docs (elastic#39159)
  Mute SingleNodeTests (elastic#39156)
  AwaitsFix XPackUsageIT#testXPackCcrUsage.
  Resolve concurrency with watcher trigger service (elastic#39092)
  Fix median calculation in MedianAbsoluteDeviationAggregatorTests (elastic#38979)
  [DOCS] Edits the remote clusters documentation (elastic#38996)
  add version 6.6.2
  Revert "Mute failing test 20_mix_typless_typefull (elastic#38781)" (elastic#38912)
  Rebuild remote connections on profile changes (elastic#37678)
  Document 'max_size' parameter as shard size for rollover (elastic#38750)
  Add some missing toString() implementations (elastic#39124)
  Migrate Streamable to Writeable for cluster block package (elastic#37391)
  fix RethrottleTests retry (elastic#38978)
  Disable date parsing test in non english locale (elastic#39052)
  Remove BCryptTests (elastic#39098)
  [ML] Stop the ML memory tracker before closing node (elastic#39111)
  Allow retention lease operations under blocks (elastic#39089)
  ML refactor DatafeedsConfig(Update) so defaults are not populated in queries or aggs (elastic#38822)
  Fix retention leases sync on recovery test
  ...
jkakavas added a commit that referenced this pull request Feb 20, 2019
There is a strong indication that the test was originally failing
for the same reason as testLoadExistingWatchesUponStartup. This was
fixed in #39092 and the cause is explained in
https://github.com/elastic/elasticsearch/pull/39092/files#r257895150
tvernum added a commit to tvernum/elasticsearch that referenced this pull request Feb 20, 2019
SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few
times with various causes (not all of which we have logs for).

This change enables the test again.

1. The fix from elastic#39092 should resolve any issues in assertWatchCount
2. In at least 1 case, getWatchHistoryEntry failed due to a
   ResponseException, which is not caught by assertBusy.
   This commit catches those and calls "fail" so that assertBusy will
   sleep and retry
3. Additional logging has been included to help diagnose any other
   failures causes.
tvernum added a commit that referenced this pull request Feb 20, 2019
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Backport of: #39092
tvernum added a commit that referenced this pull request Feb 20, 2019
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Backport of: #39092
tvernum added a commit that referenced this pull request Feb 20, 2019
SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few
times with various causes (not all of which we have logs for).

This change enables the test again.

1. The fix from #39092 should resolve any issues in assertWatchCount
2. In at least 1 case, getWatchHistoryEntry failed due to a
   ResponseException, which is not caught by assertBusy.
   This commit catches those and calls "fail" so that assertBusy will
   sleep and retry
3. Additional logging has been included to help diagnose any other
   failures causes.
tvernum added a commit to tvernum/elasticsearch that referenced this pull request Feb 21, 2019
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Backport of: elastic#39092
tvernum added a commit that referenced this pull request Feb 21, 2019
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Backport of: #39092
weizijun pushed a commit to weizijun/elasticsearch that referenced this pull request Feb 22, 2019
SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few
times with various causes (not all of which we have logs for).

This change enables the test again.

1. The fix from elastic#39092 should resolve any issues in assertWatchCount
2. In at least 1 case, getWatchHistoryEntry failed due to a
   ResponseException, which is not caught by assertBusy.
   This commit catches those and calls "fail" so that assertBusy will
   sleep and retry
3. Additional logging has been included to help diagnose any other
   failures causes.
weizijun pushed a commit to weizijun/elasticsearch that referenced this pull request Feb 22, 2019
SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few
times with various causes (not all of which we have logs for).

This change enables the test again.

1. The fix from elastic#39092 should resolve any issues in assertWatchCount
2. In at least 1 case, getWatchHistoryEntry failed due to a
   ResponseException, which is not caught by assertBusy.
   This commit catches those and calls "fail" so that assertBusy will
   sleep and retry
3. Additional logging has been included to help diagnose any other
   failures causes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BootStrapTests.testLoadExistingWatchesUponStartup fails
3 participants