Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] FullClusterRestartIT#testWatcher failures #48381

Closed
romseygeek opened this issue Oct 23, 2019 · 28 comments
Closed

[CI] FullClusterRestartIT#testWatcher failures #48381

romseygeek opened this issue Oct 23, 2019 · 28 comments
Labels
:Data Management/Watcher medium-risk An open issue or test failure that is a medium risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI

Comments

@romseygeek
Copy link
Contributor

This has been failing a couple of times a day since it was re-enabled in #48000

Failures fall into two types; on 7.x and 7.5:

org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:35174], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout] {"cluster_name":"v6.7.0","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0} 	at __randomizedtesting.SeedInfo.seed([1CA518BE73754281:7148EB738314D577]:0) 	at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:283) 	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:261) 	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235) 	at org.elasticsearch.xpack.restart.FullClusterRestartIT.waitForYellow(FullClusterRestartIT.java:567) 	at org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:160)

eg https://build-stats.elastic.co/app/kibana#/doc/b646ed00-7efc-11e8-bf69-63c8ef516157/build-*/t?id=20191023094703-5F65A9C2&_g=()

and on master:

java.lang.AssertionError: 
Expected: <2>
     but: was <3>
	at __randomizedtesting.SeedInfo.seed([16B952622AA67F39:7B54A1AFDAC7E8CF]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.junit.Assert.assertThat(Assert.java:923)
	at org.elasticsearch.xpack.restart.FullClusterRestartIT.assertBasicWatchInteractions(FullClusterRestartIT.java:351)
	at org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:176)

eg https://build-stats.elastic.co/app/kibana#/doc/b646ed00-7efc-11e8-bf69-63c8ef516157/build-*/t?id=20191022054329-BF6E5EA6&_g=()

@romseygeek romseygeek added >test-failure Triaged test failures from CI :Data Management/Watcher labels Oct 23, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Watcher)

@jakelandis
Copy link
Contributor

Increased timeout on for yellow state on #48434 (will backport) and will look into the assertion error : https://gradle-enterprise.elastic.co/s/kcjrg2hoa7zqe/

@pugnascotia
Copy link
Contributor

@tlrx
Copy link
Member

tlrx commented Oct 31, 2019

@hendrikmuhs
Copy link

another: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob+fast+bwc/1837/console

I think the last failures had the 60s timeout in.

jakelandis added a commit to jakelandis/elasticsearch that referenced this issue Nov 4, 2019
…r yellow

The timeout was increased to 60s to allow this test more time to reach a
yellow state. However, the test will still on occasion fail even with the
60s timeout.

Related: elastic#48381
Related: elastic#48434
Related: elastic#47950
Related: elastic#40178
jakelandis added a commit that referenced this issue Nov 4, 2019
The timeout was increased to 60s to allow this test more time to reach a
yellow state. However, the test will still on occasion fail even with the
60s timeout.

Related: #48381
Related: #48434
Related: #47950
Related: #40178
jakelandis added a commit to jakelandis/elasticsearch that referenced this issue Nov 4, 2019
…tic#48848)

The timeout was increased to 60s to allow this test more time to reach a
yellow state. However, the test will still on occasion fail even with the
60s timeout.

Related: elastic#48381
Related: elastic#48434
Related: elastic#47950
Related: elastic#40178
jakelandis added a commit to jakelandis/elasticsearch that referenced this issue Nov 4, 2019
…tic#48848)

The timeout was increased to 60s to allow this test more time to reach a
yellow state. However, the test will still on occasion fail even with the
60s timeout.

Related: elastic#48381
Related: elastic#48434
Related: elastic#47950
Related: elastic#40178
jakelandis added a commit that referenced this issue Nov 11, 2019
The timeout was increased to 60s to allow this test more time to reach a
yellow state. However, the test will still on occasion fail even with the
60s timeout.

Related: #48381
Related: #48434
Related: #47950
Related: #40178
jakelandis added a commit that referenced this issue Nov 11, 2019
The timeout was increased to 60s to allow this test more time to reach a
yellow state. However, the test will still on occasion fail even with the
60s timeout.

Related: #48381
Related: #48434
Related: #47950
Related: #40178
@jakelandis
Copy link
Contributor

This test has been re-muted across branches and the timeout reduced back to the original 30s.

@martijnvg martijnvg self-assigned this Dec 23, 2019
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Dec 23, 2019
Previously this test failed waiting for yellow:
https://gradle-enterprise.elastic.co/s/fv55holsa36tg/console-log#L2676

Oddly cluster health returned red status, but there were no unassigned, relocating or initializing shards.
Placed the waiting for green in a try-catch block, so that when this fails again then cluster state gets printed.

Relates to elastic#48381
martijnvg added a commit that referenced this issue Jan 6, 2020
Previously this test failed waiting for yellow:
https://gradle-enterprise.elastic.co/s/fv55holsa36tg/console-log#L2676

Oddly cluster health returned red status, but there were no unassigned, relocating or initializing shards.
Placed the waiting for green in a try-catch block, so that when this fails again then cluster state gets printed.

Relates to #48381
martijnvg added a commit that referenced this issue Jan 21, 2020
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
Previously this test failed waiting for yellow:
https://gradle-enterprise.elastic.co/s/fv55holsa36tg/console-log#L2676

Oddly cluster health returned red status, but there were no unassigned, relocating or initializing shards.
Placed the waiting for green in a try-catch block, so that when this fails again then cluster state gets printed.

Relates to elastic#48381
@martijnvg
Copy link
Member

martijnvg commented Jan 24, 2020

This test hasn't failed since it was enabled on 23th december 6th januari in the master branch.
I'm going to enable it on the 7.x branch too and if it doesn't fail for a while then I will close this issue.

martijnvg added a commit that referenced this issue Jan 24, 2020
Previously this test failed waiting for yellow:
https://gradle-enterprise.elastic.co/s/fv55holsa36tg/console-log#L2676

Oddly cluster health returned red status, but there were no unassigned, relocating or initializing shards.
Placed the waiting for green in a try-catch block, so that when this fails again then cluster state gets printed.

Relates to #48381
martijnvg added a commit that referenced this issue Jan 27, 2020
martijnvg added a commit that referenced this issue Jan 27, 2020
removed unchecked suppress warnings.

See #48381
martijnvg added a commit that referenced this issue Jan 27, 2020
removed unchecked suppress warnings.

See #48381
martijnvg added a commit that referenced this issue Jan 27, 2020
@martijnvg
Copy link
Member

This test failed twice yesterday on master and 7 dot x branches. I will mute the test now and investigate these failures with the additional logging that was added recently.

martijnvg added a commit that referenced this issue Jan 29, 2020
martijnvg added a commit that referenced this issue Jan 29, 2020
@martijnvg
Copy link
Member

Unmuted test and when it fails again that test captures watcher stats as well.

The watcher debug logs, don't indicate that a watch executes, which could result in the index action not being executed. I think the watch may be stuck and hopefully the watcher stats will capture this.

@martijnvg
Copy link
Member

This test hasn't failed for almost a month. I will close this issue,
if this does fail again then a new issue should be opened.

@cbuescher
Copy link
Member

@martijnvg I just saw this test still muted on 7.x, should your unmute commit 9d7b80f be backported or was this done so on purpose?

@martijnvg
Copy link
Member

@cbuescher I think I forgot the backport... I will backport the commit you mention today.

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Aug 21, 2020

Reopening the issue, as we had another failure on master today:

Log:https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-darwin-compatibility/176/console
Build Scans: https://gradle-enterprise.elastic.co/s/humaqzeduckx4

REPRODUCE WITH: ./gradlew ':x-pack:qa:full-cluster-restart:v8.0.0#upgradedClusterTest' --tests "org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher"
-Dtests.seed=7EF337E5D8E436CC
-Dtests.security.manager=true
-Dtests.locale=ar-KW
-Dtests.timezone=US/Central
-Druntime.java=11

java.lang.AssertionError: |  
-- | --
  | Expected: <2> |  
  | but: was <3> |  

at __randomizedtesting.SeedInfo.seed([7EF337E5D8E436CC:131EC4282885A13A]:0) |  
-- | --
  |   | at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) |  
  |   | at org.junit.Assert.assertThat(Assert.java:956) |  
  |   | at org.junit.Assert.assertThat(Assert.java:923) |  
  |   | at org.elasticsearch.xpack.restart.FullClusterRestartIT.assertBasicWatchInteractions(FullClusterRestartIT.java:418) |  
  |   | at org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:207)

The failure doesn't reproduce locally.

@joegallo joegallo added the Team:Data Management Meta label for data/management team label Nov 18, 2020
@droberts195
Copy link
Contributor

The logging that was added to help debug this issue is still there today - this line:

setting 'logger.org.elasticsearch.xpack.watcher', 'DEBUG'

That line was added over a year ago in 106c3ce. It means that the server-side logs for the X-Pack full cluster restart tests are 99% watcher debug, making it hard to debug anything else. Is this extra debug level logging still required today?

@martijnvg
Copy link
Member

@droberts195 I will remove that line. If needed we can enable watcher debug logging later when we get back to investigation this test failure (and not forgetting to remove it).

@martijnvg martijnvg removed their assignment May 19, 2021
martijnvg added a commit that referenced this issue May 19, 2021
maybe enabling it again when investigating watcher full cluster restart qa tests (#48381)
martijnvg added a commit that referenced this issue May 19, 2021
maybe enabling it again when investigating watcher full cluster restart qa tests (#48381)
martijnvg added a commit that referenced this issue May 19, 2021
maybe enabling it again when investigating watcher full cluster restart qa tests (#48381)
@martijnvg
Copy link
Member

A different instance of this test failing: https://gradle-enterprise.elastic.co/s/alpg4ojedmfhg

org.elasticsearch.xpack.restart.FullClusterRestartIT > testWatcher FAILED |  
-- | --
  | org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:44101], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout] |  
  | {"cluster_name":"v7.0.1","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0} |  
  | at __randomizedtesting.SeedInfo.seed([7E0AB41A5FE5E699:13E747D7AF84716F]:0) |  
  | at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347) |  
  | at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:313) |  
  | at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:288) |  
  | at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.waitForYellow(FullClusterRestartIT.java:757) |  
  | at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:159)

org.elasticsearch.xpack.restart.FullClusterRestartIT > testWatcher FAILED	
    org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:44101/], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout]	
    {"cluster_name":"v7.0.1","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}	
        at __randomizedtesting.SeedInfo.seed([7E0AB41A5FE5E699:13E747D7AF84716F]:0)	
        at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)	
        at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:313)	
        at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:288)	
        at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.waitForYellow(FullClusterRestartIT.java:757)	
        at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:159)

@arteam
Copy link
Contributor

arteam commented Jun 21, 2022

Started happening again (failed 5 times over the last week)
https://gradle-enterprise.elastic.co/s/4nqy4ic7pc2cc/tests/:x-pack:qa:full-cluster-restart:v7.3.1%23oldClusterTest/org.elasticsearch.xpack.restart.FullClusterRestartIT/testWatcher?top-execution=1

org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:38531], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout] |  

@benwtrent
Copy link
Member

Another failure, again due to timeout:

https://gradle-enterprise.elastic.co/s/o6w6vll7qjego

--
  | org.elasticsearch.xpack.restart.FullClusterRestartIT > testWatcher FAILED |  
  | org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:37476], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout] |  
  | {"cluster_name":"v7.3.1","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0} |  
  | at __randomizedtesting.SeedInfo.seed([4A1898BDB88606D4:27F56B7048E79122]:0) |  
  | at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347) |  
  | at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:313) |  
  | at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:288) |  
  | at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.waitForYellow(FullClusterRestartIT.java:752) |  
  | at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:158)

org.elasticsearch.xpack.restart.FullClusterRestartIT > testWatcher FAILED	
    org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:37476/], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout]	
    {"cluster_name":"v7.3.1","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}	
        at __randomizedtesting.SeedInfo.seed([4A1898BDB88606D4:27F56B7048E79122]:0)	
        at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)	
        at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:313)	
        at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:288)	
        at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.waitForYellow(FullClusterRestartIT.java:752)	
        at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:158)

@andreidan
Copy link
Contributor

Another failure https://gradle-enterprise.elastic.co/s/htn2dtr4ajl5c

This one seems to be a genuine test failure

            java.lang.AssertionError: 
11:24:54     Expected: <2>
11:24:54          but: was <3>
11:24:54         at __randomizedtesting.SeedInfo.seed([8299F6568D3BC8E1:EF74059B7D5A5F17]:0)
11:24:54         at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
11:24:54         at org.junit.Assert.assertThat(Assert.java:956)
11:24:54         at org.junit.Assert.assertThat(Assert.java:923)
11:24:54         at org.elasticsearch.xpack.restart.FullClusterRestartIT.assertBasicWatchInteractions(FullClusterRestartIT.java:722)
11:24:54         at org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:200)

@thecoop
Copy link
Member

thecoop commented May 3, 2023

And another https://gradle-enterprise.elastic.co/s/snmnlk6itki4w, timeout this time

org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:33149], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout 
{"cluster_name":"elasticsearch","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}

@cbuescher
Copy link
Member

Looks like this fits here: https://gradle-enterprise.elastic.co/s/tb7rzfmtfukac

@dakrone dakrone added the medium-risk An open issue or test failure that is a medium risk to future releases label Oct 12, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@masseyke
Copy link
Member

I think that the version is getting incremented because watcher sometimes happens to run, and it updates the document in .watches with status fields. By default it does this every 500 ms (xpack.watcher.trigger.schedule.ticker.tick_interval), so it seems believable that every once in a while this happens while the test is running. I'm going to change the test to assert that the version is at least 2.

@masseyke
Copy link
Member

The timeout failure is more interesting though. From the latest failure, it appears to be happening because the bwc_watch_index, so the health check fails (the other two indices exist, according to the cluster state that's dumped out on failure). I have no idea why bwc_watch_index would not exist though. I can see that the bwc_watch was created. And according to the watcher stats that are dumped out on failure, watcher is up and running, and has run 29 watches (which probably maps to running this watch once per second for the 29 times it could before the 30-second timeout). There is no mention of bwc_watch failing, and no mention of bwc_watch_index being created or failing to be created. I haven't been able to reproduce this locally.

@masseyke
Copy link
Member

I realized I hadn't been paying attention when I said that the watch had run 29 times -- the threadpool max size was 29. In the run that failed there were only 2 watches in the watch stats. The test creates 3 watches. I dumped out the watcher stats in a successful run, and one node showed 2 watches and the other 1 watch (as you'd expect). So it appears that for some reason in this timeout run the bwc_watch silently failed to be created in some way.

@elasticsearchmachine elasticsearchmachine closed this as not planned Won't fix, can't repro, duplicate, stale Nov 5, 2024
@elasticsearchmachine
Copy link
Collaborator

This issue has been closed because it has been open for too long with no activity.

Any muted tests that were associated with this issue have been unmuted.

If the tests begin failing again, a new issue will be opened, and they may be muted again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Watcher medium-risk An open issue or test failure that is a medium risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests