-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] RemoteClusterServiceTests/TransportSearchActionTests#testCollectSearchShards times out waiting in latch #33852
Comments
Pinging @elastic/es-search-aggs |
This test occasionally fails in `testCollectSearchShards` waiting on what seems to be a search request to a remote cluster for one second. Given that the test fails here very rarely I suspect maybe one second is very rarely not enough so we could fix it by increasing the max wait time slightly. Closes elastic#33852
This test occasionally fails in `testCollectSearchShards` waiting on what seems to be a search request to a remote cluster for one second. Given that the test fails here very rarely I suspect maybe one second is very rarely not enough so we could fix it by increasing the max wait time slightly. Closes #33852
This test occasionally fails in `testCollectSearchShards` waiting on what seems to be a search request to a remote cluster for one second. Given that the test fails here very rarely I suspect maybe one second is very rarely not enough so we could fix it by increasing the max wait time slightly. Closes #33852
This test occasionally fails in `testCollectSearchShards` waiting on what seems to be a search request to a remote cluster for one second. Given that the test fails here very rarely I suspect maybe one second is very rarely not enough so we could fix it by increasing the max wait time slightly. Closes #33852
I believe this is the same issue on master today, doesn't reproduce
|
another failure waiting on the latch at https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/838/console
|
Another one today in 6.x : https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1026 |
We have recently gone from 1 second to 5, and we still had some failures. Increasing this one last time considerably, which should help understanding if failures are caused by test infra slowness or a bug. Relates to elastic#33852
We have recently gone from 1 second to 5, and we still had some failures. Note that testCollectSearchShards was recently moved and reworked from RemoteClusterServiceTests to TransportSearchActionTests. Increasing this one last time considerably, which should help understanding if failures are caused by test infra slowness or a bug. Relates to elastic#33852
I have done quite some digging on this one. I was not able to reproduce this issue even running this test in a loop for hours and hours. Last failure is from January 23rd, at the time this test was failing around one or two times per day. We previously increased the wait for the responses a couple of times. The failure that I have analyzed happens before we even simulate network failures. At the second
The three I am not too happy with increasing the timeout once more like I proposed in #38198 and #38199, at least unless we understand exactly what slows things down. We could try and decrease the number of remote clusters (hence nodes) that we connect to, but I am not sure that is the problem either. Also, this test and corresponding tested code has been moved on master, 7.0 and 7.x to @tbrooks8 do you have any suggestion on changes to make here or what could have caused this situation? |
Closing this one as it has not failed in a month or so. I think that infra changes have helped with this. |
Another instance: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.2+intake/35/consoleText
|
heya @dnhatn at first glance this seems like a different failure? This is a connect timeout, while historically we have had latches awaits time out here on this test. |
Just had this failure in a PR build: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request/311/console
But I found several instances over time in the CI mails, e.g.
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+release-tests/922/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/6789/console
and some a while ago in February.
It seems a rare timing issue, so this doesn't reproduce for me locally:
The failure shows the test is checking a latch after waiting for a second, maybe that is not long enough and should be increased.
The text was updated successfully, but these errors were encountered: