-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] resource_already_exists_exception failures #45605
Comments
Pinging @elastic/es-distributed |
I just noticed this is a more detailed investigation of one specific case of #45600, so this may not be a rare problem at all. |
Pinging @elastic/es-core-infra |
Could it be that somehow the clusters unite so in fact the tests are running against a larger cluster not two separate ones ? I think that would explain the behavior from the logs. |
Maybe that has happened in other CI runs, but in the CI run described in the description of this issue the two test clusters did not unite. The server side logs for the docs cluster show that it only had 1 node (as expected) and the server side logs for the |
The two examples we have so far where tests for https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-windows-compatibility/os=windows-2016/71/console is an example where the failure to delete an index between tests occurs and just a single test fails in the CI run:
The failure that follows on from that is:
The repro command for this failure was:
Given that the problem occurs in the cleanup in between tests I'm not sure individual repro commands are that helpful though. |
As a side note, it seems that we should fail if the cleanup fails instead of attempting to run the tests |
We looked at this failure on slack a couple of days ago with @original-brownbear and did not update here. Doing so in case someone finds it useful. There were some CCR test failures that looked similar ( index not found ) that had to do with caching test tasks when we shouldn't have. CCR tests are usually made up of a test that sets CCR up and another one that checks on it, so if part of it get's cached the index will never replicate. This was fixed in #45659 and is unrelated to this issue as here we are dealing with tests that do run with testclusters which can correctly deal with caching. It also seems that this fails on tests that run against a multi node cluster. The following exception shows up after the test when the test framework tries to clean up all indices:
In this case we had 2 tests that used the same index. There is no evidence that the index that exists was ever created. No tests that use that index seem to have ran prior to this, and the test is connected to the right cluster, that has the right nodes, excluding cross-talk with other tests. @dnhatn suggested this could be related to #45409 but we seem to have seen failures after that was merged as well. |
In internal test clusters tests we check that wiping all indices was acknowledged but in REST tests we didn't. This aligns the behavior in both kinds of tests. Relates elastic#45605 which might be caused by unacked deletes that were just slow.
On this intake build, OpenCloseIndexIT failed with symptoms that were similar to what @atorok described above: mysterious test failure, then |
In internal test clusters tests we check that wiping all indices was acknowledged but in REST tests we didn't. This aligns the behavior in both kinds of tests. Relates #45605 which might be caused by unacked deletes that were just slow.
) In internal test clusters tests we check that wiping all indices was acknowledged but in REST tests we didn't. This aligns the behavior in both kinds of tests. Relates elastic#45605 which might be caused by unacked deletes that were just slow.
As noted in #45956 (comment) I do not think that
Interrupting this wait for the shard lock while the node is closing does seem like it might be problematic. |
I've edited the title here. The We still need to figure out what's causing the resource_already_exists_exception |
Subsequent updates moved to #46091, so closing this one. |
This problem occurred while running
indices.update_aliases/10_basic/Basic test for multiple aliases
in theqa/smoke-test-multinode
directory. However, I do not think that particular test is to blame for the failure - it seems more like a very rare problem with index management in multi-node clusters.The build where this problem was seen was https://gradle-enterprise.elastic.co/s/fknt4cque3p4e
The client side problem that made the test fail was "index already exists":
The underlying reason from the server side logs was that a previous test's index of the same name could not be deleted:
The repro command is:
This did not reproduce for me.
The other thing that might be relevant is that around the time this was failing a completely different test against a different test cluster failed on the same CI worker. It was:
The client side error for that was:
The server side for that docs test failure shows nothing apart from a big gap:
These two test clusters were different - it's not that the docs tests and the multi-node smoke tests were simultaneously talking to the same test cluster. But they were running at the same time and suffered failures around the same time.
It may be that these failures are caused because CI worker was just completely overloaded by multiple test clusters running simultaneously around the time of these failures.
But maybe there's something interesting in the
ShardLockObtainFailedException
that might be worth considering in case users see it on machines that are under heavy load.The text was updated successfully, but these errors were encountered: