-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] IndexingIT testIndexing failing #100371
Comments
Pinging @elastic/es-data-management (Team:Data Management) |
OK, this particular failure might be due to ML. Attaching the logs. (I am not sure if this is indicative of all the "failure to get green" BWC tests). In the logs for X-Pack rolling upgrade, I see the following logs on some of the nodes (one node failed to start) and thus couldn't get to green. @elastic/ml-core What do y'all think?
And this trace:
|
The earliest failure I can find that's similar to this is https://gradle-enterprise.elastic.co/s/in6lvjac7niak from 3rd October. That would put it around the time of #100143. That change caused another problem, in #100180, which was fixed in #100388. This issue was opened before #100388 was merged. I cannot see any failures of @davidkyle please can you check if this failure would also be fixed by #100388? If it would then #100285 can probably be closed too, as I think that's the same problem, just raised against a different test (as 3 tests fail together). |
I'm wondering if the TrainedModelAssignmentMetadata serialization problem is also causing #100379 (once it happens there, all the tests after it fail for a while). |
Still no failures since the day #100388 was merged. I'll close this and we can reopen it if we start seeing this again. |
I have a suspicion that the test failure is actually due to the muting of another test in the same test suite that was failing and fixed by #100388, this might would explain why the failure appears to be fixed by #100388 The failure is caused by the ml model deployment code updating the clusterstate with a new named writable, this should only happen once all the nodes in the cluster have been upgraded. The unknown named writable is a fatal error for the 3rd node causing IndexingIT to timeout waiting for the cluster to have 3 nodes. I've pushed some logging in #100800 and re-muted |
There's still a problem with This is a failure from today that shows it: https://gradle-enterprise.elastic.co/s/36366ajvxm45i In the server-side logs,
The error happens on the node that's still on 8.2 after the other two nodes in the cluster have been upgraded to 8.12. I am pretty sure that something has been done in the last few weeks that invalidates the assumptions of #88289:
I am wondering if something has changed in the negotiation of transport versions that means the cluster briefly thinks all the nodes are on 8.12 even though one of them is on 8.2. This would lead the following test to be true: Lines 527 to 528 in 2ce5392
We could test the theory by temporarily changing that trace to an info and appending the min transport version to the message. |
The assumption was that the #100886 looks likely to be the cause although I cannot find a convincing explanation as to why. It does seem related to having no model deployments at the 2/3 stage and hence empty |
Pinging @elastic/ml-core (Team:ML) |
Reopening due to recent failures: https://gradle-enterprise.elastic.co/s/ened3yzjpisnq/console-log/task/:qa:rolling-upgrade:v8.10.3%23bwcTest?anchor=155&page=1 |
@volodk85 please could you open a separate issue for the new failure. The new failure is in the This issue is already pretty confusing and complex as the problem turned out to be in ML even though the failing test was an indexing test. In this new failure ML is almost certainly not the problem as the test is not an X-Pack test. |
We have fare too many overloaded test suite names. This is just one of them. Sure, the package name is unique but it's easy to conflate these. I'm wonder if we should go through and give these test suites unique names, even if that means losing some test history for a time. |
This has started failing a ton the last few days. I frankly have no idea what team this should be assigned to.
Build scan:
https://gradle-enterprise.elastic.co/s/rnuiwsuovl67g/tests/:x-pack:qa:rolling-upgrade:v8.2.2%23twoThirdsUpgradedTest/org.elasticsearch.upgrades.IndexingIT/testIndexing
Reproduction line:
Applicable branches:
8.11, main
Reproduces locally?:
No
Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.upgrades.IndexingIT&tests.test=testIndexing
Failure excerpt:
The text was updated successfully, but these errors were encountered: