-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Multicluster transform test failures (yaml=multi_cluster/80_transform/Batch transform from local and remote cluster) #51629
Comments
Pinging @elastic/ml-core (:ml/Transform) |
Another one earlier today on 7.x_ |
And I found anohter one from 6 days ago on master:
|
The root cause in all 3 cases seems to be a race condition when storing the state into the state document:
We use optimistic concurrency control and keep the seq_nr/primary term in an atomic variable, so it looks like 2 save state calls run at the same time. |
Ran into this on one of my PRs as well. Looks like this is happening in |
Thank you @mark-vieira ! After some more investigations I think my 1st assumption is not the problem. Together with the failures we also see:
I assumed this is benign, but it is the other way around, the version conflict is benign and this refresh issue isn't. TL/DR Transform stores state in an index, this state is returned when the transform is stopped - no task, therefore no in-memory information. The refresh is triggered to ensure the latest state is flushed. Because the refresh fails this is not the case, so it returns an old state. Note that an index re-fresh is triggered internally by lucene eventually. This problem only makes trouble for non-human interaction (CI is a great way to spot those issues) and if security is enabled and if you use a non-admin user. |
I am 99.9% certain that #51732 fixes this issue. The other optimistic concurrency control hiccup still needs to be addressed. |
The error hasn't happened again after the fix, closing. Follow up regarding logs: #52035 |
Unfortunately this just failed again in master. :x-pack:qa:multi-cluster-tests-with-security:mixed-clusterRunner FAILED
Full Task Log
Task Output
|
Failure:
I wasn't able to reproduce locally with this:
The text was updated successfully, but these errors were encountered: