-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] StackOverflowError when executing SnapshotDisruptionIT.testDisruptionOnSnapshotInitialization #28169
Comments
@tlrx I think I know what happened. Even though the logs don't provide evidence for the whole scenario, what's in the logs makes the following quite plausible. The test deliberately causes a
What happens in that case in
Now to the question what could have caused it or why we're not seeing this more often. What's odd is that it took such a long time for What fixes should we do? The first thing we should change is for |
@ywelsch Thanks for the detailed comment, I understand the execution flow and I agree that it can explain the StackOverflowError.
Please let me know if/how I can help. Also, feel free to reassign this issue to you if you want. |
) ClusterHealthAction does not use the regular retry logic, possibly causing StackOverflowErrors. Relates #28169
) ClusterHealthAction does not use the regular retry logic, possibly causing StackOverflowErrors. Relates #28169
I think we can close this. The StackOverflowError should be addressed by #28195 |
The test
SnapshotDisruptionIT.testDisruptionOnSnapshotInitialization()
failed on CI today on Windows Server 2012 R2 6.3 amd64/Oracle Corporation 1.8.0_92 (64-bit).I first thought that it was a snapshot/restore regression due to my recent changes in #28078 or #27931 but after looking at the test execution log I'm not so sure.
I wonder if in this test the cluster ends up in a situation where a
listener.onFailure()
call caused a stack overflow error on a network thread (that was uncaught by the usual Elasticsearch's UncaughtExceptionHandler) that caused the NIO Selector to be closed too and not listening to incoming requests.The test starts 3 master only nodes and 1 data only node. Once the cluster is stable, it sets up a snapshot repository and creates a first snapshot to check that everything is working correctly. Then it sets up a disruption scheme that is designed to isolate the master node as soon as a snapshot-in-progress entry in INIT state is found in the cluster state. Next step in the test is to create a second snapshot that triggers the disruption scheme and waits for the cluster to elect a new master that terminates this second snapshot.
In this execution logs
node_tm0
,node_tm1
andnode_tm2
are started as master only nodes.node_tm1
is elected as master and adds the data only nodenode_td3
to the cluster.The test runs correctly to the point where the second snapshot is created:
Once the disruption is started the master node
node_tm1
is isolated. Other nodes think it left:So the remaining nodes elect
node_tm0
as the new master node. It uses the last commited cluster state in version 25:The new master node updates the cluster state to notify that it is the master now. But the publishing of the cluster state version 26 is not processed by the old master node which is still isolated:
So the master node gives up, removes the old master node from the cluster state and publish a new version 27 of the cluster state where
node_tm1
is removed:And the new master node cleans up the second snapshot as expected:
So the test is somewhat good as its first purpose is to test that the snapshot is correctly terminated by the new master. Before the test ends, it stops the disruption and waits for the cluster to be stable again:
The old master node detects the timeout when it tried to publish the initial cluster state 25 (when the second snapshot is STARTED in cluster state):
[2018-01-10T05:23:40,407][WARN ][o.e.c.s.MasterService ] [node_tm1] failing [update_snapshot [test-repo:test-snap-2/vyIDc0GFSIKVtGd--HP_hQ]]: failed to commit cluster state version [25]
And the old master fails the snapshot locally, which is also expected.
And the old master joins back the cluster...
... and this is where things get blurred for me.
It seems that
node_tm1
cannot ping the other nodes, and the ESSelector loop is closed:Then all connection attempts from node_tm1 failed, so the cluster cannot recover to 4 nodes and the test suite times out. I think that all errors after that are caused by the test framework trying to stop the nodes.
But I'm worried about is the StackOverflowError in logs:
As well as the NPEs at the beginning of the tests:
@tbrooks8 @bleskes I have some trouble digging more this failure, so I'd be happy to have your opinion / gut feeling on this. Just an overlook would be helpfull at this stage.
The text was updated successfully, but these errors were encountered: