-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Test] TestDowngradeUpgradeClusterOf3 timeout #14540
Comments
Trying to reproduce
or
|
That's actually an interesting case, checkout the logs:
The raft request to downgrade was successfully propagated to all members:
Then member and leader
The timeout is then caused by the leader being stuck checking whether downgrades are valid:
While the other members never reach "The server is ready to downgrade" which is what the test expected when it times out. |
Downgrade process it to allow temporary lowering of cluster version.
Shows that cluster version was already downgraded. Looks like a bug in the etcd/server/etcdserver/version/monitor.go Lines 64 to 92 in b8be237
If |
Still it's strange that downgrade test doesn't proceed as the cluster-version was lowered as expected. |
Even in the happy path case of the test passing, it complains about that version difference for a short time before shutting down. I assume it's mostly cosmetic as of now. Should the
Just one member (the leader) printed the the ready message that's required, the other members seem stuck (?) Looks like totally separate code paths to me right now, so very strange indeed. |
Added #14587, but I'm highly unsure this is the right way to fix this. That sounds more like a race condition lingering somewhere than this. |
In the TestDowngradeUpgradeCluster case, the brand-new cluster is using simple-config-changer, which means that entries has been committed before leader election and these entries will be applied when etcdserver starts to receive apply-requests. The simple-config-changer will mark the `confState` dirty and the storage backend precommit hook will update the `confState`. For the new cluster, the storage version is nil at the beginning. And it will be v3.5 if the `confState` record has been committed. And it will be >v3.5 if the `storageVersion` record has been committed. When the new cluster is ready, the leader will set init cluster version with v3.6.x. And then it will trigger the `monitorStorageVersion` to update the `storageVersion` to v3.6.x. If the `confState` record has been updated before cluster version update, we will get storageVersion record. If the storage backend doesn't commit in time, the `monitorStorageVersion` won't update the version because of `cannot detect storage schema version: missing confstate information`. And then we file the downgrade request before next round of `monitorStorageVersion`(per 4 second), the cluster version will be v3.5.0 which is equal to the `UnsafeDetectSchemaVersion`'s result. And we won't see that `The server is ready to downgrade`. It is easy to reproduce the issue if you use cpuset or taskset to limit in two cpus. So, we should wait for the new cluster's storage ready before downgrade request. Fixes: etcd-io#14540 Signed-off-by: Wei Fu <[email protected]>
In the TestDowngradeUpgradeCluster case, the brand-new cluster is using simple-config-changer, which means that entries has been committed before leader election and these entries will be applied when etcdserver starts to receive apply-requests. The simple-config-changer will mark the `confState` dirty and the storage backend precommit hook will update the `confState`. For the new cluster, the storage version is nil at the beginning. And it will be v3.5 if the `confState` record has been committed. And it will be >v3.5 if the `storageVersion` record has been committed. When the new cluster is ready, the leader will set init cluster version with v3.6.x. And then it will trigger the `monitorStorageVersion` to update the `storageVersion` to v3.6.x. If the `confState` record has been updated before cluster version update, we will get storageVersion record. If the storage backend doesn't commit in time, the `monitorStorageVersion` won't update the version because of `cannot detect storage schema version: missing confstate information`. And then we file the downgrade request before next round of `monitorStorageVersion`(per 4 second), the cluster version will be v3.5.0 which is equal to the `UnsafeDetectSchemaVersion`'s result. And we won't see that `The server is ready to downgrade`. It is easy to reproduce the issue if you use cpuset or taskset to limit in two cpus. So, we should wait for the new cluster's storage ready before downgrade request. Fixes: etcd-io#14540 Signed-off-by: Wei Fu <[email protected]>
In the TestDowngradeUpgradeCluster case, the brand-new cluster is using simple-config-changer, which means that entries has been committed before leader election and these entries will be applied when etcdserver starts to receive apply-requests. The simple-config-changer will mark the `confState` dirty and the storage backend precommit hook will update the `confState`. For the new cluster, the storage version is nil at the beginning. And it will be v3.5 if the `confState` record has been committed. And it will be >v3.5 if the `storageVersion` record has been committed. When the new cluster is ready, the leader will set init cluster version with v3.6.x. And then it will trigger the `monitorStorageVersion` to update the `storageVersion` to v3.6.x. If the `confState` record has been updated before cluster version update, we will get storageVersion record. If the storage backend doesn't commit in time, the `monitorStorageVersion` won't update the version because of `cannot detect storage schema version: missing confstate information`. And then we file the downgrade request before next round of `monitorStorageVersion`(per 4 second), the cluster version will be v3.5.0 which is equal to the `UnsafeDetectSchemaVersion`'s result. And we won't see that `The server is ready to downgrade`. It is easy to reproduce the issue if you use cpuset or taskset to limit in two cpus. So, we should wait for the new cluster's storage ready before downgrade request. Fixes: etcd-io#14540 Signed-off-by: Wei Fu <[email protected]>
In the TestDowngradeUpgradeCluster case, the brand-new cluster is using simple-config-changer, which means that entries has been committed before leader election and these entries will be applied when etcdserver starts to receive apply-requests. The simple-config-changer will mark the `confState` dirty and the storage backend precommit hook will update the `confState`. For the new cluster, the storage version is nil at the beginning. And it will be v3.5 if the `confState` record has been committed. And it will be >v3.5 if the `storageVersion` record has been committed. When the new cluster is ready, the leader will set init cluster version with v3.6.x. And then it will trigger the `monitorStorageVersion` to update the `storageVersion` to v3.6.x. If the `confState` record has been updated before cluster version update, we will get storageVersion record. If the storage backend doesn't commit in time, the `monitorStorageVersion` won't update the version because of `cannot detect storage schema version: missing confstate information`. And then we file the downgrade request before next round of `monitorStorageVersion`(per 4 second), the cluster version will be v3.5.0 which is equal to the `UnsafeDetectSchemaVersion`'s result. And we won't see that `The server is ready to downgrade`. It is easy to reproduce the issue if you use cpuset or taskset to limit in two cpus. So, we should wait for the new cluster's storage ready before downgrade request. Fixes: etcd-io#14540 Signed-off-by: Wei Fu <[email protected]>
In the TestDowngradeUpgradeCluster case, the brand-new cluster is using simple-config-changer, which means that entries has been committed before leader election and these entries will be applied when etcdserver starts to receive apply-requests. The simple-config-changer will mark the `confState` dirty and the storage backend precommit hook will update the `confState`. For the new cluster, the storage version is nil at the beginning. And it will be v3.5 if the `confState` record has been committed. And it will be >v3.5 if the `storageVersion` record has been committed. When the new cluster is ready, the leader will set init cluster version with v3.6.x. And then it will trigger the `monitorStorageVersion` to update the `storageVersion` to v3.6.x. If the `confState` record has been updated before cluster version update, we will get storageVersion record. If the storage backend doesn't commit in time, the `monitorStorageVersion` won't update the version because of `cannot detect storage schema version: missing confstate information`. And then we file the downgrade request before next round of `monitorStorageVersion`(per 4 second), the cluster version will be v3.5.0 which is equal to the `UnsafeDetectSchemaVersion`'s result. And we won't see that `The server is ready to downgrade`. It is easy to reproduce the issue if you use cpuset or taskset to limit in two cpus. So, we should wait for the new cluster's storage ready before downgrade request. Fixes: etcd-io#14540 Signed-off-by: Wei Fu <[email protected]>
Which github workflows are flaking?
test (linux-amd64-e2e)
Which tests are flaking?
TestDowngradeUpgradeClusterOf3
Github Action link
https://github.com/etcd-io/etcd/actions/runs/3156389973/jobs/5136053128
Reason for failure (if possible)
e2e test cluster logging
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered: