You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug:
In a single node etcd cluster today, the peer URL is not TLS enabled. When the etcd cluster is scaled from 1 to 3 then the other nodes only talk over TLS but since the first member cannot talk TLS, the other 2 members fail to join the cluster. The pods go into CrashLoopBackOff. To correct that we earlier merge #530, however that is not sufficient. To ensure that the TLS is now enabled for peer URL, etcd-druid will restart (implemented as delete sts + create sts) 2 times. Since the etcd configuration has already changed, the restart causes the new configuration to be available to the etcd member. In the new configuration the cluster size is already 3. In backuprestoreserver.go the call to UpdateMemberPeerURL is only made when the cluster size == 1. This results in the peer URL never getting TLS enabled.
Another issue is that the way we identify if this is a scale up case is incorrect. We depend upon the following check:
This check is not deterministic as UpdatedReplicas only means that the sts controller has updated all replicas to have the current revision. It does not mean they are ready. As soon as the replicas are changed from 1 to 3, UpdateReplicas also change very fast to 3, resulting in the failure of the above check, making this check timing dependent and thus non-deterministic. The state that is returned is ClusterStateNew which causes one of the members to start a new cluster and once it does that its clusterID does not match with the existing clusterID and it fails to join the cluster.
Expected behavior:
Peer URL should be TLS enabled always when etcd cluster is scaled from 1 to 3. Also the identification that this is not a bootstrap case but a scale up case should be based on time insensitive conditions.
How To Reproduce (as minimally and precisely as possible):
Logs:
Screenshots (if applicable):
Environment (please complete the following information):
Etcd version/commit ID :
Etcd-backup-restore version/commit ID:
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:
Anything else we need to know?:
The text was updated successfully, but these errors were encountered:
Describe the bug:
In a single node etcd cluster today, the peer URL is not TLS enabled. When the etcd cluster is scaled from 1 to 3 then the other nodes only talk over TLS but since the first member cannot talk TLS, the other 2 members fail to join the cluster. The pods go into
CrashLoopBackOff
. To correct that we earlier merge #530, however that is not sufficient. To ensure that the TLS is now enabled for peer URL, etcd-druid will restart (implemented as delete sts + create sts) 2 times. Since the etcd configuration has already changed, the restart causes the new configuration to be available to the etcd member. In the new configuration the cluster size is already 3. In backuprestoreserver.go the call toUpdateMemberPeerURL
is only made when the cluster size == 1. This results in the peer URL never getting TLS enabled.Another issue is that the way we identify if this is a scale up case is incorrect. We depend upon the following check:
This check is not deterministic as
UpdatedReplicas
only means that the sts controller has updated all replicas to have the current revision. It does not mean they are ready. As soon as the replicas are changed from 1 to 3, UpdateReplicas also change very fast to 3, resulting in the failure of the above check, making this check timing dependent and thus non-deterministic. The state that is returned isClusterStateNew
which causes one of the members to start a new cluster and once it does that its clusterID does not match with the existing clusterID and it fails to join the cluster.Expected behavior:
Peer URL should be TLS enabled always when etcd cluster is scaled from 1 to 3. Also the identification that this is not a bootstrap case but a scale up case should be based on time insensitive conditions.
How To Reproduce (as minimally and precisely as possible):
Logs:
Screenshots (if applicable):
Environment (please complete the following information):
Anything else we need to know?:
The text was updated successfully, but these errors were encountered: