[BUG] Member Peer URL not updated when scaled to multi-node, scale identification is also not correct #533

unmarshall · 2022-09-14T06:32:15Z

Describe the bug:
In a single node etcd cluster today, the peer URL is not TLS enabled. When the etcd cluster is scaled from 1 to 3 then the other nodes only talk over TLS but since the first member cannot talk TLS, the other 2 members fail to join the cluster. The pods go into CrashLoopBackOff. To correct that we earlier merge #530, however that is not sufficient. To ensure that the TLS is now enabled for peer URL, etcd-druid will restart (implemented as delete sts + create sts) 2 times. Since the etcd configuration has already changed, the restart causes the new configuration to be available to the etcd member. In the new configuration the cluster size is already 3. In backuprestoreserver.go the call to UpdateMemberPeerURL is only made when the cluster size == 1. This results in the peer URL never getting TLS enabled.

Another issue is that the way we identify if this is a scale up case is incorrect. We depend upon the following check:

	if *curSts.Spec.Replicas > 1 && *curSts.Spec.Replicas > curSts.Status.UpdatedReplicas {
		return pointer.StringPtr(ClusterStateExisting), nil
	}

This check is not deterministic as UpdatedReplicas only means that the sts controller has updated all replicas to have the current revision. It does not mean they are ready. As soon as the replicas are changed from 1 to 3, UpdateReplicas also change very fast to 3, resulting in the failure of the above check, making this check timing dependent and thus non-deterministic. The state that is returned is ClusterStateNew which causes one of the members to start a new cluster and once it does that its clusterID does not match with the existing clusterID and it fails to join the cluster.

Expected behavior:

Peer URL should be TLS enabled always when etcd cluster is scaled from 1 to 3. Also the identification that this is not a bootstrap case but a scale up case should be based on time insensitive conditions.

How To Reproduce (as minimally and precisely as possible):

Logs:

Screenshots (if applicable):

Environment (please complete the following information):

Etcd version/commit ID :
Etcd-backup-restore version/commit ID:
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

unmarshall · 2022-09-14T06:32:24Z

/assign

unmarshall added the kind/bug Bug label Sep 14, 2022

unmarshall self-assigned this Sep 14, 2022

unmarshall mentioned this issue Sep 14, 2022

Member Peer URL not updated when scaled to multi-node, scale identification is also not correct #534

Merged

unmarshall closed this as completed in #534 Sep 16, 2022

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Member Peer URL not updated when scaled to multi-node, scale identification is also not correct #533

[BUG] Member Peer URL not updated when scaled to multi-node, scale identification is also not correct #533

unmarshall commented Sep 14, 2022

unmarshall commented Sep 14, 2022

[BUG] Member Peer URL not updated when scaled to multi-node, scale identification is also not correct #533

[BUG] Member Peer URL not updated when scaled to multi-node, scale identification is also not correct #533

Comments

unmarshall commented Sep 14, 2022

unmarshall commented Sep 14, 2022