Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Member Peer URL not updated when scaled to multi-node, scale identification is also not correct #533

Closed
unmarshall opened this issue Sep 14, 2022 · 1 comment · Fixed by #534
Assignees
Labels
kind/bug Bug status/closed Issue is closed (either delivered or triaged)

Comments

@unmarshall
Copy link
Contributor

Describe the bug:
In a single node etcd cluster today, the peer URL is not TLS enabled. When the etcd cluster is scaled from 1 to 3 then the other nodes only talk over TLS but since the first member cannot talk TLS, the other 2 members fail to join the cluster. The pods go into CrashLoopBackOff. To correct that we earlier merge #530, however that is not sufficient. To ensure that the TLS is now enabled for peer URL, etcd-druid will restart (implemented as delete sts + create sts) 2 times. Since the etcd configuration has already changed, the restart causes the new configuration to be available to the etcd member. In the new configuration the cluster size is already 3. In backuprestoreserver.go the call to UpdateMemberPeerURL is only made when the cluster size == 1. This results in the peer URL never getting TLS enabled.

Another issue is that the way we identify if this is a scale up case is incorrect. We depend upon the following check:

	if *curSts.Spec.Replicas > 1 && *curSts.Spec.Replicas > curSts.Status.UpdatedReplicas {
		return pointer.StringPtr(ClusterStateExisting), nil
	}

This check is not deterministic as UpdatedReplicas only means that the sts controller has updated all replicas to have the current revision. It does not mean they are ready. As soon as the replicas are changed from 1 to 3, UpdateReplicas also change very fast to 3, resulting in the failure of the above check, making this check timing dependent and thus non-deterministic. The state that is returned is ClusterStateNew which causes one of the members to start a new cluster and once it does that its clusterID does not match with the existing clusterID and it fails to join the cluster.

Expected behavior:

Peer URL should be TLS enabled always when etcd cluster is scaled from 1 to 3. Also the identification that this is not a bootstrap case but a scale up case should be based on time insensitive conditions.

How To Reproduce (as minimally and precisely as possible):

Logs:

Screenshots (if applicable):

Environment (please complete the following information):

  • Etcd version/commit ID :
  • Etcd-backup-restore version/commit ID:
  • Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:

@unmarshall unmarshall added the kind/bug Bug label Sep 14, 2022
@unmarshall
Copy link
Contributor Author

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug status/closed Issue is closed (either delivered or triaged)
Projects
None yet
2 participants