Add etcd-quorum-guard manifests and doc #613

RobertKrawitz · 2019-04-09T14:46:21Z

- What I did
Add etcd-quorum-guard manifests and documentation describing it.

- How to verify it
oc get pods -n kube-system | grep etcd-quorum-guard

- Description for the changelog
Add etcd-quorum-guard

openshift-ci-robot · 2019-04-09T14:46:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: RobertKrawitz
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: cgwalters

If they are not already assigned, you can assign the PR to them by writing /assign @cgwalters in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

RobertKrawitz · 2019-04-09T14:46:42Z

/cc @derekwaynecarr @sjenning

rphillips · 2019-04-09T15:17:36Z

I suspect the pkg/operator/sync.go needs an update to include the deployment.

perhaps @kikisdeliveryservice or @runcom can confirm.

abhinavdahiya · 2019-04-09T16:12:54Z

manifests/etcdquorumguard/deployment.yaml

+  name: etcd-quorum-guard
+  namespace: kube-system
+spec:
+  replicas: 3


I think it should be possible to teach mco to scale up / decide the replica count based on number of master node.

Agreed beyond 4.1; for 4.1, it has been decided to only support 3 masters.

abhinavdahiya · 2019-04-09T16:13:37Z

manifests/etcdquorumguard/deployment.yaml

+        effect: NoExecute
+        operator: Exists
+      containers:
+      - image: registry.svc.ci.openshift.org/openshift/origin-v4.0:base


This must be plumbed through release image.

"{{.Images.etcdQuorumGuardImage}}"

abhinavdahiya · 2019-04-09T16:14:46Z

manifests/etcdquorumguard/deployment.yaml

+                declare -r key="${cert%.crt}.key"
+                declare -r cacert="$croot/ca.crt"
+                [[ -z $cert || -z $key ]] && exit 1
+                curl --max-time 2 --silent --cert "${cert//:/\:}" --key "$key" --cacert "$cacert" "$health_endpoint" |grep '{ *"health" *: *"true" *}'


/cc @hexfusion
please use the metrics client certs that were created to connect to etcd

Where are those certs located?

You could get crt/key with something like

oc -n openshift-config get secrets etcd-metric-client -o yaml

ca

oc get configmap -n openshift-config etcd-metric-serving-ca -o yaml

you can use etcd proxy for /health with these certs. port 9979 vs 2379

I can do that inside the pod?

I agree with the rationale; my question is how to get the appropriate cert.

working on this now

Note that the etcd-quorum-guard proper does not have any Go code in it; it's simply (right now) a static deployment and disruption budget, with the lone pod being a trivial script.

with #623 you should be able to mount the resources and then consume in your bash as local files. Something like.

volumeMounts: - mountPath: "/etc/ssl/certs/etcd" name: etcd-metric-client readOnly: true volumes: - name: etcd-metric-client secret: secretName: etcd-metric-client

runcom · 2019-04-09T16:26:25Z

I suspect the pkg/operator/sync.go needs an update to include the deployment.

perhaps @kikisdeliveryservice or @runcom can confirm.

yeah, if we're now watching this manifest, we need to sync it up as well

runcom · 2019-04-09T16:27:32Z

test/e2e/etcdquorumguard_test.go

+	kclient, err := k8sclient.NewForConfig(config)
+	if err != nil {
+		return nil, fmt.Errorf("Error creating client: %s\n", err.Error())
+	}


we do have this initialization in e2e, you can reuse

test/e2e/etcdquorumguard_test.go

docs/etcd-quorum-guard.md

test/e2e/etcdquorumguard_test.go

kikisdeliveryservice · 2019-04-09T23:54:17Z

@RobertKrawitz added some comments, also could you ensure that your final commits have a brief i sentence "why" summary in the body.

Thank you for adding the doc!!

RobertKrawitz · 2019-04-10T14:24:23Z

@RobertKrawitz added some comments, also could you ensure that your final commits have a brief i sentence "why" summary in the body.

Thank you for adding the doc!!

Yup.

kikisdeliveryservice · 2019-04-10T19:31:47Z

talking to @hexfusion , #623 needs to merge before this one, so adding a hold label to make sure they go in correctly.

/hold

RobertKrawitz · 2019-04-11T15:52:56Z

xref openshift/installer#1597

runcom · 2019-04-12T08:54:19Z

pkg/operator/sync.go

@@ -171,6 +172,28 @@ func (optr *Operator) syncMachineConfigController(config renderConfig) error {
 	return nil
 }

+func (optr *Operator) syncEtcdQuorumGuard(config renderConfig) error {
+	eqgBytes, err := renderAsset(config, "manifests/etcdquorumguard/deployment.yaml")


should this sync also wait for the deployment to correctly roll out? (see waitForDeploymentRollout)

runcom · 2019-04-12T09:21:02Z

Ok, this has the usual chicken and egg issue when adding something to bootkube (and a new image).

Abhinav summarized what needs to happen (generally) when adding a new image here: #538 (comment) You can follow that (and should be pretty straightforward, you also already have the installer PR up)

Other than that, by looking at past PR, e.g. adding infraImage, the flow has been like this:

create the main PR Pick image from release payload for crio.conf pause_image #471
create the installer PR Add infra-image to MCO bootstrap installer#1292
in order to merge the installer PR above, we needed this (it's a group of PRs cause we missed some changes):
You can see the above as just one change (besides having been 3 PRs for us that time)
The above (3 PRs) unblock merge on the installer PR, so we merged that
Installer PR unblock merge of the main PR Pick image from release payload for crio.conf pause_image #471
Then we just needed to cleanup with bootstrap: Final switch to CVO pod image #518

I hope the above is clear enough (let me know otherwise)

RobertKrawitz · 2019-04-12T13:15:39Z

@runcom so to be clear, I need a mini PR against the MCO, the installer PR, @hexfusion's PR so I can get the correct cert, this PR, and finally an aditional installer PR for cleanup (five PRs all told)?

hexfusion · 2019-04-22T16:22:29Z

manifests/etcdquorumguard/deployment.yaml

+          declare -r cert="$croot/tls.crt"
+          declare -r key="$croot/tls.key"
+          declare -r cacert="/var/run/secrets/kubernetes.io/serviceaccount/etcd-metric-serving-ca.crt"
+          ls -lR "$croot"


etcd-metric-serving-ca is a ConfigMap vs Secret

https://github.com/openshift/machine-config-operator/blob/f4d79247db439ae08e2f5e17cf0347c94998e94d/pkg/operator/sync.go#L375-L378

RobertKrawitz · 2019-04-22T18:16:37Z

/retest

cgwalters · 2019-04-23T09:50:14Z

The reason it's in kube-system currently is that it needs to mount the host filesystem.

We also have the MCD which is privileged and mounts the host in the openshift-machine-config namespace.

That may not be needed when #623 goes in, but it does need to be able to hit the network etcd listens on,

etcd is just hostNetwork: true right? I don't think the kube namespace affects hostnetwork pods.

RobertKrawitz · 2019-04-23T21:44:02Z

/retest

RobertKrawitz · 2019-04-24T00:05:26Z

/retest

RobertKrawitz · 2019-04-24T02:18:54Z

/retest

runcom · 2019-04-24T07:19:32Z

This times out in e2e-aws-op meaning that you (for now) just need to raise the test timeout to 70minutes let's say, you can do that here https://github.com/openshift/machine-config-operator/blob/master/Makefile#L101

RobertKrawitz · 2019-04-24T12:21:45Z

First experiment will be to see whether it behaves differently done the original way.

RobertKrawitz · 2019-04-24T14:44:41Z

/retest

RobertKrawitz · 2019-04-24T19:12:40Z

Switching back to using the etcd-quorum-guard standalone, so this is now moot.

runcom · 2019-04-24T20:25:50Z

Wut

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 9, 2019

openshift-ci-robot requested review from derekwaynecarr, sjenning, cgwalters and LorbusChris April 9, 2019 14:46

RobertKrawitz mentioned this pull request Apr 9, 2019

Add e2e test case for etcd-quorum-guard #614

Closed

RobertKrawitz force-pushed the etcd-quorum-guard branch from 6d09ab4 to 20df4fc Compare April 9, 2019 15:17

openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 9, 2019

abhinavdahiya reviewed Apr 9, 2019

View reviewed changes

openshift-ci-robot requested a review from hexfusion April 9, 2019 16:14

runcom reviewed Apr 9, 2019

View reviewed changes

kikisdeliveryservice previously requested changes Apr 9, 2019

View reviewed changes

hexfusion mentioned this pull request Apr 10, 2019

*: add resource sync for etcd-quorum-guard #623

Closed

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 10, 2019

RobertKrawitz force-pushed the etcd-quorum-guard branch from 8beb6b9 to a331155 Compare April 11, 2019 15:02

RobertKrawitz mentioned this pull request Apr 11, 2019

Add etcd-quorum-guard openshift/installer#1597

Closed

runcom reviewed Apr 12, 2019

View reviewed changes

hexfusion reviewed Apr 22, 2019

View reviewed changes

RobertKrawitz force-pushed the etcd-quorum-guard branch 2 times, most recently from ccf5879 to d24d38a Compare April 22, 2019 17:18

RobertKrawitz force-pushed the etcd-quorum-guard branch 3 times, most recently from dafd571 to 5f2cb56 Compare April 22, 2019 22:58

RobertKrawitz force-pushed the etcd-quorum-guard branch 3 times, most recently from 7df7f01 to 9ff5246 Compare April 23, 2019 14:19

Apply openshift#623

ce3acd6

RobertKrawitz force-pushed the etcd-quorum-guard branch from 9ff5246 to ce3acd6 Compare April 23, 2019 15:28

RobertKrawitz added 2 commits April 23, 2019 14:06

Turn sync off again to see what the quorum guard does.

da7a6cf

Fix the rest of the namespace names in eqg test

b594807

RobertKrawitz force-pushed the etcd-quorum-guard branch 2 times, most recently from 5bf8bc6 to 8598f3d Compare April 24, 2019 14:25

Try use kube-system namespace and full powered cert

901e42a

RobertKrawitz force-pushed the etcd-quorum-guard branch from 8598f3d to 901e42a Compare April 24, 2019 14:46

Try using jedi cert in openshift-machine-config-operator namespace

0882043

RobertKrawitz closed this Apr 24, 2019

hexfusion mentioned this pull request Apr 25, 2019

templates/master/00-master: add 127.0.0.1 to etcd metric proxy SAN. #664

Closed

Add etcd-quorum-guard manifests and doc #613

Add etcd-quorum-guard manifests and doc #613

Conversation

RobertKrawitz commented Apr 9, 2019

openshift-ci-robot commented Apr 9, 2019

RobertKrawitz commented Apr 9, 2019

rphillips commented Apr 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

runcom Apr 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hexfusion Apr 9, 2019 • edited Loading

Choose a reason for hiding this comment

hexfusion Apr 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

runcom commented Apr 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kikisdeliveryservice commented Apr 9, 2019

RobertKrawitz commented Apr 10, 2019

kikisdeliveryservice commented Apr 10, 2019

RobertKrawitz commented Apr 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

runcom commented Apr 12, 2019

RobertKrawitz commented Apr 12, 2019

Choose a reason for hiding this comment

RobertKrawitz commented Apr 22, 2019

cgwalters commented Apr 23, 2019

RobertKrawitz commented Apr 23, 2019

RobertKrawitz commented Apr 24, 2019

RobertKrawitz commented Apr 24, 2019

runcom commented Apr 24, 2019

RobertKrawitz commented Apr 24, 2019

RobertKrawitz commented Apr 24, 2019

RobertKrawitz commented Apr 24, 2019

runcom commented Apr 24, 2019

runcom Apr 12, 2019 •

edited

Loading

hexfusion Apr 9, 2019 •

edited

Loading

hexfusion Apr 9, 2019 •

edited

Loading