Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add etcd-quorum-guard manifests and doc #613

Closed
wants to merge 28 commits into from

Conversation

RobertKrawitz
Copy link
Contributor

- What I did
Add etcd-quorum-guard manifests and documentation describing it.

- How to verify it
oc get pods -n kube-system | grep etcd-quorum-guard

- Description for the changelog
Add etcd-quorum-guard

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: RobertKrawitz
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: cgwalters

If they are not already assigned, you can assign the PR to them by writing /assign @cgwalters in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 9, 2019
@RobertKrawitz
Copy link
Contributor Author

/cc @derekwaynecarr @sjenning

@rphillips
Copy link
Contributor

I suspect the pkg/operator/sync.go needs an update to include the deployment.

perhaps @kikisdeliveryservice or @runcom can confirm.

@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 9, 2019
name: etcd-quorum-guard
namespace: kube-system
spec:
replicas: 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be possible to teach mco to scale up / decide the replica count based on number of master node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed beyond 4.1; for 4.1, it has been decided to only support 3 masters.

effect: NoExecute
operator: Exists
containers:
- image: registry.svc.ci.openshift.org/openshift/origin-v4.0:base
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be plumbed through release image.

Copy link
Member

@runcom runcom Apr 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"{{.Images.etcdQuorumGuardImage}}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

declare -r key="${cert%.crt}.key"
declare -r cacert="$croot/ca.crt"
[[ -z $cert || -z $key ]] && exit 1
curl --max-time 2 --silent --cert "${cert//:/\:}" --key "$key" --cacert "$cacert" "$health_endpoint" |grep '{ *"health" *: *"true" *}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @hexfusion
please use the metrics client certs that were created to connect to etcd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are those certs located?

Copy link
Contributor

@hexfusion hexfusion Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could get crt/key with something like

oc -n openshift-config get secrets etcd-metric-client -o yaml 

ca

oc get configmap -n openshift-config etcd-metric-serving-ca -o yaml

Copy link
Contributor

@hexfusion hexfusion Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use etcd proxy for /health with these certs. port 9979 vs 2379

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that inside the pod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the rationale; my question is how to get the appropriate cert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

working on this now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the etcd-quorum-guard proper does not have any Go code in it; it's simply (right now) a static deployment and disruption budget, with the lone pod being a trivial script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with #623 you should be able to mount the resources and then consume in your bash as local files. Something like.

        volumeMounts:
          - mountPath: "/etc/ssl/certs/etcd"
            name: etcd-metric-client
            readOnly: true
      volumes:
        - name: etcd-metric-client
          secret:
            secretName: etcd-metric-client

@runcom
Copy link
Member

runcom commented Apr 9, 2019

I suspect the pkg/operator/sync.go needs an update to include the deployment.

perhaps @kikisdeliveryservice or @runcom can confirm.

yeah, if we're now watching this manifest, we need to sync it up as well

kclient, err := k8sclient.NewForConfig(config)
if err != nil {
return nil, fmt.Errorf("Error creating client: %s\n", err.Error())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do have this initialization in e2e, you can reuse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

test/e2e/etcdquorumguard_test.go Show resolved Hide resolved
docs/etcd-quorum-guard.md Outdated Show resolved Hide resolved
docs/etcd-quorum-guard.md Outdated Show resolved Hide resolved
docs/etcd-quorum-guard.md Outdated Show resolved Hide resolved
docs/etcd-quorum-guard.md Outdated Show resolved Hide resolved
test/e2e/etcdquorumguard_test.go Outdated Show resolved Hide resolved
test/e2e/etcdquorumguard_test.go Outdated Show resolved Hide resolved
test/e2e/etcdquorumguard_test.go Show resolved Hide resolved
test/e2e/etcdquorumguard_test.go Show resolved Hide resolved
test/e2e/etcdquorumguard_test.go Show resolved Hide resolved
@kikisdeliveryservice
Copy link
Contributor

@RobertKrawitz added some comments, also could you ensure that your final commits have a brief i sentence "why" summary in the body.

Thank you for adding the doc!!

@RobertKrawitz
Copy link
Contributor Author

@RobertKrawitz added some comments, also could you ensure that your final commits have a brief i sentence "why" summary in the body.

Thank you for adding the doc!!

Yup.

@kikisdeliveryservice
Copy link
Contributor

talking to @hexfusion , #623 needs to merge before this one, so adding a hold label to make sure they go in correctly.

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 10, 2019
@RobertKrawitz
Copy link
Contributor Author

xref openshift/installer#1597

@@ -171,6 +172,28 @@ func (optr *Operator) syncMachineConfigController(config renderConfig) error {
return nil
}

func (optr *Operator) syncEtcdQuorumGuard(config renderConfig) error {
eqgBytes, err := renderAsset(config, "manifests/etcdquorumguard/deployment.yaml")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this sync also wait for the deployment to correctly roll out? (see waitForDeploymentRollout)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Done.

@runcom
Copy link
Member

runcom commented Apr 12, 2019

Ok, this has the usual chicken and egg issue when adding something to bootkube (and a new image).

Abhinav summarized what needs to happen (generally) when adding a new image here: #538 (comment) You can follow that (and should be pretty straightforward, you also already have the installer PR up)

Other than that, by looking at past PR, e.g. adding infraImage, the flow has been like this:

I hope the above is clear enough (let me know otherwise)

@RobertKrawitz
Copy link
Contributor Author

@runcom so to be clear, I need a mini PR against the MCO, the installer PR, @hexfusion's PR so I can get the correct cert, this PR, and finally an aditional installer PR for cleanup (five PRs all told)?

declare -r cert="$croot/tls.crt"
declare -r key="$croot/tls.key"
declare -r cacert="/var/run/secrets/kubernetes.io/serviceaccount/etcd-metric-serving-ca.crt"
ls -lR "$croot"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RobertKrawitz RobertKrawitz force-pushed the etcd-quorum-guard branch 2 times, most recently from ccf5879 to d24d38a Compare April 22, 2019 17:18
@RobertKrawitz
Copy link
Contributor Author

/retest

@RobertKrawitz RobertKrawitz force-pushed the etcd-quorum-guard branch 3 times, most recently from dafd571 to 5f2cb56 Compare April 22, 2019 22:58
@cgwalters
Copy link
Member

The reason it's in kube-system currently is that it needs to mount the host filesystem.

We also have the MCD which is privileged and mounts the host in the openshift-machine-config namespace.

That may not be needed when #623 goes in, but it does need to be able to hit the network etcd listens on,

etcd is just hostNetwork: true right? I don't think the kube namespace affects hostnetwork pods.

@RobertKrawitz RobertKrawitz force-pushed the etcd-quorum-guard branch 3 times, most recently from 7df7f01 to 9ff5246 Compare April 23, 2019 14:19
@RobertKrawitz
Copy link
Contributor Author

/retest

2 similar comments
@RobertKrawitz
Copy link
Contributor Author

/retest

@RobertKrawitz
Copy link
Contributor Author

/retest

@runcom
Copy link
Member

runcom commented Apr 24, 2019

This times out in e2e-aws-op meaning that you (for now) just need to raise the test timeout to 70minutes let's say, you can do that here https://github.com/openshift/machine-config-operator/blob/master/Makefile#L101

@RobertKrawitz
Copy link
Contributor Author

First experiment will be to see whether it behaves differently done the original way.

@RobertKrawitz RobertKrawitz force-pushed the etcd-quorum-guard branch 2 times, most recently from 5bf8bc6 to 8598f3d Compare April 24, 2019 14:25
@RobertKrawitz
Copy link
Contributor Author

/retest

@RobertKrawitz
Copy link
Contributor Author

Switching back to using the etcd-quorum-guard standalone, so this is now moot.

@runcom
Copy link
Member

runcom commented Apr 24, 2019

Wut

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants