Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Story of my Morning's F*** Up: Where is the "letsencrypt-prod" ClusterIssuer? #908

Closed
maelvls opened this issue Sep 15, 2023 · 5 comments · Fixed by #919
Closed

The Story of my Morning's F*** Up: Where is the "letsencrypt-prod" ClusterIssuer? #908

maelvls opened this issue Sep 15, 2023 · 5 comments · Fixed by #919

Comments

@maelvls
Copy link
Member

maelvls commented Sep 15, 2023

tl;dr: I mistakenly removed the ClusterIssuer letsencrypt-prod from the github-build-infra cluster that runs https://prow.build-infra.jetstack.net/. There is no outage yet, but the 4 certificates won't be able to be renewed until I re-apply the manifest.

This morning, I made a huge mistake! I was reviewing cert-manager/openshift-routes#32 so I installed openshift-routes as well as cert-manager 1.13.0... Except I didn't install it to my kind cluster... I installed it to the github-build-infra cluster where all the Prow services are running!

The commands I intended for my kind cluster were:

kubectl apply -f https://raw.githubusercontent.com/openshift/router/master/deploy/route_crd.yaml
kubectl apply -f https://raw.githubusercontent.com/openshift/router/master/deploy/router_rbac.yaml
kubectl apply -f https://raw.githubusercontent.com/openshift/router/master/deploy/router.yaml
kubectl set image -n openshift-ingress deploy/ingress-router router=quay.io/openshift/origin-haproxy-router:4.11
kubectl set env -n openshift-ingress deploy/ingress-router ROUTER_DOMAIN=domain.com
kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.13.0/cert-manager.yaml

It said that the cert-manager-webhook's selectors could not be changed. So I
figured I would delete the cert-manager deployment I had installed a while back
on y kind cluster (or so I thought).

kubectl delete -f https://github.com/jetstack/cert-manager/releases/download/v1.13.0/cert-manager.yaml

This is at this point that I realized that I had nuked the cert-manager
installation that was on github-build-infra.

I thus carefully backtracked by deleting what I had installed:

kubectl delete -f https://raw.githubusercontent.com/openshift/router/master/deploy/route_crd.yaml --wait=false
kubectl delete -f https://raw.githubusercontent.com/openshift/router/master/deploy/router_rbac.yaml --wait=false
kubectl delete -f https://raw.githubusercontent.com/openshift/router/master/deploy/router.yaml --wait=false

And then reinstalled the cert-manager:

kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.13.0/cert-manager.yaml

Then I looked at the logs, and it seems like four certificates existed before my mistake:

  • prometheus-prod-thanos (created from the Ingress resource prometheus/prometheus-prod-thanos)
  • prometheus-prod-server (created from the Ingress resource prometheus/prometheus-prod-server)
  • prow-tls
  • triageparty-tls

Thanks to the fact that there are no owner references set on Secret resources
created by cert-manager, there was no outage at this point!

But all four Certificates are now targetting a missing ClusterIssuer:
letsencrypt-prod in the cert-manager namespace.

I haven't found the manifest for this ClusterIssuer. I ran:

cd prow/
gcloud container clusters get-credentials github-build-infra --project jetstack-build-infra --zone europe-west1-b
make deploy-prow

Unrelated: I also had to comment out the PersistentVolumeClaim in
cluster/tot_deployment.yaml because it was failing to create it:

The PersistentVolumeClaim "tot-storage" is invalid: spec: Forbidden: spec is immutable after creation except resources.requests for bound claims
  core.PersistentVolumeClaimSpec{
        AccessModes:      {"ReadWriteOnce"},
        Selector:         nil,
        Resources:        {Requests: {s"storage": {i: {...}, s: "1Gi", Format: "BinarySI"}}},
-       VolumeName:       "pvc-3d2fb6df-b8a6-11e7-8292-42010a8401a1",
+       VolumeName:       "",
-       StorageClassName: &"standard",
+       StorageClassName: nil,
        VolumeMode:       &"Filesystem",
        DataSource:       nil,
        DataSourceRef:    nil,
  }

Where is the ClusterIssuer manifest?

@maelvls maelvls changed the title Where is defined the "letsencrypt-prod" ClusterIssuer? Where is defined the "letsencrypt-prod" ClusterIssuer? The story of this morning's f*** up Sep 15, 2023
@maelvls
Copy link
Member Author

maelvls commented Sep 15, 2023

I need to know if I can postpone this work to later. Let's check whether some of the certificates inside the Secret resources managed by cert-manager are about to expire:

$ kubectl get secrets -A -o json \
   | jq .items \
   | jq '[.[] | select(.metadata.annotations["certmanager.k8s.io/issuer-name"])]' \
   | jq -r '.[] | select(.data["tls.crt"]) | "\(.metadata.name)\t\(.data["tls.crt"])"' \
   | while IFS=$'\t' read -r name crt; do \
       printf "$name\t$(echo "$crt" | base64 -d | openssl x509 -noout -enddate 2>/dev/null)\n"; \
     done | column -t

Which showed that the next X.509 to expire will be in 1 month (Oct 20, 2023).

cherrypick-tls               notAfter=May  18  14:57:11  2019  GMT
prow-tls                     notAfter=Nov  6   08:11:02  2023  GMT
prometheus-prod-server-tls   notAfter=Nov  19  08:08:50  2023  GMT
prometheus-prod-sidecar-tls  notAfter=Oct  20  15:19:50  2023  GMT

So no problem for now!

@maelvls
Copy link
Member Author

maelvls commented Sep 15, 2023

Ohh I just found about @ChaosInTheCRD's kube-lock project. That's exactly what would have stopped me from stupidly installing my stuff to the wrong cluster!

@maelvls maelvls changed the title Where is defined the "letsencrypt-prod" ClusterIssuer? The story of this morning's f*** up The Story of my Morning's F*** Up: Where is the "letsencrypt-prod" ClusterIssuer? Sep 15, 2023
@ChaosInTheCRD
Copy link

Ohh I just found about @ChaosInTheCRD's kube-lock project. That's exactly what would have stopped me from stupidly installing my stuff to the wrong cluster!

Sorry to see that you've been through such a disaster, these things happen to the best of us unfortunately 😢

It's made my day that you find my project! It needs a bit of work, I definitely should assign some time to it. Nonetheless thanks for giving it a shoutout 😄

@a0s
Copy link

a0s commented Oct 14, 2023

offtopic:
@maelvls I accidentally saw your message, I can offer another option on how to avoid making mistakes when working with different environments. I'm using direnv + separate context files (KUBECONFIG). This way I just explicitly go to the ./production or ./development folder and direnv updates the KUBECONFIG variable to the correct context file.

@maelvls
Copy link
Member Author

maelvls commented Nov 4, 2023

I have removed the leftover cherrypick-tls since we have removed https://cherrypick.build-infra.jetstack.net a while back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants