-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Config Tries to Be Loaded Before Secrets Have Been Injected Into Pod #9593
Comments
This issue is currently awaiting triage. If Ingress contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is not showing a bug. /remove-kind bug The error message is stating this
and I suspect that is related to that pem file. And that pem file could be related to the auth-tls-secret annotation. You could create another app and ingress with a vanilla image nginx:alpine and see if simple no extra-annotation ingress works. If simple ingress works, then you can proceed to add that annotation and see if the previously working ingress fails after adding that annotation. |
this file /etc/ingress-controller/ssl/ca-ingress-nginx-cloudflare-origin-pull-ca.pem |
@longwuyuan Is this not showing a bug? The file ( The referenced Kubernetes secret, This leads me to think ingress-nginx is attempting to validate/load the nginx config, which references that PEM on disk, before ingress-nginx has actually read the secret and written it to it's local filesystem What are your thoughts? |
hi @Evesy ,
With cloudflare CA being involved in your post, I think there is a lot to be considered, hence the small tiny minute details of the problem will help a lot. Cloudflare CA and fullchain etc for auth etc are a specialist's area |
This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach |
Hello 👋 By chance, do you have any other findings around this @Evesy ? Believe I experience a similar situation but with the CA CRL file instead of the CRT (my secret provided by the annotation holds both "ca.crt" and "ca.crl"). I've confirmed it's happening on versions 1.1.3 ; 1.3.1 ; 1.4.0 and 1.5.1. |
Hey @Restless-ET -- Unfortunately we haven't seen a reoccurrence of this issue since I raised the issue, and I was never able to reliably reproduce the issue either |
Yes, I experience the same... when I release a new version or simply do a rollout restart it doesn't happen every time and even when it does it's not for all the controller pods. It doesn't seem to affect functionality on any of the endpoints configured, so I guess at this stage is really more about a logs noise reduction (and quicker detection of actual problems) then anything else. Anyway, thank you for getting back on this. :) |
This problem has severely impacted us in the past, I have just now been able to compile the information and replicate the problem. I also believe it's the same underlying issue causing #10234 and #10265 Our context
The following makes this issue occur more often
Symptoms
or
Replicating the problemI did the following in minikube
#!/bin/bash
VERSION=4.7.1
NS=ingress-test
# Install ingress controller
helm upgrade nginx ingress-nginx/ingress-nginx -i --version ${VERSION} -n ${NS} --create-namespace
echo Wait for ingress controller to be live
until kubectl wait --for=condition=Ready pod --selector app.kubernetes.io/component=controller
do
sleep 1
done
# Create large truststore (increased likelyhood of race condition)
cat << EOF | kubectl apply -n ${NS} -f - --server-side
apiVersion: v1
data:
ca.crt: |
$(cat /etc/ssl/certs/ca-certificates.crt | base64 | sed "s/^/ /")
kind: Secret
metadata:
name: truststore
type: Opaque
EOF
# Create ingress
cat <<EOF | kubectl apply -n ${NS} -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/auth-tls-pass-certificate-to-upstream: "true"
nginx.ingress.kubernetes.io/auth-tls-secret: ingress-test/truststore
nginx.ingress.kubernetes.io/auth-tls-verify-client: "on"
nginx.ingress.kubernetes.io/auth-tls-verify-depth: "1"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
update-time: ""
name: ingress
spec:
ingressClassName: nginx
rules:
- host: dummy.host.com
http:
paths:
- backend:
service:
name: dummy-service
port:
number: 8080
path: /
pathType: ImplementationSpecific
tls:
- hosts:
- dummy.host.com
EOF Use 2 terminals Terminal 1exec into controller pod Run the following command/script which:
expected_md5=$(md5sum /etc/ingress-controller/ssl/ca-ingress-test-truststore.pem)
cnt=0
while true
do
if [[ "$(md5sum /etc/ingress-controller/ssl/ca-ingress-test-truststore.pem)" == "${expected_md5}" ]] ; then
let cnt++
else
echo "success count: $cnt"
cnt=0
echo "failure! $(date)"
fi
done outputs:
or run (performed internally by the controller to validate the config) cnt=0
while true
do
if nginx -tq ; then
let cnt++
else
echo "success count: $cnt"
cnt=0
echo "failure! $(date)"
fi
done outputs:
Terminal 2After the monitoring is running in terminal 1 while true; do
kubectl patch -n ingress-test ingress ingress --type merge --patch "metadata: {annotations: {update-time: \"$(date)\"}}"
done outputs:
Causes
Mitigations that we applied
Ex. where metadata:
annotations:
nginx.ingress.kubernetes.io/auth-tls-secret: ingress/mtls-truststore`
status:
loadBalancer:
ingress:
- ip: x.x.x.x Possible solutions
func ConfigureCACert(name string, ca []byte, sslCert *ingress.SSLCert) error {
caName := fmt.Sprintf("ca-%v.pem", name)
+ tmpFileName := fmt.Sprintf("%v/.%v", file.DefaultSSLDirectory, caName)
fileName := fmt.Sprintf("%v/%v", file.DefaultSSLDirectory, caName)
+ // Perform atomic write by doing a write followed by a rename (unix only)
- err := os.WriteFile(fileName, ca, 0644)
+ err := os.WriteFile(tmpFileName, ca, 0644)
+ if err == nil {
+ err = os.Rename(tmpFileName, fileName)
+ }
if err != nil {
return fmt.Errorf("could not write CA file %v: %v", fileName, err)
}
sslCert.CAFileName = fileName
klog.V(3).InfoS("Created CA Certificate for Authentication", "path", fileName)
return nil
}
Other potentially affected pieces of code:
I am willing to provide a PR with fixes if you can provide some guidance on my proposed solution(s). |
Just to add some information on this, we are able to consistently reproduce the issue by deploying ingresses with the following annotations
Attempts to deploy many such ingresses simultaneously gives errors such as
Observations:
As our ingresses only need to use a single shared CA bundle which doesn't change often, our workaround right now is to mount said bundle as a configmap into the nginx pods, then use a configuration snippet to turn on TLS verification to the backend pods, referencing the mounted secret.
This seems to dodge the race condition but is far from ideal, not least because enabling configuration snippets exposes vulnerabilities. I created a helm chart which consistently reproduces the issue. It deploys a placeholder secret, then deploys many ingresses with the above annotations which reference said secret. |
Hi, It seems distinctly that an event like a rollout of the controller resulting in existing controller pods terminating and new controller pods being created is required to cause this. Another event seems like a large volume of ingresses with the relevant annotation that injects secrets causes this. I see that some comments also concur that race condition(s) like situations are not ruled out. To state the obvious, just one or a few ingresses syncing concurrently does not cause this problem. Also it is obvious that for the users that have mTLS secrets in ingresses and that too either in large volumes or involved in rollout during upgrades, require a better experience. But the project is extremely short on resources and there is no developer time available to work on this. If a PR is submitted then it is likely that it will get reviewed but a e2e-test that mirrors the conditions in a kind cluster is a absolute requirement. I see the need for lots of certs there. The project resources have a priority to work on securing the controller by default and also implementing the Gateway-API. We have actually deprecated features that are far from the implications of the Ingress-API specs like the tcp/udp forwarding. But the best step forward is that I request you join the community meeting with announcing the intent to do so and discuss this in the ingress-nginx-dev channel of the Kubernetes Slack. It would help a lot. |
What happened:
During startup of nginx we observed Nginx emitting emergency level logs as the configuration contained references to certificate files that Nginx had not yet loaded into the pod
What you expected to happen:
ingress-nginx
should fully write secrets to the pod before attempting to start upNGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 1.4.0
Kubernetes version (use
kubectl version
): 1.24.8Environment:
Cloud provider or hardware configuration: GKE
OS (e.g. from /etc/os-release): ContainerOS
Kernel (e.g.
uname -a
): Linux ingress-nginx-external-controller-6c9449fbfd-p584h 5.10.147+ Basic structure #1 SMP Thu Nov 10 04:41:53 UTC 2022 x86_64 LinuxCurrent state of ingress object, if applicable:
How to reproduce this issue:
This hasn't been reproducible in a smaller test environment as of yet, it only seems to happen on our cluster with ~1000 ingresses. We've been on 1.4 for some time now and this is the first time we've observed the issue when nginx is rolling out
The text was updated successfully, but these errors were encountered: