Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] starting container process caused "exec: \"/azure-keyvault/azure-keyvault-env\": stat /azure-keyvault/azure-keyvault-env: no such file or directory" #42

Closed
howardjones opened this issue Mar 11, 2020 · 36 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@howardjones
Copy link
Contributor

howardjones commented Mar 11, 2020

Note: Make sure to check out known issues (https://github.com/sparebankenvest/azure-key-vault-to-kubernetes#known-issues) before submitting

Describe the bug
Using environment injection with either the supplied test image or my own results in the container not starting, apparently because the expected volume mounts are not actually created.

AKS with kubernetes 1.17.0

To Reproduce
Steps to reproduce the behavior:

Helm installation as per manual.

apiVersion: v1
kind: Namespace
metadata:
  name: akv-test
  labels:
    azure-key-vault-env-injection: enabled
---
apiVersion: spv.no/v1alpha1
kind: AzureKeyVaultSecret
metadata:
  name: db-secret-inject
  namespace: akv-test
spec:
  vault:
    name: akvtest # name of key vault
    object:
      name: login-secret # name of the akv object
      type: secret # akv object type
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: akv2k8s-test-injection
  namespace: akv-test
  labels:
    app: akv2k8s-test-injection
spec:
  selector:
    matchLabels:
      app: akv2k8s-test-injection
  template:
    metadata:
      labels:
        app: akv2k8s-test-injection
    spec:
      containers:
      - name: akv2k8s-env-test
        image: spvest/akv2k8s-env-test
        env:
        - name: TEST_SECRET
          value: "secret-inject@azurekeyvault"
        - name: SP_SECRET
          value: "db-secret-inject@azurekeyvault"
        - name: ENV_INJECTOR_LOG_LEVEL
          value: debug

Expected behavior

SP_SECRET environment variable defined with value from AKV, inside a running container.

Logs
If applicable, add logs to help explain your problem.

The env-injector pod seems to think it did its work:

time="2020-03-11T09:20:02Z" level=info msg="found pod to mutate in namespace 'akv-test'"
time="2020-03-11T09:20:02Z" level=info msg="found container 'akv2k8s-env-test' to mutate"
time="2020-03-11T09:20:02Z" level=info msg="checking for env vars containing '@azurekeyvault' in container akv2k8s-env-test"
time="2020-03-11T09:20:02Z" level=info msg="found env var: secret-inject@azurekeyvault"
time="2020-03-11T09:20:02Z" level=info msg="found env var: db-secret-inject@azurekeyvault"
time="2020-03-11T09:20:02Z" level=info msg="did not find credentials to use with registry 'spvest' - getting default credentials"
time="2020-03-11T09:20:02Z" level=info msg="registry host 'spvest' is not a acr registry"
time="2020-03-11T09:20:02Z" level=info msg="pulling docker image docker.io/spvest/akv2k8s-env-test:latest to get entrypoint and cmd, timeout is 120 seconds"
time="2020-03-11T09:20:04Z" level=info msg="docker image docker.io/spvest/akv2k8s-env-test:latest pulled successfully"
time="2020-03-11T09:20:04Z" level=info msg="inspecting container image docker.io/spvest/akv2k8s-env-test:latest, looking for entrypoint and cmd"
time="2020-03-11T09:20:04Z" level=info msg="using 'entrypoint.sh' as arguments for env-injector"
time="2020-03-11T09:20:04Z" level=info msg="containers mutated and pod updated with init-container and volumes"

But in the pod description:

Events:
  Type     Reason     Age                   From                                        Message
  ----     ------     ----                  ----                                        -------
  Normal   Scheduled  <unknown>             default-scheduler                           Successfully assigned akv-test/akv2k8s-test-injection-6c796746df-vqspz to aks-nodepool1-40643695-vmss000001
  Normal   Pulled     13m                   kubelet, aks-nodepool1-40643695-vmss000001  Container image "spvest/azure-keyvault-env:1.0.1" already present on machine
  Normal   Created    13m                   kubelet, aks-nodepool1-40643695-vmss000001  Created container copy-azurekeyvault-env
  Normal   Started    13m                   kubelet, aks-nodepool1-40643695-vmss000001  Started container copy-azurekeyvault-env
  Normal   Started    13m                   kubelet, aks-nodepool1-40643695-vmss000001  Started container akv2k8s-env-test
  Normal   Pulled     12m (x4 over 13m)     kubelet, aks-nodepool1-40643695-vmss000001  Successfully pulled image "spvest/akv2k8s-env-test"
  Normal   Created    12m (x4 over 13m)     kubelet, aks-nodepool1-40643695-vmss000001  Created container akv2k8s-env-test
  Warning  Failed     12m (x3 over 12m)     kubelet, aks-nodepool1-40643695-vmss000001  Error: failed to start container "akv2k8s-env-test": Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"/azure-keyvault/azure-keyvault-env\": stat /azure-keyvault/azure-keyvault-env: no such file or directory": unknown
  Normal   Pulling    11m (x5 over 13m)     kubelet, aks-nodepool1-40643695-vmss000001  Pulling image "spvest/akv2k8s-env-test"
  Warning  BackOff    3m11s (x44 over 12m)  kubelet, aks-nodepool1-40643695-vmss000001  Back-off restarting failed container

The init container does not show any error state (exit code 0). Both containers have the /azure-keyvault/ mount listed.

The logs for the init container simply say that the file was copied:

kubectl logs -n akv-test akv2k8s-test-injection-6c796746df-vqspz -c copy-azurekeyvault-env
Copying /azure-keyvault/azure-keyvault-env to /azure-keyvault/

Additional context

This did work yesterday! Nothing has changed in the AKS cluster. I have redeployed the akv2k8s and application deployments a few times to be sure.

@howardjones howardjones added the bug Something isn't working label Mar 11, 2020
@torresdal
Copy link
Collaborator

Hi @howardjones - I have a theory what's happening, but I belive it should only be a probelm if a container crashes on first run. Could you try and use the deployment as detailed in this tutorial? https://akv2k8s.io/tutorials/env-injection/1-secret/

Specifically image v. 2.0.1, specify args and env:

containers:
- name: akv2k8s-env-test
  image: spvest/akv2k8s-env-test:2.0.1
  args: ["TEST_SECRET"]
  env:
  - name: TEST_SECRET
    value: "secret-inject@azurekeyvault" # ref to akvs

@torresdal
Copy link
Collaborator

torresdal commented Mar 11, 2020

@howardjones My theory is that when a Pod starts up, the init-container executes, copying env-injector executable and creds to a shared volume. When the original program start up, these sensitive files are deleted (because they are sensitive). However, if the pod crashed and tries to restart, the init-container will not run again (by design) and the sensitive files which is now needed to start the contianer, are no longer there. To fix a crashed container, the pod needs to be deleted, which will make the init-container run again.

This is of course not the way we would like it and are looking into options - have any?

@torresdal
Copy link
Collaborator

To themporary work around this issue, you could try a set the env var DELETE_SENSITIVE_FILES=false in your container, and the env-injector should pick that up and not delete the files. This is at leas something you should try to see if this is causing the problem I describe above...

@hfgbarrigas
Copy link

hfgbarrigas commented Mar 11, 2020

Hey, just encountered this issue. Tried the env variable suggested by @torresdal but still getting the same error. Here's a screenshot.

image

The main container might be crashing on boot, but atm, I'm not sure because the only info I have available is regarding this particular error. FWIW, the main container is most likely failing on boot.

@howardjones
Copy link
Contributor Author

Pinning version to 2.0.1 has successfully started the test container with the right AKV value.
I don't think my own container is crashing at startup... I'll try with the environment variable you mention and report back.

@torresdal torresdal self-assigned this Mar 11, 2020
@hfgbarrigas
Copy link

Just tested with spvest/akv2k8s-env-test:2.0.1, to make sure everything was working at least from a injector point of view. Got the same error.
Here's the deployment yaml

kind: Deployment
metadata:
  name: akvs-secret-app
  namespace: akv2k8s-managed
  labels:
    app: akvs-secret-app
spec:
  selector:
    matchLabels:
      app: akvs-secret-app
  template:
    metadata:
      labels:
        app: akvs-secret-app
    spec:
      containers:
      - name: akv2k8s-env-test
        image: spvest/akv2k8s-env-test:2.0.1
        args: [...secrets]
        env:
        - name: ENVIRONMENT
          value: dev
        - name: DELETE_SENSITIVE_FILES
          value: "false"
        ... more secrets```

@torresdal
Copy link
Collaborator

@hfgbarrigas which container image version is the init-container running?

@hfgbarrigas
Copy link

My app deployment is using:
image

The test deployment using spvest/akv2k8s-env-test:2.0.1 is using the same init image (1.0.1).

@howardjones
Copy link
Contributor Author

DELETE_SENSITIVE_FILES=false did not make a difference. I can't see any sign that the container has restarted, either - the only Failed event is for the /azure-keyvault/azure-keyvault-env error. My init container is spvest/azure-keyvault-env:1.0.1

@torresdal
Copy link
Collaborator

There must be something off here. Both of you @hfgbarrigas and @howardjones are running the same scenario, one fails the other succeed...

@torresdal
Copy link
Collaborator

I need to dig down deep and find a good solution for this and a test harness. Sorry about the inconvenience. Will update here as soon as I can.

@hfgbarrigas
Copy link

Indeed, if my test deployment worked from a secret perspective I would've been satisfied. At least the error is consistent. Looking forward and thank you for the prompt feedback.

@looping-18
Copy link

Hello,
I have the exact same issue using the version 1.0.1

@hfgbarrigas
Copy link

@torresdal with version 1.0.2-beta1:

  • test deployment with spvest/akv2k8s-env-test:2.0.1 works as expected
  • my app deployment works as expected. Although fails on boot, from a secret perspective we're good.

@howardjones
Copy link
Contributor Author

howardjones commented Mar 11, 2020

Agreed. With --set image.tag=1.0.2-beta.1 --set envImage.tag=1.0.2-beta.1 I get a working app as expected (with my own app).

@torresdal
Copy link
Collaborator

Thanks @howardjones & @hfgbarrigas - the only difference between the two versions is that the files will only be deleted IF your application starts successfully.

Should your app crash for some reason after this though - the pod will not be able to recover.

This is not acceptable for this project and we will keep working to find a solution.

@Aaron-ML
Copy link

Aaron-ML commented Mar 11, 2020

@torresdal what's the official workaround for this currently?

Not sure how to deal with pods that can never recover without manual intervention.

@torresdal
Copy link
Collaborator

We have a few ideas of how to solve this now, but it will require more time - so I see no other solution right now other than disable the functionality for deleting sensitive files in a patch (on the way - will be 1.0.2).

As for the solution, we're looking into how we can exchange AKV auth tokens securely with our webhook instead of storing creds on a in-memory disk. Other suggestions are welcome!

@torresdal
Copy link
Collaborator

torresdal commented Mar 11, 2020

Can anyone verify that issues described here is solved by no longer deleting sensitive files? Version 1.0.2-beta.2 => https://github.com/SparebankenVest/azure-key-vault-to-kubernetes/releases/tag/1.0.2-beta.2

Will release 1.0.2 as soon as I have confirmation.

@hfgbarrigas
Copy link

@torresdal All good from my side using 1.0.2-beta.1 and 1.0.2-beta.2. Sensible files will persist throughout the pod's lifetime in the in memory volume right?

@torresdal
Copy link
Collaborator

@hfgbarrigas yes - that is right.

@torresdal
Copy link
Collaborator

The official release is now out: https://github.com/SparebankenVest/azure-key-vault-to-kubernetes/releases/tag/1.0.2

Currently working on a better and more secure solution for handling default credentials.

@torresdal
Copy link
Collaborator

torresdal commented Mar 17, 2020

Two fixes are implemented to prevent the issues identified in this thread. Detailed description in Release 1.1.0-beta.4. Any help verifying before final release would be greatly appreciated!

@torresdal torresdal added this to the Version 1.1.0 milestone Mar 19, 2020
@hfgbarrigas
Copy link

I'll test it asap.
Regarding the implementation, can you share a bit more light on the gets an oauth token to access AKV from an endpoint that is protected via client certificate ?
Thank you.

@howardjones
Copy link
Contributor Author

howardjones commented Mar 23, 2020 via email

@torresdal
Copy link
Collaborator

torresdal commented Mar 23, 2020

@hfgbarrigas @howardjones fixed several issues during the weekend. You should now point images to 1.1.0-beta.28 and helm chart version 1.1.0-beta.12:

--set envImage.tag=1.1.0-beta.28 --set authService.image.tag=1.1.0-beta.28 --set image.tag=1.1.0-beta.28 --version 1.1.0-beta.12

Update: I will create a new official Beta release shortly, with updated Helm chart.

@torresdal
Copy link
Collaborator

@hfgbarrigas regarding "gets an oauth token to access AKV from an endpoint that is protected via client certificate"

We have implemented a auth-service that will issue Azure jwt tokens valid for AKV resources only. The service are using credentials given to the auth-service (default is AKS credentials).

The main difference is:

  • jwt tokens expire after 1 hour (default, can be configured in Azure AD)
  • token is specifically issued for the Azure Key Vault resource (default https://vault.azure.net)
  • the auth service is protected by a client certificate that gets issued to Pods using the env-injector
  • no Azure.json in Pods and no files to delete

@howardjones
Copy link
Contributor Author

Hi @torresdal. My test application is working happily with 1.1.0-beta.28 and helm chart 1.1.0-beta.12 as described above. I have not yet tested any failure (pod restart) scenarios though.

@jemag
Copy link

jemag commented Mar 26, 2020

Hmm I have never encountered this problem before with a cluster running the env-injector version 0.1.15 for multiple months, but now that I have upgraded 2 clusters to the version 1.0.2, I have had multiple incidents of CrashLoopBackOff and it seems that this is happening when Kured restarts nodes for OS updates.

Could this be due to the fact that there does not seem to be any readiness probe for the env injector in v 1.0.2? So some pods would try contact a non-ready env injector which fails and then pod is stuck with non properly injected credentials and in a restart loop?

@jemag
Copy link

jemag commented Mar 26, 2020

Did some more testing on a 2 node cluster with version 1.0.2 and around 12 pods using env injector for their secrets. If I drain one of the node, there seems to be a fairly high likelihood that 1 or 2 of the pods will start with their secrets not injected, fail and then enter a CrashLoopBackOff cycle.

I then proceeded to go back to version 0.1.4 of the env injector chart (App version : 0.1.15) and did the same thing for a couple of tests and so far I cannot seem to get pods into CrashLoopBackOff cycle. They all get their secrets injected properly.

Finally, I did a quick test with 1.1.0-beta.28 and helm chart 1.1.0-beta.12, but having some problems getting secret injection to work at all with this version. Will look into it more tomorrow.

@torresdal
Copy link
Collaborator

Thanks for taking time to do thorough testing @jemag. We're in the middle of fixing and testing this issue and aim to have a working version in Beta tomorrow. I would advice you not to spend time on the current beta, as we have also seen issues similar to yours, until the new beta is out tomorrow. This part of the env-injector turned out to be more error prone than we would have hoped, but the good news is we feel confident we have a much more stable implementation now. I'll update here as soon as the beta is out. Thanks!

@howardjones
Copy link
Contributor Author

Is there a more current RC, or is 1.1.0-beta.28 still my best bet? (starting up a new cluster).

@howardjones
Copy link
Contributor Author

With the latest github code, I'm hitting

time="2020-04-15T15:09:01Z" level=fatal msg="failed to download ca cert, error: Get http://.akv2k8s.svc/ca: dial tcp: lookup .akv2k8s.svc: no such host" application=env-injector component=akv2k8s namespace=akv-test

Which looks like webhook_auth_service is blank. I'm using the last published helm charts though. Is there something new that I need to set for the new auth service?

@torresdal
Copy link
Collaborator

@howardjones sorry for the long wait. We had to pause dev on this for a bit (we have a bank to run also 😄), but are back on this now. Give us a few days to clean up a few things and get a new release out. Thanks.

@torresdal
Copy link
Collaborator

This is now implemented and working correctly as far as we've managed to test in our clusters. Any help verifying this would be great: #115

@torresdal
Copy link
Collaborator

Testing by multiple parties show this is working as expected. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants