Juju units get stuck in "Waiting for vault to be available" waiting state #69

gruyaume · 2023-10-26T15:51:53Z

Describe the bug

Juju units get stuck in "Waiting for vault to be available" waiting state when kubernetes pods get rescheduled.

To Reproduce

Deploy Vault

juju deploy vault-k8s --channel=edge --trust -n 4

Wait for vault units to be in active/idle state
Delete one of the pods

kubectl delete pod  vault-k8s-1 -n <your model>

Run juju status

guillaume@potiron:~$ juju status
Model  Controller          Cloud/Region        Version  SLA          Timestamp
dem    microk8s-localhost  microk8s/localhost  3.1.6    unsupported  11:50:00-04:00

App        Version  Status   Scale  Charm      Channel  Rev  Address         Exposed  Message
vault-k8s           waiting      4  vault-k8s  edge      44  10.152.183.110  no       installing agent

Unit          Workload  Agent  Address      Ports  Message
vault-k8s/0*  active    idle   10.1.182.49         
vault-k8s/1   waiting   idle   10.1.182.42         Waiting for vault to be available
vault-k8s/2   active    idle   10.1.182.57         
vault-k8s/3   active    idle   10.1.182.38

Expected behavior

The expectation is for them to go back to an active/idle state automatically.

Environment

Charm / library version (if relevant):
Juju version (output from juju --version): 3.1.6
Cloud Environment: MicroK8s v1.27.6
Kubernetes version: 1.27

The text was updated successfully, but these errors were encountered:

gruyaume · 2023-11-05T11:40:19Z

This issue may be due to the fact that we use IP's instead of hostnames. When pods go down and back up, they'll come back wit a different IP. I'd recommend that we use K8s hostnames instead so that identity is not impacted by pods being rescheduled.

Fixes #69 by using the FQDN to connect to the vault instance instead of using the IP address. Using the IP address caused an issue because the IP address would change after a crash or removal of a pod. In turn, this would cause the TLS certificate to no longer be valid, as the TLS certificate is validated against the new IP address (but was only issued for the old IP address). We could re-issue the certificate, but the certificate is also valid for anything which uses the same FQDN which the new pod will share. This change removes the IP address from the certificate, and relies on the FQDN instead.

gruyaume added the bug Something isn't working label Oct 26, 2023

DanielArndt mentioned this issue Nov 23, 2023

fix: Use fqdn instead of IP address to connect to vault #81

Merged

6 tasks

DanielArndt closed this as completed in #81 Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Juju units get stuck in "Waiting for vault to be available" waiting state #69

Juju units get stuck in "Waiting for vault to be available" waiting state #69

gruyaume commented Oct 26, 2023

gruyaume commented Nov 5, 2023

Juju units get stuck in "Waiting for vault to be available" waiting state #69

Juju units get stuck in "Waiting for vault to be available" waiting state #69

Comments

gruyaume commented Oct 26, 2023

Describe the bug

To Reproduce

Expected behavior

Environment

gruyaume commented Nov 5, 2023