You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This error seems to come from k8s, and it looks like a resource (etcd? ceph?) is just taking too long to respond. This aligns with the ceph outage mentioned at the time.
this was in a live model on our prodstack. I believe there might have been an ongoing ceph outage at the time, so the entire controller was a bit sluggish to respond. To resolve the issue i had to redeploy vault once controller was back on track.
Relevant log output
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-vault-0/charm/./src/charm.py", line 1575, in<module>
main(VaultCharm)
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/main.py", line 548, in main
manager.run()
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/main.py", line 527, in run
self._emit()
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/main.py", line 516, in _emit
_emit_charm_event(self.charm, self.dispatcher.event_name)
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/main.py", line 147, in _emit_charm_event
event_to_emit.emit(*args, **kwargs)
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/framework.py", line 348, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/framework.py", line 860, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/framework.py", line 950, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-vault-0/charm/./src/charm.py", line 408, in _configure
self._configure_pki_secrets_engine()
File "/var/lib/juju/agents/unit-vault-0/charm/./src/charm.py", line 485, in _configure_pki_secrets_engine
vault = self._get_active_vault_client()
File "/var/lib/juju/agents/unit-vault-0/charm/./src/charm.py", line 1377, in _get_active_vault_client
role_id, secret_id = self._get_approle_auth_secret()
File "/var/lib/juju/agents/unit-vault-0/charm/./src/charm.py", line 1252, in _get_approle_auth_secret
juju_secret = self.model.get_secret(label=VAULT_CHARM_APPROLE_SECRET_LABEL)
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/model.py", line 285, in get_secret
content = self._backend.secret_get(id=id, label=label)
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/model.py", line 3504, in secret_get
result = self._run('secret-get', *args, return_output=True, use_json=True)
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/model.py", line 3141, in _run
raise ModelError(e.stderr) from e
ops.model.ModelError: ERROR cannot ensure service account "unit-vault-0": Internal error occurred: resource quota evaluation timed out
Additional context
We can catch the error here, but there are some implications. If we can't retrieve something because of an intermittent error, what do we set the status to? what if the calls are asymmetrical (we can retrieve in configure, but not in collect status, or vice-versa). Even more disruptively, what do we do when we're trying to store a secret and this happens? Since we don't use defers, we lose the context in which we were attempting to add/update this secret. I think this topic deserves a bigger discussion.
For now, I'll move forward with this and catch the error; I think we may be able to remove some dependence on secrets. For example, we store the CSR for PKI in 3 separate places -- vault, the relation data, and juju secrets. In the other cases, it should be fairly straight-forward to write the code such that subsequent calls will update the secret as expected, although there might be some inconsistency in the in-between time.
The text was updated successfully, but these errors were encountered:
Reported by @alesstimec
Bug Description
This error seems to come from k8s, and it looks like a resource (etcd? ceph?) is just taking too long to respond. This aligns with the ceph outage mentioned at the time.
To Reproduce
Unknown
Environment
#407 (comment)
Relevant log output
Additional context
We can catch the error here, but there are some implications. If we can't retrieve something because of an intermittent error, what do we set the status to? what if the calls are asymmetrical (we can retrieve in configure, but not in collect status, or vice-versa). Even more disruptively, what do we do when we're trying to store a secret and this happens? Since we don't use defers, we lose the context in which we were attempting to add/update this secret. I think this topic deserves a bigger discussion.
For now, I'll move forward with this and catch the error; I think we may be able to remove some dependence on secrets. For example, we store the CSR for PKI in 3 separate places -- vault, the relation data, and juju secrets. In the other cases, it should be fairly straight-forward to write the code such that subsequent calls will update the secret as expected, although there might be some inconsistency in the in-between time.
The text was updated successfully, but these errors were encountered: