generated from canonical/template-operator
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deferring start/restart event causes lock to be prematurely released #183
Labels
bug
Something isn't working
Comments
This was referenced Feb 9, 2024
Discovered same issue while reviewing opensearch codebase in preparation for adding in-place upgrades after event deferred, lock is immediately released here: opensearch-operator/lib/charms/rolling_ops/v0/rollingops.py Lines 395 to 407 in e7acbdd
|
carlcsaposs-canonical
added a commit
that referenced
this issue
Apr 17, 2024
## Issue ### Fix issues with current lock implementation - Fixes #183: Units can start/restart without rollingops lock (if event deferred, lock released but unit (re)starts on next event anyways). - Unit can scale down (using "ops lock") while another unit is (re)starting (using separate rollingops lock) ### Prepare for in-place upgrades - In order to maintain HA, only one unit should start, restart, join cluster (scale up), leave cluster (scale down), or upgrade at a time - Upgrades adds an additional challenge: if the charm code is upgraded (in addition to the workload), it's possible for the leader unit to go into error state—which prevents coordination of locks via peer databag (current rolling ops implementation) and can block rollback (for other units) ## Options considered 1. Use peer relation databag for lock - Unit requests lock by adding data to unit databag -> relation-changed event triggered on leader - Leader grants lock by adding data to app databag -> relation-changed event triggered on unit - Unit proceeds & releases lock from unit databag 2. Don't use lock for upgrades. Make upgrade order deterministic (so that each unit can independently determine upgrade order & upgrade order does not change) and each unit upgrades when it sees the units before it have upgraded 3. Use opensearch index/document for lock 4. Option 3 but fallback to option 1 if ~~no units online~~ less than 2 units online Cons of each option: 1. if leader unit is in error state (raises uncaught exception), rollback will be blocked for all units 2. a unit can restart for non-upgrade related reasons (role re-balancing from network cut or scale up) while another unit is restarting to upgrade 3. doesn't work if all units offline (initial start, full cluster crash/network cut) 4. implementation complexity, some edge cases not supported e.g. https://github.com/canonical/opensearch-operator/blob/b1b28c10f2bd70bee4270707b616d8d565b8b616/lib/charms/opensearch/v0/opensearch_locking.py#L227-L228 Pros of each option: 1. only one unit will (re)start at a time, for any reason 2. rollback with `juju refresh` will immediately rollback highest unit even if leader/other units in error state 3. only one unit will (re)start at a time, for any reason and rollback with `juju refresh` will quickly rollback highest unit even if leader unit charm code in error state 4. same as 3 and locking (mostly, except aforementioned edge cases) works when all units offline More context: Discussion: https://chat.canonical.com/canonical/pl/9fah5tfxd38ybnx3tq7zugxfyh Option 1 in discussion is option 2 here, option 2 in discussion is option 1 here **Option chosen:** Option 4 ### Opensearch index vs document for lock Current "ops lock" implementation with opensearch index: Each unit requests the lock by trying to create an index. If the index does not exist, the "lock" is granted. However, if a unit requests the lock, charm goes into error state, and error state is resolved (e.g. after rollback) it will not be able to use the lock—no unit will be aware that it has the lock and no unit will be able to release the lock Solution: use document id 0 that stores "unit-name" as lock Discussion: https://chat.canonical.com/canonical/pl/biddxzzk3fbpjgbhmatzr8n6bw ## Solution ### Design (Option 4): Use opensearch document as lock (for any (re)start, join cluster, leave cluster, or upgrade). Fallback to peer databag if all units offline. ### Implementation Create custom events `_StartOpensearch` and `_RestartOpensearch` https://github.com/canonical/opensearch-operator/blob/b1b28c10f2bd70bee4270707b616d8d565b8b616/lib/charms/opensearch/v0/opensearch_base_charm.py#L121-L132 When opensearch should be (re)started, emit the custom event. Custom event requests the lock. If granted, it (re)starts opensearch. Once opensearch fully ready, the lock is released. If opensearch fails to start, the lock is released. While opensearch is starting or while the lock is not granted, the custom event will be continually deferred. Note: the original event is not deferred—only the custom event. This is so that any logic that ran before the request to (re)start opensearch does not get re-ran. By requesting the lock within the custom event, and attempting to reacquire the lock each time the custom event runs (i.e. after every time it's deferred), we solve the design issue with rollingops and deferred events detailed in #183
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I am currently seeing failures such as this error in the early steps of the CI run.
The entire description here is based in this PR: https://github.com/canonical/opensearch-operator/tree/DPE-3352-last-passing-tests
Full deployment:
Unit opensearch/2 is failing because all shards disappeared at once. At 14:15:01, everything is there:
Then, it disappears:
Looking at the logs, I can see that:
2024-02-07 14:15:57 DEBUG unit.opensearch/0.juju-log server.go:325 service:2: Rolling Ops Manager: stop_opensearch called
2024-02-07 14:15:34 DEBUG unit.opensearch/1.juju-log server.go:325 service:2: Rolling Ops Manager: stop_opensearch called
That means, when the node opensearch/2 is running its routine, all nodes are gone and it fails.
It is happening because both opensearch 0 and 1 are deferring their RunWithLocks event:
Looking into the
rolling-ops
code: there is no post-processing, if the event has been deferred or not. It will only react to an exception, which will break the entire hook.I will report a bug with rolling ops as well, asking for the
run_on_lock
to capture any exceptions and double check if the event has been marked for deferral post callback returned.The text was updated successfully, but these errors were encountered: