Deferring start/restart event causes lock to be prematurely released #183

phvalguima · 2024-02-07T16:00:17Z

I am currently seeing failures such as this error in the early steps of the CI run.

The entire description here is based in this PR: https://github.com/canonical/opensearch-operator/tree/DPE-3352-last-passing-tests

Full deployment:

Model              Controller           Cloud/Region         Version  SLA          Timestamp
test-backups-23v7  github-pr-2fa19-lxd  localhost/localhost  3.1.7    unsupported  14:16:34Z

App                       Version  Status       Scale  Charm                     Channel  Rev  Exposed  Message
opensearch                         maintenance      3  opensearch                           0  no       Beginning rolling service
s3-integrator                      active           1  s3-integrator             stable    13  no       
self-signed-certificates           active           1  self-signed-certificates  stable    72  no       

Unit                         Workload  Agent      Machine  Public address  Ports  Message
opensearch/0                 waiting   idle       2        10.241.94.24           Waiting for OpenSearch to start...
opensearch/1*                waiting   executing  3        10.241.94.127          Waiting for OpenSearch to start...
opensearch/2                 error     idle       4        10.241.94.7            hook failed: "opensearch-peers-relation-changed"
s3-integrator/0*             active    idle       1        10.241.94.199          
self-signed-certificates/0*  active    idle       0        10.241.94.194          

Machine  State    Address        Inst id        Base          AZ  Message
0        started  10.241.94.194  juju-ad79f3-0  [email protected]      Running
1        started  10.241.94.199  juju-ad79f3-1  [email protected]      Running
2        started  10.241.94.24   juju-ad79f3-2  [email protected]      Running
3        started  10.241.94.127  juju-ad79f3-3  [email protected]      Running
4        started  10.241.94.7    juju-ad79f3-4  [email protected]      Running

Integration provider                   Requirer                           Interface            Type     Message
opensearch:opensearch-peers            opensearch:opensearch-peers        opensearch_peers     peer     
opensearch:service                     opensearch:service                 rolling_op           peer     
s3-integrator:s3-credentials           opensearch:s3-credentials          s3                   regular  
s3-integrator:s3-integrator-peers      s3-integrator:s3-integrator-peers  s3-integrator-peers  peer     
self-signed-certificates:certificates  opensearch:certificates            tls-certificates     regular  

Storage Unit  Storage ID         Type        Pool    Mountpoint                   Size    Status    Message
opensearch/0  opensearch-data/0  filesystem  rootfs  /var/snap/opensearch/common  72 GiB  attached  
opensearch/1  opensearch-data/1  filesystem  rootfs  /var/snap/opensearch/common  72 GiB  attached  
opensearch/2  opensearch-data/2  filesystem  rootfs  /var/snap/opensearch/common  72 GiB  attached

Unit opensearch/2 is failing because all shards disappeared at once. At 14:15:01, everything is there:

2024-02-07 14:15:01 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: https://10.241.94.7:9200 "GET /_cat/indices?v HTTP/1.1" 200 606
2024-02-07 14:15:01 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: Getting secret app:admin-password
2024-02-07 14:15:01 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: Starting new HTTPS connection (1): 10.241.94.7:9200
2024-02-07 14:15:01 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: https://10.241.94.7:9200 "GET /_cat/shards?v HTTP/1.1" 200 1774
2024-02-07 14:15:01 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: indices status:
[{'health': 'green', 'status': 'open', 'index': '.opensearch-observability', 'uuid': '_pC83dCcQVuYW5VWnbasfg', 'pri': '1', 'rep': '2', 'docs.count': '0', 'docs.deleted': '0', 'store.size': '416b', 'pri.store.size': '208b'}, {'health': 'green', 'status': 'open', 'index': '.plugins-ml-config', 'uuid': 'RXeUPKcXToOBpRQPM4VQEA', 'pri': '1', 'rep': '2', 'docs.count': '1', 'docs.deleted': '0', 'store.size': '7.7kb', 'pri.store.size': '3.8kb'}, {'health': 'green', 'status': 'open', 'index': '.opendistro_security', 'uuid': 'FXZtZFctTWyEjVt9ZDRqcw', 'pri': '1', 'rep': '2', 'docs.count': '10', 'docs.deleted': '2', 'store.size': '105.2kb', 'pri.store.size': '57.8kb'}]
indices shards:
[{'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '1', 'store': '0b', 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.8kb', 'ip': '10.241.94.127', 'node': 'opensearch-1'}, {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.241.94.24', 'node': 'opensearch-0'}, {'index': '.opensearch-observability', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '0', 'store': '0b', 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.opensearch-observability', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '0', 'store': '208b', 'ip': '10.241.94.127', 'node': 'opensearch-1'}, {'index': '.opensearch-observability', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '0', 'store': '208b', 'ip': '10.241.94.24', 'node': 'opensearch-0'}, {'index': '.opensearch-sap-log-types-config', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': None, 'store': None, 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.opensearch-sap-log-types-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': None, 'store': None, 'ip': '10.241.94.127', 'node': 'opensearch-1'}, {'index': '.opensearch-sap-log-types-config', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': None, 'store': None, 'ip': '10.241.94.24', 'node': 'opensearch-0'}, {'index': '.opendistro_security', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '10', 'store': '0b', 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.opendistro_security', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '10', 'store': '57.8kb', 'ip': '10.241.94.127', 'node': 'opensearch-1'}, {'index': '.opendistro_security', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '10', 'store': '47.4kb', 'ip': '10.241.94.24', 'node': 'opensearch-0'}]

Then, it disappears:

2024-02-07 14:16:21 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: https://10.241.94.7:9200 "GET /_cat/shards?v HTTP/1.1" 200 555
2024-02-07 14:16:21 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: indices status:
[{'health': 'green', 'status': 'open', 'index': '.opensearch-observability', 'uuid': '_pC83dCcQVuYW5VWnbasfg', 'pri': '1', 'rep': '0', 'docs.count': '0', 'docs.deleted': '0', 'store.size': '208b', 'pri.store.size': '208b'}, {'health': 'green', 'status': 'open', 'index': '.plugins-ml-config', 'uuid': 'RXeUPKcXToOBpRQPM4VQEA', 'pri': '1', 'rep': '0', 'docs.count': '1', 'docs.deleted': '0', 'store.size': '3.9kb', 'pri.store.size': '3.9kb'}, {'health': 'red', 'status': 'open', 'index': '.opendistro_security', 'uuid': 'FXZtZFctTWyEjVt9ZDRqcw', 'pri': '1', 'rep': '0', 'docs.count': None, 'docs.deleted': None, 'store.size': None, 'pri.store.size': None}]
indices shards:
[{'index': '.opensearch-observability', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '0', 'store': '208b', 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.opensearch-sap-log-types-config', 'shard': '0', 'prirep': 'p', 'state': 'UNASSIGNED', 'docs': None, 'store': None, 'ip': None, 'node': None}, {'index': '.opendistro_security', 'shard': '0', 'prirep': 'p', 'state': 'UNASSIGNED', 'docs': None, 'store': None, 'ip': None, 'node': None}]

Looking at the logs, I can see that:

Opensearch/0: stops at: 2024-02-07 14:15:57 DEBUG unit.opensearch/0.juju-log server.go:325 service:2: Rolling Ops Manager: stop_opensearch called
Opensearch/1: stops at: 2024-02-07 14:15:34 DEBUG unit.opensearch/1.juju-log server.go:325 service:2: Rolling Ops Manager: stop_opensearch called

That means, when the node opensearch/2 is running its routine, all nodes are gone and it fails.

It is happening because both opensearch 0 and 1 are deferring their RunWithLocks event:

Opensearch/0: RunWithLock via OpenSearchOperatorCharm/on/service_run_with_lock[283] appears multiple time
Opensearch/1: RunWithLock via OpenSearchOperatorCharm/on/service_run_with_lock[318] appears multiple time

Looking into the rolling-ops code: there is no post-processing, if the event has been deferred or not. It will only react to an exception, which will break the entire hook.

I will report a bug with rolling ops as well, asking for the run_on_lock to capture any exceptions and double check if the event has been marked for deferral post callback returned.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-02-07T16:00:34Z

https://warthogs.atlassian.net/browse/DPE-3496

carlcsaposs-canonical · 2024-03-14T15:16:18Z

Discovered same issue while reviewing opensearch codebase in preparation for adding in-place upgrades

after event deferred, lock is immediately released here:

opensearch-operator/lib/charms/rolling_ops/v0/rollingops.py

Lines 395 to 407 in e7acbdd

    
           def _on_run_with_lock(self: CharmBase, event: RunWithLock): 
        
               lock = Lock(self) 
        
               self.model.unit.status = MaintenanceStatus("Executing {} operation".format(self.name)) 
        
               relation = self.model.get_relation(self.name) 
        
               # default to instance callback if not set 
        
               callback_name = relation.data[self.charm.unit].get( 
        
                   "callback_override", self._callback.__name__ 
        
               ) 
        
               callback = getattr(self.charm, callback_name) 
        
               callback(event) 
        
               lock.release()  # Updates relation data

## Issue ### Fix issues with current lock implementation - Fixes #183: Units can start/restart without rollingops lock (if event deferred, lock released but unit (re)starts on next event anyways). - Unit can scale down (using "ops lock") while another unit is (re)starting (using separate rollingops lock) ### Prepare for in-place upgrades - In order to maintain HA, only one unit should start, restart, join cluster (scale up), leave cluster (scale down), or upgrade at a time - Upgrades adds an additional challenge: if the charm code is upgraded (in addition to the workload), it's possible for the leader unit to go into error state—which prevents coordination of locks via peer databag (current rolling ops implementation) and can block rollback (for other units) ## Options considered 1. Use peer relation databag for lock - Unit requests lock by adding data to unit databag -> relation-changed event triggered on leader - Leader grants lock by adding data to app databag -> relation-changed event triggered on unit - Unit proceeds & releases lock from unit databag 2. Don't use lock for upgrades. Make upgrade order deterministic (so that each unit can independently determine upgrade order & upgrade order does not change) and each unit upgrades when it sees the units before it have upgraded 3. Use opensearch index/document for lock 4. Option 3 but fallback to option 1 if ~~no units online~~ less than 2 units online Cons of each option: 1. if leader unit is in error state (raises uncaught exception), rollback will be blocked for all units 2. a unit can restart for non-upgrade related reasons (role re-balancing from network cut or scale up) while another unit is restarting to upgrade 3. doesn't work if all units offline (initial start, full cluster crash/network cut) 4. implementation complexity, some edge cases not supported e.g. https://github.com/canonical/opensearch-operator/blob/b1b28c10f2bd70bee4270707b616d8d565b8b616/lib/charms/opensearch/v0/opensearch_locking.py#L227-L228 Pros of each option: 1. only one unit will (re)start at a time, for any reason 2. rollback with `juju refresh` will immediately rollback highest unit even if leader/other units in error state 3. only one unit will (re)start at a time, for any reason and rollback with `juju refresh` will quickly rollback highest unit even if leader unit charm code in error state 4. same as 3 and locking (mostly, except aforementioned edge cases) works when all units offline More context: Discussion: https://chat.canonical.com/canonical/pl/9fah5tfxd38ybnx3tq7zugxfyh Option 1 in discussion is option 2 here, option 2 in discussion is option 1 here **Option chosen:** Option 4 ### Opensearch index vs document for lock Current "ops lock" implementation with opensearch index: Each unit requests the lock by trying to create an index. If the index does not exist, the "lock" is granted. However, if a unit requests the lock, charm goes into error state, and error state is resolved (e.g. after rollback) it will not be able to use the lock—no unit will be aware that it has the lock and no unit will be able to release the lock Solution: use document id 0 that stores "unit-name" as lock Discussion: https://chat.canonical.com/canonical/pl/biddxzzk3fbpjgbhmatzr8n6bw ## Solution ### Design (Option 4): Use opensearch document as lock (for any (re)start, join cluster, leave cluster, or upgrade). Fallback to peer databag if all units offline. ### Implementation Create custom events `_StartOpensearch` and `_RestartOpensearch` https://github.com/canonical/opensearch-operator/blob/b1b28c10f2bd70bee4270707b616d8d565b8b616/lib/charms/opensearch/v0/opensearch_base_charm.py#L121-L132 When opensearch should be (re)started, emit the custom event. Custom event requests the lock. If granted, it (re)starts opensearch. Once opensearch fully ready, the lock is released. If opensearch fails to start, the lock is released. While opensearch is starting or while the lock is not granted, the custom event will be continually deferred. Note: the original event is not deferred—only the custom event. This is so that any logic that ran before the request to (re)start opensearch does not get re-ran. By requesting the lock within the custom event, and attempting to reacquire the lock each time the custom event runs (i.e. after every time it's deferred), we solve the design issue with rollingops and deferred events detailed in #183

phvalguima added the bug Something isn't working label Feb 7, 2024

This was referenced Feb 9, 2024

Process callback deferring events incorrectly canonical/charm-rolling-ops#11

Open

[DPE-3573][DPE-3579] Fix #164 - check _cluster/settings instead of opensearch.yml for plugins #177

Merged

carlcsaposs-canonical mentioned this issue Apr 3, 2024

Use OpenSearch for locking with fallback to peer databag #211

Merged

carlcsaposs-canonical closed this as completed in #211 Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deferring start/restart event causes lock to be prematurely released #183

Deferring start/restart event causes lock to be prematurely released #183

phvalguima commented Feb 7, 2024

github-actions bot commented Feb 7, 2024

carlcsaposs-canonical commented Mar 14, 2024

Deferring start/restart event causes lock to be prematurely released #183

Deferring start/restart event causes lock to be prematurely released #183

Comments

phvalguima commented Feb 7, 2024

github-actions bot commented Feb 7, 2024

carlcsaposs-canonical commented Mar 14, 2024