Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deferring start/restart event causes lock to be prematurely released #183

Closed
phvalguima opened this issue Feb 7, 2024 · 2 comments · Fixed by #211
Closed

Deferring start/restart event causes lock to be prematurely released #183

phvalguima opened this issue Feb 7, 2024 · 2 comments · Fixed by #211
Labels
bug Something isn't working

Comments

@phvalguima
Copy link
Contributor

I am currently seeing failures such as this error in the early steps of the CI run.

The entire description here is based in this PR: https://github.com/canonical/opensearch-operator/tree/DPE-3352-last-passing-tests

Full deployment:

Model              Controller           Cloud/Region         Version  SLA          Timestamp
test-backups-23v7  github-pr-2fa19-lxd  localhost/localhost  3.1.7    unsupported  14:16:34Z

App                       Version  Status       Scale  Charm                     Channel  Rev  Exposed  Message
opensearch                         maintenance      3  opensearch                           0  no       Beginning rolling service
s3-integrator                      active           1  s3-integrator             stable    13  no       
self-signed-certificates           active           1  self-signed-certificates  stable    72  no       

Unit                         Workload  Agent      Machine  Public address  Ports  Message
opensearch/0                 waiting   idle       2        10.241.94.24           Waiting for OpenSearch to start...
opensearch/1*                waiting   executing  3        10.241.94.127          Waiting for OpenSearch to start...
opensearch/2                 error     idle       4        10.241.94.7            hook failed: "opensearch-peers-relation-changed"
s3-integrator/0*             active    idle       1        10.241.94.199          
self-signed-certificates/0*  active    idle       0        10.241.94.194          

Machine  State    Address        Inst id        Base          AZ  Message
0        started  10.241.94.194  juju-ad79f3-0  [email protected]      Running
1        started  10.241.94.199  juju-ad79f3-1  [email protected]      Running
2        started  10.241.94.24   juju-ad79f3-2  [email protected]      Running
3        started  10.241.94.127  juju-ad79f3-3  [email protected]      Running
4        started  10.241.94.7    juju-ad79f3-4  [email protected]      Running

Integration provider                   Requirer                           Interface            Type     Message
opensearch:opensearch-peers            opensearch:opensearch-peers        opensearch_peers     peer     
opensearch:service                     opensearch:service                 rolling_op           peer     
s3-integrator:s3-credentials           opensearch:s3-credentials          s3                   regular  
s3-integrator:s3-integrator-peers      s3-integrator:s3-integrator-peers  s3-integrator-peers  peer     
self-signed-certificates:certificates  opensearch:certificates            tls-certificates     regular  

Storage Unit  Storage ID         Type        Pool    Mountpoint                   Size    Status    Message
opensearch/0  opensearch-data/0  filesystem  rootfs  /var/snap/opensearch/common  72 GiB  attached  
opensearch/1  opensearch-data/1  filesystem  rootfs  /var/snap/opensearch/common  72 GiB  attached  
opensearch/2  opensearch-data/2  filesystem  rootfs  /var/snap/opensearch/common  72 GiB  attached  

Unit opensearch/2 is failing because all shards disappeared at once. At 14:15:01, everything is there:

2024-02-07 14:15:01 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: https://10.241.94.7:9200 "GET /_cat/indices?v HTTP/1.1" 200 606
2024-02-07 14:15:01 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: Getting secret app:admin-password
2024-02-07 14:15:01 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: Starting new HTTPS connection (1): 10.241.94.7:9200
2024-02-07 14:15:01 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: https://10.241.94.7:9200 "GET /_cat/shards?v HTTP/1.1" 200 1774
2024-02-07 14:15:01 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: indices status:
[{'health': 'green', 'status': 'open', 'index': '.opensearch-observability', 'uuid': '_pC83dCcQVuYW5VWnbasfg', 'pri': '1', 'rep': '2', 'docs.count': '0', 'docs.deleted': '0', 'store.size': '416b', 'pri.store.size': '208b'}, {'health': 'green', 'status': 'open', 'index': '.plugins-ml-config', 'uuid': 'RXeUPKcXToOBpRQPM4VQEA', 'pri': '1', 'rep': '2', 'docs.count': '1', 'docs.deleted': '0', 'store.size': '7.7kb', 'pri.store.size': '3.8kb'}, {'health': 'green', 'status': 'open', 'index': '.opendistro_security', 'uuid': 'FXZtZFctTWyEjVt9ZDRqcw', 'pri': '1', 'rep': '2', 'docs.count': '10', 'docs.deleted': '2', 'store.size': '105.2kb', 'pri.store.size': '57.8kb'}]
indices shards:
[{'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '1', 'store': '0b', 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.8kb', 'ip': '10.241.94.127', 'node': 'opensearch-1'}, {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.241.94.24', 'node': 'opensearch-0'}, {'index': '.opensearch-observability', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '0', 'store': '0b', 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.opensearch-observability', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '0', 'store': '208b', 'ip': '10.241.94.127', 'node': 'opensearch-1'}, {'index': '.opensearch-observability', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '0', 'store': '208b', 'ip': '10.241.94.24', 'node': 'opensearch-0'}, {'index': '.opensearch-sap-log-types-config', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': None, 'store': None, 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.opensearch-sap-log-types-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': None, 'store': None, 'ip': '10.241.94.127', 'node': 'opensearch-1'}, {'index': '.opensearch-sap-log-types-config', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': None, 'store': None, 'ip': '10.241.94.24', 'node': 'opensearch-0'}, {'index': '.opendistro_security', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '10', 'store': '0b', 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.opendistro_security', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '10', 'store': '57.8kb', 'ip': '10.241.94.127', 'node': 'opensearch-1'}, {'index': '.opendistro_security', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '10', 'store': '47.4kb', 'ip': '10.241.94.24', 'node': 'opensearch-0'}]

Then, it disappears:

2024-02-07 14:16:21 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: https://10.241.94.7:9200 "GET /_cat/shards?v HTTP/1.1" 200 555
2024-02-07 14:16:21 DEBUG unit.opensearch/2.juju-log server.go:325 opensearch-peers:1: indices status:
[{'health': 'green', 'status': 'open', 'index': '.opensearch-observability', 'uuid': '_pC83dCcQVuYW5VWnbasfg', 'pri': '1', 'rep': '0', 'docs.count': '0', 'docs.deleted': '0', 'store.size': '208b', 'pri.store.size': '208b'}, {'health': 'green', 'status': 'open', 'index': '.plugins-ml-config', 'uuid': 'RXeUPKcXToOBpRQPM4VQEA', 'pri': '1', 'rep': '0', 'docs.count': '1', 'docs.deleted': '0', 'store.size': '3.9kb', 'pri.store.size': '3.9kb'}, {'health': 'red', 'status': 'open', 'index': '.opendistro_security', 'uuid': 'FXZtZFctTWyEjVt9ZDRqcw', 'pri': '1', 'rep': '0', 'docs.count': None, 'docs.deleted': None, 'store.size': None, 'pri.store.size': None}]
indices shards:
[{'index': '.opensearch-observability', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '0', 'store': '208b', 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.241.94.7', 'node': 'opensearch-2'}, {'index': '.opensearch-sap-log-types-config', 'shard': '0', 'prirep': 'p', 'state': 'UNASSIGNED', 'docs': None, 'store': None, 'ip': None, 'node': None}, {'index': '.opendistro_security', 'shard': '0', 'prirep': 'p', 'state': 'UNASSIGNED', 'docs': None, 'store': None, 'ip': None, 'node': None}]

Looking at the logs, I can see that:

  • Opensearch/0: stops at: 2024-02-07 14:15:57 DEBUG unit.opensearch/0.juju-log server.go:325 service:2: Rolling Ops Manager: stop_opensearch called
  • Opensearch/1: stops at: 2024-02-07 14:15:34 DEBUG unit.opensearch/1.juju-log server.go:325 service:2: Rolling Ops Manager: stop_opensearch called

That means, when the node opensearch/2 is running its routine, all nodes are gone and it fails.

It is happening because both opensearch 0 and 1 are deferring their RunWithLocks event:

  • Opensearch/0: RunWithLock via OpenSearchOperatorCharm/on/service_run_with_lock[283] appears multiple time
  • Opensearch/1: RunWithLock via OpenSearchOperatorCharm/on/service_run_with_lock[318] appears multiple time

Looking into the rolling-ops code: there is no post-processing, if the event has been deferred or not. It will only react to an exception, which will break the entire hook.

I will report a bug with rolling ops as well, asking for the run_on_lock to capture any exceptions and double check if the event has been marked for deferral post callback returned.

@phvalguima phvalguima added the bug Something isn't working label Feb 7, 2024
Copy link
Contributor

github-actions bot commented Feb 7, 2024

@carlcsaposs-canonical
Copy link
Contributor

Discovered same issue while reviewing opensearch codebase in preparation for adding in-place upgrades

after event deferred, lock is immediately released here:

def _on_run_with_lock(self: CharmBase, event: RunWithLock):
lock = Lock(self)
self.model.unit.status = MaintenanceStatus("Executing {} operation".format(self.name))
relation = self.model.get_relation(self.name)
# default to instance callback if not set
callback_name = relation.data[self.charm.unit].get(
"callback_override", self._callback.__name__
)
callback = getattr(self.charm, callback_name)
callback(event)
lock.release() # Updates relation data

carlcsaposs-canonical added a commit that referenced this issue Apr 17, 2024
## Issue
### Fix issues with current lock implementation
- Fixes #183: Units can start/restart without rollingops lock (if event
deferred, lock released but unit (re)starts on next event anyways).
- Unit can scale down (using "ops lock") while another unit is
(re)starting (using separate rollingops lock)

### Prepare for in-place upgrades
- In order to maintain HA, only one unit should start, restart, join
cluster (scale up), leave cluster (scale down), or upgrade at a time
- Upgrades adds an additional challenge: if the charm code is upgraded
(in addition to the workload), it's possible for the leader unit to go
into error state—which prevents coordination of locks via peer databag
(current rolling ops implementation) and can block rollback (for other
units)

## Options considered
1. Use peer relation databag for lock
- Unit requests lock by adding data to unit databag -> relation-changed
event triggered on leader
- Leader grants lock by adding data to app databag -> relation-changed
event triggered on unit
    - Unit proceeds & releases lock from unit databag
2. Don't use lock for upgrades. Make upgrade order deterministic (so
that each unit can independently determine upgrade order & upgrade order
does not change) and each unit upgrades when it sees the units before it
have upgraded
3. Use opensearch index/document for lock
4. Option 3 but fallback to option 1 if ~~no units online~~ less than 2
units online

Cons of each option:
1. if leader unit is in error state (raises uncaught exception),
rollback will be blocked for all units
2. a unit can restart for non-upgrade related reasons (role re-balancing
from network cut or scale up) while another unit is restarting to
upgrade
3. doesn't work if all units offline (initial start, full cluster
crash/network cut)
4. implementation complexity, some edge cases not supported e.g.
https://github.com/canonical/opensearch-operator/blob/b1b28c10f2bd70bee4270707b616d8d565b8b616/lib/charms/opensearch/v0/opensearch_locking.py#L227-L228

Pros of each option:
1. only one unit will (re)start at a time, for any reason
2. rollback with `juju refresh` will immediately rollback highest unit
even if leader/other units in error state
3. only one unit will (re)start at a time, for any reason and rollback
with `juju refresh` will quickly rollback highest unit even if leader
unit charm code in error state
4. same as 3 and locking (mostly, except aforementioned edge cases)
works when all units offline

More context:
Discussion:
https://chat.canonical.com/canonical/pl/9fah5tfxd38ybnx3tq7zugxfyh
Option 1 in discussion is option 2 here, option 2 in discussion is
option 1 here

**Option chosen:** Option 4

### Opensearch index vs document for lock
Current "ops lock" implementation with opensearch index:
Each unit requests the lock by trying to create an index. If the index
does not exist, the "lock" is granted.

However, if a unit requests the lock, charm goes into error state, and
error state is resolved (e.g. after rollback) it will not be able to use
the lock—no unit will be aware that it has the lock and no unit will be
able to release the lock

Solution: use document id 0 that stores "unit-name" as lock
Discussion:
https://chat.canonical.com/canonical/pl/biddxzzk3fbpjgbhmatzr8n6bw

## Solution
### Design
(Option 4): Use opensearch document as lock (for any (re)start, join
cluster, leave cluster, or upgrade). Fallback to peer databag if all
units offline.

### Implementation
Create custom events `_StartOpensearch` and `_RestartOpensearch`

https://github.com/canonical/opensearch-operator/blob/b1b28c10f2bd70bee4270707b616d8d565b8b616/lib/charms/opensearch/v0/opensearch_base_charm.py#L121-L132

When opensearch should be (re)started, emit the custom event.
Custom event requests the lock. If granted, it (re)starts opensearch.
Once opensearch fully ready, the lock is released.
If opensearch fails to start, the lock is released.
While opensearch is starting or while the lock is not granted, the
custom event will be continually deferred.

Note: the original event is not deferred—only the custom event. This is
so that any logic that ran before the request to (re)start opensearch
does not get re-ran.

By requesting the lock within the custom event, and attempting to
reacquire the lock each time the custom event runs (i.e. after every
time it's deferred), we solve the design issue with rollingops and
deferred events detailed in #183
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants