Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_start_opensearch does not catch OpenSearchUserMgmtError #164

Closed
phvalguima opened this issue Jan 24, 2024 · 5 comments
Closed

_start_opensearch does not catch OpenSearchUserMgmtError #164

phvalguima opened this issue Jan 24, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@phvalguima
Copy link
Contributor

There seems to be some race condition happening at _post_start_init, where some of the CI tests fail with OpenSearchUserMgmtError. That is caused in the stack at the end.

Side note: _post_start_init should be renamed to something else, as it runs as the very first line at _start_opensearch.

Stack trace: https://pastebin.ubuntu.com/p/WRMY3FnZCc/

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 276, in call
    response = s.request(**request_kwargs)
  File "/var/lib/juju/agents/unit-opensearch-4/charm/venv/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/var/lib/juju/agents/unit-opensearch-4/charm/venv/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/var/lib/juju/agents/unit-opensearch-4/charm/venv/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='10.206.26.175', port=9200): Read timed out. (read timeout=5)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_users.py", line 137, in get_users
    return self.opensearch.request("GET", f"{USER_ENDPOINT}/")
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 297, in request
    resp = call(retries, resp_status_code)
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 285, in call
    return call(
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 238, in call
    raise OpenSearchHttpError()
charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-4/charm/./src/charm.py", line 94, in <module>
    main(OpenSearchOperatorCharm)
  File "/var/lib/juju/agents/unit-opensearch-4/charm/venv/ops/main.py", line 434, in main
    framework.reemit()
  File "/var/lib/juju/agents/unit-opensearch-4/charm/venv/ops/framework.py", line 863, in reemit
    self._reemit()
  File "/var/lib/juju/agents/unit-opensearch-4/charm/venv/ops/framework.py", line 942, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/rolling_ops/v0/rollingops.py", line 410, in _on_run_with_lock
    callback(event)
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 618, in _start_opensearch
    self._post_start_init()
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 720, in _post_start_init
    self._put_monitoring_user()
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_base_charm.py", line 840, in _put_monitoring_user
    users = self.user_manager.get_users()
  File "/var/lib/juju/agents/unit-opensearch-4/charm/lib/charms/opensearch/v0/opensearch_users.py", line 139, in get_users
    raise OpenSearchUserMgmtError(e)
charms.opensearch.v0.opensearch_users.OpenSearchUserMgmtError
2024-01-24 08:12:51 ERROR juju.worker.uniter.operation runhook.go:180 hook "update-status" (via hook dispatching script: dispatch) failed: exit status 1
@phvalguima phvalguima added the bug Something isn't working label Jan 24, 2024
@phvalguima phvalguima changed the title _start_opensearch does not catch _start_opensearch does not catch OpenSearchUserMgmtError Jan 24, 2024
Copy link
Contributor

@phvalguima
Copy link
Contributor Author

phvalguima commented Jan 24, 2024

@phvalguima
Copy link
Contributor Author

So, I did some more digging on this issue.

In a nutshell, I believe we try to access the .opensearch-security index, which is hosted on unit opensearch/3, at the moment that unit is restarting.

The main question here is where was the valid replicas of .opensearch-security at the moment _post_create_monitoring_user() is called in _post_start_init.


There are 3x OpenSearch nodes:

Model              Controller           Cloud/Region         Version  SLA          Timestamp
test-backups-xhv5  localhost-localhost  localhost/localhost  3.1.7    unsupported  11:38:45+01:00

App                       Version  Status  Scale  Charm                     Channel      Rev  Exposed  Message
opensearch                         active      3  opensearch                               1  no       
s3-integrator                      active      1  s3-integrator             latest/edge   13  no       
self-signed-certificates           active      1  self-signed-certificates  latest/edge   60  no       

Unit                         Workload  Agent  Machine  Public address  Ports  Message
opensearch/3*                active    idle   5        10.206.26.226          
opensearch/4                 active    idle   6        10.206.26.175          
opensearch/5                 active    idle   7        10.206.26.254          
s3-integrator/0*             active    idle   0        10.206.26.246          
self-signed-certificates/0*  active    idle   4        10.206.26.139          

Machine  State    Address        Inst id        Base          AZ  Message
0        started  10.206.26.246  juju-c72925-0  [email protected]      Running
4        started  10.206.26.139  juju-c72925-4  [email protected]      Running
5        started  10.206.26.226  juju-c72925-5  [email protected]      Running
6        started  10.206.26.175  juju-c72925-6  [email protected]      Running
7        started  10.206.26.254  juju-c72925-7  [email protected]      Running

The node having an issue is opensearch/4, which starts to have issues accessing the internalusers API on 08:12:15:

2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 Starting new HTTPS connection (1): 10.206.26.175:9200
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 https://10.206.26.175:9200 "GET /_cluster/settings HTTP/1.1" 200 182
2024-01-24 08:12:15 ERROR unit.opensearch/4.juju-log server.go:325 [Errno 111] Connection refused
2024-01-24 08:12:15 ERROR unit.opensearch/4.juju-log server.go:325 [Errno 111] Connection refused
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 Getting secret app:admin-password
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 Starting new HTTPS connection (1): 10.206.26.175:9200
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 https://10.206.26.175:9200 "PUT /_cluster/settings HTTP/1.1" 200 125
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 Getting secret app:admin-password
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 Starting new HTTPS connection (1): 10.206.26.175:9200
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 https://10.206.26.175:9200 "PUT /.plugins-ml-config/_settings HTTP/1.1" 200 21
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 Getting secret app:admin-password
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 Starting new HTTPS connection (1): 10.206.26.175:9200
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 https://10.206.26.175:9200 "PUT /.opensearch-sap-log-types-config/_settings HTTP/1.1" 200 21
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 Getting secret app:admin-password
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 Starting new HTTPS connection (1): 10.206.26.175:9200
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 https://10.206.26.175:9200 "PUT /.opensearch-sap-pre-packaged-rules-config/_settings HTTP/1.1" 404 575
2024-01-24 08:12:15 ERROR unit.opensearch/4.juju-log server.go:325 [Errno 111] Connection refused
2024-01-24 08:12:15 ERROR unit.opensearch/4.juju-log server.go:325 [Errno 111] Connection refused
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 Getting secret app:admin-password
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 Starting new HTTPS connection (1): 10.206.26.175:9200
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 https://10.206.26.175:9200 "GET /_cluster/health HTTP/1.1" 200 447
2024-01-24 08:12:15 INFO unit.opensearch/4.juju-log server.go:325 Health: {'cluster_name': 'opensearch-5raf', 'status': 'red', 'timed_out': False, 'number_of_nodes': 1, 'number_of_data_nodes': 1, 'discovered_master': True, 'discovered_cluster_manager': True, 'activ>
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 Getting secret app:admin-password
2024-01-24 08:12:15 DEBUG unit.opensearch/4.juju-log server.go:325 Starting new HTTPS connection (1): 10.206.26.175:9200
2024-01-24 08:12:20 ERROR unit.opensearch/4.juju-log server.go:325 Request GET to https://10.206.26.175:9200/_plugins/_security/api/internalusers/ with payload: None failed. (Attempts left: 0)
HTTPSConnectionPool(host='10.206.26.175', port=9200): Read timed out. (read timeout=5)
2024-01-24 08:12:21 ERROR unit.opensearch/4.juju-log server.go:325 Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-opensearch-4/charm/venv/urllib3/connectionpool.py", line 467, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from

###### STARTS THE STACK TRACE MENTIONED IN THIS BUG

On the logs above, it is possible to see that multiple calls actually go through, but eventually the GET https://10.206.26.175:9200/_plugins/_security/api/internalusers/ fails. Also, the health status is marked as "RED".

Looking at the logs of opensearch/4, I can see the unit loses contact with opensearch/3 and eventually decides to remove it from the cluster:

[2024-01-24T08:11:37,496][WARN ][o.o.c.NodeConnectionsService] [opensearch-4] failed to connect to {opensearch-3}{nWI1aoofTkasg0WYc3Wc1A}{mytOU1xoSYWg5xJOS-CAlg}{10.206.26.226}{10.206.26.226:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true} 
org.opensearch.transport.ConnectTransportException: [opensearch-3][10.206.26.226:9300] connect_exception
        at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1074) ~[opensearch-2.10.0.jar:2.10.0]
        at org.opensearch.core.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:217) ~[opensearch-core-2.10.0.jar:2.10.0]
        at org.opensearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:57) ~[opensearch-common-2.10.0.jar:2.10.0]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
        at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:887) ~[?:?]
        at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2325) ~[?:?]
        at org.opensearch.common.concurrent.CompletableContext.addListener(CompletableContext.java:60) ~[opensearch-common-2.10.0.jar:2.10.0]
        at org.opensearch.transport.netty4.Netty4TcpChannel.addConnectListener(Netty4TcpChannel.java:135) ~[?:?]
        at org.opensearch.transport.TcpTransport.initiateConnection(TcpTransport.java:378) ~[opensearch-2.10.0.jar:2.10.0]
        at org.opensearch.transport.TcpTransport.openConnection(TcpTransport.java:338) ~[opensearch-2.10.0.jar:2.10.0]
        at org.opensearch.transport.ClusterConnectionManager.internalOpenConnection(ClusterConnectionManager.java:274) ~[opensearch-2.10.0.jar:2.10.0]
        at org.opensearch.transport.ClusterConnectionManager.connectToNode(ClusterConnectionManager.java:157) ~[opensearch-2.10.0.jar:2.10.0]
        at org.opensearch.transport.TransportService.connectToNode(TransportService.java:493) ~[opensearch-2.10.0.jar:2.10.0]
        at org.opensearch.transport.TransportService.connectToNode(TransportService.java:473) ~[opensearch-2.10.0.jar:2.10.0]
        at org.opensearch.cluster.NodeConnectionsService$ConnectionTarget$1.doRun(NodeConnectionsService.java:344) ~[opensearch-2.10.0.jar:2.10.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.10.0.jar:2.10.0]
        at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
        at org.opensearch.cluster.NodeConnectionsService.connectToNodes(NodeConnectionsService.java:157) ~[opensearch-2.10.0.jar:2.10.0]
        at org.opensearch.cluster.service.ClusterApplierService.connectToNodesAndWait(ClusterApplierService.java:583) ~[opensearch-2.10.0.jar:2.10.0]
        at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:548) ~[opensearch-2.10.0.jar:2.10.0]
        at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:484) ~[opensearch-2.10.0.jar:2.10.0]
        at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:186) ~[opensearch-2.10.0.jar:2.10.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) ~[opensearch-2.10.0.jar:2.10.0]
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282) ~[opensearch-2.10.0.jar:2.10.0]
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245) ~[opensearch-2.10.0.jar:2.10.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 10.206.26.226/10.206.26.226:9300
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
        at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[?:?]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
        ... 1 more

Then, the unit is removed:

[2024-01-24T08:11:37,582][INFO ][o.o.c.s.ClusterApplierService] [opensearch-4] removed {{opensearch-3}{nWI1aoofTkasg0WYc3Wc1A}{mytOU1xoSYWg5xJOS-CAlg}{10.206.26.226}{10.206.26.226:9300 {coordinating_onlydimml{shard_indexing_pressure_enabled=true}}, term: 2, version: 44, reason: Publication{term=2, version=44}                                                                                        

Now, looking into unit opensearch/3, indeed the unit is restarted at that time:

[2024-01-24T08:11:37,037][INFO ][o.o.d.PeerFinder         ] [opensearch-3] setting findPeersInterval to [1s] as node commission stat
us = [true] for local node [{opensearch-3}{nWI1aoofTkasg0WYc3Wc1A}{mytOU1xoSYWg5xJOS-CAlg}{10.206.26.226}{10.206.26.226:9300}{coordi
nating_onlydimml}{shard_indexing_pressure_enabled=true}]
[2024-01-24T08:11:37,131][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-3] Detected cluster change event for destina
tion migration
[2024-01-24T08:11:37,212][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-3] Detected cluster change event for destina
tion migration
[2024-01-24T08:11:37,345][INFO ][o.o.i.r.RecoverySourceHandler] [opensearch-3] [.opensearch-observability][0][recover to opensearch-
4] finalizing recovery took [44.6ms]
[2024-01-24T08:11:37,351][INFO ][o.o.i.r.RecoverySourceHandler] [opensearch-3] [.plugins-ml-config][0][recover to opensearch-4] fina
lizing recovery took [22.7ms]
Stopping Service for snap application opensearch.daemon...
[2024-01-24T08:11:37,398][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-3] Detected cluster change event for destina
tion migration
[2024-01-24T08:11:37,405][INFO ][o.o.n.Node               ] [opensearch-3] stopping ...
[2024-01-24T08:11:37,421][INFO ][o.o.s.a.r.AuditMessageRouter] [opensearch-3] Closing AuditMessageRouter
[2024-01-24T08:11:37,422][INFO ][o.o.s.a.s.SinkProvider   ] [opensearch-3] Closing DebugSink
[2024-01-24T08:11:37,425][INFO ][o.o.c.c.Coordinator      ] [opensearch-3] cluster-manager node [{opensearch-4}{CxJCO-5xTRiPaEnbQk2p
2w}{ELLVDYRMSAqC9rpXIzr0Qg}{10.206.26.175}{10.206.26.175:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true}] failed
, restarting discovery
org.opensearch.transport.NodeDisconnectedException: [opensearch-4][10.206.26.175:9300][disconnected] disconnected
[2024-01-24T08:11:37,500][INFO ][o.o.n.Node               ] [opensearch-3] stopped
[2024-01-24T08:11:37,501][INFO ][o.o.n.Node               ] [opensearch-3] closing ...
[2024-01-24T08:11:37,515][INFO ][o.o.s.a.i.AuditLogImpl   ] [opensearch-3] Closing AuditLogImpl
[2024-01-24T08:11:37,527][INFO ][o.o.n.Node               ] [opensearch-3] closed

and only comes back on 08:12:30:

[2024-01-24T08:12:30,154][INFO ][o.o.n.Node               ] [opensearch-3] started
[2024-01-24T08:12:30,155][INFO ][o.o.s.OpenSearchSecurityPlugin] [opensearch-3] Node started

So, PUT /<index-name> commands will work, as it means creating new indices, but GET on a node that does not hold a replica of a given index will fail.

@Mehdi-Bendriss
Copy link
Contributor

So, PUT / commands will work, as it means creating new indices, but GET on a node that does not hold a replica of a given index will fail.

This is not quite accurate, search requests restrict the search to the local node only if added ?_only_local=true.
I believe we must continue investigating.

@phvalguima
Copy link
Contributor Author

Hi @Mehdi-Bendriss this GET is not a search but rather a GET against one of the plugins. The actual call that breaks is:

GET /_plugins/_security/api/internalusers

Which is present here

And this log line:

2024-01-24 08:12:20 ERROR unit.opensearch/4.juju-log server.go:325 Request GET to https://10.206.26.175:9200/_plugins/_security/api/internalusers/ with payload: None failed. (Attempts left: 0)

Does not specify any parameters, at least not injected in the URL.

phvalguima added a commit that referenced this issue Jan 27, 2024
The main goal is to have a more controlled stop routine, where
it is checked if all shards belonging to the given node are
reassigned or failing otherwise. That behavior can also be
toggled with an option: soft=False to untoggle it.

Option set as True by default.

Fixes-Bug: #164
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants