From 45ee1966ba32c5fad272f14dabe22758557f8070 Mon Sep 17 00:00:00 2001 From: Balazs Gibizer Date: Tue, 19 Sep 2023 16:12:17 +0200 Subject: [PATCH] Use heartbeat_in_pthread=false for nova-api With heartbeat_in_pthread=true we saw that sometimes nova-api hangs. If the amqp logging was also enabled then the hang could be reproduced with any request that needed rabbitmq connection. It seems that the root cause is that the native thread running the heartbeat tries to log but the oslo.log stack was initialized in the main thread where locking was monkey patched by eventlet. The oslo.log wants to take a lock but taking an eventlet lock from another native thread is not supported by eventlet and the heartbeat thread hangs at: ``` /usr/lib/python3.9/site-packages/oslo_messaging/_drivers/impl_rabbit.py:782 in __init__ `LOG.debug('[%(connection_id)s] Connected to AMQP server on '` /usr/lib64/python3.9/logging/__init__.py:1800 in debug `self.log(DEBUG, msg, *args, **kwargs)` /usr/lib64/python3.9/logging/__init__.py:1844 in log `self.logger.log(level, msg, *args, **kwargs)` /usr/lib64/python3.9/logging/__init__.py:1512 in log `self._log(level, msg, args, **kwargs)` /usr/lib64/python3.9/logging/__init__.py:1589 in _log `self.handle(record)` /usr/lib64/python3.9/logging/__init__.py:1599 in handle `self.callHandlers(record)` /usr/lib64/python3.9/logging/__init__.py:1661 in callHandlers `hdlr.handle(record)` /usr/lib64/python3.9/logging/__init__.py:950 in handle `self.acquire()` /usr/lib64/python3.9/logging/__init__.py:901 in acquire `self.lock.acquire()` /usr/lib64/python3.9/threading.py:160 in acquire `rc = self._block.acquire(blocking, timeout)` /usr/lib/python3.9/site-packages/eventlet/semaphore.py:115 in acquire `hubs.get_hub().switch()` /usr/lib/python3.9/site-packages/eventlet/hubs/hub.py:313 in switch `return self.greenlet.switch()` ``` while the main thread prints: ``` Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers timer() File "/usr/lib/python3.9/site-packages/eventlet/hubs/timer.py", line 59, in __call__ cb(*args, **kw) File "/usr/lib/python3.9/site-packages/eventlet/semaphore.py", line 147, in _do_acquire waiter.switch() greenlet.error: cannot switch to a different thread ``` There is an oslo.log fix that might be able to help https://review.opendev.org/c/openstack/oslo.log/+/852443 but it is not backported to antelop. So this PR moves that heartbeat back to the main thread into a greenlet. This has a negative side effect of the heartbeat being mostly non effective as apache froze the main thread if no request is being handled. So the heartbeat greenlet is only scheduled when nova-api is actively handling a request and does some IO so that eventlet can switch greenlet execution. When the heartbeat times out due to this, the rabbit connection is dropped and nova-api needs to reconnect to rabbit during the next request. This does not break any request handling but produce extra INFO logs due to the rabbit reconnect. --- templates/nova.conf | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/templates/nova.conf b/templates/nova.conf index d180ea391..64fb37212 100644 --- a/templates/nova.conf +++ b/templates/nova.conf @@ -49,7 +49,10 @@ amqp_auto_delete=false # rabbit_quorum_queue=true {{/*we might just want to make this always false*/}} {{ if eq .service_name "nova-api"}} -heartbeat_in_pthread=true +# We cannot set this to true while is +# https://review.opendev.org/c/openstack/oslo.log/+/852443 is not used in the +# nova-api image otherwise logging from the heartbeat thread will cause hangs. +heartbeat_in_pthread=false {{else}} heartbeat_in_pthread=false {{end}}