Skip to content

Commit

Permalink
Use heartbeat_in_pthread=false for nova-api
Browse files Browse the repository at this point in the history
With heartbeat_in_pthread=true we saw that sometimes nova-api hangs. If
the amqp logging was also enabled then the hang could be reproduced with
any request that needed rabbitmq connection.

It seems that the root cause is that the native thread running the
heartbeat tries to log but the oslo.log stack was initialized in the
main thread where locking was monkey patched by eventlet. The oslo.log
wants to take a lock but taking an eventlet lock from another native
thread is not supported by eventlet and the heartbeat thread hangs at:

```
/usr/lib/python3.9/site-packages/oslo_messaging/_drivers/impl_rabbit.py:782 in __init__
    `LOG.debug('[%(connection_id)s] Connected to AMQP server on '`

/usr/lib64/python3.9/logging/__init__.py:1800 in debug
    `self.log(DEBUG, msg, *args, **kwargs)`

/usr/lib64/python3.9/logging/__init__.py:1844 in log
    `self.logger.log(level, msg, *args, **kwargs)`

/usr/lib64/python3.9/logging/__init__.py:1512 in log
    `self._log(level, msg, args, **kwargs)`

/usr/lib64/python3.9/logging/__init__.py:1589 in _log
    `self.handle(record)`

/usr/lib64/python3.9/logging/__init__.py:1599 in handle
    `self.callHandlers(record)`

/usr/lib64/python3.9/logging/__init__.py:1661 in callHandlers
    `hdlr.handle(record)`

/usr/lib64/python3.9/logging/__init__.py:950 in handle
    `self.acquire()`

/usr/lib64/python3.9/logging/__init__.py:901 in acquire
    `self.lock.acquire()`

/usr/lib64/python3.9/threading.py:160 in acquire
    `rc = self._block.acquire(blocking, timeout)`

/usr/lib/python3.9/site-packages/eventlet/semaphore.py:115 in acquire
    `hubs.get_hub().switch()`

/usr/lib/python3.9/site-packages/eventlet/hubs/hub.py:313 in switch
    `return self.greenlet.switch()`
```

while the main thread prints:

```
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers
    timer()
  File "/usr/lib/python3.9/site-packages/eventlet/hubs/timer.py", line 59, in __call__
    cb(*args, **kw)
  File "/usr/lib/python3.9/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
    waiter.switch()
greenlet.error: cannot switch to a different thread
```

There is an oslo.log fix that might be able to help
https://review.opendev.org/c/openstack/oslo.log/+/852443 but it is not
backported to antelop.

So this PR moves that heartbeat back to the main thread into a greenlet.
This has a negative side effect of the heartbeat being mostly
non effective as apache froze the main thread if no request is being
handled. So the heartbeat greenlet is only scheduled when nova-api is
actively handling a request and does some IO so that eventlet can switch
greenlet execution. When the heartbeat times out due to this, the rabbit
connection is dropped and nova-api needs to reconnect to rabbit during
the next request. This does not break any request handling but produce
extra INFO logs due to the rabbit reconnect.
  • Loading branch information
gibizer authored and openshift-merge-robot committed Sep 22, 2023
1 parent 4a535c8 commit a862ac1
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion templates/nova.conf
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,10 @@ amqp_auto_delete=false
# rabbit_quorum_queue=true
{{/*we might just want to make this always false*/}}
{{ if eq .service_name "nova-api"}}
heartbeat_in_pthread=true
# We cannot set this to true while is
# https://review.opendev.org/c/openstack/oslo.log/+/852443 is not used in the
# nova-api image otherwise logging from the heartbeat thread will cause hangs.
heartbeat_in_pthread=false
{{else}}
heartbeat_in_pthread=false
{{end}}
Expand Down

0 comments on commit a862ac1

Please sign in to comment.