-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transport potentially blocks at long http response times #7120
Comments
If I understand the issue correctly, I think having the RetryQueue hold a lock which propagates to events to the state machine is definitely wrong. The state machine should not block, ever. Something important to remember is that gevent only yields control when an async operation is done. Synchronous operations (like appending to a list) doesn't yield control. Therefore, I think the best way to move forward would be something like:
|
Before, the retry-queues semaphore had to be acquired in order to enqueue messages. This caused the thought to be asynchronous enqueue() method and all methods calling it to be dependent on successfully processing all messages in the queues greenlet that processes the queue. Effectively this resulted in a synchronous enqueue() call and could cause the state-machine to block during long-running requests or retries to the transport layer. Fixes: raiden-network#7120
Before, the retry-queues semaphore had to be acquired in order to enqueue messages. This caused the thought to be asynchronous enqueue() method and all methods calling it to be dependent on successfully processing all messages in the queues greenlet that processes the queue. Effectively this resulted in a synchronous enqueue() call and could cause the state-machine to block during long-running requests or retries to the transport layer. The retry-queues lock was removed, since it was not necessary at all: as long as there is only ever one retry-queue instance per channel running, which is not the case due to the logic in `_get_retrier()`. Fixes: raiden-network#7120
Before, the retry-queues semaphore had to be acquired in order to enqueue messages. This caused the thought to be asynchronous enqueue() method and all methods calling it to be dependent on successfully processing all messages in the queues greenlet that processes the queue. Effectively this resulted in a synchronous enqueue() call and could cause the state-machine to block during long-running requests or retries to the transport layer. The retry-queues lock was removed, since it was not necessary at all: as long as there is only ever one retry-queue instance per channel running, which is not the case due to the logic in `_get_retrier()`. Fixes: raiden-network#7120
Before, the retry-queues semaphore had to be acquired in order to enqueue messages. This caused the thought to be asynchronous enqueue() method and all methods calling it to be dependent on successfully processing all messages in the queues greenlet that processes the queue. Effectively this resulted in a synchronous enqueue() call and could cause the state-machine to block during long-running requests or retries to the transport layer. The retry-queues lock was removed, since it was not necessary at all: as long as there is only ever one retry-queue instance per channel running, which is not the case due to the logic in `_get_retrier()`. Fixes: #7120
There is a potential issue with retry queues which can block other parts of the system such as processing messages and handling Raiden events if http requests to matrix are slow or being rate limited.
Description
The
RetryQueue
has a semaphore (RetryQueue.lock
to control adding and removing messages from the queue. This is also described here with the commentEnqueuing messages
check_and_send()
which ultimately calls thesend_to_device()
method therefore blocks the possibility to enqueue new messages. But where do messages get enqueued?Processing received Raiden messages
Upon processing received messages,
Delivered
messages are created and enqueued into the retry queue for the partner address (RetryQueues are mapped to addresses). This happens in_process_raiden_messages()
. Afterwards messages are being converted into state changes (if necessary), forwarded to the state machine and being applied.Enqueuing new messages to be sent
State changes create Raiden events which are handled in the
RaidenEventHandler
. An event could be converted into a message which needs to be send to a partner (i.e. LockedTransferMessage). These messages get enqueued by callingMatrixTransport.send_async()
. This ultimately calls againRetryQueue.enqueue
.Potential blocking event
Imagine a scenario where a
to_device
message has a very long http response time. While the retry queue gets emptied by callingsend_to_device
, it requires the lock that no new messages can be enqueued in the meanwhile (as described above). As long as the to_device call is ongoing all other greenlets which enqueue messages are basically blocked by trying to acquire the lock.Processing messages being blocked
If processing messages is blocked to continue to work by the RetryQueue, basically no new state changes will be created and applied until the lock is released (which depends on the http request to matrix). This means that all messages received from a sync are blocked to be processed in the state machine. As a consequence, there are no events created which could progress the current state.
handling events being blocked
The other part of the code which acquires a lock is enqueuing messages in the event handler. If this is blocked, all subsequent events will be blocked to be processed. This could also be potentially interaction with the blockchain and therefore critical interactions.
What should be done
I think it is intended that the retry queue should never cause other greenlets to be stuck, especially system critical ones. Unfortunately, I see that this is the case if http responses take a long time. IMHO it is essential to fix that as soon as possible. This also could lead to an increase in payment speed as we avoid any greenlets to be stuck by slow http requests to Matrix.
Illustration
The text was updated successfully, but these errors were encountered: