Transport potentially blocks at long http response times #7120

fredo · 2021-06-02T12:49:45Z

There is a potential issue with retry queues which can block other parts of the system such as processing messages and handling Raiden events if http requests to matrix are slow or being rate limited.

Description

The RetryQueue has a semaphore (RetryQueue.lock to control adding and removing messages from the queue. This is also described here with the comment

# once entered the critical section, block any other enqueue or notify attempt

Enqueuing messages

check_and_send() which ultimately calls the send_to_device() method therefore blocks the possibility to enqueue new messages. But where do messages get enqueued?

Processing received Raiden messages

Upon processing received messages, Delivered messages are created and enqueued into the retry queue for the partner address (RetryQueues are mapped to addresses). This happens in _process_raiden_messages(). Afterwards messages are being converted into state changes (if necessary), forwarded to the state machine and being applied.

Enqueuing new messages to be sent

State changes create Raiden events which are handled in the RaidenEventHandler. An event could be converted into a message which needs to be send to a partner (i.e. LockedTransferMessage). These messages get enqueued by calling MatrixTransport.send_async(). This ultimately calls again RetryQueue.enqueue.

Potential blocking event

Imagine a scenario where a to_device message has a very long http response time. While the retry queue gets emptied by calling send_to_device, it requires the lock that no new messages can be enqueued in the meanwhile (as described above). As long as the to_device call is ongoing all other greenlets which enqueue messages are basically blocked by trying to acquire the lock.

Processing messages being blocked

If processing messages is blocked to continue to work by the RetryQueue, basically no new state changes will be created and applied until the lock is released (which depends on the http request to matrix). This means that all messages received from a sync are blocked to be processed in the state machine. As a consequence, there are no events created which could progress the current state.

handling events being blocked

The other part of the code which acquires a lock is enqueuing messages in the event handler. If this is blocked, all subsequent events will be blocked to be processed. This could also be potentially interaction with the blockchain and therefore critical interactions.

What should be done

I think it is intended that the retry queue should never cause other greenlets to be stuck, especially system critical ones. Unfortunately, I see that this is the case if http responses take a long time. IMHO it is essential to fix that as soon as possible. This also could lead to an increase in payment speed as we avoid any greenlets to be stuck by slow http requests to Matrix.

Illustration

The text was updated successfully, but these errors were encountered:

andrevmatos · 2021-06-02T13:55:52Z

If I understand the issue correctly, I think having the RetryQueue hold a lock which propagates to events to the state machine is definitely wrong. The state machine should not block, ever. Something important to remember is that gevent only yields control when an async operation is done. Synchronous operations (like appending to a list) doesn't yield control. Therefore, I think the best way to move forward would be something like:

RetryQueue may still have a lock, but it's used only internally, to prevent multiple parallel send_to_device
- If you can guarantee only a single RetryQueue instance (e.g. it runs with a for loop), then there's no need for this lock either
The first thing it does is to do a shallow copy of the messages list and clear it. This is synchronous, and therefore, there's no need for a lock protecting the list itself of anything which happens elsewhere.
The RetryQueue operations are then done over this shallow copy, which guarantees exclusive access.
Enqueueing messages don't need a lock, it's also a synchronous operation, you can just append to it and be sure it won't block anything.
If the RetryQueue happens to need to re-queue messages, it can just push to front before releasing its lock (which may be used for the next iteration of RetryQueue), as this also is a synchronous call.

Before, the retry-queues semaphore had to be acquired in order to enqueue messages. This caused the thought to be asynchronous enqueue() method and all methods calling it to be dependent on successfully processing all messages in the queues greenlet that processes the queue. Effectively this resulted in a synchronous enqueue() call and could cause the state-machine to block during long-running requests or retries to the transport layer. Fixes: raiden-network#7120

Before, the retry-queues semaphore had to be acquired in order to enqueue messages. This caused the thought to be asynchronous enqueue() method and all methods calling it to be dependent on successfully processing all messages in the queues greenlet that processes the queue. Effectively this resulted in a synchronous enqueue() call and could cause the state-machine to block during long-running requests or retries to the transport layer. The retry-queues lock was removed, since it was not necessary at all: as long as there is only ever one retry-queue instance per channel running, which is not the case due to the logic in `_get_retrier()`. Fixes: raiden-network#7120

Before, the retry-queues semaphore had to be acquired in order to enqueue messages. This caused the thought to be asynchronous enqueue() method and all methods calling it to be dependent on successfully processing all messages in the queues greenlet that processes the queue. Effectively this resulted in a synchronous enqueue() call and could cause the state-machine to block during long-running requests or retries to the transport layer. The retry-queues lock was removed, since it was not necessary at all: as long as there is only ever one retry-queue instance per channel running, which is not the case due to the logic in `_get_retrier()`. Fixes: #7120

fredo added Component / State Machine Component / Transport Transport related issues Severity / Critical Type / Bug labels Jun 2, 2021

ezdac self-assigned this Jun 3, 2021

ezdac mentioned this issue Jun 3, 2021

Fix retry-queue's enqueue() blocked by semaphore #7126

Merged

ulope closed this as completed in #7126 Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transport potentially blocks at long http response times #7120

Transport potentially blocks at long http response times #7120

fredo commented Jun 2, 2021 •

edited

Loading

andrevmatos commented Jun 2, 2021 •

edited

Loading

Transport potentially blocks at long http response times #7120

Transport potentially blocks at long http response times #7120

Comments

fredo commented Jun 2, 2021 • edited Loading

Description

Enqueuing messages

Processing received Raiden messages

Enqueuing new messages to be sent

Potential blocking event

Processing messages being blocked

handling events being blocked

What should be done

Illustration

andrevmatos commented Jun 2, 2021 • edited Loading

fredo commented Jun 2, 2021 •

edited

Loading

andrevmatos commented Jun 2, 2021 •

edited

Loading