Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transport potentially blocks at long http response times #7120

Closed
fredo opened this issue Jun 2, 2021 · 1 comment · Fixed by #7126
Closed

Transport potentially blocks at long http response times #7120

fredo opened this issue Jun 2, 2021 · 1 comment · Fixed by #7126

Comments

@fredo
Copy link
Contributor

fredo commented Jun 2, 2021

There is a potential issue with retry queues which can block other parts of the system such as processing messages and handling Raiden events if http requests to matrix are slow or being rate limited.

Description

The RetryQueue has a semaphore (RetryQueue.lock to control adding and removing messages from the queue. This is also described here with the comment

# once entered the critical section, block any other enqueue or notify attempt

Enqueuing messages

check_and_send() which ultimately calls the send_to_device() method therefore blocks the possibility to enqueue new messages. But where do messages get enqueued?

Processing received Raiden messages

Upon processing received messages, Delivered messages are created and enqueued into the retry queue for the partner address (RetryQueues are mapped to addresses). This happens in _process_raiden_messages(). Afterwards messages are being converted into state changes (if necessary), forwarded to the state machine and being applied.

Enqueuing new messages to be sent

State changes create Raiden events which are handled in the RaidenEventHandler. An event could be converted into a message which needs to be send to a partner (i.e. LockedTransferMessage). These messages get enqueued by calling MatrixTransport.send_async(). This ultimately calls again RetryQueue.enqueue.

Potential blocking event

Imagine a scenario where a to_device message has a very long http response time. While the retry queue gets emptied by calling send_to_device, it requires the lock that no new messages can be enqueued in the meanwhile (as described above). As long as the to_device call is ongoing all other greenlets which enqueue messages are basically blocked by trying to acquire the lock.

Processing messages being blocked

If processing messages is blocked to continue to work by the RetryQueue, basically no new state changes will be created and applied until the lock is released (which depends on the http request to matrix). This means that all messages received from a sync are blocked to be processed in the state machine. As a consequence, there are no events created which could progress the current state.

handling events being blocked

The other part of the code which acquires a lock is enqueuing messages in the event handler. If this is blocked, all subsequent events will be blocked to be processed. This could also be potentially interaction with the blockchain and therefore critical interactions.

What should be done

I think it is intended that the retry queue should never cause other greenlets to be stuck, especially system critical ones. Unfortunately, I see that this is the case if http responses take a long time. IMHO it is essential to fix that as soon as possible. This also could lead to an increase in payment speed as we avoid any greenlets to be stuck by slow http requests to Matrix.

Illustration

1622636287407

@andrevmatos
Copy link
Contributor

andrevmatos commented Jun 2, 2021

If I understand the issue correctly, I think having the RetryQueue hold a lock which propagates to events to the state machine is definitely wrong. The state machine should not block, ever. Something important to remember is that gevent only yields control when an async operation is done. Synchronous operations (like appending to a list) doesn't yield control. Therefore, I think the best way to move forward would be something like:

  • RetryQueue may still have a lock, but it's used only internally, to prevent multiple parallel send_to_device
    • If you can guarantee only a single RetryQueue instance (e.g. it runs with a for loop), then there's no need for this lock either
  • The first thing it does is to do a shallow copy of the messages list and clear it. This is synchronous, and therefore, there's no need for a lock protecting the list itself of anything which happens elsewhere.
  • The RetryQueue operations are then done over this shallow copy, which guarantees exclusive access.
  • Enqueueing messages don't need a lock, it's also a synchronous operation, you can just append to it and be sure it won't block anything.
  • If the RetryQueue happens to need to re-queue messages, it can just push to front before releasing its lock (which may be used for the next iteration of RetryQueue), as this also is a synchronous call.

@ezdac ezdac self-assigned this Jun 3, 2021
ezdac added a commit to ezdac/raiden that referenced this issue Jun 3, 2021
Before, the retry-queues semaphore had to be acquired in order to
enqueue messages.
This caused the thought to be asynchronous enqueue() method and
all methods calling it to be dependent on successfully processing
all messages in the queues greenlet that processes the queue.

Effectively this resulted in a synchronous enqueue() call and could
cause the state-machine to block during long-running requests or retries
to the transport layer.

Fixes: raiden-network#7120
ezdac added a commit to ezdac/raiden that referenced this issue Jun 4, 2021
Before, the retry-queues semaphore had to be acquired in order to
enqueue messages.
This caused the thought to be asynchronous enqueue() method and
all methods calling it to be dependent on successfully processing
all messages in the queues greenlet that processes the queue.

Effectively this resulted in a synchronous enqueue() call and could
cause the state-machine to block during long-running requests or retries
to the transport layer.

The retry-queues lock was removed, since it was not necessary at all:
as long as there is only ever one retry-queue instance per channel
running, which is not the case due to the logic in `_get_retrier()`.

Fixes: raiden-network#7120
ezdac added a commit to ezdac/raiden that referenced this issue Jun 4, 2021
Before, the retry-queues semaphore had to be acquired in order to
enqueue messages.
This caused the thought to be asynchronous enqueue() method and
all methods calling it to be dependent on successfully processing
all messages in the queues greenlet that processes the queue.

Effectively this resulted in a synchronous enqueue() call and could
cause the state-machine to block during long-running requests or retries
to the transport layer.

The retry-queues lock was removed, since it was not necessary at all:
as long as there is only ever one retry-queue instance per channel
running, which is not the case due to the logic in `_get_retrier()`.

Fixes: raiden-network#7120
ulope pushed a commit that referenced this issue Jun 4, 2021
Before, the retry-queues semaphore had to be acquired in order to
enqueue messages.
This caused the thought to be asynchronous enqueue() method and
all methods calling it to be dependent on successfully processing
all messages in the queues greenlet that processes the queue.

Effectively this resulted in a synchronous enqueue() call and could
cause the state-machine to block during long-running requests or retries
to the transport layer.

The retry-queues lock was removed, since it was not necessary at all:
as long as there is only ever one retry-queue instance per channel
running, which is not the case due to the logic in `_get_retrier()`.

Fixes: #7120
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants