rpc/transport: fix temporary dispatch loop stalls #11023

bharathv · 2023-05-25T01:19:12Z

Currently the way the code is structured can result in temporary
dispatch stalls timing out a bunch of requests.

Consider a request queue of sequence numbers [5, 6, 7, 8]

Lets assume seq=6 dispatch got delayed. This can happen if timeout is
low or there is an unknown delay by the scheduler (eg: debug builds).

_last_seq=4, queue = [5, 7, 8] - 5 is dispatched right away
_last_seq=5, queue = [7,8] -- out of order.

Dispatch of 7, 8 doesn't happen right away due to out of order sequence
and we never do a dispatch with seq=6 because we detect it already
timed out.

Now it takes a new RPC request to clear the stalled queue but
that may not happen for a while (eg: until the queued RPCs are timed out) and
meanwhile seq=7/8 timeout for no fault of theirs.

This patch does two main things.

Consolidates _last_seq tracking. Currently it is not monotonic and can
jump all over the place. This is confusing. Now we only update it in a
centralized place in dispatch_send().
^^ Requires that dispatch_send() is called in all cases which also
avoids dispatch loop stalls. For example, a timed out request will clear
the stalled queue right away (which is not the case before).

Backports Required

Release Notes

none

src/v/rpc/transport.cc

dotnwat · 2023-05-25T18:25:15Z

src/v/rpc/transport.cc

+                from_now(timing.memory_reserved_at),
                from_now(timing.enqueued_at),


are enqueue and memory reserved out of order?

bharathv · 2023-05-25T18:29:17Z

src/v/rpc/transport.cc

@@ -204,6 +205,7 @@ transport::make_response_handler(
                _correlations.size(),
                from_now(
                  timing.timeout.timeout_at() - timing.timeout.timeout_period),
+                from_now(timing.memory_reserved_at),


this should go after enqueued_at, just noticed.. will fix .

Incase RPCs are timing out due to memory pressure (unlikely given we do not set a small limit to the semaphore but just in case).

.. to track the sequence of dispatches.

bharathv · 2023-06-01T19:38:17Z

Moving to draft as I do more ci-repeat runs..I folded the dispatch loop stall fix into this PR.

Currently the way the code is structured can result in temporary dispatch stalls timing out a bunch of requests. Consider a request queue of sequence numbers [5, 6, 7, 8] Lets assume seq=6 dispatch got delayed. This can happen if timeout is low or there is an unknown delay by the scheduler (eg: debug builds). _last_seq=4, queue = [5, 7, 8] - 5 is dispatched right away _last_seq=5, queue = [7,8] -- out of order. Dispatch of 7, 8 doesn't happen right away due to out of order sequence and we never do a dispatch with seq=6 because we detect it already timed out. Now it takes a new RPC request to clear the stalled queue but they may not happen for a while as most of these are timer based and meanwhile seq=7/8 timeout for no fault of theirs. This patch does two main things. * Consolidates _last_seq tracking. Currently it is not monotonic and can jump all over the place. This is confusing. Now we only update it in a centralized place in dispatch_send(). * ^^ Requires that dispatch_send() is called in all cases which also avoids dispatch loop stalls. For example, a timed out request will clear the stalled queue right away (which is not the case before).

bharathv · 2023-06-01T22:33:03Z

/ci-repeat 5
debug
skip-units

bharathv · 2023-06-02T01:54:54Z

repeat-5 debug failures are all known flaky issues currently happening with debug builds.

dotnwat · 2023-06-02T21:32:13Z

Following up here from out-of-band conversation:

q: do we have any rpc users that depend on ordered delivery?
a: dunno, probably not. but there is that thing with optimization related to in-order delivery of raft messages
q: does this pr change delivery order?
a: messages may be dropped (but that happened before, too), so no?
q: is that accurate, are there other concerns?

i guess the important thing is to fully understand if there are material differences for delivery order?

cc @mmaslankaprv @bharathv

bharathv · 2023-06-02T21:47:39Z

i guess the important thing is to fully understand if there are material differences for delivery order?

I think there is no change.. out of order messages are always possible with timeouts eg: head of the queue times out and then the next RPC that depends on it is successfully dispatched. AIUI in order delivery is done on a best effort basis as a performance optimization but we have checks in the raft layer for correctness.

dotnwat · 2023-06-02T23:09:13Z

Thanks @bharathv

bharathv · 2023-06-02T23:20:22Z

/ci-repeat 1

bharathv · 2023-06-03T17:04:32Z

Failures: (unrelated, one new)

vbotbuildovich · 2023-06-03T17:31:25Z

/backport v23.1.x

dotnwat · 2023-06-03T19:23:24Z

Did we see this issue reported in 23.1.x?

bharathv · 2023-06-05T15:52:21Z

Did we see this issue reported in 23.1.x?

No not really.. I closed the backport, we can revisit it later.

bharathv requested a review from ztlpn May 25, 2023 01:19

github-actions bot added the area/redpanda label May 25, 2023

bharathv requested review from dotnwat and mmaslankaprv May 25, 2023 01:19

dotnwat reviewed May 25, 2023

View reviewed changes

src/v/rpc/transport.cc Outdated Show resolved Hide resolved

bharathv requested a review from dotnwat May 25, 2023 18:13

dotnwat reviewed May 25, 2023

View reviewed changes

bharathv commented May 25, 2023

View reviewed changes

bharathv added 2 commits June 1, 2023 10:20

rpc/debugging: include memory reservation time point

d014caf

Incase RPCs are timing out due to memory pressure (unlikely given we do not set a small limit to the semaphore but just in case).

rpc/transport: Additional debug information on dispatches

3cbb63b

.. to track the sequence of dispatches.

bharathv marked this pull request as draft June 1, 2023 19:38

bharathv force-pushed the transport_debug branch from ca5cf8a to fe58595 Compare June 1, 2023 19:38

bharathv added 2 commits June 1, 2023 15:32

clang-format: formatting only, no logic changes

93e8274

bharathv force-pushed the transport_debug branch from fe58595 to 93e8274 Compare June 1, 2023 22:32

bharathv changed the title ~~rpc/transport: additional debug information for spurious timeouts~~ rpc/transport: fix temporary dispatch loop stalls Jun 2, 2023

bharathv requested a review from dotnwat June 2, 2023 01:53

bharathv marked this pull request as ready for review June 2, 2023 01:54

dotnwat requested a review from graphcareful June 2, 2023 05:10

mmaslankaprv approved these changes Jun 2, 2023

View reviewed changes

bharathv self-assigned this Jun 2, 2023

dotnwat approved these changes Jun 2, 2023

View reviewed changes

dotnwat merged commit 68d67a5 into redpanda-data:dev Jun 3, 2023

vbotbuildovich mentioned this pull request Jun 3, 2023

[v23.1.x] rpc/transport: fix temporary dispatch loop stalls #11184

Closed

bharathv deleted the transport_debug branch June 5, 2023 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc/transport: fix temporary dispatch loop stalls #11023

rpc/transport: fix temporary dispatch loop stalls #11023

bharathv commented May 25, 2023 •

edited

Loading

dotnwat May 25, 2023

bharathv May 25, 2023

bharathv commented Jun 1, 2023

bharathv commented Jun 1, 2023

bharathv commented Jun 2, 2023

dotnwat commented Jun 2, 2023

bharathv commented Jun 2, 2023

dotnwat commented Jun 2, 2023

bharathv commented Jun 2, 2023

bharathv commented Jun 3, 2023

vbotbuildovich commented Jun 3, 2023

dotnwat commented Jun 3, 2023

bharathv commented Jun 5, 2023

		from_now(timing.memory_reserved_at),
		from_now(timing.enqueued_at),

rpc/transport: fix temporary dispatch loop stalls #11023

rpc/transport: fix temporary dispatch loop stalls #11023

Conversation

bharathv commented May 25, 2023 • edited Loading

Backports Required

Release Notes

dotnwat May 25, 2023

Choose a reason for hiding this comment

bharathv May 25, 2023

Choose a reason for hiding this comment

bharathv commented Jun 1, 2023

bharathv commented Jun 1, 2023

bharathv commented Jun 2, 2023

dotnwat commented Jun 2, 2023

bharathv commented Jun 2, 2023

dotnwat commented Jun 2, 2023

bharathv commented Jun 2, 2023

bharathv commented Jun 3, 2023

vbotbuildovich commented Jun 3, 2023

dotnwat commented Jun 3, 2023

bharathv commented Jun 5, 2023

bharathv commented May 25, 2023 •

edited

Loading