-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sure Zebra uses poll_ready and Buffer reservations correctly #1593
Comments
teor2345
added a commit
to teor2345/zebra
that referenced
this issue
Jan 22, 2021
Uses the `ServiceExt::oneshot` design pattern from ZcashFoundation#1593.
2 tasks
teor2345
added a commit
to teor2345/zebra
that referenced
this issue
Jan 22, 2021
Uses the `ServiceExt::oneshot` design pattern from ZcashFoundation#1593.
This was referenced Jan 25, 2021
teor2345
added a commit
to teor2345/zebra
that referenced
this issue
Jan 25, 2021
ServiceExt::call_all leaks Tower::Buffer reservations, so we can't use it in Zebra. Instead, use a loop in the returned future. See ZcashFoundation#1593 for details.
teor2345
added a commit
to teor2345/zebra
that referenced
this issue
Jan 25, 2021
Uses the `ServiceExt::oneshot` design pattern from ZcashFoundation#1593.
teor2345
added a commit
to teor2345/zebra
that referenced
this issue
Jan 25, 2021
ServiceExt::call_all leaks Tower::Buffer reservations, so we can't use it in Zebra. Instead, use a loop in the returned future. See ZcashFoundation#1593 for details.
teor2345
added a commit
to teor2345/zebra
that referenced
this issue
Jan 25, 2021
ServiceExt::call_all leaks Tower::Buffer reservations, so we can't use it in Zebra. Instead, use a loop in the returned future. See ZcashFoundation#1593 for details.
teor2345
added a commit
to teor2345/zebra
that referenced
this issue
Jan 25, 2021
Uses the `ServiceExt::oneshot` design pattern from ZcashFoundation#1593.
teor2345
added a commit
to teor2345/zebra
that referenced
this issue
Jan 25, 2021
ServiceExt::call_all leaks Tower::Buffer reservations, so we can't use it in Zebra. Instead, use a loop in the returned future. See ZcashFoundation#1593 for details.
2 tasks
1 task
2 tasks
teor2345
added a commit
that referenced
this issue
Feb 2, 2021
Uses the `ServiceExt::oneshot` design pattern from #1593.
teor2345
added a commit
that referenced
this issue
Feb 2, 2021
ServiceExt::call_all leaks Tower::Buffer reservations, so we can't use it in Zebra. Instead, use a loop in the returned future. See #1593 for details.
This was referenced Feb 3, 2021
2 tasks
The remaining work in this ticket is routine RFC documentation. |
This was referenced Feb 15, 2021
The remaining design work in this ticket can be done as part of any future sprint. |
10 tasks
teor2345
changed the title
Make sure all poll_ready reservations are actually used in a call
Make sure Zebra uses poll_ready and Buffer reservations correctly
Apr 18, 2021
6 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Motivation
The constraints imposed by the
tower::Buffer
andtower::Batch
implementations are:poll_ready
must be called at least once for eachcall
Poll::Ready
from a buffer, regardless of the current readiness of the buffer or its underlying serviceBuffer
/Batch
capacity limits the number of concurrently waiting tasks. Once this limit is reached, further tasks will block, awaiting a free reservation.Buffer
/Batch
capacity must be larger than the maximum number of concurrently waiting tasks, or Zebra could deadlock (hang).ready!
macro, rather than polling them all simultaneouslyThese constraints are mitigated by:
Buffer
/Batch
reservation release when response future is returned by the buffer/batch, even if the future doesn't completecall
sBuffer
/Batch
boundsTasks
poll_ready
deadlock section to that RFC, based on the deadlock constraints listed aboveHere's a dealock avoidance example from #1735:
Outdated
The rest of this ticket is outdated, but might be useful for writing the RFC.
Obsolete Tasks
Service
orServiceExt
poll_ready
orcall
on aService
Turn this ticket text into a Zebra RFC, using examples from Fix poll_ready usage in ChainVerifier #1700(obsoleted)Motivation
Symptoms
Zebra sometimes hangs (#1435) for long periods of time. These hangs seem to happen in the peer set, but they could be caused by bugs elsewhere in Zebra.
Analysis
Zebra uses Tower services extensively. The
tower::Buffer
documentation says that everypoll_ready
andcall
should happen in a matched pair, or theBuffer
will run out of reservation slots and hang:But Zebra often calls
Buffer::poll_ready
(orready_and
), gets an Ok result, then never uses theBuffer
reservation in aBuffer::call
.This bug can also be transitive:
Inner
Zebra service correctly follows everyBuffer::poll_ready
with aBuffer::call
Outer
Zebra service usesInner::poll_ready
, but doesn't follow it with anInner::call
Outer::poll_ready
reserves slots via theInner
service'sBuffer::poll_ready
, but never follows it with anInner::call
, so theBuffer
slot is reserved pending theInner::call
This issue also affects Zebra's own
tower-batch
, which has a very similar implementation totower::Buffer
. See #1668 for details.Requirements
Therefore, all Zebra services must follow
poll_ready
withcall
on all of the services they use, so that anyBuffer
s orBatch
es lower down in the service stack are used correctly. (Since Zebra or its dependencies may addBuffer
s orBatch
es in future changes, this requirement applies even if they are not currently in the service stack.)As of January 2021, Zebra often breaks these requirements using the following anti-patterns:
poll_ready
:poll_ready
sBuffer
s, but returnPoll::Pending
Poll::Pending
retry ourpoll_ready
Buffer
reservations, without anycall
s to use those reservations, filling up theBuffer
s of the ready servicescall
:call
the services used for this specific kind of requestBuffer
s fill up with reservationsDesign
Overall Summary
Zebra uses 3 different design patterns for service readiness:
poll_ready
andcall
unconditionallyServiceExt::oneshot
to make surepoll_ready
andcall
happen togetheroneshot
patternReadyCache
Simple Wrapper Services
Simple wrapper services can forward
poll_ready
andcall
when aWrapper
service:Inner
service, andInner
service:We can't use this pattern if
Inner
service call happens:?
operator.Implementation:
Wrapper::poll_ready
:Wrapper
service, returning immediately if it is not ready or erroredWrapper
service is ready, check the readiness of theInner
service and return the result, converting the error types as neededWrapper::call
:Inner::call
method directlyInner::call
must be called immediately, unconditionally, and regardless of any errorsInner::call
, wrap it in a function that can not return an errorSimple services transmit backpressure directly from
Inner::poll_ready
, so aBuffer
is optional.Complex Services
We need to follow every
poll_ready
with an immediatecall
when anOuter
service:?
operator.Implementation:
Outer::poll_ready
:Outer
service and return the resultOuter::call
Outer::call
:service.clone().oneshot(request)
(oneshot
is fromtower::ServiceExt
)Buffer
s so they are cloneableWhen non-service components call into service code, they should also use this
oneshot
pattern. (Non-services can't transmit backpressure, so the other patterns can't be used for them).Complex Backpressure
If a complex service needs to transmit backpressure from one or more inner services, it can:
ReadyCache
Outer::poll_ready
:Outer
service, returning immediately if it is not ready or erroredOuter
service is ready, check the readiness of the backpressure inner services usingReadyCache::poll_pending
and return the result, converting the error types as neededOuter::call
:ReadyCache::call_ready
service.clone().oneshot(request)
Outer
must not callReadyCache::check_ready
(orReadyCache::check_ready_index
), because they usepoll_ready
on services that are already ready. This ensures that each inner service has at most one outstandingBuffer
reservation in theReadyCache
.It's also possible to store multiple services in a
ReadyCache
. But Zebra services typically have inner services with different types, so this requires type erasure, which complicates the design.See https://docs.rs/tower/0.4.3/tower/ready_cache/cache/struct.ReadyCache.html
Stop Using
call_all
call_all
leaks a buffer reservation every time the inner service returnsPoll::Pending
: https://github.com/tower-rs/tower/blob/master/tower/src/util/call_all/common.rs#L112Buffer
s returnPoll::Pending
when they are full:https://github.com/tower-rs/tower/blob/master/tower/src/buffer/service.rs#L119
https://github.com/tower-rs/tower/blob/master/tower/src/semaphore.rs#L55
So we should stop using
call_all
, because buffers can always returnPoll:Pending
.For quick requests, we can just use a loop inside the returned future. Correctness and simplicity is more important that a possible small performance gain from doing concurrent tasks (as long as they are quick).
If the requests are slow, or have dependencies on each other, we should
spawn
each request, and store theirJoinHandle
s in aFuturesUnordered
. We might also want to wrap each request future in aTimeout
, so we don't hang on very slow tasks or missing dependencies.Buffers
A service should be provided wrapped in a
Buffer
if:Services might also have other reasons for using a
Buffer
. These reasons should be documented.Choosing Buffer Bounds
Zebra's
Buffer
bounds should be set to the maximum number of concurrent requests, plus 1:The extra slot protects us from future changes that add an extra caller, or extra concurrency.
As a general rule, Zebra
Buffer
s should all have at least 3 slots (2 + 1), because most Zebra services can be called concurrently by:Services might also have other reasons for a larger bound. These reasons should be documented.
We should minimise
Buffer
lengths for services whose requests or responses containBlock
s (or other large data items, such asTransaction
vectors). A longBuffer
full ofBlock
s can significantly increase memory usage.Long
Buffer
s can also increase request latency. Latency isn't a concern for Zebra's core use case as a node software, but it might be an issue if wallets, exchanges, or block explorers want to use Zebra.Implementation Tasks
We should review every Zebra service, and all the code that calls Zebra services, and make sure it uses one of the design patterns listed in this ticket. In particular, we should search for the following strings:
poll_ready
ready_and
call
Service
orServiceExt
functions that callpoll_ready
We should stop using
ServiceExt::call_all
, because it has a bug that leaks service reservations, which can fill up buffers, causing hangs.Underlying Abstraction Failure
These kinds of issues occur because Zebra wants:
But Tower's canonical use case is:
And Tower's design and tooling is a work-in-progress:
In addition, Zebra's definition of service misuse doesn't quite match Tower's.
For example, we might want to panic rather than hanging if a service's buffer fills up. Typically, networked services are de-prioritised, restarted or replaced if they are slower than other services. But Zebra only creates a single instance of each service at startup.
Stream Abstraction
Stream
is immune from these issues, because itspoll_next
function returns the underlying result when ready.Open Questions
Batch Verification and Blocking Tasks
Do batch verification, CPU-bound services, and other potentially blocking tasks need larger
Buffer
s?Do the service layers above blocking services need larger
Buffer
s, up to the point where concurrency is introduced?For example, some desktop machines have 64 cores and 128 threads:
https://www.amd.com/en/products/cpu/amd-ryzen-threadripper-pro-3995wx
(I haven't seen any CPU-limiting issues on my 18-core / 36-thread machine, so this question isn't a high priority.)
Alternative Designs
Here are some other designs for readiness:
Buffer
s bigger - just delays the hangsRequest::Null
to services we don't use - makes the API really messyready_and().await?.call(request)
- more complex thanclone().oneshot(request)
, and both ways require aBuffer
anywayReadyCache
- easier to get wrongHere are some other designs for backpressure:
Buffer
for backpressure - only works if blocking or slow tasks happen directly in thecall
, rather than the returned future, because theBuffer
slot is released after the future is returnedRelated Issues
This work is part of fixing the hangs in #1435.
Follow-Up Tasks
The content of this ticket should be turned into a design RFC, with example code based on the implementation of this ticket. But we need to check that it works first.
We should keep the current backpressure implementations as much as possible, but review them as a separate task (#1618).
The text was updated successfully, but these errors were encountered: