-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cancel-able reservations for ratelimiters #6030
Cancel-able reservations for ratelimiters #6030
Conversation
…er than IRL and it still behaves the same
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
... and 9 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kind of makes sense at high level. added some nit comments. will do another round of review once marked ready for review.
common/clock/ratelimiter.go
Outdated
// implementation if they are handled separately. | ||
// - All MethodAt APIs have been excluded as we do not use them, and they would | ||
// have to bypass the contained TimeSource, which seems error-prone. | ||
// - All `MethodN` APIs have been excluded because we do not use them, but they |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side not: these would be useful if we decide to have different weights per call or its content
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
their interaction with burst is a bit funky and I'm not sure that's something we want to play around with (as opposed to just repeatedly calling Allow()
)... but yea. easy and has useful scenarios, I'm pretty confident there won't be any surprises if we add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can be useful, if the limit is not based on QPS but BPS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have any of those? I can add them, I just haven't seen it
Benchmarking and checking logs carefully showed a flaw! Using "now" rather than the reserved-at time when checking `Allow()` is incorrect, and can lead to far more allowed calls than intended. This is now fixed, and full benchmark runs are saved in testdata.
common/clock/ratelimiter.go
Outdated
// in high-CPU benchmarks, if you use the limiter's "now", you can sometimes | ||
// see this allowing many times more requests than it should, due to the | ||
// interleaving time advancement. | ||
return r.res.OK() && r.res.DelayFrom(r.reservedAt) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a flaw while benchmarking!
Previously this was using r.limiter.lockNow()
's value, and under high concurrency it could allow millions rather than the intended couple-thousand (afaict due to the limiter's internal-"now" advancing to a time where the reservation would be allowed).
Now it's fixed, and I have not seen it misbehave again.
…tect from that same flaw
return d | ||
} | ||
|
||
type timesourceContext struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly useful to break out separately: context-timeouts need a timesource-wrapper.
main caveat is that using it in prod will mean losing stdlib-context
-internal optimizations (cancels are propagated more efficiently if it's an stdlib implementation).
build tags could solve that, but we have generally avoided them because they make a few things annoying... but with a standardized and checked approach I'd be entirely in favor of it.
common/clock/ratelimiter.go
Outdated
// unusable, immediately cancel it to get rid of the future-reservation | ||
// since we don't allow waiting for it. | ||
// | ||
// this restores MANY more tokens when heavily contended, as time skews | ||
// less before canceling, so it gets MUCH closer to the intended limit, | ||
// and it performs very slightly better. | ||
res.CancelAt(now) | ||
return deniedReservation{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noticed some super-low allow rates in a mock-time benchmark when advancing time slowly (and sometimes in real-time too), tracked the issue down to "cancels often arrive later than desired in this test", leading to the ratelimiter mostly staying at negative tokens (due to not-allowed pending reservations) and only allowing ~0.001% through.
synchronously canceling solves that, and should also improve behavior IRL.
Tests cleaned up and flattened, docs updated, minor optimizations optimized, and 100% coverage even without the fuzz test. Seems good to go :) |
// - If false, the Reservation will be rolled back, possibly restoring a Ratelimiter token. | ||
// This is equivalent to calling Cancel() on a [rate.Reservation], but it is more likely | ||
// to return the token. | ||
Used(wasUsed bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dumb question: what's the intent one might try to convey by specifying that a reservation was used, even when it wasn't allowed? What does that do to the underlying limiter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this implementation: nothing, it's always a no-op.
In the global ratelimiter thing that led to all this: so you have an "accepted/rejected" event that can be monitored, for counting usage.
With just "Reserve -> Cancel", you have to assume all reserves are "used" immediately, and subtract 1 if cancel is called. And then you may have to deal with counts going negative briefly.
With something called in all cases, that's not a concern, just defer the count until it's marked as used-or-not. It'll even be more time-correct since you're "using" it closer to the Used(true)
call than the Reserve()
call.
I suspect it'll also be useful in tests, as this way you always EXPECT
a call (but the arg varies) rather than "it might be called or it might not".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you probably meant with the real limiter / a somewhat different question.
*rate.Limiter
reservations don't really have an "allowed" concept. We (currently and in this wrapper) consider a reservation "allowed" if it is OK()
(possible to reserve, literally just "N <= burst") and Delay() == 0
(the limiter had a full token available right now, which has already been claimed).
Delay can be positive, if there wasn't a full token in the bucket. You can take advantage of this if you wanted to reserve a token and wait for it to become available.
At that point, reservations are essentially identical to Wait()
but you can tell ahead of time how long you would have to wait, instead of it being a black box that you can give up on (cancel the context) but not predict.
When you get one of these positive-delay reservations, a token has still been removed. The limiter goes from e.g. 0.5 tokens to -0.5 tokens. Reserve another and you'll be at -1.5 tokens, and it'll be 2.5 ratelimiter-periods before any Allow()
call can succeed, as it refills the bucket from negative values. But you can continue Reserve()
ing (or Wait()
ing) as much as you like, and push that to-be-allowed-call backlog out arbitrarily far, each one just waits long enough to produce a steady throughput.
You can cancel these reservations and return a token while waiting, allowing new reservations to occur sooner and possibly even an Allow()
call*. But not after your time has passed, the Cancel() call just no-ops.
*: ... I'm not actually sure this is reasonable, now that I think about it. I suspect it can allow arbitrarily large bursts. But it is how *rate.Limiter
works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*: ... I'm not actually sure this is reasonable, now that I think about it. I suspect it can allow arbitrarily large bursts.
Ah, no, that's handled. It may take me a while to fully understand the details, but it's tracking "reserves since reserved" and not exceeding that value, so carefully-timed reserve+cancel interactions don't seem to allow scheduling a bunch of reserved events at the same time.
common/clock/ratelimiter.go
Outdated
// and cause undefined behavior. | ||
func (r *ratelimiter) lockNow() (now time.Time, unlock func()) { | ||
r.mut.Lock() | ||
// important design note: `r.timesource.Now()` MUST be gathered while the lock is held, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, but repeated above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had intended this as "callers of this func must hold the lock as long as the limiter is being modified (often handled by defer unlock()
)" being a different concern from "this func's implementation must lock before gathering time.Now()
" (which can't be observed by callers of lockNow()
).
"do less inside the lock" -> "move Now() outside the lock" is a fairly predictable future optimization idea, but it'd reproduce the underlying *rate.Limiter
flaw.
I can reword it, but I do think both need to be documented explicitly, in a way that'd be noticed during code review for any changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, wait, you were probably referring to the comment above timesource
in the struct definition. yea, that feels like a duplicate. I'll clean that up.
common/clock/ratelimiter.go
Outdated
|
||
// ensure we are unlocked, which will not have happened if an err | ||
// was returned immediately. | ||
once.Do(unlock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slightly confused about the number of unlocks here. My recollection is that defers are LIFO and you're manually unlocking here. Won't this be called a second time on 327?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, that's what you're using once for 🤦 ignore me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea - it just felt (a lot) easier to once
it and not worry about handling specific branches. Much harder to screw up.
I can figure out exactly what that'd take if we wanted a different route / see the comparison, and it doesn't actually have to be a sync-once to work so it can be optimized a bit. I just fought with it briefly and then realized "wait this approach is obviously correct, I'll just do that" and never looked back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't fault anything to the extent that I was following along the minutae of the rate-limter behaviour
if delay == 0 { | ||
return nil // available token, allow it | ||
} | ||
if deadline, ok := ctx.Deadline(); ok && now.Add(delay).After(deadline) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want a few milliseconds of buffer (now+delay < deadline-buffer). Ideally specified by the endpoint being protected but due to lack of annotations that will be a mess. Probably fine to leave like this for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from IRL: this kind of "reserve a buffer for the caller to use the thing protected by the limiter" is better handled by layers above this, by making a context with a deadline before that buffer.
which does very much seem worth making easier to do, it's quite verbose with plain contexts and I suspect that's part of why we don't do it much. and it may be worth building a Wait(ctx, reserveTimeBeforeCtxTimesOut)
API to make it super easy and clear. we just don't really have anything that works this way right now, and the decision of how much time to buffer wouldn't be implemented at this layer anyway.
Merging since I got approvals, very much open to documentation improvements / things to try though. And I'll probably present it internally to the team and make some tweaks based on how that goes. |
Implements a fix for cadence-workflow#6029, allowing _normal_ ratelimiter usage to reserve tokens and be able to cancel them, assuming no racing claims on the limiter. In the course of benchmarking this, I discovered what I believe to be some fundamental flaws with how `*rate.Limiter` works, which effectively _require_ a wrapper like I've built here. So unless this has unsustainable latency or CPU costs, IMO it should be preferred over the underlying `*rate.Limiter` on correctness alone. The good news is that latency seems entirely survivable, less than 2x slower in basically all cases (and usually much less). Surprisingly, it's even slightly _faster_ for one of our common uses (reserve -> cancel, used in multi-stage ratelimits, which I suspect we will use more heavily in the future). # What this wrapper does A few core things: 1. re-implements `*rate.Limiter` in terms of an interface, to allow mocks and wrappers (`*rate.Limiter` has `Reserve() *rate.Reservation` which can't be mocked in any reasonable way) 2. implements ^ the interface while using a `TimeSource` so tests can resist CPU noise more easily (will also require a context-wrapper in some cases, which will be done later when needed. there's a simple prototype in this commit.) 3. prevents `*rate.Limiter`-internal time from going backwards (leads to irrational behavior / violated limits) 4. synchronously cancels any reservations that are not immediately available (more efficient, matches our intended use, and has much more correct throughput when contended) # The flaw Time travel does weird things. Essentially, it's a combination of: - `*rate.Limiter`'s *use of* `time.Now()` is not monotonic, despite `time.Now()` *itself* being monotonic: `Now()` is collected _before_ the ratelimiter's mutex is held, so values may be arbitrarily delayed and out of order while waiting to acquire the mutex. - `*rate.Limiter` allows its internal time to advance _and rewind_ by design, but it does not preserve enough information to e.g. rewind 1s and then advance 1s without affecting the next call. - as a mitigating factor on all this, `*rate.Limiter` only updates its `last` time when an event is _allowed_. this means ^ this essentially only applies if time skew is greater than your limiter's rate, especially if it's >=2x or other time-changing methods are called. With fast benchmarks this is very easy to trigger, IRL... I'm unsure. Some of our limits are set to >100,000 per second, or <10µs of mutex-imposed delay tolerated. With a tight loop in a benchmark I regularly exceed 100µs of latency between locks even without parallel calls (zero contention), so I suspect it's not all that rare. In any case, because it explicitly allows time to rewind and the simpler methods do not acquire the lock before gathering `time.Now()`, concurrent use can lead to the following behavior. Assuming the limit refills a token every 5 time-periods, and the calls acquire the mutex in the following global order (as observed through the limiter's mutex) but with their original T=time.Now() values: 1. `Allow()` called at T0: successfully claims a token, emptying the token bucket. - The limiter's internal time, `last`, is now T0. 3. `Allow()` called at T1: fails, 0 tokens available. - The limiter's internal time does not advance, nothing changes. 5. `Allow()` called at T10: successfully claims a token. - The limiter's internal time advances from T0 to T10, and adds 2 tokens. - The 1 full token is claimed, the call is allowed, and 1 token remains after the call. 6. `Allow()` called at T0: successfully claims a token. - The limiter's internal time advances from T10 to T0. No tokens have been added because the new time is lower. - There is 1 token in the bucket, so the call is allowed. 0 tokens remain. 7. `Allow()` called at T5: successfully claims a token. - The limiter's internal time advances from T0 to T5, and adds a token to the bucket. - The 1 full token is claimed, the call is allowed, and 0 tokens remain after the call. From an outside observer who knows "real" time has only moved from T0 to T10, _four_ calls have been allowed when only three should have been (one each at T0, T5, and T10). This can happen _any number of times_ between real-T0 and real-T10 if more old/new stale-time-alternating calls occur (especially with larger skew), and some benchmarks show this allowing _thousands_ of times more events than it should. # The fix Never allow the internal time to rewind. It's pretty straightforward conceptually, and I am _absolutely thrilled_ that `*rate.Limiter` has these "pass in a time" APIs: they allow this kind of wrapper, both for mocking time and for implementing the fix. Implementing it is a bit subtle and fiddly (as shown in this PR), but thankfully possible. This eliminates rapid thrashing (all old-time events simply occur "at the same time"), and as long as "real" time is used, it eventually converges on real-time-like throughput because real time keeps advancing at a steady rate when viewed at larger time scales. At small time scales there is no correct time available anyway, so only larger-scale behavior is important. I've also chosen to "pin" reservations' time, because this allows canceling them well after they are claimed _if no other time-advancing calls have occurred_. Conceptually I believe this is sound, and benchmarks show essentially perfect timing between allowed events: - Events that occur in the real-past / at the "same" time as the reservation (or after) can be allowed to use the returned token because no possibly-over-limit calls have occurred after it (in fact, none at all). - If using `time.Now()`, this token likely could not be returned because time has usually advanced, and canceling does nothing when that occurs. - You can see an isolated version of this in the pinned-time benchmarks, notice how its sequential call throughput is essentially perfect. - If other calls have occurred "after" the reservation, returning a consumed token cannot be guaranteed safe / may not correctly limit throughput, so it is not returned. - e.g. burst-1 limit-1/s and allowing late cancels would allow: reserve (0 delay) -> wait 1s -> Allow() (true) -> cancel() -> Allow() (true), which means 2 allowed events in the same second, violating the limit. # A simpler fix we can contribute upstream Lock before acquiring `time.Now()`, not after. For "normal" use (`Wait(ctx)` and `Allow()`), this would ensure that time progresses monotonically, which seems like an obvious assumption for users. Even high frequency events should behave correctly in this case. Even if this is done, this wrapper still has some use to protect against rewinds when canceling while recovering more tokens. In fact, the entire wrapper would likely be completely unchanged, because we have no way to know what the internal time is when canceling... ... but all non-wrapper and non-at-time-passing users Go-ecosystem-wide would be much more correct. And adding some `AllowN(int)` (not `AllowN(time, int)`) time-free non-singular APIs would likely take care of the VAST majority of the remaining cases, leaving only Reserve-users to suffer through the details. (I would need to think harder about it, but changing how CancelAt works to compare against `r.lim.last` might resolve it for cancellation too)
Implements a fix for #6029, allowing normal ratelimiter usage to reserve tokens and be able to cancel them, assuming no racing claims on the limiter.
In the course of benchmarking this, I discovered what I believe to be some fundamental flaws with how
*rate.Limiter
works, which effectively require a wrapper like I've built here. So unless this has unsustainable latency or CPU costs, IMO it should be preferred over the underlying*rate.Limiter
on correctness alone.The good news is that latency seems entirely survivable, less than 2x slower in basically all cases (and usually much less).
Surprisingly, it's even slightly faster for one of our common uses (reserve -> cancel, used in multi-stage ratelimits, which I suspect we will use more heavily in the future).
What this wrapper does
A few core things:
*rate.Limiter
in terms of an interface, to allow mocks and wrappers (*rate.Limiter
hasReserve() *rate.Reservation
which can't be mocked in any reasonable way)TimeSource
so tests can resist CPU noise more easily (will also require a context-wrapper in some cases, which will be done later when needed. there's a simple prototype in this commit.)*rate.Limiter
-internal time from going backwards (leads to irrational behavior / violated limits)The flaw
Time travel does weird things.
Essentially, it's a combination of:
*rate.Limiter
's use oftime.Now()
is not monotonic, despitetime.Now()
itself being monotonic:Now()
is collected before the ratelimiter's mutex is held, so values may be arbitrarily delayed and out of order while waiting to acquire the mutex.*rate.Limiter
allows its internal time to advance and rewind by design, but it does not preserve enough information to e.g. rewind 1s and then advance 1s without affecting the next call.*rate.Limiter
only updates itslast
time when an event is allowed. this means ^ this essentially only applies if time skew is greater than your limiter's rate, especially if it's >=2x or other time-changing methods are called.With fast benchmarks this is very easy to trigger, IRL... I'm unsure. Some of our limits are set to >100,000 per second, or <10µs of mutex-imposed delay tolerated. With a tight loop in a benchmark I regularly exceed 100µs of latency between locks even without parallel calls (zero contention), so I suspect it's not all that rare.
In any case, because it explicitly allows time to rewind and the simpler methods do not acquire the lock before gathering
time.Now()
, concurrent use can lead to the following behavior. Assuming the limit refills a token every 5 time-periods, and the calls acquire the mutex in the following global order (as observed through the limiter's mutex) but with their original T=time.Now() values:Allow()
called at T0: successfully claims a token, emptying the token bucket.last
, is now T0.Allow()
called at T1: fails, 0 tokens available.Allow()
called at T10: successfully claims a token.Allow()
called at T0: successfully claims a token.Allow()
called at T5: successfully claims a token.From an outside observer who knows "real" time has only moved from T0 to T10, four calls have been allowed when only three should have been (one each at T0, T5, and T10). This can happen any number of times between real-T0 and real-T10 if more old/new stale-time-alternating calls occur (especially with larger skew), and some benchmarks show this allowing thousands of times more events than it should.
The fix
Never allow the internal time to rewind.
It's pretty straightforward conceptually, and I am absolutely thrilled that
*rate.Limiter
has these "pass in a time" APIs: they allow this kind of wrapper, both for mocking time and for implementing the fix. Implementing it is a bit subtle and fiddly (as shown in this PR), but thankfully possible.This eliminates rapid thrashing (all old-time events simply occur "at the same time"), and as long as "real" time is used, it eventually converges on real-time-like throughput because real time keeps advancing at a steady rate when viewed at larger time scales. At small time scales there is no correct time available anyway, so only larger-scale behavior is important.
I've also chosen to "pin" reservations' time, because this allows canceling them well after they are claimed if no other time-advancing calls have occurred.
Conceptually I believe this is sound, and benchmarks show essentially perfect timing between allowed events:
time.Now()
, this token likely could not be returned because time has usually advanced, and canceling does nothing when that occurs.A simpler fix we can contribute upstream
Lock before acquiring
time.Now()
, not after.For "normal" use (
Wait(ctx)
andAllow()
), this would ensure that time progresses monotonically, which seems like an obvious assumption for users. Even high frequency events should behave correctly in this case.Even if this is done, this wrapper still has some use to protect against rewinds when canceling while recovering more tokens. In fact, the entire wrapper would likely be completely unchanged, because we have no way to know what the internal time is when canceling...
... but all non-wrapper and non-at-time-passing users Go-ecosystem-wide would be much more correct. And adding some
AllowN(int)
(notAllowN(time, int)
) time-free non-singular APIs would likely take care of the VAST majority of the remaining cases, leaving only Reserve-users to suffer through the details.(I would need to think harder about it, but changing how CancelAt works to compare against
r.lim.last
might resolve it for cancellation too)