-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upstream: Implement WRSQ Scheduler #14681
Conversation
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
fix shuffle to work past c++17 Signed-off-by: Tony Allen <[email protected]>
/retest |
Retrying Azure Pipelines: |
Signed-off-by: Tony Allen <[email protected]>
This reverts commit a954816. Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
// Outside of this case, object picking is logarithmic with the number of unique weights. Adding | ||
// objects is always constant time. | ||
// | ||
// For the case where all object weights are the same, WRSQ behaves identical to vanilla |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does wrsq handle first pick determinism for the case that all hosts have the same weight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something that will be addressed in a subsequent patch when plumbing WRSQ into the load balancer. In the case you mention, first pick determinism is fixed for WRSQ by simply adding hosts to the scheduler in a random order during refresh.
namespace Envoy { | ||
namespace Upstream { | ||
|
||
// Weighted Random Selection Queue (WRSQ) Scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this comment after not thinking about this for 2 months, it can improve. Some observations to reference later:
- Elaborate on why it's like vanilla RR or WRS based on the weight distribution.
- Beef up the note.
- Consider an ASCII diagram.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look at this with fresh eyes. Adding a few more comments for the next round (after someone reviews).
bool empty() const override { return queue_map_.empty(); } | ||
|
||
private: | ||
using ObjQueue = std::queue<std::weak_ptr<C>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may benefit from being a class that automatically sets rebuild_cumulative_weights_
when it's mutated.
/assign akonradi |
include/envoy/upstream/scheduler.h
Outdated
/** | ||
* Insert entry into queue with a given weight. | ||
* | ||
* @param weight entry weight. | ||
* @param entry shared pointer to entry. | ||
*/ | ||
virtual void add(double weight, std::shared_ptr<C> entry) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was toying around with removing add() in a separate PR since for all usages, all items are present at construction time. Does it provide value to be able to dynamically add instead of rebuilding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, but it doesn't seem to be adding much complexity. I'm not opposed to ripping it out, but I think it's out of scope for this PR.
// NOTE: This class only supports integral weights and does not allow for the changing of object | ||
// weights on the fly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, what? This really shouldn't inherit from the Scheduler interface as written, then, since it doesn't actually implement the exposed API.
Also, I'm not sure how much we benefit from using floating-point weights. If we move forward with this scheduler, we should strongly consider switching to integral weights for the purpose of efficiency. The upside of floats is the dynamic range, so you can have weights of 10E1 and 10E10. I'm not sure that the relative difference between weights of 5.0, 5.00001, and 5.00002 really matters though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm not sure how much we benefit from using floating-point weights. If we move forward with this scheduler, we should strongly consider switching to integral weights for the purpose of efficiency.
It looks like we're using doubles because of the least request LB variations. If all of the object weights are 1, we want some fractional bias towards hosts with less active requests.
Since the LbEndpoint weights are uint32, we could consider casting the weights to uint64 internally and scaling everything by ~1e5 and performing integer math. I think that's a high enough precision to make this feasible. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that's sufficient. With the existing implementation, weights of 0.0000001 and 0.0000002 still result in a 1:2 balancing ratio. We'd have to re-normalize within the scheduler any time a weight is added or changed. If we want to use integer representations, we'd need to expose that at the function interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, currently the only way to get weights like that is for it to be adjusted via LRLB or something like slow-start in #13176. There would need to be not only a weight of 1 specified for a host in the proto, but also O(tens-of-thousands) of outstanding requests to get down to a number like that that the scaling wouldn't be able to represent.
If for some reason we did end up with those numbers, they would just round down to zero until whatever calculation doing the adjustments results in something > 1e-5. I still think this can work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I am not at all familiar with how the scheduler is actually used. If you say this will work in practice, I believe you. I'm arguing that this implementation doesn't match the Scheduler interface as proposed. If we want a minimum value, let's encode that either in a comment or (preferably) in the type of the weights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I'm asking for to switch to fixed-point values. That allows us to encode in the type that, say with 2 decimal points of precision, 3.112 and 3.114 are sufficiently close as to be indistinguishable. Using integers as weights would just be a special case where # decimals = 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just caught your update here, sorry for the delay. Introducing this to the base scheduler type makes sense to me.
There are some other things I'd like to knock out before getting back to this, but I'll knock it out and update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started doing this, but it quickly became a spookier change than I'd like to make across all the LB code. I encoded it in a comment instead, but if it makes folks wince we can chat this out a bit more and figure out a better path forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, ignoring my comments regarding integral weights, we really should consider changing the interface if this implementation isn't going to have the ability to update object weights.
if (!prepicked_obj.expired()) { | ||
return std::shared_ptr<C>(prepicked_obj); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a race here that is probably also present in the EDF scheduler where the referenced object gets deleted between the call to .expired() and creating the shared_ptr. I'm not sure if the use of shared_ptr here implies that the queue objects can span threads. Either way, we should prefer shared_ptr::lock() weak_ptr::lock() since it's a single atomic operation and should be less expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! TIL about weak_ptr::lock()
.
using QueueMap = absl::flat_hash_map<double, ObjQueue>; | ||
|
||
// Used to store a queue's weight info necessary to perform the weighted random selection. | ||
struct QueueInfo { | ||
double cumulative_weight; | ||
double weight; | ||
ObjQueue* q; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some duplication here of information, where weight
is stored both as a key in QueueMap and as a value in QueueInfo. We can use an absl::flat_hash_set
and heterogenous lookup to store the value inline in ObjQueue and still be able to look up by weight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not sure I follow. Can you spell it out a bit more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Since we're using a property of the value as the key, we can either have the same information in two places (what we're currently doing), or use absl::flat_hash_set
with a custom hash function. That's an improvement on space usage, but then to look up a QueueInfo
by weight, you'd have to construct a temporary one with the desired weight and check whether that's in the set or something - not great. absl::flat_hash_set
supports heterogeneous lookup, though, where you can check for presence by providing a key of a different type, as long as it hashes to the same value. See https://abseil.io/tips/144.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I've wasted a tremendous amount of time trying to guess my way through this and haven't had any luck. Is there some snippet of code you can share that does what you're asking? I think this may be common inside of Google, but externally there's not any example for me to base this off of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this help?
envoy/source/common/common/hash.h
Line 119 in a0ca08b
absl::flat_hash_set<SharedString, HeterogeneousStringHash, HeterogeneousStringEqual>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does! Thanks, I missed that.
I did want to ask if it's really worth the additional complexity to save the inline double for each unique weight? This scheduler is mostly encountering scenarios where there are not many unique weights per-object, since the performance degrades to be much worse than EDF. We can't use this for things like the least-request LB where there are many many individual weights.
I'd expect the additional memory utilization to be roughly equivalent to the hash table's unused slots (depending on the load factor). If we're concerned about this, since the number of unique weights is expected to be small, it may be more performant/efficient to simply keep these things in a vector and scan the vector for each lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let's punt or drop this. I had forgotten that flat_hash_set imposes constness requirements that we don't want to hack around.
Signed-off-by: Tony Allen <[email protected]>
@tonya11en @akonradi what's the status of this PR? Tony is this ready for another look by Alex? I couldn't quite suss out whether commit 8ef7453 from Tony addressed the concern in the comment. |
It didn't address the heterogeneous lookup comment. Let me try to wrap this up today. It's actually a bit embarrassing-- I've had trouble getting it to build and I've found no examples to base this off of. |
Signed-off-by: Tony Allen <[email protected]>
Checking in here, is this awaiting review? |
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
@snowp I'm waiting for folks to take another look and also respond to #14681 (comment). I successfully made the hash set based on the example @jmarantz gave, but the compiler complains when I try to mutate elements referenced via heterogeneous lookup.
complains with:
with this type definition:
If it's not obvious what's going on with that failure and this redundant double isn't a dealbreaker for this patch, I'd like to just put in a TODO and field more comments. At this point, I've spent more time fiddling with this heterogeneous lookup than I did writing the original patch :( |
Signed-off-by: Tony Allen <[email protected]>
/retest |
Retrying Azure Pipelines: |
@akonradi can you give this PR another pass? I think we can defer heterogeneous lookup to a followup PR. |
Agreed on heterogeneous lookup. My objection to ignoring the weight update function still stands, though: we shouldn't claim to implement an interface and then ignore meaningful arguments provided by callers. |
Signed-off-by: Tony Allen <[email protected]>
I went ahead and just made the WRSQ scheduler mutate weights. It should be in line with the interface now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending the minor changes requested
for (auto& it : queue_map_) { | ||
const auto weight_val = it.first; | ||
weight_sum += weight_val * it.second.size(); | ||
cumulative_weights_.emplace_back(QueueInfo{weight_sum, weight_val, it.second}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this will result in a call to QueueInfo's move constructor within cumulative_weights_.emplace_back, in addition to the call to its (implicitly declared) converting constructor. Pass weight_sum
and friends directly to emplace_back
to remove the extra constructor call.
// number of times equal to its weight. | ||
for (uint32_t i = 0; i < weight_sum; ++i) { | ||
EXPECT_CALL(random, random()).WillOnce(Return(i)); | ||
// The weights will not change with WRSQ, so the predicate does not matter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is out of date now.
|
||
{ | ||
auto second_entry = std::make_shared<uint32_t>(42); | ||
auto first_entry = std::make_shared<uint32_t>(37); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this to something other than 37 so it's obvious that the peek isn't picking this one?
EXPECT_TRUE(sched.pickAndAdd({}) == nullptr); | ||
} | ||
|
||
TEST(WRSQSchedulerTest, ManyPeekahead) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a short comment describing what the objective of this test is, something like "Ensure that values returned via peeks match the values that are picked afterwards"
++weight5pick; | ||
break; | ||
default: | ||
EXPECT_TRUE(false) << "bogus value returned"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: prefer the FAIL() macro. Same below
EXPECT_EQ(*peek, *p); | ||
} | ||
|
||
auto p = sched.pickAndAdd(f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here describing what the weights look like after this, maybe something like "After this, weights are e1 -> 0, e2 -> 0, e3 -> 1"
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
@akonradi I addressed those comments. Thanks again for the review and patience-- this was on the back-burner for a while. |
Same, sorry for the slow reviews |
@tonya11en can you merge main to make sure we are up to date and then we can get this in and iterate? Thanks. /wait |
Signed-off-by: Tony Allen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks for pushing this forward @akonradi and @tonya11en. Let's ship and iterate.
Weighted Random Selection Queue (WRSQ) Scheduler
This patch implements the WRSQ scheduler as progress towards #14597. No changes are made to the WRR load balancer in this patch, so release notes and docs have been left out. However, it does include the scheduler interface (that the EDF scheduler now implements), benchmarks comparing WRSQ/EDF, and tests.
More context can be found either in #14597 or in the subsequent PR after this review closes that will contain the docs and release note.
WRSQ scheduler keeps a queue for each unique weight among all objects inserted and adds the objects to their respective queue based on weight. When performing a pick operation, a queue is selected and an object is pulled. Each queue gets its own selection probability which is weighted as the sum of all weights of objects contained within. Once a queue is picked, you can simply pull from the top and honor the expected selection probability of each object.
Adding an object will cause the scheduler to rebuild internal structures on the first pick that follows. This operation will be linear on the number of unique weights among objects inserted. Outside of this case, object picking is logarithmic with the number of unique weights (as opposed to the number of objects with EDF). Adding objects is always constant time. For the case where all object weights are the same, WRSQ behaves identical to vanilla round-robin. If all object weights are different, it behaves identical to weighted random selection. An added bonus here is that it fixes the requirement for all LB weights to be 1 to avoid EDF scheduler performance for vanilla RR.
The evidence of the performance claims above can be seen in in the benchmark results:
(updated 8/11/2021)