Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upstream: Implement WRSQ Scheduler #14681

Merged
merged 20 commits into from
Aug 13, 2021
Merged

Conversation

tonya11en
Copy link
Member

@tonya11en tonya11en commented Jan 13, 2021

Weighted Random Selection Queue (WRSQ) Scheduler

This patch implements the WRSQ scheduler as progress towards #14597. No changes are made to the WRR load balancer in this patch, so release notes and docs have been left out. However, it does include the scheduler interface (that the EDF scheduler now implements), benchmarks comparing WRSQ/EDF, and tests.

More context can be found either in #14597 or in the subsequent PR after this review closes that will contain the docs and release note.

WRSQ scheduler keeps a queue for each unique weight among all objects inserted and adds the objects to their respective queue based on weight. When performing a pick operation, a queue is selected and an object is pulled. Each queue gets its own selection probability which is weighted as the sum of all weights of objects contained within. Once a queue is picked, you can simply pull from the top and honor the expected selection probability of each object.

Adding an object will cause the scheduler to rebuild internal structures on the first pick that follows. This operation will be linear on the number of unique weights among objects inserted. Outside of this case, object picking is logarithmic with the number of unique weights (as opposed to the number of objects with EDF). Adding objects is always constant time. For the case where all object weights are the same, WRSQ behaves identical to vanilla round-robin. If all object weights are different, it behaves identical to weighted random selection. An added bonus here is that it fixes the requirement for all LB weights to be 1 to avoid EDF scheduler performance for vanilla RR.

The evidence of the performance claims above can be seen in in the benchmark results:
(updated 8/11/2021)

Run on (40 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x20)
  L1 Instruction 32 KiB (x20)
  L2 Unified 256 KiB (x20)
  L3 Unified 30720 KiB (x2)
Load Average: 21.13, 20.29, 15.47
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
splitWeightAddEdf/64             7.04 us         7.04 us        97415
splitWeightAddEdf/512            37.7 us         37.8 us        14428
splitWeightAddEdf/4096            386 us          386 us         2659
splitWeightAddEdf/16384          1691 us         1691 us          509
splitWeightAddWRSQ/64            4.19 us         4.20 us       165605
splitWeightAddWRSQ/512           28.5 us         28.6 us        24873
splitWeightAddWRSQ/4096           239 us          239 us         2958
splitWeightAddWRSQ/16384         1299 us         1299 us          547
splitWeightPickEdf/64             125 ns          125 ns      5595971
splitWeightPickEdf/512            153 ns          153 ns      4574393
splitWeightPickEdf/4096           182 ns          182 ns      3870913
splitWeightPickEdf/16384          196 ns          196 ns      3599974
splitWeightPickWRSQ/64           76.8 ns         76.8 ns      9038291
splitWeightPickWRSQ/512          80.5 ns         80.5 ns      8692647
splitWeightPickWRSQ/4096         84.8 ns         84.8 ns      8245446
splitWeightPickWRSQ/16384        86.2 ns         86.2 ns      8090606
uniqueWeightAddEdf/64            7.44 us         7.43 us       135596
uniqueWeightAddEdf/512           51.5 us         51.6 us        10000
uniqueWeightAddEdf/4096           409 us          409 us         2299
uniqueWeightAddEdf/16384         1877 us         1877 us          479
uniqueWeightAddWRSQ/64           4.79 us         4.81 us       146618
uniqueWeightAddWRSQ/512          36.1 us         36.2 us        19524
uniqueWeightAddWRSQ/4096          339 us          339 us         2094
uniqueWeightAddWRSQ/16384        1537 us         1537 us          445
uniqueWeightPickEdf/64            121 ns          121 ns      5749699
uniqueWeightPickEdf/512           144 ns          144 ns      4858103
uniqueWeightPickEdf/4096          187 ns          187 ns      3726080
uniqueWeightPickEdf/16384         214 ns          214 ns      3247980
uniqueWeightPickWRSQ/64           115 ns          115 ns      6111640
uniqueWeightPickWRSQ/512          157 ns          157 ns      4482860
uniqueWeightPickWRSQ/4096         236 ns          236 ns      2964728
uniqueWeightPickWRSQ/16384        297 ns          297 ns      2339000

Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
@mattklein123 mattklein123 self-assigned this Jan 13, 2021
Base automatically changed from master to main January 15, 2021 23:02
@repokitteh-read-only
Copy link

🙀 Error while processing event:

evaluation error
error: context deadline exceeded
🐱

Caused by: #14681 was edited by mattklein123.

see: more, trace.

@mattklein123 mattklein123 added the no stalebot Disables stalebot from closing an issue label Jan 16, 2021
fix shuffle to work past c++17

Signed-off-by: Tony Allen <[email protected]>
@tonya11en
Copy link
Member Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #14681 (comment) was created by @tonya11en.

see: more, trace.

Signed-off-by: Tony Allen <[email protected]>
This reverts commit a954816.

Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
// Outside of this case, object picking is logarithmic with the number of unique weights. Adding
// objects is always constant time.
//
// For the case where all object weights are the same, WRSQ behaves identical to vanilla
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does wrsq handle first pick determinism for the case that all hosts have the same weight?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that will be addressed in a subsequent patch when plumbing WRSQ into the load balancer. In the case you mention, first pick determinism is fixed for WRSQ by simply adding hosts to the scheduler in a random order during refresh.

namespace Envoy {
namespace Upstream {

// Weighted Random Selection Queue (WRSQ) Scheduler
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this comment after not thinking about this for 2 months, it can improve. Some observations to reference later:

  • Elaborate on why it's like vanilla RR or WRS based on the weight distribution.
  • Beef up the note.
  • Consider an ASCII diagram.

Copy link
Member Author

@tonya11en tonya11en left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at this with fresh eyes. Adding a few more comments for the next round (after someone reviews).

source/common/upstream/wrsq_scheduler.h Outdated Show resolved Hide resolved
bool empty() const override { return queue_map_.empty(); }

private:
using ObjQueue = std::queue<std::weak_ptr<C>>;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may benefit from being a class that automatically sets rebuild_cumulative_weights_ when it's mutated.

@akonradi
Copy link
Contributor

akonradi commented May 5, 2021

/assign akonradi

Comment on lines 38 to 44
/**
* Insert entry into queue with a given weight.
*
* @param weight entry weight.
* @param entry shared pointer to entry.
*/
virtual void add(double weight, std::shared_ptr<C> entry) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was toying around with removing add() in a separate PR since for all usages, all items are present at construction time. Does it provide value to be able to dynamically add instead of rebuilding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, but it doesn't seem to be adding much complexity. I'm not opposed to ripping it out, but I think it's out of scope for this PR.

Comment on lines 37 to 38
// NOTE: This class only supports integral weights and does not allow for the changing of object
// weights on the fly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, what? This really shouldn't inherit from the Scheduler interface as written, then, since it doesn't actually implement the exposed API.

Also, I'm not sure how much we benefit from using floating-point weights. If we move forward with this scheduler, we should strongly consider switching to integral weights for the purpose of efficiency. The upside of floats is the dynamic range, so you can have weights of 10E1 and 10E10. I'm not sure that the relative difference between weights of 5.0, 5.00001, and 5.00002 really matters though.

Copy link
Member Author

@tonya11en tonya11en May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm not sure how much we benefit from using floating-point weights. If we move forward with this scheduler, we should strongly consider switching to integral weights for the purpose of efficiency.

It looks like we're using doubles because of the least request LB variations. If all of the object weights are 1, we want some fractional bias towards hosts with less active requests.

Since the LbEndpoint weights are uint32, we could consider casting the weights to uint64 internally and scaling everything by ~1e5 and performing integer math. I think that's a high enough precision to make this feasible. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that's sufficient. With the existing implementation, weights of 0.0000001 and 0.0000002 still result in a 1:2 balancing ratio. We'd have to re-normalize within the scheduler any time a weight is added or changed. If we want to use integer representations, we'd need to expose that at the function interface.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, currently the only way to get weights like that is for it to be adjusted via LRLB or something like slow-start in #13176. There would need to be not only a weight of 1 specified for a host in the proto, but also O(tens-of-thousands) of outstanding requests to get down to a number like that that the scaling wouldn't be able to represent.

If for some reason we did end up with those numbers, they would just round down to zero until whatever calculation doing the adjustments results in something > 1e-5. I still think this can work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I am not at all familiar with how the scheduler is actually used. If you say this will work in practice, I believe you. I'm arguing that this implementation doesn't match the Scheduler interface as proposed. If we want a minimum value, let's encode that either in a comment or (preferably) in the type of the weights.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I'm asking for to switch to fixed-point values. That allows us to encode in the type that, say with 2 decimal points of precision, 3.112 and 3.114 are sufficiently close as to be indistinguishable. Using integers as weights would just be a special case where # decimals = 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just caught your update here, sorry for the delay. Introducing this to the base scheduler type makes sense to me.

There are some other things I'd like to knock out before getting back to this, but I'll knock it out and update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started doing this, but it quickly became a spookier change than I'd like to make across all the LB code. I encoded it in a comment instead, but if it makes folks wince we can chat this out a bit more and figure out a better path forward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, ignoring my comments regarding integral weights, we really should consider changing the interface if this implementation isn't going to have the ability to update object weights.

Comment on lines 56 to 58
if (!prepicked_obj.expired()) {
return std::shared_ptr<C>(prepicked_obj);
}
Copy link
Contributor

@akonradi akonradi May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a race here that is probably also present in the EDF scheduler where the referenced object gets deleted between the call to .expired() and creating the shared_ptr. I'm not sure if the use of shared_ptr here implies that the queue objects can span threads. Either way, we should prefer shared_ptr::lock() weak_ptr::lock() since it's a single atomic operation and should be less expensive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! TIL about weak_ptr::lock().

Comment on lines 73 to 80
using QueueMap = absl::flat_hash_map<double, ObjQueue>;

// Used to store a queue's weight info necessary to perform the weighted random selection.
struct QueueInfo {
double cumulative_weight;
double weight;
ObjQueue* q;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some duplication here of information, where weight is stored both as a key in QueueMap and as a value in QueueInfo. We can use an absl::flat_hash_set and heterogenous lookup to store the value inline in ObjQueue and still be able to look up by weight.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure I follow. Can you spell it out a bit more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Since we're using a property of the value as the key, we can either have the same information in two places (what we're currently doing), or use absl::flat_hash_set with a custom hash function. That's an improvement on space usage, but then to look up a QueueInfo by weight, you'd have to construct a temporary one with the desired weight and check whether that's in the set or something - not great. absl::flat_hash_set supports heterogeneous lookup, though, where you can check for presence by providing a key of a different type, as long as it hashes to the same value. See https://abseil.io/tips/144.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I've wasted a tremendous amount of time trying to guess my way through this and haven't had any luck. Is there some snippet of code you can share that does what you're asking? I think this may be common inside of Google, but externally there's not any example for me to base this off of.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this help?

absl::flat_hash_set<SharedString, HeterogeneousStringHash, HeterogeneousStringEqual>;

Copy link
Member Author

@tonya11en tonya11en Jul 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does! Thanks, I missed that.

I did want to ask if it's really worth the additional complexity to save the inline double for each unique weight? This scheduler is mostly encountering scenarios where there are not many unique weights per-object, since the performance degrades to be much worse than EDF. We can't use this for things like the least-request LB where there are many many individual weights.

I'd expect the additional memory utilization to be roughly equivalent to the hash table's unused slots (depending on the load factor). If we're concerned about this, since the number of unique weights is expected to be small, it may be more performant/efficient to simply keep these things in a vector and scan the vector for each lookup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's punt or drop this. I had forgotten that flat_hash_set imposes constness requirements that we don't want to hack around.

source/common/upstream/wrsq_scheduler.h Outdated Show resolved Hide resolved
source/common/upstream/wrsq_scheduler.h Outdated Show resolved Hide resolved
@jmarantz
Copy link
Contributor

jmarantz commented Jul 5, 2021

@tonya11en @akonradi what's the status of this PR? Tony is this ready for another look by Alex? I couldn't quite suss out whether commit 8ef7453 from Tony addressed the concern in the comment.

@tonya11en
Copy link
Member Author

@tonya11en @akonradi what's the status of this PR? Tony is this ready for another look by Alex? I couldn't quite suss out whether commit 8ef7453 from Tony addressed the concern in the comment.

It didn't address the heterogeneous lookup comment. Let me try to wrap this up today.

It's actually a bit embarrassing-- I've had trouble getting it to build and I've found no examples to base this off of.

Signed-off-by: Tony Allen <[email protected]>
@snowp
Copy link
Contributor

snowp commented Jul 14, 2021

Checking in here, is this awaiting review?

Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
@tonya11en
Copy link
Member Author

@snowp I'm waiting for folks to take another look and also respond to #14681 (comment).

I successfully made the hash set based on the example @jmarantz gave, but the compiler complains when I try to mutate elements referenced via heterogeneous lookup.

  void add(double weight, std::shared_ptr<C> entry) override {
    rebuild_cumulative_weights_ = true;

    auto it = queue_set_.emplace(weight).first;
    it->q.emplace(std::move(entry));
  }

complains with:

./source/common/upstream/wrsq_scheduler.h:69:11: error: no matching member function for call to 'emplace'
    it->q.emplace(std::move(entry));
    ~~~~~~^~~~~~~
test/common/upstream/wrsq_scheduler_test.cc:31:11: note: in instantiation of member function 'Envoy::Upstream::WRSQScheduler<unsigned int>::add' requested here
    sched.add(1, entries[i]);
          ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_queue.h:276:2: note: candidate function template not viable: 'this' argument has type 'const Envoy::Upstream::WRSQScheduler<unsigned int>::ObjQueue' (aka 'const queue<std::weak_ptr<unsigned int>>'), but method is not marked const
        emplace(_Args&&... __args)
        ^

with this type definition:

struct HeterogeneousQueueInfoHash {
  using is_transparent = void; // NOLINT(readability-identifier-naming)

  size_t operator()(double d) const {
    size_t ret;
    memcpy(&ret, &d, sizeof(d));
    return ret;
  }
  size_t operator()(QueueInfo qi) const {
    size_t ret;
    memcpy(&ret, &qi.weight, sizeof(qi.weight));
    return ret;
  }
};

struct HeterogeneousQueueInfoEqual {
  using is_transparent = void; // NOLINT(readability-identifier-naming)

  size_t operator()(QueueInfo a, QueueInfo b) const { return a.weight == b.weight; }
  size_t operator()(const double a, const double b) const { return a == b; }
  size_t operator()(QueueInfo a, const double b) const { return a.weight == b; }
  size_t operator()(const double a, QueueInfo b) const { return a == b.weight; }
};


using QueueSet =
    absl::flat_hash_set<QueueInfo, HeterogeneousQueueInfoHash, HeterogeneousQueueInfoEqual>;

If it's not obvious what's going on with that failure and this redundant double isn't a dealbreaker for this patch, I'd like to just put in a TODO and field more comments. At this point, I've spent more time fiddling with this heterogeneous lookup than I did writing the original patch :(

@tonya11en
Copy link
Member Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #14681 (comment) was created by @tonya11en.

see: more, trace.

@yanavlasov
Copy link
Contributor

@akonradi can you give this PR another pass? I think we can defer heterogeneous lookup to a followup PR.

@akonradi
Copy link
Contributor

akonradi commented Aug 4, 2021

Agreed on heterogeneous lookup. My objection to ignoring the weight update function still stands, though: we shouldn't claim to implement an interface and then ignore meaningful arguments provided by callers.

@tonya11en
Copy link
Member Author

Agreed on heterogeneous lookup. My objection to ignoring the weight update function still stands, though: we shouldn't claim to implement an interface and then ignore meaningful arguments provided by callers.

I went ahead and just made the WRSQ scheduler mutate weights. It should be in line with the interface now.

Copy link
Contributor

@akonradi akonradi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending the minor changes requested

for (auto& it : queue_map_) {
const auto weight_val = it.first;
weight_sum += weight_val * it.second.size();
cumulative_weights_.emplace_back(QueueInfo{weight_sum, weight_val, it.second});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this will result in a call to QueueInfo's move constructor within cumulative_weights_.emplace_back, in addition to the call to its (implicitly declared) converting constructor. Pass weight_sum and friends directly to emplace_back to remove the extra constructor call.

// number of times equal to its weight.
for (uint32_t i = 0; i < weight_sum; ++i) {
EXPECT_CALL(random, random()).WillOnce(Return(i));
// The weights will not change with WRSQ, so the predicate does not matter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is out of date now.


{
auto second_entry = std::make_shared<uint32_t>(42);
auto first_entry = std::make_shared<uint32_t>(37);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to something other than 37 so it's obvious that the peek isn't picking this one?

EXPECT_TRUE(sched.pickAndAdd({}) == nullptr);
}

TEST(WRSQSchedulerTest, ManyPeekahead) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a short comment describing what the objective of this test is, something like "Ensure that values returned via peeks match the values that are picked afterwards"

++weight5pick;
break;
default:
EXPECT_TRUE(false) << "bogus value returned";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: prefer the FAIL() macro. Same below

EXPECT_EQ(*peek, *p);
}

auto p = sched.pickAndAdd(f);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here describing what the weights look like after this, maybe something like "After this, weights are e1 -> 0, e2 -> 0, e3 -> 1"

Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
@tonya11en
Copy link
Member Author

@akonradi I addressed those comments. Thanks again for the review and patience-- this was on the back-burner for a while.

@akonradi
Copy link
Contributor

Same, sorry for the slow reviews

@mattklein123
Copy link
Member

@tonya11en can you merge main to make sure we are up to date and then we can get this in and iterate? Thanks.

/wait

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks for pushing this forward @akonradi and @tonya11en. Let's ship and iterate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants