-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft: Make max client connections configurable and change shard_for
hash
#11345
Conversation
To evaluate the new hash function relative to the stated requirements this testing code was used. The CurrentThe number of connections node 0 has to the rest of the brokers in a cluster
A summary for the number of shards that use a given connection to a destination node.
A summary for the number of connections to other brokers a given shard has.
Current modified to allow for variable connection countsThe number of connections node 0 has to the rest of the brokers in a cluster
A summary for the number of shards that use a given connection to a destination node.
A summary for the number of connections to other brokers a given shard has.
Newest hash in this PRThe number of connections node 0 has to the rest of the brokers in a cluster
A summary for the number of shards that use a given connection to a destination node.
A summary for the number of connections to other brokers a given shard has.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Results look great.
Can I just confirm this upgrades fine? I.e.: this doesn't affect any persisted state anywhere and different nodes also don't need to have the same implementation?
Adding some unit tests probably also a good idea.
, rpc_client_max_connections( | ||
*this, | ||
"rpc_client_max_connections", | ||
"The max client connections that will be open to a given broker.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think need to be clear here that this is internal RPC traffic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "Maximum internal RPC connections a broker will open to each other broker."
*this, | ||
"rpc_client_max_connections", | ||
"The max client connections that will be open to a given broker.", | ||
{.example = "8"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what the defaults are and why the other examples above don't list it but probably good make reloadability and visibility explicit.
@@ -20,7 +20,18 @@ namespace rpc { | |||
connection_cache::connection_cache( | |||
ss::sharded<ss::abort_source>& as, | |||
std::optional<connection_cache_label> label) | |||
: _label(std::move(label)) { | |||
: _max_connections(8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this constructor just be delegated? I know we have something like a dummy binding for testing but not sure there is something like that for non-testing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I think I should probably just remove this constructor entirely. I added it in mainly to avoid having to change a bunch of unit tests. But it's kinda a foot-gun now that I look at it again.
uint64_t remainder = total_shards % _max_connections; | ||
uint64_t ring_size = total_shards - remainder; | ||
|
||
// Vary the starting shard of the ring depending on the connection hash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is the actual "consistent hash" right? In that case I don't understand why we use remainder below for the bucket size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add comments/links to where/how those magic primes (there are more than one!) came about to be? Is there a paper or is it on OSS implementation from elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piyushredpanda The primes came from https://planetmath.org/goodhashtableprimes I'll re-add the attribution into the code. As for the hash method its self it is vaguely inspired by a older consistent hash algorithm mentioned in the paper we reference for our jump_consistent_hash
implementation here https://arxiv.org/pdf/1406.2294.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some results I didn't include that show this, but I was surprised how much just mixing in a prime number helps with ensure connections are uniformly distributed. So I was mainly just mixing in primes with the hashes to avoid clustering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess boost::hash_combine does not do very good mixing? I wonder if absl::HashOf
is better. It seems a bit weird to mix in a "magic prime" which seem to be optimized to be used as the second argument to %
in a hash table, and not to be used as the seed for a mixing function like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that these primes aren't new: the original shard_for
used a list of them from the same source too: it's just not that clear from the diff because it moved files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absl::HashOf
will work much better. The magic prime is mainly there since std::hash
is just the identity function for integers which doesn't result in a very good distribution of connections.
connection_hash, remainder + 1); | ||
uint64_t ring_end_shard = (ring_size - 1) + ring_start_shard; | ||
|
||
// Hash any shards not in [ring_start_shard, ring_end_shard] to a point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this happens because of the truncation above right? For example, 11 shards and 8 connections then we handle shard 9,10,11 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep exactly, in any case where shard count isn't a multiple of the number of connections this will be used.
ss::shard_id src_shard, | ||
model::node_id n, | ||
model::node_id dest_node, | ||
ss::shard_id total_shards) const { | ||
if (total_shards <= _max_connections) { | ||
return src_shard; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an overall explanation with a small ascii chart (maybe use a line instead of a ring) would be great here I think
@ballard26 has probably set a new high bar for the best PR cover letter/detail here. 🥇 |
@StephanDollberg Definitely agree, will be adding in some unit tests for the hash function specifically before moving the PR out of the draft stage. |
Basic question: does this need to be a hash at all? Can we not just assign the shard/client mapping explicitly, in which case it can be exactly optimal (e.g., simple linear mapping all the connections to node 0, then 1, etc, gives optimal distribution, right?). That is, since modifications to the set of nodes are rare and connections are basically ephemeral (we can tear one down and create one one another shard w/o much cost) it's not clear why we use a hash or consistent hash at all. |
Why did the existing hash underperform? It was not sufficiently uniformly random? |
@travisdowns It was basically that it didn't meet the three requirements I outlined in the PR;
The data I provided in the first comment for this PR goes into details about how each hash function performs relative to the requirements I stated. |
@travisdowns I did something similar in the current "hash" function. Though I spaced on connections to the same broker with an equal number of shards between each connection. I had hoped this could give us better cache locality compared to a linear mapping. I also had to handle some edge cases like when shard count isn't a multiple of the requested connection count. It can definitely be made perfectly optimal we we're willing for store some information about the allocations in the |
@ballard26 - I guess maybe I'm misunderstanding what I we mean by a "hash function". As I see, the best hash function can only map inputs to shards uniformly at random, which will never produce a particularly good result in terms of your requirements due to randomness. I.e., even "uniform random" selections aren't going to meet those requirements in general. So I guess my "hash function" we mean arbitrary logic which maps shards to connections, and it can take the whole list of input shards + nodes in the cluster as input?
Right, that's fine: I was suggesting linear was the best, only that it was the simplest example of something that would satisfy all the requirements.
How does cache locality play into this? I'm not following. |
Right you are. At this point it'd be a misnomer to call what I wrote a hash function. It's just a function that tries to map connections to shards in a way that satisfies my stated requirements.
Yeah
It's more to do with locality for the data being sent rather than the connection its self. If a shard wants to use a connection on a different shard it'd be nice if they shared, say, the same L2 cache as it's likely the data that needs to be sent over that connection will already be there. Though I've noticed at least one new architecture(AMD's Zen 3) shares a L3 cache among all cores and has private per-core L1 & L2 caches. So maybe it's not something that's important to optimize for anymore. |
uint64_t connection_hash = 201326611; // magic prime | ||
boost::hash_combine(connection_hash, std::hash<model::node_id>{}(src_node)); | ||
boost::hash_combine( | ||
connection_hash, std::hash<model::node_id>{}(dest_node)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absl::HashOf
is a more concise way to hash several values of presumably similar quality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(though I see now this comes from the existing mapping function which had moved files)
@@ -34,11 +34,11 @@ class connection_cache final | |||
using underlying = std::unordered_map<model::node_id, transport_ptr>; | |||
using iterator = typename underlying::iterator; | |||
|
|||
static inline ss::shard_id shard_for( | |||
ss::shard_id shard_for( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good time to add a comment to this function, especially given the ambiguity of some of the parameter names.
As I understand it, is it given that we are currently on (self, shard) with max_shards total shards, we would like to know the shard id (return value) we should use to connect to node node.
} | ||
|
||
connection_cache::connection_cache( | ||
config::binding<size_t> max_connections, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of passing it through as a binding? The value is used once and saved in a member so it could just be a size_t
which would simplify things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I wasn't sure at first if this is something we'd want to make dynamically configurable without a restart. But I don't believe we do. So I'll definitely just switch this over to size_t
I would add a 4th desirable requirement to @ballard26 's list:
This is trivially true when max_connections >= smp::count since we always return |
OK sure, but how does the current approach or any contemplated approach maximize this kind of locality? Wouldn't you have to examine the topology of the system and make explicit decisions based on the topology<->shard mapping to do that? Or are you saying that shard IDs have some correlation to topology? [1] FWIW basically no modern big CPUs have cores that share an L2 cache unless they are hyper-siblings in which case they also share an L1 cache. The basic rule is that a physical core has private L1 and L2 and shares the L3 with "a set of other cores". The cores that share an L3 are generally all the cores on a socket (most AWS boxes are dual socket), though on AMD it's a bit more complicated because even within the shared domain L3 access is pretty NUMA due to the CCX boundaries. [1] Well there is some if you assume shard ID maps to equivalent or at least "shifted" cpuid: but it works both ways: the closest cores topology-wide (hyper siblings) usually are far apart numerically (e.g., 0 and N for an N*2 lcore system), but then CPUs on the same socket are usually contiguous. So basically close/far numerically isn't very useful. |
So I guess the main problem with just using an even mapping without "hashing" is that the broker IDs:
Shard ID doesn't really have any of those problems but the combination of shard + broker does. So like you say maybe a stateful connection cache is the way to go? Basically you'd discover the broker IDs over time or something and maintain an explicit mapping table? Because it really seems like while you are in here we should try to remove this source of imbalance completely. |
The current approach doesn't try to use the "self" shard either, so I guess there is kind of a double performance cliff when you go over 8 shards currently: not only do you get the imbalance that originally drove this change, you also go from 100% connection locality to say 6.25% connection locality on a 16-shard system (1/16), even though that could be 50% if the self-preference rule was used. |
Interesting related change (PR 8 !!): So actually the source shard for a connection can affect the port selected (on the local side) which in turn can affect the shard assigned to the connection on the remote side (if |
Yeah, these are the problems I ended up running into when trying to have a perfectly distributed stateless solution. The distribution function I ended up with is pretty close to meeting the requirements stated in the PR. However, the hashing that I ended up using still results in some collisions that skew the distribution of connections per shard. Storing some state in the |
Shards not using their own connections was definitely a nasty surprise when I went to analyze the results. And it was a behavior I made sure the new distribution function didn't have. One potential "benefit" could be that it means no shard has priority access to a connection. I.e, if none of the shards their own connections then each of them will have around the same latency to send data over a connection. Vs if some shards use their own connection then they may be able to send data over it without having to schedule a task in the queue. This could lead to partitions having better latency than others in the worst case. That being said the locality and reduced inter-shard communication we get from using a connection on the same shard is much more important. |
@ballard26 - we should make this property dynamic, so the k8s operator can take this into account and get upstream into the cloud. cc: @joejulian @alejandroEsc and @RafalKorepta |
During testing with OMB it was found that traffic from internal RPC clients can cause reactor utilization on some shards to be 40% higher than others. This largely caused by poor distribution of client connections to other brokers amongst shards.
Looking at
RPC client in + out bytes
in the graph below we can see that some shards are processing 4x the amount of throughput as others which can explain a lot of the extra reactor utilization on some shards.The first change this PR makes is to allow for the number of client connections to each of the brokers user configurable. In the chart below the allowed connections was set to equal the number of shards on a given broker. This allows each shard to have its own connections to all of the brokers in the cluster. And we can see in the chart that the standard deviation for reactor utilization is reduced by 50%.
In cases where increasing the number of client connections to other brokers to equal the shard count is impractical this PR also attempts to improve the existing
shard_for
hashing function to better match the following requirements;max_connections
to every other node in the cluster.Also this PR is the first of probably two that will focus on fixing the mentioned issue. There will be a coming PR that tries to move as much work as possible(i.e, encoding the request) to the shard that requests a message to be sent. Rather than doing it all on the shard that happens to hold the connection to the broker the message is to be sent to.
Backports Required
Release Notes