-
Notifications
You must be signed in to change notification settings - Fork 994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NetworkTransport make pipelining configurable and default to max 2 in flight. #541
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find and amazing description with the visualizations to make the issue easier to follow. I left one minor nit suggestion, but overall approving...we should get others with more experience with this package to also approve.
net_transport.go
Outdated
if maxInFlight < 1 { | ||
// Default (zero or nonsense negative value given) to 2 (which translates to | ||
// a zero length channel buffer) | ||
maxInFlight = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[NIT] For readability, can we use a constant for 2 that describes it as "default" and use it here and AppendEntriesPipeline()
and newNetPipeline()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I was just doing this, and I think we should make this case a Constant like the other defaults.
But I realised that all the 2
s in newNetPipeline and AppendEntriesPipeline
probably should not be changed. The rationale is: they are are 2 because 2 is the minimum value that is possible in our current implementation - they are guarding against panics or using code that can't work with a value specifically less than 2.
For example if we ever decided that the best default was actually 1 (disabling pipelining) or maybe a bit more than two, every usage except this one should still be 2 since that's a property of the actual code implementation not just the default value.
Do you think it's worth making two separate constants something like DefaultMaxInFlight = 2
and minValidMaxInFlight = 2
so that we can show that we are using the 2 to protect the implementation for doing something invalid in those other cases not necessarily because it's the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a commit with the approach above - two separate constants to express the semantics of the subtly-different 2
values!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for putting this together @banks ! I have an inline question but nothing blocking.
I'm also wondering about the implication of this to a gRPC transport. The PR that @mkeeler put together to add gRPC transport (#538) leverage streaming to achieve pipelining of append entries, I'm wondering if that end up reproducing the same behaviour we are trying to avoid with this PR 🤔 ?
net_transport.go
Outdated
if n.maxInFlight < 2 { | ||
// Pipelining is disabled since no more than one request can be outstanding | ||
// at once. Skip the whole code path and use synchronous requests. | ||
return nil, ErrPipelineReplicationNotSupported | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the implication of making a pipeline of size 1 instead of retuning an error in that use case? With this, if we decide to thread this config up to Consul would we need to add extra logic to call the right API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pipeline of size one doesn't make any sense with our current implementation and will panic.
This is because we only block on the buffer after sending the next request, and because the first request sent is immediately consumed by the receiving go routine. That means even with a buffer size of zero you still allow 2 requests to be in flight: one because the receiving goroutine immediately consumes and unblocks the buffer while it waits, and the second because we send it before pushing to inflight where we block waiting for the first response.
We could change the behaviour of the pipeline to push to the channel first before sending so that a zero-buffered chan would correspond to MaxInFlight = 1
... But I don't see much benefit to that - that would just make it ambigous whether MaxInFlight = 1
mean "no pipeline at all" or "use pipelining code path for parallel decoding/encoding, but don't allow more than one outstanding request". In theory that still has some potential performance benefit but I was not able to measure one so I don't think it's worth adding the extra complexity/risk by changing how pipelining works and making there be even more subtle config options.
Does that make sense?
The flip side to this is that it might in some ways be simpler to make the config hide the complexity less by exposing a separate DisablePipelining
and PipelineBufferSize
, but then operators have to work out which thing they want and we still have to document all the unintuitive stuff like "Buffer size of 0 actually allows 2 requests to be in flight". I don't have a strong opinion but choose to lean towards making the config map more directly to the behaviour it enabled rather than obscure implementation details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point I was trying to make is mostly that from UX point of you I expect a pipeline size of 1 or 0 (depending on how we design it) to work as if I call AppendEntries()
without any pipelining. I guess we already missed that opportunity when we made the choice to make separate APIs for the pipelining option way before this PR. The user, in that case raft, need to deal with making the right choice which is already in place.
I agree hard failing, by panic is a reasonable choice to avoid misconfiguration here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a looong time since I've been over this code, so thanks to you and @mkeeler for walking me through it offline.
I suspect we'll be referring back to this excellent PR for years to come, so it might be worth adding something like the following to the tl;dr
or Conclusion
sections. They currently seem to focus on the parameters and effects while not mentioning the shift in mechanisms adjusting the parameters (ideally) produces.
Since the leader batches entries from the follower's last acknowledged index, the fewer pipelined requests you allow the more opportunity you leave for the leader to batch. There is an optimal balance between inflight requests and batching, and the current max of 128 pipelined requests is not it!
...or something to that effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent PR, Paul! As a newcomer to this codebase, I very much appreciate the time taken to outline the specifics on this change and the comprehensive real-world examples.
net_transport_test.go
Outdated
}() | ||
|
||
select { | ||
case <-errCh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a bug here := we aren't capturing the error from the channel to use in the assertion below...
Thanks for all the feedback folks. Finally got around to the last few bits here, but all minor docs comments/constant replacement or a minor change to just ignore negative values (i.e. treat them like "no pipelining" rather than treat them like "default amount of pipelining". |
This includes upgrading to Hashicorp Raft 1.5, which brings in some performance improvements. See hashicorp/raft#541 for more details.
Thanks for your hard work. All my rqlite testing passes with this change in place, so I'm looking forward to seeing if I can see any performance improvements with this release. |
BREAKING CHANGES: * extensions: The Lua extension now targets local proxy listeners for the configured service's upstreams, rather than remote downstream listeners for the configured service, when ListenerType is set to outbound in extension configuration. See CVE-2023-2816 changelog entry for more details. SECURITY: * Update to UBI base image to 9.2. * Upgrade golang.org/x/net to address CVE-2022-41723 * Upgrade to use Go 1.20.4. This resolves vulnerabilities CVE-2023-24537(go/scanner), CVE-2023-24538(html/template), CVE-2023-24534(net/textproto) and CVE-2023-24536(mime/multipart). Also, golang.org/x/net has been updated to v0.7.0 to resolve CVEs CVE-2022-41721, CVE-2022-27664 and CVE-2022-41723 * extensions: Disable remote downstream proxy patching by Envoy Extensions other than AWS Lambda. Previously, an operator with service:write ACL permissions for an upstream service could modify Envoy proxy config for downstream services without equivalent permissions for those services. This issue only impacts the Lua extension. [CVE-2023-2816] FEATURES: * hcp: Add new metrics sink to collect, aggregate and export server metrics to HCP in OTEL format. IMPROVEMENTS: * Fixes a performance issue in Raft where commit latency can increase by 100x or more when under heavy load. For more details see hashicorp/raft#541. * agent: add a configurable maximimum age (default: 7 days) to prevent servers re-joining a cluster with stale data * agent: add new metrics to track cpu disk and memory usage for server hosts (defaults to: enabled) * connect: update supported envoy versions to 1.22.11, 1.23.8, 1.24.6, 1.25.4 * envoy: add MaxEjectionPercent and BaseEjectionTime to passive health check configs. * hcp: Add support for linking existing Consul clusters to HCP management plane. * logging: change snapshot log header from agent.server.snapshot to agent.server.raft.snapshot * peering: allow re-establishing terminated peering from new token without deleting existing peering first. * peering: gRPC queries for TrustBundleList, TrustBundleRead, PeeringList, and PeeringRead now support blocking semantics, reducing network and CPU demand. The HTTP APIs for Peering List and Read have been updated to support blocking. * raft: Remove expensive reflection from raft/mesh hot path * xds: rename envoy_hcp_metrics_bind_socket_dir to envoy_telemetry_collector_bind_socket_dir to remove HCP naming references. BUG FIXES: * Fix an bug where decoding some Config structs with unset pointer fields could fail with reflect: call of reflect.Value.Type on zero Value. * acl: (Enterprise only) Check permissions in correct partition/namespace when resolving service in non-default partition/namespace * acl: Fix an issue where the anonymous token was synthesized in non-primary datacenters which could cause permission errors when federating clusters with ACL replication enabled. * acls: Fix ACL bug that can result in sidecar proxies having incorrect endpoints. * connect: Fix multiple inefficient behaviors when querying service health. * gateways: Fix an bug where targeting a virtual service defined by a service-resolver was broken for HTTPRoutes. * grpc: ensure grpc resolver correctly uses lan/wan addresses on servers * namespaces: adjusts the return type from HTTP list API to return the api module representation of a namespace. This fixes an error with the consul namespace list command when a namespace has a deferred deletion timestamp. * peering: Fix issue where modifying the list of exported services did not correctly replicate changes for services that exist in a non-default namespace. * peering: Fix issue where peer streams could incorrectly deregister services in various scenarios. * peering: ensure that merged central configs of peered upstreams for partitioned downstreams work * xds: Fix possible panic that can when generating clusters before the root certificates have been fetched.
This PR changes the default behaviour of the "pipelining" optimization in our
NetworkTransport
which is used by all HashiCorp's products.This feature has been found to cause major performance problems in practice although it seems like it should be an optimization.
TL;DR Summary
MaxInFlight = 2
.MaxInFlight = 130
in theNetworkTransport
config.This issue was so unintuitive to us that we thought there must be some mistake. We have gone to significant lengths to explain to oursleves what is happening, as well as test different settings in actual Consul clusters in a lab setting. Read on for more details.
Replication Basics
The problem with pipelining is that it interacts unintuitively with our batching optimization. Specifically, the batching that occurs during follower replication in
Raft.pipelineReplicate
:raft/replication.go
Lines 480 to 482 in 60ddd51
(There are other sources of batching in our raft library such as the "group commit" a leader does before writing to it's own log which is not the same thing although could end up correlating.)
Each call to
r.pipelineSend()
will send all messages since the last one the follower has been sent, up to the most recent one on the leader's disk in a batch. It will also block if the pipeline has hit it's macInFlightRPCs until the next response comes back from the follower and so is no longer "in-flight". When it does block, this caused batch sizes to increase to catch the follower up faster on the next send etc.Until now
NetworkTransport
has always used a constant of 128 as the size of the buffered channel used to control the number of in-flight RPCs.raft/net_transport.go
Lines 28 to 30 in 60ddd51
Meanwhile, the follower handles each AppendEntries RPC in strict serial order. There is a single goroutine running
handleConn
per inbound TCP connection from the leader, and it callshandleCommand
serially inline.handleCommand
serially reads the request from the wire, decides what to do, in the case of AppendEntries with actual data, it callsprocessLogs
synchronously (which waits for the write to disk to complete) before writing a response to the wire and returning to accept the next request.In a happy case where writes are far enough apart that they can be written to disk individually, all is well:
This diagram models a case where the follower takes 2 time units (we'll say milliseconds here as it's close to realistic but the numbers are arbitrary) to write each log to disk, and the round trip time between the leader and follower is 0.5ms. This is approximately what we saw when testing for real in AWS on an io2 Volume across AZs in the same region.
The Problem
The problem occurs when the write rate increases. As soon as there are enough requests per second that the next one comes in while the last is still being written on the follower, the pipeline starts to fill up. The scenario below has a Max In-Flight of 5 not 128 like the actual code to make it more readable! These are generated using
LaTeX
and a simple go program.The same thing happens just a little more slowly if the rate is only just quicker than the disk can commit, but we show it as about double here to make the pattern easier to see. Pretty soon the pipeline channel is full and it will stay full for as long as the write rate is maintained.
That means even though batching is kicking in to ensure the follower does keep up with replication, the latency of each some requests is now on average 5x longer.
That's with a Max in-flight of only 5, with 128 like this library currently has, you very quickly get into a case were the average latency is about 130x the mean tome to commit to disk when it should be only just over 1!
We observed this in real-world performance testing in Consul: mean commit times were around 5ms right up to the point where the inbound rate suddenly crossed the threshold and then suddenly every request was taking about 300ms. All this while the actual time it took for a follower to write to disk stayed constant: around 2ms. So why were commit times hundreds of times larger? Because of pipelining.
The Solution
Perhaps surprisingly the solution is just not to use pipelining (almost). This is what the exact same scenario as above looks like without pipelining at all:
It's a bit slower that before on average because of batching. The diagrams above are showing the worst case latency not the mean. Without batching though, the follower would be unable to keep up at all at this throughput and the cluster would be in-operable already and failing most writes.
So should we just rip out pipelining altogether then? Surely there must be some cases where it is actually a win.
Where Pipelining Helps
Pipelining helps when the rount-trip-time on the network is much more significant compared to the processing time of each request. For example, If you have a much bigger RTT - say 5ms, and super faster disks that can commit in 1ms, the diagram looks like this:
Now clearly pipelining is much better as we can keep that follower busy and don't have to wait for nearly 2 RTTs in the worst case shown.
Here's the thing though: all HashiCorp products that use raft only support low-latency local-area networks. The follower's work involves a synchronous disk write. Even super fast NVMe disks are not that much quicker than a LAN round trip on average. In my testing on EC2 committing a small batch of logs on NVMe was about 200-300us for Consul's workload and the networking mean RTT was about 500us. That implies that any pipelining above 2 or 3 inflight is not going to help and having a much larger limit will eventually still hurt latency. This is hard to see with more realisitic numbers for NVMe, but you can see that eventually if we add enough throughput we start to hit the same pattern of full buffer and added latency:
Real World
So far the diagrams are using a simplified model. They intentionally leave out a bunch of complications that matter in the real world. For example they don't model the time taken to encode and decode messages on the leader which are overlapped with pipelining but synchronous without.
But the following is taken from real-world testing. First we test on an io2 disk that takes about 2ms to commit each batch to disk and see the latency problem outlined here clearly.
The legend indicates the number allowed in-flight where 1 indicates that pipelining is entirely disabled and every request is synchronous. The right hand graph is the same but with Y axis zoomed to show the detail for the lower pipeline sizes.
We see that with 128 in flight (current behaviour), the mean request took over 300ms even though the disk was committing consistently in < 10ms. Once the pipeline number is brought right down, or even disabled, the same hardware could service the same workload with only around 10ms mean latency and only about 30ms for 99.99 percentile!
We also wanted to validate that a much smaller pipeline wasn't significantly detrimental when the disk is somewhat faster than the network. These results show the same workload and hardware but using an NVME disk. The throughput is doubled.
In this case the right hand graph is different and shows windows latency percentiles on individual followers. (Taken from Consul's
consul.raft.replication.appendEntries.rpc
metrics, taking max of the 99percentile values for each 10s window).First though the left graph: in this case we are not pushing this disk hard enough (despite 10k writes per second) to consistently fill up the pipeline and cause orders of magnitude worse latency. In fact latencies are pretty amazing across the board here. Even with 128 pipeline, we are still getting a 99.99 percentile around 50ms and a p50 of less than 2ms!
The tail latencies are also noisy enough that although lower pipeline numbers are generally better it's not totally consistent. We'll come back to those in a second.
The right hand graph tells an interesting story (which was observed more clearly on time series graphs). Although the overall commit times are low, the worst follower is still occasionally taking 45ms to commit a single log batch with 128 in-flight limit. It's not the reason for the P99.99 results on the left since those are similar for all on the left, we'll get to that. Commit in Raft only requires 1 follower to ack in a 3 node cluster like this so the worst follower hitting spikes of 45ms is not enough to have client-observed latencies spike. The time series graph showed both followers having spikes like this every so often throughout the test but recovering quickly enough that commit times weren't impacted. Before we move on, the P50 on the right hand graph is important: it's showing us that pipeline disabled, 1, and 2 buffering are all roughly the same.
We need some more detail to see what is going on:
The left chart shows the max
consul.raft.commitTime
P50 and P99 (calculated over each 10s window) observed in the same test run. The right one is the same forconsul.raft.rpc.appendEntries
which measures how long the follower actually took to process each AppendEntries RPC.Here we can see that even in the 128 in-flight case, even the P99 commitLatency was never more than 4.5ms - that's an order of magnitude smaller than the 45ms the client observed. The reason in this case is that the NVMe disk is so fast that we are hitting other bottlenecks in Consul where we apply the committed change to the in-memory store. Because we can't batch that update right now and it involves quite a few allocations due to the way MemDB works, it seems to be the thing that's causing those occasional spikes of client-observed write latency at 10k req/s.
Also interesting is that service times on the right never have a P99 above 0.6ms! Even when the leader is observing the same RPC taking 45.7ms. This is evidence of the pipeline issue we saw above causing trouble again - even though it's not consistent enough to cause commit times to suffer, it's still present and could still tip over the edge to impacting client requests if the workload increased further or hardware had a latency spike (e.g. caused by a large snapshot being taken).
The commit latency graph is the most interesting one. It seems to indicate that for this setup, there is some benefit though mostly in the P99 of keeping pipelining with a short buffer. This is likely because the NVMe disk is fast enough that the encoding and decoding time for the RPC is just about significant enough that overlapping it with the network call itself allows things to get through a little quicker.
Since io2 was only marginally better with no batching and NVMe was slightly better with it, we've choosen a default of
MaxInFlight = 2
which means we still pipeline and overlap encoding/decoding on the leader but without letting a backlog of requests build up in the network.Conclusions
MaxInFlight = 2
.MaxInFlight = 130
in theNetworkTransport
config.