Inconsistent Raft Latency on Different Infrastructures #578

aminst · 2023-10-23T16:05:00Z

Issue Summary

I've encountered an issue with HashiCorp's Raft library where I observe inconsistent Raft latencies across different infrastructures. This issue seems specific to HashiCorp Raft, as I've successfully used Etcd on the same infrastructure without encountering similar problems.

Description

I have extensively tested the HashiCorp Raft library on three different infrastructures: my local machine, physical servers, and GCP (Google Cloud Platform). While I don't encounter issues with other communications, I'm experiencing significant differences in Raft replication latencies.

On physical machines and my local machine, I'm observing a replication latency of approximately 150 ms, even for small objects. This latency is considerably higher than expected and not ideal for my use case.
On GCP, the replication latency is significantly lower, approximately 10ms, which is acceptable for my requirements.

This inconsistency in replication latency is puzzling, as I would expect more uniform behavior across different infrastructures, especially when using the same Raft library.

Environment

HashiCorp Raft Library Version: v1.5.0
Infrastructure:
- Local Machine: MacBook Air, Apple M1, 2020
- GCP: e2-medium, Intel Broadwell, x86/64, Ubuntu 20.04
- Physical Servers: 2x Intel E5-2620v2 (12 physical cores), 32 GB memory, interconnect 10GbE, Ubuntu 20.04

Steps to Reproduce

Use the Raft Example and test the replication latency.

banks · 2023-10-23T19:18:46Z

@aminst thanks for the great report! We'd love to help figure out what you are seeing.

I have a few questions:

What version of this library are you using?
You mentioned "raft latency" and "replication latency" here but what specifically are you measuring? The total time to apply a write (i.e. commit, and apply on leader) or the time it takes to also notify and apply that write on all followers etc?
Do you see these "bad" latencies all the time or only when the system is under high load (e.g. a write benchmark)?
Can you describe one of you test environments more? Were you using the laptop and physical servers in one cluster or separate ones? What was your workload and how did you measure the latency? Is 150ms the mean or P99 etc?
Usually in a local network the write times are dominated by how fast the disks can fsync. Apple laptops are notoriously slow at this despite being really fast for "cached" writes - they typically take ~20ms per fsync and Raft has to write to leader and then at least one follower disk in serial. If both are on same laptop that is at least 40ms per write alone. You didn't mention disk specs for the other environments though - can you share more on those?
Were you using the raft example code when you tested or something else? If it was something else can you describe it in more detail? Or did you reproduce using that example code in the same environments?

You might be interested in this issue we fixed recently that could artificially inflate commit latencies when throughput was high in some situations by up to 128x the actual disk write time. but if you see this issue even on single writes on un-loaded cluster then it's not likely to be this.

aminst · 2023-10-23T19:37:48Z

Hi @banks, Thanks for your response.

I'm using the v1.5.0 version.
I'm measuring the total time to call the Apply method on the raft leader and get a response. I removed any code that could have taken time from the finite state machine Apply implementation so that I only measured the time it takes to communicate between replicas. I wait for the future when Apply returns, so I measure the time it takes to apply write on all followers.
These are for single requests. I started to test the system with single requests without putting the load on the system.
I ran my code and the raft-example code on the machines to measure latency. I deleted everything from the FSM apply method and only measured the Raft apply time. The three types of infrastructure I mentioned were tested separately. On my local machine and GCM, I deployed all the replicas locally and issued a request to check the latency. On the physical servers, I both tested the latency locally and when the replicas were on multiple machines. But it didn't make a difference.
Oh, I see! But the problem is that even after commenting all the disk writes, the latency doesn't change. I am measuring the latency of communication between different replicas. The FSM apply function only returns nil and doesn't write to the disk in my testing environment.
I used both my project and the raft example implementation to test the latency. My code only replicates small objects of type bool or integers. Both resulted in approximately the same latency numbers.

Thank you again for your help!

aminst · 2023-10-23T21:44:43Z

Hi @banks
You were right about the fsync problem! I did more experimentation and moved everything to memory; now the latency is great. The problem was that on my physical servers and the local machine, the fsync took so much, and since I was sending single requests, it wasn't getting amortized over data writes.
Thank you so much for your help! I appreciate it!

banks · 2023-10-24T14:59:35Z

@aminst glad we could help, other folks here also reviewed and contributed to that response so I can't take all the credit!

In general fsyncs are the slowest thing in raft (or database that actually has strong durability guarantees). The main way to improve throughput and lower latency is to provide many parallel writes - this library will "group commit" that is batch all writes made in parallel to disk so you can get up to 64 (by default) writes for the price of one!

It's also the reason why raft (and most consensus-based systems) are not very effective at utilizing disk hardware. Modern SSDs have huge IOps available but only if your workload can issue lots (usually thousands) of larger reads or writes in parallel to take advantage of the parallel chips/controllers/channels etc. in the device.

There is one optimization i'm considering for this library that could help reduce latency: writing to the leader's disk in parallel with replicating to the followers. It's in Diego's raft thesis (section 10.2.1) and I have a pretty good idea and a prototype of how to make it happen in this library but not sure if/when we'll get to it. That means that instead of needing to wait for leaders disk and then the RTT and follower disk write before committing, we effectively only wait for the RTT and one disk write as they are all in parallel with the leader's disk write. So if your disk writes take 5ms and RTT is 0.1ms today a commit will take at least 5 + 0.1 + 5 = 10.1ms wheras in theory it could be done in just 0.1 + 5ms = 5.1ms.

I'm going to close this as it sounds like you worked out what is going on but feel free to let us know if you have more feedback or info on this!

aminst · 2023-10-24T20:28:25Z

Hi @banks
Thank you for your explanation. That makes sense.
The optimization you discussed sounds interesting; it will help when there are not many writes to aggregate.
I would like to help implement it. Is there a guide on contribution to this project? I would be happy to help and learn more by doing this.

banks · 2023-10-25T12:03:02Z

I would like to help implement it. Is there a guide on contribution to this project? I would be happy to help and learn more by doing this.

I love you enthusiasm to contribute, thank you!

To be very honest, we've not yet found a great model for working on really major contributions from the community for Raft. The issue is that Raft is very easy to break and very hard to have confidence in either the completeness of tests or the thoroughness with which we've understood changes. That means even internally it takes a large "cost/benefit" discussion and to motivate taking the risk of making a significant change that will impact all our product's core reliability! Trying to coordinate all that around community contributions is something we've not yet achieved. I'd love to find some ways to be more open to community input though.

In this case I think the best next step is for me to write up a GH issue with my current thoughts and experiments around this. one really useful contribution to that effort would be having others testing and validating those changes or helping us think of ways we can reduce the risk etc! I'll try to do that next week.

In general, it's usually best to open an issue if you have an idea for something that might help so we can think through together the motivation and design before you invest too much work in it and find we're not able to justify the risk or time investment to build enough confidence in the correctness and benefit to get it merged.

aminst · 2023-10-25T14:40:54Z

Thank you for your encouraging words.
I understand what you're saying since lots of products depend on Raft as the base layer.
I'm looking forward to your upcoming GitHub issue, and I'll be eager to dive into the details and help with reducing the risk. Please inform me once you've written the GitHub issue if possible.

banks · 2023-10-30T22:10:40Z

@aminst I've opened #579 as a WIP experiment with my current thinking on how we can improve this. It will likely remain WIP until we've done a bunch more testing to check that the performance improvement is actually enough to warrant pursuing it futher, and then we'll need to work out how to convince ourselves it's correct in and thoroughly tested!

banks closed this as completed Oct 24, 2023

banks mentioned this issue Oct 30, 2023

WIP: Add mechanism that allows writing to the leader's disk in parallel with replication. #579

Draft

banks pinned this issue Oct 30, 2023

banks unpinned this issue Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Raft Latency on Different Infrastructures #578

Inconsistent Raft Latency on Different Infrastructures #578

aminst commented Oct 23, 2023

banks commented Oct 23, 2023

aminst commented Oct 23, 2023

aminst commented Oct 23, 2023

banks commented Oct 24, 2023

aminst commented Oct 24, 2023

banks commented Oct 25, 2023

aminst commented Oct 25, 2023

banks commented Oct 30, 2023

Inconsistent Raft Latency on Different Infrastructures #578

Inconsistent Raft Latency on Different Infrastructures #578

Comments

aminst commented Oct 23, 2023

Issue Summary

Description

Environment

Steps to Reproduce

banks commented Oct 23, 2023

aminst commented Oct 23, 2023

aminst commented Oct 23, 2023

banks commented Oct 24, 2023

aminst commented Oct 24, 2023

banks commented Oct 25, 2023

aminst commented Oct 25, 2023

banks commented Oct 30, 2023