-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
physical plaintext network bottleneck? #3538
Comments
This is possible. For what it's worth, our new hardware environment has a switch with 6x QSFP-100G ports if you were looking to donate three network cards and a few cables 😅 |
lol. If you have some spare 10G cards, maybe you can bond them. The cloud benchmark for plaintext has tokio-minihttp at 64.5% of ulib. These two are at the top of the physical benchmark (100% and 97.8%). I'm curious how much the graph would rescale. We'll see when you find a sponsor :) |
The 10G cards in the servers we have are double-NIC, but I don't know enough about the hardware to be certain that bonding both NICs on the same card would be an improvement.
Agreed, though I have learned over the years that nothing really beats C in terms of performance when written well. That said, I personally want Rust to compete.
I'm out there looking! |
@msmith-techempower link aggregation is definitely something you should look into for getting 20gbps between servers instead of just 10. |
I can see 9M RPS with the latest runs on Citrine ... how is that possible then? |
Indeed. Citrine must have more than 10G then. |
I'm looking a bit at these daily results (https://tfb-status.techempower.com/). The variation across runs is huge. plaintext:
|
Correct my maths if they are wrong, but using Octane's numbers from the first good Citrine run:
This seems to make sense on the 10Gb. |
@msmith-techempower the math is good for the response.
Maybe this one is stuck at 40G then 😄 |
Is that an actual request sent via |
The request is much smaller than that on plaintext when we use work. |
You can send that via wrk. I copied the request from https://www.techempower.com/benchmarks/#section=code |
I think that is a simple example request for testing, but not actually representative of what That said, with 10Gb ethernet and full duplex, I don't really see anything suggesting that we would be performing above 10Gbps (in the theoretical sense). In fact, testing with |
Locally I think I got 9.8 Gbs with |
|
Yes I know ;) hence the comment |
My math assumed 400B requests when creating this issue. It would be nice to update https://www.techempower.com/benchmarks/#section=code to reflect the headers/sizes used by wrk. |
Understood, may not be accurate. I will check.
Completely agree.
Part 1 is definitely true and part 2 is likely for a few extremely high-performance test implementations. |
I think the math is off, as networking bandwidth is measured as base 1000 for Kbs -> Mbps -> Gbps, unlike storage which is 1024. So the theoretical max for plaintext on the network layer is limited by two factors: packets per second, and total bandwidth. Given the responses are pipelined at a depth of 16, the most optimal packet size would be For total bandwidth: In the latest runs, the top plaintext frameworks seem to be bunching up at around 7,000,000 RPS, which is strange, but quite a way off the theoretical maximum based on above. |
@msmith-techempower would be able to elaborate some, but earlier continuous Citrine runs were improperly reporting even higher measured responses per second—above 9 million in the cases of the top performers. I say these measurements were "improper" because they were collected without the request headers we expect to be sent by the load generator (wrk). This was due to a bug we had introduced during the Docker conversion: the command line arguments specifying the request headers to wrk were not being escaped. The request headers are significant, and although I have not done the math, my hunch is the requests are longer in total bytes than the responses. After correcting the command line arguments so that wrk is sending the expected request headers, the top performers are clustering at ~7M rps. We are presently focused on increasing stability across all frameworks in order to wrap Round 16. After that, some further tweaks to the network are being considered which could allow the top performers to be better differentiated. |
I believe the request headers for plaintext are currently the following:
That would make the full request the following:
With the terminating That of course is a theoretical limit, so it seems very likely we're bumping up against our max environmental limit for the ingress side of plaintext now. @bhauer seems like we should prioritize investigating adapter bonding to raise the upper limit to roughly twice what it is now. |
Agreed on the correct maths. It actually pointed me in the direction of fixing a bug I introduced into our |
As an aside, it seems possible we're also starting to hit limits in the JSON test too, although in that case it seems aligned with the packet switching limit. |
@DamianEdwards The math looks solid and conforms with the observed convergence we're seeing in continuous runs. In this recent example continuous run, we see a plaintext convergence at just over 7M. I suspect the theoretical limit is slightly higher than reality since additional bytes are presumably needed for overhead such as frame and packet headers. Agreed on increasing the network capacity. That said, I'd prefer to get Round 16 finalized before we do that. We will focus on making a "Preview" out the next good continuous run. (To be clear, a "Preview" is not special from a data perspective; I merely think it will be helpful to get the attention of less-active project participants.) Then aim to finalize a week or two after. |
Sounds good. 100G fiber here we come 😁 |
Might have issues currently with the benchmarker becoming the bottleneck; rather than the webservers above 10GbE wg/wrk#337 |
Hey, just wondering why on a such small packet you don't count the IP+TCP overhead of at least 40 bytes, for the small packets in these tests, the overhead is quite significant, no? (and I'm not counting 18 bytes ethernet header + even a micro latency with TCP windows size + all the small percentage of ACKs/packets loss...etc.) |
One of the popular switch models I know that has 6xQSFP28 ports (the Nexus 31108PC/TC-V) should really be used in the non-default hardware profile mode when it has only 4xQSFP28 and 2xQSFP+. Maybe the other switches based on the same merchant silicon have these different modes too. |
@onyxmaster Thanks for the alert on that. That is the switch model we're using, so we will need to be mindful of the configuration. |
Plaintext is pipelined so may result in only 2 packets for 16 requests and 2 packets for 16 responses |
Yeah the math was just meant to be HTTP only theoretical max, to see if we're in the ballpark, and it's clear we are. Would be interesting to include overheads to see just how close to network theoretical limit we are too. We're also working to help update the test infrastructure to capture CPU, TX/RX, & packet rate during every run, and include that data in the results. |
@DamianEdwards I finally got some time, so I set up the 100G cards and plugged everything in this morning, but I found that they seem to not be plug-n-play (sort of expected) and Intel does not support Ubuntu server. Have you guys experienced this as well? Any workaround? I think the worst-case scenario would be that we switch the machines over to a supported CentOS/OpenSUSE/RHE to get driver support. Since everything is done in Docker, I do not really see any cause for concern. |
@bhauer asked that I cc @sebastienros for the above concern. Basically, we want to make sure that there is parity between our environments, so if you guys got the Debian drivers working (I'm not at all sure how or if that's possible), then we would want to stick with that approach, otherwise let's land on some choice to proceed. |
I will ask our colleagues who manage the lab, but as of last week I know they had not tried it yet. |
This is kind of the most classical flaw of benchmarking - bottlenecking the wrong part and crediting something different.
You don't need to chase any particular super network adapter a 20 thousand USD - just lower your CPU-time and you're done. Much cheaper and now you're no longer benchmarking essentially noise. |
Any plans to work on this? I'm looking forward to see that plaintext 10xtop-1 become a top-10. |
I agree that lowering the available cpu is an alternative to increasing bandwidth. The numbers for plaintext speak to the imagination, and I guess that is why the desire is to increase bandwidth.
TechEmpower provides a set of tests, and runs frameworks against those tests. While most tests don't reflect real world applications, they do stress particular aspects of the frameworks. The results can then be used to improve those frameworks. This is very much what we've seen in the .NET Core space. Microsoft actively used TechEmpower benchmarks to improve the framework. And those performance benefits are measurable in the end-user applications. |
Yes, that is what this issue is tracking. The top for plaintext physical is noise. Your blog post has a broader message: Why people should stop listening to TechEmpower, which is what I was replying to. |
Yes, the JSON test runs without pipelining. |
That is not necessarily the case. However, it can be (and likely is the) case that 8 test implementations are performing at or near the maximum throughput of our hardware at present, and that's still very interesting.
We disagree on this.
and this. |
It is probably worth mentioning that due to issue #3804, the fastest frameworks in the cached queries test might also be hitting the network bandwidth limit. For example, according to |
Also |
Isnt the solution simple? Just run the benchmarks on slower hardware. or am i wrong? |
Idea: Add a new dimension with the number of cores (1, 4, .. MAX). That would show how some frameworks behave in constrained environments. |
The benchmarks show best performing plaintext on physical to be at 2.7M rps.
The requests have about 400B (TCP content only). So 2.7M rps is about 8.6Gbps.
This seems close to the 10Gbps of the ethernet switch.
Perhaps the physical benchmark is constrained by the network bandwith?
The text was updated successfully, but these errors were encountered: