Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

physical plaintext network bottleneck? #3538

Open
tmds opened this issue Apr 11, 2018 · 44 comments
Open

physical plaintext network bottleneck? #3538

tmds opened this issue Apr 11, 2018 · 44 comments

Comments

@tmds
Copy link
Contributor

tmds commented Apr 11, 2018

The benchmarks show best performing plaintext on physical to be at 2.7M rps.
The requests have about 400B (TCP content only). So 2.7M rps is about 8.6Gbps.
This seems close to the 10Gbps of the ethernet switch.
Perhaps the physical benchmark is constrained by the network bandwith?

@msmith-techempower
Copy link
Member

This is possible.

For what it's worth, our new hardware environment has a switch with 6x QSFP-100G ports if you were looking to donate three network cards and a few cables 😅

@tmds
Copy link
Contributor Author

tmds commented Apr 11, 2018

For what it's worth, our new hardware environment has a switch with 6x QSFP-100G ports if you were looking to donate three network cards and a few cables

lol. If you have some spare 10G cards, maybe you can bond them.

The cloud benchmark for plaintext has tokio-minihttp at 64.5% of ulib. These two are at the top of the physical benchmark (100% and 97.8%). I'm curious how much the graph would rescale.

We'll see when you find a sponsor :)

@msmith-techempower
Copy link
Member

lol. If you have some spare 10G cards, maybe you can bond them.

The 10G cards in the servers we have are double-NIC, but I don't know enough about the hardware to be certain that bonding both NICs on the same card would be an improvement.

The cloud benchmark for plaintext has tokio-minihttp at 64.5% of ulib. These two are at the top of the physical benchmark (100% and 97.8%). I'm curious how much the graph would rescale.

Agreed, though I have learned over the years that nothing really beats C in terms of performance when written well. That said, I personally want Rust to compete.

We'll see when you find a sponsor :)

I'm out there looking!

@RX14
Copy link
Contributor

RX14 commented Apr 11, 2018

@msmith-techempower link aggregation is definitely something you should look into for getting 20gbps between servers instead of just 10.

@sebastienros
Copy link
Contributor

The benchmarks show best performing plaintext on physical to be at 2.7M rps.

I can see 9M RPS with the latest runs on Citrine ... how is that possible then?

@tmds
Copy link
Contributor Author

tmds commented Apr 17, 2018

I can see 9M RPS with the latest runs on Citrine ... how is that possible then?

Indeed. Citrine must have more than 10G then.

@tmds
Copy link
Contributor Author

tmds commented Apr 17, 2018

I'm looking a bit at these daily results (https://tfb-status.techempower.com/). The variation across runs is huge.

plaintext:

Date Netty aspnetcore
16/04 5.5 Mrps (35.7ms) 3.0 Mrps (156ms)
10/04 3.4 Mrps (2500ms) 2.7 Mrps (150ms)
21/03 5.7 Mrps (35.7ms) 2.4 Mrps (51.1ms)

@msmith-techempower
Copy link
Member

Correct my maths if they are wrong, but using Octane's numbers from the first good Citrine run:

=============================
Octane(plaintext) on Citrine
=============================
9,346,826 responses per second


=============================
Octane(plaintext) response
=============================
HTTP/1.1 200 OK\r\n
Server: octane\r\n
Content-Type: text/plain\r\n
Content-Length: 13\r\n
Date: Thu Apr 12 16:18:26 2018\r\n
\r\n
Hello, World!
=============================
126 total bytes received


9,346,826 * 126 =
  1,117,700,076 bytes per second =
    8,941.600608 Mbits per second =
      8.9416 Gbits per second

This seems to make sense on the 10Gb.

@tmds
Copy link
Contributor Author

tmds commented Apr 17, 2018

@msmith-techempower the math is good for the response.
For the request there are about 400 bytes:

GET /json HTTP/1.1
Host: server
User-Agent: Mozilla/5.0 (X11; Linux x86_64) Gecko/20130501 Firefox/30.0 AppleWebKit/600.00 Chrome/30.0.0000.0 Trident/10.0 Safari/600.00
Cookie: uid=12345678901234567890; __utma=1.1234567890.1234567890.1234567890.1234567890.12; wd=2560x1600
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Connection: keep-alive

Maybe this one is stuck at 40G then 😄

@msmith-techempower
Copy link
Member

Is that an actual request sent via wrk?

@sebastienros
Copy link
Contributor

The request is much smaller than that on plaintext when we use work.

@tmds
Copy link
Contributor Author

tmds commented Apr 17, 2018

Is that an actual request sent via wrk?

You can send that via wrk. I copied the request from https://www.techempower.com/benchmarks/#section=code

@msmith-techempower
Copy link
Member

I think that is a simple example request for testing, but not actually representative of what wrk does. I am sorry for the confusion, but that documentation was written long before we even started using wrk (we were using some benchmarker from apache at one point... I don't even remember... it was like 7 years ago, even before we moved into the open source space).

That said, with 10Gb ethernet and full duplex, I don't really see anything suggesting that we would be performing above 10Gbps (in the theoretical sense). In fact, testing with iperf confirms this, as well.

@sebastienros
Copy link
Contributor

Locally I think I got 9.8 Gbs with iperf and measured a max of 1.5M packets per second. Any benchmark that is over 1.5M is obviously and correctly using pipelining.

@msmith-techempower
Copy link
Member

Any benchmark that is over 1.5M is obviously and correctly using pipelining.

plaintext is using pipelining.

@sebastienros
Copy link
Contributor

Yes I know ;) hence the comment

@tmds
Copy link
Contributor Author

tmds commented Apr 17, 2018

The requests have about 400B (TCP content only). So 2.7M rps is about 8.6Gbps.

My math assumed 400B requests when creating this issue.
So round15 didn't hit 10G. And daily on Citrine may be hitting it?

It would be nice to update https://www.techempower.com/benchmarks/#section=code to reflect the headers/sizes used by wrk.

@msmith-techempower
Copy link
Member

My math assumed 400B responses when creating this issue.

Understood, may not be accurate. I will check.

It would be nice to update https://www.techempower.com/benchmarks/#section=code to reflect the headers/sizes used by wrk.

Completely agree.

So round15 didn't hit 10G. And daily on Citrine may be hitting it?

Part 1 is definitely true and part 2 is likely for a few extremely high-performance test implementations.

@DamianEdwards
Copy link
Contributor

I think the math is off, as networking bandwidth is measured as base 1000 for Kbs -> Mbps -> Gbps, unlike storage which is 1024.

So the theoretical max for plaintext on the network layer is limited by two factors: packets per second, and total bandwidth. Given the responses are pipelined at a depth of 16, the most optimal packet size would be 16 x 126 = 2016 bytes (not accounting for any overhead). At a switching rate of 1.5 million packets/second (from @sebastienros above), that gives us a max packet throughput rate of 3024000000 byte/S or 24.192 Gbps. Now, I don't know what the packet sizes (MTU) are set to, so frameworks are potentially sending more packets than that (due to them being smaller), but they'd need to be less than 1024 bytes to even approach packet switching being the limit, which seems unlikely.

For total bandwidth: 10Gbps is 10,000,000,000 bits per second, or 1,250,000,000 bytes per second. To get max RPS for plaintext: 1,250,000,000 / 126 = 9,920,634.9 RPS.

In the latest runs, the top plaintext frameworks seem to be bunching up at around 7,000,000 RPS, which is strange, but quite a way off the theoretical maximum based on above.

@bhauer
Copy link
Contributor

bhauer commented May 11, 2018

@msmith-techempower would be able to elaborate some, but earlier continuous Citrine runs were improperly reporting even higher measured responses per second—above 9 million in the cases of the top performers.

I say these measurements were "improper" because they were collected without the request headers we expect to be sent by the load generator (wrk). This was due to a bug we had introduced during the Docker conversion: the command line arguments specifying the request headers to wrk were not being escaped.

The request headers are significant, and although I have not done the math, my hunch is the requests are longer in total bytes than the responses. After correcting the command line arguments so that wrk is sending the expected request headers, the top performers are clustering at ~7M rps.

We are presently focused on increasing stability across all frameworks in order to wrap Round 16. After that, some further tweaks to the network are being considered which could allow the top performers to be better differentiated.

@DamianEdwards
Copy link
Contributor

I believe the request headers for plaintext are currently the following:

"text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7"

That would make the full request the following:

GET /plaintext HTTP/1.1
Host: tfb-server
Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7
Connection: keep-alive

With the terminating CRLF CRLF that's a total of 167 bytes per request. The math for the request side then looks like: 1,250,000,000 / 167 = 7,485,029.94 RPS.

That of course is a theoretical limit, so it seems very likely we're bumping up against our max environmental limit for the ingress side of plaintext now.

@bhauer seems like we should prioritize investigating adapter bonding to raise the upper limit to roughly twice what it is now.

@msmith-techempower
Copy link
Member

Agreed on the correct maths. It actually pointed me in the direction of fixing a bug I introduced into our wrk image that we resolved some time ago.

@DamianEdwards
Copy link
Contributor

As an aside, it seems possible we're also starting to hit limits in the JSON test too, although in that case it seems aligned with the packet switching limit.

@bhauer
Copy link
Contributor

bhauer commented May 17, 2018

@DamianEdwards The math looks solid and conforms with the observed convergence we're seeing in continuous runs. In this recent example continuous run, we see a plaintext convergence at just over 7M. I suspect the theoretical limit is slightly higher than reality since additional bytes are presumably needed for overhead such as frame and packet headers.

Agreed on increasing the network capacity. That said, I'd prefer to get Round 16 finalized before we do that. We will focus on making a "Preview" out the next good continuous run. (To be clear, a "Preview" is not special from a data perspective; I merely think it will be helpful to get the attention of less-active project participants.) Then aim to finalize a week or two after.

@DamianEdwards
Copy link
Contributor

Sounds good. 100G fiber here we come 😁

@benaadams
Copy link
Contributor

Might have issues currently with the benchmarker becoming the bottleneck; rather than the webservers above 10GbE wg/wrk#337

@xoofx
Copy link

xoofx commented May 17, 2018

In the latest runs, the top plaintext frameworks seem to be bunching up at around 7,000,000 RPS, which is strange, but quite a way off the theoretical maximum based on above.

Hey, just wondering why on a such small packet you don't count the IP+TCP overhead of at least 40 bytes, for the small packets in these tests, the overhead is quite significant, no? (and I'm not counting 18 bytes ethernet header + even a micro latency with TCP windows size + all the small percentage of ACKs/packets loss...etc.)

@onyxmaster
Copy link

One of the popular switch models I know that has 6xQSFP28 ports (the Nexus 31108PC/TC-V) should really be used in the non-default hardware profile mode when it has only 4xQSFP28 and 2xQSFP+. Maybe the other switches based on the same merchant silicon have these different modes too.

@bhauer
Copy link
Contributor

bhauer commented May 17, 2018

@onyxmaster Thanks for the alert on that. That is the switch model we're using, so we will need to be mindful of the configuration.

@benaadams
Copy link
Contributor

Hey, just wondering why on a such small packet you don't count the IP+TCP overhead of at least 40 bytes, for the small packets in these tests, the overhead is quite significant, no?

Plaintext is pipelined so may result in only 2 packets for 16 requests and 2 packets for 16 responses

@DamianEdwards
Copy link
Contributor

Yeah the math was just meant to be HTTP only theoretical max, to see if we're in the ballpark, and it's clear we are. Would be interesting to include overheads to see just how close to network theoretical limit we are too.

We're also working to help update the test infrastructure to capture CPU, TX/RX, & packet rate during every run, and include that data in the results.

@msmith-techempower
Copy link
Member

msmith-techempower commented Sep 10, 2018

Sounds good. 100G fiber here we come 😁

@DamianEdwards I finally got some time, so I set up the 100G cards and plugged everything in this morning, but I found that they seem to not be plug-n-play (sort of expected) and Intel does not support Ubuntu server. Have you guys experienced this as well? Any workaround?

I think the worst-case scenario would be that we switch the machines over to a supported CentOS/OpenSUSE/RHE to get driver support. Since everything is done in Docker, I do not really see any cause for concern.

@msmith-techempower
Copy link
Member

@bhauer asked that I cc @sebastienros for the above concern.

Basically, we want to make sure that there is parity between our environments, so if you guys got the Debian drivers working (I'm not at all sure how or if that's possible), then we would want to stick with that approach, otherwise let's land on some choice to proceed.

@sebastienros
Copy link
Contributor

I will ask our colleagues who manage the lab, but as of last week I know they had not tried it yet.

@ghost
Copy link

ghost commented Oct 22, 2018

This is kind of the most classical flaw of benchmarking - bottlenecking the wrong part and crediting something different.

Michael Jackson looked at the moon, so did Leonardo Da Vinci.
They are both dead.
The moon kills people.

You don't need to chase any particular super network adapter a 20 thousand USD - just lower your CPU-time and you're done. Much cheaper and now you're no longer benchmarking essentially noise.

@tmds
Copy link
Contributor Author

tmds commented Feb 15, 2019

Any plans to work on this? I'm looking forward to see that plaintext 10xtop-1 become a top-10.

@tmds
Copy link
Contributor Author

tmds commented Feb 15, 2019

You don't need to chase any particular super network adapter a 20 thousand USD - just lower your CPU-time and you're done. Much cheaper and now you're no longer benchmarking essentially noise.

I agree that lowering the available cpu is an alternative to increasing bandwidth. The numbers for plaintext speak to the imagination, and I guess that is why the desire is to increase bandwidth.

I wrote a post about this just now actually: https://medium.com/@alexhultman/why-people-should-stop-listening-to-techempower-c544e7b538a5

TechEmpower provides a set of tests, and runs frameworks against those tests. While most tests don't reflect real world applications, they do stress particular aspects of the frameworks. The results can then be used to improve those frameworks.

This is very much what we've seen in the .NET Core space. Microsoft actively used TechEmpower benchmarks to improve the framework. And those performance benefits are measurable in the end-user applications.

@tmds
Copy link
Contributor Author

tmds commented Feb 15, 2019

Ehm, no. The top 13 servers are not capped on CPU, so you can't draw any valid conclusion from them. Did you even read the post?

Yes, that is what this issue is tracking. The top for plaintext physical is noise. Your blog post has a broader message: Why people should stop listening to TechEmpower, which is what I was replying to.

@tmds
Copy link
Contributor Author

tmds commented Feb 15, 2019

I have a question regarding the "JSON serialization" test - this looks identical to the plaintext test only instead of Hello world it's in JSON. Why are the results 7x lower in the JSON test? Are you not having pipelining there?

Yes, the JSON test runs without pipelining.

@msmith-techempower
Copy link
Member

It cannot be the case that 8 servers score almost identical at 7 million req/sec.

That is not necessarily the case. However, it can be (and likely is the) case that 8 test implementations are performing at or near the maximum throughput of our hardware at present, and that's still very interesting.

We are at round 17 now, so you've known this for some time and you still post invalid results.

We disagree on this.

It really tears down the validity of TechEmpower altogether.

and this.

@volyrique
Copy link
Contributor

It is probably worth mentioning that due to issue #3804, the fastest frameworks in the cached queries test might also be hitting the network bandwidth limit. For example, according to wrk servlet-postgresql transfers 1.07 GB/s on average.

@zloster
Copy link
Contributor

zloster commented Mar 25, 2019

Also ulib is at 1,37M RPS for the case "1 object extracted from the cache" (Edit: currently 16 objects). Which also is very similar to the JSON serialisation test (256 and 512 concurrency). Given the math above (1.5M packets without pipelining) and assuming 1 packet per response, it is very close.

@ctxcode
Copy link

ctxcode commented Oct 25, 2021

Isnt the solution simple? Just run the benchmarks on slower hardware. or am i wrong?

@sebastienros
Copy link
Contributor

Idea: Add a new dimension with the number of cores (1, 4, .. MAX). That would show how some frameworks behave in constrained environments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests