physical plaintext network bottleneck? #3538

tmds · 2018-04-11T08:16:55Z

The benchmarks show best performing plaintext on physical to be at 2.7M rps.
The requests have about 400B (TCP content only). So 2.7M rps is about 8.6Gbps.
This seems close to the 10Gbps of the ethernet switch.
Perhaps the physical benchmark is constrained by the network bandwith?

msmith-techempower · 2018-04-11T14:48:20Z

This is possible.

For what it's worth, our new hardware environment has a switch with 6x QSFP-100G ports if you were looking to donate three network cards and a few cables 😅

tmds · 2018-04-11T15:27:36Z

For what it's worth, our new hardware environment has a switch with 6x QSFP-100G ports if you were looking to donate three network cards and a few cables

lol. If you have some spare 10G cards, maybe you can bond them.

The cloud benchmark for plaintext has tokio-minihttp at 64.5% of ulib. These two are at the top of the physical benchmark (100% and 97.8%). I'm curious how much the graph would rescale.

We'll see when you find a sponsor :)

msmith-techempower · 2018-04-11T15:29:56Z

lol. If you have some spare 10G cards, maybe you can bond them.

The 10G cards in the servers we have are double-NIC, but I don't know enough about the hardware to be certain that bonding both NICs on the same card would be an improvement.

The cloud benchmark for plaintext has tokio-minihttp at 64.5% of ulib. These two are at the top of the physical benchmark (100% and 97.8%). I'm curious how much the graph would rescale.

Agreed, though I have learned over the years that nothing really beats C in terms of performance when written well. That said, I personally want Rust to compete.

We'll see when you find a sponsor :)

I'm out there looking!

RX14 · 2018-04-11T18:16:44Z

@msmith-techempower link aggregation is definitely something you should look into for getting 20gbps between servers instead of just 10.

sebastienros · 2018-04-17T01:10:13Z

The benchmarks show best performing plaintext on physical to be at 2.7M rps.

I can see 9M RPS with the latest runs on Citrine ... how is that possible then?

tmds · 2018-04-17T04:16:59Z

I can see 9M RPS with the latest runs on Citrine ... how is that possible then?

Indeed. Citrine must have more than 10G then.

tmds · 2018-04-17T09:07:32Z

I'm looking a bit at these daily results (https://tfb-status.techempower.com/). The variation across runs is huge.

plaintext:

Date	Netty	aspnetcore
16/04	5.5 Mrps (35.7ms)	3.0 Mrps (156ms)
10/04	3.4 Mrps (2500ms)	2.7 Mrps (150ms)
21/03	5.7 Mrps (35.7ms)	2.4 Mrps (51.1ms)

msmith-techempower · 2018-04-17T14:48:31Z

Correct my maths if they are wrong, but using Octane's numbers from the first good Citrine run:

=============================
Octane(plaintext) on Citrine
=============================
9,346,826 responses per second


=============================
Octane(plaintext) response
=============================
HTTP/1.1 200 OK\r\n
Server: octane\r\n
Content-Type: text/plain\r\n
Content-Length: 13\r\n
Date: Thu Apr 12 16:18:26 2018\r\n
\r\n
Hello, World!
=============================
126 total bytes received


9,346,826 * 126 =
  1,117,700,076 bytes per second =
    8,941.600608 Mbits per second =
      8.9416 Gbits per second

This seems to make sense on the 10Gb.

tmds · 2018-04-17T14:54:37Z

@msmith-techempower the math is good for the response.
For the request there are about 400 bytes:

GET /json HTTP/1.1
Host: server
User-Agent: Mozilla/5.0 (X11; Linux x86_64) Gecko/20130501 Firefox/30.0 AppleWebKit/600.00 Chrome/30.0.0000.0 Trident/10.0 Safari/600.00
Cookie: uid=12345678901234567890; __utma=1.1234567890.1234567890.1234567890.1234567890.12; wd=2560x1600
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Connection: keep-alive

Maybe this one is stuck at 40G then 😄

msmith-techempower · 2018-04-17T15:08:39Z

Is that an actual request sent via wrk?

sebastienros · 2018-04-17T15:10:03Z

The request is much smaller than that on plaintext when we use work.

tmds · 2018-04-17T15:12:06Z

Is that an actual request sent via wrk?

You can send that via wrk. I copied the request from https://www.techempower.com/benchmarks/#section=code

msmith-techempower · 2018-04-17T15:15:05Z

I think that is a simple example request for testing, but not actually representative of what wrk does. I am sorry for the confusion, but that documentation was written long before we even started using wrk (we were using some benchmarker from apache at one point... I don't even remember... it was like 7 years ago, even before we moved into the open source space).

That said, with 10Gb ethernet and full duplex, I don't really see anything suggesting that we would be performing above 10Gbps (in the theoretical sense). In fact, testing with iperf confirms this, as well.

sebastienros · 2018-04-17T15:21:16Z

Locally I think I got 9.8 Gbs with iperf and measured a max of 1.5M packets per second. Any benchmark that is over 1.5M is obviously and correctly using pipelining.

msmith-techempower · 2018-04-17T15:25:43Z

Any benchmark that is over 1.5M is obviously and correctly using pipelining.

plaintext is using pipelining.

sebastienros · 2018-04-17T15:26:24Z

Yes I know ;) hence the comment

tmds · 2018-04-17T15:27:07Z

The requests have about 400B (TCP content only). So 2.7M rps is about 8.6Gbps.

My math assumed 400B requests when creating this issue.
So round15 didn't hit 10G. And daily on Citrine may be hitting it?

It would be nice to update https://www.techempower.com/benchmarks/#section=code to reflect the headers/sizes used by wrk.

msmith-techempower · 2018-04-17T15:29:00Z

My math assumed 400B responses when creating this issue.

Understood, may not be accurate. I will check.

It would be nice to update https://www.techempower.com/benchmarks/#section=code to reflect the headers/sizes used by wrk.

Completely agree.

So round15 didn't hit 10G. And daily on Citrine may be hitting it?

Part 1 is definitely true and part 2 is likely for a few extremely high-performance test implementations.

DamianEdwards · 2018-05-11T22:57:30Z

I think the math is off, as networking bandwidth is measured as base 1000 for Kbs -> Mbps -> Gbps, unlike storage which is 1024.

So the theoretical max for plaintext on the network layer is limited by two factors: packets per second, and total bandwidth. Given the responses are pipelined at a depth of 16, the most optimal packet size would be 16 x 126 = 2016 bytes (not accounting for any overhead). At a switching rate of 1.5 million packets/second (from @sebastienros above), that gives us a max packet throughput rate of 3024000000 byte/S or 24.192 Gbps. Now, I don't know what the packet sizes (MTU) are set to, so frameworks are potentially sending more packets than that (due to them being smaller), but they'd need to be less than 1024 bytes to even approach packet switching being the limit, which seems unlikely.

For total bandwidth: 10Gbps is 10,000,000,000 bits per second, or 1,250,000,000 bytes per second. To get max RPS for plaintext: 1,250,000,000 / 126 = 9,920,634.9 RPS.

In the latest runs, the top plaintext frameworks seem to be bunching up at around 7,000,000 RPS, which is strange, but quite a way off the theoretical maximum based on above.

bhauer · 2018-05-11T23:41:52Z

@msmith-techempower would be able to elaborate some, but earlier continuous Citrine runs were improperly reporting even higher measured responses per second—above 9 million in the cases of the top performers.

I say these measurements were "improper" because they were collected without the request headers we expect to be sent by the load generator (wrk). This was due to a bug we had introduced during the Docker conversion: the command line arguments specifying the request headers to wrk were not being escaped.

The request headers are significant, and although I have not done the math, my hunch is the requests are longer in total bytes than the responses. After correcting the command line arguments so that wrk is sending the expected request headers, the top performers are clustering at ~7M rps.

We are presently focused on increasing stability across all frameworks in order to wrap Round 16. After that, some further tweaks to the network are being considered which could allow the top performers to be better differentiated.

DamianEdwards · 2018-05-16T19:59:55Z

I believe the request headers for plaintext are currently the following:

FrameworkBenchmarks/toolset/benchmark/test_types/plaintext_type.py

Line 72 in e99d22f

    
           "text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7"

That would make the full request the following:

GET /plaintext HTTP/1.1
Host: tfb-server
Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7
Connection: keep-alive

With the terminating CRLF CRLF that's a total of 167 bytes per request. The math for the request side then looks like: 1,250,000,000 / 167 = 7,485,029.94 RPS.

That of course is a theoretical limit, so it seems very likely we're bumping up against our max environmental limit for the ingress side of plaintext now.

@bhauer seems like we should prioritize investigating adapter bonding to raise the upper limit to roughly twice what it is now.

msmith-techempower · 2018-05-16T20:04:13Z

Agreed on the correct maths. It actually pointed me in the direction of fixing a bug I introduced into our wrk image that we resolved some time ago.

DamianEdwards · 2018-05-17T02:59:51Z

As an aside, it seems possible we're also starting to hit limits in the JSON test too, although in that case it seems aligned with the packet switching limit.

bhauer · 2018-05-17T15:28:01Z

@DamianEdwards The math looks solid and conforms with the observed convergence we're seeing in continuous runs. In this recent example continuous run, we see a plaintext convergence at just over 7M. I suspect the theoretical limit is slightly higher than reality since additional bytes are presumably needed for overhead such as frame and packet headers.

Agreed on increasing the network capacity. That said, I'd prefer to get Round 16 finalized before we do that. We will focus on making a "Preview" out the next good continuous run. (To be clear, a "Preview" is not special from a data perspective; I merely think it will be helpful to get the attention of less-active project participants.) Then aim to finalize a week or two after.

DamianEdwards · 2018-05-17T15:30:11Z

Sounds good. 100G fiber here we come 😁

benaadams · 2018-05-17T19:42:01Z

Might have issues currently with the benchmarker becoming the bottleneck; rather than the webservers above 10GbE wg/wrk#337

xoofx · 2018-05-17T21:03:59Z

In the latest runs, the top plaintext frameworks seem to be bunching up at around 7,000,000 RPS, which is strange, but quite a way off the theoretical maximum based on above.

Hey, just wondering why on a such small packet you don't count the IP+TCP overhead of at least 40 bytes, for the small packets in these tests, the overhead is quite significant, no? (and I'm not counting 18 bytes ethernet header + even a micro latency with TCP windows size + all the small percentage of ACKs/packets loss...etc.)

onyxmaster · 2018-05-17T21:07:14Z

One of the popular switch models I know that has 6xQSFP28 ports (the Nexus 31108PC/TC-V) should really be used in the non-default hardware profile mode when it has only 4xQSFP28 and 2xQSFP+. Maybe the other switches based on the same merchant silicon have these different modes too.

bhauer · 2018-05-17T21:14:00Z

@onyxmaster Thanks for the alert on that. That is the switch model we're using, so we will need to be mindful of the configuration.

benaadams · 2018-05-17T21:40:31Z

Hey, just wondering why on a such small packet you don't count the IP+TCP overhead of at least 40 bytes, for the small packets in these tests, the overhead is quite significant, no?

Plaintext is pipelined so may result in only 2 packets for 16 requests and 2 packets for 16 responses

DamianEdwards · 2018-05-17T21:46:23Z

Yeah the math was just meant to be HTTP only theoretical max, to see if we're in the ballpark, and it's clear we are. Would be interesting to include overheads to see just how close to network theoretical limit we are too.

We're also working to help update the test infrastructure to capture CPU, TX/RX, & packet rate during every run, and include that data in the results.

msmith-techempower · 2018-09-10T16:39:39Z

Sounds good. 100G fiber here we come 😁

@DamianEdwards I finally got some time, so I set up the 100G cards and plugged everything in this morning, but I found that they seem to not be plug-n-play (sort of expected) and Intel does not support Ubuntu server. Have you guys experienced this as well? Any workaround?

I think the worst-case scenario would be that we switch the machines over to a supported CentOS/OpenSUSE/RHE to get driver support. Since everything is done in Docker, I do not really see any cause for concern.

msmith-techempower · 2018-09-10T18:27:02Z

@bhauer asked that I cc @sebastienros for the above concern.

Basically, we want to make sure that there is parity between our environments, so if you guys got the Debian drivers working (I'm not at all sure how or if that's possible), then we would want to stick with that approach, otherwise let's land on some choice to proceed.

sebastienros · 2018-09-10T18:46:27Z

I will ask our colleagues who manage the lab, but as of last week I know they had not tried it yet.

ghost · 2018-10-22T20:55:14Z

This is kind of the most classical flaw of benchmarking - bottlenecking the wrong part and crediting something different.

Michael Jackson looked at the moon, so did Leonardo Da Vinci.
They are both dead.
The moon kills people.

You don't need to chase any particular super network adapter a 20 thousand USD - just lower your CPU-time and you're done. Much cheaper and now you're no longer benchmarking essentially noise.

tmds · 2019-02-15T13:12:39Z

Any plans to work on this? I'm looking forward to see that plaintext 10xtop-1 become a top-10.

tmds · 2019-02-15T14:12:12Z

You don't need to chase any particular super network adapter a 20 thousand USD - just lower your CPU-time and you're done. Much cheaper and now you're no longer benchmarking essentially noise.

I agree that lowering the available cpu is an alternative to increasing bandwidth. The numbers for plaintext speak to the imagination, and I guess that is why the desire is to increase bandwidth.

I wrote a post about this just now actually: https://medium.com/@alexhultman/why-people-should-stop-listening-to-techempower-c544e7b538a5

TechEmpower provides a set of tests, and runs frameworks against those tests. While most tests don't reflect real world applications, they do stress particular aspects of the frameworks. The results can then be used to improve those frameworks.

This is very much what we've seen in the .NET Core space. Microsoft actively used TechEmpower benchmarks to improve the framework. And those performance benefits are measurable in the end-user applications.

tmds · 2019-02-15T14:19:57Z

Ehm, no. The top 13 servers are not capped on CPU, so you can't draw any valid conclusion from them. Did you even read the post?

Yes, that is what this issue is tracking. The top for plaintext physical is noise. Your blog post has a broader message: Why people should stop listening to TechEmpower, which is what I was replying to.

tmds · 2019-02-15T14:20:29Z

I have a question regarding the "JSON serialization" test - this looks identical to the plaintext test only instead of Hello world it's in JSON. Why are the results 7x lower in the JSON test? Are you not having pipelining there?

Yes, the JSON test runs without pipelining.

msmith-techempower · 2019-02-15T16:09:44Z

It cannot be the case that 8 servers score almost identical at 7 million req/sec.

That is not necessarily the case. However, it can be (and likely is the) case that 8 test implementations are performing at or near the maximum throughput of our hardware at present, and that's still very interesting.

We are at round 17 now, so you've known this for some time and you still post invalid results.

We disagree on this.

It really tears down the validity of TechEmpower altogether.

and this.

volyrique · 2019-03-06T14:34:54Z

It is probably worth mentioning that due to issue #3804, the fastest frameworks in the cached queries test might also be hitting the network bandwidth limit. For example, according to wrk servlet-postgresql transfers 1.07 GB/s on average.

zloster · 2019-03-25T18:34:43Z

Also ulib is at 1,37M RPS for the case "1 object extracted from the cache" (Edit: currently 16 objects). Which also is very similar to the JSON serialisation test (256 and 512 concurrency). Given the math above (1.5M packets without pipelining) and assuming 1 packet per response, it is very close.

ctxcode · 2021-10-25T12:57:18Z

Isnt the solution simple? Just run the benchmarks on slower hardware. or am i wrong?

sebastienros · 2021-10-25T18:48:28Z

Idea: Add a new dimension with the number of cores (1, 4, .. MAX). That would show how some frameworks behave in constrained environments.

msmith-techempower added Bug: Results Web Site and removed Bug: Results Web Site labels May 3, 2018

msmith-techempower mentioned this issue Oct 22, 2018

Non-software bottleneck invalidity #4148

Closed

zloster mentioned this issue Mar 25, 2019

Spit up tight grouping at top of JSON serialization by testing with 2048 connections #4480

Closed

zloster mentioned this issue Nov 4, 2019

wrk is the bottleneck for plaintext test and json test #5207

Open

physical plaintext network bottleneck? #3538

physical plaintext network bottleneck? #3538

Comments

tmds commented Apr 11, 2018

msmith-techempower commented Apr 11, 2018

tmds commented Apr 11, 2018

msmith-techempower commented Apr 11, 2018

RX14 commented Apr 11, 2018

sebastienros commented Apr 17, 2018

tmds commented Apr 17, 2018

tmds commented Apr 17, 2018

msmith-techempower commented Apr 17, 2018

tmds commented Apr 17, 2018 • edited Loading

msmith-techempower commented Apr 17, 2018

sebastienros commented Apr 17, 2018

tmds commented Apr 17, 2018

msmith-techempower commented Apr 17, 2018

sebastienros commented Apr 17, 2018

msmith-techempower commented Apr 17, 2018

sebastienros commented Apr 17, 2018

tmds commented Apr 17, 2018 • edited Loading

msmith-techempower commented Apr 17, 2018

DamianEdwards commented May 11, 2018

bhauer commented May 11, 2018

DamianEdwards commented May 16, 2018

msmith-techempower commented May 16, 2018

DamianEdwards commented May 17, 2018

bhauer commented May 17, 2018

DamianEdwards commented May 17, 2018

benaadams commented May 17, 2018

xoofx commented May 17, 2018

onyxmaster commented May 17, 2018

bhauer commented May 17, 2018

benaadams commented May 17, 2018

DamianEdwards commented May 17, 2018

msmith-techempower commented Sep 10, 2018 • edited Loading

msmith-techempower commented Sep 10, 2018

sebastienros commented Sep 10, 2018

ghost commented Oct 22, 2018 • edited by ghost Loading

tmds commented Feb 15, 2019

tmds commented Feb 15, 2019

tmds commented Feb 15, 2019

tmds commented Feb 15, 2019

msmith-techempower commented Feb 15, 2019

volyrique commented Mar 6, 2019

zloster commented Mar 25, 2019 • edited Loading

ctxcode commented Oct 25, 2021

sebastienros commented Oct 25, 2021

tmds commented Apr 17, 2018 •

edited

Loading

tmds commented Apr 17, 2018 •

edited

Loading

msmith-techempower commented Sep 10, 2018 •

edited

Loading

ghost commented Oct 22, 2018 •

edited by ghost

Loading

zloster commented Mar 25, 2019 •

edited

Loading