Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pool4 is underperforming #214

Closed
ydahhrk opened this issue Apr 28, 2016 · 19 comments
Closed

pool4 is underperforming #214

ydahhrk opened this issue Apr 28, 2016 · 19 comments
Milestone

Comments

@ydahhrk
Copy link
Member

ydahhrk commented Apr 28, 2016

Started here. If my theory is correct, the bug is not tied to --mark; it's tied to the amount of rows pool4 contains.

Even if --mark is at fault, this bug should be addressed because, even though RFC7422 and --mark are (to most purposes) roughly the same feature, and the former should naturally scale more elegantly as more clients need to be serviced, the latter is more versatile since it matches clients arbitrarily by means of iptables rules. So I don't see any reasons to drop --mark once RFC7422 is implemented.

I believe this is the symptom that needs to be addressed:

When there are many pool4 entries, translation rate drops noticeably.
@ydahhrk
Copy link
Member Author

ydahhrk commented May 3, 2016

New status:

Before getting to profiling I wanted to make sure I had a clear starting point from where to start optimizing so I tried extracting more data. I wound up with a different opinion on what's going on.

Here's the data. (See the README for details.)

From the fact that the "ip6tables full" curve stays stubbornly below "pool4 full", it looks like pool4 is actually less critical than ip6tables on performance. (At least when populated with 1500 rows.)

That is, I know the pool4 entry lookup can be optimized, but I don't think this will speed up translation as much as Pier hopes.

(@pierky: Do you get different results than me if you prolong the tests through several iperf calls?)

ydahhrk added a commit that referenced this issue May 12, 2016
Added a new by-address index so there's no need to iterate
sequentially over anything now.

The new pool4 is unit-tested and looking stable, but there are two
TODOs preventing the code from being usable:

1. The API changed a little and I still need to tweak the callers.
2. I can't use RCU anymore and all the locking code is commented out.

Progress on #214.
ydahhrk added a commit that referenced this issue May 13, 2016
Was missing:

- Locks.
- Fall back to use interface addresses when pool4 is empty.
- Fix the API users.
- A flush unit test; --flush was crashing pool4.

Looks stable, but I've only run unit and informal integration tests.

Fixes #214.
@pierky
Copy link
Contributor

pierky commented May 19, 2016

I performed some tests using the current test branch (ba4e7db); I used the same hardware and scenario of my previous tests (100 Mbps NIC, see #175), but I introduced the mark-randomizer module. Tests have been performed using 20 cycles of iperf.

  • Test n. 1: 1536 ip6tables rules + 1536 jool --pool4 --add statements = 92.7 Mbps
  • Test n. 2: 6400 ip6tables rules + 6400 jool --pool4 --add statements = 35.4/35.9 Mbps
  • Test n. 3: 6400 ip6tables rules and only 1 jool --pool4 --add statements = 35.4/35.9 Mbps

Test 2 and 3 brought CPU to 100% because of hardware interrupts, test n. 1 only to 52%.
@ydahhrk do you have similar results?

@ydahhrk
Copy link
Member Author

ydahhrk commented May 19, 2016

Oh wow, we were planning to report at the same time :)

Before I analyse your data, I'd like to report my tests on the new code.

pool4 looks a lot faster now, at least compared to ip6tables. I even got rid of those annoying waves somehow.

Notice that the code was forked from the Jool 3.5 development branch, which might still not be in production status yet.

@ydahhrk
Copy link
Member Author

ydahhrk commented May 19, 2016

Test 2 and 3 brought CPU to 100% because of hardware interrupts, test n. 1 only to 52%.
@ydahhrk do you have similar results?

Oh, well I'd have to run the tests again, but I can believe it.

I take it that you think of that as a problem?

Is pool4 getting too full? Jool selects ports based on algorithm 3 of RFC 6056. This algorithm degrades "mostly" gracefully because reserved ports tend to scatter themselves randomly across the pool4 domain, which means that when there is a collision, finding a nearby unused port is relatively fast.

Until most ports are reserved that is. When pool4 is completely reserved, for example, the processor will waste a lot of time looping through the whole pool4 domain looking for an unused port.

This is an approximate representation of how port selection should degrade as the number of reserved bindings reach the limit imposed by pool4:

pool4performance

Maybe this can be optimized, too.

@pierky
Copy link
Contributor

pierky commented May 19, 2016

I'll run new tests too: my last ones used a short range of ports per pool4 entry (~ 30), so collisions may have negatively impaired performances. I'll try with more reserved ports per entry.

@ydahhrk
Copy link
Member Author

ydahhrk commented May 19, 2016

We might be looking at a different problem than the port selection peak, actually.

I failed to mention it in my previous post, but the y-axis in the graph has an upper limit. (Which is the reason why I didn't draw it as an arrow.)

That limit is the amount of transport addresses the relevant pool4 mark has.

So if your client's mark only has 30 ports, then that graph degrades as shown, except the peak is at 30. In other words, the worst case scenario is 30 iterations. I don't think 30 iterations per packet are enough to freeze your processor busy. (Edit: I deleted a bunch of text here because it didn't really go anywhere.)

Edit: Actually, scratch that idea. If you ran 20 iperf calls, then each client is only using 20 out of the 30 pool4 addresses. Assuming the mark randomizer didn't cause clients to be mapped to marks that didn't belong to them, then you are not exhausting pool4 entries.

On the other hand, is it really degrading badly? I see two sort of comparable numbers (92.7 and 35.4/35.9) but that's not enough to draw a curve.

@ydahhrk
Copy link
Member Author

ydahhrk commented May 19, 2016

(See edits above)

Questions:

  • Is this 100% processor usage leading to unresponsiveness and packet drops?
  • What is our target client count? (I mean, no matter how much we optimize it, there will always be a number of clients that will lead to 100% processor usage)

@pierky
Copy link
Contributor

pierky commented May 19, 2016

Actually I have not a real target client count, I just wanted to see how better the new code was performing and, as you also said, it's very good to see how faster it is now.
I hope to have more time to run new tests on a full GBE env and to build a config that looks like closer to a real life scenario (something like 1000/2000 ports per entry). At that time I'll capture more metrics to see how that curve degrades.

Edit: the "do you have similar results" was not related to CPU usage, I just wondered if you had same performance improvement with the new code, sorry, my fault.

@ydahhrk
Copy link
Member Author

ydahhrk commented May 19, 2016

Oh, ok. Thank you. :)

Then I guess that's it for the moment. I'll go back to tweaking the other issues.

@ydahhrk
Copy link
Member Author

ydahhrk commented May 24, 2016

So I noticed the other day that the new bottleneck (ip6tables accesing thousands of entries sequentially) can also be addressed by using a particular variation of the MARK target.

The basic idea is, we might not be able to prevent ip6tables from walking through the whole database, but we can condense several ip6tables entries into one.

The following graph shows the Mbits/sec that I gained by swapping 2048 MARK rules/2048 pool4 entries (yellow pattern) to the equivalent 1 MARKSRCRANGE rule/2048 pool4 entries (orange pattern). The blue pattern is 0 rules/0 pool4 entries:

mark-vs-marksrcrange

Here are the details of the experiment.

@ydahhrk ydahhrk added the Status: Tested Needs release label May 24, 2016
@pierky
Copy link
Contributor

pierky commented May 26, 2016

So, very good results here using the brand new MARKSRCRANGE (NICMx/mark-src-range@0e3fde5).

I performed some measurements in the following scenario:

|--------|            |-------|            |----------|
| sender | --[GbE]--> | NAT64 | --[GbE]--> | receiver |
|--------|            |-------|            |----------|

NAT64 is always the same hardware, this time it's connected via GbE to other two hosts.

Using the old configuration (6400 jool/ip6tables rules, without MARKSRCRANGE and with mark-randomizer) I got results near to the previous ones: TCP 42.6 Mbps.

In the new configuration, I used the same branch of Jool I've already used in my previous test (ba4e7db), this time with 6656 pool4 entries (jool --pool4 prints 19968 samples) and 6656 source IPv6 prefixes marked using 26 ip6tables' MARKSRCRANGE rules (26 x /56 splitted in /64), with 1000 ports per source /64.

Sender's IPv6 address falls in the last ip6tables' MARKSRCRANGE rule.

Using iperf from sender to receiver, I got:

  • TCP 10 seconds: avg 830 Mbps (10 iterations: 777, 784, 829, 928, 928, 928, 928, 759, 777, 757)
  • TCP 30 seconds: avg 881 Mbps (10 iterations: 928, 928, 928, 928, 770, 776, 928, 928, 928, 771)
  • UDP 5 threads @ 200Mbps/each, -l 1450: 943 Mbps (10 iterations, always 943)

These are the same values I obtained using iperf between sender and NAT64 directly (peak at 929 Mbps TCP and 941 Mbps UDP).

@ydahhrk
Copy link
Member Author

ydahhrk commented May 27, 2016

(^_^)/ \(^_^)

These are the same values I obtained using iperf between sender and NAT64 directly (peak at 929 Mbps TCP and 941 Mbps UDP).

Does this mean that translation adds no overhead whatsoever?

I find this a little too impressive (as in "worrying")

@pierky
Copy link
Contributor

pierky commented May 27, 2016

I find it odd too, this is why I wanted to report it here.

I had only little time to run these tests and I spent most of it setting up the new scenario with the two GbE-enabled hosts (sender and receiver). Hosts have been connected directly each other, just like in the "diagram" above, no switches nor other equipment have been used; I can't say I saw with my own eyes (for example with tcpdump) packets entering from the v6 interface and leaving from the v4 one, but the iperf commands used on sender (iperf -V -c 64:ff9b::10.0.0.2) and on receiver (iperf -s), togheter with the network topology and addressing scheme, make me confident that they were translated by Jool somehow. The source ports seen by receiver were consistent with those expected from the translation rule for the given source prefix. I also had a look at the top's output on NAT64 during a transfer and I saw the two CPUs involved with the NICs' interrupts raising to 90% and 30% (if I recall correctly).

Unfortunately it only remains to wait until the next week when I will spend some time on it again; now that the lab is already up I'll have more time for the real measurements. First of all I'll double check every step again and I'll capture and write down all the metrics aforementioned.

In the meanwhile any suggestion or hint to dispel this doubt will be very well appreciated. :-)

@ydahhrk
Copy link
Member Author

ydahhrk commented May 27, 2016

Well, I'm guessing it'll most likely not add much valuable insight, but maybe the BIB/session tables can also be queried to validate Jool is actively doing what we're expecting it to.

@pierky
Copy link
Contributor

pierky commented May 27, 2016

I forgot to tell, but I also checked the --session's output and iperf sessions was there.

Yes, I feel like I'm missing something very big, maybe I'm not seeing the forest for the trees here.
I'll reproduce the tests and I'll try to go deep into this unexpected (excessively well performing) behaviour.

@ydahhrk
Copy link
Member Author

ydahhrk commented May 27, 2016

Thanks :)

@pierky
Copy link
Contributor

pierky commented May 30, 2016

So, everything seems fine with results I got last week.

Packets from sender toward receiver enter the v6 interface and leave out through the v4 interface, Jool translates them and uses the right IPv4:port expected from the MARKSRCRANGE's mark / pool4 entry mapping.

NAT64 is using CPU3 for the v6 interface's interrupts and CPU0 for v4 interface's ones.

IPv6-to-IPv6 tests from sender toward NAT64 give 928 Mbps, with max 20% CPU3 hardware interrupt (as reported by top).

IPv6-to-IPv4 tests from sender toward receiver give the same value, 928 Mbps, with max 90% CPU3 and 53% CPU1 hardware interrupts.

So, CPU3 (= v6 interface interrupts) raises from 20% to 90% but traffic keeps flowing without problems.

I tried to put the ip6tables rule that matches the sender's source address both at the end and in the middle of the ip6tables ruleset with no changes.

  • sender's source address: 2001:db8:1234:56ff::2/64
  • ip6tables ruleset:
MARKSRCRANGE  all      2001:db8::/56       ::/0                marks 1-256 (0x1-0x100) /56/64
MARKSRCRANGE  all      2001:db8:1::/56     ::/0                marks 257-512 (0x101-0x200) /56/64
...
MARKSRCRANGE  all      2001:db8:15::/56    ::/0                marks 3841-4096 (0xf01-0x1000) /56/64
MARKSRCRANGE  all      2001:db8:1234:5600::/56  ::/0                marks 4097-4352 (0x1001-0x1100) /56/64
MARKSRCRANGE  all      2001:db8:16::/56    ::/0                marks 4353-4608 (0x1101-0x1200) /56/64
MARKSRCRANGE  all      2001:db8:17::/56    ::/0                marks 4609-4864 (0x1201-0x1300) /56/64
...
MARKSRCRANGE  all      2001:db8:30::/56    ::/0                marks 7937-8192 (0x1f01-0x2000) /56/64
  • jool's --pool4 entry:
4352    TCP     10.0.0.75       36000-36999
  • jool --session --csv --numeric
TCP,2001:db8:1234:56ff::2,58094,64:ff9b::a00:2,5001,10.0.0.75,36661,10.0.0.2,5001,00:03:03.60,V4_FIN_V6_FIN_RCV
TCP,2001:db8:1234:56ff::2,58096,64:ff9b::a00:2,5001,10.0.0.75,36662,10.0.0.2,5001,00:03:13.84,V4_FIN_V6_FIN_RCV
TCP,2001:db8:1234:56ff::2,58098,64:ff9b::a00:2,5001,10.0.0.75,36663,10.0.0.2,5001,00:03:23.116,V4_FIN_V6_FIN_RCV
TCP,2001:db8:1234:56ff::2,58100,64:ff9b::a00:2,5001,10.0.0.75,36664,10.0.0.2,5001,00:03:33.148,V4_FIN_V6_FIN_RCV
TCP,2001:db8:1234:56ff::2,58102,64:ff9b::a00:2,5001,10.0.0.75,36665,10.0.0.2,5001,00:03:43.180,V4_FIN_V6_FIN_RCV
TCP,2001:db8:1234:56ff::2,58104,64:ff9b::a00:2,5001,10.0.0.75,36666,10.0.0.2,5001,00:03:53.212,V4_FIN_V6_FIN_RCV
TCP,2001:db8:1234:56ff::2,58106,64:ff9b::a00:2,5001,10.0.0.75,36667,10.0.0.2,5001,02:00:00.0,ESTABLISHED
  • iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36661
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  1.08 GBytes   928 Mbits/sec
[  5] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36662
[  5]  0.0-10.0 sec  1.08 GBytes   928 Mbits/sec
[  4] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36663
[  4]  0.0-10.0 sec  1.08 GBytes   928 Mbits/sec
[  5] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36664
[  5]  0.0-10.0 sec  1.08 GBytes   927 Mbits/sec
[  4] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36665
[  4]  0.0-10.0 sec  1.08 GBytes   928 Mbits/sec
[  5] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36666
[  5]  0.0-10.0 sec  1.08 GBytes   928 Mbits/sec
[  4] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36667
[  4]  0.0-10.0 sec  1.08 GBytes   928 Mbits/sec
[  5] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36668
[  5]  0.0-10.0 sec  1.08 GBytes   928 Mbits/sec
[  4] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36669
[  4]  0.0-10.0 sec  1.08 GBytes   927 Mbits/sec
[  5] local 10.0.0.2 port 5001 connected with 10.0.0.75 port 36670
[  5]  0.0-10.0 sec  1.08 GBytes   928 Mbits/sec

NAT64 configuration:

  • Dell Computer Corporation PowerEdge 2850/0T7971
  • 2 x Intel(R) Xeon(TM) CPU 3.00GHz, 800 MHz bus, 2 MB L2 cache (fam: 0f, model: 04, stepping: 03)
  • 2 GB RAM, ECC DDR2
  • 2 x Intel Corporation 82541GI Gigabit Ethernet Controller (rev 05), e1000 (PCI:66MHz:32-bit) Intel(R) PRO/1000 Network Connection
  • uname -a output:
Linux nat64-test 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

@ydahhrk
Copy link
Member Author

ydahhrk commented May 30, 2016

Thank you for your efforts!

I'm guessing a single client is simply unable to saturate the network now that, assuming this configuration, Jool stopped being the bottleneck. If more clients and bandwidth are added to the mini-DOS attack, the CPUs are probably going to start hobbling.

Also, iperf only measures bandwidth. Other parameters might also inspire further insight. (latency, throughput, jitter...)

(But I'd say that is outside of the scope of this issue.)

@ydahhrk ydahhrk added this to the 3.5.0 milestone Jun 15, 2016
@ydahhrk
Copy link
Member Author

ydahhrk commented Sep 26, 2016

3.5 released; closing.

@ydahhrk ydahhrk closed this as completed Sep 26, 2016
@ydahhrk ydahhrk removed the Status: Tested Needs release label Sep 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants