-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beta testing Gatekeeper #417
Comments
I don't think we need to specify https://github.com/AltraMayor/gatekeeper/blob/master/config/dynamic.c#L486-L499 Instead, I think that the call to If so, it is likely because:
If it seems to be one of the above issues, let me know, as I think we should certainly add a log message with more information. |
In my log file I see: GATEKEEPER: lpm: IPv4 lookup miss Here my network config:
I double checked, that my back interface (10.255.0.226/29) is in the same subnet with gateway (10.255.0.225) |
Hi @dahnevskiy, Although my sidenotes below don't address the problem on hand, they are still relevant here:
|
Please ignore item 1 above; my mistake. |
Thanks for the clarifications @dahnevskiy, I will see if I can replicate the issue. |
Are you actually seeing Try adding this to the end of your Lua script to dump the ARP table:
If the gateway entry you're adding (10.255.0.225) doesn't respond to ARP requests, it will show up as |
I added lls check to lua script, and yes, I get the stale ARP:
I found and fix problem in configuration, thanks for the help! By the way, I can't use /30 netmask in my net.lua. If I use it, for example in my net.lua:
in my logs, when I launch gatekeeper:
Its not a huge problem for us, we can use /29 or less specific network, but I guess its maybe a bug and you wanna know about it. |
Hi @dahnevskiy, We are going to investigate the /30 bug, but would you mind describing how you solved the configuration issue? Understanding it may help us to conceive ways to make the configuration more robust. |
I continue my tests, and im trying to generate 1 mpps synflood attack to destination 10.254.71.130 In my grantor server I see logs:
I guess its problem, because on 10.254.71.130 I can't see any traffic. |
just have a problem with links from gatekeeper server to asr9k. nothing to do with gatekeeper software |
The log message This is not a fundamental limitation and we can think of how to support it. Would it be possible for we schedule a meeting, so our team can understand your deployment environment? |
Looks like the DPDK cuckoo hash table must have a length of at least 8. |
Thank you for the helpful diagram. Based on your diagram, there's only one gateway for all traffic that is not local to a Grantor server. Is this assumption correct? If so, Grantor servers can support a single gateway for non-local traffic and still avoid having an LPM lookup. Would this solution work for you? |
If I understand you correctly - this assumption is correct. |
The DPDK hash table library requires that hash tables be of at least size 8, as described in AltraMayor#417.
This patch allows Grantor servers to support a single gateway for non-local traffic and still avoid having an LPM lookup. It also demonstrates the network configurations for a deployment scenario, as described in AltraMayor#267 and AltraMayor#417.
This patch allows Grantor servers to support a single gateway for non-local traffic and still avoid having an LPM lookup. It also demonstrates the network configurations for a deployment scenario, as described in AltraMayor#267 and AltraMayor#417.
This patch allows Grantor servers to support a single gateway for non-local traffic and still avoid having an LPM lookup. It also demonstrates the network configurations for a deployment scenario, as described in AltraMayor#267 and AltraMayor#417.
This patch allows Grantor servers to support a single gateway for non-local traffic and still avoid having an LPM lookup. It also demonstrates the network configurations for a deployment scenario, as described in AltraMayor#267 and AltraMayor#417.
This patch allows Grantor servers to support a single gateway for non-local traffic and still avoid having an LPM lookup. It also demonstrates the network configurations for a deployment scenario, as described in AltraMayor#267 and AltraMayor#417.
This patch allows Grantor servers to support a single gateway for non-local traffic and still avoid having an LPM lookup. It also demonstrates the network configurations for a deployment scenario, as described in AltraMayor#267 and AltraMayor#417.
This patch replaces the requirement that Grantor servers had to be deployed in the same subnet that the protected destination with the requirement that either Grantor servers are deployed in the same subnet, or the last hop on the path from a Gatekeeper server to a Grantor server is a router that can forward the encapsulated packets to its destinations. This new requirement supports the deployment environment discussed in issue AltraMayor#417.
This patch replaces the requirement that Grantor servers had to be deployed in the same subnet that the protected destination with the requirement that either Grantor servers are deployed in the same subnet, or the last hop on the path from a Gatekeeper server to a Grantor server is a router that can forward the encapsulated packets to its destinations. This new requirement supports the deployment environment discussed in issue AltraMayor#417.
This patch replaces the requirement that Grantor servers had to be deployed in the same subnet that the protected destination with the requirement that either Grantor servers are deployed in the same subnet, or the last hop on the path from a Gatekeeper server to a Grantor server is a router that can forward the encapsulated packets to its destinations. This new requirement supports the deployment environment discussed in issue AltraMayor#417.
Hi @dahnevskiy, Both issues have been fixed: the 30-bit prefix length and replying to the router for non-local traffic. None of these improvements changed the configuration files, but you'll have to compile the source to obtain the binaries. |
Thanks! |
I continue my tests. my environment: net_config:
my lua script:
In my logs I can see ARP from front and back interface of gatekeeper:
but when I apply my lua script, gatekeeper says:
and in logs:
I can't understand why, because for now I have ARP entries from both front and back interface. |
The checksum bug discussed in AltraMayor#417 (comment) is because we have assumed that the two's complement subtraction, which is currently being used, is the same as "subtracting complements with borrow" under one's complement as required in RFC1624. While these two operations often come up with the same result, they are often not equal too. In addition, the two's complement subtraction is not endianness preserving! To solve this issue, we followed the example implmentation of RFC1624 [Eqn. 3] in Linux kernel: https://elixir.bootlin.com/linux/latest/source/net/ipv4/netfilter/ipt_ECN.c#L38
Hi @dahnevskiy, We have merged pull request #425 that we believe fixes the checksum bug you describe above. However, we currently don't have a proper way to test the code. So, even if it's fine, we'd appreciate that you report back. We have also fixed the issue related to the error message We are going to take a look at the log entries from |
It works, thanks! I guess i found a problem with kni interfaces: if NOT define ipv6 address in net.lua, for example:
after receiving ipv6 neighbor discovery packet from ASR9k in logs:
and after that ipv4 address disappears from kni front interface within 15-30 seconds in ifconfig it looks: after gatekeeper start its fine:
but after 30 seconds:
ipv4 address was disappeared from kni_front interface. |
The error message I don't see why this is causing the IPv4 address to be lost on the KNI, but I will look into it. However, I'm curious why there is an IPv6 address on the front KNI if no IPv6 address was configured on Gatekeeper's front interface. The addresses on the KNI should mirror the addresses from its Gatekeeper interface counterpart. Did you separately add an IPv6 address to the KNI? |
Actually, on testing this, I see that even when IPv6 is not configured on Gatekeeper, the KNI is automatically assigned a link-local IPv6 address from Linux. This should be fine, and it is expected that you'll see the What's not expected, as you pointed out, is that the Are you doing any sort of operations on the KNI devices from Linux, i.e. |
I dont doing any sort of operations from Linux on the KNI devices, but i will double check it and report back. |
ohh the problem was on my side, it was NetworkManager service on centos:
So i disable this service, because we dont use it, and all works fine. Sorry for wasting your time:( |
Hi @dahnevskiy, We haven't been able to reproduce the repeated log entries from |
The checksum bug discussed in AltraMayor#417 (comment) is because we have assumed that the two's complement subtraction, which is currently being used, is the same as "subtracting complements with borrow" under one's complement as required in RFC1624. While these two operations often come up with the same result, they are often not equal too. In addition, the two's complement subtraction is not endianness preserving! To solve this issue, we followed the example implmentation of RFC1624 [Eqn. 3] in Linux kernel: https://elixir.bootlin.com/linux/latest/source/net/ipv4/netfilter/ipt_ECN.c#L38
I apologize for such a long response. Now i continue my tests, i recompiled gatekeeper yesterday, using current master branch. I use 3 mpps syn-flood attack with spoofed SRC IP address from subnet 10.161.0.0/24, so as i can understand, gatekeeper should have created 255 flows. And gatekeeper created them, but in logs:
on lcore 2, lcore3, lcore4, lcore5 there is a huge amount of pkts_num_request. But i generate only 255 flows...
but it seems, that gk instance, running on this lcore just ignores this flow entry, and send GK_REQUEST to grantor. Also in logs, while my syn-flood test is running, i have multiple messages:
perhaps these messages are somehow related with this problem... Also, Unfortunately, problem with gk_del_flow_entry_from_hash() stil exists:
|
I generated syn-flood with only 5 spoofed SRC address with range 10.161.0.1 -10.161.0.5, and it works good!
There is no pkts_num_request, as expected. Also, i dont have logs |
but if i generate syn-flood with 15 SRC addresses: 10.161.0.1 - 10.161.0.15 - i have all problems, described above. |
It seems, that i can reproduce this problem, just doing syn-flood from 1 IP. the flow was installed on lcore 5, but it was ignored:
All packets was encapsulated to grantor as pkts_num_request. After that, while syn-flood is running, i restarted gatekeeper, and after restart it works fine:
|
Perhaps i found possible problem: This is good flow, working as expected:
And this is bad flow, gatekeeper dont use this flow and send all traffic to grantor as GK_REQUEST:
I repeated this test multiple times, and all bad flows always has flow hash value = 0. |
Hi @dahnevskiy, Having a wrong flow hash would explain most of what's going on. But the problem can be subtle because this value is supposed to be computed by the NIC, so it shouldn't be absent. Only when the information comes from the GGU block the flow hash is computed in software. Based on the information you posted, other things could be going on at the same time. For example, Just to confirm, the SYN flood is the only traffic going toward the Gatekeeper servers, isn't it? Would it be possible to share the whole log file? |
Could you dump a couple of flows that show up in the log entries of |
I allocated 20 CPUs to GK processes in lua/main_config.lua. Its expected, but its not working correctly. SYN flood is the only traffic going toward the Gatekeeper servers, its correct. I dumped flow from log entries of gk_del_flow_entry_from_hash():
so yes, they all have an absent flow hash |
Is the front interface of the Gatekeeper server in your deployment going to be 10Gbps? In our tests with 10Gbps front interfaces, more than two GK blocks per NUMA node was only needed when the flow tables of the GK blocks were too large and faster scanning for expired entries was needed. What value are you assigning to variable If you are using faster NICs as front interfaces, more GK blocks are needed to handle the extra packets. But we don't have guidance for this setup at this point; it's one of our future milestones. We are going to continue the investigation to figure out what's going wrong when |
Hi @dahnevskiy, I just issued a patch that solves some of the problems going on, if not all. Could you test it? There's a possibility that more than one problem is going on, but with one less problem, it should be easier to diagnosticate what is left. |
if n_gk_lcores >= 16: I used a syn-flood attack from source 10.161.0.4. And it seems, that problem flows now is not created at all. In logs i have multiple:
But flow doesnt exist:
There is no gk_del_flow_entry_from_hash error, but i guess its because instead of creating flows with absent flow hash, this version doesnt create part of flows at all. if n_gk_lcores < 16 all works as expected. |
As far as I can tell, this last issue is related to how the front NIC is responding to the setup of the RSS function. My hypothesis is that there is something wrong with the initialization of the RSS function that is throwing packets into queue 0, which belongs to the LLS block. This explains why the LLS block is receiving lots of packets that are not destined to it and the flow is not being created in the GK block (i.e. the GK block running on lcore 41 never receives its packets). Would it be possible for us to have an SSH connection to the test Gatekeeper server? We don't need to run the experiment, we only need to run |
Unfortunatelly, there is no possibility to have ssh connection, but we have another options:
|
Let's take both options. I describe below a couple of things that you can do to gather some general information, and we should schedule an online meeting next week to probe the issue further if needed. I'm going to find some time slots next week for the online meeting. I need to talk to other project members to see if someone can join us. While optional, it's better to replace
Print the function parameters
Once done with each call to
|
Our group can join the debugging online meeting on either 24th or 25th at 9am EST. Which date works for you? |
Hello. Unfortunatelly, for now i dont have time for gdb debug and tshoot, but I will return to this issue in 2 weeks, so please dont close this issue, and we will continue our work. And we really want support of i40e network cards, but as i can see, this will require an upgrade of dpdk version.... |
That's such great news! We look forward to learning about your deployment. We'll keep this issue open until we nail the issue of the NIC initialization. Could you explain the issue you have found with i40e NICs? The version of DPDK that Gatekeeper is currently using already includes the i40e driver. |
It's been more than two months now without an update and this issue is already quite big, so I'm closing it. If the NIC initialization is still happening, please open a new issue just for it. Thank you for all the help testing Gatekeeper. |
Hello!
Im trying to inject prefix to FIB, using lua/examples/example_of_dynamic_config_request.lua
As I can see, this example file is not working:
because its missing:
require "gatekeeper/dylib"
which we need to use to do dylib.c.add_fib_entry function for example.
I fixed this, so this is my simple final lua script:
but its not working anyway, when I try to apply changes:
Perhaps we have an error in dylib.lua, or maybe I do something wrong, because C++ and lua is not my strong side:)
The text was updated successfully, but these errors were encountered: