Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beta testing Gatekeeper #417

Closed
dahnevskiy opened this issue Jul 20, 2020 · 53 comments
Closed

Beta testing Gatekeeper #417

dahnevskiy opened this issue Jul 20, 2020 · 53 comments

Comments

@dahnevskiy
Copy link

Hello!

Im trying to inject prefix to FIB, using lua/examples/example_of_dynamic_config_request.lua

As I can see, this example file is not working:

because its missing:
require "gatekeeper/dylib"
which we need to use to do dylib.c.add_fib_entry function for example.

I fixed this, so this is my simple final lua script:

require "gatekeeper/staticlib"
require "gatekeeper/dylib"

local dy_conf = staticlib.c.get_dy_conf()
        if dy_conf == nil then
                error("Failed to allocate dy_conf")
        end

local ret = dylib.c.add_fib_entry("192.168.0.0/16", "10.255.0.66", "10.255.0.225", dylib.c.GK_FWD_GRANTOR, dy_conf.gk)
if ret < 0 then
        return "gk: failed to add an FIB entry\n"
end

but its not working anyway, when I try to apply changes:

[root@srv351531 build]# ./gkctl lua/dyn.config
./lua/gatekeeper/dylib.lua:68: ';' expected near '=' at line 2

Perhaps we have an error in dylib.lua, or maybe I do something wrong, because C++ and lua is not my strong side:)

@cjdoucette
Copy link
Collaborator

I don't think we need to specify require "gatekeeper/dylib" in the file because we add the path to the dynamic configuration file when Gatekeeper is started, and load that file:

https://github.com/AltraMayor/gatekeeper/blob/master/config/dynamic.c#L486-L499

Instead, I think that the call to add_fib_entry() might be failing because the FIB is not correctly set-up. If you look in the Gatekeeper log, do you see entries that say GATEKEEPER: lpm: IPv4 lookup miss?

If so, it is likely because:

  1. The back interface of your Gatekeeper server is not in the same network as the gateway address in the add_fib_entry() call.

  2. Even if you fixed (1), your back interface/gateway and Grantor appear to be in the same network (10.255.0.0/24?), a feature which is not currently supported (What are the expected network configurations across a deployment? #267 (comment)) because it is not in our targeted deployment scenarios.

If it seems to be one of the above issues, let me know, as I think we should certainly add a log message with more information.

@AltraMayor AltraMayor added this to the First deployment milestone Jul 20, 2020
@dahnevskiy
Copy link
Author

In my log file I see:

GATEKEEPER: lpm: IPv4 lookup miss
GATEKEEPER: lpm: IPv4 lookup miss

Here my network config:

	local front_ports = {"enp133s0f0"}
	local front_ips  = {"10.255.0.18/29"}
	local front_bonding_mode = staticlib.c.BONDING_MODE_ROUND_ROBIN
	local front_vlan_tag = 0x123
	local front_vlan_insert = false
	local front_mtu = 1500

	local back_ports = {"enp133s0f1"}
	local back_ips  = {"10.255.0.226/29"}
	local back_bonding_mode = staticlib.c.BONDING_MODE_ROUND_ROBIN
	local back_vlan_tag = 0x456
	local back_vlan_insert = false
	local back_mtu = 1500

I double checked, that my back interface (10.255.0.226/29) is in the same subnet with gateway (10.255.0.225)
Also, grantor server (10.255.0.66) not in my gatekeeper subnet (10.255.0.224/29)

@AltraMayor
Copy link
Owner

Hi @dahnevskiy,

Although my sidenotes below don't address the problem on hand, they are still relevant here:

  1. There is a non-zero bit after the prefix length in the network prefix of the back network: 10.255.0.226/29. In detail, the last octet 226 in binary is 1110,0010, and the mask 29 covers the first 5 bits, namely "1110,0", therefore last bits (i.e. "010") should be all zeros to avoid confusion. Replacing 226 with 224 would solve this issue.

  2. The MTU of the back network should be slightly bigger than the MTU of the front network to avoid the heavy work of fragmenting the encapsulated packets sent to Grantors servers. This extra, unnecessary load can cost one a lot during a DDoS attack.

@AltraMayor
Copy link
Owner

Please ignore item 1 above; my mistake.

@cjdoucette
Copy link
Collaborator

Thanks for the clarifications @dahnevskiy, I will see if I can replicate the issue.

@cjdoucette
Copy link
Collaborator

Are you actually seeing gk: failed to add an FIB entry or is it failing silently?

Try adding this to the end of your Lua script to dump the ARP table:

os.execute("sleep 3")
local llsc = staticlib.c.get_lls_conf()
if llsc == nil then
        return "lls: failed to fetch config to dump caches"
end
return dylib.list_lls_arp(llsc, dylib.print_lls_dump_entry, "")

If the gateway entry you're adding (10.255.0.225) doesn't respond to ARP requests, it will show up as stale in the ARP table and you'll get the GATEKEEPER: lpm: IPv4 lookup miss errors in the log.

@dahnevskiy
Copy link
Author

I added lls check to lua script, and yes, I get the stale ARP:

[root@srv351531 build]# ./gkctl lua/dyn.config
LLS cache entry:: [state: stale, ip: 10.255.0.225, mac: 00:00:00:00:00:00, port: 0]

I found and fix problem in configuration, thanks for the help!

By the way, I can't use /30 netmask in my net.lua.

If I use it, for example in my net.lua:

        local front_ports = {"enp133s0f0"}
        local front_ips  = {"10.255.0.226/30"}
        local front_bonding_mode = staticlib.c.BONDING_MODE_ROUND_ROBIN
        local front_vlan_tag = 0x123
        local front_vlan_insert = false
        local front_mtu = 1500

        local back_ports = {"enp133s0f1"}
        local back_ips  = {"10.255.0.18/30"}
        local back_bonding_mode = staticlib.c.BONDING_MODE_ROUND_ROBIN
        local back_vlan_tag = 0x456
        local back_vlan_insert = false
        local back_mtu = 1500

in my logs, when I launch gatekeeper:

GATEKEEPER: main: cycles/second = 2200008189, cycles/millisecond = 2200008, picosec/cycle = 454
resolve_xsym(1): EBPF_PSEUDO_CALL to external function: init_ctx_to_cookie
rte_bpf_elf_load(fname="./lua/bpf/granted.bpf", sname="init") successfully creates 0x7f7044475000(jit={.func=0x7f7044474000,.sz=316});
resolve_xsym(1): EBPF_PSEUDO_CALL to external function: pkt_ctx_to_cookie
resolve_xsym(15): EBPF_PSEUDO_CALL to external function: pkt_ctx_to_pkt
resolve_xsym(33): EBPF_PSEUDO_CALL to external function: gk_bpf_prep_for_tx
rte_bpf_elf_load(fname="./lua/bpf/granted.bpf", sname="pkt") successfully creates 0x7f7044473000(jit={.func=0x7f7044472000,.sz=181});
rte_bpf_elf_load(fname="./lua/bpf/declined.bpf", sname="init") successfully creates 0x7f7044471000(jit={.func=0x7f7044470000,.sz=4});
rte_bpf_elf_load(fname="./lua/bpf/declined.bpf", sname="pkt") successfully creates 0x7f704446f000(jit={.func=0x7f704446e000,.sz=8});
resolve_xsym(1): EBPF_PSEUDO_CALL to external function: init_ctx_to_cookie
rte_bpf_elf_load(fname="./lua/bpf/grantedv2.bpf", sname="init") successfully creates 0x7f704446d000(jit={.func=0x7f704446c000,.sz=386});
resolve_xsym(1): EBPF_PSEUDO_CALL to external function: pkt_ctx_to_cookie
resolve_xsym(4): EBPF_PSEUDO_CALL to external function: pkt_ctx_to_pkt
resolve_xsym(60): EBPF_PSEUDO_CALL to external function: gk_bpf_prep_for_tx
rte_bpf_elf_load(fname="./lua/bpf/grantedv2.bpf", sname="pkt") successfully creates 0x7f704446b000(jit={.func=0x7f704446a000,.sz=350});
resolve_xsym(1): EBPF_PSEUDO_CALL to external function: init_ctx_to_cookie
rte_bpf_elf_load(fname="./lua/bpf/web.bpf", sname="init") successfully creates 0x7f7044469000(jit={.func=0x7f7044468000,.sz=386});
resolve_xsym(1): EBPF_PSEUDO_CALL to external function: pkt_ctx_to_cookie
resolve_xsym(4): EBPF_PSEUDO_CALL to external function: pkt_ctx_to_pkt
resolve_xsym(148): EBPF_PSEUDO_CALL to external function: gk_bpf_prep_for_tx
rte_bpf_elf_load(fname="./lua/bpf/web.bpf", sname="pkt") successfully creates 0x7f7044467000(jit={.func=0x7f7044466000,.sz=960});
GATEKEEPER: net: port 0 (0000:1c:00.0) on the front interface only supports RSS hash functions 0x38d34, but Gatekeeper asks for 0x8c
GATEKEEPER: net: port 1 (0000:1c:00.1) on the back interface only supports RSS hash functions 0x38d34, but Gatekeeper asks for 0x8c
GATEKEEPER: lls: calculate_mempool_config_para: total_pkt_burst = 2144 packets, total_rx_desc = 640 descriptors, total_tx_desc = 256 descriptors, max_num_pkt = 3040 packets, num_mbuf = 4095 packets.
HASH: rte_hash_create has invalid parameters
GATEKEEPER GK: Cannot create hash table for neighbor FIB
GATEKEEPER GK: Failed to setup the FIB entry for the front network prefixes at init_fib_tbl
GATEKEEPER GK: Failed to initialize the FIB table at setup_gk_lpm
Device with port_id=1 already stopped
Device with port_id=0 already stopped

Its not a huge problem for us, we can use /29 or less specific network, but I guess its maybe a bug and you wanna know about it.

@AltraMayor
Copy link
Owner

Hi @dahnevskiy,

We are going to investigate the /30 bug, but would you mind describing how you solved the configuration issue? Understanding it may help us to conceive ways to make the configuration more robust.

@dahnevskiy
Copy link
Author

I continue my tests, and im trying to generate 1 mpps synflood attack to destination 10.254.71.130

In my grantor server I see logs:

GATEKEEPER LLS: 10.255.0.65: 3c:8a:b0:81:f1:fa (port 0) (0 holds)
GATEKEEPER: lpm: IPv4 lookup miss
GATEKEEPER GT: gt_neigh_get_ether_cache: receiving an IPv4 packet with destination IP address 10.254.71.130, which is not on the same subnet as the GT server
GATEKEEPER: lpm: IPv4 lookup miss
GATEKEEPER GT: gt_neigh_get_ether_cache: receiving an IPv4 packet with destination IP address 10.254.71.130, which is not on the same subnet as the GT server
GATEKEEPER LLS: LLS cache (arp)

I guess its problem, because on 10.254.71.130 I can't see any traffic.
I guess its some kind of misconfiguration on my side, what I can to check?

@dahnevskiy
Copy link
Author

Hi @dahnevskiy,

We are going to investigate the /30 bug, but would you mind describing how you solved the configuration issue? Understanding it may help us to conceive ways to make the configuration more robust.

just have a problem with links from gatekeeper server to asr9k. nothing to do with gatekeeper software

@AltraMayor
Copy link
Owner

The log message GATEKEEPER GT: gt_neigh_get_ether_cache: receiving an IPv4 packet with destination IP address 10.254.71.130, which is not on the same subnet as the GT server reflects a design/implementation decision derived from our first, ongoing deployment: we assumed that Grantor servers are in the same subnet of the protected machines. This assumption is true in our first deployment and simplifies the code of Grantor servers since it doesn't need to have a routing table like Gatekeeper servers. This assumption is not true in your deployment environment since your Grantor server has the IP address 10.255.0.66 and the protected host 10.254.71.130.

This is not a fundamental limitation and we can think of how to support it. Would it be possible for we schedule a meeting, so our team can understand your deployment environment?

@cjdoucette
Copy link
Collaborator

We are going to investigate the /30 bug

Looks like the DPDK cuckoo hash table must have a length of at least 8.

@dahnevskiy
Copy link
Author

dahnevskiy commented Jul 22, 2020

The log message GATEKEEPER GT: gt_neigh_get_ether_cache: receiving an IPv4 packet with destination IP address 10.254.71.130, which is not on the same subnet as the GT server reflects a design/implementation decision derived from our first, ongoing deployment: we assumed that Grantor servers are in the same subnet of the protected machines. This assumption is true in our first deployment and simplifies the code of Grantor servers since it doesn't need to have a routing table like Gatekeeper servers. This assumption is not true in your deployment environment since your Grantor server has the IP address 10.255.0.66 and the protected host 10.254.71.130.

This is not a fundamental limitation and we can think of how to support it. Would it be possible for we schedule a meeting, so our team can understand your deployment environment?

here our deployment environment:
gt_setup

we have a lot of hosting switches, which terminates many small clients subnets (in my example 8.8.8.0/29, 8.8.8.8/29 etc...) , so we can't use the current design of gatekeeper, because in current design we need to deploy grantor server on each small client subnet, and use a part of client subnet for grantor server addressing.

Instead, we want to deploy grantor machines in our aggregation layer, which can protect multiple clients subnets on different switches. But in this scheme grantor servers will not in the same subnet with protected machines.

If you need more information, we can schedule a meeting, why not. Or I can answer here:)

@AltraMayor
Copy link
Owner

AltraMayor commented Jul 22, 2020

Thank you for the helpful diagram.

Based on your diagram, there's only one gateway for all traffic that is not local to a Grantor server. Is this assumption correct? If so, Grantor servers can support a single gateway for non-local traffic and still avoid having an LPM lookup. Would this solution work for you?

@dahnevskiy
Copy link
Author

If I understand you correctly - this assumption is correct.
If you mean, that grantor server can work with default gateway for all outgoing traffic - that's a good solution for us.

cjdoucette added a commit to cjdoucette/gatekeeper that referenced this issue Jul 23, 2020
The DPDK hash table library requires that hash tables be of at
least size 8, as described in AltraMayor#417.
mengxiang0811 added a commit to mengxiang0811/gatekeeper that referenced this issue Jul 26, 2020
This patch allows Grantor servers to support a single gateway
for non-local traffic and still avoid having an LPM lookup.

It also demonstrates the network configurations
for a deployment scenario, as described in AltraMayor#267 and AltraMayor#417.
mengxiang0811 added a commit to mengxiang0811/gatekeeper that referenced this issue Jul 26, 2020
This patch allows Grantor servers to support a single gateway
for non-local traffic and still avoid having an LPM lookup.

It also demonstrates the network configurations
for a deployment scenario, as described in AltraMayor#267 and AltraMayor#417.
mengxiang0811 added a commit to mengxiang0811/gatekeeper that referenced this issue Jul 26, 2020
This patch allows Grantor servers to support a single gateway
for non-local traffic and still avoid having an LPM lookup.

It also demonstrates the network configurations
for a deployment scenario, as described in AltraMayor#267 and AltraMayor#417.
mengxiang0811 added a commit to mengxiang0811/gatekeeper that referenced this issue Jul 26, 2020
This patch allows Grantor servers to support a single gateway
for non-local traffic and still avoid having an LPM lookup.

It also demonstrates the network configurations
for a deployment scenario, as described in AltraMayor#267 and AltraMayor#417.
mengxiang0811 added a commit to mengxiang0811/gatekeeper that referenced this issue Jul 26, 2020
This patch allows Grantor servers to support a single gateway
for non-local traffic and still avoid having an LPM lookup.

It also demonstrates the network configurations
for a deployment scenario, as described in AltraMayor#267 and AltraMayor#417.
mengxiang0811 added a commit to mengxiang0811/gatekeeper that referenced this issue Jul 26, 2020
This patch allows Grantor servers to support a single gateway
for non-local traffic and still avoid having an LPM lookup.

It also demonstrates the network configurations
for a deployment scenario, as described in AltraMayor#267 and AltraMayor#417.
mengxiang0811 added a commit to mengxiang0811/gatekeeper that referenced this issue Jul 28, 2020
This patch replaces the requirement that Grantor servers had to be
deployed in the same subnet that the protected destination with
the requirement that either Grantor servers are deployed in
the same subnet, or the last hop on the path from a Gatekeeper server
to a Grantor server is a router that can forward
the encapsulated packets to its destinations.

This new requirement supports the deployment environment discussed in
issue AltraMayor#417.
mengxiang0811 added a commit to mengxiang0811/gatekeeper that referenced this issue Jul 29, 2020
This patch replaces the requirement that Grantor servers had to be
deployed in the same subnet that the protected destination with
the requirement that either Grantor servers are deployed in
the same subnet, or the last hop on the path from a Gatekeeper server
to a Grantor server is a router that can forward
the encapsulated packets to its destinations.

This new requirement supports the deployment environment discussed in
issue AltraMayor#417.
mengxiang0811 added a commit to mengxiang0811/gatekeeper that referenced this issue Jul 30, 2020
This patch replaces the requirement that Grantor servers had to be
deployed in the same subnet that the protected destination with
the requirement that either Grantor servers are deployed in
the same subnet, or the last hop on the path from a Gatekeeper server
to a Grantor server is a router that can forward
the encapsulated packets to its destinations.

This new requirement supports the deployment environment discussed in
issue AltraMayor#417.
@AltraMayor
Copy link
Owner

Hi @dahnevskiy,

Both issues have been fixed: the 30-bit prefix length and replying to the router for non-local traffic. None of these improvements changed the configuration files, but you'll have to compile the source to obtain the binaries.

@dahnevskiy
Copy link
Author

Thanks!
I will test it next week.

@dahnevskiy
Copy link
Author

I continue my tests.
I compile sources and launch Gatekeeper and again can't inject prefix to FIB on gatekeeper server.

my environment:

net_config:

        local front_ports = {"enp28s0f0"}
        local front_ips  = {"10.255.0.226/29"}
        local front_bonding_mode = staticlib.c.BONDING_MODE_ROUND_ROBIN
        local front_vlan_tag = 0x123
        local front_vlan_insert = false
        local front_mtu = 1500

        local back_ports = {"enp28s0f1"}
        local back_ips  = {"10.255.0.18/29"}
        local back_bonding_mode = staticlib.c.BONDING_MODE_ROUND_ROBIN
        local back_vlan_tag = 0x456
        local back_vlan_insert = false
        local back_mtu = 2048

my lua script:

require "gatekeeper/staticlib"

local dy_conf = staticlib.c.get_dy_conf()
        if dy_conf == nil then
                error("Failed to allocate dy_conf")
        end

local ret = dylib.c.add_fib_entry("10.254.71.130/32", "10.255.0.66", "10.255.0.17", dylib.c.GK_FWD_GRANTOR, dy_conf.gk)
if ret < 0 then
        return "gk: failed to add an FIB entry\n"
end

In my logs I can see ARP from front and back interface of gatekeeper:

GATEKEEPER LLS: 10.255.0.17: d4:6d:50:55:78:d5 (port 1) (0 holds)
GATEKEEPER LLS: 10.255.0.225: d4:6d:50:57:64:f1 (port 0) (0 holds)

but when I apply my lua script, gatekeeper says:

EAL:   probe driver: 8086:158b net_i40e
EAL: PCI device 0000:af:00.1 on NUMA socket 1
EAL:   probe driver: 8086:158b net_i40e
EAL: See files in . for further log
PANIC: unprotected error in call to Lua API (bad argument #3 to '?' (string expected, got nil))

and in logs:

GATEKEEPER: lpm: IPv4 lookup miss
GATEKEEPER: lpm: IPv4 lookup miss

I can't understand why, because for now I have ARP entries from both front and back interface.

mengxiang0811 added a commit to mengxiang0811/gatekeeper that referenced this issue Aug 31, 2020
The checksum bug discussed in AltraMayor#417 (comment) is because we have assumed that
the two's complement subtraction, which is currently being used, is the same as
"subtracting complements with borrow" under one's complement as required in
RFC1624. While these two operations often come up with the same result, they
are often not equal too. In addition, the two's complement subtraction is not
endianness preserving!

To solve this issue, we followed the example implmentation of RFC1624 [Eqn. 3]
in Linux kernel:
https://elixir.bootlin.com/linux/latest/source/net/ipv4/netfilter/ipt_ECN.c#L38
@AltraMayor
Copy link
Owner

Hi @dahnevskiy,

We have merged pull request #425 that we believe fixes the checksum bug you describe above. However, we currently don't have a proper way to test the code. So, even if it's fine, we'd appreciate that you report back.

We have also fixed the issue related to the error message PANIC: unprotected error in call to Lua API (bad argument #3 to '?' (string expected, got nil)) that you've reported earlier.

We are going to take a look at the log entries from gk_del_flow_entry_from_hash().

@dahnevskiy
Copy link
Author

dahnevskiy commented Sep 2, 2020

It works, thanks!

I guess i found a problem with kni interfaces:

if NOT define ipv6 address in net.lua, for example:

        local front_ports = {"enp28s0f0"}
        local front_ips  = {"10.255.0.226/29"}
        local front_bonding_mode = staticlib.c.BONDING_MODE_ROUND_ROBIN
        local front_vlan_tag = 0x123
        local front_vlan_insert = false
        local front_mtu = 1500

        local back_ports = {"enp28s0f1"}
        local back_ips  = {"10.255.0.18/29"}
        local back_bonding_mode = staticlib.c.BONDING_MODE_ROUND_ROBIN
        local back_vlan_tag = 0x456
        local back_vlan_insert = false
        local back_mtu = 2048

after receiving ipv6 neighbor discovery packet from ASR9k in logs:

GATEKEEPER CPS: KNI for front iface received ND packet, but the interface is not configured for ND

and after that ipv4 address disappears from kni front interface within 15-30 seconds

in ifconfig it looks:

after gatekeeper start its fine:

kni_back: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2048
        inet 10.255.0.18  netmask 255.255.255.248  broadcast 0.0.0.0
        ether 90:e2:ba:2b:4c:b5  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

kni_front: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.255.0.226  netmask 255.255.255.248  broadcast 0.0.0.0
        inet6 fe80::6ba1:9caf:ea9b:a77a  prefixlen 64  scopeid 0x20<link>
        ether 90:e2:ba:2b:4c:b4  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8  bytes 1184 (1.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

but after 30 seconds:

kni_back: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2048
        inet 10.255.0.18  netmask 255.255.255.248  broadcast 0.0.0.0
        ether 90:e2:ba:2b:4c:b5  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

kni_front: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::6ba1:9caf:ea9b:a77a  prefixlen 64  scopeid 0x20<link>
        ether 90:e2:ba:2b:4c:b4  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 38  bytes 6660 (6.5 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ipv4 address was disappeared from kni_front interface.
Workaround: define ipv6 address in net.lua

@cjdoucette
Copy link
Collaborator

The error message GATEKEEPER CPS: KNI for front iface received ND packet, but the interface is not configured for ND only appears if a neighbor discovery packet is received on the KNI, i.e. coming from the host through the KNI and intended to be transmitted on the DPDK interface.

I don't see why this is causing the IPv4 address to be lost on the KNI, but I will look into it. However, I'm curious why there is an IPv6 address on the front KNI if no IPv6 address was configured on Gatekeeper's front interface. The addresses on the KNI should mirror the addresses from its Gatekeeper interface counterpart. Did you separately add an IPv6 address to the KNI?

@cjdoucette
Copy link
Collaborator

Actually, on testing this, I see that even when IPv6 is not configured on Gatekeeper, the KNI is automatically assigned a link-local IPv6 address from Linux. This should be fine, and it is expected that you'll see the GATEKEEPER CPS: KNI for front iface received ND packet, but the interface is not configured for ND message.

What's not expected, as you pointed out, is that the kni_front IP address would disappear, and also that the kni_back is not assigned a link-local IPv6 address. I'm not sure why, as this didn't happen in my tests.

Are you doing any sort of operations on the KNI devices from Linux, i.e. ethtool, ip link, ip addr, etc?

@dahnevskiy
Copy link
Author

I dont doing any sort of operations from Linux on the KNI devices, but i will double check it and report back.

@dahnevskiy
Copy link
Author

ohh the problem was on my side, it was NetworkManager service on centos:

Sep 03 13:47:15 srv351531 NetworkManager[1731]: <warn>  [1599130035.8860] dhcp4 (kni_front): request timed out
Sep 03 13:47:15 srv351531 NetworkManager[1731]: <info>  [1599130035.8861] dhcp4 (kni_front): state changed unknown -> timeout
Sep 03 13:47:15 srv351531 NetworkManager[1731]: <info>  [1599130035.9182] dhcp4 (kni_front): canceled DHCP transaction, DHCP client pid 6769
Sep 03 13:47:15 srv351531 NetworkManager[1731]: <info>  [1599130035.9182] dhcp4 (kni_front): state changed timeout -> done
Sep 03 13:47:15 srv351531 NetworkManager[1731]: <info>  [1599130035.9184] device (kni_front): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'manage
Sep 03 13:47:15 srv351531 NetworkManager[1731]: <warn>  [1599130035.9189] device (kni_front): Activation: failed for connection 'Wired connection 1'
Sep 03 13:47:15 srv351531 NetworkManager[1731]: <info>  [1599130035.9191] device (kni_front): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')

So i disable this service, because we dont use it, and all works fine. Sorry for wasting your time:(

@AltraMayor
Copy link
Owner

AltraMayor commented Sep 4, 2020

Hi @dahnevskiy,

We haven't been able to reproduce the repeated log entries from gk_del_flow_entry_from_hash(). But in a careful review, we might have found the cause. Could you test the code in the repository to check if this issue has gone away? Notice that we've also changed the log entry to add more information, so, in case the issue is still there, we'll have more clues.

cjdoucette pushed a commit to cjdoucette/gatekeeper that referenced this issue Sep 8, 2020
The checksum bug discussed in AltraMayor#417 (comment) is because we have assumed that
the two's complement subtraction, which is currently being used, is the same as
"subtracting complements with borrow" under one's complement as required in
RFC1624. While these two operations often come up with the same result, they
are often not equal too. In addition, the two's complement subtraction is not
endianness preserving!

To solve this issue, we followed the example implmentation of RFC1624 [Eqn. 3]
in Linux kernel:
https://elixir.bootlin.com/linux/latest/source/net/ipv4/netfilter/ipt_ECN.c#L38
@dahnevskiy
Copy link
Author

dahnevskiy commented Sep 11, 2020

I apologize for such a long response. Now i continue my tests, i recompiled gatekeeper yesterday, using current master branch.

I use 3 mpps syn-flood attack with spoofed SRC IP address from subnet 10.161.0.0/24, so as i can understand, gatekeeper should have created 255 flows. And gatekeeper created them, but in logs:

GATEKEEPER GK: The GK block basic measurements at lcore = 2: [tot_pkts_num = 8759291, tot_pkts_size = 525557460, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 5585740, pkts_size_request =  446859200, pkts_num_declined = 3173551, pkts_size_declined =  190413060, tot_pkts_num_dropped = 3173551, tot_pkts_size_dropped =  190413060, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 3: [tot_pkts_num = 8768504, tot_pkts_size = 526110240, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 5594673, pkts_size_request =  447573840, pkts_num_declined = 3173831, pkts_size_declined =  190429860, tot_pkts_num_dropped = 3173831, tot_pkts_size_dropped =  190429860, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 4: [tot_pkts_num = 8757475, tot_pkts_size = 525448500, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 5584012, pkts_size_request =  446720960, pkts_num_declined = 3173463, pkts_size_declined =  190407780, tot_pkts_num_dropped = 3173463, tot_pkts_size_dropped =  190407780, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 5: [tot_pkts_num = 8759121, tot_pkts_size = 525547260, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 5585085, pkts_size_request =  446806800, pkts_num_declined = 3174036, pkts_size_declined =  190442160, tot_pkts_num_dropped = 3174036, tot_pkts_size_dropped =  190442160, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 6: [tot_pkts_num = 3173789, tot_pkts_size = 190427340, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 3173789, pkts_size_declined =  190427340, tot_pkts_num_dropped = 3173789, tot_pkts_size_dropped =  190427340, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 7: [tot_pkts_num = 3173432, tot_pkts_size = 190405920, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 3173432, pkts_size_declined =  190405920, tot_pkts_num_dropped = 3173432, tot_pkts_size_dropped =  190405920, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 8: [tot_pkts_num = 3173253, tot_pkts_size = 190395180, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 3173253, pkts_size_declined =  190395180, tot_pkts_num_dropped = 3173253, tot_pkts_size_dropped =  190395180, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 9: [tot_pkts_num = 3281321, tot_pkts_size = 196879260, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 3281321, pkts_size_declined =  196879260, tot_pkts_num_dropped = 3281321, tot_pkts_size_dropped =  196879260, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 10: [tot_pkts_num = 2813168, tot_pkts_size = 168790080, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 2813168, pkts_size_declined =  168790080, tot_pkts_num_dropped = 2813168, tot_pkts_size_dropped =  168790080, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 11: [tot_pkts_num = 2812858, tot_pkts_size = 168771480, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 2812858, pkts_size_declined =  168771480, tot_pkts_num_dropped = 2812858, tot_pkts_size_dropped =  168771480, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 30: [tot_pkts_num = 2811816, tot_pkts_size = 168708960, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 2811816, pkts_size_declined =  168708960, tot_pkts_num_dropped = 2811816, tot_pkts_size_dropped =  168708960, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 31: [tot_pkts_num = 2812417, tot_pkts_size = 168745020, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 2812417, pkts_size_declined =  168745020, tot_pkts_num_dropped = 2812417, tot_pkts_size_dropped =  168745020, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 32: [tot_pkts_num = 2813135, tot_pkts_size = 168788100, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 2813135, pkts_size_declined =  168788100, tot_pkts_num_dropped = 2813135, tot_pkts_size_dropped =  168788100, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 33: [tot_pkts_num = 2811204, tot_pkts_size = 168672240, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 2811204, pkts_size_declined =  168672240, tot_pkts_num_dropped = 2811204, tot_pkts_size_dropped =  168672240, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 34: [tot_pkts_num = 2812989, tot_pkts_size = 168779340, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 2812989, pkts_size_declined =  168779340, tot_pkts_num_dropped = 2812989, tot_pkts_size_dropped =  168779340, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 35: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 36: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 37: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 38: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 39: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]

on lcore 2, lcore3, lcore4, lcore5 there is a huge amount of pkts_num_request. But i generate only 255 flows...
I checked, that flows are created:

GATEKEEPER: flow: gk: log the flow state [state: GK_BPF (3), flow hash value: 3852936895, expire_at: 0x42d260d516fb6e, program_index=3, cookie=3c817eff13d24200, 0004000074230f00, f423ffff00000000, 9c09666ccc0b5900, 55b56c5464890100, 0000000000000000, 0000000000000000, 9500000000000000, grantor_ip: 10.255.0.66] in the flow table at print_flow_state with lcore 5 for the flow with IP source address 10.161.0.50, and destination address 10.254.71.130

but it seems, that gk instance, running on this lcore just ignores this flow entry, and send GK_REQUEST to grantor.

Also in logs, while my syn-flood test is running, i have multiple messages:

GATEKEEPER: 234091 log entries were suppressed at lcore 0 during the last ratelimit interval
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER: 234496 log entries were suppressed at lcore 0 during the last ratelimit interval
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800

perhaps these messages are somehow related with this problem...

Also, Unfortunately, problem with gk_del_flow_entry_from_hash() stil exists:

GATEKEEPER: flow: The GK block failed to delete a key from hash table at gk_del_flow_entry_from_hash: No such file or directory
 for the flow with IP source address 10.161.0.105, and destination address 10.254.71.130
GATEKEEPER: flow: The GK block failed to delete a key from hash table at gk_del_flow_entry_from_hash: No such file or directory
 for the flow with IP source address 10.161.0.41, and destination address 10.254.71.130
GATEKEEPER: flow: The GK block failed to delete a key from hash table at gk_del_flow_entry_from_hash: No such file or directory
 for the flow with IP source address 10.161.0.106, and destination address 10.254.71.130
GATEKEEPER: flow: The GK block failed to delete a key from hash table at gk_del_flow_entry_from_hash: No such file or directory
 for the flow with IP source address 10.161.0.194, and destination address 10.254.71.130
GATEKEEPER: flow: The GK block failed to delete a key from hash table at gk_del_flow_entry_from_hash: No such file or directory
 for the flow with IP source address 10.161.0.130, and destination address 10.254.71.130
GATEKEEPER: flow: The GK block failed to delete a key from hash table at gk_del_flow_entry_from_hash: No such file or directory
 for the flow with IP source address 10.161.0.193, and destination address 10.254.71.130
GATEKEEPER: flow: The GK block failed to delete a key from hash table at gk_del_flow_entry_from_hash: No such file or directory
 for the flow with IP source address 10.161.0.115, and destination address 10.254.71.130
GATEKEEPER: flow: The GK block failed to delete a key from hash table at gk_del_flow_entry_from_hash: No such file or directory
 for the flow with IP source address 10.161.0.216, and destination address 10.254.71.130
GATEKEEPER: flow: The GK block failed to delete a key from hash table at gk_del_flow_entry_from_hash: No such file or directory
 for the flow with IP source address 10.161.0.239, and destination address 10.254.71.130
GATEKEEPER: flow: The GK block failed to delete a key from hash table at gk_del_flow_entry_from_hash: No such file or directory

@dahnevskiy
Copy link
Author

dahnevskiy commented Sep 11, 2020

I generated syn-flood with only 5 spoofed SRC address with range 10.161.0.1 -10.161.0.5, and it works good!

GATEKEEPER GK: The GK block basic measurements at lcore = 2: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 3: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 4: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 5: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 6: [tot_pkts_num = 9999293, tot_pkts_size = 599957580, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 9999293, pkts_size_declined =  599957580, tot_pkts_num_dropped = 9999293, tot_pkts_size_dropped =  599957580, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 7: [tot_pkts_num = 9996390, tot_pkts_size = 599783400, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 9996390, pkts_size_declined =  599783400, tot_pkts_num_dropped = 9996390, tot_pkts_size_dropped =  599783400, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 8: [tot_pkts_num = 9997255, tot_pkts_size = 599835300, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 9997255, pkts_size_declined =  599835300, tot_pkts_num_dropped = 9997255, tot_pkts_size_dropped =  599835300, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 9: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 10: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 11: [tot_pkts_num = 10002396, tot_pkts_size = 600143760, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 10002396, pkts_size_declined =  600143760, tot_pkts_num_dropped = 10002396, tot_pkts_size_dropped =  600143760, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 30: [tot_pkts_num = 10004547, tot_pkts_size = 600272820, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 10004547, pkts_size_declined =  600272820, tot_pkts_num_dropped = 10004547, tot_pkts_size_dropped =  600272820, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 31: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 32: [tot_pkts_num = 10000425, tot_pkts_size = 600025500, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 10000425, pkts_size_declined =  600025500, tot_pkts_num_dropped = 10000425, tot_pkts_size_dropped =  600025500, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 33: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 34: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 35: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 36: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 37: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 38: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]
GATEKEEPER GK: The GK block basic measurements at lcore = 39: [tot_pkts_num = 0, tot_pkts_size = 0, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 0, pkts_size_request =  0, pkts_num_declined = 0, pkts_size_declined =  0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped =  0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed =  0]

There is no pkts_num_request, as expected.

Also, i dont have logs
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800.
and, after flows was expired, i dont have logs
gk_del_flow_entry_from_hash: No such file or directory

@dahnevskiy
Copy link
Author

but if i generate syn-flood with 15 SRC addresses: 10.161.0.1 - 10.161.0.15 - i have all problems, described above.

@dahnevskiy
Copy link
Author

It seems, that i can reproduce this problem, just doing syn-flood from 1 IP.
I started with SRC 10.161.0.1 - it works fine.
After that i stopped syn-flood from SRC 10.161.0.1 and started from SRC 10.161.0.2 - it works bad.

the flow was installed on lcore 5, but it was ignored:

GATEKEEPER GK: The GK block basic measurements at lcore = 5: [tot_pkts_num = 35841216, tot_pkts_size = 2150472960, pkts_num_granted = 0, pkts_size_granted = 0, pkts_num_request = 35841216, pkts_size_request = 2867297280, pkts_num_declined = 0, pkts_size_declined = 0, tot_pkts_num_dropped = 0, tot_pkts_size_dropped = 0, tot_pkts_num_distributed = 0, tot_pkts_size_distributed = 0]

All packets was encapsulated to grantor as pkts_num_request.

After that, while syn-flood is running, i restarted gatekeeper, and after restart it works fine:

GATEKEEPER GK: The GK block basic measurements at lcore = 33: [tot_pkts_num = 49843599, tot_pkts_size = 2990615940, pkts_num_granted = 371, pkts_size_granted = 29680, pkts_num_request = 2576, pkts_size_request = 206080, pkts_num_declined = 49840652, pkts_size_declined = 2990439120, tot_pkts_num_dropped = 49840652, tot_pkts_size_dropped = 2990439120, tot_pkts_num_distributed = 0, tot_pkts_size_distributed = 0]

@dahnevskiy
Copy link
Author

Perhaps i found possible problem:

This is good flow, working as expected:

GATEKEEPER: flow: gk: log the flow state [state: GK_BPF (3), flow hash value: 1777052268, expire_at: 0x42e4829d95ce40, program_index=0, cookie=ce71b511eee34200, 0004000000000000, 1000000000000000, be1fbfe163e44200, 0854648901000000, 0000000000000000, 0000000000000000, 0000000000000000, grantor_ip: 10.255.0.66] in the flow table at print_flow_state with lcore 16 for the flow with IP source address 178.22.89.97, and destination address 94.100.186.227

And this is bad flow, gatekeeper dont use this flow and send all traffic to grantor as GK_REQUEST:

GATEKEEPER: flow: gk: log the flow state [state: GK_BPF (3), flow hash value: 0, expire_at: 0x42e47de4d127e4, program_index=0, cookie=e6a5c1bce4e34200, 0004000000000000, 0000100000000000, fa76fa285fe44200, 0854648901000000, 0000000000000000, 0000000000000000, 0000000000000000, grantor_ip: 10.255.0.66] in the flow table at print_flow_state with lcore 37 for the flow with IP source address 178.22.89.96, and destination address 94.100.186.227

I repeated this test multiple times, and all bad flows always has flow hash value = 0.
So maybe we have a problem with flow hash calculation.

@AltraMayor
Copy link
Owner

Hi @dahnevskiy,

Having a wrong flow hash would explain most of what's going on. But the problem can be subtle because this value is supposed to be computed by the NIC, so it shouldn't be absent. Only when the information comes from the GGU block the flow hash is computed in software.

Based on the information you posted, other things could be going on at the same time. For example, lua/main_config.lua allocates 2 GK blocks per NUMA node, but your log suggests 2 NUMA node and 10 GK blocks per NUMA node. Have you made this change yourself? Or, is it an unexpected behavior?

Just to confirm, the SYN flood is the only traffic going toward the Gatekeeper servers, isn't it?

Would it be possible to share the whole log file?

@AltraMayor
Copy link
Owner

Could you dump a couple of flows that show up in the log entries of gk_del_flow_entry_from_hash()? I'm guessing that they all have an absent flow hash. I'm just trying to connect the dots to see if I'm on the right path to solve this issue.

@dahnevskiy
Copy link
Author

I allocated 20 CPUs to GK processes in lua/main_config.lua. Its expected, but its not working correctly.
If n_gk_lcores < 16 - all works as expected.
If n_gk_lcores >= 16 - wrong flow hash and other problems.

SYN flood is the only traffic going toward the Gatekeeper servers, its correct.

I dumped flow from log entries of gk_del_flow_entry_from_hash():

GATEKEEPER: flow: The GK block failed to delete a key from hash table at gk_del_flow_entry_from_hash: No such file or directory
 for the flow with IP source address 10.161.111.97, and destination address 10.254.71.130

GATEKEEPER: flow: gk: log the flow state [state: GK_BPF (3), flow hash value: 0, expire_at: 0x42f56a2fd3ec06, program_index=3, cookie=5b678caaebf44200, 0004000044f80f00, c4f8ffff00000000, 82441fc7d52e5900, 55b56c5464890100, 0000000000000000, 0000000000000000, 3400000000000000, grantor_ip: 10.255.0.66] in the flow table at print_flow_state with lcore 35 for the flow with IP source address 10.161.111.97, and destination address 10.254.71.130

so yes, they all have an absent flow hash

@AltraMayor
Copy link
Owner

Is the front interface of the Gatekeeper server in your deployment going to be 10Gbps? In our tests with 10Gbps front interfaces, more than two GK blocks per NUMA node was only needed when the flow tables of the GK blocks were too large and faster scanning for expired entries was needed. What value are you assigning to variable flow_ht_size in lua/gk.lua? Although the optimal number of GK blocks might be different on different servers, one likely doesn't need more than 4 GK blocks per NUMA node.

If you are using faster NICs as front interfaces, more GK blocks are needed to handle the extra packets. But we don't have guidance for this setup at this point; it's one of our future milestones.

We are going to continue the investigation to figure out what's going wrong when n_gk_lcores >= 16.

@AltraMayor
Copy link
Owner

Hi @dahnevskiy,

I just issued a patch that solves some of the problems going on, if not all. Could you test it?

There's a possibility that more than one problem is going on, but with one less problem, it should be easier to diagnosticate what is left.

@dahnevskiy
Copy link
Author

if n_gk_lcores >= 16:

I used a syn-flood attack from source 10.161.0.4.

And it seems, that problem flows now is not created at all.

In logs i have multiple:

GATEKEEPER: 2752654 log entries were suppressed at lcore 0 during the last ratelimit interval
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800
GATEKEEPER LLS: front interface should not be seeing a packet with EtherType 0x0800

But flow doesnt exist:

GATEKEEPER: flow: gk: failed to log flow state at log_flow_state with lcore 41 - flow doesn't exist
 for the flow with IP source address 10.161.0.4, and destination address 10.254.71.130

There is no gk_del_flow_entry_from_hash error, but i guess its because instead of creating flows with absent flow hash, this version doesnt create part of flows at all.

if n_gk_lcores < 16 all works as expected.

@AltraMayor
Copy link
Owner

AltraMayor commented Sep 11, 2020

As far as I can tell, this last issue is related to how the front NIC is responding to the setup of the RSS function. My hypothesis is that there is something wrong with the initialization of the RSS function that is throwing packets into queue 0, which belongs to the LLS block. This explains why the LLS block is receiving lots of packets that are not destined to it and the flow is not being created in the GK block (i.e. the GK block running on lcore 41 never receives its packets). Would it be possible for us to have an SSH connection to the test Gatekeeper server? We don't need to run the experiment, we only need to run gatekeeper with gdb and to inspect the hardware initialization to make sure that the RSS initialization is correct.

@AltraMayor AltraMayor changed the title add_fib_entry not working Beta testing Gatekeeper Sep 11, 2020
@dahnevskiy
Copy link
Author

Unfortunatelly, there is no possibility to have ssh connection, but we have another options:

  1. If you write here, what we need to do with gdb, we can do it and post results.
  2. We can use zoom.us on next week and i can share my screen and give control to you, and you can use gdb from my notebook.

@AltraMayor
Copy link
Owner

AltraMayor commented Sep 17, 2020

Let's take both options. I describe below a couple of things that you can do to gather some general information, and we should schedule an online meeting next week to probe the issue further if needed. I'm going to find some time slots next week for the online meeting. I need to talk to other project members to see if someone can join us.

While optional, it's better to replace -O3 in variable EXTRA_CFLAGS in the Makefile with -O0 and compile Gatekeeper again, so it's easier to keep up with the debugger.

  1. Call gdb build/gatekeeper;
  2. Set a breakpoint bp gatekeeper_setup_rss; function gatekeeper_setup_rss() is in file lib/net.c;
  3. Set a second breakpoint bp lls_proc; function lls_proc() is in file lls/main.c;
  4. Run gatekeeper with run. Feel free to add any parameters you normally pass to Gatekeeper after the command run;
  5. gdb will eventually stop at gatekeeper_setup_rss(). We need as much information here as possible.

Print the function parameters p port_id, p num_queues, p *queues@num_queues. Walk around the function with n and print p dev_info.reta_size after calling rte_eth_dev_info_get(), and print p reta_conf before and after calling rte_eth_dev_rss_reta_update() as well as after calling rte_eth_dev_rss_reta_query().

gatekeeper_setup_rss() is going to be called twice, once for the front and another time for the back interface. Although the problem we are investigating is for the front interface, print the content for the back interface as well. The back interface might have some clue to the problem.

Once done with each call to gatekeeper_setup_rss(), issue the command continue to continue the execution.

  1. gdb will eventually stop at lls_proc(). Just print p lls_conf once this variable Is assigned.

@AltraMayor
Copy link
Owner

Our group can join the debugging online meeting on either 24th or 25th at 9am EST. Which date works for you?

@dahnevskiy
Copy link
Author

Hello.
Sorry for long response, but last 2 weeks we are doing our final tests of gk, and im glad to say, that all our tests is passed and we will deploy 1 tbps ddos solution.

Unfortunatelly, for now i dont have time for gdb debug and tshoot, but I will return to this issue in 2 weeks, so please dont close this issue, and we will continue our work.

And we really want support of i40e network cards, but as i can see, this will require an upgrade of dpdk version....

@AltraMayor
Copy link
Owner

That's such great news! We look forward to learning about your deployment.

We'll keep this issue open until we nail the issue of the NIC initialization.

Could you explain the issue you have found with i40e NICs? The version of DPDK that Gatekeeper is currently using already includes the i40e driver.

@AltraMayor
Copy link
Owner

It's been more than two months now without an update and this issue is already quite big, so I'm closing it. If the NIC initialization is still happening, please open a new issue just for it.

Thank you for all the help testing Gatekeeper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants