Fix L7NP enable logging wrong packet #6651

qiyueyao · 2024-09-04T21:26:30Z

Current logs by Suricata when enableLogging is set, logs the wrong packet of RST instead of the original TCP packet, and remains an existing bug in Suricata.

This solution modifies the design to remove enableLogging in L7, keep it in L4, and instead configure Suricata to log Packet info for all alert events.

Also add Packet test to E2E tests.

Decode the packet dtwWezuaHlOhfWpNgQAAAQgARQAAjbT0QABABm9cCgoBBAoKAQOJUgBQa2w1WZlax6yAGAH7nAIAAAEBCAorcsv8RSTwQkdFVCAvYWRtaW4vaW5kZXguaHRtbCBIVFRQLzEuMQ0KSG9zdDogMTAuMTAuMS4zDQpVc2VyLUFnZW50OiBjdXJsLzcuNzQuMA0KQWNjZXB0OiAqLyoNCg0K

-> 10.10.1.4 → 10.10.1.3 HTTP GET /admin/index.html HTTP/1.1

Fixes #6636.

antoninbas

Do you think we could validate the logged packet in our e2e test?

antoninbas · 2024-09-05T17:14:02Z

docs/antrea-l7-network-policy.md

  "dest_port": 80,
  "proto": "TCP",
-  "pkt_src": "wire/pcap",
-  "tenant_id": 2,
+  "pkt_src": "stream (flow timeout)",


If the pkt src is stream (flow timeout), you are probably looking at the wrong message in the logs. There should be a message with src wire/pcap that shows up immediately, instead of after a flow timeout (which takes 30s to a minute IIRC)

pkg/agent/controller/networkpolicy/l7engine/reconciler.go

antoninbas · 2024-09-05T17:17:35Z

pkg/agent/controller/networkpolicy/l7engine/reconciler.go

@@ -178,7 +178,7 @@ func generateTenantRulesData(policyName string, protoKeywords map[string]sets.Se
 	// Refer to Suricata detect engine in codebase for detailed tag keyword configuration.
 	var tagKeyword string
 	if enableLogging {
-		tagKeyword = " tag: session, 30, seconds;"
+		tagKeyword = " tag:host;"


I wonder why the previous version was not working, it feels like session should also have included the request packet.

BTW, should we use something like tag:host,1,packets as we only need the first packet?

I did some further investigation and found some problems:
Logging the wrong packet is a known bug/expected behavior documented https://redmine.openinfosecfoundation.org/issues/3480 and https://redmine.openinfosecfoundation.org/issues/2069. Basically the idea is that the log advanced past the first packet that triggered the event, so this will be the case for each session. If changed to host, the first alert stays the same, but the second alert would capture the initial packet as required.
Below is the complete log:

-----(trigger 1st time) alert 10.10.1.3 → 10.10.1.4 TCP 80 → 47588 [RST, ACK] -----(trigger 2nd time) 10.10.1.4 → 10.10.1.3 TCP 55344 → 80 [SYN] SACK_PERM 10.10.1.3 → 10.10.1.4 TCP 80 → 55344 [SYN, ACK] SACK_PERM 10.10.1.4 → 10.10.1.3 TCP 55344 → 80 [ACK] alert 10.10.1.4 → 10.10.1.3 HTTP GET /admin/index.html HTTP/1.1 10.10.1.3 → 10.10.1.4 TCP 80 → 55344 [RST, ACK] alert: timeout http: timeout

Given this fact, for the e2e it is possible to test it by triggering twice and look for the original packet. But I'm not sure of the necessity of enableLogging actually if this issue continues in Suricata.

If the first packet is dropped and we have to wait for the next request, then I would argue that this is not a correct implementation of the feature. It's a different request in a different TCP session (source port is different).
So it looks like it is not possible to capture the correct packet at all?

AFAIK if we use the existing Suricata log for this feature, it's not possible to capture that initial packet =(
I can remove the tagged-packet if capturing the initial packet is the only purpose 😂

Do you mean removing the feature completely?

Actually I modified the alert configuration, it was able to log the initial packet

{"timestamp":"2024-09-06T22:56:25.262709+0000","flow_id":561336788716570,"in_iface":"antrea-l7-tap0","event_type":"alert","vlan":[2],"src_ip":"10.10.1.4","src_port":46094,"dest_ip":"10.10.1.3","dest_port":80,"proto":"TCP","pkt_src":"wire/pcap","tenant_id":2,"alert":{"action":"blocked","gid":1,"signature_id":1,"rev":0,"signature":"Reject by AntreaNetworkPolicy:default/allow-privileged-url-to-admin-role","category":"","severity":3,"tenant_id":2},"app_proto":"http","direction":"to_server","flow":{"pkts_toserver":3,"pkts_toclient":1,"bytes_toserver":307,"bytes_toclient":78,"start":"2024-09-06T22:56:25.261768+0000","src_ip":"10.10.1.4","dest_ip":"10.10.1.3","src_port":46094,"dest_port":80},"packet":"dtwWezuaHlOhfWpNgQAAAggARQAAjd+4QABABkSYCgoBBAoKAQO0DgBQH6jA1rH7p7qAGAH7zPsAAAEBCAou2X6GSIuizUdFVCAvYWRtaW4vaW5kZXguaHRtbCBIVFRQLzEuMQ0KSG9zdDogMTAuMTAuMS4zDQpVc2VyLUFnZW50OiBjdXJsLzcuNzQuMA0KQWNjZXB0OiAqLyoNCg0K","packet_info":{"linktype":1}}

But it applies to all rules because it's in the Suricata configuration, so cannot be controlled by enableLogging. Does it sound good to remove enableLogging which relates to tagged-packet, but add the packet support in alert that applies to all rules?
Or we can configure whether we want packet to be logged additionally in alert upon start of Suricata.

Maybe it would be a good idea to discuss this at the community meeting.

BTW, if this is for the alert event type, it only applies to rules that reject traffic, right?
We don't need the packet dump for allowed traffic, as we already have the app-layer metadata.

Yes, tested that it only applies to rules rejecting traffic ( packet will only be added to alert logged for reject, only http will be logged for allow )

We should do some perf testing using ab as @tnqn suggested, just to make sure that the number of rejected requests that we can handle stays roughly the same after enabling packet logging.

qiyueyao · 2024-09-14T00:16:26Z

AB testing done:

With packet: yes for alert:

$ ab -c 10 -n 500 -r http://10.10.1.3/admin/index.html
Concurrency Level:      10
Time taken for tests:   3.151 seconds
Complete requests:      500
Failed requests:        1000
   (Connect: 0, Receive: 500, Length: 0, Exceptions: 500)
Total transferred:      0 bytes
HTML transferred:       0 bytes
Requests per second:    158.70 [#/sec] (mean)
Time per request:       63.011 [ms] (mean)
Time per request:       6.301 [ms] (mean, across all concurrent requests)
Transfer rate:          0.00 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        1    1   0.5      1       7
Processing:     4    5   1.4      5      23
Waiting:        0    0   0.0      0       0
Total:          4    6   1.6      6      26

With packet: no for alert:

$ ab -c 10 -n 500 -r http://10.10.1.3/admin/index.html
Concurrency Level:      10
Time taken for tests:   3.011 seconds
Complete requests:      500
Failed requests:        1000
   (Connect: 0, Receive: 500, Length: 0, Exceptions: 500)
Total transferred:      0 bytes
HTML transferred:       0 bytes
Requests per second:    166.08 [#/sec] (mean)
Time per request:       60.210 [ms] (mean)
Time per request:       6.021 [ms] (mean, across all concurrent requests)
Transfer rate:          0.00 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        1    1   0.6      1       8
Processing:     4    5   1.6      4      34
Waiting:        0    0   0.0      0       0
Total:          5    6   1.8      6      37

antoninbas · 2024-09-16T20:56:54Z

Thanks @qiyueyao. It seems that there is a noticeable impact, but nothing too significant (around 5%). Based on these results, I would lean towards enabling this by default, without introducing a new configuration parameter. But I'll wait until @tnqn chimes in, as he requested the benchmarking.

~~Out of curiosity, do you have a baseline (same testbed) for the same traffic but without any L7NP enforced?~~

Complete requests:      500
Failed requests:        1000

I am having trouble understanding these numbers. Shouldn't all requests fail (be rejected by Suricata)?

qiyueyao · 2024-09-16T22:58:17Z

Thanks @qiyueyao. It seems that there is a noticeable impact, but nothing too significant (around 5%). Based on these results, I would lean towards enabling this by default, without introducing a new configuration parameter. But I'll wait until @tnqn chimes in, as he requested the benchmarking.

~~Out of curiosity, do you have a baseline (same testbed) for the same traffic but without any L7NP enforced?~~
Complete requests:      500
Failed requests:        1000
I am having trouble understanding these numbers. Shouldn't all requests fail (be rejected by Suricata)?

I only sent 500 requests for both cases, and kept the socket open on receive errors. Based on online sources, Receive means count of err_recv when status is not APR_SUCCESS, Exceptions means exceptions thrown.

Without specifying -r, the socket exits with just one request, output apr_socket_recv: Connection reset by peer (104).

Without the source code of Apache ab tool, my guess is that for each rejected packet, apr_socket_recv is counted towards Receive and Complete requests, reject is counted towards Exceptions, they in total sum as 1000 Failed requests.

tnqn · 2024-09-20T14:24:06Z

Thanks @qiyueyao. It seems that there is a noticeable impact, but nothing too significant (around 5%). Based on these results, I would lean towards enabling this by default, without introducing a new configuration parameter. But I'll wait until @tnqn chimes in, as he requested the benchmarking.

5% difference seems fine, enabling it without a new configuration parameter sounds good to me. We can add a configuration when there is a real need.