NETOBSERV-613: decrease premature eviction of eBPF hashmap #61

mariomac · 2022-10-11T12:54:46Z

In the eBPF side, we were wrongly assuming that every time a map insertion/update failed was caused by the map being full.

The most frequent error is not actually "map full" but "resource busy" error. This PR checks the error codes and only flushes the eBPF map when it is a "map full" error.

This change minimizes the premature eviction of the eBPF flows' cache, minimizing the generated goroutines (which were giving problems on high-load scenario) and paradoxically reducing the number of "resource busy" errors, as map evictions from the userspace requires many continuous accesses.

This PR does not completely fix the NETOBSERV-613 issue.

github-actions · 2022-10-11T12:56:20Z

New image: ["quay.io/netobserv/netobserv-ebpf-agent:f7e3e07"]. It will expire after two weeks.

praveingk · 2022-10-12T04:53:14Z

bpf/flows.c

-            */
+        long ret = bpf_map_update_elem(&aggregated_flows, &id, &new_flow, BPF_ANY);
+        if (ret != 0) {
+            // usually error -16 (-EBUSY) or -7 (E2BIG) is printed here.


This is an interesting observation.

praveingk · 2022-10-12T04:58:55Z

pkg/ebpf/tracer.go

+			}
+			atomic.StoreInt32(&m.ringBuf.forwardedFlows, 0)
+			atomic.StoreInt32(&m.ringBuf.isForwarding, 0)
+			atomic.StoreInt32(&m.ringBuf.mapFullErrs, 0)


Can these metrics be accessible to the operator apart from the logs?
This might be useful to indicate if the map size is too small to handle the volume of traffic, if its getting full often.

+1, maybe for a later task / PR ?

jotak · 2022-10-18T07:32:12Z

bpf/flows.c

@@ -184,7 +185,15 @@ static inline int flow_monitor(struct __sk_buff *skb, u8 direction) {
            aggregate_flow->start_mono_time_ts = current_time;
        }

-        bpf_map_update_elem(&aggregated_flows, &id, aggregate_flow, BPF_EXIST);
+        long ret = bpf_map_update_elem(&aggregated_flows, &id, aggregate_flow, BPF_ANY);


what's the purpose of changing BPF_EXIST to BPF_ANY ?

I realized that the PerCPU map implementation sometimes complaints about the assumption of existence of the flows and we might loose packets.

E.g. it can't assume that an entry does not exist because another thread could have inserted the bucket in the map (which is common to all the CPUs despite the "PerCPU" prefix of the map), or it can't assume that exist because the userspace might be deleting it.

jotak

lgtm

NETOBSERV-613: decrease premature eviction of eBPF hashmap

1c08399

mariomac added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Oct 11, 2022

mariomac requested review from jotak, praveingk and jpinsonneau October 11, 2022 12:54

fix tests

ab3b847

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Oct 11, 2022

Mario Macias added 3 commits October 11, 2022 17:39

Fix e2e tests for out-of-order reporting

41c2ab1

rm unneeded comment

0a2446f

extend eventually time just in case

214445b

praveingk reviewed Oct 12, 2022

View reviewed changes

This was referenced Oct 13, 2022

NETOBSERV-613: Rework pipeline #62

Merged

NETOBSERV-613: drop messages when they accumulate in the exporter #63

Merged

praveingk approved these changes Oct 14, 2022

View reviewed changes

jotak reviewed Oct 18, 2022

View reviewed changes

jotak approved these changes Oct 18, 2022

View reviewed changes

mariomac merged commit 516a29c into netobserv:main Oct 18, 2022

mariomac deleted the decrease-eviction-freq branch October 18, 2022 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NETOBSERV-613: decrease premature eviction of eBPF hashmap #61

NETOBSERV-613: decrease premature eviction of eBPF hashmap #61

mariomac commented Oct 11, 2022 •

edited

Loading

github-actions bot commented Oct 11, 2022

praveingk Oct 12, 2022

praveingk Oct 12, 2022

jotak Oct 18, 2022

jotak Oct 18, 2022

mariomac Oct 18, 2022

jotak left a comment

NETOBSERV-613: decrease premature eviction of eBPF hashmap #61

NETOBSERV-613: decrease premature eviction of eBPF hashmap #61

Conversation

mariomac commented Oct 11, 2022 • edited Loading

github-actions bot commented Oct 11, 2022

praveingk Oct 12, 2022

Choose a reason for hiding this comment

praveingk Oct 12, 2022

Choose a reason for hiding this comment

jotak Oct 18, 2022

Choose a reason for hiding this comment

jotak Oct 18, 2022

Choose a reason for hiding this comment

mariomac Oct 18, 2022

Choose a reason for hiding this comment

jotak left a comment

Choose a reason for hiding this comment

mariomac commented Oct 11, 2022 •

edited

Loading