Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-613: decrease premature eviction of eBPF hashmap #61

Merged
merged 5 commits into from
Oct 18, 2022
Merged

NETOBSERV-613: decrease premature eviction of eBPF hashmap #61

merged 5 commits into from
Oct 18, 2022

Conversation

mariomac
Copy link

@mariomac mariomac commented Oct 11, 2022

In the eBPF side, we were wrongly assuming that every time a map insertion/update failed was caused by the map being full.

The most frequent error is not actually "map full" but "resource busy" error. This PR checks the error codes and only flushes the eBPF map when it is a "map full" error.

This change minimizes the premature eviction of the eBPF flows' cache, minimizing the generated goroutines (which were giving problems on high-load scenario) and paradoxically reducing the number of "resource busy" errors, as map evictions from the userspace requires many continuous accesses.

This PR does not completely fix the NETOBSERV-613 issue.

@mariomac mariomac added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Oct 11, 2022
@github-actions
Copy link

New image: ["quay.io/netobserv/netobserv-ebpf-agent:f7e3e07"]. It will expire after two weeks.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Oct 11, 2022
*/
long ret = bpf_map_update_elem(&aggregated_flows, &id, &new_flow, BPF_ANY);
if (ret != 0) {
// usually error -16 (-EBUSY) or -7 (E2BIG) is printed here.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting observation.

}
atomic.StoreInt32(&m.ringBuf.forwardedFlows, 0)
atomic.StoreInt32(&m.ringBuf.isForwarding, 0)
atomic.StoreInt32(&m.ringBuf.mapFullErrs, 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these metrics be accessible to the operator apart from the logs?
This might be useful to indicate if the map size is too small to handle the volume of traffic, if its getting full often.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, maybe for a later task / PR ?

@@ -184,7 +185,15 @@ static inline int flow_monitor(struct __sk_buff *skb, u8 direction) {
aggregate_flow->start_mono_time_ts = current_time;
}

bpf_map_update_elem(&aggregated_flows, &id, aggregate_flow, BPF_EXIST);
long ret = bpf_map_update_elem(&aggregated_flows, &id, aggregate_flow, BPF_ANY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of changing BPF_EXIST to BPF_ANY ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that the PerCPU map implementation sometimes complaints about the assumption of existence of the flows and we might loose packets.

E.g. it can't assume that an entry does not exist because another thread could have inserted the bucket in the map (which is common to all the CPUs despite the "PerCPU" prefix of the map), or it can't assume that exist because the userspace might be deleting it.

Copy link
Member

@jotak jotak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mariomac mariomac merged commit 516a29c into netobserv:main Oct 18, 2022
@mariomac mariomac deleted the decrease-eviction-freq branch October 18, 2022 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants