-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NETOBSERV-1533: refine metrics for dashboard creation #281
Conversation
jotak
commented
Feb 27, 2024
- Use just two shared metrics for eviction counters: on for eviction events and one for flows; shared across ringbuf/map implementations and labelled as such
- Report more errors via metrics
- Use just two shared metrics for eviction counters: on for eviction events and one for flows; shared across ringbuf/map implementations and labelled as such - Report more errors via metrics
@jotak: This pull request references NETOBSERV-1533 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@msherif1234 as I'm working on the dashboard and I'm bringing a few changes here. Some text counters in the top bar: (something looks wrong in the eviction rate when it evicts because of full map) |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #281 +/- ##
==========================================
+ Coverage 36.08% 36.26% +0.18%
==========================================
Files 42 42
Lines 3786 3794 +8
==========================================
+ Hits 1366 1376 +10
+ Misses 2342 2340 -2
Partials 78 78
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
pkg/metrics/metrics.go
Outdated
// MapSize = defineMetric( | ||
// "ebpf_map_size", | ||
// "size of the eBPF maps", | ||
// TypeGauge, | ||
// ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to create a gauge to track the hashmap size but I didn't find a function to get the used size in bpflib...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can do dummy iterate like the one we use today and inside increment the map size ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd prefer to avoid any additional processing .. no worries I'll delete this commented code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
me2 :)
pkg/flow/tracer_map.go
Outdated
m.evictionCond.Broadcast() | ||
m.evictionCounter.ForSourceAndReason("hashmap", reason).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msherif1234 I think what I did here doesn't work: the eviction counter is incremented everytime Flush
is called, but it seems to be called many times before sync.Condition makes it go to the actual evict. So I end up with a counter value much bigger than it should. I'm not sure if this is expected that Flush
is called so often without having an actual eviction. But to "fix" this counter I would need to move the increment back where you set it before, in evictFlows
, but the downside is that we loose then the reason
label that I wanted to have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Use a single metrics object that holds shared metrics - Create shared metrics at startup
- accounter and deduper buffer gauges - use eviction metrics for exporters - eviction from deduper - limiter drops counter
/lgtm |
New image: It will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=12b1753 make set-agent-image |
@msherif1234 I think this task is done I've got all metrics needed to build a nice dashboard - can you take a second look & review please? |
New image: It will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=26076c3 make set-agent-image |
} | ||
// We observed that eBFP PerCPU map might insert multiple times the same key in the map | ||
// (probably due to race conditions) so we need to re-join metrics again at userspace | ||
// TODO: instrument how many times the keys are is repeated in the same eviction | ||
flows[id] = append(flows[id], metrics...) | ||
} | ||
met.BufferSizeGauge.WithBufferName("hashmap-total").Set(float64(count)) | ||
met.BufferSizeGauge.WithBufferName("hashmap-unique").Set(float64(len(flows))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't those counters are dup to what we we already have in tracer_map ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's different , the counters are used to compute rates, like number of flows evicted per second, but this allows to get the size of the maps. In fact (to answer your next question) they have been incredibly useful to me as it's thanks to these gauges that I understood exactly what was going wrong here with LookupAndDelete: the "hashmap-total" was growing to 100K elements ie. the full map size even when it was not supposed to be full, while the "hashmap-unique" gauge was much smaller, something like 2K. So that's what led me to find the issue about deleting within the iteration.
} | ||
m.BufferSizeGauge.WithBufferName("deduper-list").Set(float64(cache.entries.Len())) | ||
m.BufferSizeGauge.WithBufferName("deduper-map").Set(float64(len(cache.ifaces))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need metrics for this isn't debugging enough ?
"Sampling rate seconds", | ||
samplingRate = defineMetric( | ||
"sampling_rate", | ||
"Sampling rate", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding the unis was in the metrics guide line u shared with me ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but this one is the sampling rate (ratio) like 1:50 or 1:1 it's not measuring time
overall looks great to me small comments/questions inline up to you to address nothing critical |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jotak The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |