Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1533: refine metrics for dashboard creation #281

Merged
merged 5 commits into from
Mar 4, 2024

Conversation

jotak
Copy link
Member

@jotak jotak commented Feb 27, 2024

  • Use just two shared metrics for eviction counters: on for eviction events and one for flows; shared across ringbuf/map implementations and labelled as such
  • Report more errors via metrics

- Use just two shared metrics for eviction counters: on for eviction
  events and one for flows; shared across ringbuf/map implementations
and labelled as such
- Report more errors via metrics
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Feb 27, 2024

@jotak: This pull request references NETOBSERV-1533 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

  • Use just two shared metrics for eviction counters: on for eviction events and one for flows; shared across ringbuf/map implementations and labelled as such
  • Report more errors via metrics

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jotak
Copy link
Member Author

jotak commented Feb 27, 2024

@msherif1234 as I'm working on the dashboard and I'm bringing a few changes here.
Here's a screenshot of what I get with these metrics:

Some text counters in the top bar:
Capture d’écran du 2024-02-27 16-11-07

Some ebpf stats:
Capture d’écran du 2024-02-27 16-11-25

(something looks wrong in the eviction rate when it evicts because of full map)

Copy link

codecov bot commented Feb 27, 2024

Codecov Report

Attention: Patch coverage is 69.10569% with 38 lines in your changes are missing coverage. Please review.

Project coverage is 36.26%. Comparing base (b3c02b8) to head (4a58e68).
Report is 2 commits behind head on main.

Files Patch % Lines
pkg/agent/agent.go 35.48% 20 Missing ⚠️
pkg/ebpf/tracer.go 0.00% 9 Missing ⚠️
pkg/flow/tracer_ringbuf.go 50.00% 5 Missing ⚠️
pkg/exporter/kafka_proto.go 50.00% 2 Missing ⚠️
pkg/exporter/grpc_proto.go 90.90% 1 Missing ⚠️
pkg/flow/account.go 93.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #281      +/-   ##
==========================================
+ Coverage   36.08%   36.26%   +0.18%     
==========================================
  Files          42       42              
  Lines        3786     3794       +8     
==========================================
+ Hits         1366     1376      +10     
+ Misses       2342     2340       -2     
  Partials       78       78              
Flag Coverage Δ
unittests 36.26% <69.10%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 92 to 96
// MapSize = defineMetric(
// "ebpf_map_size",
// "size of the eBPF maps",
// TypeGauge,
// )
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to create a gauge to track the hashmap size but I didn't find a function to get the used size in bpflib...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do dummy iterate like the one we use today and inside increment the map size ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd prefer to avoid any additional processing .. no worries I'll delete this commented code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me2 :)

m.evictionCond.Broadcast()
m.evictionCounter.ForSourceAndReason("hashmap", reason).Inc()
Copy link
Member Author

@jotak jotak Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msherif1234 I think what I did here doesn't work: the eviction counter is incremented everytime Flush is called, but it seems to be called many times before sync.Condition makes it go to the actual evict. So I end up with a counter value much bigger than it should. I'm not sure if this is expected that Flush is called so often without having an actual eviction. But to "fix" this counter I would need to move the increment back where you set it before, in evictFlows, but the downside is that we loose then the reason label that I wanted to have

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at this: there's a peak to 300K evictions / second due to this increase, while the flows per second is only at 3.5K. Isn't it fishy?

Capture d’écran du 2024-02-27 16-11-25

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I understand. In tracer_map I'm seeing not flows but individual packets. So a different metrics should be used for that. I'll do it surely tomorrow.
Also, I fixed a bug in my last commit, not all metrics were correctly showing up, now it looks like this:
Capture d’écran du 2024-02-27 19-40-30

- Use a single metrics object that holds shared metrics
- Create shared metrics at startup
- accounter and deduper buffer gauges
- use eviction metrics for exporters
- eviction from deduper
- limiter drops counter
@msherif1234
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Feb 28, 2024
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Feb 29, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:12b1753

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=12b1753 make set-agent-image

@openshift-ci openshift-ci bot removed the lgtm label Feb 29, 2024
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Feb 29, 2024
@jotak
Copy link
Member Author

jotak commented Mar 4, 2024

@msherif1234 I think this task is done I've got all metrics needed to build a nice dashboard - can you take a second look & review please?

@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 4, 2024
Copy link

github-actions bot commented Mar 4, 2024

New image:
quay.io/netobserv/netobserv-ebpf-agent:26076c3

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=26076c3 make set-agent-image

}
// We observed that eBFP PerCPU map might insert multiple times the same key in the map
// (probably due to race conditions) so we need to re-join metrics again at userspace
// TODO: instrument how many times the keys are is repeated in the same eviction
flows[id] = append(flows[id], metrics...)
}
met.BufferSizeGauge.WithBufferName("hashmap-total").Set(float64(count))
met.BufferSizeGauge.WithBufferName("hashmap-unique").Set(float64(len(flows)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't those counters are dup to what we we already have in tracer_map ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's different , the counters are used to compute rates, like number of flows evicted per second, but this allows to get the size of the maps. In fact (to answer your next question) they have been incredibly useful to me as it's thanks to these gauges that I understood exactly what was going wrong here with LookupAndDelete: the "hashmap-total" was growing to 100K elements ie. the full map size even when it was not supposed to be full, while the "hashmap-unique" gauge was much smaller, something like 2K. So that's what led me to find the issue about deleting within the iteration.

}
m.BufferSizeGauge.WithBufferName("deduper-list").Set(float64(cache.entries.Len()))
m.BufferSizeGauge.WithBufferName("deduper-map").Set(float64(len(cache.ifaces)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need metrics for this isn't debugging enough ?

"Sampling rate seconds",
samplingRate = defineMetric(
"sampling_rate",
"Sampling rate",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding the unis was in the metrics guide line u shared with me ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this one is the sampling rate (ratio) like 1:50 or 1:1 it's not measuring time

@msherif1234
Copy link
Contributor

overall looks great to me small comments/questions inline up to you to address nothing critical
/lgtm

@openshift-ci openshift-ci bot added the lgtm label Mar 4, 2024
@jotak
Copy link
Member Author

jotak commented Mar 4, 2024

/approve

Copy link

openshift-ci bot commented Mar 4, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Mar 4, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 1d85464 into netobserv:main Mar 4, 2024
10 checks passed
@jotak jotak deleted the refine-metrics branch March 21, 2024 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved jira/valid-reference lgtm ok-to-test To set manually when a PR is safe to test. Triggers image build on PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants