Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP NETOBSERV-1550: Using batchAPIs to help with CPU and memory resources #256

Closed
wants to merge 2 commits into from

Conversation

msherif1234
Copy link
Contributor

@msherif1234 msherif1234 commented Jan 24, 2024

Description

cilium recently added batchAPI support for PerCPU maps this PR to migrate ebpf agent to use batchapis

cilium/ebpf#1315

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jan 24, 2024

@msherif1234: This pull request references NETOBSERV-559 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

cilium recently added batchAPI support for PerCPU maps this PR to migrate ebpf agent to use batchapis

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

openshift-ci bot commented Jan 24, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from msherif1234. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

codecov bot commented Jan 24, 2024

Codecov Report

Attention: Patch coverage is 0% with 90 lines in your changes are missing coverage. Please review.

Project coverage is 33.44%. Comparing base (b63f483) to head (edd8134).
Report is 1 commits behind head on main.

❗ Current head edd8134 differs from pull request most recent head 5fdf081. Consider uploading reports for the commit 5fdf081 to get more accurate results

Files Patch % Lines
pkg/ebpf/tracer_batchapis.go 0.00% 57 Missing ⚠️
pkg/ebpf/tracer.go 0.00% 33 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #256      +/-   ##
==========================================
- Coverage   34.04%   33.44%   -0.61%     
==========================================
  Files          47       48       +1     
  Lines        3836     3905      +69     
==========================================
  Hits         1306     1306              
- Misses       2444     2513      +69     
  Partials       86       86              
Flag Coverage Δ
unittests 33.44% <0.00%> (-0.61%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jan 24, 2024

@msherif1234: This pull request references NETOBSERV-559 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

cilium recently added batchAPI support for PerCPU maps this PR to migrate ebpf agent to use batchapis

cilium/ebpf#1315

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jpinsonneau jpinsonneau marked this pull request as draft January 25, 2024 11:24
@msherif1234 msherif1234 force-pushed the batch-apis branch 3 times, most recently from 93db299 to 53cbc3a Compare January 26, 2024 22:41
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 26, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:6d184cc

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=6d184cc make set-agent-image

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 26, 2024
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 26, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:bfa5ac7

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=bfa5ac7 make set-agent-image

@msherif1234
Copy link
Contributor Author

ran scale based of 4.14
https://docs.google.com/spreadsheets/d/14taH8UGgjiLNqjgCRq66mcNBeNCDCnYEFGsur8Qig9I/edit#gid=2136829211

summary showing increase in ebpf resources

cpuEBPFTotals | cpuEBPFTotals | avg(value) | Fail | 78.57% | 3.405002158 | 6.080411792 |

rssEBPFTotals | rssEBPFTotals | avg(value) | Fail | 53.61% | 3404791063 | 5230041771 |

@msherif1234
Copy link
Contributor Author

msherif1234 commented Jan 29, 2024

image
image

(pprof) top10 -cum
Showing nodes accounting for 70ms, 3.14% of 2230ms total
Dropped 56 nodes (cum <= 11.15ms)
Showing top 10 nodes out of 92
      flat  flat%   sum%        cum   cum%
         0     0%     0%     1770ms 79.37%  github.com/netobserv/netobserv-ebpf-agent/pkg/flow.(*MapTracer).evictFlows
         0     0%     0%     1770ms 79.37%  github.com/netobserv/netobserv-ebpf-agent/pkg/flow.(*MapTracer).evictionSynchronization
         0     0%     0%     1760ms 78.92%  github.com/netobserv/netobserv-ebpf-agent/pkg/ebpf.(*FlowFetcher).LookupAndDeleteMap
         0     0%     0%     1620ms 72.65%  github.com/cilium/ebpf.(*Map).BatchLookupAndDelete (inline)
         0     0%     0%     1620ms 72.65%  github.com/cilium/ebpf.(*Map).batchLookup
         0     0%     0%     1620ms 72.65%  github.com/cilium/ebpf.(*Map).batchLookupPerCPU
      40ms  1.79%  1.79%     1510ms 67.71%  github.com/cilium/ebpf/internal/sysenc.Unmarshal
      30ms  1.35%  3.14%     1350ms 60.54%  encoding/binary.Read
         0     0%  3.14%     1190ms 53.36%  github.com/cilium/ebpf.unmarshalBatchPerCPUValue
         0     0%  3.14%     1180ms 52.91%  github.com/cilium/ebpf.unmarshalPerCPUValue
(pprof) 

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Feb 6, 2024
@msherif1234
Copy link
Contributor Author

msherif1234 commented Feb 6, 2024

added bench mark testing for iterate vs batchdelete api

$ go test ./pkg/ebpf/ -exec sudo -bench=BenchmarkFlowFetcher_LookupAndDeleteMap -benchmem -count 5 -run=^#
goos: linux
goarch: amd64
pkg: github.com/netobserv/netobserv-ebpf-agent/pkg/ebpf
cpu: Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
BenchmarkFlowFetcher_LookupAndDeleteMap/BatchLookupAndDelete-12         	     403	   2507858 ns/op	  757583 B/op	    2943 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/BatchLookupAndDelete-12         	     446	   2531754 ns/op	  746563 B/op	    2838 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/BatchLookupAndDelete-12         	     488	   2234317 ns/op	  737511 B/op	    2753 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/BatchLookupAndDelete-12         	     526	   2209894 ns/op	  730663 B/op	    2688 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/BatchLookupAndDelete-12         	     477	   2251203 ns/op	  739670 B/op	    2774 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/IterateLookupAndDelete-12       	     386	   2796254 ns/op	  598852 B/op	    4355 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/IterateLookupAndDelete-12       	     345	   3105146 ns/op	  613746 B/op	    4492 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/IterateLookupAndDelete-12       	     370	   2940347 ns/op	  604619 B/op	    4406 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/IterateLookupAndDelete-12       	     304	   3723941 ns/op	  631809 B/op	    4664 allocs/op
BenchmarkFlowFetcher_LookupAndDeleteMap/IterateLookupAndDelete-12       	     326	   3699242 ns/op	  621145 B/op	    4566 allocs/op
PASS
ok  	github.com/netobserv/netobserv-ebpf-agent/pkg/ebpf	70.103s

@msherif1234 msherif1234 force-pushed the batch-apis branch 2 times, most recently from ab946c5 to 4597d54 Compare February 6, 2024 18:20
@msherif1234
Copy link
Contributor Author

started a repro upstream cilium/ebpf#1343

@msherif1234
Copy link
Contributor Author

/ok-to-test

@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Feb 22, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:baad512

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=baad512 make set-agent-image

@jotak jotak changed the title WIP NETOBSERV-559: Using batchAPIs to help with CPU and memory resources WIP NETOBSERV-1550: Using batchAPIs to help with CPU and memory resources Mar 1, 2024
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Mar 1, 2024

@msherif1234: This pull request references NETOBSERV-1550 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

cilium recently added batchAPI support for PerCPU maps this PR to migrate ebpf agent to use batchapis

cilium/ebpf#1315

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jotak
Copy link
Member

jotak commented Mar 1, 2024

@msherif1234 I've created a new jira for this PR, NETOBSERV-1550, and the former is used for not-batched LookupAndDelete with my PR #283

Comment on lines 424 to 428
for i, id := range ids[:count] {
for j := 0; j < ebpf.MustPossibleCPU(); j++ {
flows[id] = append(flows[id], metrics[i*ebpf.MustPossibleCPU()+j])
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this, I'm not sure to understand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are now getting percpu metrics so we need to combine all metrics for each CPU together and assign them to the right flow id

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 25, 2024
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 25, 2024
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 25, 2024
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 25, 2024
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 0% with 90 lines in your changes are missing coverage. Please review.

Project coverage is 33.44%. Comparing base (b63f483) to head (edd8134).
Report is 1 commits behind head on main.

❗ Current head edd8134 differs from pull request most recent head 5fdf081. Consider uploading reports for the commit 5fdf081 to get more accurate results

Files Patch % Lines
pkg/ebpf/tracer_batchapis.go 0.00% 57 Missing ⚠️
pkg/ebpf/tracer.go 0.00% 33 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #256      +/-   ##
==========================================
- Coverage   34.04%   33.44%   -0.61%     
==========================================
  Files          47       48       +1     
  Lines        3836     3905      +69     
==========================================
  Hits         1306     1306              
- Misses       2444     2513      +69     
  Partials       86       86              
Flag Coverage Δ
unittests 33.44% <0.00%> (-0.61%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 25, 2024
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 25, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:f8e7e13

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=f8e7e13 make set-agent-image

@msherif1234
Copy link
Contributor Author

I will close this PR as it never shows any real value switching to batchAPIs vs what we have today should we ever reconsider we can reopen it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress jira/valid-reference ok-to-test To set manually when a PR is safe to test. Triggers image build on PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants