Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHORE] processing discovered targets async #3517

Merged
merged 12 commits into from
Dec 19, 2024

Conversation

nicolastakashi
Copy link
Contributor

@nicolastakashi nicolastakashi commented Dec 5, 2024

Description:

This change addresses issue #3512. The implementation is inspired by the Prometheus Scrape Manager, as referenced in the source code here: Prometheus Scrape Manager.

Below are the benchstat results comparing the current main branch with the proposed changes in this PR. These results highlight the performance improvements and trade-offs introduced by this update.

goos: darwin
goarch: arm64
pkg: github.com/open-telemetry/opentelemetry-operator/cmd/otel-allocator
cpu: Apple M1 Pro
                                                      │ bench_targets-serial.txt │     bench_targets-concurrent.txt     │
                                                      │          sec/op          │    sec/op     vs base                │
ProcessTargets/consistent-hashing-10                                 3.658 ±  5%    1.087 ±  9%  -70.30% (p=0.000 n=10)
ProcessTargets/per-node-10                                         3586.9m ±  3%   995.7m ± 14%  -72.24% (p=0.000 n=10)
ProcessTargets/least-weighted-10                                   3627.1m ±  3%   985.5m ±  7%  -72.83% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node-10                        1007.7m ± 11%   959.8m ±  8%        ~ (p=0.529 n=10)
ProcessTargetsWithRelabelConfig/least-weighted-10                   994.0m ±  8%   977.5m ± 10%        ~ (p=0.481 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing-10              1010.2m ± 10%   927.3m ±  2%   -8.21% (p=0.002 n=10)
geomean                                                              1.907         987.6m        -48.23%

                                                      │ bench_targets-serial.txt │    bench_targets-concurrent.txt     │
                                                      │           B/op           │     B/op      vs base               │
ProcessTargets/consistent-hashing-10                                3.158Gi ± 0%   3.178Gi ± 0%  +0.64% (p=0.000 n=10)
ProcessTargets/per-node-10                                          3.158Gi ± 0%   3.178Gi ± 0%  +0.64% (p=0.000 n=10)
ProcessTargets/least-weighted-10                                    3.158Gi ± 0%   3.178Gi ± 0%  +0.64% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node-10                         3.180Gi ± 0%   3.180Gi ± 0%       ~ (p=0.190 n=10)
ProcessTargetsWithRelabelConfig/least-weighted-10                   3.180Gi ± 0%   3.180Gi ± 0%       ~ (p=0.529 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing-10               3.180Gi ± 0%   3.180Gi ± 0%       ~ (p=0.684 n=10)
geomean                                                             3.169Gi        3.179Gi       +0.32%

                                                      │ bench_targets-serial.txt │    bench_targets-concurrent.txt    │
                                                      │        allocs/op         │  allocs/op   vs base               │
ProcessTargets/consistent-hashing-10                                 4.161M ± 0%   4.221M ± 0%  +1.45% (p=0.002 n=10)
ProcessTargets/per-node-10                                           4.161M ± 0%   4.221M ± 0%  +1.45% (p=0.000 n=10)
ProcessTargets/least-weighted-10                                     4.161M ± 0%   4.220M ± 0%  +1.43% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node-10                          4.236M ± 0%   4.237M ± 0%       ~ (p=0.724 n=10)
ProcessTargetsWithRelabelConfig/least-weighted-10                    4.237M ± 0%   4.237M ± 0%       ~ (p=0.631 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing-10                4.236M ± 0%   4.237M ± 0%       ~ (p=0.436 n=10)
geomean                                                              4.198M        4.229M       +0.72%

Link to tracking Issue(s):

Testing: Run Unit Tests and BenchMarch Tests

Documentation: N/A

@nicolastakashi nicolastakashi requested a review from a team as a code owner December 5, 2024 15:09
@nicolastakashi nicolastakashi force-pushed the chore/improving-set-targets branch 2 times, most recently from b8ea4df to d219c9d Compare December 5, 2024 15:33
@nicolastakashi
Copy link
Contributor Author

I think tests are failing because master is failing as well, but I'm not sure.

@janumpallykalyan

This comment was marked as spam.

@nicolastakashi
Copy link
Contributor Author

@janumpallykalyan your last comment is broken

…v.valueFrom)" (open-telemetry#3510)

* Revert "Support configuring java runtime from configmap or secret (env.valueF…"

This reverts commit 2b36f0d.

* chlog (open-telemetry#3511)
@nicolastakashi nicolastakashi force-pushed the chore/improving-set-targets branch from d219c9d to 42eec92 Compare December 5, 2024 21:34
@janumpallykalyan

This comment was marked as spam.

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of these changes! I've left some questions and comments about the specifics.

More generally, since we're changing the execution model of the Discoverer, I'd love to see a few more unit tests verifying if the triggers work correctly. Also, comments for all the new functions, explaining what purpose they serve in the new design.

cmd/otel-allocator/target/discovery.go Show resolved Hide resolved
cmd/otel-allocator/target/discovery.go Outdated Show resolved Hide resolved
cmd/otel-allocator/target/discovery.go Outdated Show resolved Hide resolved
cmd/otel-allocator/target/discovery.go Outdated Show resolved Hide resolved
cmd/otel-allocator/target/discovery.go Outdated Show resolved Hide resolved
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall, some final nitpicks. Could you also rerun the benchmark to see if the concurrency changes affected performance?

.chloggen/discovering-target-async.yaml Outdated Show resolved Hide resolved
cmd/otel-allocator/target/discovery.go Show resolved Hide resolved
cmd/otel-allocator/target/discovery.go Show resolved Hide resolved
@nicolastakashi
Copy link
Contributor Author

goarch: arm64
pkg: github.com/open-telemetry/opentelemetry-operator/cmd/otel-allocator
cpu: Apple M1 Pro
BenchmarkProcessTargets/per-node-10            	       1	1122291333 ns/op	3431733448 B/op	 4417034 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	 985838084 ns/op	3414612288 B/op	 4234554 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	 968455875 ns/op	3414472688 B/op	 4234142 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	 902708542 ns/op	3414559424 B/op	 4235042 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	 914759875 ns/op	3414536968 B/op	 4234955 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	 861704500 ns/op	3414609664 B/op	 4235733 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	 824441770 ns/op	3414526040 B/op	 4234824 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	1053985854 ns/op	3414310288 B/op	 4232579 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	1014306334 ns/op	3414348072 B/op	 4232986 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	 927938792 ns/op	3414307512 B/op	 4232577 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 838161250 ns/op	3414583368 B/op	 4235406 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 950306875 ns/op	3414322416 B/op	 4232724 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 892734354 ns/op	3414382264 B/op	 4233430 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 882699542 ns/op	3414367800 B/op	 4233267 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 918699980 ns/op	3414305280 B/op	 4232622 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 827566146 ns/op	3414596352 B/op	 4234735 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 932759167 ns/op	3414521768 B/op	 4234829 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 835300375 ns/op	3414518352 B/op	 4234696 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 862945354 ns/op	3414511400 B/op	 4234795 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 917244250 ns/op	3414306992 B/op	 4232574 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 908266292 ns/op	3414353496 B/op	 4233097 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 872114500 ns/op	3414446104 B/op	 4233992 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 863686250 ns/op	3414349784 B/op	 4233032 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 884526562 ns/op	3414359912 B/op	 4233131 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 872686104 ns/op	3414537528 B/op	 4234973 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 899104500 ns/op	3414298424 B/op	 4232552 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 856262916 ns/op	3414461864 B/op	 4234234 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 887320250 ns/op	3414394752 B/op	 4233453 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 910004980 ns/op	3414333144 B/op	 4232886 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 880362000 ns/op	3414589832 B/op	 4235591 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 894908562 ns/op	3415972280 B/op	 4250871 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 884152000 ns/op	3415784728 B/op	 4248901 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 880393438 ns/op	3415879584 B/op	 4249947 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	1006655375 ns/op	3415878896 B/op	 4249814 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 875715896 ns/op	3415914608 B/op	 4250269 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 912449271 ns/op	3415745264 B/op	 4248512 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 881933730 ns/op	3415738312 B/op	 4248428 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 851389396 ns/op	3415938952 B/op	 4250570 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 860880292 ns/op	3415879320 B/op	 4249898 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 910038542 ns/op	3415783504 B/op	 4248931 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 846392438 ns/op	3415940752 B/op	 4250580 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 922212542 ns/op	3415764984 B/op	 4248759 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 893194479 ns/op	3415929848 B/op	 4250460 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 888838542 ns/op	3415776144 B/op	 4248804 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 876409104 ns/op	3415805960 B/op	 4249196 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 924206646 ns/op	3415770592 B/op	 4248793 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 850870812 ns/op	3415931992 B/op	 4250439 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 884762875 ns/op	3415814584 B/op	 4249200 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 896129083 ns/op	3415959216 B/op	 4250795 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 900787250 ns/op	3415772504 B/op	 4248838 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	 930492312 ns/op	3415914392 B/op	 4250305 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	 861972688 ns/op	3415985264 B/op	 4250985 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	 872793875 ns/op	3415783328 B/op	 4248945 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	 889231812 ns/op	3415923628 B/op	 4250333 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	 875701146 ns/op	3415837688 B/op	 4249438 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	 880286062 ns/op	3415972176 B/op	 4250887 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	 879769188 ns/op	3415780680 B/op	 4248888 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	1264806020 ns/op	3415740260 B/op	 4248454 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	 993177812 ns/op	3415891352 B/op	 4250057 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       1	1084294000 ns/op	3416297216 B/op	 4249305 allocs/op
PASS
ok  	github.com/open-telemetry/opentelemetry-operator/cmd/otel-allocator	177.920s

Bench results after lock per job

Signed-off-by: Nicolas Takashi <[email protected]>
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking this up!

@swiatekm swiatekm merged commit b387afd into open-telemetry:main Dec 19, 2024
38 checks passed
@diranged
Copy link

@nicolastakashi Shoudl we expect that in large environments that the CPU usage of the targetallocator pods goes up significantly?

image

@nicolastakashi
Copy link
Contributor Author

@nicolastakashi Shoudl we expect that in large environments that the CPU usage of the targetallocator pods goes up significantly?

image

Yes, this is due to the amount of go routines.
If you were previously using Prometheus you can notice the same behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Targets taking too long time to be discovered on large targets environment
5 participants