[CHORE] processing discovered targets async #3517

nicolastakashi · 2024-12-05T15:09:42Z

Description:

This change addresses issue #3512. The implementation is inspired by the Prometheus Scrape Manager, as referenced in the source code here: Prometheus Scrape Manager.

Below are the benchstat results comparing the current main branch with the proposed changes in this PR. These results highlight the performance improvements and trade-offs introduced by this update.

goos: darwin
goarch: arm64
pkg: github.com/open-telemetry/opentelemetry-operator/cmd/otel-allocator
cpu: Apple M1 Pro
                                                      │ bench_targets-serial.txt │     bench_targets-concurrent.txt     │
                                                      │          sec/op          │    sec/op     vs base                │
ProcessTargets/consistent-hashing-10                                 3.658 ±  5%    1.087 ±  9%  -70.30% (p=0.000 n=10)
ProcessTargets/per-node-10                                         3586.9m ±  3%   995.7m ± 14%  -72.24% (p=0.000 n=10)
ProcessTargets/least-weighted-10                                   3627.1m ±  3%   985.5m ±  7%  -72.83% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node-10                        1007.7m ± 11%   959.8m ±  8%        ~ (p=0.529 n=10)
ProcessTargetsWithRelabelConfig/least-weighted-10                   994.0m ±  8%   977.5m ± 10%        ~ (p=0.481 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing-10              1010.2m ± 10%   927.3m ±  2%   -8.21% (p=0.002 n=10)
geomean                                                              1.907         987.6m        -48.23%

                                                      │ bench_targets-serial.txt │    bench_targets-concurrent.txt     │
                                                      │           B/op           │     B/op      vs base               │
ProcessTargets/consistent-hashing-10                                3.158Gi ± 0%   3.178Gi ± 0%  +0.64% (p=0.000 n=10)
ProcessTargets/per-node-10                                          3.158Gi ± 0%   3.178Gi ± 0%  +0.64% (p=0.000 n=10)
ProcessTargets/least-weighted-10                                    3.158Gi ± 0%   3.178Gi ± 0%  +0.64% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node-10                         3.180Gi ± 0%   3.180Gi ± 0%       ~ (p=0.190 n=10)
ProcessTargetsWithRelabelConfig/least-weighted-10                   3.180Gi ± 0%   3.180Gi ± 0%       ~ (p=0.529 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing-10               3.180Gi ± 0%   3.180Gi ± 0%       ~ (p=0.684 n=10)
geomean                                                             3.169Gi        3.179Gi       +0.32%

                                                      │ bench_targets-serial.txt │    bench_targets-concurrent.txt    │
                                                      │        allocs/op         │  allocs/op   vs base               │
ProcessTargets/consistent-hashing-10                                 4.161M ± 0%   4.221M ± 0%  +1.45% (p=0.002 n=10)
ProcessTargets/per-node-10                                           4.161M ± 0%   4.221M ± 0%  +1.45% (p=0.000 n=10)
ProcessTargets/least-weighted-10                                     4.161M ± 0%   4.220M ± 0%  +1.43% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node-10                          4.236M ± 0%   4.237M ± 0%       ~ (p=0.724 n=10)
ProcessTargetsWithRelabelConfig/least-weighted-10                    4.237M ± 0%   4.237M ± 0%       ~ (p=0.631 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing-10                4.236M ± 0%   4.237M ± 0%       ~ (p=0.436 n=10)
geomean                                                              4.198M        4.229M       +0.72%

Link to tracking Issue(s):

Resolves: Targets taking too long time to be discovered on large targets environment #3512

Testing: Run Unit Tests and BenchMarch Tests

Documentation: N/A

nicolastakashi · 2024-12-05T17:00:27Z

I think tests are failing because master is failing as well, but I'm not sure.

nicolastakashi · 2024-12-05T20:38:29Z

@janumpallykalyan your last comment is broken

…v.valueFrom)" (open-telemetry#3510) * Revert "Support configuring java runtime from configmap or secret (env.valueF…" This reverts commit 2b36f0d. * chlog (open-telemetry#3511)

swiatekm

I like the idea of these changes! I've left some questions and comments about the specifics.

More generally, since we're changing the execution model of the Discoverer, I'd love to see a few more unit tests verifying if the triggers work correctly. Also, comments for all the new functions, explaining what purpose they serve in the new design.

cmd/otel-allocator/target/discovery.go

Signed-off-by: Nicolas Takashi <[email protected]>

swiatekm

Looks good to me overall, some final nitpicks. Could you also rerun the benchmark to see if the concurrency changes affected performance?

.chloggen/discovering-target-async.yaml

cmd/otel-allocator/target/discovery.go

Co-authored-by: Mikołaj Świątek <[email protected]>

nicolastakashi · 2024-12-17T20:30:06Z

goarch: arm64
pkg: github.com/open-telemetry/opentelemetry-operator/cmd/otel-allocator
cpu: Apple M1 Pro
BenchmarkProcessTargets/per-node-10            	       1	1122291333 ns/op	3431733448 B/op	 4417034 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	 985838084 ns/op	3414612288 B/op	 4234554 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	 968455875 ns/op	3414472688 B/op	 4234142 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	 902708542 ns/op	3414559424 B/op	 4235042 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	 914759875 ns/op	3414536968 B/op	 4234955 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	 861704500 ns/op	3414609664 B/op	 4235733 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	 824441770 ns/op	3414526040 B/op	 4234824 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	1053985854 ns/op	3414310288 B/op	 4232579 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	1014306334 ns/op	3414348072 B/op	 4232986 allocs/op
BenchmarkProcessTargets/per-node-10            	       2	 927938792 ns/op	3414307512 B/op	 4232577 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 838161250 ns/op	3414583368 B/op	 4235406 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 950306875 ns/op	3414322416 B/op	 4232724 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 892734354 ns/op	3414382264 B/op	 4233430 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 882699542 ns/op	3414367800 B/op	 4233267 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 918699980 ns/op	3414305280 B/op	 4232622 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 827566146 ns/op	3414596352 B/op	 4234735 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 932759167 ns/op	3414521768 B/op	 4234829 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 835300375 ns/op	3414518352 B/op	 4234696 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 862945354 ns/op	3414511400 B/op	 4234795 allocs/op
BenchmarkProcessTargets/least-weighted-10      	       2	 917244250 ns/op	3414306992 B/op	 4232574 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 908266292 ns/op	3414353496 B/op	 4233097 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 872114500 ns/op	3414446104 B/op	 4233992 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 863686250 ns/op	3414349784 B/op	 4233032 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 884526562 ns/op	3414359912 B/op	 4233131 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 872686104 ns/op	3414537528 B/op	 4234973 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 899104500 ns/op	3414298424 B/op	 4232552 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 856262916 ns/op	3414461864 B/op	 4234234 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 887320250 ns/op	3414394752 B/op	 4233453 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 910004980 ns/op	3414333144 B/op	 4232886 allocs/op
BenchmarkProcessTargets/consistent-hashing-10  	       2	 880362000 ns/op	3414589832 B/op	 4235591 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 894908562 ns/op	3415972280 B/op	 4250871 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 884152000 ns/op	3415784728 B/op	 4248901 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 880393438 ns/op	3415879584 B/op	 4249947 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	1006655375 ns/op	3415878896 B/op	 4249814 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 875715896 ns/op	3415914608 B/op	 4250269 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 912449271 ns/op	3415745264 B/op	 4248512 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 881933730 ns/op	3415738312 B/op	 4248428 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 851389396 ns/op	3415938952 B/op	 4250570 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 860880292 ns/op	3415879320 B/op	 4249898 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/least-weighted-10         	       2	 910038542 ns/op	3415783504 B/op	 4248931 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 846392438 ns/op	3415940752 B/op	 4250580 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 922212542 ns/op	3415764984 B/op	 4248759 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 893194479 ns/op	3415929848 B/op	 4250460 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 888838542 ns/op	3415776144 B/op	 4248804 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 876409104 ns/op	3415805960 B/op	 4249196 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 924206646 ns/op	3415770592 B/op	 4248793 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 850870812 ns/op	3415931992 B/op	 4250439 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 884762875 ns/op	3415814584 B/op	 4249200 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 896129083 ns/op	3415959216 B/op	 4250795 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/consistent-hashing-10     	       2	 900787250 ns/op	3415772504 B/op	 4248838 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	 930492312 ns/op	3415914392 B/op	 4250305 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	 861972688 ns/op	3415985264 B/op	 4250985 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	 872793875 ns/op	3415783328 B/op	 4248945 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	 889231812 ns/op	3415923628 B/op	 4250333 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	 875701146 ns/op	3415837688 B/op	 4249438 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	 880286062 ns/op	3415972176 B/op	 4250887 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	 879769188 ns/op	3415780680 B/op	 4248888 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	1264806020 ns/op	3415740260 B/op	 4248454 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       2	 993177812 ns/op	3415891352 B/op	 4250057 allocs/op
BenchmarkProcessTargetsWithRelabelConfig/per-node-10               	       1	1084294000 ns/op	3416297216 B/op	 4249305 allocs/op
PASS
ok  	github.com/open-telemetry/opentelemetry-operator/cmd/otel-allocator	177.920s

Bench results after lock per job

Signed-off-by: Nicolas Takashi <[email protected]>

swiatekm

Thank you for taking this up!

diranged · 2025-01-10T19:08:56Z

@nicolastakashi Shoudl we expect that in large environments that the CPU usage of the targetallocator pods goes up significantly?

nicolastakashi · 2025-01-10T20:48:38Z

@nicolastakashi Shoudl we expect that in large environments that the CPU usage of the targetallocator pods goes up significantly?

Yes, this is due to the amount of go routines.
If you were previously using Prometheus you can notice the same behaviour.

nicolastakashi requested a review from a team as a code owner December 5, 2024 15:09

nicolastakashi force-pushed the chore/improving-set-targets branch 2 times, most recently from b8ea4df to d219c9d Compare December 5, 2024 15:33

This comment was marked as spam.

Sign in to view

Revert "Support configuring java runtime from configmap or secret (en…

42eec92

…v.valueFrom)" (open-telemetry#3510) * Revert "Support configuring java runtime from configmap or secret (env.valueF…" This reverts commit 2b36f0d. * chlog (open-telemetry#3511)

nicolastakashi force-pushed the chore/improving-set-targets branch from d219c9d to 42eec92 Compare December 5, 2024 21:34

This comment was marked as spam.

Sign in to view

swiatekm reviewed Dec 6, 2024

View reviewed changes

nicolastakashi and others added 7 commits December 10, 2024 13:39

Merge branch 'main' into chore/improving-set-targets

0795d54

[CHORE] changing log level

d0e3563

Signed-off-by: Nicolas Takashi <[email protected]>

[CHORE] renaming method

d79d0d1

Signed-off-by: Nicolas Takashi <[email protected]>

[CHORE] adding change log entry

bc95872

Signed-off-by: Nicolas Takashi <[email protected]>

Merge branch 'main' into chore/improving-set-targets

7437bb8

[CHORE] locking targets per job

be48c0f

Signed-off-by: Nicolas Takashi <[email protected]>

Merge branch 'main' into chore/improving-set-targets

9e8264f

swiatekm reviewed Dec 17, 2024

View reviewed changes

.chloggen/discovering-target-async.yaml Outdated Show resolved Hide resolved

cmd/otel-allocator/target/discovery.go Show resolved Hide resolved

cmd/otel-allocator/target/discovery.go Show resolved Hide resolved

Update .chloggen/discovering-target-async.yaml

2ef992e

Co-authored-by: Mikołaj Świątek <[email protected]>

[REFACTORY] applying comments

28880ab

Signed-off-by: Nicolas Takashi <[email protected]>

nicolastakashi requested a review from swiatekm December 18, 2024 14:16

nicolastakashi and others added 2 commits December 18, 2024 21:14

Merge branch 'main' into chore/improving-set-targets

3e5ae20

[CHORE] adding mutex back

42ae8d0

Signed-off-by: Nicolas Takashi <[email protected]>

swiatekm approved these changes Dec 19, 2024

View reviewed changes

swiatekm merged commit b387afd into open-telemetry:main Dec 19, 2024
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CHORE] processing discovered targets async #3517

[CHORE] processing discovered targets async #3517

nicolastakashi commented Dec 5, 2024 •

edited

Loading

nicolastakashi commented Dec 5, 2024

This comment was marked as spam.

nicolastakashi commented Dec 5, 2024

This comment was marked as spam.

swiatekm left a comment

swiatekm left a comment

nicolastakashi commented Dec 17, 2024

swiatekm left a comment

diranged commented Jan 10, 2025

nicolastakashi commented Jan 10, 2025

[CHORE] processing discovered targets async #3517

[CHORE] processing discovered targets async #3517

Conversation

nicolastakashi commented Dec 5, 2024 • edited Loading

nicolastakashi commented Dec 5, 2024

This comment was marked as spam.

nicolastakashi commented Dec 5, 2024

This comment was marked as spam.

swiatekm left a comment

Choose a reason for hiding this comment

swiatekm left a comment

Choose a reason for hiding this comment

nicolastakashi commented Dec 17, 2024

swiatekm left a comment

Choose a reason for hiding this comment

diranged commented Jan 10, 2025

nicolastakashi commented Jan 10, 2025

nicolastakashi commented Dec 5, 2024 •

edited

Loading