Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(serializer.prometheusremote): improve performance #12971

Conversation

mhoffm-aiven
Copy link
Contributor

@mhoffm-aiven mhoffm-aiven commented Mar 28, 2023

Required for all PRs

resolves #12974

@mhoffm-aiven mhoffm-aiven force-pushed the mhoffm-promremote-performance-improvements branch from 5fe88ef to 902c410 Compare March 28, 2023 12:32
@mhoffm-aiven mhoffm-aiven changed the title Mhoffm promremote performance improvements feat(serializer.prometheusremote): improve performance Mar 28, 2023
@mhoffm-aiven mhoffm-aiven marked this pull request as ready for review March 28, 2023 12:33
@powersj
Copy link
Contributor

powersj commented Mar 28, 2023

Hi @mhoffm-aiven,

Thanks for the PR. Is there an issue that goes along with this? If not could you file one please?

Also looks like there are a few lint issues:

  plugins/serializers/prometheus/convert.go:58:6                                 unused    func `isValid` is unused
  plugins/serializers/prometheusremotewrite/prometheusremotewrite.go:56:80       revive    empty-lines: extra empty line at the end of a block
  plugins/serializers/prometheusremotewrite/prometheusremotewrite_test.go:37:3   revive    unhandled-error: Unhandled error in call to function s.SerializeBatch
  plugins/serializers/prometheusremotewrite/prometheusremotewrite_test.go:37:19  errcheck  Error return value of `s.SerializeBatch` is not checked

Thanks!

@mhoffm-aiven mhoffm-aiven force-pushed the mhoffm-promremote-performance-improvements branch from 902c410 to 4632e6a Compare March 28, 2023 15:41
@mhoffm-aiven
Copy link
Contributor Author

Hi @mhoffm-aiven,

Thanks for the PR. Is there an issue that goes along with this? If not could you file one please?

Also looks like there are a few lint issues:

  plugins/serializers/prometheus/convert.go:58:6                                 unused    func `isValid` is unused
  plugins/serializers/prometheusremotewrite/prometheusremotewrite.go:56:80       revive    empty-lines: extra empty line at the end of a block
  plugins/serializers/prometheusremotewrite/prometheusremotewrite_test.go:37:3   revive    unhandled-error: Unhandled error in call to function s.SerializeBatch
  plugins/serializers/prometheusremotewrite/prometheusremotewrite_test.go:37:19  errcheck  Error return value of `s.SerializeBatch` is not checked

Thanks!

Hey @powersj,

No issue yet ( ill create one in a minute ); i was investigating performance issues with our telegraf promremote output and did some profiling.

@powersj
Copy link
Contributor

powersj commented Mar 28, 2023

i was investigating performance issues with our telegraf promremote output and did some profiling.

ok thanks - if you could include some of that data I would appreciate it! thanks again!

@mhoffm-aiven
Copy link
Contributor Author

mhoffm-aiven commented Mar 28, 2023

i was investigating performance issues with our telegraf promremote output and did some profiling.

ok thanks - if you could include some of that data I would appreciate it! thanks again!

I locally run that benchmark from this PR against main:

 fedora  ~  git  …  plugins  serializers  prometheusremotewrite  master 
$ go test -run=^$ -bench=. -benchtime 30s -benchmem -memprofile memprofile.out -cpuprofile profile.out
goos: linux
goarch: amd64
pkg: github.com/influxdata/telegraf/plugins/serializers/prometheusremotewrite
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
BenchmarkRemoteWrite-8   	   38106	    924518 ns/op	  648409 B/op	    7005 allocs/op
PASS
ok  	github.com/influxdata/telegraf/plugins/serializers/prometheusremotewrite	44.943s

and against this PR:

$ go test -run=^$ -bench=. -benchtime 30s -benchmem -memprofile memprofile.out -cpuprofile profile.out
goos: linux
goarch: amd64
pkg: github.com/influxdata/telegraf/plugins/serializers/prometheusremotewrite
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
BenchmarkRemoteWrite-8   	   64338	    539615 ns/op	  344826 B/op	    4007 allocs/op
PASS
ok  	github.com/influxdata/telegraf/plugins/serializers/prometheusremotewrite	40.534s

main offenders were copied slices ( i fixed this by allocating once and truncating on loop iterations ) and sort.Slice instead of sort.Sort inplace. surprisingly unicode.Is also was pretty high up in the profile. Let me run against both branches and upload profiles a bit later on.

@mhoffm-aiven
Copy link
Contributor Author

mhoffm-aiven commented Mar 29, 2023

top allocs with the benchmark on master (notice reflectlite.Swapper and createLabels:

Type: alloc_space
Time: Mar 29, 2023 at 10:37am (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 28871.76MB, 99.93% of 28890.89MB total
Dropped 48 nodes (cum <= 144.45MB)
      flat  flat%   sum%        cum   cum%
14601.41MB 50.54% 50.54% 19582.68MB 67.78%  github.com/influxdata/telegraf/plugins/serializers/prometheusremotewrite.getPromTS
 8599.07MB 29.76% 80.30%  8599.07MB 29.76%  github.com/influxdata/telegraf/plugins/serializers/prometheusremotewrite.(*Serializer).createLabels
 4981.27MB 17.24% 97.55%  4981.27MB 17.24%  internal/reflectlite.Swapper
  688.01MB  2.38% 99.93%   688.01MB  2.38%  github.com/influxdata/telegraf/plugins/serializers/prometheus.MetricName
       2MB 0.0069% 99.93% 28883.26MB   100%  github.com/influxdata/telegraf/plugins/serializers/prometheusremotewrite.(*Serializer).SerializeBatch
         0     0% 99.93% 28883.26MB   100%  github.com/influxdata/telegraf/plugins/serializers/prometheusremotewrite.BenchmarkRemoteWrite
         0     0% 99.93%  4981.27MB 17.24%  sort.Slice
         0     0% 99.93% 28882.76MB   100%  testing.(*B).launch
         0     0% 99.93% 28883.26MB   100%  testing.(*B).runN

top 10 on this PR:

Type: alloc_space
Time: Mar 29, 2023 at 10:40am (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 23.15GB, 99.74% of 23.21GB total
Dropped 52 nodes (cum <= 0.12GB)
      flat  flat%   sum%        cum   cum%
   22.02GB 94.87% 94.87%    22.02GB 94.87%  github.com/influxdata/telegraf/plugins/serializers/prometheusremotewrite.getPromTS
    1.12GB  4.81% 99.68%     1.12GB  4.81%  github.com/influxdata/telegraf/plugins/serializers/prometheus.MetricName
    0.01GB 0.055% 99.74%    23.20GB 99.95%  github.com/influxdata/telegraf/plugins/serializers/prometheusremotewrite.(*Serializer).SerializeBatch
         0     0% 99.74%    23.20GB 99.95%  github.com/influxdata/telegraf/plugins/serializers/prometheusremotewrite.BenchmarkRemoteWrite
         0     0% 99.74%    23.20GB 99.95%  testing.(*B).launch
         0     0% 99.74%    23.20GB 99.95%  testing.(*B).runN

cpu profile on master:

Type: cpu
Time: Mar 29, 2023 at 10:37am (CEST)
Duration: 44.63s, Total samples = 63.74s (142.82%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 27.73s, 43.50% of 63.74s total
Dropped 268 nodes (cum <= 0.32s)
Showing top 10 nodes out of 114
      flat  flat%   sum%        cum   cum%
     4.77s  7.48%  7.48%     17.97s 28.19%  runtime.mallocgc
     3.43s  5.38% 12.86%      4.05s  6.35%  runtime.heapBitsSetType
     3.39s  5.32% 18.18%      3.40s  5.33%  hash/fnv.(*sum64a).Write
     3.21s  5.04% 23.22%     10.36s 16.25%  runtime.scanobject
     2.52s  3.95% 27.17%      2.52s  3.95%  runtime.nextFreeFast (inline)
     2.33s  3.66% 30.83%      4.40s  6.90%  unicode.Is
     2.07s  3.25% 34.08%      2.07s  3.25%  unicode.is16
     2.04s  3.20% 37.28%      8.71s 13.66%  github.com/influxdata/telegraf/plugins/serializers/prometheusremotewrite.(*Serializer).createLabels
     2.03s  3.18% 40.46%      7.27s 11.41%  github.com/influxdata/telegraf/plugins/serializers/prometheus.isValid
     1.94s  3.04% 43.50%      1.94s  3.04%  runtime.memmove

and on this PR:

Type: cpu
Time: Mar 29, 2023 at 10:40am (CEST)
Duration: 40.91s, Total samples = 56.60s (138.36%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 27.17s, 48.00% of 56.60s total
Dropped 227 nodes (cum <= 0.28s)
Showing top 10 nodes out of 106
      flat  flat%   sum%        cum   cum%
     5.44s  9.61%  9.61%      5.44s  9.61%  hash/fnv.(*sum64a).Write
     4.14s  7.31% 16.93%     15.25s 26.94%  runtime.mallocgc
     3.57s  6.31% 23.23%      4.19s  7.40%  runtime.heapBitsSetType
     2.86s  5.05% 28.29%      3.64s  6.43%  github.com/influxdata/telegraf/plugins/serializers/prometheusremotewrite.(*Serializer).appendCommonLabels
     2.51s  4.43% 32.72%      7.88s 13.92%  runtime.scanobject
     2.01s  3.55% 36.27%      2.01s  3.55%  runtime.nextFreeFast (inline)
     1.91s  3.37% 39.65%      1.91s  3.37%  runtime.memmove
     1.74s  3.07% 42.72%      1.78s  3.14%  runtime.pageIndexOf (inline)
     1.52s  2.69% 45.41%     42.41s 74.93%  github.com/influxdata/telegraf/plugins/serializers/prometheusremotewrite.(*Serializer).SerializeBatch
     1.47s  2.60% 48.00%      1.47s  2.60%  runtime.futex

so my theory was that getting rid of some of the allocations and using sort.Sort without reflection would improve performance

Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Have you been running this successfully? Couple questions in line.

plugins/serializers/prometheus/convert.go Show resolved Hide resolved
plugins/serializers/prometheus/convert.go Outdated Show resolved Hide resolved
@powersj powersj self-assigned this Mar 30, 2023
@mhoffm-aiven mhoffm-aiven force-pushed the mhoffm-promremote-performance-improvements branch from 4632e6a to 1b4659b Compare March 31, 2023 08:36
@MichaHoffmann
Copy link

Thanks for the PR. Have you been running this successfully? Couple questions in line.

No this has not been running in our production yet.

Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@powersj powersj assigned srebhan and unassigned powersj Mar 31, 2023
@powersj powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Mar 31, 2023
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an awesome improvement @mhoffm-aiven! Thank you for this! I do have one comment though as now the batch-serialization will break if called concurrently. Please also find a suggestion inline...

plugins/serializers/prometheus/convert.go Outdated Show resolved Hide resolved
plugins/serializers/prometheus/convert.go Outdated Show resolved Hide resolved
@mhoffm-aiven mhoffm-aiven force-pushed the mhoffm-promremote-performance-improvements branch 2 times, most recently from 5cd25df to a339b41 Compare April 3, 2023 13:38
@mhoffm-aiven mhoffm-aiven force-pushed the mhoffm-promremote-performance-improvements branch from a339b41 to b696d68 Compare April 3, 2023 13:44
@telegraf-tiger
Copy link
Contributor

telegraf-tiger bot commented Apr 3, 2023

Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for your effort @mhoffm-aiven!

@srebhan srebhan requested a review from powersj April 3, 2023 18:24
@srebhan
Copy link
Member

srebhan commented Apr 3, 2023

@powersj can you please take a look again after the changes!?

@powersj powersj merged commit 99ea0b1 into influxdata:master Apr 3, 2023
@srebhan srebhan added this to the v1.27.0 milestone Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

prometheusremote: improve performance
4 participants