Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache instruments so repeatedly creating identical instruments doesn't leak memory #4820

Merged
merged 6 commits into from
Jan 24, 2024

Conversation

dashpole
Copy link
Contributor

@dashpole dashpole commented Jan 10, 2024

Resolve #3527
Fixes open-telemetry/opentelemetry-go-contrib#4226.

This is one potential solution to the problem of creating multiple instances of an instrumentation library leaking memory. Alternatively, we could require users to cache the instrumentation library instance itself, as was done in googleapis/google-api-go-client#2329.

This works well for synchronous instruments, as making observations on the instrument from multiple instances of a library correctly aggregates across instances. But for async instruments, this is a bit more confusing. e.g. what should happen if I instantiate two instances of the runtime metrics library? Presumably the last observation should win?

This also only solves the problem of caching instruments themselves. The pattern of making multiple instances of an instrumentation library would still leak memory if each instance created and registered a new Callback. If RegisterCallback is used, the callback can be Unregistered(), but if the callback is passed as an option to the instrument, it will exist forever, which would also leak memory.

TODO:

  • Benchmarks
  • Cache callbacks as well

Copy link

codecov bot commented Jan 10, 2024

Codecov Report

Attention: 20 lines in your changes are missing coverage. Please review.

Comparison is base (cef39a1) 82.3% compared to head (55829b3) 82.3%.

Additional details and impacted files

Impacted file tree graph

@@          Coverage Diff          @@
##            main   #4820   +/-   ##
=====================================
  Coverage   82.3%   82.3%           
=====================================
  Files        226     226           
  Lines      18481   18557   +76     
=====================================
+ Hits       15222   15286   +64     
- Misses      2973    2983   +10     
- Partials     286     288    +2     
Files Coverage Δ
sdk/metric/cache.go 100.0% <100.0%> (ø)
sdk/metric/meter.go 90.4% <81.6%> (-2.2%) ⬇️

... and 1 file with indirect coverage changes

@MrAlias
Copy link
Contributor

MrAlias commented Jan 11, 2024

I think this looks like a decent proposal. Looking back through the record, we had originally tried to do this.

@dashpole
Copy link
Contributor Author

Ah, cool. I'd definitely like feedback from @MadVikingGod, then

sdk/metric/meter.go Outdated Show resolved Hide resolved
@MadVikingGod
Copy link
Contributor

Ah, cool. I'd definitely like feedback from @MadVikingGod, then

That work was pulled out because of added complexity, trying to get something working, and concerns over memory freeing.

At first glance, this looks to add caching in a simple way. I would suggest putting a benchmark to ensure that it performs better when reused and doesn't affect single-use too much.

@MrAlias
Copy link
Contributor

MrAlias commented Jan 11, 2024

It isn't clear if the aggregators cache is still needed if we are caching instruments.

I think this is still going to be needed to ensure view modified instruments that result in the same aggregators get the same aggregators. But I haven't tested it.

@MrAlias
Copy link
Contributor

MrAlias commented Jan 11, 2024

The pattern of making multiple instances of an instrumentation library would still leak memory if each instance created and registered a new Callback.

This is a bit worrisome. Solving the issue to return the same instrument instance may lead users into thinking they can just make the same calls repeatedly.

Should we document that only first call using With*Callback will be honored? And users should use RegisterCallback after that?

@dashpole
Copy link
Contributor Author

@MrAlias
Copy link
Contributor

MrAlias commented Jan 18, 2024

This should resolve #3527

@dashpole dashpole force-pushed the duplicate_instruments branch from a805a5d to e87d2e5 Compare January 19, 2024 19:28
@MrAlias
Copy link
Contributor

MrAlias commented Jan 19, 2024

Looks like python logs a warning and ignores callbacks after the first: https://github.com/open-telemetry/opentelemetry-python/blob/975733c71473cddddd0859c6fcbd2b02405f7e12/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/__init__.py#L171

+1 to this behavior being implemented here.

@dashpole dashpole force-pushed the duplicate_instruments branch from d27e99d to fd4d810 Compare January 19, 2024 21:16
sdk/metric/cache.go Outdated Show resolved Hide resolved
@dashpole
Copy link
Contributor Author

dashpole commented Jan 20, 2024

This now ignores callbacks after the first.

For benchmarking, If I run the existing BenchmarkInstrumentCreation, it shows a substantial increase in performance because it is now returning the cached version instead of making a new one each time:

From main

$ go test -bench=BenchmarkInstrumentCreation
goos: linux
goarch: amd64
pkg: go.opentelemetry.io/otel/sdk/metric
cpu: Intel(R) Xeon(R) CPU @ 2.20GHz
BenchmarkInstrumentCreation-24    	   98368	     11940 ns/op	    3043 B/op	      69 allocs/op
PASS
ok  	go.opentelemetry.io/otel/sdk/metric	1.379s

This PR:

$ go test -bench=BenchmarkInstrumentCreation
goos: linux
goarch: amd64
pkg: go.opentelemetry.io/otel/sdk/metric
cpu: Intel(R) Xeon(R) CPU @ 2.20GHz
BenchmarkInstrumentCreation-24    	  318759	      3187 ns/op	     480 B/op	       6 allocs/op
PASS
ok  	go.opentelemetry.io/otel/sdk/metric	1.144s

If I move NewMeterProvider and Meter() inside the loop (so that it isn't using the cached one), I get

From main

$ go test -bench=BenchmarkInstrumentCreation
goos: linux
goarch: amd64
pkg: go.opentelemetry.io/otel/sdk/metric
cpu: Intel(R) Xeon(R) CPU @ 2.20GHz
BenchmarkInstrumentCreation-24    	   32074	     37227 ns/op	   18326 B/op	     225 allocs/op
PASS
ok  	go.opentelemetry.io/otel/sdk/metric	1.651s

This PR:

$ go test -bench=BenchmarkInstrumentCreation
goos: linux
goarch: amd64
pkg: go.opentelemetry.io/otel/sdk/metric
cpu: Intel(R) Xeon(R) CPU @ 2.20GHz
BenchmarkInstrumentCreation-24    	   27579	     44411 ns/op	   23228 B/op	     237 allocs/op
PASS
ok  	go.opentelemetry.io/otel/sdk/metric	1.745s

Which is slightly slower, but that is partially because we are doing some extra instantiation of the caches in the MeterProvider.

@dashpole dashpole force-pushed the duplicate_instruments branch from fd4d810 to 80d90cf Compare January 22, 2024 14:58
@MrAlias MrAlias added this to the v1.23.0 milestone Jan 24, 2024
@MrAlias MrAlias merged commit 1978044 into open-telemetry:main Jan 24, 2024
25 checks passed
@dashpole dashpole deleted the duplicate_instruments branch January 24, 2024 15:48
@MrAlias MrAlias mentioned this pull request Feb 5, 2024
dmitryax added a commit to dmitryax/opentelemetry-go that referenced this pull request Jun 28, 2024
dmitryax added a commit to dmitryax/opentelemetry-go that referenced this pull request Jul 2, 2024
MrAlias added a commit that referenced this pull request Jul 12, 2024
…ith callbacks (#5606)

In #4820, I only
added a comment describing the behavior to `Int64ObservableCounter`, but
forgot other instruments. This adds the comment to all observable
instruments.

Fixes #5561

---------

Co-authored-by: Robert Pająk <[email protected]>
Co-authored-by: Tyler Yahn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants