Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JMX Gatherer collection interval increases with time #926

Closed
akats7 opened this issue Jun 16, 2023 · 20 comments
Closed

JMX Gatherer collection interval increases with time #926

akats7 opened this issue Jun 16, 2023 · 20 comments

Comments

@akats7
Copy link
Contributor

akats7 commented Jun 16, 2023

Component(s)

jmx-metrics

What happened?

Description

Hi all, we are using the jmx gatherer with a custom groovy script that has 60-70 instruments. We configure a default interval of 30s but noticed that as the gatherer runs for a prolonged period of time, the collection interval increases in intervals of 30s (i.e, become 60s -> 90s -> 120s -> ....). We have some more complex instruments for kafka mbeans, that also create multiple metrics per rule. On some of the instances we see about 7k metrics per collection interval.

Has this behavior been observed before and has any load testing been done on the gatherer,?

We're trying to gauge what the limit should be and exploring a few solutions, if we need to adjust our jvm params or potentially run multiple gatherers with narrowed instruments in parallel.

Wanted to get the thoughts of the team, thanks.

Component version

1.26.0

Log output

N/A no additional info in logs outside of duplicate metric warnings

Additional context

No response

@trask
Copy link
Member

trask commented Jun 16, 2023

cc: jmx-metrics component owners @breedx-splk @Mrod1598 @rmfitzpatrick @dehaansa

@dehaansa
Copy link
Contributor

Took a look at the code, and the runnable is scheduled with scheduleWithFixedDelay, which will wait the allotted collection interval between each execution of the groovy script.

We could change to use scheduleAtFixedRate, but in your described situation this would likely cause several overlapping collections to attempt to execute, causing more issues with the JVM, in addition to extra load on your kafka targets. Given the desire to have the metrics gatherer only run one collection at a time, I don't think we should experiment with scheduleAtFixedRate.

As far as JVM resources or splitting into multiple gatherers, it's hard to give specific advice without understanding your systems & your custom script. Either might be helpful. If you're consistently seeing these delays, you could also reduce your collection interval to something small like 1 second, and then the collections will restart quickly after the previous one finishes, assuming they are consistently taking 30s+ to execute.

@rmfitzpatrick
Copy link
Contributor

rmfitzpatrick commented Jun 16, 2023

@akats7 what are you using the determine that the collection interval is increasing periodically? The collection occurs in a single-threaded executor service's scheduleAtFixedRate(). The delay period is sourced from the configured otel.jmx.interval.milliseconds property, so no matter how long the collection runnable takes to proceed, it won't initiate another until the the previous has completed + the interval value.

That value is also used to set the otel.metric.export.interval property for the sdk, which I have a feeling may be related to what you're experiencing.

edit: Would you be able to provide some of your ~redacted script?

@rmfitzpatrick
Copy link
Contributor

We could change to use scheduleAtFixedRate, but in your described situation this would likely cause several overlapping collections to attempt to execute, causing more issues with the JVM, in addition to extra load on your kafka targets. Given the desire to have the metrics gatherer only run one collection at a time, I don't think we should experiment with scheduleAtFixedRate.

I agree this wouldn't be desirable for request load issues and would also create the burden of dealing with RejectedExecutionExceptions at arbitrary points in execution.

@akats7
Copy link
Contributor Author

akats7 commented Jun 20, 2023

Hi all, thanks for the replies. Here is a groovy script similar to the ones we've been using here.

Here are our session props, (+ some additional for ssl)
otel.jmx.service.url=service:jmx:rmi:<JMX_URL> otel.jmx.interval.milliseconds=30000 otel.jmx.remote.registry.ssl=true otel.metrics.exporter=otlp otel.exporter.otlp.timeout=30000 otel.exporter.otlp.endpoint=<ENDPOINT>

Also, we have noticed that it looks like the issue may be caused by a memory leak, we've been tracking the memory utilization on the java process for the gatherer and see that it has been incrementally increasing and had doubled after running for a couple of days.

@rmfitzpatrick @dehaansa @breedx-splk

@akats7
Copy link
Contributor Author

akats7 commented Jun 20, 2023

@akats7 what are you using the determine that the collection interval is increasing periodically? The collection occurs in a single-threaded executor service's scheduleAtFixedRate(). The delay period is sourced from the configured otel.jmx.interval.milliseconds property, so no matter how long the collection runnable takes to proceed, it won't initiate another until the the previous has completed + the interval value.

That value is also used to set the otel.metric.export.interval property for the sdk, which I have a feeling may be related to what you're experiencing.

edit: Would you be able to provide some of your ~redacted script?

We see how often the data is coming in to our observability backend, and see the interval between data points is increasing

@junhuangli
Copy link

Glad to see we are not alone. We are seeing the same behavior, plus after 10ish days it will stop sending metrics totally. See details in #861

@breedx-splk
Copy link
Contributor

breedx-splk commented Jun 22, 2023

@akats7

We have some more complex instruments for kafka mbeans, that also create multiple metrics per rule. On some of the instances we see about 7k metrics per collection interval.

It sounds like there is a possibility that the longer your system runs, the more data points (metrics) that can be produced in a given collection. Is that correct? If so, that could account for increases in collection duration -- more metric data takes longer to collect. There feels like some kind of non-scalable metric explosion happening here.

It also makes me wonder about a few other things:

  • Can you manually run the gatherer outside of your custom groovy script to see if that changes the behavior?
  • Can you attach jconsole or another debugger to the gatherer and watch the heap usage over time (like 20-30 minutes maybe?). If so, send us a screenshot.
  • Can you take a heap dump of a long running gatherer process to see where memory is being held?

I ran a test locally against petclinic for 15 minutes and couldn't see any obvious memory problems.

image
java -Xmx128M \
  -Dotel.metrics.exporter=otlp \
  -Dotel.jmx.interval.milliseconds=2000 \
  -Dotel.jmx.service.url=service:jmx:rmi:///jndi/rmi://127.0.0.1:9010/jmxrmi \
  -Dotel.jmx.target.system=jvm \
  -Dotel.exporter.otlp.endpoint=http://localhost:4317 \
  -jar build/libs/opentelemetry-jmx-metrics-1.20.0-alpha-SNAPSHOT.jar

@akats7
Copy link
Contributor Author

akats7 commented Jul 5, 2023

Hi all,

I looking into this further and it looks like the culprit is the constant creation of callbackRegistration objects in the sdk. I initially figured that the objects were being created each run of the script and registered in meterSharedState and that the issue was that the List was growing too large. However, the callbackRegistrations list is cleared each time the metrics are collected using the resetForTest method (which likely should be either renamed or not be used in the sdkMeter source).

With the right intervals and number of metrics it looks like the callbackRegistration objects are created and discarded faster than gc could pick them up. When I did a heap dump 97% of my heap was callbackRegistration objects.

I ran a test with an extremely short collection interval to see if I would see the same heap ramp up quicker and this is the behavior I saw.
Screenshot 2023-07-05 at 12 39 45 PM
Screenshot 2023-07-05 at 12 39 37 PM

It also seems that if the interval is less than the time it takes to collect the metrics, I was seeing that it would abandon the current run and start the next run before registering all of the callbacks.

A question I have is why is the mechanism constant re-registration of the callbacks and not registering them once and then invoking them accordingly. That would seemingly solve this problem.

I wanted to get your thoughts on this. Thanks!

@breedx-splk @Mrod1598 @rmfitzpatrick @dehaansa

@dehaansa
Copy link
Contributor

dehaansa commented Jul 5, 2023

@akats7 that seems reasonable, though this behavior

I was seeing that it would abandon the current run and start the next run before registering all of the callbacks.

is not what I would expect from the code. I know @rmfitzpatrick implemented the change you're describing in this PR #253 but it became stale, maybe it should be reopened.

@dehaansa
Copy link
Contributor

dehaansa commented Jul 11, 2023

I believe this issue should be improved in my PR that addresses several other issues by refactoring how callbacks are created & executed: #949

I've personally seen the heap get GC'd successfully and not display memory leak issues, but it's possible there are other memory leaks present in the receiver that I haven't addressed. I know at least the repeated callbacks issue is no longer present in that PR.

image

@akats7
Copy link
Contributor Author

akats7 commented Jul 11, 2023

Awesome, thanks @dehaansa! I'll do some testing as well

@jack-berg
Copy link
Member

However, the callbackRegistrations list is cleared each time the metrics are collected using the resetForTest method (which likely should be either renamed or not be used in the sdkMeter source).

Where in the source code is this happening?

@akats7
Copy link
Contributor Author

akats7 commented Jul 11, 2023

However, the callbackRegistrations list is cleared each time the metrics are collected using the resetForTest method (which likely should be either renamed or not be used in the sdkMeter source).

Where in the source code is this happening?

Hey @jack-berg, I'm now also not seeing that happening... I'm a bit confused myself tbh. I'm now seeing that the callbackRegistrations list continuously grows as I originally assumed. Perhaps I modified something myself while testing locally.

@jack-berg
Copy link
Member

Well in general, a CallbackRegistration is created each time an async instrument callback is registered, and its not dereferenced until the resulting instrument is closed. For example:

    // Creation of async instrument triggered creation of CallbackRegistration
    ObservableLongCounter asyncCounter = meter.counterBuilder("counter")
        .buildWithCallback(observableLongMeasurement -> observableLongMeasurement.record(10));

    // ... when instrument is eventually closed, the CallbackRegistration is dereferenced
    asyncCounter.close();

In a normal workflow, a small number of async instruments are typically created at application start. They are never closed and observe values from the same callback function for the lifecycle of the application.

I suspect jmx metric gatherer may be continuously registered new callbacks without closing old ones.

@dehaansa
Copy link
Contributor

#949 was merged and should be available in 1.29.0. This resolves the repeated callbacks issue, as well as several others, let us know if this behavior persists once the new release is available.

@breedx-splk
Copy link
Contributor

@akats7 have you had a chance to try this to confirm it's resolved for you?

@breedx-splk breedx-splk added the needs author feedback Waiting for additional feedback from the author label Sep 19, 2023
@github-actions
Copy link
Contributor

This has been automatically marked as stale because it has been marked as needing author feedback and has not had any activity for 7 days. It will be closed if no further activity occurs within 7 days of this comment.

@github-actions github-actions bot added the Stale label Sep 26, 2023
@akats7
Copy link
Contributor Author

akats7 commented Sep 27, 2023

Hey @breedx-splk, I was able to ensure that the memory leak has been resolved but have not had a chance to check that it also resolved the increase in the interval, I'm fairly sure that the leak was the cause but will validate shortly.

@github-actions github-actions bot removed the needs author feedback Waiting for additional feedback from the author label Sep 27, 2023
@breedx-splk
Copy link
Contributor

Hey @akats7 (and others). It looks like this has gone stale, which I think is a pretty good indication that this is no longer an issue for folks. If it is, please feel free to reopen with new testing results, but for now I'm going to close. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants