JMX Gatherer collection interval increases with time #926

akats7 · 2023-06-16T18:28:27Z

Component(s)

jmx-metrics

What happened?

Description

Hi all, we are using the jmx gatherer with a custom groovy script that has 60-70 instruments. We configure a default interval of 30s but noticed that as the gatherer runs for a prolonged period of time, the collection interval increases in intervals of 30s (i.e, become 60s -> 90s -> 120s -> ....). We have some more complex instruments for kafka mbeans, that also create multiple metrics per rule. On some of the instances we see about 7k metrics per collection interval.

Has this behavior been observed before and has any load testing been done on the gatherer,?

We're trying to gauge what the limit should be and exploring a few solutions, if we need to adjust our jvm params or potentially run multiple gatherers with narrowed instruments in parallel.

Wanted to get the thoughts of the team, thanks.

Component version

1.26.0

Log output

N/A no additional info in logs outside of duplicate metric warnings

Additional context

No response

trask · 2023-06-16T18:52:39Z

cc: jmx-metrics component owners @breedx-splk @Mrod1598 @rmfitzpatrick @dehaansa

dehaansa · 2023-06-16T19:57:52Z

Took a look at the code, and the runnable is scheduled with scheduleWithFixedDelay, which will wait the allotted collection interval between each execution of the groovy script.

We could change to use scheduleAtFixedRate, but in your described situation this would likely cause several overlapping collections to attempt to execute, causing more issues with the JVM, in addition to extra load on your kafka targets. Given the desire to have the metrics gatherer only run one collection at a time, I don't think we should experiment with scheduleAtFixedRate.

As far as JVM resources or splitting into multiple gatherers, it's hard to give specific advice without understanding your systems & your custom script. Either might be helpful. If you're consistently seeing these delays, you could also reduce your collection interval to something small like 1 second, and then the collections will restart quickly after the previous one finishes, assuming they are consistently taking 30s+ to execute.

rmfitzpatrick · 2023-06-16T19:58:29Z

@akats7 what are you using the determine that the collection interval is increasing periodically? The collection occurs in a single-threaded executor service's scheduleAtFixedRate(). The delay period is sourced from the configured otel.jmx.interval.milliseconds property, so no matter how long the collection runnable takes to proceed, it won't initiate another until the the previous has completed + the interval value.

That value is also used to set the otel.metric.export.interval property for the sdk, which I have a feeling may be related to what you're experiencing.

edit: Would you be able to provide some of your ~redacted script?

rmfitzpatrick · 2023-06-16T20:04:36Z

We could change to use scheduleAtFixedRate, but in your described situation this would likely cause several overlapping collections to attempt to execute, causing more issues with the JVM, in addition to extra load on your kafka targets. Given the desire to have the metrics gatherer only run one collection at a time, I don't think we should experiment with scheduleAtFixedRate.

I agree this wouldn't be desirable for request load issues and would also create the burden of dealing with RejectedExecutionExceptions at arbitrary points in execution.

akats7 · 2023-06-20T15:04:18Z

Hi all, thanks for the replies. Here is a groovy script similar to the ones we've been using here.

Here are our session props, (+ some additional for ssl)
otel.jmx.service.url=service:jmx:rmi:<JMX_URL> otel.jmx.interval.milliseconds=30000 otel.jmx.remote.registry.ssl=true otel.metrics.exporter=otlp otel.exporter.otlp.timeout=30000 otel.exporter.otlp.endpoint=<ENDPOINT>

Also, we have noticed that it looks like the issue may be caused by a memory leak, we've been tracking the memory utilization on the java process for the gatherer and see that it has been incrementally increasing and had doubled after running for a couple of days.

@rmfitzpatrick @dehaansa @breedx-splk

akats7 · 2023-06-20T15:20:33Z

@akats7 what are you using the determine that the collection interval is increasing periodically? The collection occurs in a single-threaded executor service's scheduleAtFixedRate(). The delay period is sourced from the configured otel.jmx.interval.milliseconds property, so no matter how long the collection runnable takes to proceed, it won't initiate another until the the previous has completed + the interval value.

That value is also used to set the otel.metric.export.interval property for the sdk, which I have a feeling may be related to what you're experiencing.

edit: Would you be able to provide some of your ~redacted script?

We see how often the data is coming in to our observability backend, and see the interval between data points is increasing

junhuangli · 2023-06-20T19:15:35Z

Glad to see we are not alone. We are seeing the same behavior, plus after 10ish days it will stop sending metrics totally. See details in #861

breedx-splk · 2023-06-22T20:49:21Z

@akats7

We have some more complex instruments for kafka mbeans, that also create multiple metrics per rule. On some of the instances we see about 7k metrics per collection interval.

It sounds like there is a possibility that the longer your system runs, the more data points (metrics) that can be produced in a given collection. Is that correct? If so, that could account for increases in collection duration -- more metric data takes longer to collect. There feels like some kind of non-scalable metric explosion happening here.

It also makes me wonder about a few other things:

Can you manually run the gatherer outside of your custom groovy script to see if that changes the behavior?
Can you attach jconsole or another debugger to the gatherer and watch the heap usage over time (like 20-30 minutes maybe?). If so, send us a screenshot.
Can you take a heap dump of a long running gatherer process to see where memory is being held?

I ran a test locally against petclinic for 15 minutes and couldn't see any obvious memory problems.

java -Xmx128M \
  -Dotel.metrics.exporter=otlp \
  -Dotel.jmx.interval.milliseconds=2000 \
  -Dotel.jmx.service.url=service:jmx:rmi:///jndi/rmi://127.0.0.1:9010/jmxrmi \
  -Dotel.jmx.target.system=jvm \
  -Dotel.exporter.otlp.endpoint=http://localhost:4317 \
  -jar build/libs/opentelemetry-jmx-metrics-1.20.0-alpha-SNAPSHOT.jar

akats7 · 2023-07-05T17:01:48Z

Hi all,

I looking into this further and it looks like the culprit is the constant creation of callbackRegistration objects in the sdk. I initially figured that the objects were being created each run of the script and registered in meterSharedState and that the issue was that the List was growing too large. However, the callbackRegistrations list is cleared each time the metrics are collected using the resetForTest method (which likely should be either renamed or not be used in the sdkMeter source).

With the right intervals and number of metrics it looks like the callbackRegistration objects are created and discarded faster than gc could pick them up. When I did a heap dump 97% of my heap was callbackRegistration objects.

I ran a test with an extremely short collection interval to see if I would see the same heap ramp up quicker and this is the behavior I saw.

It also seems that if the interval is less than the time it takes to collect the metrics, I was seeing that it would abandon the current run and start the next run before registering all of the callbacks.

A question I have is why is the mechanism constant re-registration of the callbacks and not registering them once and then invoking them accordingly. That would seemingly solve this problem.

I wanted to get your thoughts on this. Thanks!

@breedx-splk @Mrod1598 @rmfitzpatrick @dehaansa

dehaansa · 2023-07-05T20:01:44Z

@akats7 that seems reasonable, though this behavior

I was seeing that it would abandon the current run and start the next run before registering all of the callbacks.

is not what I would expect from the code. I know @rmfitzpatrick implemented the change you're describing in this PR #253 but it became stale, maybe it should be reopened.

dehaansa · 2023-07-11T13:56:29Z

I believe this issue should be improved in my PR that addresses several other issues by refactoring how callbacks are created & executed: #949

I've personally seen the heap get GC'd successfully and not display memory leak issues, but it's possible there are other memory leaks present in the receiver that I haven't addressed. I know at least the repeated callbacks issue is no longer present in that PR.

akats7 · 2023-07-11T17:06:33Z

Awesome, thanks @dehaansa! I'll do some testing as well

jack-berg · 2023-07-11T18:09:55Z

However, the callbackRegistrations list is cleared each time the metrics are collected using the resetForTest method (which likely should be either renamed or not be used in the sdkMeter source).

Where in the source code is this happening?

akats7 · 2023-07-11T21:26:28Z

However, the callbackRegistrations list is cleared each time the metrics are collected using the resetForTest method (which likely should be either renamed or not be used in the sdkMeter source).

Where in the source code is this happening?

Hey @jack-berg, I'm now also not seeing that happening... I'm a bit confused myself tbh. I'm now seeing that the callbackRegistrations list continuously grows as I originally assumed. Perhaps I modified something myself while testing locally.

jack-berg · 2023-07-11T21:46:49Z

Well in general, a CallbackRegistration is created each time an async instrument callback is registered, and its not dereferenced until the resulting instrument is closed. For example:

    // Creation of async instrument triggered creation of CallbackRegistration
    ObservableLongCounter asyncCounter = meter.counterBuilder("counter")
        .buildWithCallback(observableLongMeasurement -> observableLongMeasurement.record(10));

    // ... when instrument is eventually closed, the CallbackRegistration is dereferenced
    asyncCounter.close();

In a normal workflow, a small number of async instruments are typically created at application start. They are never closed and observe values from the same callback function for the lifecycle of the application.

I suspect jmx metric gatherer may be continuously registered new callbacks without closing old ones.

dehaansa · 2023-07-28T17:31:58Z

#949 was merged and should be available in 1.29.0. This resolves the repeated callbacks issue, as well as several others, let us know if this behavior persists once the new release is available.

breedx-splk · 2023-09-19T20:23:22Z

@akats7 have you had a chance to try this to confirm it's resolved for you?

github-actions · 2023-09-26T20:27:13Z

This has been automatically marked as stale because it has been marked as needing author feedback and has not had any activity for 7 days. It will be closed if no further activity occurs within 7 days of this comment.

akats7 · 2023-09-27T04:33:33Z

Hey @breedx-splk, I was able to ensure that the memory leak has been resolved but have not had a chance to check that it also resolved the increase in the interval, I'm fairly sure that the leak was the cause but will validate shortly.

breedx-splk · 2024-04-22T15:24:36Z

Hey @akats7 (and others). It looks like this has gone stale, which I think is a pretty good indication that this is no longer an issue for folks. If it is, please feel free to reopen with new testing results, but for now I'm going to close. Thanks!

trask added the component: jmx-metrics label Jun 16, 2023

smamidala1 mentioned this issue Jun 20, 2023

Jvm/Cassandra metrics stopped flowing after some time #861

Open

trask assigned dehaansa, rmfitzpatrick, Mrod1598 and breedx-splk Sep 17, 2023

breedx-splk added the needs author feedback Waiting for additional feedback from the author label Sep 19, 2023

github-actions bot added the Stale label Sep 26, 2023

github-actions bot removed the needs author feedback Waiting for additional feedback from the author label Sep 27, 2023

akats7 mentioned this issue Jan 12, 2024

REQUEST: New membership for akats7 open-telemetry/community#1872

Closed

breedx-splk closed this as completed Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JMX Gatherer collection interval increases with time #926

JMX Gatherer collection interval increases with time #926

akats7 commented Jun 16, 2023

trask commented Jun 16, 2023

dehaansa commented Jun 16, 2023

rmfitzpatrick commented Jun 16, 2023 •

edited

Loading

rmfitzpatrick commented Jun 16, 2023

akats7 commented Jun 20, 2023

akats7 commented Jun 20, 2023 •

edited

Loading

junhuangli commented Jun 20, 2023

breedx-splk commented Jun 22, 2023 •

edited

Loading

akats7 commented Jul 5, 2023

dehaansa commented Jul 5, 2023

dehaansa commented Jul 11, 2023 •

edited

Loading

akats7 commented Jul 11, 2023

jack-berg commented Jul 11, 2023

akats7 commented Jul 11, 2023 •

edited

Loading

jack-berg commented Jul 11, 2023

dehaansa commented Jul 28, 2023

breedx-splk commented Sep 19, 2023

github-actions bot commented Sep 26, 2023

akats7 commented Sep 27, 2023

breedx-splk commented Apr 22, 2024

JMX Gatherer collection interval increases with time #926

JMX Gatherer collection interval increases with time #926

Comments

akats7 commented Jun 16, 2023

Component(s)

What happened?

Description

Component version

Log output

Additional context

trask commented Jun 16, 2023

dehaansa commented Jun 16, 2023

rmfitzpatrick commented Jun 16, 2023 • edited Loading

rmfitzpatrick commented Jun 16, 2023

akats7 commented Jun 20, 2023

akats7 commented Jun 20, 2023 • edited Loading

junhuangli commented Jun 20, 2023

breedx-splk commented Jun 22, 2023 • edited Loading

akats7 commented Jul 5, 2023

dehaansa commented Jul 5, 2023

dehaansa commented Jul 11, 2023 • edited Loading

akats7 commented Jul 11, 2023

jack-berg commented Jul 11, 2023

akats7 commented Jul 11, 2023 • edited Loading

jack-berg commented Jul 11, 2023

dehaansa commented Jul 28, 2023

breedx-splk commented Sep 19, 2023

github-actions bot commented Sep 26, 2023

akats7 commented Sep 27, 2023

breedx-splk commented Apr 22, 2024

rmfitzpatrick commented Jun 16, 2023 •

edited

Loading

akats7 commented Jun 20, 2023 •

edited

Loading

breedx-splk commented Jun 22, 2023 •

edited

Loading

dehaansa commented Jul 11, 2023 •

edited

Loading

akats7 commented Jul 11, 2023 •

edited

Loading