JMX port for JVM monitoring will stay up if JMX port goes down on an VM #1183

kaio-ru · 2023-04-03T10:27:30Z

I have few VM's in Compute Engine that are running Java applications on RHEL 8. I am able to monitor them over JMX and I would like to set up an alarm that when one of those processes dies, Google Monitoring would send out an alarm, that JMX port is down.

Describe the bug
I have set up a policy for workload.googleapis.com/jvm.threads.count for it. When I kill one of the processes, the opentelemetry-collector will hang like the JMX port is still up. There is a process java -Dorg.slf4j.simpleLogger.defaultLogLevel=info io.opentelemetry.contrib.jmxmetrics.JmxMetrics -config /tmp/jmx-config-3146301998.properties up and this seems to be feeding GCP this data. When I restart the google-cloud-ops-agent-opentelemetry-collector service on the server, the jvm.threads.count will drop and an alarm is being sent out

/tmp/jmx-config-3146301998.properties file:

otel.exporter.otlp.endpoint = http://0.0.0.0:42861
otel.exporter.otlp.timeout = 5000
otel.jmx.interval.milliseconds = 60000
otel.jmx.service.url = service:jmx:rmi:///jndi/rmi://localhost:8565/jmxrmi
otel.jmx.target.system = jvm
otel.metrics.exporter = otlp

To Reproduce
Steps to reproduce the behavior:

Set up JMX monitoring and an alert with metric jvm.threads.count, if metric is above 1, all is good, if metric is not over 1, send out an alarm
Start the java application, metric is over 1
Stop the java application, JMX port on the Virtual Machine drops but metric is still above 1 in Google Monitoring
Restart the google-cloud-ops-agent-opentelemetry-collector.service service
jvm.threads.count drops in the Google Monitoring also and an alarm is triggered
Start the java application, metric will be over 1 again and the alarm clears

Expected behavior
Alarm should be triggered without the google-cloud-ops-agent-opentelemetry-collector.service restart

Environment (please complete the following information):

Project ID demograft
VM ID 197175560824347001
VM distro / OS: RHEL 8
Ops Agent version 2.29.0 1.el8
Ops Agent configuration

 metrics:
  receivers:
    jvm:
      type: jvm
      endpoint: localhost:8565
      collection_interval: 60s
  service:
    pipelines:
      jvm:
        receivers:
          - jvm

Ops Agent log
health-checks.log

2023/03/31 11:23:38 api_check.go:114: logging client was created successfully
2023/03/31 11:23:38 api_check.go:146: monitoring client was created successfully
2023/03/31 11:23:38 healthchecks.go:78: API Check - Result: PASS
2023/04/03 08:26:44 ports_check.go:60: listening to 0.0.0.0:20202:
2023/04/03 08:26:44 ports_check.go:70: listening to 0.0.0.0:20201:
2023/04/03 08:26:44 ports_check.go:79: listening to [::]:20201:
2023/04/03 08:26:44 healthchecks.go:78: Ports Check - Result: PASS
2023/04/03 08:26:44 healthchecks.go:78: Network Check - Result: ERROR, Detail: Get "https://logging.googleapis.com/$discovery/rest": dial tcp: lookup logging.googleapis.com on 169.254.169.254:53: dial udp 169.254.169.254:53: connect: network is unreachable
2023/04/03 08:26:44 healthchecks.go:78: API Check - Result: ERROR, Detail: can't get GCE metadata: can't get resource metadata: Get "http://169.254.169.254/computeMetadata/v1/instance/zone": dial tcp 169.254.169.254:80: connect: network is unreachable

logging-module.log

[2023/04/03 08:26:47] [ info] [fluent bit] version=2.0.10, commit=, pid=1243
[2023/04/03 08:26:47] [ info] [storage] ver=1.4.0, type=memory+filesystem, sync=normal, checksum=off, max_chunks_up=128
[2023/04/03 08:26:47] [ info] [storage] backlog input plugin: storage_backlog.4
[2023/04/03 08:26:47] [ info] [cmetrics] version=0.5.8
[2023/04/03 08:26:47] [ info] [ctraces ] version=0.2.7
[2023/04/03 08:26:47] [ info] [input:fluentbit_metrics:fluentbit_metrics.0] initializing
[2023/04/03 08:26:47] [ info] [input:fluentbit_metrics:fluentbit_metrics.0] storage_strategy='memory' (memory only)
[2023/04/03 08:26:47] [ info] [input:tail:tail.1] initializing
[2023/04/03 08:26:47] [ info] [input:tail:tail.1] storage_strategy='filesystem' (memory + filesystem)
[2023/04/03 08:26:47] [ info] [input:tail:tail.2] initializing
[2023/04/03 08:26:47] [ info] [input:tail:tail.2] storage_strategy='filesystem' (memory + filesystem)
[2023/04/03 08:26:47] [ info] [input:tail:tail.2] multiline core started
[2023/04/03 08:26:47] [ info] [input:tail:tail.3] initializing
[2023/04/03 08:26:47] [ info] [input:tail:tail.3] storage_strategy='memory' (memory only)
[2023/04/03 08:26:47] [ info] [input:storage_backlog:storage_backlog.4] initializing
[2023/04/03 08:26:47] [ info] [input:storage_backlog:storage_backlog.4] storage_strategy='memory' (memory only)
[2023/04/03 08:26:47] [ info] [input:storage_backlog:storage_backlog.4] queue memory limit: 47.7M
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.0] metadata_server set to http://metadata.google.internal
[2023/04/03 08:26:47] [ warn] [output:stackdriver:stackdriver.0] client_email is not defined, using a default one
[2023/04/03 08:26:47] [ warn] [output:stackdriver:stackdriver.0] private_key is not defined, fetching it from metadata server
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.0] worker #0 started
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.0] worker #2 started
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.0] worker #3 started
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.0] worker #1 started
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.0] worker #4 started
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.0] worker #5 started
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.0] worker #6 started
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.1] metadata_server set to http://metadata.google.internal
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.0] worker #7 started
[2023/04/03 08:26:47] [ warn] [output:stackdriver:stackdriver.1] client_email is not defined, using a default one
[2023/04/03 08:26:47] [ warn] [output:stackdriver:stackdriver.1] private_key is not defined, fetching it from metadata server
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.1] worker #0 started
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.1] worker #2 started
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.1] worker #1 started
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.1] worker #5 started
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.1] worker #6 started
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.1] worker #7 started
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.1] worker #3 started
[2023/04/03 08:26:47] [ info] [output:prometheus_exporter:prometheus_exporter.2] listening iface=0.0.0.0 tcp_port=20202
[2023/04/03 08:26:47] [ info] [output:stackdriver:stackdriver.1] worker #4 started
[2023/04/03 08:26:47] [ info] [input:tail:tail.1] inotify_fs_add(): inode=50547303 watch_fd=1 name=/var/log/messages
[2023/04/03 08:26:47] [ info] [input:tail:tail.2] inotify_fs_add(): inode=426960 watch_fd=1 name=/opt/**-**-******-front/logs/**-*******.log
[2023/04/03 08:26:48] [ info] [input:tail:tail.3] inotify_fs_add(): inode=689166 watch_fd=1 name=/var/log/google-cloud-ops-agent/subagents/logging-module.log
[2023/04/03 09:19:00] [ info] [input:tail:tail.2] inode=426960 handle rotation(): /opt/**-**-******-front/logs/**-*******.log => /opt/**-**-*****-front/logs/**-*******.log.2023-03-31.1
[2023/04/03 09:19:00] [ info] [input:tail:tail.2] inotify_fs_remove(): inode=426960 watch_fd=1
[2023/04/03 09:19:00] [error] [/work/submodules/fluent-bit/plugins/in_tail/tail_fs_inotify.c:147 errno=2] No such file or directory
[2023/04/03 09:19:00] [ info] [input:tail:tail.2] inotify_fs_add(): inode=980964 watch_fd=2 name=/opt/**-**-******-front/logs/tv-optimizer.log

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-12-17T03:16:15Z

This issue was marked stale due to lack of activity. It will be closed in 14 days.

dehaansa mentioned this issue Jul 26, 2023

Update Java Contrib submodule and update to use java 17 in build #1352

Merged

9 tasks

github-actions bot added the Stale label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JMX port for JVM monitoring will stay up if JMX port goes down on an VM #1183

JMX port for JVM monitoring will stay up if JMX port goes down on an VM #1183

kaio-ru commented Apr 3, 2023

github-actions bot commented Dec 17, 2024

JMX port for JVM monitoring will stay up if JMX port goes down on an VM #1183

JMX port for JVM monitoring will stay up if JMX port goes down on an VM #1183

Comments

kaio-ru commented Apr 3, 2023

github-actions bot commented Dec 17, 2024