[APM] Java agent GC metrics visualization #36320

graphaelli · 2019-05-08T23:33:52Z

#34708 implemented a metrics endpoint including 3 of the 5 metrics intended for the Java agent metrics UI. This issue is for tracking the other 2 metrics: GC rate and GC time.

GC rate is the number of garbage collection runs per pool
GC time is the amount of time spent in garbage collection per pool

Both of these are monotonically increasing counters. Therefore, both of these metrics require calculations per agent instance first, followed by some rollup to communicate values across all instances. To support that type of aggregation, agent.ephemeral_id will be stored with metrics per elastic/apm-server#2148.

Considering just GC count, given these 3 samples across 2 instances:

{"index":{}}
{"@timestamp":"2019-05-08T12:37:08.215Z","jvm":{"gc":{"count": 1}},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:18.215Z","jvm":{"gc":{"count": 2}},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:28.215Z","jvm":{"gc":{"count": 10}},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:08.215Z","jvm":{"gc":{"count": 1}},"agent":{"name":"java","ephemeral_id":"def"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:18.215Z","jvm":{"gc":{"count": 1}},"agent":{"name":"java","ephemeral_id":"def"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:28.215Z","jvm":{"gc":{"count": 6}},"agent":{"name":"java","ephemeral_id":"def"}}

agent abc has 1,1,8 GCs, def has 1,0,5 - the overall service graph would show 2,1,13.

One way to query per-instance values, including accounting for counter resets:

{
  "size": 0,
  "aggs": {
    "per_agent": {
      "terms": {
        "field": "agent.ephemeral_id.keyword",
        "size": 10
      },
      "aggs": {
        "over_time": {
          "date_histogram": {
            "field": "@timestamp",
            "interval": "10s"
          },
          "aggs": {
            "gc_max": {
              "max": {
                "field": "jvm.gc.count"
              }
            },
            "gc_count_all": {
              "derivative": {
                "buckets_path": "gc_max"
              }
            },
            "gc_count": {
              "bucket_script": {
                "buckets_path": {"value": "gc_max"},
                "script": "params.value > 0.0 ? params.value : 0.0"
              }
            }
          }
        }
      }
    }
  }
}

This will only consider the top X agents due to the terms aggregation. Also, I was unable to come up with a query to calculate the numbers to be graphed in a single query. One option is to calculate the sums per date histogram bucket post-query, similar to how TSVB series aggregation does it.

To eliminate the terms query limitation, a composite aggregation could be utilized. Another option is to use the metric explorar as a backend for these calculations.

@eyalkoren Can you clarify what the pool means / which field that is in the elasticsearch document?

@sqren all yours, I hope this helps.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-05-08T23:33:54Z

Pinging @elastic/apm-ui

graphaelli · 2019-05-09T14:16:09Z

following up on the question about gc pools - #34708 (comment) says to use context.tags.name, now labels.name in 7.x.

so new sample data including those labels:

{"index":{}}
{"@timestamp":"2019-05-08T12:37:08.215Z","jvm":{"gc":{"count": 1}},"labels":{"name":"G1 Old Generation"},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:18.215Z","jvm":{"gc":{"count": 2}},"labels":{"name":"G1 Old Generation"},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:28.215Z","jvm":{"gc":{"count": 10}},"labels":{"name":"G1 Old Generation"},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:08.215Z","jvm":{"gc":{"count": 1}},"labels":{"name":"G1 Young Generation"},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:18.215Z","jvm":{"gc":{"count": 3}},"labels":{"name":"G1 Young Generation"},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:28.215Z","jvm":{"gc":{"count": 5}},"labels":{"name":"G1 Young Generation"},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:08.215Z","jvm":{"gc":{"count": 1}},"labels":{"name":"G1 Old Generation"},"agent":{"name":"java","ephemeral_id":"def"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:18.215Z","jvm":{"gc":{"count": 1}},"labels":{"name":"G1 Old Generation"},"agent":{"name":"java","ephemeral_id":"def"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:28.215Z","jvm":{"gc":{"count": 1}},"labels":{"name":"G1 Old Generation"},"agent":{"name":"java","ephemeral_id":"def"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:08.215Z","jvm":{"gc":{"count": 1}},"labels":{"name":"G1 Young Generation"},"agent":{"name":"java","ephemeral_id":"def"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:18.215Z","jvm":{"gc":{"count": 2}},"labels":{"name":"G1 Young Generation"},"agent":{"name":"java","ephemeral_id":"def"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:28.215Z","jvm":{"gc":{"count": 3}},"labels":{"name":"G1 Young Generation"},"agent":{"name":"java","ephemeral_id":"def"}}

The query(ies) will need to take into account this additional level of aggregation.

eyalkoren · 2019-05-13T14:44:32Z

Some input on this:

GC names: Normally, there are two garbage collectors in HotSpot and similar JVMs- one that does minor collections and one that does major collections. Minor collections collect young objects (in this case - G1 Young Generation) and are more frequent. Major collections collect older objects as well. As @graphaelli already mentioned, the name of the GC is stored as labels.name in 7+. This name should be used for aggregations, but also for the graph-line labelling and legend.

In addition, something I didn't see here is an issue with jvm.memory.non_heap.max, which may have the value -1. Since Java 8, the default in fact would be -1, as the "metaspace" introduced in this version is unlimited. Therefore, it is expected to be the common case, and we should handle that nicely when it is irrelevant (meaning- not show it). While the UI should not fail when this metric's value is valid in some data points and invalid in others on the same graphs, we can assume it is either always valid, or always invalid (this may not be the case when JVM is stopped, reconfigured to limit the metaspace and restarted within the time range of the metric query).
Here are some options of dealing with that:

Omit this metric from the graph if ALL data points have the value -1. If some have value >0 - just show the data as is
Omit this metric from the graph if AT LEAST one document have the value -1.
Filter out documents where value is <0.

Which to choose- depends on how easy each option is to implement and how it behaves.

In order to test with real data, just use the agent on Java 8. Then, to get valid value for this metric, stop the JVM and restart it with the -XX:MaxMetaspaceSize flag in the command line (eg -XX:MaxMetaspaceSize=128M)

sorenlouv · 2019-07-30T08:36:29Z

@roncohen I'm trying to figure out whether this issue is blocked by work needed in the agents.
I talked to @eyalkoren who said it had been discussed that "maybe it would be better if agents (all, not only Java) will switch to reporting deltas instead of monotonically increasing counters".
Since this affects GC counters should we wait for that or move forward without it?

roncohen · 2019-07-30T08:54:53Z

this is blocked on work by design working to come up with next steps AFAIK cc @katrin-freihofner @graphaelli @nehaduggal

katrin-freihofner · 2019-07-31T12:39:47Z

@roncohen you are referring to #41349?

roncohen · 2019-07-31T12:50:58Z

yes, thanks for the link.

Closes elastic#36320.

* [APM] Garbage collection metrics charts Closes #36320. * Review feedback * Display average of delta in gc chart

* [APM] Garbage collection metrics charts Closes elastic#36320. * Review feedback * Display average of delta in gc chart

* [APM] Garbage collection metrics charts Closes #36320. * Review feedback * Display average of delta in gc chart

graphaelli added Team:APM All issues that need APM UI Team support v7.2.0 labels May 8, 2019

sorenlouv added the [zube]: Impl Ready label May 9, 2019

ogupte self-assigned this May 22, 2019

ogupte added [zube]: In Progress and removed [zube]: Impl Ready labels May 22, 2019

sorenlouv added v7.3.0 and removed v7.2.0 labels May 29, 2019

ogupte added [zube]: In Progress and removed [zube]: Impl Ready labels May 29, 2019

graphaelli added v7.4.0 and removed v7.3.0 labels Jul 3, 2019

sorenlouv added [zube]: Impl Backlog and removed [zube]: In Progress labels Jul 10, 2019

sorenlouv unassigned ogupte Jul 24, 2019

sorenlouv added v7.5.0 and removed v7.4.0 labels Aug 14, 2019

katrin-freihofner mentioned this issue Aug 22, 2019

[APM] JVM list and individual JVM metrics page #43765

Closed

8 tasks

sorenlouv added [zube]: (7.5) Planned for release and removed [zube]: Impl Backlog labels Sep 18, 2019

dgieselaar self-assigned this Sep 26, 2019

dgieselaar added [zube]: In Progress and removed [zube]: (7.5) Planned for release labels Sep 26, 2019

dgieselaar removed their assignment Sep 26, 2019

sorenlouv changed the title ~~[APM] Java agent GC metrics UI~~ [APM] Java agent GC metrics visualization Sep 27, 2019

dgieselaar self-assigned this Sep 30, 2019

dgieselaar added [zube]: In Progress and removed [zube]: (7.5) Planned for release labels Sep 30, 2019

dgieselaar mentioned this issue Oct 1, 2019

[APM] Garbage collection metrics charts #47023

Merged

dgieselaar added a commit to dgieselaar/kibana that referenced this issue Oct 8, 2019

[APM] Garbage collection metrics charts

f378bf5

Closes elastic#36320.

dgieselaar closed this as completed in #47023 Oct 9, 2019

dgieselaar added a commit that referenced this issue Oct 9, 2019

[APM] Garbage collection metrics charts (#47023)

43d1c58

* [APM] Garbage collection metrics charts Closes #36320. * Review feedback * Display average of delta in gc chart

zube bot added [zube]: Done and removed [zube]: In Progress labels Oct 9, 2019

dgieselaar added a commit to dgieselaar/kibana that referenced this issue Oct 9, 2019

[APM] Garbage collection metrics charts (elastic#47023)

6e38659

* [APM] Garbage collection metrics charts Closes elastic#36320. * Review feedback * Display average of delta in gc chart

dgieselaar added a commit that referenced this issue Oct 10, 2019

[7.x] [APM] Garbage collection metrics charts (#47023) (#47675)

7e9964d

* [APM] Garbage collection metrics charts Closes #36320. * Review feedback * Display average of delta in gc chart

sorenlouv removed the [zube]: Done label Dec 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[APM] Java agent GC metrics visualization #36320

[APM] Java agent GC metrics visualization #36320

graphaelli commented May 8, 2019 •

edited by sorenlouv

Loading

elasticmachine commented May 8, 2019

graphaelli commented May 9, 2019

eyalkoren commented May 13, 2019

sorenlouv commented Jul 30, 2019

roncohen commented Jul 30, 2019 •

edited

Loading

katrin-freihofner commented Jul 31, 2019

roncohen commented Jul 31, 2019

[APM] Java agent GC metrics visualization #36320

[APM] Java agent GC metrics visualization #36320

Comments

graphaelli commented May 8, 2019 • edited by sorenlouv Loading

elasticmachine commented May 8, 2019

graphaelli commented May 9, 2019

eyalkoren commented May 13, 2019

sorenlouv commented Jul 30, 2019

roncohen commented Jul 30, 2019 • edited Loading

katrin-freihofner commented Jul 31, 2019

roncohen commented Jul 31, 2019

graphaelli commented May 8, 2019 •

edited by sorenlouv

Loading

roncohen commented Jul 30, 2019 •

edited

Loading