Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Java agent GC metrics visualization #36320

Closed
graphaelli opened this issue May 8, 2019 · 7 comments · Fixed by #47023
Closed

[APM] Java agent GC metrics visualization #36320

graphaelli opened this issue May 8, 2019 · 7 comments · Fixed by #47023
Assignees
Labels
Team:APM All issues that need APM UI Team support v7.5.0

Comments

@graphaelli
Copy link
Member

graphaelli commented May 8, 2019

#34708 implemented a metrics endpoint including 3 of the 5 metrics intended for the Java agent metrics UI. This issue is for tracking the other 2 metrics: GC rate and GC time.

GC rate is the number of garbage collection runs per pool
GC time is the amount of time spent in garbage collection per pool

Both of these are monotonically increasing counters. Therefore, both of these metrics require calculations per agent instance first, followed by some rollup to communicate values across all instances. To support that type of aggregation, agent.ephemeral_id will be stored with metrics per elastic/apm-server#2148.

Considering just GC count, given these 3 samples across 2 instances:

{"index":{}}
{"@timestamp":"2019-05-08T12:37:08.215Z","jvm":{"gc":{"count": 1}},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:18.215Z","jvm":{"gc":{"count": 2}},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:28.215Z","jvm":{"gc":{"count": 10}},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:08.215Z","jvm":{"gc":{"count": 1}},"agent":{"name":"java","ephemeral_id":"def"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:18.215Z","jvm":{"gc":{"count": 1}},"agent":{"name":"java","ephemeral_id":"def"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:28.215Z","jvm":{"gc":{"count": 6}},"agent":{"name":"java","ephemeral_id":"def"}}

agent abc has 1,1,8 GCs, def has 1,0,5 - the overall service graph would show 2,1,13.

One way to query per-instance values, including accounting for counter resets:

{
  "size": 0,
  "aggs": {
    "per_agent": {
      "terms": {
        "field": "agent.ephemeral_id.keyword",
        "size": 10
      },
      "aggs": {
        "over_time": {
          "date_histogram": {
            "field": "@timestamp",
            "interval": "10s"
          },
          "aggs": {
            "gc_max": {
              "max": {
                "field": "jvm.gc.count"
              }
            },
            "gc_count_all": {
              "derivative": {
                "buckets_path": "gc_max"
              }
            },
            "gc_count": {
              "bucket_script": {
                "buckets_path": {"value": "gc_max"},
                "script": "params.value > 0.0 ? params.value : 0.0"
              }
            }
          }
        }
      }
    }
  }
}

This will only consider the top X agents due to the terms aggregation. Also, I was unable to come up with a query to calculate the numbers to be graphed in a single query. One option is to calculate the sums per date histogram bucket post-query, similar to how TSVB series aggregation does it.

To eliminate the terms query limitation, a composite aggregation could be utilized. Another option is to use the metric explorar as a backend for these calculations.

@eyalkoren Can you clarify what the pool means / which field that is in the elasticsearch document?

@sqren all yours, I hope this helps.

@graphaelli graphaelli added Team:APM All issues that need APM UI Team support v7.2.0 labels May 8, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui

@graphaelli
Copy link
Member Author

following up on the question about gc pools - #34708 (comment) says to use context.tags.name, now labels.name in 7.x.

so new sample data including those labels:

{"index":{}}
{"@timestamp":"2019-05-08T12:37:08.215Z","jvm":{"gc":{"count": 1}},"labels":{"name":"G1 Old Generation"},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:18.215Z","jvm":{"gc":{"count": 2}},"labels":{"name":"G1 Old Generation"},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:28.215Z","jvm":{"gc":{"count": 10}},"labels":{"name":"G1 Old Generation"},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:08.215Z","jvm":{"gc":{"count": 1}},"labels":{"name":"G1 Young Generation"},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:18.215Z","jvm":{"gc":{"count": 3}},"labels":{"name":"G1 Young Generation"},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:28.215Z","jvm":{"gc":{"count": 5}},"labels":{"name":"G1 Young Generation"},"agent":{"name":"java","ephemeral_id":"abc"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:08.215Z","jvm":{"gc":{"count": 1}},"labels":{"name":"G1 Old Generation"},"agent":{"name":"java","ephemeral_id":"def"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:18.215Z","jvm":{"gc":{"count": 1}},"labels":{"name":"G1 Old Generation"},"agent":{"name":"java","ephemeral_id":"def"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:28.215Z","jvm":{"gc":{"count": 1}},"labels":{"name":"G1 Old Generation"},"agent":{"name":"java","ephemeral_id":"def"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:08.215Z","jvm":{"gc":{"count": 1}},"labels":{"name":"G1 Young Generation"},"agent":{"name":"java","ephemeral_id":"def"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:18.215Z","jvm":{"gc":{"count": 2}},"labels":{"name":"G1 Young Generation"},"agent":{"name":"java","ephemeral_id":"def"}}
{"index":{}}
{"@timestamp":"2019-05-08T12:37:28.215Z","jvm":{"gc":{"count": 3}},"labels":{"name":"G1 Young Generation"},"agent":{"name":"java","ephemeral_id":"def"}}

The query(ies) will need to take into account this additional level of aggregation.

@eyalkoren
Copy link
Contributor

Some input on this:

GC names: Normally, there are two garbage collectors in HotSpot and similar JVMs- one that does minor collections and one that does major collections. Minor collections collect young objects (in this case - G1 Young Generation) and are more frequent. Major collections collect older objects as well. As @graphaelli already mentioned, the name of the GC is stored as labels.name in 7+. This name should be used for aggregations, but also for the graph-line labelling and legend.

In addition, something I didn't see here is an issue with jvm.memory.non_heap.max, which may have the value -1. Since Java 8, the default in fact would be -1, as the "metaspace" introduced in this version is unlimited. Therefore, it is expected to be the common case, and we should handle that nicely when it is irrelevant (meaning- not show it). While the UI should not fail when this metric's value is valid in some data points and invalid in others on the same graphs, we can assume it is either always valid, or always invalid (this may not be the case when JVM is stopped, reconfigured to limit the metaspace and restarted within the time range of the metric query).
Here are some options of dealing with that:

  1. Omit this metric from the graph if ALL data points have the value -1. If some have value >0 - just show the data as is
  2. Omit this metric from the graph if AT LEAST one document have the value -1.
  3. Filter out documents where value is <0.

Which to choose- depends on how easy each option is to implement and how it behaves.

In order to test with real data, just use the agent on Java 8. Then, to get valid value for this metric, stop the JVM and restart it with the -XX:MaxMetaspaceSize flag in the command line (eg -XX:MaxMetaspaceSize=128M)

@sorenlouv
Copy link
Member

@roncohen I'm trying to figure out whether this issue is blocked by work needed in the agents.
I talked to @eyalkoren who said it had been discussed that "maybe it would be better if agents (all, not only Java) will switch to reporting deltas instead of monotonically increasing counters".
Since this affects GC counters should we wait for that or move forward without it?

@roncohen
Copy link
Contributor

roncohen commented Jul 30, 2019

this is blocked on work by design working to come up with next steps AFAIK cc @katrin-freihofner @graphaelli @nehaduggal

@katrin-freihofner
Copy link
Contributor

@roncohen you are referring to #41349?

@roncohen
Copy link
Contributor

yes, thanks for the link.

@dgieselaar dgieselaar removed their assignment Sep 26, 2019
@sorenlouv sorenlouv changed the title [APM] Java agent GC metrics UI [APM] Java agent GC metrics visualization Sep 27, 2019
@dgieselaar dgieselaar self-assigned this Sep 30, 2019
dgieselaar added a commit to dgieselaar/kibana that referenced this issue Oct 8, 2019
dgieselaar added a commit that referenced this issue Oct 9, 2019
* [APM] Garbage collection metrics charts

Closes #36320.

* Review feedback

* Display average of delta in gc chart
dgieselaar added a commit to dgieselaar/kibana that referenced this issue Oct 9, 2019
* [APM] Garbage collection metrics charts

Closes elastic#36320.

* Review feedback

* Display average of delta in gc chart
dgieselaar added a commit that referenced this issue Oct 10, 2019
* [APM] Garbage collection metrics charts

Closes #36320.

* Review feedback

* Display average of delta in gc chart
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:APM All issues that need APM UI Team support v7.5.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants