Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to align and cleanup some jmx metrics #11621

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

SylvainJuge
Copy link
Contributor

@SylvainJuge SylvainJuge commented Jun 18, 2024

This is an attempt to fix a few errors and inconsistencies that I've found in the JMX metrics captured with the JMX Metric Insight feature.

I have intentionally limited the scope to tomcat, jetty and wildfly, but similar changes might be applied to other systems as a follow-up.

  • clarify the strategy for pre-defined metrics
  • simplify metric prefix for tomcat to tomcat. for consistency with jetty, wildfly, ...
  • fix tomcat busy/idle threads, as explained in this discussion.
  • rename metrics to use singular form to fit (experimental) metrics semconv recommendations.
  • rename units to use singular form to fit (stable) units semconv recommendations.
  • move tomcat request-related metrics to tomcat.request.* namespace for consistency with wildfly.request.*
  • align tomcat on system.network.io with tomcat.network.io for transferred bytes
  • align wildfly on system.network.io with: wildfly.network.io for transferred bytes (only direction attribute had to be changed).

For the overall strategy, I agree that covering every metric of every platform is not possible nor something we aim to. For example, with Wildfly the db pool exposes more than 50 attributes that could be captured as metrics.

I think one of the important things that could make this type of mapping somehow manageable over time is to use the following strategy for metric names and their attributes:

  • use or align to semconv when it fits
  • keep the MBean attribute name otherwise: it allows to preserve the semantics of the observed system without having to try re-defining common metrics or deal with subtle implementation details.

Checklist & follow-ups

@github-actions github-actions bot requested a review from theletterf June 18, 2024 13:22
metricAttribute:
state: const(used)
db.client.connection.state: const(used)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, the metric will have a metric attribute which will always have the same constant value. What is really the point of having such an attribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct here, having a constant metric attribute only makes sense to provide breakdown of the same metric, occurences of wildfly.db.client.connection.usage should be renamed to wildfly.db.client.connection.count. I am currently trying to validate with the implementation if there is an overlap between the connection states or if it's an effective partition (in which case we can provide a breakdown).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking at the implementation and the wildfly test cases it seems that we have:

  • "available" seems close to the definition of "idle"
  • "in use" seems close to the definition of "busy"
  • "active" seems to mostly be the sum of "available" + "in use"
  • for "wait count", I haven't found any proper definition besides the docs

https://github.com/wildfly/wildfly/blob/841fea771567a71d06490a0d7e9a398dc6fdf5c0/testsuite/integration/basic/src/test/java/org/jboss/as/test/integration/jca/statistics/DataSourcePoolClearStatisticsTestCase.java#L69

https://github.com/wildfly/wildfly/blob/841fea771567a71d06490a0d7e9a398dc6fdf5c0/testsuite/integration/basic/src/test/java/org/jboss/as/test/integration/jca/capacitypolicies/ResourceAdapterCapacityPoliciesTestCase.java#L143

Given that both InUseCount and IdleCount refer to physical connection states in documentation, they effectively form a partition and then using a common metric with a constant attribute for idle | used makes sense.

For WaitCount this is about the number of logical connections that are waiting for a physical connection, so it would also make sense to use the same metric with a custom wait constant attribute.

When aggregating and removing attributes this would return the total number of logical connections to the database pool, and only a subset with either idle or used attribute value would be the physical connections.

I have updated this PR to match this in fb71fa1

@PeterF778
Copy link
Contributor

Originally, JMX Metric Insight borrowed the metric definitions from JMX Metric Gatherer, and was bug-for-bug compatible. This was caused equally by our laziness as by the desire to allow users to transition smoothly to in-process metric collection.
I do not know how popular JMX Metric Insight is, but I know from experience that changes to metric names/attributes can sometimes be painful for the users. Perhaps it will be helpful for the customers if we keep the old metric configuration files around for some time as tomcat_old or tomcat_legacy etc.

@SylvainJuge
Copy link
Contributor Author

Originally, JMX Metric Insight borrowed the metric definitions from JMX Metric Gatherer, and was bug-for-bug compatible. This was caused equally by our laziness as by the desire to allow users to transition smoothly to in-process metric collection. I do not know how popular JMX Metric Insight is, but I know from experience that changes to metric names/attributes can sometimes be painful for the users. Perhaps it will be helpful for the customers if we keep the old metric configuration files around for some time as tomcat_old or tomcat_legacy etc.

I completely understand the duplication strategy here, but it's probably time to remove the duplication and simplify things:

Until we have such duplication removed, we will have to backport such changes in the contrib repo. Implementation-wise, a common implementation would likely reside in the contrib repo and be included in the instrumentation agent (there are already similar dependencies for the aws and gcp resource providers).

Regarding compatibility, I really don't know what should be the best approach here, all the JMX metrics are very dependent on implementation details, having any formal definition in semconv and stability status for them is not possible. Maybe keeping previous iterations of the yaml files could provide this.

@SylvainJuge
Copy link
Contributor Author

Status following June 20th SIG meeting:

  • we need to first align the implementations in contrib/instrumentation while preserving current metrics compatibility
  • we can provide updated version of the metrics in the instrumentation side with opt-in to use those new definitions (by default on the current state for compatibility)
  • switching to the new metrics could be aligned with the next major that should be around the stable database semconv.
  • this PR will stay in draft until then, parts of it will of course be reused along the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants