Elasticsearch has the metrics API available in server's package
org.elasticsearch.telemetry.metric
.
This package contains base classes/interfaces for creating and working with metrics.
Please refer to the javadocs provided in these classes in that package for more details.
The entry point for working with metrics is MeterRegistry
.
We use elastic's apm-java-agent as an implementation of the API we expose.
the implementation can be found in :modules:apm
The apm-java-agent is responsible for buffering metrics and upon metrics_interval
send them over to apm server.
Metrics_interval is configured via a telemetry.agent.metrics_interval
setting
The agent also collects a number of JVM metrics.
see https://www.elastic.co/guide/en/apm/agent/java/current/metrics.html#metrics-jvm
The choice of the right instrument is not always easy as often differences are subtle. The simplified algorithm could be as follows:
- You want to measure something (absolute value)
- values are non-additive
- use a gauge
- Example: a cpu temperature
- values are additive
- use asynchronous counter
- Example: total number of requests
- values are non-additive
- You want to count something
- values are monotonously increasing
- use a counter
- Example: Recording a failed authentication count
- values can be decreased
- use UpDownCounter
- Example: Number of orders in a queue
- values are monotonously increasing
- You want to record a statistics
- use a histogram
- Example: A statistics about how long it took to access a value from cache
- use a histogram
refer to https://opentelemetry.io/docs/specs/otel/metrics/supplementary-guidelines/#instrument-selection for more details
See the naming guidelines for metrics: NAMING GUIDE
if the instrument is correctly chosen, the apm server will be able to determine if the metrics were restarted (i.e. node was restarted) or there was a counter overflow (the metric in ES might use an int internally, but apm backend might have a long )
There are 2 types of usages of an instrument depending on a type.
- For synchronous instrument (counter/UpDownCounter) we need to register an instrument with
MeterRegistry
and use the returned value to increment a value of that instrument
MeterRegistry registry;
LongCounter longCounter = registry.registerLongCounter("es.test.requests.count", "a test counter", "count");
longCounter.increment();
longCounter.incrementBy(1, Map.of("name", "Alice"));
longCounter.incrementBy(1, Map.of("name", "Bob"));
- For asynchronous instrument (gauge/AsynchronousCounter) we register an instrument and have to provide a callback that will report the absolute measured value. This callback has to be provided upon registration and cannot be changed.
MeterRegistry registry;
long someValue = 1;
registry.registerLongGauge("es.test.cpu.temperature", "the current CPU temperature as measured by psensor", "degrees Celsius",
() -> new LongWithAttributes(someValue, Map.of("cpuNumber", 1)));
If we don’t have access to ‘state’ that will be fetched on metric event (when callback is executed) we can use a utility LongGaugeMetric or LongGaugeMetric
MeterRegistry meterRegistry ;
LongGaugeMetric longGaugeMetric = LongGaugeMetric.create(meterRegistry, "es.test.gauge", "a test gauge", "total value");
longGaugeMetric.set(123L);
Each instrument can attach attributes to a reported value. This helps drilling down into the details of value that was reported during the metric event
The quickest way to verify that your metrics are working is to run ./gradlew run --with-apm-server
.
This will run ES node (or nodes in serverless) and also start a mock http server that will act
as an apm server. This fake http server will log all the http messages it receives from apm-agent
You can also run local ES node with an apm server in cloud.
Create a new deployment in cloud, then click the 'hamburger' on the left, scroll to Observability and click APM under it.
At the upper right corner there is Add data
link, then scroll down to ApmAgents
section and pick Java
There you should be able to see elastic.apm.secret_token
and `elastic.apm.server_url. You will use them in the next step.
edit your ~/.gradle/init.d/apm.gradle
and replace the secret_token and the server_url.
rootProject {
if (project.name == 'elasticsearch' && Boolean.getBoolean('metrics.enabled')) {
afterEvaluate {
testClusters.matching { it.name == "runTask" }.configureEach {
setting 'xpack.security.audit.enabled', 'true'
keystore 'telemetry.secret_token', 'TODO-REPLACE'
setting 'telemetry.metrics.enabled', 'true'
setting 'telemetry.agent.server_url', 'https://TODO-REPLACE-URL.apm.eastus2.staging.azure.foundit.no:443'
}
}
}
}
The example use:
./gradlew :run -Dmetrics.enabled=true
with any approach you took to run your ES with APM you will find apm-agent.json file in ES's logs directory. If there are any problems with connecting to APM you will see WARN/ERROR messages. We run apm-agent with logs at WARN level, so normally you should not see any logs there.
When running ES in cloud, logs are being also indexed in a logging cluster, so you will be able to find them
in kibana. The datastream.dataset
is elasticsearch.apm_agent
We currently provide a base TestTelemetryPlugin
which should help you write an integration test.
See an example S3BlobStoreRepositoryTests
https://opentelemetry.io/docs/specs/otel/metrics/supplementary-guidelines/
https://www.elastic.co/guide/en/apm/guide/current/data-model-metrics.html