Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kafka.consumer.* metrics with NaN values and no unit #9498

Closed
mviitane opened this issue Sep 19, 2023 · 3 comments
Closed

kafka.consumer.* metrics with NaN values and no unit #9498

mviitane opened this issue Sep 19, 2023 · 3 comments
Labels
bug Something isn't working needs triage New issue that requires triage

Comments

@mviitane
Copy link
Member

Describe the bug

I’m running OpenTelemetry demo application with the latest v1.30.0 java agent. My metrics backend complains about receiving a lot of NaN value metrics.

It looks like the Fraud Detection service (Kafka consumer in the demo app) creates these NaN value metrics. Also, these metrics are missing the unit.

Steps to reproduce

  • Download OpenTelemetry demo application (use main/latest, not the latest released version, to have the latest java agent)
  • Configure detailed logging exporter in the Collector to access the detailed metric printouts
exporters:
  logging/detailed:
    verbosity: detailed
  • Start the demo app
  • After around 2 min, read the Collector logs with "NaN" values
    $ docker compose logs otelcol | grep -B 11 NaN

Expected behavior

I'd expect all the exported metrics to have a value and a unit. NaN value metrics shouldn't be propagated.

Actual behavior

Lots of metrics with NaN values and no unit:

otel-col  | Metric #72
otel-col  | Descriptor:
otel-col  |      -> Name: kafka.consumer.partition_assigned_latency_max
otel-col  |      -> Description: The max time taken for a partition-assigned rebalance listener callback
otel-col  |      -> Unit: 
otel-col  |      -> DataType: Gauge
otel-col  | NumberDataPoints #0
otel-col  | Data point attributes:
otel-col  |      -> client-id: Str(consumer-frauddetectionservice-1)
otel-col  | StartTimestamp: 2023-09-19 06:14:58.534111254 +0000 UTC
otel-col  | Timestamp: 2023-09-19 06:15:58.53179417 +0000 UTC
otel-col  | Value: NaN
--
otel-col  | Metric #77
otel-col  | Descriptor:
otel-col  |      -> Name: kafka.consumer.join_time_avg
otel-col  |      -> Description: The average time taken for a group rejoin
otel-col  |      -> Unit: 
otel-col  |      -> DataType: Gauge
otel-col  | NumberDataPoints #0
otel-col  | Data point attributes:
otel-col  |      -> client-id: Str(consumer-frauddetectionservice-1)
otel-col  | StartTimestamp: 2023-09-19 06:14:58.534111254 +0000 UTC
otel-col  | Timestamp: 2023-09-19 06:15:58.53179417 +0000 UTC
otel-col  | Value: NaN
--
otel-col  | Metric #84
otel-col  | Descriptor:
otel-col  |      -> Name: kafka.consumer.rebalance_latency_max
otel-col  |      -> Description: The max time taken for a group to complete a successful rebalance, which may be composed of several failed re-trials until it succeeded
otel-col  |      -> Unit: 
otel-col  |      -> DataType: Gauge
otel-col  | NumberDataPoints #0
otel-col  | Data point attributes:
otel-col  |      -> client-id: Str(consumer-frauddetectionservice-1)
otel-col  | StartTimestamp: 2023-09-19 06:14:58.534111254 +0000 UTC
otel-col  | Timestamp: 2023-09-19 06:15:58.53179417 +0000 UTC
otel-col  | Value: NaN
--
otel-col  | Metric #85
otel-col  | Descriptor:
otel-col  |      -> Name: kafka.consumer.reauthentication_latency_avg
otel-col  |      -> Description: The average latency observed due to re-authentication
otel-col  |      -> Unit: 
otel-col  |      -> DataType: Gauge
otel-col  | NumberDataPoints #0
otel-col  | Data point attributes:
otel-col  |      -> client-id: Str(consumer-frauddetectionservice-1)
otel-col  | StartTimestamp: 2023-09-19 06:14:58.534111254 +0000 UTC
otel-col  | Timestamp: 2023-09-19 06:15:58.53179417 +0000 UTC
otel-col  | Value: NaN

Javaagent or library instrumentation version

v1.30.0

Environment

Docker desktop on macOS

Additional context

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#numerical-limits-handling

@mviitane mviitane added bug Something isn't working needs triage New issue that requires triage labels Sep 19, 2023
@mateuszrzeszutek
Copy link
Member

Hey @mviitane ,

Also, these metrics are missing the unit.

The Kafka metrics that you see here are just a simple bridge to the Kafka's own metrics system; see the OpenTelemetryMetricsReporter for more info. Because Kafka does not report units along with its metrics, neither do we.

NaN value metrics shouldn't be propagated.

That's a fair point. @jack-berg do you think we should filter out NaN values at some level?

@jack-berg
Copy link
Member

The metrics API operations are expected to record "numeric values". NaN is not numeric and cannot be aggregated so it seems correct for the SDK to ignore them.

@mateuszrzeszutek
Copy link
Member

This was resolved in open-telemetry/opentelemetry-java#5859

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage New issue that requires triage
Projects
None yet
Development

No branches or pull requests

3 participants