Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect StartTimeUnixNano for ExponentialHistogram with Delta Temporality after gRPC connection drop #3974

Closed
alexchowle opened this issue Jun 15, 2024 · 10 comments · Fixed by #3978
Assignees
Labels
bug Something isn't working

Comments

@alexchowle
Copy link

alexchowle commented Jun 15, 2024

Describe your environment

OS: CentOS 7
Python version: 3.11.1
SDK version: 1.25.0
API version: 1.25.0

What happened?

As part of investigating #3971 I found that the loss and subsequent re-establishment of a gRPC connection to a Collector caused the next metric export to have a very old StartTimeUnixNano value

Steps to Reproduce

metrics_temporality = {
        Histogram: AggregationTemporality.DELTA
    }
otlp_metric_exporter = OTLPMetricExporter(endpoint=grpc_collector_address, insecure=True,
                                              preferred_temporality=metrics_temporality, timeout=30_000)
metric_reader = PeriodicExportingMetricReader(otlp_metric_exporter,
                                                  export_interval_millis=60_000))
meter_provider = MeterProvider(metric_readers=[metric_reader], resource=_create_resources(),
                                   views=[View(instrument_type=Histogram,
                                               aggregation=ExponentialBucketHistogramAggregation())])
metrics.set_meter_provider(meter_provider)

meter = metrics.get_meter(__name__)

h = meter.create_histogram(
        name='my_histogram', unit='s', description='just my histogram'
    )

h.record(1.0)
time.sleep(360)
h.record(2.0)

Have the OTel Collector drop the client's gRPC connection by setting the max_connection_age to 5m (as documented at https://github.com/open-telemetry/opentelemetry-collector/blob/main/config/configgrpc/README.md#server-configuration)

So the first recording happens, then within the 6-minute sleep the Collector drops the client's gRPC connection within 5 minutes, then the client sends a second observation after the 6 minutes has elapsed.

Expected Result

StartTimeUnixNano has a value that is within 60 seconds of the 2nd recording. This is how the Golang SDK works.

Actual Result

StartTimeUnixNano has the TimeUnixNano of observation # 1 i.e. about 6 minutes in the past.

Observation # 1
-----------------
Metric #0                                                                                                
Descriptor:                                                                                              
     -> Name: http.server.request.duration                                                               
     -> Description: Duration of HTTP server requests                                                    
     -> Unit: s                                                                                          
     -> DataType: ExponentialHistogram                                                                   
     -> AggregationTemporality: Delta                                                                    
ExponentialHistogramDataPoints #0                                                                        
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-14 06:25:40.301137322 +0000 UTC
Timestamp: 2024-06-14 06:25:53.980223433 +0000 UTC
Count: 1
Sum: 0.021235
Min: 0.021235
Max: 0.021235
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.021235, 0.021235], Count: 1


Observation #2
-----------------
Metric #0
Descriptor:
     -> Name: http.server.request.duration
     -> Description: Duration of HTTP server requests
     -> Unit: s
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-14 06:25:53.980223433 +0000 UTC
Timestamp: 2024-06-14 06:31:53.98886037 +0000 UTC
Count: 1
Sum: 0.018851
Min: 0.018851
Max: 0.018851
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.018851, 0.018851], Count: 1

Additional context

No response

Would you like to implement a fix?

None

@alexchowle alexchowle added the bug Something isn't working label Jun 15, 2024
@alexchowle
Copy link
Author

alexchowle commented Jun 18, 2024

I've performed more testing and it seems the gRPC disconnection is irrelevant; the aged StartTimeUnixNano is used regardless.

@xrmx
Copy link
Contributor

xrmx commented Jun 26, 2024

@alexchowle care to give #3978 a run?

@alexchowle
Copy link
Author

alexchowle commented Jun 26, 2024

@alexchowle care to give #3978 a run?

I've ran the reproduction scenario and am still seeing the same issue.

Observation # 1
-----------------
Metric #0                                                                                                
Descriptor:                                                                                              
     -> Name: http.server.request.duration                                                               
     -> Description: Duration of HTTP server requests                                                    
     -> Unit: s                                                                                          
     -> DataType: ExponentialHistogram                                                                   
     -> AggregationTemporality: Delta                                                                    
ExponentialHistogramDataPoints #0                                                                        
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-26 18:50:51.262250586 +0000 UTC
Timestamp: 2024-06-26 18:51:01.262261958 +0000 UTC
Count: 1
Sum: 1.000000
Min: 1.000000
Max: 1.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.999999, 1.000000], Count: 1


Observation #2
-----------------
Metric #0
Descriptor:
     -> Name: http.server.request.duration
     -> Description: Duration of HTTP server requests
     -> Unit: s
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-26 18:51:01.226261958 +0000 UTC
Timestamp: 2024-06-26 18:56:51.263023031 +0000 UTC
Count: 1
Sum: 2.000000
Min: 2.000000
Max: 2.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (1.999999, 2.000000], Count: 1

I'm pretty confident I'm using the correct whl files:
image

I'm picking up the new code:
image

I'm thinking I'm doing something wrong with my builds/imports of the wheels.

@ocelotl
Copy link
Contributor

ocelotl commented Jun 27, 2024

Observation # 1
-----------------
Metric #0                                                                                                
Descriptor:                                                                                              
     -> Name: http.server.request.duration                                                               
     -> Description: Duration of HTTP server requests                                                    
     -> Unit: s                                                                                          
     -> DataType: ExponentialHistogram                                                                   
     -> AggregationTemporality: Delta                                                                    
ExponentialHistogramDataPoints #0                                                                        
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-26 18:50:51.262250586 +0000 UTC
Timestamp: 2024-06-26 18:51:01.262261958 +0000 UTC
Count: 1
Sum: 1.000000
Min: 1.000000
Max: 1.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.999999, 1.000000], Count: 1


Observation #2
-----------------
Metric #0
Descriptor:
     -> Name: http.server.request.duration
     -> Description: Duration of HTTP server requests
     -> Unit: s
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-26 18:51:01.226261958 +0000 UTC
Timestamp: 2024-06-26 18:56:51.263023031 +0000 UTC
Count: 1
Sum: 2.000000
Min: 2.000000
Max: 2.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (1.999999, 2.000000], Count: 1

From the data above you are making 2 observations, with the following timestamps:

> Observation # 1
> -----------------
> ...
> StartTimestamp: 2024-06-26 18:50:51.262250586 +0000 UTC
> Timestamp: 2024-06-26 18:51:01.262261958 +0000 UTC
> ...
> Observation # 2
> -----------------
> ...
> StartTimestamp: 2024-06-26 18:51:01.226261958 +0000 UTC
> Timestamp: 2024-06-26 18:56:51.263023031 +0000 UTC
> ...

Observation # 1 Timestamp matches Observation # 2 StartTimestamp:
Timestamp: 2024-06-26 18:51:01.262261958 +0000 UTC == StartTimestamp: 2024-06-26 18:51:01.226261958 +0000 UTC.

This is expected, the start time of a DELTA temporality collected exponential histogram data point is the value of the collection time of the previous DELTA temporality collected exponential histogram data point.

For this behavior to change to match the Golang behavior the entire _ExponentialBucketHistogramAggregation would need to be reinstantiated after the first point is collected and before the second point is collected. I wonder if that is what Golang is doing, I imagine something like that happens when the gRPC connection drops, but I am not sure if that behavior is defined somewhere in the spec 🤷

I'll ask @jmacd for guidance here.

@alexchowle
Copy link
Author

I only noticed this issue because my vendor backend uses StartTimeUnixNano as the observation timestamp. My service has a very low throughput.

@ocelotl
Copy link
Contributor

ocelotl commented Jun 27, 2024

Sorry, please disregard my previous answer I need to do more thinking about this issue, and consider the export interval of the periodic exporter, will comment here when I have a better answer ✌️

ocelotl added a commit to ocelotl/opentelemetry-python that referenced this issue Jun 27, 2024
@ocelotl
Copy link
Contributor

ocelotl commented Jun 27, 2024

@alexchowle I have created a branch that I think should fix your issue: #4002

I am just adding a commit so that the previous collection time gets updated with every collection regardless of aggregate being called between collection cycles or not.

Please give it a try ✌️

@jack-berg
Copy link
Member

jack-berg commented Jun 28, 2024

This is expected, the start time of a DELTA temporality collected exponential histogram data point is the value of the collection time of the previous DELTA temporality collected exponential histogram data point.

I disagree. The supplementary guidelines has a delta example which I believe covers this situation. Notice how the start / end time of the 3rd export is (T2, T3] despite there being no measurements in the previous interval. If your interpretation was correct, we would see the 3rd export have a start / end time of (T1, T3].

@ocelotl
Copy link
Contributor

ocelotl commented Jun 28, 2024

This is expected, the start time of a DELTA temporality collected exponential histogram data point is the value of the collection time of the previous DELTA temporality collected exponential histogram data point.

I disagree. The supplementary guidelines has a delta example which I believe covers this situation. Notice how the start / end time of the 3rd export is (T2, T3] despite there being no measurements in the previous interval. If you're interpretation was correct, we would see the 3rd export have a start / end time of (T1, T3].

Please take a look here

@alexchowle
Copy link
Author

@alexchowle I have created a branch that I think should fix your issue: #4002

I am just adding a commit so that the previous collection time gets updated with every collection regardless of aggregate being called between collection cycles or not.

Please give it a try ✌️

I've just tried it. This looks very promising:

Observation # 1

2024-06-29T10:12:33.955Z        info    ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> telemetry.sdk.language: Str(python)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.26.0.dev0)
     -> service.name: Str(otelworkbench)
     -> service.version: Str(0.1.0)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope __main__ 
Metric #0
Descriptor:
     -> Name: my_histogram
     -> Description: just a histogram
     -> Unit: s
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
StartTimestamp: 2024-06-29 10:11:33.902642238 +0000 UTC
Timestamp: 2024-06-29 10:12:33.897396523 +0000 UTC
Count: 1
Sum: 1.000000
Min: 1.000000
Max: 1.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.999999, 1.000000], Count: 1
        {"kind": "exporter", "data_type": "metrics", "name": "debug"}

Observation # 2

2024-06-29T10:16:34.014Z        info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2024-06-29T10:16:34.014Z        info    ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> telemetry.sdk.language: Str(python)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.26.0.dev0)
     -> service.name: Str(otelworkbench)
     -> service.version: Str(0.1.0)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope __main__ 
Metric #0
Descriptor:
     -> Name: my_histogram
     -> Description: just a histogram
     -> Unit: s
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
StartTimestamp: 2024-06-29 10:16:33.904149745 +0000 UTC
Timestamp: 2024-06-29 10:16:34.00260972 +0000 UTC
Count: 1
Sum: 2.000000
Min: 2.000000
Max: 2.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (1.999999, 2.000000], Count: 1
        {"kind": "exporter", "data_type": "metrics", "name": "debug"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants