[Prometheus Receiver] Incorrect start_timestamp of Summary and Histogram metrics after a failed scrape #6360

PaurushGarg · 2021-11-17T15:50:34Z

Describe the bug

We want to ensure OpenMetrics / Prometheus compatibility in the OpenTelemetry Collector. We have been building compatibility tests to verify the OpenMetrics spec is fully supported on both the OpenTelemetry Collector Prometheus receiver and PRW exporter as well as in Prometheus itself.

In order to verify that the Prometheus receiver is functioning as expected for Prometheus to OTLP data transformations, issue #6000 and #6151 create test to validate Prometheus core metrics. Through this test we found that after a failed scrape, start_timestamp of histogram and summary metrics are not same as start_timestamp of the first scrape. start_timestamp are used by OTLP format to set timestamp when a metric collection system is started.

Steps to reproduce

Run func TestEndToEnd(t *testing.T)
Currently, the validate loop is skipped for the tests, re-enable the validate loop by removing/commenting following lines from func testEndToEnd(...) (lines 1442-1445 )

if true {
   t.Log(`Skipping the "up" metric checks as they seem to be spuriously failing after staleness marker insertions`)
   return}

Note: the test fails in getValidScrapes due to staleness, inspect the metrics to look at the timestamps

What did you expect to see?

Referring to the table below, expected to see the start_timestamps of Histogram and Summary of Target1Scrape2 same as Target1Scrape1, irrespective of the failed scrape in between Target1Scrape2 and Target1Scrape1.

What did you see instead?

Because of the failed scrape between Target1Page1 and Target1Page2, the start_timestamp of Target1Scrape1 andTarget1Scrape2 are different for histogram and summary metrics.

Table: start_timestamps observed:

	timestamp:{seconds, nanos}	timestamp:{seconds, nanos}	timestamp:{seconds, nanos}	timestamp:{seconds, nanos}
	Target1 Scrape1	Target1 Failed Scrape	Target1 Scrape2	Target1 Failed Scrape	4 nos of: 5 default metrics with up 0
Gauge series start_Timestamp	Point Only	Point Only	Point Only	Point Only
Counter series start_Timestamp	1634681667, 427000000	1634681667, 427000000	1634681667, 427000000	1634681667, 427000000
Histogram series start_Timestamp	1634681667, 427000000		1634681667, 427000000	1634681670, 431000000
Summary series start_Timestamp	1634681667, 427000000	1634681668, 432000000	1634681668, 432000000	1634681670, 431000000

Points timestamp	1634681667, 427000000	1634681668, 432000000	1634681669, 431000000	1634681670, 431000000

Possible Solution

Modify existing adjustMetricTimeseries logic to ensure:

start_timestamp are reset if the current scrape has lower value than previous scrape, despite the presence of failed scrape in between.
start_timestamp are not reset if the current scrape has higher value than previous scrape, despite the presence of failed scrape in between.

Complete solution will also be dependent on the resolution of the issue: #6400.

What version did you use?

Collector Contrib: v- 0.37.1

Additional context
Related to: open-telemetry/prometheus-interoperability-spec#57
Issue: #6000 #6400
cc: @alolita @Aneurysm9

The text was updated successfully, but these errors were encountered:

PaurushGarg · 2021-11-17T15:51:00Z

@alolita please assign this issue to me. I would like to work on this one.

gouthamve · 2022-07-14T14:06:59Z

@PaurushGarg Can this be closed as it looks like #6696 fixes it?

PaurushGarg · 2022-07-14T14:54:50Z

Yes. This issue has been fixed and can be closed now. cc @Aneurysm9

#6360) Signed-off-by: Bogdan <[email protected]> Signed-off-by: Bogdan <[email protected]>

alolita added the comp:prometheus Prometheus related issues label Nov 17, 2021

alolita assigned PaurushGarg Nov 17, 2021

PaurushGarg changed the title ~~[Prometheus reciever] Incorrect start_timestamp of Summary and Histogram metrics after a failed scrape~~ [Prometheus receiver] Incorrect start_timestamp of Summary and Histogram metrics after a failed scrape Nov 17, 2021

alolita mentioned this issue Nov 17, 2021

[Prometheus Receiver] Modify existing tests to validate metrics using OTLP format instead of OpenCensus format #6151

Closed

PaurushGarg changed the title ~~[Prometheus receiver] Incorrect start_timestamp of Summary and Histogram metrics after a failed scrape~~ [Prometheus Receiver] Incorrect start_timestamp of Summary and Histogram metrics after a failed scrape Nov 18, 2021

This was referenced Dec 10, 2021

[Prometheus Receiver] Modifies otlp_metric_adjuster to support datapoint flags for staleness markers #6696

Merged

REQUEST: New membership for @PaurushGarg open-telemetry/community#934

Closed

PaurushGarg closed this as completed Jul 14, 2022

povilasv referenced this issue in coralogix/opentelemetry-collector-contrib Dec 19, 2022

[chore] pdatagen: Use os.WriteFile, avoid unnecessary copy of the data (

79d5c62

#6360) Signed-off-by: Bogdan <[email protected]> Signed-off-by: Bogdan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Prometheus Receiver] Incorrect start_timestamp of Summary and Histogram metrics after a failed scrape #6360

[Prometheus Receiver] Incorrect start_timestamp of Summary and Histogram metrics after a failed scrape #6360

PaurushGarg commented Nov 17, 2021 •

edited

Loading

PaurushGarg commented Nov 17, 2021

gouthamve commented Jul 14, 2022

PaurushGarg commented Jul 14, 2022

[Prometheus Receiver] Incorrect start_timestamp of Summary and Histogram metrics after a failed scrape #6360

[Prometheus Receiver] Incorrect start_timestamp of Summary and Histogram metrics after a failed scrape #6360

Comments

PaurushGarg commented Nov 17, 2021 • edited Loading

PaurushGarg commented Nov 17, 2021

gouthamve commented Jul 14, 2022

PaurushGarg commented Jul 14, 2022

PaurushGarg commented Nov 17, 2021 •

edited

Loading