Remove stale metrics #164

diafour · 2018-11-13T19:54:18Z

This patch implements clearing of metrics with variable labels with per-mapping configurable timeout (ttl). Our monitoring installation is suffered from the behavior desribed in #129. There are metrics from Kubernetes pods with constantly changing values for labels. For example, metric ingress_nginx_upstream_retries_count with label pod and changing pod name as a value for this label. There is no sense to store metrics with old pod names as they are never repeated. Removing metrics with this stale values will not cause any damage to subsequent aggregations or multiple prometheus servers.

The patch is divided by 3 parts to simplify a review:

replace Metrics with Collectors to gain ability of deletion metrics. So Elements in CounterContainer stores CounterVecs instead of Counters.
implement labels values storage in Exporter to detect metrics staleness
implement configurable timeout (ttl). Add key ttl to default section and to mapping sections.

ttl is a timeout in seconds for stale metrics. Exporter saves each labels values set and updates last time it receives metric with these labels values. If no metric with the same labels values received for ttl seconds, then statsd_exporter stops reporting metric with this labels values set.

Mark this as WIP. It would be great to hear your thoughts on this.

diafour · 2018-11-13T20:11:39Z

Tests are broken for now because of changes in Elements fields. Will fix it at the daylight.

matthiasr · 2018-11-14T10:59:05Z

Awesome, thank you for contributing! It will take me a little while to go through the code changes in detail, but I wanted to say that I think the general approach is good.

exporter.go

R0quef0rt · 2018-11-27T08:56:07Z

This is a great idea. Really hoping to see it implemented soon.

I work in an environment where our central infrastructure is in the cloud, but our clients are behind corporate firewalls. We're developing an application-specific plugin that will send metrics to Prometheus - but we keep running into this problem.

The pushgateway - and now, statsd - currently "cache" metrics indefinitely. This makes it difficult to determine if a client has gone down - because their metrics are being persisted indefinitely.

Here's hoping we'll see a TTL feature added soon!

matthiasr · 2018-11-27T08:59:18Z

One workaround for this specific question and situation is to push a metric that is the current (unix) time from all clients, then you can see when that stops updating / gets too far away from the server time().

…

On Tue, Nov 27, 2018, 09:56 R0quef0rt ***@***.***> wrote: This is a great idea. Really hoping to see it implemented soon. I work in an environment where our central infrastructure is in the cloud, but our clients are behind corporate firewalls. We're developing an application-specific plugin that will send metrics to Prometheus - but we keep running into this problem. The pushgateway - and now, statsd - currently "cache" metrics indefinitely. This makes it difficult to determine if a client has gone down - because their metrics are being persisted indefinitely. Here's hoping we'll see a TTL feature added soon! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#164 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAICBgNOO_rwtYaDB8NI7LUUsT3OKz-sks5uzP4ngaJpZM4YcZUb> .

diafour · 2018-11-27T13:47:14Z

Tests are good now. The problem was in TestHistogramUnits test with mocking of Histogram metric. Resolve it with Write a dto.Metric as in a client_golang https://github.com/prometheus/client_golang/blob/master/prometheus/examples_test.go#L497-L498

hashNameAndLabels is used to create a key for label values in intermediate storage (Exporter.labelValues)

labelNames is got proper name

diafour · 2018-12-07T12:56:09Z

@matthiasr Any comments? statsd_exporter with this patch now lives on 5 production clusters and works well.

matthiasr · 2018-12-07T13:32:42Z

sorry, I got distracted and then forgot. the title still says WIP?

matthiasr

Looks pretty good! Only one concern: The exporter allows live-reloading of the configuration, so the label set or TTL of a given metric can change during the lifetime of the exporter. What happens in that case?

Could you please add tests for the added mapping fields, and tests that demonstrate/verify the expiry behaviour?

Please update the documentation (README) as well.

exporter.go

exporter_test.go

matthiasr · 2018-12-10T09:01:25Z

What happens when there are two mappings with different TTLs mapping to the same metric name / label? That is a perfectly valid situation, and we need to at least document it.

diafour · 2018-12-13T16:42:53Z

Could you please add tests for the added mapping fields, and tests that demonstrate/verify the expiry behaviour?

Done

Please update the documentation (README) as well.

Done

The exporter allows live-reloading of the configuration, so the label set or TTL of a given metric can change during the lifetime of the exporter. What happens in that case?

Modified ttl for a metric will not be in use until call to handleEvent with a new value for metric. In theory one would want to expire a stale metric with ttl: 30m without reload by decreasing ttl to 1s — this is not possible now. Should we handle such situation? I think it requires some kind of signalization about config reloads from main to exporter.

What happens when there are two mappings with different TTLs mapping to the same metric name / label? That is a perfectly valid situation, and we need to at least document it.

handleEvent use mapper.GetMapping which use FSM.GetMapping to get a *MetricMapping by metric name and metric type. This mapping defines ttl to use. As I can see labels are not used to find a
mapping. Can you tell what is an algorithm to find the mapping in FSM? May be you have an example of config to run some tests?

diafour · 2018-12-17T07:01:08Z

@matthiasr some comments about tests. Previous algorithm with getting a Metric instance and call a Write works good for simple cases such as test for 1 value or test values in one routine. So I use a Gatherer which is thread safe and to be more close to what a scraper get. Method Gather returns many additional metrics but for test purposes it is not important.

matthiasr · 2018-12-17T19:01:14Z

Should we handle such situation? I think it requires some kind of signalization about config reloads from main to exporter.

I think for now it's okay to document this and leave it be. If someone really needs instant expiration, they can restart or implement that signalling.

matthiasr · 2018-12-17T19:04:31Z

Can you tell what is an algorithm to find the mapping in FSM

The FSM finds exactly one mapping for a given statsd metric name. The Prometheus metric name and labels are an output of that. I think what will happen in this case is that whichever mapping last matched will win. Again, I think that's okay as long as we document it.

matthiasr · 2018-12-17T19:38:03Z

Nevermind my last comment, I was confusing myself about what level TTLs are applied at. I don't expect anyone to have multiple mappings to the same Prometheus metric name, labels and label values, so we don't need to cover that case. If you're really mapping multiple event streams to the same metric, all bets are off, but last-event-wins is what you get either way.

matthiasr

Soooo cloooose …

The go.mod merge conflict is from #171 which fixed the build after Go 1.11.4 was released. Could you rebase once more please?

The README looks almost good, I proposed some edits for language, I would like to strike the mention of Kubernetes (in that case, TTLs are the wrong solution), and I wrote up #164 (comment)

Do you think you can change the tests to not be wall-clock-dependent? That tends to make them flaky in the long run, which I would really like to avoid because it will make future contributor's lives hard.

README.md

exporter.go

exporter_test.go

README.md

I'd like to add more detail, but I'm not sure I understand the implications just yet, and they will probably change with #164. Signed-off-by: Matthias Rampke <[email protected]>

- use MetricVec family instead of Metric - dynamic label values instead of ConstLabels - use dto.Metric to gain histrogram value in exporter_test - remove hash calculations Signed-off-by: Ivan Mikheykin <[email protected]>

- ttl is hardcoded — should be in mapping.yaml - works with metrics without labels Signed-off-by: Ivan Mikheykin <[email protected]>

Signed-off-by: Ivan Mikheykin <[email protected]>

diafour · 2018-12-19T05:37:29Z

The branch is rebased. New prometheus client brings an error "Histogram is not Observer". Get methods for Histogram and Summary now return prometheus.Observer to fix this issue.

All changes in README are applied, paragraph about k8s is deleted, your suggestion is added.

Tests are not depended on time.Sleep:

time.Now and time.NewTracker are wrapped within clock package to be overridden in tests
time.Sleep was used to wait for handleEvent to finish. Making events channel synchronous works well too.

matthiasr · 2018-12-20T16:45:46Z

Thank you very much!

Signed-off-by: Matthias Rampke <[email protected]>

matthiasr reviewed Nov 23, 2018

View reviewed changes

exporter.go Outdated Show resolved Hide resolved

exporter.go Outdated Show resolved Hide resolved

exporter.go Outdated Show resolved Hide resolved

exporter.go Show resolved Hide resolved

exporter.go Outdated Show resolved Hide resolved

matthiasr reviewed Dec 7, 2018

View reviewed changes

exporter.go Outdated Show resolved Hide resolved

exporter_test.go Outdated Show resolved Hide resolved

This was referenced Dec 11, 2018

dynamic tags cannot be added when metric submit #170

Closed

Fix use of client_golang to allow inconsistent labels on metrics #114

Closed

diafour changed the title ~~WIP: Remove stale metrics~~ Remove stale metrics Dec 13, 2018

matthiasr reviewed Dec 17, 2018

View reviewed changes

matthiasr pushed a commit that referenced this pull request Dec 17, 2018

Add basic release note for #171

ef5d2c8

I'd like to add more detail, but I'm not sure I understand the implications just yet, and they will probably change with #164. Signed-off-by: Matthias Rampke <[email protected]>

diafour added 8 commits December 18, 2018 08:39

Replace Metrics with Collectors

b638b9d

- use MetricVec family instead of Metric - dynamic label values instead of ConstLabels - use dto.Metric to gain histrogram value in exporter_test - remove hash calculations Signed-off-by: Ivan Mikheykin <[email protected]>

Remove stale timeseries

b4e29c5

- ttl is hardcoded — should be in mapping.yaml - works with metrics without labels Signed-off-by: Ivan Mikheykin <[email protected]>

Configured ttl value for stale metrics

e1a3a5f

Signed-off-by: Ivan Mikheykin <[email protected]>

rename labelsNames to labelNames

57495db

Signed-off-by: Ivan Mikheykin <[email protected]>

TTL expiration tests, README update

d1b2dd4

Signed-off-by: Ivan Mikheykin <[email protected]>

Use updated ttl value from mapping in saveLabelValues

e550f06

Signed-off-by: Ivan Mikheykin <[email protected]>

Language fixes in README, reduce map lookups in saveLabelValues.

331d2a5

Signed-off-by: Ivan Mikheykin <[email protected]>

Rework tests to not depend on actual wall clocks

699c11c

Signed-off-by: Ivan Mikheykin <[email protected]>

matthiasr merged commit 7364c6f into prometheus:master Dec 20, 2018

matthiasr pushed a commit that referenced this pull request Dec 20, 2018

Add CHANGELOG entry for #164

141e366

Signed-off-by: Matthias Rampke <[email protected]>

matthiasr mentioned this pull request Mar 11, 2019

How to identify stale data? #129

Closed

mgalgs mentioned this pull request Dec 17, 2020

Can TTL'd metrics be resurrected if pushed into the exporter again later? #353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove stale metrics #164

Remove stale metrics #164

diafour commented Nov 13, 2018

diafour commented Nov 13, 2018

matthiasr commented Nov 14, 2018

R0quef0rt commented Nov 27, 2018

matthiasr commented Nov 27, 2018 via email

diafour commented Nov 27, 2018 •

edited

Loading

diafour commented Dec 7, 2018

matthiasr commented Dec 7, 2018

matthiasr left a comment

matthiasr commented Dec 10, 2018

diafour commented Dec 13, 2018

diafour commented Dec 17, 2018

matthiasr commented Dec 17, 2018

matthiasr commented Dec 17, 2018

matthiasr commented Dec 17, 2018

matthiasr left a comment

diafour commented Dec 19, 2018 •

edited

Loading

matthiasr commented Dec 20, 2018

Remove stale metrics #164

Remove stale metrics #164

Conversation

diafour commented Nov 13, 2018

diafour commented Nov 13, 2018

matthiasr commented Nov 14, 2018

R0quef0rt commented Nov 27, 2018

matthiasr commented Nov 27, 2018 via email

diafour commented Nov 27, 2018 • edited Loading

diafour commented Dec 7, 2018

matthiasr commented Dec 7, 2018

matthiasr left a comment

Choose a reason for hiding this comment

matthiasr commented Dec 10, 2018

diafour commented Dec 13, 2018

diafour commented Dec 17, 2018

matthiasr commented Dec 17, 2018

matthiasr commented Dec 17, 2018

matthiasr commented Dec 17, 2018

matthiasr left a comment

Choose a reason for hiding this comment

diafour commented Dec 19, 2018 • edited Loading

matthiasr commented Dec 20, 2018

diafour commented Nov 27, 2018 •

edited

Loading

diafour commented Dec 19, 2018 •

edited

Loading