Ever increasing timeseries count #904

atibdialpad · 2023-03-22T06:41:28Z

atibdialpad
Mar 22, 2023

I am using this prometheus Python client for logs to Prometheus metrics conversion and I see that number of timeseries keeps on monotonically increasing.

My code roughly looks like this:

from prometheus_client import Counter
metrics_to_name = {} # Map to store metrics locally so that we do not try to recreate an already existing metric.

def process_log_line(logline):
  name, description, tags = get_metric_from_log(logline)
  metric = metrics_to_name.get(name)
  if not metric:
    metric = Counter(name, description, tags.keys())
    metric_to_name[name] = metric
  metric.labels(**tags).inc(...)

Am I violating any prometheus rule by keeping a local mapping of metrics and dynamically creating them by parsing a log line ?

csmarchbanks · 2023-03-22T19:29:29Z

csmarchbanks
Mar 22, 2023
Maintainer

You are not violating any Prometheus rules by doing this, in fact there are entire projects such as mtail and grok_exporter that create metrics from logs.

My guess is that there is some sort of condition where there is a high cardinality label that comes from get_metric_from_log, for example parsing out an IP address or userid from the log and adding it as a label. Can you look to see where the cardinality comes from for your job? Is it lots of new metric names or lots of values for a specific label?

0 replies

atibdialpad · 2023-03-24T06:34:05Z

atibdialpad
Mar 24, 2023
Author

thanks for taking a stab at this. I guessed so but couldn't find any shenanigans from the get_metric_from_log function. I am sending the metrics to Grafana which has a cool cardinality dashboard which helped me verify the above.

If I look at the number of series in the head block, it keeps increasing. I was expecting it to increase until prometheus "sees" all the metrics and stabilise around a particular value (only to churn if there are changes in tags (version, etc...)) but it keeps on increasing

1 reply

csmarchbanks Mar 29, 2023
Maintainer

Ohh, nice about having the Grafana cardinality dashboards. Would you be willing to screenshot the top handful of series names and labels provided on the overview dashboard? That could help figure out where some of the cardinality is coming from. If you don't want to post those online, feel free to send them to me on the CNCF or Grafana slacks or email them to me.

atibdialpad · 2023-03-30T10:55:19Z

atibdialpad
Mar 30, 2023
Author

@csmarchbanks I had one more question. Some of the tags I noticed are not high cardinality in say the last 2 days but have a high churn .. so over a month they might have ~1000 unique values. I feel that will cause the prometheus active series count to go up steadily. Is it possible to make prometheus forget series that it has not seen in the last X hours. I am using prometheus in agent mode to only remote write to Grafana. I am using "storage.agent.retention.max-time=2h" and was expecting prometheus will not remember any metrics more than 2 hrs old.

args:
            - "--config.file=/etc/prometheus/prometheus.yaml"
            - "--enable-feature=agent"
            - "--storage.agent.path=/prometheus/"
            - "--storage.agent.retention.max-time=2h"
            - "--storage.agent.wal-compression"

2 replies

csmarchbanks Mar 30, 2023
Maintainer

That should be how Prometheus works already, each time compaction happens (which in agent mode just just means truncate the WAL), then all old series will be removed from the head block.

Is it possible that you continue to export a cached value of the old metrics that Prometheus keeps ingesting as the current value?

atibdialpad Mar 30, 2023
Author

Is it possible that you continue to export a cached value of the old metrics that Prometheus keeps ingesting as the current value?

That is an excellent point. I think that might be it. I am using the prometheus python client (default Collector) to register metrics which are exposed via a Flask App. The Client must be holding onto old series which are then scraped by the Prom Agent. I will double check on this. If thats the case, I need to somehow "clean up" old series.

atibdialpad · 2023-04-03T11:01:42Z

atibdialpad
Apr 3, 2023
Author

Hi @csmarchbanks
I looked at the Prometheus Python Client code and it seems its possibly (TBH: in a hacky way) by creating a custom regisry:

Create Log2MetricRegistry()
Create Metrics and Register to 'Log2MetricRegistry', e.g a = Counter(name, ..., registry=Log2MetricRegistry)
Use generate_latest(Log2MetricRegistry) to collect metrics upon /metrics scrape
CRON to periodically delete and re-create Log2MetricRegistry

Does this sound okay ? Would appreciate if you enlighten me with a better approach.

1 reply

csmarchbanks Apr 4, 2023
Maintainer

I think that would work and at least stop the increasing forever problem. You could also do something like track each metric+labelset combination and the last time it was updated and only delete the labelsets that need to be cleaned up. That might be more work than just re-creating the registry though...

atibdialpad · 2023-04-04T07:27:49Z

atibdialpad
Apr 4, 2023
Author

ran this locally and POC looks good. Will try this out in the actual cluster. @csmarchbanks

0 replies

atibdialpad · 2023-04-13T15:22:05Z

atibdialpad
Apr 13, 2023
Author

Hi @csmarchbanks I have been running the system with Soln 1 (Custom Registry and delete+create periodically) and the system has been performing well removing stale metrics and freeing up memory

The only side effect of this soln. are the counter resets in the active metrics which can be handled in Grafana by doing a rate(). Its just that it looks ugly when someone looks at the Cumulative Value itself. (A Counter looks like a Gauge)

I am planning to give soln2 also a try since that would solve the problem w/o resetting the active time series. I see the custom collector classes here

client_python/prometheus_client/metrics_core.py

Line 106 in d6e08e3

class CounterMetricFamily(Metric):

also allows to set the timestamp when adding a metric which can come super handly. I would just need compare this against the current time and delete the samples which are super old. Still need to figure out how to access/delete the sample. But it can look something like this

def process_log_line(logline):
  name, description, tags = get_metric_from_log(logline)
  metric = metrics_to_name.get(name)
  if not metric:
    metric = CounterMetricFamily(name, description, tags.keys())
    metric_to_name[name] = metric
  metric.add_metric(..., timestamp=time.time())

# CRON
for metric in all_metrics: # Need to figure out how to get all_metrics
  for sample in metric.samples:
    if now - sample.timestamp > SOME_THRESHOLD:
      metric.samples.remove(sample)

Does this look okay'ish to you ?

1 reply

csmarchbanks Apr 18, 2023
Maintainer

Glad that progress is being made!

I would be very careful using the timestamp variable. That will also get transmitted to Prometheus potentially creating very short time series or lots of duplicate samples that are rejected. Depending on when the timestamp is Prometheus may even reject scraping it.

A better solution would probably be to keep a separate dict of metrics to last seen timestamps and then iterating over the map and removing any metrics with too old of a last seen time.

atibdialpad · 2023-04-20T04:54:22Z

atibdialpad
Apr 20, 2023
Author

thanks @csmarchbanks. I did end up using the regular Counter/Histogram class and a separate dict per metric with key="tuple of label values" and value=last_seen_timestamp + a cron to iterate over the said dict and doing a metric.remove(label_value_tuple). Seems to be working and this approach seems better to me since it doesn't reset ALL the timeseries in the system. I think I will go ahead with this. Thanks for your help here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ever increasing timeseries count #904

{{title}}

Replies: 7 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Ever increasing timeseries count #904

atibdialpad Mar 22, 2023

Replies: 7 comments · 5 replies

csmarchbanks Mar 22, 2023 Maintainer

atibdialpad Mar 24, 2023 Author

csmarchbanks Mar 29, 2023 Maintainer

atibdialpad Mar 30, 2023 Author

csmarchbanks Mar 30, 2023 Maintainer

atibdialpad Mar 30, 2023 Author

atibdialpad Apr 3, 2023 Author

csmarchbanks Apr 4, 2023 Maintainer

atibdialpad Apr 4, 2023 Author

atibdialpad Apr 13, 2023 Author

csmarchbanks Apr 18, 2023 Maintainer

atibdialpad Apr 20, 2023 Author

atibdialpad
Mar 22, 2023

Replies: 7 comments 5 replies

csmarchbanks
Mar 22, 2023
Maintainer

atibdialpad
Mar 24, 2023
Author

csmarchbanks Mar 29, 2023
Maintainer

atibdialpad
Mar 30, 2023
Author

csmarchbanks Mar 30, 2023
Maintainer

atibdialpad Mar 30, 2023
Author

atibdialpad
Apr 3, 2023
Author

csmarchbanks Apr 4, 2023
Maintainer

atibdialpad
Apr 4, 2023
Author

atibdialpad
Apr 13, 2023
Author

csmarchbanks Apr 18, 2023
Maintainer

atibdialpad
Apr 20, 2023
Author