Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics for periodic jobs are creating too many timeseries due to timestamps in the labels #4061

Closed
sevagh opened this issue Mar 28, 2018 · 8 comments · Fixed by #4392
Closed

Comments

@sevagh
Copy link
Contributor

sevagh commented Mar 28, 2018

Hello,
This is some sample output of the Nomad task_group_* metrics:

nomad_nomad_job_summary_complete{host="foo",job="bar/periodic-1522233600",task_group="baz"} 1
nomad_nomad_job_summary_complete{host="foo",job="bar/periodic-1522233601",task_group="baz"} 1
nomad_nomad_job_summary_complete{host="foo",job="bar/periodic-1522233602",task_group="baz"} 1
nomad_nomad_job_summary_complete{host="foo",job="bar/periodic-1522233603",task_group="baz"} 1
nomad_nomad_job_summary_complete{host="foo",job="bar/periodic-1522233604",task_group="baz"} 1
nomad_nomad_job_summary_complete{host="foo",job="bar/periodic-1522233605",task_group="baz"} 1
nomad_nomad_job_summary_complete{host="foo",job="bar/periodic-1522233606",task_group="baz"} 1
nomad_nomad_job_summary_complete{host="foo",job="bar/periodic-1522233607",task_group="baz"} 1
nomad_nomad_job_summary_complete{host="foo",job="bar/periodic-1522233608",task_group="baz"} 1

This is a misuse of labels in Prometheus: https://prometheus.io/docs/practices/naming/#labels

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

There should be a better way of representing this.

@dadgar
Copy link
Contributor

dadgar commented Mar 28, 2018

What is your suggestion?

@sevagh
Copy link
Contributor Author

sevagh commented Mar 28, 2018

I have nothing right now. Can we have this ticket open for brainstorming? I'll be doing some meditation as well and can post what I come up with.

@sevagh
Copy link
Contributor Author

sevagh commented Mar 28, 2018

A pattern we use (for totally unrelated metrics in Prometheus) is:

last_success_timestamp{} = 1.52225616e+09

With some elbow-grease maybe something similar can be thought up for this case?

@dadgar
Copy link
Contributor

dadgar commented Mar 28, 2018

Sounds good. Improving metrics is always a worth while effort!

@sevagh
Copy link
Contributor Author

sevagh commented Mar 29, 2018

So I'm worried about doing something like this (pseudocode mixed with Python, not valid go):

jobID := job.ID // e.g. foojob/periodic-XXXXXXX
jobName := job.Name // e.g. foojob/periodic-XXXXXX

split := jobID.split('-')
adjustedJobID := split[:-1] // e.g. foojob/periodic
timestamp = split[-1] // e.g. XXXXXXX

emit_metric{
    name: periodic_job_last_run,
    value: float32(XXXXXXX)
    labels:
         "job": "foojob/periodic",
}

This emits things but it sort of butchers the way Periodic jobs are even named in the first place within Nomad - smells funny to me.

@dylan-ferreira
Copy link

dylan-ferreira commented Apr 5, 2018

In the meantime, rewriting the label on ingest is a workable stopgap:

    metric_relabel_configs:
      - source_labels: ['exported_job']
        separator: ;
        regex: "^(.+/periodic)-[0-9]+$" # Drop unix-timestamp on nomad job metrics
        target_label: 'exported_job'
        replacement: '$1'
        action: replace

(in the example above, the job label is already taken by service discovery)

I do like having the indication of a periodic job, but I agree that altering the original job name can lead to confusion. Adding another label: job_type="periodic" may be helpful here.

Rather than last_success_timestamp{} we could have something like the following that could apply to any job type:

# HELP nomad_job_start_time_seconds Start time of the job since unix epoch in seconds.
# TYPE nomad_job_start_time_seconds gauge
nomad_job_start_time_seconds{job="<job_name>", job_type="<job_type>"} <unixtime64>

For periodic tasks, It would be great to have metrics like:

# HELP nomad_job_last_run_time_seconds Time in seconds the last run of the job took to complete.
# TYPE nomad_job_last_run_time_seconds gauge
nomad_job_last_run_time_seconds{job="<job_name>", job_type="<job_type>"} <seconds>
# HELP nomad_job_last_exit_code Exit code from the last run of the job.
# TYPE nomad_job_last_exit_code gauge
nomad_job_last_exit_code{job="<job_name>", job_type="<job_type>"} <exit_code>

@sevagh
Copy link
Contributor Author

sevagh commented Apr 5, 2018

Great insights, thanks.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants