Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update prometheus helmfiles and rules to better use metric labels #304

Merged
merged 11 commits into from
Mar 28, 2020

Conversation

willgraf
Copy link
Contributor

@willgraf willgraf commented Mar 27, 2020

In an effort to improve the autoscaling, each prometheus-related chart has been updated to the latest version. In updating the charts, I found that some of the values became stale (especially the image version), so I removed all values that we did not need/want to set ourselves. They just inherit from the default value set, with a link provided in the helmfile.

  • Updated prometheus-redis-exporter to 3.3.3
  • Updated prometheus-operator to 8.12.3
  • Updated prometheus-adapter to 2.1.3

Additionally, prometheus metrics use labels to do math on 2 or more time series. To get our labels to match up, I made a few changes:

  • Update the redis-exporter to have the key name be queue and queue-zip instead of queue_image_keys and queue_zip_keys.
  • Add a metric_relabel_config to the redis-exporter prometheus job to take the queue name and include a new label, deployment="queue-consumer".
  • Refactor zip-consumer to segmentation-zip-consumer in order to have the labels match the queue name.

Also, I updated our rules:

  • Added new rules consumer_key_ratio and consumers_per_gpu which use the new labels to calculate stats for all deployed consumers (if their name X-consumer matches the queue X).
  • Changed the GPU metrics to a single tf_serving_gpu_usage metric.
  • Removed the avg_over_time calls, they are outputting discrete points instead of a nice line.

using .75 for now as anecdotally seems to work. Ideal number could be 
more or less.
Copy link
Collaborator

@MekWarrior MekWarrior left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, well done!

@willgraf willgraf merged commit 465f869 into stable Mar 28, 2020
@willgraf willgraf deleted the willgraf/prometheus-update branch March 28, 2020 22:23
willgraf added a commit that referenced this pull request May 23, 2020
* Update prometheus-operator, prometheus-adapter, and promethes-redis-exporter helm charts and remove stale default values

* Relabel redis-exporter with `deployment=$QUEUE-consumer` and change key to be `$QUEUE`

* Rename zip-consumer to segmentation-zip-consumer to match labels.

* Using .75 instead of .9 for backoff coefficient.
willgraf added a commit that referenced this pull request May 23, 2020
* Update prometheus-operator, prometheus-adapter, and promethes-redis-exporter helm charts and remove stale default values

* Relabel redis-exporter with `deployment=$QUEUE-consumer` and change key to be `$QUEUE`

* Rename zip-consumer to segmentation-zip-consumer to match labels.

* Using .75 instead of .9 for backoff coefficient.
willgraf added a commit that referenced this pull request May 23, 2020
* Update prometheus-operator, prometheus-adapter, and promethes-redis-exporter helm charts and remove stale default values

* Relabel redis-exporter with `deployment=$QUEUE-consumer` and change key to be `$QUEUE`

* Rename zip-consumer to segmentation-zip-consumer to match labels.

* Using .75 instead of .9 for backoff coefficient.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants