Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make custom metrics work with gunicorn reload #2873

Closed
anggao opened this issue Jan 22, 2021 · 5 comments · Fixed by #3018
Closed

Make custom metrics work with gunicorn reload #2873

anggao opened this issue Jan 22, 2021 · 5 comments · Fixed by #3018
Assignees

Comments

@anggao
Copy link
Contributor

anggao commented Jan 22, 2021

Right now the custom metrics are exposed through a process in the model container, since the metrics contains label worker-id, each restart/reload of the gunicorn will create a new data series, this caused an issue, as both old and new data series are exists and will be send back as prometheus scrape response.

With auto-reload process in our model (in order to avoid potential OOM), this resulting over 10MB prometheus response after running the model for a while, which caused scrape timeout and huge memory bump of the prometheus server.

I think we need a better way to workaround this issue, as reload gunicorn seems a common method to avoid OOM in model deployment.

@anggao anggao added the triage Needs to be triaged and prioritised accordingly label Jan 22, 2021
@anggao
Copy link
Contributor Author

anggao commented Jan 22, 2021

@axsaucedo @RafalSkolasinski FYI

@ukclivecox ukclivecox added priority/p0 and removed triage Needs to be triaged and prioritised accordingly labels Jan 28, 2021
@anggao
Copy link
Contributor Author

anggao commented Feb 9, 2021

@axsaucedo @RafalSkolasinski any updates for this ticket ?

@axsaucedo
Copy link
Contributor

@anggao we discussed it yesterday morning as we started revisiting the discussion around persistence, but it seems like it's quite nuanced, we haven't been able to identify a simple way to address this that doesn't end up being a relatively big hack... We could remove the worker ID, or provide an option to disable worker ID through an env variable, but it seems like it may be addressing this edge case. Not sure if @cliveseldon you would have any thoughts on this?

@anggao
Copy link
Contributor Author

anggao commented Feb 9, 2021

@axsaucedo Thank you! Can you elaborate how you plan to get rid of the worker id, are you planning to do aggragation at python server layer ?

@RafalSkolasinski
Copy link
Contributor

Following a bit more the discussion we had. Current understanding of the issue is as following:

  • SC stores custom metrics in the map: worker_id -> metrics
  • When workers die, new workers get new id but the old one is still kept in the memory
  • Effectively SC python server keep exposing metrics to prometheus like there would be more and more live workers

Following graph tries to visualise the issue
image

x - axis: time
y - axis: arbitrary metrics (offset for each set of workers)

when the workers are being killed and new one created old metrics are kept exposed (blue horizontal lines)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants