Make custom metrics work with gunicorn reload #2873

anggao · 2021-01-22T16:45:59Z

Right now the custom metrics are exposed through a process in the model container, since the metrics contains label worker-id, each restart/reload of the gunicorn will create a new data series, this caused an issue, as both old and new data series are exists and will be send back as prometheus scrape response.

With auto-reload process in our model (in order to avoid potential OOM), this resulting over 10MB prometheus response after running the model for a while, which caused scrape timeout and huge memory bump of the prometheus server.

I think we need a better way to workaround this issue, as reload gunicorn seems a common method to avoid OOM in model deployment.

The text was updated successfully, but these errors were encountered:

anggao · 2021-01-22T16:50:40Z

@axsaucedo @RafalSkolasinski FYI

anggao · 2021-02-09T11:33:03Z

@axsaucedo @RafalSkolasinski any updates for this ticket ?

axsaucedo · 2021-02-09T16:11:29Z

@anggao we discussed it yesterday morning as we started revisiting the discussion around persistence, but it seems like it's quite nuanced, we haven't been able to identify a simple way to address this that doesn't end up being a relatively big hack... We could remove the worker ID, or provide an option to disable worker ID through an env variable, but it seems like it may be addressing this edge case. Not sure if @cliveseldon you would have any thoughts on this?

anggao · 2021-02-09T16:21:45Z

@axsaucedo Thank you! Can you elaborate how you plan to get rid of the worker id, are you planning to do aggragation at python server layer ?

RafalSkolasinski · 2021-02-16T16:45:31Z

Following a bit more the discussion we had. Current understanding of the issue is as following:

SC stores custom metrics in the map: worker_id -> metrics
When workers die, new workers get new id but the old one is still kept in the memory
Effectively SC python server keep exposing metrics to prometheus like there would be more and more live workers

Following graph tries to visualise the issue

x - axis: time
y - axis: arbitrary metrics (offset for each set of workers)

when the workers are being killed and new one created old metrics are kept exposed (blue horizontal lines)

anggao added the triage Needs to be triaged and prioritised accordingly label Jan 22, 2021

ukclivecox added priority/p0 and removed triage Needs to be triaged and prioritised accordingly labels Jan 28, 2021

adriangonz self-assigned this Mar 3, 2021

adriangonz mentioned this issue Mar 5, 2021

Clear methods when Gunicorn worker exits #3018

Merged

seldondev closed this as completed in #3018 Mar 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make custom metrics work with gunicorn reload #2873

Make custom metrics work with gunicorn reload #2873

anggao commented Jan 22, 2021 •

edited

Loading

anggao commented Jan 22, 2021 •

edited

Loading

anggao commented Feb 9, 2021

axsaucedo commented Feb 9, 2021

anggao commented Feb 9, 2021

RafalSkolasinski commented Feb 16, 2021

Make custom metrics work with gunicorn reload #2873

Make custom metrics work with gunicorn reload #2873

Comments

anggao commented Jan 22, 2021 • edited Loading

anggao commented Jan 22, 2021 • edited Loading

anggao commented Feb 9, 2021

axsaucedo commented Feb 9, 2021

anggao commented Feb 9, 2021

RafalSkolasinski commented Feb 16, 2021

anggao commented Jan 22, 2021 •

edited

Loading

anggao commented Jan 22, 2021 •

edited

Loading