-
-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IMP] queue_job: Recover gracefully when the CPU limit is reached #419
Conversation
Hi @guewen, |
8302923
to
37bf19f
Compare
37bf19f
to
42ea6c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems legit, I don't know deeply this part tho 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
principle is elegant, small comment to understand how the method is triggered.
_logger.debug("WorkerJobRunner (%s) starting up", self.pid) | ||
time.sleep(START_DELAY) | ||
self.runner.run() | ||
|
||
def signal_time_expired_handler(self, n, stack): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An chance to point how and when this method will be called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!! The function is called once the time is filled. Is preconfigured on the Worker here
https://github.com/odoo/odoo/blob/13.0/odoo/service/server.py#L967
The original value can be found here
https://github.com/odoo/odoo/blob/13.0/odoo/service/server.py#L911-L916
The CPU max time value is defined here.
https://github.com/odoo/odoo/blob/13.0/odoo/service/server.py#L948
Once this time has been reached, the error function is executed everysecond. With the original function, an exception was called
@etobella that's an interesting bug. It would not have happened with my initial implementation where the job runner was running as a thread in the main process, which is not subject to CPU limits. Now that the job runner is in a separate worker this can indeed happen. However I think this could also happen if the worker receives a SIGTERM or SIGINT between these two lines (which has always been the case). So I wonder if we should not protect against this situation by masking signals around https://github.com/OCA/queue/blob/14.0/queue_job/jobrunner/runner.py#L425-L426. As a side note, we currently experiment with running the job runner in a simple process (#409) where it is not sensitive to constraints of Odoo workers. |
So, do you prefer to do the masking there? Or should I keep my original idea? |
I prefer the signal masking yes:
|
SIGTERM and SIGINT are already monitored in the code You can see the masking on odoo code |
@etobella you are right. There is nothing else to do for SIGTERM and SIGINT. But then why not simply ignoring SIGXCPU instead of restarting the job runner. AFAIK there is no memory leak in the job runner, so there is no need to restart it. |
Odoo Workers are restarting following the signal_time_expired_handler, so, the Queue Worker will raise the error, every second after it has achived the expected time. So, it seems legit to me to restart the worker, following the same logic defined by odoo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, why not.
This PR has the |
/ocabot merge patch |
Hey, thanks for contributing! Proceeding to merge this for you. |
Congratulations, your PR was merged at 3e51d0e. Thanks a lot for contributing to OCA. ❤️ |
It is a strange case, but it can happen. Workers are reinitilized when they reach cpu time limit (https://github.com/odoo/odoo/blob/13.0/odoo/service/server.py#L911-L916)
But what will happen if it happens between these lines?
https://github.com/OCA/queue/blob/14.0/queue_job/jobrunner/runner.py#L425-L426
I found the answer in my instance, the job will be marked as enqueued, but it will not be sent to odoo and will never be executed... Also, one channel will not be used, as everyone thinks it is already filled.
How to test it?
root:1
) andcpu_time_limit=5
test.queue.job
With my solution, it will finalize the runner and start it again.
@guewen @simahawk