Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMP] queue_job: Recover gracefully when the CPU limit is reached #419

Merged
merged 1 commit into from
Apr 19, 2022

Conversation

etobella
Copy link
Member

@etobella etobella commented Apr 6, 2022

It is a strange case, but it can happen. Workers are reinitilized when they reach cpu time limit (https://github.com/odoo/odoo/blob/13.0/odoo/service/server.py#L911-L916)

But what will happen if it happens between these lines?
https://github.com/OCA/queue/blob/14.0/queue_job/jobrunner/runner.py#L425-L426

I found the answer in my instance, the job will be marked as enqueued, but it will not be sent to odoo and will never be executed... Also, one channel will not be used, as everyone thinks it is already filled.

How to test it?

  1. I created an instance with 2 workers, test_queue_job installed and just 1 channel (root:1) and cpu_time_limit=5
  2. I created a scheduled action on model test.queue.job
for _i in range(0, 1000):
    model.with_delay().testing_method()
  1. I executed the action several times to create a lot of jobs. After some minutes, at the end, it happened and a job was enqueued but never executed (in my case 3000 jobs more or less).

With my solution, it will finalize the runner and start it again.

@guewen @simahawk

@OCA-git-bot
Copy link
Contributor

Hi @guewen,
some modules you are maintaining are being modified, check this out!

Copy link
Contributor

@simahawk simahawk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems legit, I don't know deeply this part tho 🤷‍♂️

Copy link
Member

@flotho flotho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

principle is elegant, small comment to understand how the method is triggered.

_logger.debug("WorkerJobRunner (%s) starting up", self.pid)
time.sleep(START_DELAY)
self.runner.run()

def signal_time_expired_handler(self, n, stack):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An chance to point how and when this method will be called?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!! The function is called once the time is filled. Is preconfigured on the Worker here
https://github.com/odoo/odoo/blob/13.0/odoo/service/server.py#L967

The original value can be found here
https://github.com/odoo/odoo/blob/13.0/odoo/service/server.py#L911-L916

The CPU max time value is defined here.
https://github.com/odoo/odoo/blob/13.0/odoo/service/server.py#L948
Once this time has been reached, the error function is executed everysecond. With the original function, an exception was called

@sbidoul
Copy link
Member

sbidoul commented Apr 13, 2022

@etobella that's an interesting bug. It would not have happened with my initial implementation where the job runner was running as a thread in the main process, which is not subject to CPU limits. Now that the job runner is in a separate worker this can indeed happen.

However I think this could also happen if the worker receives a SIGTERM or SIGINT between these two lines (which has always been the case).

So I wonder if we should not protect against this situation by masking signals around https://github.com/OCA/queue/blob/14.0/queue_job/jobrunner/runner.py#L425-L426.

As a side note, we currently experiment with running the job runner in a simple process (#409) where it is not sensitive to constraints of Odoo workers.

@etobella
Copy link
Member Author

So, do you prefer to do the masking there? Or should I keep my original idea?

@sbidoul
Copy link
Member

sbidoul commented Apr 14, 2022

I prefer the signal masking yes:

  • it would cover more situations (such as sigterm/sigint)
  • that kind of protect is "local" in the code and probably easier to reason about

@etobella
Copy link
Member Author

etobella commented Apr 19, 2022

@sbidoul
Copy link
Member

sbidoul commented Apr 19, 2022

@etobella you are right. There is nothing else to do for SIGTERM and SIGINT. But then why not simply ignoring SIGXCPU instead of restarting the job runner. AFAIK there is no memory leak in the job runner, so there is no need to restart it.

@etobella
Copy link
Member Author

Odoo Workers are restarting following the signal_time_expired_handler, so, the Queue Worker will raise the error, every second after it has achived the expected time. So, it seems legit to me to restart the worker, following the same logic defined by odoo.

Copy link
Member

@sbidoul sbidoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, why not.

@OCA-git-bot
Copy link
Contributor

This PR has the approved label and has been created more than 5 days ago. It should therefore be ready to merge by a maintainer (or a PSC member if the concerned addon has no declared maintainer). 🤖

@sbidoul
Copy link
Member

sbidoul commented Apr 19, 2022

/ocabot merge patch

@OCA-git-bot
Copy link
Contributor

Hey, thanks for contributing! Proceeding to merge this for you.
Prepared branch 13.0-ocabot-merge-pr-419-by-sbidoul-bump-patch, awaiting test results.

@OCA-git-bot OCA-git-bot merged commit 2a7cc9c into OCA:13.0 Apr 19, 2022
@OCA-git-bot
Copy link
Contributor

Congratulations, your PR was merged at 3e51d0e. Thanks a lot for contributing to OCA. ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants