-
-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IMP] queue_job: detect jobs runned by workers that have been killed #713
Conversation
Hi @guewen, |
0a6df89
to
c2f1bd0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a recurring problem for me, so I will backport to 15.0 shortly after merge.
queue_job/models/queue_job.py
Outdated
|
||
for job in jobs: | ||
if not psutil.pid_exists(job.worker_pid): | ||
_logger.info("a process with pid %d does not exist" % job.worker_pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I saw this in the logs, I don't think I'd know what to look for. Maybe say "a queue_job process" instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, i forgot to remove it, it to ensure that it was working during my tests 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think there is value in logging unexpected queue_job workers deaths with greater specificity than the generic werkzeug worker timeout message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I updated the message to give the worker pid and the job uuid.
c2f1bd0
to
8aa3e58
Compare
8aa3e58
to
9e78112
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, sadly the pid may be on another host and still executing
What would you advise ? |
Record the hostname / other signature of the host and filter accordingly? Config parameter gate this feature? |
This wasn't a good approach. |
Goal
Do not keep jobs in 'started' status if their associated worker have been killed by a timeout and directly set it as 'failed'