-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add code for detecting (and killing) a hung task manager task #6071
add code for detecting (and killing) a hung task manager task #6071
Conversation
b4f4895
to
d121173
Compare
Build succeeded.
|
# the task manager to never do more work | ||
current_task = w.current_task | ||
if current_task: | ||
if current_task.get('task', '').endswith('tasks.run_task_manager'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get confused in the dispatcher code about whether these messages are always dicts, or sometimes strings.
Well if current_task
is a string, I don't really care what it does. Just don't do anything would be fine.
Some of these arguments (a transaction inside of a advisory lock), but not necessarily all (sending websockets as an on_commit action) could also apply to Line 155 in 4912cbd
There's no argument against reaping this task after 5 minutes. Maybe same applies to some other periodic tasks. |
if age > (60 * 5): | ||
logger.error( | ||
f'run_task_manager has held the advisory lock for >5m, sending SIGTERM to {w.pid}' | ||
) # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot figure out why you need the noqa here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably just habit; my editor was complaining about > 80 chars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main thing I wanted to see was clear logging for this event, and I can't think of any way I could improve on what you have, so it's all 👍 from me
awx/main/dispatch/pool.py
Outdated
current_task['uuid'] | ||
]['started'] = datetime.datetime.utcnow() | ||
age = ( | ||
datetime.datetime.utcnow() - current_task['started'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like time.time
would also be fine. Is it more trustworthy than datetimes??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's a good point, I should just do that.
deployed this branch and got confirmation that this case is being hit in real installs
If this instance did not have this patch, it presumably would have hung forever. |
d121173
to
8b1806d
Compare
Build succeeded.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving it twice because I like it that much
@AlanCoding I'm not quite ready to merge this yet because I want to catch #5617 in action and determine a root cause instead of just papering over it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
While that's fine, I'd like for that to be time-boxed because this bug has been preventing properly testing other features. So I urge an earlier merge. After all, we now have this code that identifies the hangup. Instead of running with This PR is a nice general fallback, and my attitude toward it would not change even if the root cause is identified. |
Okay, I'm fine with that. I'll deploy and try to reproduce the hang from 9.2.0. |
Build failed (gate pipeline). For information on how to proceed, see
|
regate |
Build succeeded (gate pipeline).
|
this is a hail-mary, worst-case scenario, save-the-AWX-install-from-being-totally-stuck sort of thing
see: #5617
I tested this by intentionally making the task manager never yield: