Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add code for detecting (and killing) a hung task manager task #6071

Conversation

ryanpetrello
Copy link
Contributor

@ryanpetrello ryanpetrello commented Feb 25, 2020

this is a hail-mary, worst-case scenario, save-the-AWX-install-from-being-totally-stuck sort of thing

see: #5617

I tested this by intentionally making the task manager never yield:

diff --git a/awx/main/scheduler/task_manager.py b/awx/main/scheduler/task_manager.py
index 9cd0aa1db9..854dfbcf74 100644
--- a/awx/main/scheduler/task_manager.py
+++ b/awx/main/scheduler/task_manager.py
@@ -543,6 +543,11 @@ class TaskManager():
     def _schedule(self):
         finished_wfjs = []
         all_sorted_tasks = self.get_tasks()
+        while True:
+            import time
+            time.sleep(1)
+            # locked!

image

@ryanpetrello
Copy link
Contributor Author

bonus:

image

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@ryanpetrello ryanpetrello requested a review from elyezer February 25, 2020 22:45
# the task manager to never do more work
current_task = w.current_task
if current_task:
if current_task.get('task', '').endswith('tasks.run_task_manager'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get confused in the dispatcher code about whether these messages are always dicts, or sometimes strings.

Well if current_task is a string, I don't really care what it does. Just don't do anything would be fine.

@AlanCoding
Copy link
Member

Some of these arguments (a transaction inside of a advisory lock), but not necessarily all (sending websockets as an on_commit action) could also apply to

def apply_cluster_membership_policies():

There's no argument against reaping this task after 5 minutes. Maybe same applies to some other periodic tasks.

if age > (60 * 5):
logger.error(
f'run_task_manager has held the advisory lock for >5m, sending SIGTERM to {w.pid}'
) # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot figure out why you need the noqa here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just habit; my editor was complaining about > 80 chars

Copy link
Member

@AlanCoding AlanCoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing I wanted to see was clear logging for this event, and I can't think of any way I could improve on what you have, so it's all 👍 from me

current_task['uuid']
]['started'] = datetime.datetime.utcnow()
age = (
datetime.datetime.utcnow() - current_task['started']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like time.time would also be fine. Is it more trustworthy than datetimes??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's a good point, I should just do that.

@AlanCoding
Copy link
Member

deployed this branch and got confirmation that this case is being hit in real installs

2020-02-26 02:06:10,932 ERROR    awx.main.dispatch PID:27850 run_task_manager has held the advisory lock for >5m, sending SIGTERM to 10757
2020-02-26 02:06:10,939 WARNING  awx.main.dispatch PID:10757 received SIGTERM, stopping

If this instance did not have this patch, it presumably would have hung forever.

@ryanpetrello ryanpetrello force-pushed the task-manager-hang-detection branch from d121173 to 8b1806d Compare February 26, 2020 13:02
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

Copy link
Member

@AlanCoding AlanCoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving it twice because I like it that much

@ryanpetrello
Copy link
Contributor Author

@AlanCoding I'm not quite ready to merge this yet because I want to catch #5617 in action and determine a root cause instead of just papering over it.

Copy link
Member

@elyezer elyezer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AlanCoding
Copy link
Member

I'm not quite ready to merge this yet because I want to catch #5617 in action and determine a root cause instead of just papering over it.

While that's fine, I'd like for that to be time-boxed because this bug has been preventing properly testing other features. So I urge an earlier merge. After all, we now have this code that identifies the hangup. Instead of running with devel, we could merge this, branch off that, remove the SIGTERM, and send an email or something.

This PR is a nice general fallback, and my attitude toward it would not change even if the root cause is identified.

@ryanpetrello
Copy link
Contributor Author

ryanpetrello commented Feb 26, 2020

While that's fine, I'd like for that to be time-boxed because this bug has been preventing properly testing other features. So I urge an earlier merge.

Okay, I'm fine with that. I'll deploy and try to reproduce the hang from 9.2.0.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed (gate pipeline). For information on how to proceed, see
http://docs.openstack.org/infra/manual/developers.html#automated-testing

@AlanCoding
Copy link
Member

regate

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded (gate pipeline).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants