-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job remains in pending #5617
Comments
what infos do you need? |
What sort of deployment do you have - what is the topology? If this is in kubernetes, what is your pod status? |
I'm using docker-compose with an external Postgres DB.
|
The containers them self run fine
|
I am starting to see this same problem. Has anyone else encounter this?
|
Similar issue here, just that my jobs never resume. Started occuring with the upgrade to 9.2.0 for me. The job that gets stuck is usually my main workflow that runs every 20 minutes. Sometimes its a subworkflow for job slicing. Clicking cancel in AWX web doesn't do anything. As for an example, my last running job (767795) started at log timestamp 19:11:47 then it got stuck. AWX version: 9.2.0 |
@xTrekStorex any chance you could run this in the task container?
...and report back what it prints? |
A bit of an extension on that question, if you already have shell. First run def check_lock():
import time
from awx.main.utils.pglock import advisory_lock
time.sleep(1)
with advisory_lock('task_manager_lock', wait=False) as acquired:
return acquired you can copy and paste functions, so you can past that and hit enter twice. next run [check_lock() for i in range(20)] Expect it to take 20 seconds to run. This assures that we get a full cycle. If any entry returns True, then that indicates that this lock is not held up perpetually. |
If you need anything else I'd be happy to try things or post more logs etc. |
We're actively tracking what we think is a regression, and I think you're encountering a version of it. The next time you see this, could you:
|
Also, is there any way you could share with us what your workflow looks like? |
sosreport-rgogbabteca01-awxhang-2020-02-26-mugqzxz.tar.xz.zip.001.txt remove .txt for extract i have the same problem and push to you my sosreport |
awx.dump.zip.001.txt remove .txt for extract i have the same problem and push to you my dumpdb |
If you're still encountering a hung install, could you run the commands referenced above and share the output? |
Will do as soon as it happens again. For now I encountered a different but maybe related issue of schedules not running anymore. I see a few schedulers waiting:
Running strace on any of them returns However I can run JTs manually for now. |
Any chance you could run:
|
Sure can. Maybe some context first: |
Done. Will get back to you asap |
For posterity, I very much suspect this is actually related to a bug in cpython itself (https://bugs.python.org/issue38884) that we've begun encountering due to this recent PR of mine: ...which moved our periodic heartbeat code (when we removed celery a few weeks ago) to a thread (instead of a distinct process). Details: |
@xTrekStorex I've done some testing on my end and am feeling good about my PR. If I hear back from you in the coming days (that you're unable to reproduce with my patch), I'll probably merge #6093 and incorporate it into the next AWX release. |
I have a hunch that our usage of a daemon thread is causing import lock contention related to ansible#5617 We've encountered similar issues before with threads across dispatcher processes at fork time, and cpython has had bugs like this in recent history: https://bugs.python.org/issue38884 My gut tells me this might be related. The prior implementation - based on celerybeat - ran its code in a process (not a thread), and the timing of that merge matches the period of time we started noticing issues. Currently testing it to see if it resolves some of the issues we're seeing.
I don't want to jinx it but in the past 24 hours I have not encountered the issue again. What I noticed though is that workers seem to be recreated a lot more often now. |
@xTrekStorex do you have something on a schedule that's running over, say, the weekend? With the old 9.2.0 image, were you having trouble encountering the issue often (more than, say, once per day?) I'm not ready to call this solved, but if you're convinced with a before/after, I'm likely to merge my PR early next week. |
I encountered the issue every 2-6 hours. There are no major changes planned for the weekend however my workflow runs continuously 24/7, every 20 minutes with a runtime of ~18 minutes to enforce configuration state (I moved to AWX from the Puppet world - it's a habit). So it will create and run plenty of jobs if that is what you're asking. |
@xTrekStorex thanks for helping us test this, I'll check back in with you next week 👍 |
hello after update of friday my awx is already hang (InteractiveConsole)
awx-manage run_dispatcher --status
strace: attach: ptrace(PTRACE_ATTACH, 35212): Operation not permitted |
I've got a PR which I suspect will resolve this issue: Any updates on your end after a weekend of running periodic jobs? |
All good. Not a single hung job. |
All, is there a way to monitor the job stuck issue and restart the stuck task container? |
You can monitor with API |
@fischerdr Instead of an alert handler based workaround you could either apply the above mentioned patch manually like I did or downgrade to AWX 9.1.1 if that's an option for you until the fixed release becomes available |
My recommendation would be what @xTrekStorex said - either apply the patch directly (it's just merged into I don't suspect that detecting and working around this bug is worth the effort, personally. |
So 9.1.1 is ok to deploy, till new version is out? |
@fischerdr to the best of my knowledge, this was a regression introduced in 9.2.0, and I've heard several people report that 9.1.1 doesn't suffer from this bug. |
I have the same exact issue. What I did to patch my container:
# Patch awx 9.2.0
awx_version="$(awx-manage --version)"
echo "current awx version is: ${awx_version}"
if [[ "${awx_version}" == "9.2.0" ]]; then
echo "Patching awx because current version == 9.2.0"
# patch
cp -f /patches/9.2.0/awx/main/dispatch/periodic.py /var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/dispatch/
cp -f /patches/9.2.0/awx/main/management/commands/run_dispatcher.py /var/lib/awx/venv/awx/lib/python3.6/site-packages/awx/main/management/commands/
fi I just patched today, I'll see if it works with time Edit Since I've patched I don't have problem anymore |
Just a little more feedback:
|
Any plans yet on a date for the next release that will include this fix? |
As we were reproducing this on a regular basis on internal jobs and many upstream users as well, I am going to call this verified by the fact that it has been solved for all parties. Closing |
ISSUE TYPE
SUMMARY
Job remains in pending, In the log I see a lot of "awx.main.tasks Not running scheduler, another task holds lock"
There are some closed issues regarding this one, but didn't find any solution.
Randomly AWX get stuck where all jobs remain in "pending"
In the logs I see
task_1 | 2020-01-09 16:08:59,344 DEBUG awx.main.dispatch task a1c42d7f-17b2-4590-b013-b82f76105f0c starting awx.main.scheduler.tasks.run_task_manager(*[]) task_1 | 2020-01-09 16:08:59,347 DEBUG awx.main.scheduler Running Tower task manager. task_1 | 2020-01-09 16:08:59,359 DEBUG awx.main.scheduler Not running scheduler, another task holds lock task_1 | 2020-01-09 16:08:59,360 DEBUG awx.main.dispatch task a1c42d7f-17b2-4590-b013-b82f76105f0c is finished
After a lot of time (this time 16min) it suddenly starts to execute the task
task_1 | 2020-01-09 16:24:19,814 DEBUG awx.main.dispatch task b832fea9-c31f-4cfb-8ce3-fda16a1ac5cb starting awx.main.scheduler.tasks.run_task_manager(*[]) task_1 | 2020-01-09 16:24:19,816 DEBUG awx.main.scheduler Running Tower task manager. task_1 | 2020-01-09 16:24:19,827 DEBUG awx.main.scheduler Not running scheduler, another task holds lock task_1 | 2020-01-09 16:24:19,828 DEBUG awx.main.dispatch task b832fea9-c31f-4cfb-8ce3-fda16a1ac5cb is finished task_1 | 2020-01-09 16:24:38,979 DEBUG awx.main.dispatch publish awx.main.tasks.cluster_node_heartbeat(aacec6cc-b438-4cb7-b3f8-2a9ba0b3f399, queue=awx) task_1 | [2020-01-09 16:24:38,979: DEBUG/Process-1] publish awx.main.tasks.cluster_node_heartbeat(aacec6cc-b438-4cb7-b3f8-2a9ba0b3f399, queue=awx) rabbitmq_1 | 2020-01-09 16:24:38.989 [info] <0.3775.0> accepting AMQP connection <0.3775.0> (172.19.0.5:48346 -> 172.19.0.3:5672) rabbitmq_1 | 2020-01-09 16:24:38.991 [info] <0.3775.0> connection <0.3775.0> (172.19.0.5:48346 -> 172.19.0.3:5672): user 'guest' authenticated and granted access to vhost 'awx' rabbitmq_1 | 2020-01-09 16:24:38.996 [info] <0.3775.0> closing AMQP connection <0.3775.0> (172.19.0.5:48346 -> 172.19.0.3:5672, vhost: 'awx', user: 'guest') task_1 | 2020-01-09 16:24:39,003 DEBUG awx.main.dispatch delivered aacec6cc-b438-4cb7-b3f8-2a9ba0b3f399 to worker[178] qsize 0 task_1 | 2020-01-09 16:24:39,005 DEBUG awx.main.dispatch task aacec6cc-b438-4cb7-b3f8-2a9ba0b3f399 starting awx.main.tasks.cluster_node_heartbeat(*[]) task_1 | 2020-01-09 16:24:39,008 DEBUG awx.main.tasks Cluster node heartbeat task. task_1 | 2020-01-09 16:24:39,020 DEBUG awx.main.dispatch task aacec6cc-b438-4cb7-b3f8-2a9ba0b3f399 is finished task_1 | 2020-01-09 16:24:39,090 DEBUG awx.main.dispatch publish awx.main.tasks.awx_k8s_reaper(e33c3d96-163b-487f-8101-00dd80f101a7, queue=awx) task_1 | [2020-01-09 16:24:39,090: DEBUG/Process-1] publish awx.main.tasks.awx_k8s_reaper(e33c3d96-163b-487f-8101-00dd80f101a7, queue=awx) rabbitmq_1 | 2020-01-09 16:24:39.100 [info] <0.3788.0> accepting AMQP connection <0.3788.0> (172.19.0.5:48350 -> 172.19.0.3:5672) rabbitmq_1 | 2020-01-09 16:24:39.102 [info] <0.3788.0> connection <0.3788.0> (172.19.0.5:48350 -> 172.19.0.3:5672): user 'guest' authenticated and granted access to vhost 'awx' rabbitmq_1 | 2020-01-09 16:24:39.104 [info] <0.3788.0> closing AMQP connection <0.3788.0> (172.19.0.5:48350 -> 172.19.0.3:5672, vhost: 'awx', user: 'guest') task_1 | 2020-01-09 16:24:39,105 DEBUG awx.main.dispatch delivered e33c3d96-163b-487f-8101-00dd80f101a7 to worker[178] qsize 0 task_1 | 2020-01-09 16:24:39,108 DEBUG awx.main.dispatch task e33c3d96-163b-487f-8101-00dd80f101a7 starting awx.main.tasks.awx_k8s_reaper(*[]) task_1 | 2020-01-09 16:24:39,117 DEBUG awx.main.dispatch task e33c3d96-163b-487f-8101-00dd80f101a7 is finished task_1 | 2020-01-09 16:24:39,344 DEBUG awx.main.dispatch publish awx.main.tasks.awx_periodic_scheduler(97554fd6-f2c2-4b5f-a25d-bf09743b9105, queue=awx_private_queue) task_1 | [2020-01-09 16:24:39,344: DEBUG/Process-1] publish awx.main.tasks.awx_periodic_scheduler(97554fd6-f2c2-4b5f-a25d-bf09743b9105, queue=awx_private_queue) rabbitmq_1 | 2020-01-09 16:24:39.354 [info] <0.3801.0> accepting AMQP connection <0.3801.0> (172.19.0.5:48354 -> 172.19.0.3:5672) rabbitmq_1 | 2020-01-09 16:24:39.356 [info] <0.3801.0> connection <0.3801.0> (172.19.0.5:48354 -> 172.19.0.3:5672): user 'guest' authenticated and granted access to vhost 'awx' task_1 | 2020-01-09 16:24:39,359 DEBUG awx.main.dispatch delivered 97554fd6-f2c2-4b5f-a25d-bf09743b9105 to worker[178] qsize 0 rabbitmq_1 | 2020-01-09 16:24:39.359 [info] <0.3801.0> closing AMQP connection <0.3801.0> (172.19.0.5:48354 -> 172.19.0.3:5672, vhost: 'awx', user: 'guest') task_1 | 2020-01-09 16:24:39,361 DEBUG awx.main.dispatch task 97554fd6-f2c2-4b5f-a25d-bf09743b9105 starting awx.main.tasks.awx_periodic_scheduler(*[]) task_1 | 2020-01-09 16:24:39,377 DEBUG awx.main.tasks Starting periodic scheduler task_1 | 2020-01-09 16:24:39,380 DEBUG awx.main.tasks Last scheduler run was: 2020-01-09 16:05:39.310989+00:00 task_1 | 2020-01-09 16:24:39,390 DEBUG awx.main.dispatch task 97554fd6-f2c2-4b5f-a25d-bf09743b9105 is finished
Is there a way to know which other task blocks the scheduler?
ENVIRONMENT
The text was updated successfully, but these errors were encountered: