Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ansible_runner cleaning up missing file... Device busy error continuously. #374

Closed
6sossomons opened this issue Oct 21, 2019 · 11 comments
Closed

Comments

@6sossomons
Copy link

I run the AWX-RPM version and when we upgraded to the newest code, which includes this newest release of ansible_runner/runner.py, we are continuing to see workflows that DID run without issue, to fail at the first playbook run (inventory syncs work, scm syncs work), but something about the playbook runs fail straight out.

The code starts at 253, which is the new section for cleaning up temp directories. I've tweaked time-outs, retries, everything except code rolling...

All failed.

Most of the time, I have seen that directory as being missing at the error point, so not sure if it is deleting or can't figure out that it needs to delete it AFTER it has gotten back out of it (as AWX runs inside the /tmp/* folder while running) and as such gives itself the Device busy error.

I rolled my code back to https://github.com/ansible/ansible-runner/blob/68d09d87b7941a4c5cd6efd03b344b1a916afa9a/ansible_runner/runner.py which is the commit prior to the clean-up code add, which is allowing us to work again.

Any thoughts/input on something else I could have been missing with the newest code?

@matburt
Copy link
Member

matburt commented Oct 21, 2019

Best thing to do here is probably check for the non-existence and not fail outright if the directory is actually gone.

@MrMEEE
Copy link

MrMEEE commented Oct 22, 2019

How about something simple like this?

--- runner.py   2019-10-04 22:41:47.000000000 +0200
+++ runner.py.new       2019-10-22 14:11:36.961132180 +0200
@@ -249,15 +249,16 @@
             shutil.rmtree(self.config.directory_isolation_path)
         if self.config.process_isolation and self.config.process_isolation_path_actual:
             def _delete(retries=15):
-                try:
-                    shutil.rmtree(self.config.process_isolation_path_actual)
-                except OSError as e:
-                    res = False
-                    if e.errno == 16 and retries > 0:
-                        time.sleep(1)
-                        res = _delete(retries=retries - 1)
-                    if not res:
-                        raise
+                if path.exists(self.config.process_isolation_path_actual)
+                    try:
+                        shutil.rmtree(self.config.process_isolation_path_actual)
+                    except OSError as e:
+                        res = False
+                        if e.errno == 16 and retries > 0:
+                            time.sleep(1)
+                            res = _delete(retries=retries - 1)
+                        if not res:
+                            raise
                 return True
             _delete()
         if self.config.resource_profiling:

@matburt
Copy link
Member

matburt commented Oct 22, 2019

I'd love to see a root-cause here.

We don't see this in the official AWX or Tower builds that include Runner.

@MrMEEE
Copy link

MrMEEE commented Oct 22, 2019

isn't this the same?
ansible/awx#4194

@matburt
Copy link
Member

matburt commented Oct 23, 2019

Possibly, does that mean your RPM is using an old version of Runner?

@angystardust
Copy link

@matburt i see the rpm has a dependency on rh-python36-ansible-runner-1.4.2-1.noarch

@MrMEEE
Copy link

MrMEEE commented Oct 23, 2019

@matburt No, I'm using the version defined in awx.. But maybe this is related: ansible/awx#4073 never seems to be fixed??

@wenottingham
Copy link
Contributor

Does #380 do anything for this for you?

@angystardust
Copy link

angystardust commented Oct 23, 2019

@wenottingham the --die-with-parent parameter passed to ansible-runner fixed the job executions for me! Thanks a lot! 🚀

@6sossomons
Copy link
Author

This fixes it for me as well. I grabbed raw versions of the runner_config.py and runner.py and updated them in the codebase, then started AWX and running multiple jobs. All succeed/fail as required.

@wenottingham
Copy link
Contributor

Closing as fixed, then. Will be in 1.4.4.

MrMEEE added a commit to MrMEEE/awx that referenced this issue Oct 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants