-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Model connection hangs when migration is in progress #829
Comments
@agileshaw @SimonRichardson had an interesting question about this earlier, does this come back after the migration is complete? The answer might help us to think about the problem and the proper solution for it. Thanks! |
@agileshaw Also, you marked this as a |
@cderici Thanks for reaching out and for helping with the bug. It never comes back after the migration completion or failure; it would hang indefinitely without retrying or termination. As for the bug status, it is indeed blocking our release. One of our application, prometheus-juju-exporter is using libjuju to connect models and gather machine information. Due to this bug, the application hangs silently in our production environment. |
No problem, we got you👍
You do know that even after this is bug is fixed it'll fail, right? You're trying to run the |
Yes, I'm aware of it. We have exception handling in our application. What we want to see is an exception raised by libjuju being returned to the main program so the logic implemented in our application gets triggered. Hanging is the critical problem here because it give a false sense that the program is running correctly when it is not. |
fixes juju#829 The `_all_watcher` task is a coroutine for the AllWatcher to run in the background all the time forever, and it involves a while loop that's being controlled manually through some flags (asyncio events), e.g. things like `_watch_stopping`, `watch_stopped`. The problem is that when the `_all_watcher` raises an exception (or receives one from things like `get_config()` like in the case of ether in the event loop, not handled/or re-raised. This is because this coroutine is not `await`ed (for good reason), it can't be `await`ed because there won't ever be any results, this method is supposed to be working in the background forever getting the deltas for us. As a result of this, if `_all_watcher` fails, then external flags like `_watch_received` is never set, and whoever's calling `await self._watch_received.wait()` will block forever (in this case the `_after_connect()`. Similarly the `disconnect()` waits for the `_watch_stopped` flag, which won't be set either, so if we call disconnect when all_watcher failed then it'll hang forever. This change fixes this problem by allowing (at the wait-for-flag spots) to wait for two things, 1) whichever flag we're waiting for, 2) `_all_watcher` task to be `"done()"`. In the latter case, we should expect to see an exception because that task is not supposed to be finished. More importantly, if we do see that the `_watcher_task.done()`, then we don't sit and wait forever for the _all_watcher event flags to be set, so we won't hang. Also a nice side effect of this should be that we should be getting less number of extra exception outputs saying that the "Task exception is never handled", since we do call the `.exception()` on the `_all_watcher` task. Though we'll probably continue to get those from the tasks like `_pinger` and `_debug_log` etc. However, this is a good first example solution to handle them as well.
fixes juju#829 The `_all_watcher` task is a coroutine for the AllWatcher to run in the background all the time forever, and it involves a while loop that's being controlled manually through some flags (asyncio events), e.g. things like `_watch_stopping`, `watch_stopped`. The problem is that when the `_all_watcher` raises an exception (or receives one from things like `get_config()` like in the case of ether in the event loop, not handled/or re-raised. This is because this coroutine is not `await`ed (for good reason), it can't be `await`ed because there won't ever be any results, this method is supposed to be working in the background forever getting the deltas for us. As a result of this, if `_all_watcher` fails, then external flags like `_watch_received` is never set, and whoever's calling `await self._watch_received.wait()` will block forever (in this case the `_after_connect()`. Similarly the `disconnect()` waits for the `_watch_stopped` flag, which won't be set either, so if we call disconnect when all_watcher failed then it'll hang forever. This change fixes this problem by allowing (at the wait-for-flag spots) to wait for two things, 1) whichever flag we're waiting for, 2) `_all_watcher` task to be `"done()"`. In the latter case, we should expect to see an exception because that task is not supposed to be finished. More importantly, if we do see that the `_watcher_task.done()`, then we don't sit and wait forever for the _all_watcher event flags to be set, so we won't hang. Also a nice side effect of this should be that we should be getting less number of extra exception outputs saying that the "Task exception is never handled", since we do call the `.exception()` on the `_all_watcher` task. Though we'll probably continue to get those from the tasks like `_pinger` and `_debug_log` etc. However, this is a good first example solution to handle them as well.
#833 #### Description This one was a bit tricky. fixes #829 The `_all_watcher` task is a coroutine for the AllWatcher to run in the background all the time forever, and it involves a while loop that's being controlled manually through some flags (asyncio events), e.g. things like `_watch_stopping`, `watch_stopped`. The problem is that when the `_all_watcher` raises an exception (or receives one from things like `get_config()` like in the case of #829, that exception is thrown in the background somewhere in ether in the event loop, not handled/or re-raised. This is because this coroutine is not `await`ed (for good reason), it can't be `await`ed because there won't ever be any results, this method is supposed to be working in the background forever getting the deltas for us. As a result of this, if `_all_watcher` fails, then external flags like `_watch_received` is never set, and whoever's calling `await self._watch_received.wait()` will block forever (in this case the `_after_connect()`). Similarly the `disconnect()` waits for the `_watch_stopped` flag, which won't be set either, so if we call disconnect when all_watcher failed then it'll hang forever. This change fixes this problem by allowing (at the wait-for-flag spots) to wait for two things, 1) whichever flag we're waiting for, 2) `_all_watcher` task to be `"done()"`. In the latter case, we should expect to see an exception because that task is not supposed to be finished. More importantly, if we do see that the `_watcher_task.done()`, then we don't sit and wait forever for the _all_watcher event flags to be set, so we won't hang. Also a nice side effect of this should be that we should be getting less number of extra exception outputs saying that the "Task exception is never handled", since we do call the `.exception()` on the `_all_watcher` task. Though we'll probably continue to get those from the tasks like `_pinger` and `_debug_log` etc. However, this is a good first example solution to handle them as well. #### QA Steps This should be rigorously tested, as it slightly changes a fundamental mechanism. We do need to make sure all the tests are passing for sure. For the manual QA, what I did was that I changed the body of the `model.get_config()` with `raise JujuError("FOO")` (artificially inducing an error seemingly coming from the api outside of the all_watcher loop). This creates the exact condition happening in the #829. You can also get two controllers and use pylibjuju while running a migration in the background, getting a `migration is in progress` error. Whichever it is, with the error in place, run the following and it should print "Error handled" in the stdout: ```python async def juju_stats(): m = Model() await m.connect() try: asyncio.run(juju_stats()) except JujuError: print("Error handled") ``` #### Notes & Discussion We might wanna also get this onto the other branches after we carefully test and land this onto `2.9` as requested by the #829 .
fixes juju#829 The `_all_watcher` task is a coroutine for the AllWatcher to run in the background all the time forever, and it involves a while loop that's being controlled manually through some flags (asyncio events), e.g. things like `_watch_stopping`, `watch_stopped`. The problem is that when the `_all_watcher` raises an exception (or receives one from things like `get_config()` like in the case of ether in the event loop, not handled/or re-raised. This is because this coroutine is not `await`ed (for good reason), it can't be `await`ed because there won't ever be any results, this method is supposed to be working in the background forever getting the deltas for us. As a result of this, if `_all_watcher` fails, then external flags like `_watch_received` is never set, and whoever's calling `await self._watch_received.wait()` will block forever (in this case the `_after_connect()`. Similarly the `disconnect()` waits for the `_watch_stopped` flag, which won't be set either, so if we call disconnect when all_watcher failed then it'll hang forever. This change fixes this problem by allowing (at the wait-for-flag spots) to wait for two things, 1) whichever flag we're waiting for, 2) `_all_watcher` task to be `"done()"`. In the latter case, we should expect to see an exception because that task is not supposed to be finished. More importantly, if we do see that the `_watcher_task.done()`, then we don't sit and wait forever for the _all_watcher event flags to be set, so we won't hang. Also a nice side effect of this should be that we should be getting less number of extra exception outputs saying that the "Task exception is never handled", since we do call the `.exception()` on the `_all_watcher` task. Though we'll probably continue to get those from the tasks like `_pinger` and `_debug_log` etc. However, this is a good first example solution to handle them as well.
Description
When libjuju tries to establish connection to a model that is in the middle of a migration, it outputs the follow error messages and hangs without returning to the main program:
It seems that it is not communicating the exception from the controller to the main libjuju thread. An expected behavior would be to raise the exception back to the main program and either triggers exception handling blocks or exit with unclean code.
Urgency
Blocker for our release
Python-libjuju version
2.9.11, 2.9.42.1, 3.0.4
Juju version
2.9.42
Reproduce / Test
The text was updated successfully, but these errors were encountered: