[Bug]: Model connection hangs when migration is in progress #829

agileshaw · 2023-04-17T20:58:43Z

Description

When libjuju tries to establish connection to a model that is in the middle of a migration, it outputs the follow error messages and hangs without returning to the main program:

unknown facade EnvironUpgrader
unexpected facade EnvironUpgrader found, unable to decipher version to use
Task exception was never retrieved
future: <Task finished name='Task-32' coro=<Model._watch.<locals>._all_watcher() done, defined at /home/ubuntu/.venv/lib/python3.10/site-packages/juju/model.py:1058> exception=JujuAPIError('model migration in progress')>
Traceback (most recent call last):
  File "/home/ubuntu/.venv/lib/python3.10/site-packages/juju/model.py", line 1061, in _all_watcher
    model_config = await self.get_config()
  File "/home/ubuntu/.venv/lib/python3.10/site-packages/juju/model.py", line 2017, in get_config
    result = await config_facade.ModelGet()
  File "/home/ubuntu/.venv/lib/python3.10/site-packages/juju/client/facade.py", line 483, in wrapper
    reply = await f(*args, **kwargs)
  File "/home/ubuntu/.venv/lib/python3.10/site-packages/juju/client/_client2.py", line 10431, in ModelGet
    reply = await self.rpc(msg)
  File "/home/ubuntu/.venv/lib/python3.10/site-packages/juju/client/facade.py", line 656, in rpc
    result = await self.connection.rpc(msg, encoder=TypeEncoder)
  File "/home/ubuntu/.venv/lib/python3.10/site-packages/juju/client/connection.py", line 627, in rpc
    raise errors.JujuAPIError(result)
juju.errors.JujuAPIError: model migration in progress

It seems that it is not communicating the exception from the controller to the main libjuju thread. An expected behavior would be to raise the exception back to the main program and either triggers exception handling blocks or exit with unclean code.

Urgency

Blocker for our release

Python-libjuju version

2.9.11, 2.9.42.1, 3.0.4

Juju version

2.9.42

Reproduce / Test

from juju.controller import Controller
import asyncio

async def juju_stats():
    controller = Controller(max_frame_size=2**27)
    await controller.connect()
    model = await controller.get_model(<model_uuid>)
    status = await model.get_status()
    print(status)
    await model.disconnect()

asyncio.run(juju_stats())

The text was updated successfully, but these errors were encountered:

cderici · 2023-04-18T16:29:57Z

@agileshaw @SimonRichardson had an interesting question about this earlier, does this come back after the migration is complete? The answer might help us to think about the problem and the proper solution for it. Thanks!

cderici · 2023-04-18T16:32:02Z

@agileshaw Also, you marked this as a Blocker for our release, was this by mistake? Because I don' t see how this would block you from making a release. However, if it's actually urgent I'll trust you and prioritize this no problem, just wanted to double check👍 thanks!

agileshaw · 2023-04-18T16:36:48Z

@cderici Thanks for reaching out and for helping with the bug. It never comes back after the migration completion or failure; it would hang indefinitely without retrying or termination.

As for the bug status, it is indeed blocking our release. One of our application, prometheus-juju-exporter is using libjuju to connect models and gather machine information. Due to this bug, the application hangs silently in our production environment.

cderici · 2023-04-18T16:54:39Z

No problem, we got you👍

As for the bug status, it is indeed blocking our release. One of our application, prometheus-juju-exporter is using libjuju to connect models and gather machine information. Due to this bug, the application hangs silently in our production environment.

You do know that even after this is bug is fixed it'll fail, right? You're trying to run the ModelConfig.ModelGet() from the api during a migration and it'll always fail with the juju.errors.JujuAPIError: model migration in progress. You can't run that before the model migration is done.

agileshaw · 2023-04-18T17:23:10Z

Yes, I'm aware of it. We have exception handling in our application. What we want to see is an exception raised by libjuju being returned to the main program so the logic implemented in our application gets triggered. Hanging is the critical problem here because it give a false sense that the program is running correctly when it is not.

fixes juju#829 The `_all_watcher` task is a coroutine for the AllWatcher to run in the background all the time forever, and it involves a while loop that's being controlled manually through some flags (asyncio events), e.g. things like `_watch_stopping`, `watch_stopped`. The problem is that when the `_all_watcher` raises an exception (or receives one from things like `get_config()` like in the case of ether in the event loop, not handled/or re-raised. This is because this coroutine is not `await`ed (for good reason), it can't be `await`ed because there won't ever be any results, this method is supposed to be working in the background forever getting the deltas for us. As a result of this, if `_all_watcher` fails, then external flags like `_watch_received` is never set, and whoever's calling `await self._watch_received.wait()` will block forever (in this case the `_after_connect()`. Similarly the `disconnect()` waits for the `_watch_stopped` flag, which won't be set either, so if we call disconnect when all_watcher failed then it'll hang forever. This change fixes this problem by allowing (at the wait-for-flag spots) to wait for two things, 1) whichever flag we're waiting for, 2) `_all_watcher` task to be `"done()"`. In the latter case, we should expect to see an exception because that task is not supposed to be finished. More importantly, if we do see that the `_watcher_task.done()`, then we don't sit and wait forever for the _all_watcher event flags to be set, so we won't hang. Also a nice side effect of this should be that we should be getting less number of extra exception outputs saying that the "Task exception is never handled", since we do call the `.exception()` on the `_all_watcher` task. Though we'll probably continue to get those from the tasks like `_pinger` and `_debug_log` etc. However, this is a good first example solution to handle them as well.

#833 #### Description This one was a bit tricky. fixes #829 The `_all_watcher` task is a coroutine for the AllWatcher to run in the background all the time forever, and it involves a while loop that's being controlled manually through some flags (asyncio events), e.g. things like `_watch_stopping`, `watch_stopped`. The problem is that when the `_all_watcher` raises an exception (or receives one from things like `get_config()` like in the case of #829, that exception is thrown in the background somewhere in ether in the event loop, not handled/or re-raised. This is because this coroutine is not `await`ed (for good reason), it can't be `await`ed because there won't ever be any results, this method is supposed to be working in the background forever getting the deltas for us. As a result of this, if `_all_watcher` fails, then external flags like `_watch_received` is never set, and whoever's calling `await self._watch_received.wait()` will block forever (in this case the `_after_connect()`). Similarly the `disconnect()` waits for the `_watch_stopped` flag, which won't be set either, so if we call disconnect when all_watcher failed then it'll hang forever. This change fixes this problem by allowing (at the wait-for-flag spots) to wait for two things, 1) whichever flag we're waiting for, 2) `_all_watcher` task to be `"done()"`. In the latter case, we should expect to see an exception because that task is not supposed to be finished. More importantly, if we do see that the `_watcher_task.done()`, then we don't sit and wait forever for the _all_watcher event flags to be set, so we won't hang. Also a nice side effect of this should be that we should be getting less number of extra exception outputs saying that the "Task exception is never handled", since we do call the `.exception()` on the `_all_watcher` task. Though we'll probably continue to get those from the tasks like `_pinger` and `_debug_log` etc. However, this is a good first example solution to handle them as well. #### QA Steps This should be rigorously tested, as it slightly changes a fundamental mechanism. We do need to make sure all the tests are passing for sure. For the manual QA, what I did was that I changed the body of the `model.get_config()` with `raise JujuError("FOO")` (artificially inducing an error seemingly coming from the api outside of the all_watcher loop). This creates the exact condition happening in the #829. You can also get two controllers and use pylibjuju while running a migration in the background, getting a `migration is in progress` error. Whichever it is, with the error in place, run the following and it should print "Error handled" in the stdout: ```python async def juju_stats(): m = Model() await m.connect() try: asyncio.run(juju_stats()) except JujuError: print("Error handled") ``` #### Notes & Discussion We might wanna also get this onto the other branches after we carefully test and land this onto `2.9` as requested by the #829 .

fixes juju#829 The `_all_watcher` task is a coroutine for the AllWatcher to run in the background all the time forever, and it involves a while loop that's being controlled manually through some flags (asyncio events), e.g. things like `_watch_stopping`, `watch_stopped`. The problem is that when the `_all_watcher` raises an exception (or receives one from things like `get_config()` like in the case of ether in the event loop, not handled/or re-raised. This is because this coroutine is not `await`ed (for good reason), it can't be `await`ed because there won't ever be any results, this method is supposed to be working in the background forever getting the deltas for us. As a result of this, if `_all_watcher` fails, then external flags like `_watch_received` is never set, and whoever's calling `await self._watch_received.wait()` will block forever (in this case the `_after_connect()`. Similarly the `disconnect()` waits for the `_watch_stopped` flag, which won't be set either, so if we call disconnect when all_watcher failed then it'll hang forever. This change fixes this problem by allowing (at the wait-for-flag spots) to wait for two things, 1) whichever flag we're waiting for, 2) `_all_watcher` task to be `"done()"`. In the latter case, we should expect to see an exception because that task is not supposed to be finished. More importantly, if we do see that the `_watcher_task.done()`, then we don't sit and wait forever for the _all_watcher event flags to be set, so we won't hang. Also a nice side effect of this should be that we should be getting less number of extra exception outputs saying that the "Task exception is never handled", since we do call the `.exception()` on the `_all_watcher` task. Though we'll probably continue to get those from the tasks like `_pinger` and `_debug_log` etc. However, this is a good first example solution to handle them as well.

agileshaw added the kind/bug indicates a bug in the project label Apr 17, 2023

agileshaw mentioned this issue Apr 17, 2023

Exporter disconnected from the controller will serve stale metrics forever and won't reconnect to the controller canonical/prometheus-juju-exporter#42

Closed

cderici mentioned this issue Apr 19, 2023

[JUJU-3565] Expect and handle exceptions from the AllWatcher task #833

Merged

juanmanuel-tirado closed this as completed in 12be4ec Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Model connection hangs when migration is in progress #829

[Bug]: Model connection hangs when migration is in progress #829

agileshaw commented Apr 17, 2023

cderici commented Apr 18, 2023

cderici commented Apr 18, 2023

agileshaw commented Apr 18, 2023

cderici commented Apr 18, 2023

agileshaw commented Apr 18, 2023

[Bug]: Model connection hangs when migration is in progress #829

[Bug]: Model connection hangs when migration is in progress #829

Comments

agileshaw commented Apr 17, 2023

Description

Urgency

Python-libjuju version

Juju version

Reproduce / Test

cderici commented Apr 18, 2023

cderici commented Apr 18, 2023

agileshaw commented Apr 18, 2023

cderici commented Apr 18, 2023

agileshaw commented Apr 18, 2023