Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Model connection hangs when migration is in progress #829

Closed
agileshaw opened this issue Apr 17, 2023 · 5 comments
Closed

[Bug]: Model connection hangs when migration is in progress #829

agileshaw opened this issue Apr 17, 2023 · 5 comments
Labels
kind/bug indicates a bug in the project

Comments

@agileshaw
Copy link

Description

When libjuju tries to establish connection to a model that is in the middle of a migration, it outputs the follow error messages and hangs without returning to the main program:

unknown facade EnvironUpgrader
unexpected facade EnvironUpgrader found, unable to decipher version to use
Task exception was never retrieved
future: <Task finished name='Task-32' coro=<Model._watch.<locals>._all_watcher() done, defined at /home/ubuntu/.venv/lib/python3.10/site-packages/juju/model.py:1058> exception=JujuAPIError('model migration in progress')>
Traceback (most recent call last):
  File "/home/ubuntu/.venv/lib/python3.10/site-packages/juju/model.py", line 1061, in _all_watcher
    model_config = await self.get_config()
  File "/home/ubuntu/.venv/lib/python3.10/site-packages/juju/model.py", line 2017, in get_config
    result = await config_facade.ModelGet()
  File "/home/ubuntu/.venv/lib/python3.10/site-packages/juju/client/facade.py", line 483, in wrapper
    reply = await f(*args, **kwargs)
  File "/home/ubuntu/.venv/lib/python3.10/site-packages/juju/client/_client2.py", line 10431, in ModelGet
    reply = await self.rpc(msg)
  File "/home/ubuntu/.venv/lib/python3.10/site-packages/juju/client/facade.py", line 656, in rpc
    result = await self.connection.rpc(msg, encoder=TypeEncoder)
  File "/home/ubuntu/.venv/lib/python3.10/site-packages/juju/client/connection.py", line 627, in rpc
    raise errors.JujuAPIError(result)
juju.errors.JujuAPIError: model migration in progress

It seems that it is not communicating the exception from the controller to the main libjuju thread. An expected behavior would be to raise the exception back to the main program and either triggers exception handling blocks or exit with unclean code.

Urgency

Blocker for our release

Python-libjuju version

2.9.11, 2.9.42.1, 3.0.4

Juju version

2.9.42

Reproduce / Test

from juju.controller import Controller
import asyncio

async def juju_stats():
    controller = Controller(max_frame_size=2**27)
    await controller.connect()
    model = await controller.get_model(<model_uuid>)
    status = await model.get_status()
    print(status)
    await model.disconnect()

asyncio.run(juju_stats())
@cderici
Copy link
Contributor

cderici commented Apr 18, 2023

@agileshaw @SimonRichardson had an interesting question about this earlier, does this come back after the migration is complete? The answer might help us to think about the problem and the proper solution for it. Thanks!

@cderici
Copy link
Contributor

cderici commented Apr 18, 2023

@agileshaw Also, you marked this as a Blocker for our release, was this by mistake? Because I don' t see how this would block you from making a release. However, if it's actually urgent I'll trust you and prioritize this no problem, just wanted to double check👍 thanks!

@agileshaw
Copy link
Author

@cderici Thanks for reaching out and for helping with the bug. It never comes back after the migration completion or failure; it would hang indefinitely without retrying or termination.

As for the bug status, it is indeed blocking our release. One of our application, prometheus-juju-exporter is using libjuju to connect models and gather machine information. Due to this bug, the application hangs silently in our production environment.

@cderici
Copy link
Contributor

cderici commented Apr 18, 2023

No problem, we got you👍

As for the bug status, it is indeed blocking our release. One of our application, prometheus-juju-exporter is using libjuju to connect models and gather machine information. Due to this bug, the application hangs silently in our production environment.

You do know that even after this is bug is fixed it'll fail, right? You're trying to run the ModelConfig.ModelGet() from the api during a migration and it'll always fail with the juju.errors.JujuAPIError: model migration in progress. You can't run that before the model migration is done.

@agileshaw
Copy link
Author

Yes, I'm aware of it. We have exception handling in our application. What we want to see is an exception raised by libjuju being returned to the main program so the logic implemented in our application gets triggered. Hanging is the critical problem here because it give a false sense that the program is running correctly when it is not.

cderici added a commit to cderici/python-libjuju that referenced this issue Apr 18, 2023
fixes juju#829

The `_all_watcher` task is a coroutine for the AllWatcher to run in
the background all the time forever, and it involves a while loop
that's being controlled manually through some flags (asyncio events),
e.g. things like `_watch_stopping`, `watch_stopped`.

The problem is that when the `_all_watcher` raises an exception (or
receives one from things like `get_config()` like in the case of
ether in the event loop, not handled/or re-raised. This is because
this coroutine is not `await`ed (for good reason), it can't be
`await`ed because there won't ever be any results, this method is
supposed to be working in the background forever getting the deltas
for us. As a result of this, if `_all_watcher` fails, then external
flags like `_watch_received` is never set, and whoever's calling
`await self._watch_received.wait()` will block forever (in this case
the `_after_connect()`. Similarly the `disconnect()` waits for the
`_watch_stopped` flag, which won't be set either, so if we call
disconnect when all_watcher failed then it'll hang forever.

This change fixes this problem by allowing (at the wait-for-flag
spots) to wait for two things, 1) whichever flag we're waiting for, 2)
`_all_watcher` task to be `"done()"`. In the latter case, we should
expect to see an exception because that task is not supposed to be
finished. More importantly, if we do see that the
`_watcher_task.done()`, then we don't sit and wait forever for the
_all_watcher event flags to be set, so we won't hang.

Also a nice side effect of this should be that we should be getting
less number of extra exception outputs saying that the "Task exception
is never handled", since we do call the `.exception()` on the
`_all_watcher` task. Though we'll probably continue to get those from
the tasks like `_pinger` and `_debug_log` etc. However, this is a good
first example solution to handle them as well.
cderici added a commit to cderici/python-libjuju that referenced this issue Apr 25, 2023
fixes juju#829

The `_all_watcher` task is a coroutine for the AllWatcher to run in
the background all the time forever, and it involves a while loop
that's being controlled manually through some flags (asyncio events),
e.g. things like `_watch_stopping`, `watch_stopped`.

The problem is that when the `_all_watcher` raises an exception (or
receives one from things like `get_config()` like in the case of
ether in the event loop, not handled/or re-raised. This is because
this coroutine is not `await`ed (for good reason), it can't be
`await`ed because there won't ever be any results, this method is
supposed to be working in the background forever getting the deltas
for us. As a result of this, if `_all_watcher` fails, then external
flags like `_watch_received` is never set, and whoever's calling
`await self._watch_received.wait()` will block forever (in this case
the `_after_connect()`. Similarly the `disconnect()` waits for the
`_watch_stopped` flag, which won't be set either, so if we call
disconnect when all_watcher failed then it'll hang forever.

This change fixes this problem by allowing (at the wait-for-flag
spots) to wait for two things, 1) whichever flag we're waiting for, 2)
`_all_watcher` task to be `"done()"`. In the latter case, we should
expect to see an exception because that task is not supposed to be
finished. More importantly, if we do see that the
`_watcher_task.done()`, then we don't sit and wait forever for the
_all_watcher event flags to be set, so we won't hang.

Also a nice side effect of this should be that we should be getting
less number of extra exception outputs saying that the "Task exception
is never handled", since we do call the `.exception()` on the
`_all_watcher` task. Though we'll probably continue to get those from
the tasks like `_pinger` and `_debug_log` etc. However, this is a good
first example solution to handle them as well.
jujubot added a commit that referenced this issue May 2, 2023
#833

#### Description

This one was a bit tricky. fixes #829

The `_all_watcher` task is a coroutine for the AllWatcher to run in the background all the time forever, and it involves a while loop that's being controlled manually through some flags (asyncio events), e.g. things like `_watch_stopping`, `watch_stopped`.

The problem is that when the `_all_watcher` raises an exception (or receives one from things like `get_config()` like in the case of #829, that exception is thrown in the background somewhere in ether in the event loop, not handled/or re-raised. This is because this coroutine is not `await`ed (for good reason), it can't be `await`ed because there won't ever be any results, this method is supposed to be working in the background forever getting the deltas for us. As a result of this, if `_all_watcher` fails, then external flags like `_watch_received` is never set, and whoever's calling `await self._watch_received.wait()` will block forever (in this case the `_after_connect()`). Similarly the `disconnect()` waits for the `_watch_stopped` flag, which won't be set either, so if we call disconnect when all_watcher failed then it'll hang forever.

This change fixes this problem by allowing (at the wait-for-flag spots) to wait for two things, 1) whichever flag we're waiting for, 2) `_all_watcher` task to be `"done()"`. In the latter case, we should expect to see an exception because that task is not supposed to be finished. More importantly, if we do see that the `_watcher_task.done()`, then we don't sit and wait forever for the _all_watcher event flags to be set, so we won't hang.

Also a nice side effect of this should be that we should be getting less number of extra exception outputs saying that the "Task exception is never handled", since we do call the `.exception()` on the `_all_watcher` task. Though we'll probably continue to get those from the tasks like `_pinger` and `_debug_log` etc. However, this is a good first example solution to handle them as well.

#### QA Steps

This should be rigorously tested, as it slightly changes a fundamental mechanism. We do need to make sure all the tests are passing for sure.

For the manual QA, what I did was that I changed the body of the `model.get_config()` with `raise JujuError("FOO")` (artificially inducing an error seemingly coming from the api outside of the all_watcher loop). This creates the exact condition happening in the #829. You can also get two controllers and use pylibjuju while running a migration in the background, getting a `migration is in progress` error. Whichever it is, with the error in place, run the following and it should print "Error handled" in the stdout:

```python
async def juju_stats():
 m = Model()
 await m.connect()

try:
 asyncio.run(juju_stats())
except JujuError:
 print("Error handled")
```

#### Notes & Discussion

We might wanna also get this onto the other branches after we carefully test and land this onto `2.9` as requested by the #829 .
jujubot pushed a commit to cderici/python-libjuju that referenced this issue Jun 1, 2023
fixes juju#829

The `_all_watcher` task is a coroutine for the AllWatcher to run in
the background all the time forever, and it involves a while loop
that's being controlled manually through some flags (asyncio events),
e.g. things like `_watch_stopping`, `watch_stopped`.

The problem is that when the `_all_watcher` raises an exception (or
receives one from things like `get_config()` like in the case of
ether in the event loop, not handled/or re-raised. This is because
this coroutine is not `await`ed (for good reason), it can't be
`await`ed because there won't ever be any results, this method is
supposed to be working in the background forever getting the deltas
for us. As a result of this, if `_all_watcher` fails, then external
flags like `_watch_received` is never set, and whoever's calling
`await self._watch_received.wait()` will block forever (in this case
the `_after_connect()`. Similarly the `disconnect()` waits for the
`_watch_stopped` flag, which won't be set either, so if we call
disconnect when all_watcher failed then it'll hang forever.

This change fixes this problem by allowing (at the wait-for-flag
spots) to wait for two things, 1) whichever flag we're waiting for, 2)
`_all_watcher` task to be `"done()"`. In the latter case, we should
expect to see an exception because that task is not supposed to be
finished. More importantly, if we do see that the
`_watcher_task.done()`, then we don't sit and wait forever for the
_all_watcher event flags to be set, so we won't hang.

Also a nice side effect of this should be that we should be getting
less number of extra exception outputs saying that the "Task exception
is never handled", since we do call the `.exception()` on the
`_all_watcher` task. Though we'll probably continue to get those from
the tasks like `_pinger` and `_debug_log` etc. However, this is a good
first example solution to handle them as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug indicates a bug in the project
Projects
None yet
Development

No branches or pull requests

2 participants