Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated asyncio errors #576

Closed
jnsgruk opened this issue Nov 5, 2021 · 1 comment · Fixed by #580
Closed

Repeated asyncio errors #576

jnsgruk opened this issue Nov 5, 2021 · 1 comment · Fixed by #580

Comments

@jnsgruk
Copy link
Member

jnsgruk commented Nov 5, 2021

I've been using pylibjuju through the pytest-operator for testing, and I'm seeing this error a lot:

ERROR    asyncio:base_events.py:1738 Task exception was never retrieved
future: <Task finished name='Task-30' coro=<Connection._pinger() done, defined at /home/jon/code/canonical/operator-libs-linux/.tox/integration/lib/python3.9/site-packages/juju/client/connection.py:502> exception=Exception('No facade Pinger in facades {}')>
Traceback (most recent call last):
  File "/home/jon/code/canonical/operator-libs-linux/.tox/integration/lib/python3.9/site-packages/juju/client/connection.py", line 516, in _pinger
    pinger_facade = client.PingerFacade.from_connection(self)
  File "/home/jon/code/canonical/operator-libs-linux/.tox/integration/lib/python3.9/site-packages/juju/client/_client.py", line 65, in from_connection
    raise Exception('No facade {} in facades {}'.format(facade_name,
Exception: No facade Pinger in facades {}

and

ERROR    asyncio:base_events.py:1738 Task exception was never retrieved
future: <Task finished name='Task-402' coro=<WebSocketCommonProtocol.recv() done, defined at /home/jon/code/canonical/operator-libs-linux/.tox/integration/lib/python3.9/site-packages/websockets/protocol.py:443> exception=ConnectionClosedOK('code = 1000 (OK), no reason')>
Traceback (most recent call last):
  File "/home/jon/code/canonical/operator-libs-linux/.tox/integration/lib/python3.9/site-packages/websockets/protocol.py", line 509, in recv
    await self.ensure_open()
  File "/home/jon/code/canonical/operator-libs-linux/.tox/integration/lib/python3.9/site-packages/websockets/protocol.py", line 803, in ensure_open
    raise self.connection_closed_exc()
websockets.exceptions.ConnectionClosedOK: code = 1000 (OK), no reason

These occur consistently on every run. More examples can be found here: https://github.com/jnsgruk/charm-kubernetes-dashboard/runs/4069754517?check_suite_focus=true#step:4:256

@SimonRichardson
Copy link
Member

We should always have a Pinger facade. So either we're not connected any longer or we're attempting to create a Pinger without first building the facades. There are couple of locations that I think need investigation:

  1. This will override the pinger and not close the old one. We should at least guard against that. https://github.com/juju/python-libjuju/blob/master/juju/client/connection.py#L802-L803
  2. Reconnect doesn't rebuild the facades, it should https://github.com/juju/python-libjuju/blob/master/juju/client/connection.py#L684-L697
  3. When tailing the debug-log we should ensure that we build the facades or disable the pinger functionality https://github.com/juju/python-libjuju/blob/master/juju/client/connection.py#L327

jujubot added a commit that referenced this issue Nov 19, 2021
#580

### Description

A bunch of background noise is happening in the concurrent asyncio events resulting in repeated error logs to build up, despite that nothing in the foreground is failing. These are often:

1. Actual exceptions that are ignored in the main task/thread.
2. Error logs bleeding from other libraries (potentially when trying to handle the ignored exceptions in (1)
(b/c `asyncio` manages their events as well).
3. Tasks that are not gathered properly at teardown.

This PR attempts to streamline all the tasks and events that are happening on our `connection.py` and make sure we're:
* not producing any background exceptions,
* not producing any unnecessary/useless/excess error logs, and 
* tearing down properly.

Hopefully will fix #576 

Jira card [#138](https://warthogs.atlassian.net/browse/JUJU-138)

### QA Steps

In the CI test runs (especially in the ones with failed tests), we should only see the pass info and exception stacktraces that are relevant to the test logic (i.e. there shouldn't be any asyncio related errors in there).

### Notes & Discussion 

* This first commit ensures the `Pinger` facade is available when the `_pinger_task` is created, by moving the `_pinger_task` creation after `_build_facade` calls.

* Also disables the `_pinger_task` completely for `debug_log` connections, because debug_log connections are not doing any RPCs, they directly connect through the WebSocketClientProtocol, so we're not getting any facades from the apiserver (like we do in the case of `_connect_with_login` etc)

* A relevant bug here is https://bugs.python.org/issue36709 : which is about `asyncio.sslproto.SSLProtocol` raising an error whenever the event_loop is closed during an alive websocket connection. So in our `client/connection.py` we avoid keeping the connection alive (by removing `ping_interval=None`).

* The websocket object (_ws) is tightly coupled with the asyncio-managed tasks (e.g. Pinger) as well as the Monitor, so interacting with it directly (e.g. doing conn._ws.close() instead of conn.close()) pushes things out of sync/control. So it shouldn't be used directly (e.g. avoid `conn.ws.close()`, instead use `conn.close()`)

* Functions such as `model.get_controller` returns a newly constructed Controller object that is already connected with the model's own connection parameters. This puts the responsibility of closing out the connection (and destroying the spawned asyncio tasks etc) to the caller. If not done, everyone (asyncio, websockets etc) yells at pylibjuju about lingering tasks and open connections. So we introduce a `ConnectedController` object that when it's used with the `async with` form automatically disconnects after done.

* asyncio loop event_handler is only called on Task exceptions when the Task object is cleared from the memory. But the garbabe collector doesn't clear the Task if we keep a reference for it (e.g. putting it in a neverending Task ---like Pinger) until the very end. So installing our own handler makes sure that the exceptions are retrieved and properly handled and logged regardless of when the Task is being destroyed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants