Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polykey agent stop command not terminating properly #185

Open
CryptoTotalWar opened this issue May 22, 2024 · 14 comments
Open

polykey agent stop command not terminating properly #185

CryptoTotalWar opened this issue May 22, 2024 · 14 comments
Labels
bug Something isn't working r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices

Comments

@CryptoTotalWar
Copy link

CryptoTotalWar commented May 22, 2024

Describe the bug

Looking through the logs, looks like the hook for shutting down never triggered. The log cuts out mid line like the process was terminated. That must've been when you had to manually kill it.

This might be a mac specific problem.

To Reproduce

  1. Run the agent
  2. Wait for a few minutes
  3. Run polykey agent stop

Test on Linux version that same way we got the error on Mac to see if it's an OS isolated incident or not.

Expected behavior

We handle most signals to trigger stopping the agent. The only thing that should really kill it like this is a SIGKILL signal.

Screenshots

Platform

  • Device: Mac
  • OS: Sonoma 14 Terminal Version 2.14
  • Version: polykey-cli-0.3.1-darwin-universal

Additional context

  • Polykey logs can be found here: Polykey-CLI#183
  • Propagated problem of connections being leaked causing push error flow being discussed here: Polykey-CLI#198

Notify maintainers

@tegefaulkes

@CryptoTotalWar CryptoTotalWar added the bug Something isn't working label May 22, 2024
@CryptoTotalWar CryptoTotalWar self-assigned this May 22, 2024
Copy link

linear bot commented May 22, 2024

@CryptoTotalWar CryptoTotalWar changed the title Polykey agent stop command not terminating properly Polykey-CLI: agent stop command not terminating properly May 22, 2024
@tegefaulkes
Copy link
Contributor

I think I've seen this while testing recently as well. So it's a certain condition that can happen after running for a while that causes it. That makes it very hard to pin down.

Copy link
Contributor

tegefaulkes commented Jun 20, 2024

I suspect that while the pkAgent.stop() is called, and the NodeConnectionManager is stopping with force. It might be possible that we are still handling incoming connections and RPC requests despite the stopping state.

So I'd check this first. Write a test that fires a lot of connections and RPC requests at a NodeConnectionManager and trigger a NodeConnectionManager.stop() with force set to true and false and just see what happens.

Right now I'm basing this off the fact that I caught it doing it in the act. I had a agent running with verbose for a while and triggered it to stop. It entered the stopping state, however it was still serving incoming connections and streams. If we enter a stopping state, forced or not, we should reject all new connections and streams that are attempted.

@tegefaulkes tegefaulkes assigned amydevs and unassigned tegefaulkes Jun 20, 2024
@tegefaulkes
Copy link
Contributor

tegefaulkes commented Jul 10, 2024

I tried approaching this from the outside in. Where I added a bunch of logging and debugging for testing and just waiting for the problem to happen so I can catch it in the act. This isn't working very well...

The running theory is that connection activity is causing a race condition that deadlocks the stopping procedure for networking. It's hard to say exactly what and where this is happening. So I have to work from the bottom up fixing up any potential problems that would cause it.

The usual suspects are the following libraries. This will create macrotasks and if any of them leak then we will fail to close the process after stopping. Also they could potentially deadlock when cleaning up.

  1. js-quic, Holds a bunch of macro tasks and resources. If it fails to clean up properly then it will deadlock.
  2. js-rpc, Similar but is a layer between transport and application code. Lifecycle is determined by the streams it consumes so it seems unlikely to deadlock from that. However it does maintain macrotasks in the form of timers.
  3. js-mds, Similar to js-quic in that it maintains a socket and timers.
  4. js-ws, Also maintains a socket and timers.

On top of the libraries, we have the the following domains that maintain connection lifecycles. These could potentially deadlock when cleaning up.

  1. client
  2. nodes

The main bit of evidence I have so far is...

  1. When I caught the CLI failing to end when stopping I saw it still serving a bunch of agent-agent RPC calls.
  2. It seems to happen more often with more nodes in the network. So more network traffic.
  3. Seems to trigger more often when calling agent stop to trigger stopping.

So knowing this, the most likely suspect is that endless RPC calls are preventing the connection from stopping fully and deadlocking stop. To fix this, I need to add draining state handling to the client and nodes domain, along with adding it to each of the transport libraries we have. I can't guarantee that it will fully fix the issue but at the very least it will cross that potential problem off as the cause. So moving forward.

  1. Look at the NodeConnectionManager domain and add logic that prevents accepting or creating new connections when stopping. Creating connections should already be prevented but we need to check reverse connections.
  2. The NodeConnections need to reject new streams when in the stopping state. Creating new streams should already be prevented, but we need to prevent reverse streams.
  3. Review client domain for any potential connection or stream leaks. Also add the same draining state logic that rejects new connections and streams.

From here we can work our way downwards as needed for the libraries.

  1. js-quic needs draining state support. We need to reject new streams and connections, both forward and reverse when we're stopping.
  2. js-ws needs the same treatment.
  3. js-rpc doesn't create any connections but timeout timers might leak.
  4. js-mdns I haven't worked on this directly, but it's usage of macrotasks can hold the process open. so we need to investigate it.

This is a fair amount of polish work that may not even solve the issue. But it's all stuff I've been meaning to do eventually. May as well get it done now.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Aug 17, 2024

As per the comment #157 (comment), I'm not a fan of the Stopping Agent STDERR message, it seems unnecessary, and I don't see how it helps in debugging anything. If the prompt is not returned to the terminal, then stopping is broken. And in every case where the agent is running automatically it should be using --verbose anyway.

In the ideal case, the agent should stop perfectly, and also no need for this 3rd case message. In the unideal case, you just have to add an extra concurrent timer function using js-timer that runs, and reports a warning that the agent cannot stop - this should result in a trace of what exactly is holding it. In the case of nodejs, remember as a JS runtime, there's only 2 things that can hold the process open - any open IO fds, or infinite loops. And if we are able to run the concurrent function, then there can be no infinite loops. Then the only thing to trace is open FDs.

Node as a runtime, does not want to stay running. In fact we previously had an issue MatrixAI/Polykey#307 where node would sometimes just stop running, and we had no idea why. That turned out to be due to "promise deadlocks", and in fact we had an issue to try and trace promise deadlocks live, using the new async hooks api https://nodejs.org/api/async_hooks.html, we ended up not using it, but it's an important API for any concurrent trace debugging MatrixAI/js-logger#15, which we eventually want to collect together into a diagnostics domain: MatrixAI/Polykey#635

Whereas debugging the opposite is supposed to MUCH easier. As you can see by this SO issue: https://stackoverflow.com/questions/26057328/node-js-inspect-whats-left-in-the-event-loop-thats-preventing-the-script-from.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Aug 17, 2024

So going forward:

  1. Investigate open fds for causing the process to stay open. (REMEMBER THAT NODE DOES NOT ALLOW PROMISE DEADLOCKS TO KEEP THE PROCESS OPEN!) - @tegefaulkes so I don't think investigating any of the async operations makes sense in debugging this.
  2. Get rid of that Stopping Agent message that was introduced from Add a shutting down message to polykey agent start command #157.
  3. As part of the Polykey conceptual structure/planning we should be starting on Setting up diagnostics Domain for keeping track of some operational metrics Polykey#635 to deal with these situations - I think PK has reached a level of complexity requiring dedicated diagnostics.
  4. The exit handler should be producing exit code of 0 when gracefully handled the signals like SIGINT and SIGTERM. It should be not be 130 and 143 respectively!

@CMCDragonkai
Copy link
Member

If a graceful exit is halted by leaked connections. Then this is likely connected to #198 which appears to potentially be related to a remotely leaked connection that ends up causing a ungraceful exit of the agent.

@CMCDragonkai CMCDragonkai added r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices and removed r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management labels Aug 23, 2024
@CMCDragonkai
Copy link
Member

  1. Get rid of that Stopping Agent message that was introduced from Add a shutting down message to polykey agent start command #157.

This can just be replaced by raising the info level to warning level for PolykeyAgent:Stopping Agent and PolykeyAgent: Stopped Agent.

@tegefaulkes
Copy link
Contributor

I've created an issue at #270 to track that.

@aryanjassal aryanjassal changed the title Polykey-CLI: agent stop command not terminating properly polykey agent stop command not terminating properly Nov 21, 2024
Copy link
Member

My agent has been running in the background for about two days. As I have been working on #832 (RPC cancellation), I haven't been using the agent to run any commands. And when I tried to stop the agent, it was able to stop fairly quickly (under 3 seconds).

This means that most likely the issue is coming from a command and not from any background tasks that are run without input. I guess we can run each command once and try to shut down the agent to see which command causes the leaks, but that will be very time consuming.

Adding proper cancellation to all RPC commands should help circumvent this issue to an extent.

Copy link
Member

All you need to do is to track resource counts using resource counter and you'll see what's leaking instead of trying to blackbox this.

@aryanjassal
Copy link
Member

I was running a temporary local node on my machine for testing purposes, and I encountered this log message before the agent got stuck on shutting down. This is using the latest staging for Polykey CLI. Before I stopped the agent, I attempted a vaults clone operation which failed with a timeout. I'm not sure if the timeout failure is relevant, but could be useful.

The log messages show which task failed to stop, so this could be useful to help pinpoint the issue and finally resolve it.

^CWARN:polykey:Stopping Agent
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms

Copy link
Member

I got something similar again, but this time, on the main node I operate. This time, it had different handlers which failed to stop on time.

WARN:polykey:Stopping Agent
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.checkConnectionsHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms

@CMCDragonkai
Copy link
Member

Debugging these are going to require you to dig inwards and observe the concurrent code, rather than just sitting on the outside of the black box.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices
Development

No branches or pull requests

5 participants