-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
polykey agent stop
command not terminating properly
#185
Comments
I think I've seen this while testing recently as well. So it's a certain condition that can happen after running for a while that causes it. That makes it very hard to pin down. |
I suspect that while the So I'd check this first. Write a test that fires a lot of connections and RPC requests at a Right now I'm basing this off the fact that I caught it doing it in the act. I had a agent running with verbose for a while and triggered it to stop. It entered the |
I tried approaching this from the outside in. Where I added a bunch of logging and debugging for testing and just waiting for the problem to happen so I can catch it in the act. This isn't working very well... The running theory is that connection activity is causing a race condition that deadlocks the stopping procedure for networking. It's hard to say exactly what and where this is happening. So I have to work from the bottom up fixing up any potential problems that would cause it. The usual suspects are the following libraries. This will create macrotasks and if any of them leak then we will fail to close the process after stopping. Also they could potentially deadlock when cleaning up.
On top of the libraries, we have the the following domains that maintain connection lifecycles. These could potentially deadlock when cleaning up.
The main bit of evidence I have so far is...
So knowing this, the most likely suspect is that endless RPC calls are preventing the connection from stopping fully and deadlocking stop. To fix this, I need to add draining state handling to the
From here we can work our way downwards as needed for the libraries.
This is a fair amount of polish work that may not even solve the issue. But it's all stuff I've been meaning to do eventually. May as well get it done now. |
As per the comment #157 (comment), I'm not a fan of the In the ideal case, the agent should stop perfectly, and also no need for this 3rd case message. In the unideal case, you just have to add an extra concurrent timer function using js-timer that runs, and reports a warning that the agent cannot stop - this should result in a trace of what exactly is holding it. In the case of nodejs, remember as a JS runtime, there's only 2 things that can hold the process open - any open IO fds, or infinite loops. And if we are able to run the concurrent function, then there can be no infinite loops. Then the only thing to trace is open FDs. Node as a runtime, does not want to stay running. In fact we previously had an issue MatrixAI/Polykey#307 where node would sometimes just stop running, and we had no idea why. That turned out to be due to "promise deadlocks", and in fact we had an issue to try and trace promise deadlocks live, using the new async hooks api https://nodejs.org/api/async_hooks.html, we ended up not using it, but it's an important API for any concurrent trace debugging MatrixAI/js-logger#15, which we eventually want to collect together into a Whereas debugging the opposite is supposed to MUCH easier. As you can see by this SO issue: https://stackoverflow.com/questions/26057328/node-js-inspect-whats-left-in-the-event-loop-thats-preventing-the-script-from. |
So going forward:
|
If a graceful exit is halted by leaked connections. Then this is likely connected to #198 which appears to potentially be related to a remotely leaked connection that ends up causing a ungraceful exit of the agent. |
This can just be replaced by raising the info level to warning level for |
I've created an issue at #270 to track that. |
polykey agent stop
command not terminating properly
My agent has been running in the background for about two days. As I have been working on #832 (RPC cancellation), I haven't been using the agent to run any commands. And when I tried to stop the agent, it was able to stop fairly quickly (under 3 seconds). This means that most likely the issue is coming from a command and not from any background tasks that are run without input. I guess we can run each command once and try to shut down the agent to see which command causes the leaks, but that will be very time consuming. Adding proper cancellation to all RPC commands should help circumvent this issue to an extent. |
All you need to do is to track resource counts using resource counter and you'll see what's leaking instead of trying to blackbox this. |
I was running a temporary local node on my machine for testing purposes, and I encountered this log message before the agent got stuck on shutting down. This is using the latest staging for Polykey CLI. Before I stopped the agent, I attempted a The log messages show which task failed to stop, so this could be useful to help pinpoint the issue and finally resolve it.
|
I got something similar again, but this time, on the main node I operate. This time, it had different handlers which failed to stop on time.
|
Debugging these are going to require you to dig inwards and observe the concurrent code, rather than just sitting on the outside of the black box. |
Describe the bug
Looking through the logs, looks like the hook for shutting down never triggered. The log cuts out mid line like the process was terminated. That must've been when you had to manually kill it.
This might be a mac specific problem.
To Reproduce
polykey agent stop
Test on Linux version that same way we got the error on Mac to see if it's an OS isolated incident or not.
Expected behavior
We handle most signals to trigger stopping the agent. The only thing that should really kill it like this is a
SIGKILL
signal.Screenshots
Platform
polykey-cli-0.3.1-darwin-universal
Additional context
Notify maintainers
@tegefaulkes
The text was updated successfully, but these errors were encountered: