-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node crashes/exits ungracefully on uncaught exceptions/events relating to QUIC connections or QUIC streams #198
Comments
I've seen this twice now. It particularly happens when there's a failure on a QUIC connection from the other remote side node. Like if they force close their node, our node then fails.
Twice when Alexi who is in Greece from the IP:
Who tried to clone my vault with Then he had to force shutdown his agent, because graceful agent stop was not possible. This is likely due to QUIC again as there are possibly open fds or open connections that hold the process open. Finally after force closing their agent, and waiting for some time… my agent ended up failing. You can see the prior connection is exactly the NodeID of the node that was force closed from Greece. So this error is connected to:
|
On the prior attempt:
What's interesting, is that I find that sometimes… even this is also an uncaught exception:
This also causes the agent to ungracefully exit. So the first time what I described happen, this was the exception. But the second time, was the |
I'm going to rename this issue, given that there is now evidence of at least 2 different errors that is being uncaught and causing ungraceful exit. |
ErrorQUICConnectionIdleTimeout
ErrorQUICConnectionIdleTimeout
and ErrorNodeConnectionTransportGenericError
.
This is connected to #185, because that is likely caused by the one side leaking connections. When this occurs, it's a resource/IO/fd leak. And that will cause the graceful shutdown to halt forever. |
The |
Pull error handling always uses standard exception handlers to deal with the problem. This we understand. Push error handling looks underimplemented. It relies on object-system context error handlers. We need to make sure that QUIC errors are being "captured" by the nodes domain fully so that it doesn't just bubble to the process context error handlers which is likely being defined in the This is because we're still working out the architecture of push-flow in MatrixAI/Polykey#444. The fact that these exceptions are all bubbling up to the process means that errors are not properly handled within |
I had left my PC running for over 2 days with Polykey agent running in the background, and I got this error after about 5 hours of running.
|
I just observed this exact error when I ran this command. $ npm run polykey -- secrets mkdir vault:aslkfj/a
> [email protected] polykey
> ts-node src/polykey.ts secrets mkdir vault:aslkfj/a
mkdir: cannot create directory aslkfj/a: No such file or directory
ErrorPolykeyCLIMakeDirectory: Failed to create one or more directories - Failed to create one or more directories This command took about 10 seconds, which is way too long for this, and then the agent crashed with the 0-length message error. I tried this and got a RPC timeout error. $ npm run polykey -- secrets mkdir vault:aslkfj/asf/asd/fa
> [email protected] polykey
> ts-node src/polykey.ts secrets mkdir vault:aslkfj/asf/asd/fa
ErrorPolykeyCLIUnexpectedError: An unexpected error occured - Thrown 'ErrorRPCTimedOut'
cause: ErrorRPCTimedOut: RPC has timed out However, this also seems inconsistent, as trying to replicate this issue again yielded the command taking a while, but succeeding in the end without the agent crashing with QUIC. |
ErrorQUICConnectionIdleTimeout
and ErrorNodeConnectionTransportGenericError
.
This is now happening immediately upon startup. I also sometimes get |
Upon starting a new agent in a new node path, within a few seconds of starting up, there will be a MASS amount of logs coming from the node connection manager. Then one of the 2 things happen which causes immediate ungraceful crash:
Or:
|
To replicate all you need to do:
It will definitely fail. |
I have another problem and it's a regression that's been created from recent changes to PK CLI too. When the agent has crashed, and it is ungraceful, it leaves the When you try to run
Commands should be aware when the agent isn't live and properly report on this. |
I can corroborate this with an initial installation on MacOS. |
I just did this, on my dev laptop. It's working fine. |
Both myself and a friend with an initial installation on MacOS also fails too. Even if you don't see it now, you will see it eventually. |
Again I don't think this has anything to do with the node state. As this is the current temporary node state I created: node.tar.gz. After several agent start attempts... eventually it doesn't immediately break. But it happens enough times for this to be repeatable. The event emitted for |
If its failing on a fresh node then no, it shouldn't be due to state. I'll dig through the |
Is this related to MatrixAI/js-rpc#74? |
It's always obvious when it's about to happen is when there's a LARGE amount of logs. |
Read this. |
The logs aren't going to enable you to reproduce it, and the fact that you saw it happen already means you already have the logs if you want. The exception itself shows you what is leaking, just go to the js-quic source code. |
Because this problem occurs even when the process is idling, it means it does not has relationship with the PK CLI's commands. But primarily through the node to node network, that just starts automatically due to the P2P network needing to do synchronization. |
This shows the craziness of node connection formation and destruction. IT results in over 100% CPU usage over time. Peek.2024-12-05.18-06.mp4This is crazy. It should not be this much. |
Why are there so many connections constantly being formed over and over again and being destroyed... After a long as time... it finally calms down. |
And after all that busy work... it ends up crashing at the end:
|
Few questions.
|
It happens with both. I have 16 cores. Performance. Plugged in. It calms down but then fails after some time. |
It's always the same 2 errors as above. |
What's the output of |
Why you need to focus on the leak first: https://chatgpt.com/share/e/6752434d-fbec-8004-aa20-87c4d692b1b8 |
|
Brian and I ran a bunch of tests and we added a
This reveals that no errors were thrown from |
Yes, cause the event is emitted from js-quic, but it's Polykey that's ultimately encapsulating the objects, so of course it's leaking inside PK. |
This is a blocker for doing any sort of demos. It happens too often, I want to make sure this investigated and fixed asap. |
@tegefaulkes can you write down your debugging discovery and process here? Here's another log of the agent breaking after some hours:
Notice here it's I believe it's the fact that something is still trying to use a particular stream even though the stream object it's associated with the QUICStream is already destroyed and thus closed. I think there's a lifecycle issue here. |
Debugging so far. I've traced how the error in question gets passed around. I found some potential promise leaks that I patched up but otherwise nothing stood out as the cause. I did a sanity check and the error itself is almost certainly getting leaked out of a async context that isn't being awaited properly. We couldn't trace this to anywhere in the quic code and after patching the promises I think the quic code is good. I'm going to did through the Polykey code to see if there are any problems with the code there. I'm adding a shotgun check that every time we throw an error I'm logging out when and where. It generates a lot of noise but hopefully if we catch it in the act we can trace it back to the last point we saw the error and narrow it down from there. I've modified the code to force the emission of the error. Doing so doesn't trigger the problem at all. So it seems to purely be a race condition causing the leak. This lends evidence to improper handling of a promise rejecting somewhere in the code. I've tried messing with timeout configuration to trigger the problem but no such luck there. I'm continuing with
|
Can you explain how some of your recent commits is addressing this. |
The following changes have been made so far.
|
The |
Ok, i've split up all of the remaining tasks here into new issues. Closing this issue now. |
Isn't 1. and 2. a sort of problematic? Whenever I see that, it looks like you're discarding exceptions. That may prevent a leak, but then you're just hiding the error. Surely it needs to be bubbled up to a handler for it to be "handled" which may just mean printing it out in some way as a warning or info message. |
I haven't seen this for a while now:
Was this ever addressed @tegefaulkes? |
Its a the |
Describe the bug
The node seems to be randomly crashing with
ErrorQUICConnectionIdleTimeout
.It also can produce - any of these things are being leaked up to the process for it to do an ungraceful exit/crash:
TypeError: Invalid state: WritableStream is closed
To Reproduce
Polykey
agent.Expected behavior
Should not crash with
ErrorQUICConnectionIdleTimeout
Screenshots
Platform (please complete the following information)
0.4.1
.Additional context
ErrorNodeConnectionTransportGenericError
.ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message
. (source: Polykey-CLI#198 (comment))Notify maintainers
@tegefaulkes
The text was updated successfully, but these errors were encountered: