-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stream Lifecycle Bugs causes Process Exit/Panic #128
Comments
Ok, the |
I feel like the name of this needs to be specific to connection lifecycle or stream lifecycle bugs. However I would also like perhaps there needs to be some refactoring of the draining state system relying on booleans/locks. |
Also important to mention that these are the equivalent of "panics" because we do want the program to crash here as this represents an invalid state of the entire logic, rather than a legitimate exception that just got leaked in the case of the connection idle error. |
|
If it's a "panic state" I don't even bother wrapping the exception... except maybe converting it to |
Progress, here are the facts for the
I've determined the error itself isn't a panic, it gets thrown to whatever is creating the stream. However at the time it was leaking out the same way as the idle timeout bug. 2 things are needed to address this.
|
I've added a patch version to I've also confirmed the stream leak. Its slow but if the connection keeps getting used then it will run out of streams eventually. My running theory is its one or more RPC handlers or calls that don't clean up their stream properly. I should be able to track it down easily with some monkey patching. |
Yep, seems to be a stream leak in
|
Well I've done a bunch of digging, checked a few things and I can't for the life of me get the remaining stream limit to go back up again. So I went digging through the quiche code and while the logic is pretty simple for tracking the limit. It's never actually decreased when the stream is cleaned up. https://github.com/cloudflare/quiche/blob/0570ab83cc5e46dc7b877765a6c0d7c4a44dd885/quiche/src/stream/mod.rs#L123 So as far as I can tell this limit applies to the total number of streams created for a connection. Not just the active ones. This is a little puzzling since the docs explicitly say that the counter is reduced when the stream cleans up. I'll do a little more digging and make extra sure that the streams are being cleaned up properly. But the stream limit is something I don't think I can fix. |
Yeah, I can confirm that the stream states are cleaning up properly. Despite that the So given that, the only solution we can apply for now is.
|
Why are we increasing the limit? Why not just prevent new stream creations? |
Wait that seems like a bug. Do you want to report this upstream if it's just a counter number. |
We are preventing new creations, that's what the error is for. We don't want to be hitting the limit so often though and 100 is a bit small for that. |
So what is the expected behaviour? It just blocks new creations, drops them or throws an exception? |
Previously attempting to create a new stream when hitting the limit would throw a The other part of the problem where this was crashing the program is already solved. The error was leaking the same way the idle timeout error was. I've created an upstream issue at cloudflare/quiche#1883 asking if the limit should apply to active streams or total streams. As for the limit, right now I'm assuming it applying to the total streams is the current implementation. So I'm going to have to increase the stream limit in polykey to prevent this from being thrown too often, even if it's handled gracefully now. I'll also add |
After a discussion with upstream it seems that it's very likely we're not cleaning up the streams properly. That's a bit frustrating since so far as I can tell the conditions for cleaning up the stream are being met. Actually, I just did a few quick tests and while creating a small number of streams doesn't increase our limit when they're completed. Doing about 40 streams shows the limit increasing after they complete. So the limit is being updated but in batches of completed streams. SO maybe there isn't actually a problem? Its just that the low limit is letting us run into it before the limit increase frame can be sent between connections? I'll just have to up the limit to 1000 or so and see if it still happens. If it does I need to do a deeper dive into the QUIC stream state machine and see what's up and how things are triggered because the docs aren't enough to go on. Here's a log of the remaining limit after each stream completes.
|
Can you show what's going on here during the next cycle meeting. |
Specification
There are some errors leaking out of the internals of quic. These are
TypeError: Invalid state: TypeError: Invalid state: WritableStream is closed is closed
- Results from trying to use the stream once it has closed. I'll need to sanity check if this is a webstream error or coming directly from quic. Likely from quic.ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message
- This is one of our errors. We create a quic stream by writing a 0 length message to it. Something is causing this to fail in an unexpected way.ErrorNodeConnectionTransportGenericError: Transport received a generic error
We'll need to dig deeper into these errors, find the cause of them and fix it.
Additional context
Tasks
TypeError
and fix it.ErrorQUICStreamInternal
and fix it.ErrorQUICStreamLimit
due to limit being reached.ErrorQUICStreamLimit
as a connection error to trigger clean up of the connection when it happens.ErrorNodeConnectionTransportGenericError
and fix it.The text was updated successfully, but these errors were encountered: