Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream Lifecycle Bugs causes Process Exit/Panic #128

Open
5 of 7 tasks
tegefaulkes opened this issue Dec 10, 2024 · 18 comments
Open
5 of 7 tasks

Stream Lifecycle Bugs causes Process Exit/Panic #128

tegefaulkes opened this issue Dec 10, 2024 · 18 comments
Assignees
Labels
development Standard development

Comments

@tegefaulkes
Copy link
Contributor

tegefaulkes commented Dec 10, 2024

Specification

There are some errors leaking out of the internals of quic. These are

  • TypeError: Invalid state: TypeError: Invalid state: WritableStream is closed is closed - Results from trying to use the stream once it has closed. I'll need to sanity check if this is a webstream error or coming directly from quic. Likely from quic.
  • ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message - This is one of our errors. We create a quic stream by writing a 0 length message to it. Something is causing this to fail in an unexpected way.
  • ErrorNodeConnectionTransportGenericError: Transport received a generic error

We'll need to dig deeper into these errors, find the cause of them and fix it.

Additional context

Tasks

  1. 1. Find the source of the TypeError and fix it.
  2. 2. Find the source of the ErrorQUICStreamInternal and fix it.
    • Make the error clearer so it throws ErrorQUICStreamLimit due to limit being reached.
    • Prevent the error from leaking and crashing the process - solved with the idle timeout error crash fixes.
    • Increase the stream limit in Polykey to reduce how often this is thrown.
    • Treat ErrorQUICStreamLimit as a connection error to trigger clean up of the connection when it happens.
  3. 3. Find the source of the ErrorNodeConnectionTransportGenericError and fix it.
@tegefaulkes tegefaulkes added the development Standard development label Dec 10, 2024
@tegefaulkes tegefaulkes self-assigned this Dec 10, 2024
Copy link

linear bot commented Dec 10, 2024

@tegefaulkes
Copy link
Contributor Author

Ok, the TypeError: Invalid state: WritableStream is closed error is specifically from writing to a WritableStream webstream after it has been closed. If the writable stream errored then that error would be thrown when attempting to write to it. Seems like possible improper usage of the stream where we write to it after closing.

@CMCDragonkai
Copy link
Member

I feel like the name of this needs to be specific to connection lifecycle or stream lifecycle bugs.

However I would also like perhaps there needs to be some refactoring of the draining state system relying on booleans/locks.

@CMCDragonkai CMCDragonkai changed the title Fix errors resulting from lifecycle bugs. Stream Lifecycle Bugs causes Process Exit/Panic Dec 10, 2024
@CMCDragonkai
Copy link
Member

Also important to mention that these are the equivalent of "panics" because we do want the program to crash here as this represents an invalid state of the entire logic, rather than a legitimate exception that just got leaked in the case of the connection idle error.

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Dec 10, 2024

TypeError: Invalid state: WritableStream is closed comes out of webstreams. We'd have to catch any attempt to write and wrap it if we want a nicer error.

@CMCDragonkai
Copy link
Member

If it's a "panic state" I don't even bother wrapping the exception... except maybe converting it to never(). If it's something where we can "catch all", then it would be possible... where we catch any kind of unexpected state exception... but that is what the never is for. We could enhance never with some additional metadata.

@tegefaulkes
Copy link
Contributor Author

Progress, here are the facts for the ErrorQUICStreamInternal bug.

  1. I found that it's caused by hitting the stream limit.
  2. Sanity checks show that the stream limit is defaulting to 100 and applies to active streams only. When they clean up then it frees up the limit. So this means there's a stream leak in polykey somewhere.
  3. I'm adding a specific error for hitting the stream limit so the error itself is less confusing later down the line.

I've determined the error itself isn't a panic, it gets thrown to whatever is creating the stream. However at the time it was leaking out the same way as the idle timeout bug.

2 things are needed to address this.

  1. Track down the stream leak. Possibly in the polykey code.
  2. Wrap the error in a more useful ErrorQuicStreamLimit error.

@tegefaulkes
Copy link
Contributor Author

I've added a patch version to js-quic that changes the ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message error with a ErrorQUICStreamLimit error that mentions the stream limit in the message.

I've also confirmed the stream leak. Its slow but if the connection keeps getting used then it will run out of streams eventually. My running theory is its one or more RPC handlers or calls that don't clean up their stream properly. I should be able to track it down easily with some monkey patching.

@tegefaulkes
Copy link
Contributor Author

Yep, seems to be a stream leak in js-quic where it's not cleaning up the streams properly. Digging deeper.

WARN:QUICStream 0:closed!
WARN:QUICStream 0:state rc: true, wc: true
WARN:QUICStream 0:state rf: true, wc: false
WARN:QUICStream 0:limit left: 100
WARN:QUICStream 0:---------
WARN:QUICStream 0:closed!
WARN:QUICStream 0:state rc: true, wc: true
WARN:QUICStream 0:state rf: true, wc: InvalidStreamState(0)
WARN:QUICStream 0:limit left: 98
WARN:QUICStream 0:---------
WARN:QUICStream 4:closed!
WARN:QUICStream 4:state rc: true, wc: true
WARN:QUICStream 4:state rf: true, wc: false
WARN:QUICStream 4:limit left: 100
WARN:QUICStream 4:---------
WARN:QUICStream 4:closed!
WARN:QUICStream 4:state rc: true, wc: true
WARN:QUICStream 4:state rf: true, wc: InvalidStreamState(4)
WARN:QUICStream 4:limit left: 97
WARN:QUICStream 4:---------
WARN:QUICStream 8:closed!
WARN:QUICStream 8:state rc: true, wc: true
WARN:QUICStream 8:state rf: true, wc: false
WARN:QUICStream 8:limit left: 100
WARN:QUICStream 8:---------
WARN:QUICStream 8:closed!
WARN:QUICStream 8:state rc: true, wc: true
WARN:QUICStream 8:state rf: true, wc: InvalidStreamState(8)
WARN:QUICStream 8:limit left: 96
WARN:QUICStream 8:---------
WARN:QUICStream 12:closed!
WARN:QUICStream 12:state rc: true, wc: true
WARN:QUICStream 12:state rf: true, wc: false
WARN:QUICStream 12:limit left: 100
WARN:QUICStream 12:---------
WARN:QUICStream 12:closed!
WARN:QUICStream 12:state rc: true, wc: true
WARN:QUICStream 12:state rf: true, wc: InvalidStreamState(12)
WARN:QUICStream 12:limit left: 95
WARN:QUICStream 12:---------
WARN:QUICStream 16:closed!
WARN:QUICStream 16:state rc: true, wc: true
WARN:QUICStream 16:state rf: true, wc: false
WARN:QUICStream 16:limit left: 100
WARN:QUICStream 16:---------
WARN:QUICStream 16:closed!
WARN:QUICStream 16:state rc: true, wc: true
WARN:QUICStream 16:state rf: true, wc: InvalidStreamState(16)
WARN:QUICStream 16:limit left: 94
WARN:QUICStream 16:---------
WARN:QUICStream 20:closed!
WARN:QUICStream 20:state rc: true, wc: true
WARN:QUICStream 20:state rf: true, wc: false
WARN:QUICStream 20:limit left: 100
WARN:QUICStream 20:---------
WARN:QUICStream 20:closed!
WARN:QUICStream 20:state rc: true, wc: true
WARN:QUICStream 20:state rf: true, wc: InvalidStreamState(20)
WARN:QUICStream 20:limit left: 93
WARN:QUICStream 20:---------
WARN:QUICStream 24:closed!
WARN:QUICStream 24:state rc: true, wc: true
WARN:QUICStream 24:state rf: true, wc: false
WARN:QUICStream 24:limit left: 100
WARN:QUICStream 24:---------
WARN:QUICStream 24:closed!
WARN:QUICStream 24:state rc: true, wc: true
WARN:QUICStream 24:state rf: true, wc: InvalidStreamState(24)
WARN:QUICStream 24:limit left: 92
WARN:QUICStream 24:---------
WARN:QUICStream 28:closed!
WARN:QUICStream 28:state rc: true, wc: true
WARN:QUICStream 28:state rf: true, wc: false
WARN:QUICStream 28:limit left: 100
WARN:QUICStream 28:---------
WARN:QUICStream 28:closed!
WARN:QUICStream 28:state rc: true, wc: true
WARN:QUICStream 28:state rf: true, wc: InvalidStreamState(28)
WARN:QUICStream 28:limit left: 91
WARN:QUICStream 28:---------
WARN:QUICStream 32:closed!
WARN:QUICStream 32:state rc: true, wc: true
WARN:QUICStream 32:state rf: true, wc: false
WARN:QUICStream 32:limit left: 100
WARN:QUICStream 32:---------
WARN:QUICStream 32:closed!
WARN:QUICStream 32:state rc: true, wc: true
WARN:QUICStream 32:state rf: true, wc: InvalidStreamState(32)
WARN:QUICStream 32:limit left: 90
WARN:QUICStream 32:---------
WARN:QUICStream 36:closed!
WARN:QUICStream 36:state rc: true, wc: true
WARN:QUICStream 36:state rf: true, wc: false
WARN:QUICStream 36:limit left: 100
WARN:QUICStream 36:---------
WARN:QUICStream 36:closed!
WARN:QUICStream 36:state rc: true, wc: true
WARN:QUICStream 36:state rf: true, wc: InvalidStreamState(36)
WARN:QUICStream 36:limit left: 90
WARN:QUICStream 36:---------

@tegefaulkes
Copy link
Contributor Author

Well I've done a bunch of digging, checked a few things and I can't for the life of me get the remaining stream limit to go back up again.

So I went digging through the quiche code and while the logic is pretty simple for tracking the limit. It's never actually decreased when the stream is cleaned up. https://github.com/cloudflare/quiche/blob/0570ab83cc5e46dc7b877765a6c0d7c4a44dd885/quiche/src/stream/mod.rs#L123

So as far as I can tell this limit applies to the total number of streams created for a connection. Not just the active ones. This is a little puzzling since the docs explicitly say that the counter is reduced when the stream cleans up.

I'll do a little more digging and make extra sure that the streams are being cleaned up properly. But the stream limit is something I don't think I can fix.

@tegefaulkes
Copy link
Contributor Author

Yeah, I can confirm that the stream states are cleaning up properly. Despite that the peerStreamsLeftBidi() number doesn't increase once they have cleaned up. So as of now I'm treating the limit as a cumulative limit of streams for the life of the connection. Rather than a limit of active streams on a connection.

So given that, the only solution we can apply for now is.

  1. Better error message when the limit is reached - already done.
  2. Increase the limit from 100 to a larger number like 1000 or 10000. When that limit is reached we'll just have to rely on recycling the connection for a new one.

Copy link
Member

Why are we increasing the limit? Why not just prevent new stream creations?

Copy link
Member

Wait that seems like a bug. Do you want to report this upstream if it's just a counter number.

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Dec 11, 2024

We are preventing new creations, that's what the error is for. We don't want to be hitting the limit so often though and 100 is a bit small for that.

@CMCDragonkai
Copy link
Member

So what is the expected behaviour? It just blocks new creations, drops them or throws an exception?

@tegefaulkes
Copy link
Contributor Author

Previously attempting to create a new stream when hitting the limit would throw a ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message. I've cleaned this up so it now throws ErrorQUICStreamLimit: Stream limit of ${limit} has been reached. This would throw through whatever was calling the stream creation function.

The other part of the problem where this was crashing the program is already solved. The error was leaking the same way the idle timeout error was.

I've created an upstream issue at cloudflare/quiche#1883 asking if the limit should apply to active streams or total streams.

As for the limit, right now I'm assuming it applying to the total streams is the current implementation. So I'm going to have to increase the stream limit in polykey to prevent this from being thrown too often, even if it's handled gracefully now. I'll also add ErrorQUICStreamLimit to the list of connection errors so the connection cleans up once the stream limit has been reached.

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Dec 12, 2024

After a discussion with upstream it seems that it's very likely we're not cleaning up the streams properly. That's a bit frustrating since so far as I can tell the conditions for cleaning up the stream are being met.

Actually, I just did a few quick tests and while creating a small number of streams doesn't increase our limit when they're completed. Doing about 40 streams shows the limit increasing after they complete. So the limit is being updated but in batches of completed streams.

SO maybe there isn't actually a problem? Its just that the low limit is letting us run into it before the limit increase frame can be sent between connections?

I'll just have to up the limit to 1000 or so and see if it still happens. If it does I need to do a deeper dive into the QUIC stream state machine and see what's up and how things are triggered because the docs aren't enough to go on.

Here's a log of the remaining limit after each stream completes.

WARN:QUICStream Test:peerStreamsLeftBidi() 99, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 98, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 97, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 96, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 95, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 94, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 93, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 92, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 91, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 90, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 89, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 88, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 87, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 86, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 85, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 84, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 83, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 82, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 81, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 80, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 79, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 78, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 77, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 76, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 75, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 74, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 73, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 72, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 71, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 70, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 69, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 68, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 67, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 66, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 99, -33
WARN:QUICStream Test:peerStreamsLeftBidi() 98, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 97, 1
WARN:QUICStream Test:peerStreamsLeftBidi() 96, 1

@CMCDragonkai
Copy link
Member

Can you show what's going on here during the next cycle meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development
Development

No branches or pull requests

2 participants