Streams not gracefully ending during high usage #20

tegefaulkes · 2023-05-16T03:53:49Z

Specification

While working on #14 we discovered that during tests with a large amount of streams and data being sent concurrently. Readable streams would fail to end. So far as we could tell, random streams would fail to process a message from streamRecv that had fin = true. In these cases other streams would send a fin frame, end and clean up on both sides, all after the problematic stream should've done the same.

Since then we found a stop-gap solution where we send a single null byte with the finish message. Given that, we're reasonably certain that the problem is related to 0-length messages not being sent or processed event though fin was set to true.

Normally a 0-length message is not sent but in the case of a finish message they should be. This functionality was added in this commit cloudflare/quiche@cc98af0.

Under normal conditions with low load. Sending fin frames with 0-length messages works fine. The problem only happens under higher loads during testing and even then, somewhat randomly. It's been hard to narrow down the specific conditions that cause the problem

To address this problem multiple things need to be done.

Create a pure rust example that can reproduce the problem. This will be useful for the upstream issue
Investigate the quiche source code for the problem. based on what I understand is happening. The problem is likely located on the receiving side with marking the stream as readable after it has received the 0-length finish frame. But only during message loads.
Post an upstream issue with the rust example and a possible solution. Sooner the better so they can address the problem and fix it ASAP.

For now we're using a work-around where we add data to the fin frame message. In this case a single null byte. This should mark the stream as readable regardless of an potential issues with 0-length fin frames.

Additional context

Tasks

1. Create a pure rust example that can reproduce the problem. This will be useful for the upstream issue
2. Investigate the quiche source code for the problem. based on what I understand is happening. The problem is likely located on the receiving side with marking the stream as readable after it has received the 0-length finish frame. But only during message loads.
3. Post an upstream issue with the rust example and a possible solution. Sooner the better so they can address the problem and fix it ASAP.

The text was updated successfully, but these errors were encountered:

* Related #20 [ci skip]

tegefaulkes · 2023-05-17T04:25:44Z

Cool, progress. I have modified the server and client example and simplified them somewhat.

The server does 3 things.

Closes the sending side immediately, we only want to test a single direction.
Reads data from the streams and drops it, we don't care about contents.
Keeps a set tracking active streams. When a ID is read it is added to the set, when a fin frame is received it is removed. Every time we remove a stream from the set we print the number remaining.

The client does 2 things.

Reads data from the streams and drops them.
Creates X amount of streams and sends Y amount of messages per stream before ending with a fin message. X and Y are configurable to fine tune load.

What I'm trying to observe with this test is.

Do all of the streams finish on the server side? This is indicated by the output showing 0 streams left after streams have finished.
Can I affect the amount of streams that fail to close by adjusting the amount of streams and messages?

What I finding is,

The threshold for triggering the problem seems to be about 10000 streams and 200 messages ea. This causes 20-50% of the streams to not finish. More often then not it's just under 50%.
Increasing the load doesn't seem to affect the amount of failures. Hitting a threshold of load seems to be the main factor here.

I think at this stage I have enough evidence to post an upstream issue now. But first I'll confirm no other errors or factors are involved.

* Related #20 [ci skip]

tegefaulkes · 2023-05-17T05:37:18Z

I changed the client code slightly and I can't trigger the problem anymore. The original failures may have been a problem with my example code?

So far there are no failures with 15000 streams at 1000 messages each. Back to the drawing board.

CMCDragonkai · 2023-05-17T05:39:29Z

What did you change?

tegefaulkes · 2023-05-17T05:55:22Z

It was a slight change to the logic for tracking the number of messages sent and when to send the fin frame.

* Related #20 [ci skip]

tegefaulkes · 2023-05-17T06:17:50Z

Forcing congestion control by setting set_initial_max_stream_data_bidi_local and set_initial_max_stream_data_bidi_remote to a low value doesn't cause it.

CMCDragonkai · 2023-05-17T06:18:34Z

Can you roll back your change and replicate it?

* Related #20 [ci skip]

CMCDragonkai · 2023-05-17T06:40:37Z

Do you have a branch for this?

CMCDragonkai · 2023-05-17T06:41:32Z

I thought you were replicating it in rust. How did this result in a change to the client code? Are you talking about JS or Rustlang? And if you want to replicate the problem why not just go back to what your original client code is.

tegefaulkes · 2023-05-18T00:12:54Z

Yes I do have a branch for this, it's rust_example.

I am replicating it in rust. The change was in the client code of the rust example.

CMCDragonkai · 2023-05-18T00:36:51Z

Does this problem still occur with the JavaScript tests?

CMCDragonkai · 2023-05-18T00:37:17Z

Can you explain in detail what you changed and how this impacted the test of stream lifecycles on the JavaScript side. Be comprehensive.

tegefaulkes · 2023-05-18T00:50:11Z

I haven't touched the javascript since I applied the temp-fix of including data with the end frame.

* Related #20 [ci skip]

tegefaulkes · 2023-05-18T04:35:23Z

I've replicated the client example in ts. it's clientTest.ts in the project root.

* Related #20 [ci skip]

tegefaulkes · 2023-05-18T06:43:23Z

I have a working(ish) server example now. Using them together causes the client to fail though. It might be related to initial negotiation.

I tried to have them mimic how the rust examples worked in structure. The main difference is that receiving data handled as an event seperately from it's main processing loop. For the client that's minor but for the server it's a pretty major difference. They're pretty close to the same logic otherwise.

There are 4 files for testing this now.

the rust examples
a. src/bin/server_test.rs run with cargo run --bin server_test
b. src/bin/client_test.rs run with cargo run --bin client_test https://127.0.0.1:4433
ts examples
a. ./serverTest.ts run with npm run ts-node -- ./clientTest.ts
b. ./clientTest.ts run with npm run ts-node -- ./serverTest.ts.

I still need to fix something with the serverTest.ts example.

* Related #20 [ci skip]

tegefaulkes · 2023-05-18T07:13:59Z

Re-based on staging.

* Related #20 [ci skip]

CMCDragonkai · 2023-07-06T06:47:37Z

Is this still a problem or fixed by #26? @tegefaulkes

tegefaulkes · 2023-07-07T00:51:17Z

It's hard to say, it only triggers under certain circumstances. I'm considering it a non-issue for now unless it comes up again.

CMCDragonkai · 2023-07-07T01:07:02Z

Are you still doing the stop gap of sending a single null byte?

tegefaulkes · 2023-07-07T02:59:50Z

No, it was removed to simplify the code a little bit. If the problem happens again it will be simple enough to re-apply it. For now we want to see if all the other changes may have fixed it.

CMCDragonkai · 2023-09-22T09:35:09Z

This is probably fixed by #53. I'm adding it there. This problem was probably we just didn't understand how streams were meant to be closed and checked. And you've added the necessary functionality to QUICStream to check such things now. But do you still have the scalability tests?

tegefaulkes · 2023-09-25T01:27:16Z

It was probably fixed before. Just we couldn't confirm it since I never found the exact condition that caused it.

It happened in the concurrency tests when we stressed it a little. AFAIK things should be working and we can resolve this if we never run into the problem after merging #53 and testing within Polykey.

tegefaulkes added the development Standard development label May 16, 2023

tegefaulkes self-assigned this May 16, 2023

tegefaulkes mentioned this issue May 16, 2023

Test connection multiplexing across multiple servers and clients #14

Closed

4 tasks

tegefaulkes added a commit that referenced this issue May 16, 2023

wip: server,client example

d10ae66

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 16, 2023

wip

9e6b8ca

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 16, 2023

wip: making progress

c916e3d

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 17, 2023

wip: function example of the problem

46ee86f

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 17, 2023

wip: cleaned up example

0b51749

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 17, 2023

wip: slight change, no errors in example now.

90c82f2

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 17, 2023

wip

5211515

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 18, 2023

wip: working ts example for client

9f2722a

* Related #20 [ci skip]

CMCDragonkai mentioned this issue May 18, 2023

Create QUIC library that can be exposed to JS and uses the Node dgram module #1

Closed

7 tasks

tegefaulkes added a commit that referenced this issue May 18, 2023

wip: mostly functional server ts example

a244e59

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 18, 2023

wip: tracking streams

dfe4715

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 18, 2023

wip: server,client example

ef8d96c

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 18, 2023

wip

8b188c4

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 18, 2023

wip: making progress

498950e

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 18, 2023

wip: function example of the problem

63fcc19

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 18, 2023

wip: cleaned up example

59aad9e

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 18, 2023

wip: slight change, no errors in example now.

3e1a3c2

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 18, 2023

wip

338b009

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 18, 2023

wip: working ts example for client

7ac72a9

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 18, 2023

wip: mostly functional server ts example

8e698f1

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 18, 2023

wip: tracking streams

fa45371

* Related #20 [ci skip]

tegefaulkes added a commit that referenced this issue May 19, 2023

wip: examples are working now

ceaac85

* Related #20 [ci skip]

CMCDragonkai mentioned this issue Jul 6, 2023

ci: merge staging to master #27

Merged

tegefaulkes changed the title ~~0-length finish messages not ending stream~~ Streams not gracefully ending during high usage Jul 7, 2023

CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Jul 9, 2023

CMCDragonkai mentioned this issue Sep 22, 2023

Events Refactoring (integrating js-events) #53

Merged

19 tasks

CMCDragonkai closed this as completed in #53 Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streams not gracefully ending during high usage #20

Streams not gracefully ending during high usage #20

tegefaulkes commented May 16, 2023

tegefaulkes commented May 17, 2023

tegefaulkes commented May 17, 2023

CMCDragonkai commented May 17, 2023

tegefaulkes commented May 17, 2023

tegefaulkes commented May 17, 2023

CMCDragonkai commented May 17, 2023

CMCDragonkai commented May 17, 2023

CMCDragonkai commented May 17, 2023

tegefaulkes commented May 18, 2023

CMCDragonkai commented May 18, 2023

CMCDragonkai commented May 18, 2023

tegefaulkes commented May 18, 2023

tegefaulkes commented May 18, 2023

tegefaulkes commented May 18, 2023

tegefaulkes commented May 18, 2023

CMCDragonkai commented Jul 6, 2023

tegefaulkes commented Jul 7, 2023

CMCDragonkai commented Jul 7, 2023

tegefaulkes commented Jul 7, 2023

CMCDragonkai commented Sep 22, 2023

tegefaulkes commented Sep 25, 2023

Streams not gracefully ending during high usage #20

Streams not gracefully ending during high usage #20

Comments

tegefaulkes commented May 16, 2023

Specification

Additional context

Tasks

tegefaulkes commented May 17, 2023

tegefaulkes commented May 17, 2023

CMCDragonkai commented May 17, 2023

tegefaulkes commented May 17, 2023

tegefaulkes commented May 17, 2023

CMCDragonkai commented May 17, 2023

CMCDragonkai commented May 17, 2023

CMCDragonkai commented May 17, 2023

tegefaulkes commented May 18, 2023

CMCDragonkai commented May 18, 2023

CMCDragonkai commented May 18, 2023

tegefaulkes commented May 18, 2023

tegefaulkes commented May 18, 2023

tegefaulkes commented May 18, 2023

tegefaulkes commented May 18, 2023

CMCDragonkai commented Jul 6, 2023

tegefaulkes commented Jul 7, 2023

CMCDragonkai commented Jul 7, 2023

tegefaulkes commented Jul 7, 2023

CMCDragonkai commented Sep 22, 2023

tegefaulkes commented Sep 25, 2023