Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streams not gracefully ending during high usage #20

Closed
3 tasks
tegefaulkes opened this issue May 16, 2023 · 21 comments · Fixed by #53
Closed
3 tasks

Streams not gracefully ending during high usage #20

tegefaulkes opened this issue May 16, 2023 · 21 comments · Fixed by #53
Assignees
Labels
development Standard development r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices

Comments

@tegefaulkes
Copy link
Contributor

Specification

While working on #14 we discovered that during tests with a large amount of streams and data being sent concurrently. Readable streams would fail to end. So far as we could tell, random streams would fail to process a message from streamRecv that had fin = true. In these cases other streams would send a fin frame, end and clean up on both sides, all after the problematic stream should've done the same.

Since then we found a stop-gap solution where we send a single null byte with the finish message. Given that, we're reasonably certain that the problem is related to 0-length messages not being sent or processed event though fin was set to true.

Normally a 0-length message is not sent but in the case of a finish message they should be. This functionality was added in this commit cloudflare/quiche@cc98af0.

Under normal conditions with low load. Sending fin frames with 0-length messages works fine. The problem only happens under higher loads during testing and even then, somewhat randomly. It's been hard to narrow down the specific conditions that cause the problem

To address this problem multiple things need to be done.

  1. Create a pure rust example that can reproduce the problem. This will be useful for the upstream issue
  2. Investigate the quiche source code for the problem. based on what I understand is happening. The problem is likely located on the receiving side with marking the stream as readable after it has received the 0-length finish frame. But only during message loads.
  3. Post an upstream issue with the rust example and a possible solution. Sooner the better so they can address the problem and fix it ASAP.

For now we're using a work-around where we add data to the fin frame message. In this case a single null byte. This should mark the stream as readable regardless of an potential issues with 0-length fin frames.

Additional context

Tasks

  • 1. Create a pure rust example that can reproduce the problem. This will be useful for the upstream issue
  • 2. Investigate the quiche source code for the problem. based on what I understand is happening. The problem is likely located on the receiving side with marking the stream as readable after it has received the 0-length finish frame. But only during message loads.
  • 3. Post an upstream issue with the rust example and a possible solution. Sooner the better so they can address the problem and fix it ASAP.
@tegefaulkes tegefaulkes added the development Standard development label May 16, 2023
@tegefaulkes tegefaulkes self-assigned this May 16, 2023
tegefaulkes added a commit that referenced this issue May 16, 2023
* Related #20

[ci skip]
tegefaulkes added a commit that referenced this issue May 16, 2023
* Related #20

[ci skip]
tegefaulkes added a commit that referenced this issue May 16, 2023
* Related #20

[ci skip]
@tegefaulkes
Copy link
Contributor Author

Cool, progress. I have modified the server and client example and simplified them somewhat.

The server does 3 things.

  1. Closes the sending side immediately, we only want to test a single direction.
  2. Reads data from the streams and drops it, we don't care about contents.
  3. Keeps a set tracking active streams. When a ID is read it is added to the set, when a fin frame is received it is removed. Every time we remove a stream from the set we print the number remaining.

The client does 2 things.

  1. Reads data from the streams and drops them.
  2. Creates X amount of streams and sends Y amount of messages per stream before ending with a fin message. X and Y are configurable to fine tune load.

What I'm trying to observe with this test is.

  1. Do all of the streams finish on the server side? This is indicated by the output showing 0 streams left after streams have finished.
  2. Can I affect the amount of streams that fail to close by adjusting the amount of streams and messages?

What I finding is,

  1. The threshold for triggering the problem seems to be about 10000 streams and 200 messages ea. This causes 20-50% of the streams to not finish. More often then not it's just under 50%.
  2. Increasing the load doesn't seem to affect the amount of failures. Hitting a threshold of load seems to be the main factor here.

I think at this stage I have enough evidence to post an upstream issue now. But first I'll confirm no other errors or factors are involved.

tegefaulkes added a commit that referenced this issue May 17, 2023
@tegefaulkes
Copy link
Contributor Author

I changed the client code slightly and I can't trigger the problem anymore. The original failures may have been a problem with my example code?

So far there are no failures with 15000 streams at 1000 messages each. Back to the drawing board.

@CMCDragonkai
Copy link
Member

What did you change?

@tegefaulkes
Copy link
Contributor Author

It was a slight change to the logic for tracking the number of messages sent and when to send the fin frame.

tegefaulkes added a commit that referenced this issue May 17, 2023
* Related #20

[ci skip]
tegefaulkes added a commit that referenced this issue May 17, 2023
@tegefaulkes
Copy link
Contributor Author

Forcing congestion control by setting set_initial_max_stream_data_bidi_local and set_initial_max_stream_data_bidi_remote to a low value doesn't cause it.

@CMCDragonkai
Copy link
Member

Can you roll back your change and replicate it?

tegefaulkes added a commit that referenced this issue May 17, 2023
* Related #20

[ci skip]
@CMCDragonkai
Copy link
Member

Do you have a branch for this?

@CMCDragonkai
Copy link
Member

I thought you were replicating it in rust. How did this result in a change to the client code? Are you talking about JS or Rustlang? And if you want to replicate the problem why not just go back to what your original client code is.

@tegefaulkes
Copy link
Contributor Author

Yes I do have a branch for this, it's rust_example.

I am replicating it in rust. The change was in the client code of the rust example.

@CMCDragonkai
Copy link
Member

Does this problem still occur with the JavaScript tests?

@CMCDragonkai
Copy link
Member

Can you explain in detail what you changed and how this impacted the test of stream lifecycles on the JavaScript side. Be comprehensive.

@tegefaulkes
Copy link
Contributor Author

I haven't touched the javascript since I applied the temp-fix of including data with the end frame.

tegefaulkes added a commit that referenced this issue May 18, 2023
@tegefaulkes
Copy link
Contributor Author

I've replicated the client example in ts. it's clientTest.ts in the project root.

@tegefaulkes
Copy link
Contributor Author

I have a working(ish) server example now. Using them together causes the client to fail though. It might be related to initial negotiation.

I tried to have them mimic how the rust examples worked in structure. The main difference is that receiving data handled as an event seperately from it's main processing loop. For the client that's minor but for the server it's a pretty major difference. They're pretty close to the same logic otherwise.

There are 4 files for testing this now.

  1. the rust examples
    a. src/bin/server_test.rs run with cargo run --bin server_test
    b. src/bin/client_test.rs run with cargo run --bin client_test https://127.0.0.1:4433
  2. ts examples
    a. ./serverTest.ts run with npm run ts-node -- ./clientTest.ts
    b. ./clientTest.ts run with npm run ts-node -- ./serverTest.ts.

I still need to fix something with the serverTest.ts example.

tegefaulkes added a commit that referenced this issue May 18, 2023
* Related #20

[ci skip]
tegefaulkes added a commit that referenced this issue May 18, 2023
* Related #20

[ci skip]
tegefaulkes added a commit that referenced this issue May 18, 2023
* Related #20

[ci skip]
tegefaulkes added a commit that referenced this issue May 18, 2023
* Related #20

[ci skip]
tegefaulkes added a commit that referenced this issue May 18, 2023
tegefaulkes added a commit that referenced this issue May 18, 2023
* Related #20

[ci skip]
tegefaulkes added a commit that referenced this issue May 18, 2023
tegefaulkes added a commit that referenced this issue May 18, 2023
* Related #20

[ci skip]
tegefaulkes added a commit that referenced this issue May 18, 2023
tegefaulkes added a commit that referenced this issue May 18, 2023
tegefaulkes added a commit that referenced this issue May 18, 2023
* Related #20

[ci skip]
@tegefaulkes
Copy link
Contributor Author

Re-based on staging.

tegefaulkes added a commit that referenced this issue May 19, 2023
* Related #20

[ci skip]
@CMCDragonkai
Copy link
Member

Is this still a problem or fixed by #26? @tegefaulkes

@tegefaulkes
Copy link
Contributor Author

It's hard to say, it only triggers under certain circumstances. I'm considering it a non-issue for now unless it comes up again.

@CMCDragonkai
Copy link
Member

Are you still doing the stop gap of sending a single null byte?

@tegefaulkes tegefaulkes changed the title 0-length finish messages not ending stream Streams not gracefully ending during high usage Jul 7, 2023
@tegefaulkes
Copy link
Contributor Author

No, it was removed to simplify the code a little bit. If the problem happens again it will be simple enough to re-apply it. For now we want to see if all the other changes may have fixed it.

@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Jul 9, 2023
@CMCDragonkai
Copy link
Member

This is probably fixed by #53. I'm adding it there. This problem was probably we just didn't understand how streams were meant to be closed and checked. And you've added the necessary functionality to QUICStream to check such things now. But do you still have the scalability tests?

@tegefaulkes
Copy link
Contributor Author

It was probably fixed before. Just we couldn't confirm it since I never found the exact condition that caused it.

It happened in the concurrency tests when we stressed it a little. AFAIK things should be working and we can resolve this if we never run into the problem after merging #53 and testing within Polykey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices
Development

Successfully merging a pull request may close this issue.

2 participants