-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quicapitest is periodically failing #930
Comments
It is not the server but the client that closes the connection. Also I believe this should not happen after the dropped retry packet as that should be handled the same way as a dropped initial packet from the server. This really needs to be investigated ASAP as the CI on feature/quic-server branch is messed up by this. |
I'm looking at it now, but its not always the client that closes the connection. Different events occur depending on exactly what is dropped and when. I'm reproducing it and investigating here: |
So, I've managed to reproduce this locally. It takes some time, as the failures are random, but this is what my debug code (which includes instrumentation of the client and server rx path, as well as enabling tracing for the noisy_dgram_test, as well as debugging): Note: Log is abreviated to prevent this from being painfully long, with annotations
|
So I feel like whats happening above is that the noisy dgram filter is corrupting part of a sent frame that, while previously recoverable, is not not, because the addition of the retry token is forcing a change in the dcid (when it triggers), causing a mismatch between what the server and the client think the "correct" dcid is. In this event, it seems like QUIC is doing what its supposed to here, the test just isn't written to expect failure, so we either need to: |
So, If I reduce the flip_offset in the get_noise function to constrain it to a very narrow window (4-8 bytes, rather that 25-50), I'm able to run this test for 10 minutes without a failure (whereas previously running it in a loop produced a failure in about 2), so it seems the problem is as described above, that the noisy dgram filter is corrupting the connection id when the client sends the second initial packet, leading to confusion between the client and server about what the connection id should be (i.e. the server thinks it has a valid connection for dcid X, where as the client is attempting to establish a connection on dcid Y), and times out after a period. @t8m thoughts on how to fix this? It seems to me that the protocol stack is doing the right things here (the client retries until it gives up, and the server correctly discards initial packets for dcids that don't match the associated token). The test is just corrupting bits that make a connection impossible, but always expects a successful connection. Not sure if constraining the filp_bits mechanism is the right thing to do here. We could also potentially flag this corruption as triggering an expected failure and have the test check for that. |
Thank you for analysis. it makes sense. I think test needs to be temporarily disabled. This kind of corruption the test then should be enabled once when #928 (encrypt token) will be done. I would not alter the test, we need to fix QUIC server. |
Thank you @Sashan I may have a solution, at least a temporary one. The patches are in this branch: Basically I added the new dcid that we reserve in port_send_retry to the token, and in port_validate_token, we compare the recorded dcid in the token to the one that the client is sending in the second initial packet. this gives us some resilience to corruption in the new dcid. Note, doing this necessitated some different marshaling of the token data, since QUIC_CONN_ID's have a static array that isn't fully used, and adding a second QUIC_CONN_ID leads to garbage data when the token integrity is verified on the client. I agree however, this isn't a permanent solution, encrypting the token will do the job better I think, as you said @Sashan. A combination of the two might be warranted, I'm not sure yet. with the patches referenced in the branch above however, I no longer receive the shutdown error in the analysis above. I do however get another error after a while about a mismatch in the ORIG_DCID, which I expect is a similar problem in a different location. I'm looking at that this weekend. |
yeah, so I'm getting this error (with some debugging added to ch_on_transport_params):
from this:
It seems pretty clear that the noisy dgram filter mangled the next to the last byte in the initial packets transmitted after the retry got mangled. Again the protocol is doing the right thing here, because the packet is wrong, but the test doesn't expect a failure. Whats more, encrypting or hardening the token here won't help, as this is not token related. Wondering if you were right @Sashan perhaps the test should be disabled, or at least the packet mangling range should be in a much more constrained range, limited to corruption that can be recovered. |
I think we should either ignore failures seen in this test or just disable the test until port at server will be able to detect such corruption. The port should silently discard corrupted packets. |
agreed, the port in the quic stack is definitely doing the "right" things in response to certain "fatal" corruptions. I'm hesitant to just disable the test however, as shows that we can handle malformed packets properly. Ignoring failures makes sense, but it would be really nice if we could predict when failures were expected, and I'm not quite sure how to do that, given that we don't know what corruptions will result in retries that result in passed connections and which ones are fatal, unless we were to add quic parsing to the noisydgram bio, which doesn't seem feasible. hmm |
So, theres one more thing that we can do here: If I modify the flip_offset in get_noise() to be constrained to the first 12 bytes of a long header, or the first 10 bytes of a short_header, we are guaranteed to only corrupt the following values:
Corrupting any of these fields results in silent drops or server responses that trigger a negotiation restart, meaning the corruption is non-fatal and should eventually complete. The one exception to that is the connection id, for which the additional check in token validation from the patches above would still be required. This lets us avoid fatal errors, and still allows us to test some level of frame corruption resilience. not sure thats worth the effort, but its an idea. |
actually scratch that last bit, even just corrupting that range led to a connection failure after about 30 minutes of running the test. Think you might be right @Sashan disabling the test might be the thing to do. |
with the addition of server verification retry packets, I've noticed that the quicapi test is occasionally failing, specifically on the noisy_dgram test. It appears that, if the filter dgram packet filter bios decides to drop the retry frame or reinject a previous initial packet, the server (correctly) closes the connection, due to an unexpected ORIG_DCID, or simply a failed handshake, due to a missing token.
Not sure what the exact root cause is here, or what the right fix is (possibly the need to avoid mangling packets until the handshake is done, or simply accept stochastic failures in the test), but some investigation is needed
The text was updated successfully, but these errors were encountered: