Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quicapitest is periodically failing #930

Open
Tracked by #878
nhorman opened this issue Nov 21, 2024 · 12 comments
Open
Tracked by #878

quicapitest is periodically failing #930

nhorman opened this issue Nov 21, 2024 · 12 comments

Comments

@nhorman
Copy link
Contributor

nhorman commented Nov 21, 2024

with the addition of server verification retry packets, I've noticed that the quicapi test is occasionally failing, specifically on the noisy_dgram test. It appears that, if the filter dgram packet filter bios decides to drop the retry frame or reinject a previous initial packet, the server (correctly) closes the connection, due to an unexpected ORIG_DCID, or simply a failed handshake, due to a missing token.

Not sure what the exact root cause is here, or what the right fix is (possibly the need to avoid mangling packets until the handshake is done, or simply accept stochastic failures in the test), but some investigation is needed

@t8m
Copy link
Member

t8m commented Nov 22, 2024

the server (correctly) closes the connection,

It is not the server but the client that closes the connection. Also I believe this should not happen after the dropped retry packet as that should be handled the same way as a dropped initial packet from the server.

This really needs to be investigated ASAP as the CI on feature/quic-server branch is messed up by this.

@nhorman
Copy link
Contributor Author

nhorman commented Nov 22, 2024

I'm looking at it now, but its not always the client that closes the connection. Different events occur depending on exactly what is dropped and when. I'm reproducing it and investigating here:
openssl/openssl#26027

@nhorman
Copy link
Contributor Author

nhorman commented Nov 22, 2024

So, I've managed to reproduce this locally. It takes some time, as the failures are random, but this is what my debug code (which includes instrumentation of the client and server rx path, as well as enabling tracing for the noisy_dgram_test, as well as debugging):

Note: Log is abreviated to prevent this from being painfully long, with annotations

    # Subtest: test_noisy_dgram  <=== Client starts by sending a client hello, normal
    1..2
Sent TLS Record
Header:
  Version = TLS 1.0 (0x301)
  Content Type = Handshake (22)
  Length = 260
    ClientHello, Length=256
      client_version=0x303 (TLS 1.2)
      Random:
        gmt_unix_time=0x23200EB9
        random_bytes (len=28): EB5016ADE4B017C913E1B79B9C22CC82B7D36069102C379045ECF7FE
      session_id (len=0): 
      cipher_suites (len=6)
        {0x13, 0x02} TLS_AES_256_GCM_SHA384
        {0x13, 0x03} TLS_CHACHA20_POLY1305_SHA256
        {0x13, 0x01} TLS_AES_128_GCM_SHA256
      compression_methods (len=1)
        No Compression (0x00)
      extensions, length = 209
        extension_type=UNKNOWN(57), length=49
          0000 - 0c 00 0f 00 01 04 80 00-75 30 03 02 44 b0 0e   ........u0..D..
          000f - 01 02 04 04 80 0c 00 00-05 04 80 08 00 00 06   ...............
          001e - 04 80 08 00 00 07 04 80-08 00 00 08 02 40 64   .............@d
          002d - 09 02 40 64                                    ..@d
        extension_type=ec_point_formats(11), length=4
          uncompressed (0)
          ansiX962_compressed_prime (1)
          ansiX962_compressed_char2 (2)
        extension_type=supported_groups(10), length=22
          ecdh_x25519 (29)
          secp256r1 (P-256) (23)
          ecdh_x448 (30)
          secp521r1 (P-521) (25)
          secp384r1 (P-384) (24)
          ffdhe2048 (256)
          ffdhe3072 (257)
          ffdhe4096 (258)
          ffdhe6144 (259)
          ffdhe8192 (260)
        extension_type=session_ticket(35), length=0
        extension_type=application_layer_protocol_negotiation(16), length=11
          ossltest
        extension_type=encrypt_then_mac(22), length=0
        extension_type=extended_master_secret(23), length=0
        extension_type=signature_algorithms(13), length=36
          ecdsa_secp256r1_sha256 (0x0403)
          ecdsa_secp384r1_sha384 (0x0503)
          ecdsa_secp521r1_sha512 (0x0603)
          ed25519 (0x0807)
          ed448 (0x0808)
          ecdsa_brainpoolP256r1_sha256 (0x081a)
          ecdsa_brainpoolP384r1_sha384 (0x081b)
          ecdsa_brainpoolP512r1_sha512 (0x081c)
          rsa_pss_pss_sha256 (0x0809)
          rsa_pss_pss_sha384 (0x080a)
          rsa_pss_pss_sha512 (0x080b)
          rsa_pss_rsae_sha256 (0x0804)
          rsa_pss_rsae_sha384 (0x0805)
          rsa_pss_rsae_sha512 (0x0806)
          rsa_pkcs1_sha256 (0x0401)
          rsa_pkcs1_sha384 (0x0501)
          rsa_pkcs1_sha512 (0x0601)
        extension_type=supported_versions(43), length=3
          TLS 1.3 (772)
        extension_type=psk_key_exchange_modes(45), length=2
          psk_dhe_ke (1)
        extension_type=key_share(51), length=38
            NamedGroup: ecdh_x25519 (29)
            key_exchange:  (len=32): A37F5B416A8F0FA84DA07180F31935D4EF18B0A957D34D30AA618CF999FFFA26

Sent Frame: Crypto
    Offset: 0
    Len: 260
Sent Frame: Padding
Sent Packet
  Packet Type: Initial
  Version: 0x00000001
  Destination Conn Id: 0x4bd741439b2b1350
  Source Conn Id: <zero length id>
  Payload length: 1178
  Token: <zero length token>
  Packet Number: 0x00000000
Sent Datagram
  Length: 1200
Pre-filter datagram list:
Pre-filter Datagram:  <= noisy filter kicks in, does nothing this round, client hello continues as normal
<datagram dump trimmed for space>

End of pre-filter datagram list
Applying noise filters:
**Back off applied
End of noise filters
Post-filter datagram list:
Post-filter Datagram:
<datagram dump trimmed for space>

End of post-filter datagram list
Port 0x2bacfe70 receiving frame in default handler
no token, send retry and drop <====Server correctly notes that client hello is missing a token, sends retry
Pre-filter datagram list:  <=== Noise filter appears to not modify anything, ok
Pre-filter Datagram:
0000 - f0 00 00 00 01 00 08 57-9e 13 c0 a4 7e 68 a1 6f   .......W....~h.o
0010 - 70 65 6e 73 73 6c 74 6f-6b 65 6e 08 4b d7 41 43   penssltoken.K.AC
0020 - 9b 2b 13 50 9b a7 48 ef-3f 14 6f a7 7b 72 88 67   .+.P..H.?.o.{r.g
0030 - 94 8c 51 c8                                       ..Q.

End of pre-filter datagram list
Applying noise filters:
End of noise filters
Post-filter datagram list:
Post-filter Datagram:
0000 - f0 00 00 00 01 00 08 57-9e 13 c0 a4 7e 68 a1 6f   .......W....~h.o
0010 - 70 65 6e 73 73 6c 74 6f-6b 65 6e 08 4b d7 41 43   penssltoken.K.AC
0020 - 9b 2b 13 50 9b a7 48 ef-3f 14 6f a7 7b 72 88 67   .+.P..H.?.o.{r.g
0030 - 94 8c 51 c8                                       ..Q.

End of post-filter datagram list
Port 0x2ba82500 receiving frame in default handler <== Client receives the retry packet, good
Found channel on client, injecting
Received Datagram
  Length: 52
client Channel 0x2ba8b6f0 handling rx packet
Retry packet
Sent Frame: Crypto
    Offset: 0
    Len: 260
Sent Frame: Padding
Sent Packet
  Packet Type: Initial
  Version: 0x00000001
  Destination Conn Id: 0x579e13c0a47e68a1
  Source Conn Id: <zero length id>
  Payload length: 1157
  Token: 6f70656e73736c746f6b656e084bd741439b2b1350 <=====good, this matches the token in the above dgram
  Packet Number: 0x00000001
Sent Datagram
  Length: 1200
Pre-filter datagram list:
Pre-filter Datagram:  <=======================Client resends the Client Hello, with the retry token inserted, good
<Datagram trimmed for size>

End of pre-filter datagram list
Applying noise filters:
**Flipping bits in a datagram at offset 12 <===================Noisy filter flips some datagram bits
0000 - ce 00 00 00 01 08 57 9e-13 c0 a4 7e 68 a1 00 15   ......W....~h...   <=====68 becomes 74 below
<remainder of dgram trimmed for size>

End of noise filters
Post-filter datagram list:
Post-filter Datagram:
0000 - ce 00 00 00 01 08 57 9e-13 c0 a4 7e 74 e1 00 15   ......W....~t...  <=====This corupts the dcid from the retry token
<remainder of dgram trimmed for size>

End of post-filter datagram list
Port 0x2bacfe70 receiving frame in default handler  <=====Server receives datagram
binding to new channel    <= binds to new channel
Injecting on new bound connection <==== Server sends the frame on to be process by the channel
Received Datagram    <===== Server prints Record layer
  Length: 1200
Received Packet
  Packet Type: Initial
  Version: 0x00000001
  Destination Conn Id: 0x579e13c0a47e74e1    <==== this should have been 579e13c0a47e68a1!!!!!!!
  Source Conn Id: <zero length id>
  Payload length: 1158
  Token: 6f70656e73736c746f6b656e084bd741439b2b1350  <=====Token value looks correct, or at least consistent
  Packet Number: 0xc6479a  <====== Packet number is mangled
Sent Frame: Ping                   <=====Client sends a ping frame to sever
Sent Frame: Padding
Sent Packet                             <=====Client resends its initial frame
  Packet Type: Initial
  Version: 0x00000001
  Destination Conn Id: 0x579e13c0a47e68a1             <===== Note this is the connection id sent from the retry packet (unmangled)
  Source Conn Id: <zero length id>
  Payload length: 1157
  Token: 6f70656e73736c746f6b656e084bd741439b2b1350
  Packet Number: 0x00000002
Sent Datagram
  Length: 1200
Pre-filter datagram list:
Pre-filter Datagram:
<dgram dump trimmed for size>

End of pre-filter datagram list
Applying noise filters:
**Back off applied    <======= No mangling done this round
End of noise filters
Post-filter datagram list:
Post-filter Datagram:
<dgram dump trimmed for noise>

End of post-filter datagram list
Port 0x2bacfe70 receiving frame in default handler   <=====Server receives new datagram
binding to new channel
Failed to bind to channel    <======== Fails to bind to channel, likely due to non-mangled connid not found in lcidm?
Sent Frame: Ping                  <========= Client sends another ping frame, with proper retry token and dcid
Sent Frame: Padding
Sent Packet
  Packet Type: Initial
  Version: 0x00000001
  Destination Conn Id: 0x579e13c0a47e68a1
  Source Conn Id: <zero length id>
  Payload length: 1157
  Token: 6f70656e73736c746f6b656e084bd741439b2b1350
  Packet Number: 0x00000003
Sent Datagram
  Length: 1200
Pre-filter datagram list:
Pre-filter Datagram:
<datagram trimmed for size>

End of pre-filter datagram list
Applying noise filters:
**Back off applied     <============No noise filters applied
End of noise filters
Post-filter datagram list:
Post-filter Datagram:
<datagram trimmed for size>

End of post-filter datagram list
Port 0x2bacfe70 receiving frame in default handler
binding to new channel
Failed to bind to channel
Sent Frame: Ping   <====Client pattern continues trying to send ping and initial packets to the server with the (non-mangled dcid, which the server continually rejects)
Sent Frame: Padding
Sent Packet
  Packet Type: Initial
  Version: 0x00000001
  Destination Conn Id: 0x579e13c0a47e68a1
  Source Conn Id: <zero length id>
  Payload length: 1157
  Token: 6f70656e73736c746f6b656e084bd741439b2b1350
  Packet Number: 0x00000004
Sent Datagram
  Length: 1200
< Ad nausem repetition of the above, until connection attempt is given up on>

End of post-filter datagram list
Port 0x2bacfe70 receiving frame in default handler
binding to new channel
Failed to bind to channel
    # INFO:  @ test/helpers/quictestlib.c:531
    # SSL_connect() failed 0, 1
    # 4067D8FDCF7F0000:error:0A0000CF:SSL routines:quic_do_handshake:protocol is shutdown:ssl/quic/quic_impl.c:1833:
    # ERROR: (bool) 'qtest_create_quic_connection(qtserv, clientquic) == true' failed @ test/quicapitest.c:1515
    # false
    # OPENSSL_TEST_RAND_SEED=1732307879
    not ok 1 - iteration 1

@nhorman
Copy link
Contributor Author

nhorman commented Nov 22, 2024

So I feel like whats happening above is that the noisy dgram filter is corrupting part of a sent frame that, while previously recoverable, is not not, because the addition of the retry token is forcing a change in the dcid (when it triggers), causing a mismatch between what the server and the client think the "correct" dcid is. In this event, it seems like QUIC is doing what its supposed to here, the test just isn't written to expect failure, so we either need to:
a) Fix the test to allow for failure conditions
or
b) fix the noisydgram filter to not corrupt frames in this way

@nhorman
Copy link
Contributor Author

nhorman commented Nov 22, 2024

So, If I reduce the flip_offset in the get_noise function to constrain it to a very narrow window (4-8 bytes, rather that 25-50), I'm able to run this test for 10 minutes without a failure (whereas previously running it in a loop produced a failure in about 2), so it seems the problem is as described above, that the noisy dgram filter is corrupting the connection id when the client sends the second initial packet, leading to confusion between the client and server about what the connection id should be (i.e. the server thinks it has a valid connection for dcid X, where as the client is attempting to establish a connection on dcid Y), and times out after a period.

@t8m thoughts on how to fix this? It seems to me that the protocol stack is doing the right things here (the client retries until it gives up, and the server correctly discards initial packets for dcids that don't match the associated token). The test is just corrupting bits that make a connection impossible, but always expects a successful connection. Not sure if constraining the filp_bits mechanism is the right thing to do here. We could also potentially flag this corruption as triggering an expected failure and have the test check for that.

@Sashan
Copy link

Sashan commented Nov 23, 2024

Thank you for analysis. it makes sense. I think test needs to be temporarily disabled. This kind of corruption
needs to be detected by server when it validates token. The validation should detect packet got mangled/corrupted.
Then server should just discard the packet silently. the client then should either give up (time out) or resend the initial packet with token.

the test then should be enabled once when #928 (encrypt token) will be done. I would not alter the test, we need to fix QUIC server.

@nhorman
Copy link
Contributor Author

nhorman commented Nov 23, 2024

Thank you @Sashan

I may have a solution, at least a temporary one. The patches are in this branch:
https://github.com/nhorman/openssl/tree/quicapi-debug

Basically I added the new dcid that we reserve in port_send_retry to the token, and in port_validate_token, we compare the recorded dcid in the token to the one that the client is sending in the second initial packet. this gives us some resilience to corruption in the new dcid.

Note, doing this necessitated some different marshaling of the token data, since QUIC_CONN_ID's have a static array that isn't fully used, and adding a second QUIC_CONN_ID leads to garbage data when the token integrity is verified on the client.

I agree however, this isn't a permanent solution, encrypting the token will do the job better I think, as you said @Sashan. A combination of the two might be warranted, I'm not sure yet.

with the patches referenced in the branch above however, I no longer receive the shutdown error in the analysis above. I do however get another error after a while about a mismatch in the ORIG_DCID, which I expect is a similar problem in a different location. I'm looking at that this weekend.

@nhorman
Copy link
Contributor Author

nhorman commented Nov 23, 2024

yeah, so I'm getting this error (with some debugging added to ch_on_transport_params):

  # Subtest: test_noisy_dgram
    1..2
    ok 1 - iteration 1
rx dcid
58:f5:ab:d6:04:e0:8c:64:
expected cid
58:f5:ab:d6:04:e0:bc:64:
    # INFO:  @ test/helpers/quictestlib.c:531
    # SSL_connect() failed 0, 1
    # 4067520A667F0000:error:0A00017E:SSL routines:ch_on_transport_params:quic protocol error:ssl/quic/quic_channel.c:1713:QUIC error code: 0x8 (TRANSPORT_PARAMETER_ERROR), reason: "ORIG_DCID does not match expected value"
    # 4067520A667F0000:error:0A0000CF:SSL routines:quic_do_handshake:protocol is shutdown:ssl/quic/quic_impl.c:1901:
    # ERROR: (bool) 'qtest_create_quic_connection(qtserv, clientquic) == true' failed @ test/quicapitest.c:1510
    # false
    # OPENSSL_TEST_RAND_SEED=1732331417
    not ok 2 - iteration 2
# OPENSSL_TEST_RAND_SEED=1732331417
not ok 1 - test_noisy_dgram

from this:

rx dcid
58:f5:ab:d6:04:e0:8c:64:
expected cid
58:f5:ab:d6:04:e0:bc:64:

It seems pretty clear that the noisy dgram filter mangled the next to the last byte in the initial packets transmitted after the retry got mangled.

Again the protocol is doing the right thing here, because the packet is wrong, but the test doesn't expect a failure. Whats more, encrypting or hardening the token here won't help, as this is not token related. Wondering if you were right @Sashan perhaps the test should be disabled, or at least the packet mangling range should be in a much more constrained range, limited to corruption that can be recovered.

@Sashan
Copy link

Sashan commented Nov 23, 2024

I think we should either ignore failures seen in this test or just disable the test until port at server will be able to detect such corruption. The port should silently discard corrupted packets.

@nhorman
Copy link
Contributor Author

nhorman commented Nov 23, 2024

agreed, the port in the quic stack is definitely doing the "right" things in response to certain "fatal" corruptions.

I'm hesitant to just disable the test however, as shows that we can handle malformed packets properly. Ignoring failures makes sense, but it would be really nice if we could predict when failures were expected, and I'm not quite sure how to do that, given that we don't know what corruptions will result in retries that result in passed connections and which ones are fatal, unless we were to add quic parsing to the noisydgram bio, which doesn't seem feasible.

hmm

@nhorman
Copy link
Contributor Author

nhorman commented Nov 23, 2024

So, theres one more thing that we can do here:

If I modify the flip_offset in get_noise() to be constrained to the first 12 bytes of a long header, or the first 10 bytes of a short_header, we are guaranteed to only corrupt the following values:

  • header form / fixed bit / spin bit / reserved bits / key phase
  • packet number length
  • dest conn id
  • dest conn id len
  • packet number
  • long packet type
  • type specific bits
  • version
  • source conn id len
  • source conn id

Corrupting any of these fields results in silent drops or server responses that trigger a negotiation restart, meaning the corruption is non-fatal and should eventually complete. The one exception to that is the connection id, for which the additional check in token validation from the patches above would still be required. This lets us avoid fatal errors, and still allows us to test some level of frame corruption resilience. not sure thats worth the effort, but its an idea.

@Sashan @t8m , what do you think?

@nhorman
Copy link
Contributor Author

nhorman commented Nov 23, 2024

actually scratch that last bit, even just corrupting that range led to a connection failure after about 30 minutes of running the test. Think you might be right @Sashan disabling the test might be the thing to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: New
Status: Refining
Development

No branches or pull requests

3 participants