Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reject a connection attempt #3690

Closed
marten-seemann opened this issue May 23, 2020 · 26 comments · Fixed by #3694
Closed

How to reject a connection attempt #3690

marten-seemann opened this issue May 23, 2020 · 26 comments · Fixed by #3694

Comments

@marten-seemann
Copy link
Contributor

Assume that a server has a blacklist of IP addresses, that it doesn't accept connections from.

One way of implementing this would be to drop all packets from those IP addresses. But this would lead to clients time out, which is sad. Let's say the server wants to be more helpful and allow the client to quickly recover from this situation.
Sending a CONNECTION_CLOSE would seem like the right thing to do in that case. However, we don't have an error code that fits here. SERVER_BUSY is used to reject connection attempts, but it's a bad fit here. Retrying later is not likely to succeed in this case. APPLICATION_ERROR also seems too broad, as it's used for any application-level error (in Initial packets).

Do we need a new error code to cover this use case? Something like CONNECTION_REFUSED?

@ianswett
Copy link
Contributor

If my reading of our code is correct, we're using PROTOCOL_VIOLATION for that case, but that's not a very informative choice. In gQUIC, we used public reset for this(as you probably remember), but stateless reset is post-handshake only.

CONNECTION_REFUSED sounds better, but I think an important question is wether we expect clients to treat this error code differently than other error codes which occur during the handshake. If we're not going to treat it differently, then it matters less.

That being said, I'm a general fan of more granular error codes for debugging.

@marten-seemann
Copy link
Contributor Author

I think an important question is wether we expect clients to treat this error code differently than other error codes which occur during the handshake. If we're not going to treat it differently, then it matters less.

I guess a reasonable policy regarding errors received during the handshake would be:

  • INVALID_TOKEN: try again immediately
  • SERVER_BUSY: try again after a few minutes
  • CONNECTION_REFUSED: don't try again for a day / week or so
  • other errors: I can see TRANSPORT_PARAMETER_ERROR and FRAME_ENCODING_ERROR (and their generic subsitute, PROTOCOL_VIOLATION) being reasonably sent in response to a ClientHello. So that means that either your or your peer's implementation is broken, so maybe you shouldn't speak QUIC at all?

@kazuho
Copy link
Member

kazuho commented May 24, 2020

I tend to think that adding an error code that affects client behavior is not a good idea, when that error code is intended to indicate a handshake failure. The signal carried by an Initial packet or a Handshake packet is not authenticated, and we should be wary of using that to convey something that affects more than just that connection.

  • INVALID_TOKEN: try again immediately
  • SERVER_BUSY: try again after a few minutes
  • CONNECTION_REFUSED: don't try again for a day / week or so

In case of this example, CONNECTION_REFUSED would become an attractive tool for MITM attackers, because it can be used for preventing the client for reconnecting for as long as a day or a week.

If we need these signal, I think that they have to be carried at the application-level after the QUIC handshake concludes. In case of HTTP, we already have 503 + Retry-After header field.

To summarize, I think my position is:

  • We should not add error codes that affect the client behavior after that connection gets closed.
  • We might want to get rid of SERVER_BUSY, or at least state that a client cannot really tell if the server was really busy, as the signal is not authenticated.

@marten-seemann
Copy link
Contributor Author

In case of this example, CONNECTION_REFUSED would become an attractive tool for MITM attackers, because it can be used for preventing the client for reconnecting for as long as a day or a week.

We accept this risk in QUIC v1. Removing error codes to mitigate the MITM vulnerability seems like a piecemeal "solution" to a problem that can only be solved by decent cryptography.

For the MITM the easiest way is probably injection a Version Negotiation packet that indicates that there are no compatible versions between the client and the server. I'd expect most client implementations to give up a dial attempt in this case, which in effect would be very similar to the actions taken in response to a CONNECTION_REFUSED.

@kazuho
Copy link
Member

kazuho commented May 24, 2020

In case of this example, CONNECTION_REFUSED would become an attractive tool for MITM attackers, because it can be used for preventing the client for reconnecting for as long as a day or a week.

We accept this risk in QUIC v1. Removing error codes to mitigate the MITM vulnerability seems like a piecemeal "solution" to a problem that can only be solved by decent cryptography.

I disagree with that view. Handshake is never going to be disruption resistant, unless there is a shared secret. There are ways of raising the probability of endpoints having shared secret (that we discussed and postponed to v2), but there would always be a case where there is no shared secret.

We always need to limit the impact of attacker injecting signals to an unauthenticated channel.

I'd also reiterate that in HTTP we already have a way of asking clients to revisit the server after certain amount of time. TLS does not have a way of signaling that.

Assuming that HTTP developers are fine with what we have in HTTP + TLS, I do not think there is good justification to add a new feature that has negative impact on security.

For the MITM the easiest way is probably injection a Version Negotiation packet that indicates that there are no compatible versions between the client and the server. I'd expect most client implementations to give up a dial attempt in this case, which in effect would be very similar to the actions taken in response to a CONNECTION_REFUSED.

Giving up one attempt is fine. Having a signal that asks to give up future connection attempts is the problem.

@marten-seemann
Copy link
Contributor Author

I'd also reiterate that in HTTP we already have a way of asking clients to revisit the server after certain amount of time. TLS does not have a way of signaling that.

HTTP has 503 to reject request when a server is busy, and still we have SERVER_BUSY at the QUIC layer. There's a value in rejecting connection attempts before spending resources (and round trips) on a handshake.

Furthermore, there are other application protocols than HTTP. In my use case, nodes in a p2p network want to enforce a local blacklist of IP addresses. In TCP, they can easily do that by resetting the connection, and before spending any resources on the TLS handshake. Having to go through the whole QUIC handshake with a blacklisted peer before being able to reject them would be a serious regression compared to TCP.

Giving up one attempt is fine. Having a signal that asks to give up future connection attempts is the problem.

I'm not sure how browsers handle a Version Negotiation packet that indicates that there are no matching versions. I doubt that they'd try QUIC again for the next request just a few seconds later. Intuitively, I'd say that a longer period combined with an exponential backoff would be a reasonable thing to implement. Maybe @ianswett and @agrover can shed some light on this?

@kazuho
Copy link
Member

kazuho commented May 24, 2020

HTTP has 503 to reject request when a server is busy, and still we have SERVER_BUSY at the QUIC layer. There's a value in rejecting connection attempts before spending resources (and round trips) on a handshake.

Yes, TCP or TLS can reject one attempt. QUIC is currently aligned to that, and that's fine.

Giving up one attempt is fine. Having a signal that asks to give up future connection attempts is the problem.

I'm not sure how browsers handle a Version Negotiation packet that indicates that there are no matching versions. I doubt that they'd try QUIC again for the next request just a few seconds later. Intuitively, I'd say that a longer period combined with an exponential backoff would be a reasonable thing to implement. Maybe @ianswett and @agrover can shed some light on this?

I'd point out that how aggressively browsers try connecting in QUIC depends on if they do happy-eyeballing. When browsers do happy-eyeballing, no matter how less you do in QUIC, the probability of establishing a connection always stays higher (or as high as) just doing TLS.

The question is when browsers start doing just QUIC. Would they prefer to cache previous connection failures, as a way to not even try to reconnect when a user requests doing so?

@kazuho
Copy link
Member

kazuho commented May 25, 2020

As a way of moving forward, I'd propose renaming SERVER_BUSY to CONNECTION_REFUSED, with no recommendation on how the client should react to that.

The rationale behind providing no recommendation is as stated above. QUIC needs to be strong against MITM attacks, at least as TLS over TCP is. TLS over TCP does not have an error code that tells the client to refrain from retrying for some amount of time. To achieve parity, we cannot have that. AFAIK, our experience with HTTPS tells us that there is no need for such an error code.

The rationale behind the renaming is that at the moment we do not have a generic error code for refusing connections, even though we have SERVER_BUSY, which indicates a specific reason of refusing a connection. Therefore, we need CONNECTION_REFUSED. We can rename SERVER_BUSY to CONNECTION_REFUSED. SERVER_BUSY does not need to be preserved, since, as stated, the client behavior cannot vary depending on an unauthenticated error code.

For debugging purposes, endpoints are free to use the reason_phrase field.

WDYT?

@marten-seemann
Copy link
Contributor Author

I don't think that renaming an existing error will solve my problem. Otherwise, I would've just used that error...

The point I was trying to make here is that clients will already make decisions about future connection attempts based on unauthenticated information, both sent in CONNECTION_CLOSE error codes as well as in Version Negotiation packets.

Adding one more error doesn't seem to exacerbate that situation, while at the same time enabling a way of blacklisting IP addresses that was possible with TCP and is currently not possible using QUIC.

@kazuho
Copy link
Member

kazuho commented May 25, 2020

I don't think that renaming an existing error will solve my problem.

As stated, my view is that your problem (of sending a signal that affects future connections) should only be solved by sending something after the handshake succeeds. And for HTTP, we already have that signal (i.e. 503 + Retry-After). Therefore, defining such signal does not have to be done in QUICv1.

The point I was trying to make here is that clients will already make decisions about future connection attempts based on unauthenticated information, both sent in CONNECTION_CLOSE error codes as well as in Version Negotiation packets.

I disagree with that view. If the specification can be read as such, then I think we need to make changes. A client should refrain from using an unauthenticated signal for determining what to do with future connection attempts. It does not matter what that unauthenticated signal is (e.g., an error code in Initial packet, failed version negotiation).

Any unsuccessful handshake attempt should be considered as an error of that particular attempt. Otherwise, QUIC handshake becomes prone to MITM attacks than TLS over TCP.

Adding one more error doesn't seem to exacerbate that situation, while at the same time enabling a way of blacklisting IP addresses that was possible with TCP and is currently not possible using QUIC.

If we are to agree that clients should not use an unauthenticated error code in determining what to do in the future, there is no point in having multiple error codes indicating that the server has deliberately refused a handshake. It is sufficient to have just one error code that let's the server say, hey, nothing is wrong with us, but I refuse to accept a connection.

@ianswett
Copy link
Contributor

@kazuho I believe it's up to the client as to how unauthenticated signals are used.

There are a lot of ways to interrupt a QUIC handshake early on, and this is one of many.

Chrome does exponential backoff with a 5 minute time period anytime the handshake fails, whether it's due to VN, connection close, handshake timeout, etc. In practice, VN and Alt-Svc typically agree, so VN is very rare and connection closes are more common.

I can't argue Chrome's algorithm is optimal, but if handshakes failed in the recent past, it seems sub-optimal to try QUIC again immediately.

@kazuho
Copy link
Member

kazuho commented May 25, 2020

@ianswett I think that is a fine thing to do as long as Chrome is happy-eyeballing with TLS-over-TCP.

But as a specification, it is my understanding that happy-eyeballing is not a prerequisite. I'd assume that we'd want applications that use QUIC only to be as ignorant against unauthenticated signals as those using TLS-over-TCP is.

@ianswett
Copy link
Contributor

If only QUIC is used, then I agree the behavior should match that of TLS over TCP.

@martinthomson
Copy link
Member

I like #3694, but it is a semantic change. It makes the existing code more general, which is appropriate. We would need to make this issue a design issue to do that though.

@mjoras
Copy link

mjoras commented May 26, 2020

I disagree with that view. If the specification can be read as such, then I think we need to make changes. A client should refrain from using an unauthenticated signal for determining what to do with future connection attempts. It does not matter what that unauthenticated signal is (e.g., an error code in Initial packet, failed version negotiation).

This is practically impossible. To function reliably on the internet a client has to use heuristics and unauthenticated information in order to provide a good user experience and not overwhelm server resources.

As a trivial example, if I wrote a client which completely ignores things like VN and immediately retries a QUIC connection, I am locked into supporting that version on my server indefinitely, since my client will now effectively DOS me if I remove support for this version from the server. This is unacceptable for internet scale. Additionally, rejecting a connection post-handshake makes fallback to TCP more complicated -- it is much simpler to use a happy eyeballs mechanism and "race" TCP and QUIC, stopping the TCP connection once the handshake is successful.

I am highly sympathetic to @marten-seemann's point here. As has been said already, we already have 100% reliable and trivially implementable way to interrupt the handshake in v1 with version negotiation. Having an additional signal which is more explicit can only help clients making new connection decisions. We don't necessarily have to prescribe what the client does, but we can say why a server may issue certain errors during the handshake.

@kazuho
Copy link
Member

kazuho commented May 26, 2020

@mjoras I'd emphasize that the specification does not talk about happy eyeballing. We should concentrate on discussing what to do when only QUIC is being used.

Having that said, I'd hope that we can agree on one principle: QUIC cannot respect unauthenticated information any more than TLS over TCP does. Otherwise, QUIC becomes more prone to MITM attacks that TLS over TCP is - a disobedience to our charter.

To function reliably on the internet a client has to use heuristics and unauthenticated information in order to provide a good user experience and not overwhelm server resources.

That might be true, but our experience with TLS over TCP is that we do not need an error code for the purpose.

As has been said already, we already have 100% reliable and trivially implementable way to interrupt the handshake in v1 with version negotiation. Having an additional signal which is more explicit can only help clients making new connection decisions.

FWIW, you do not need a version negotiation packet. Per my understanding, any server can send a private error code. Or a reason phrase explaining things.

We don't necessarily have to prescribe what the client does, but we can say why a server may issue certain errors during the handshake.

Am I correct to understand that the reason you are advocating for having more error codes is to allow a server to control the behavior of client for future connection attempts? Assuming that is the case, I'd oppose to having such error codes regardless of how we would describe them. Because IMO that is effectively about making QUIC more prone to attacks than TLS over TCP is.

@martinthomson martinthomson added the editorial An issue that does not affect the design of the protocol; does not require consensus. label May 26, 2020
@martinthomson
Copy link
Member

I think that @kazuho is entirely right here. The renaming of the error code is useful as it removes the implication of a semantic that would be destructive. That implication is that the code is like a 503, which comes with the further implication that holding back for some time is desirable (due to the common use of Retry-After with 503). QUIC has no Retry-After, so any implied time delay is bogus.

As mentioned, that browsers might sit and wait is a policy decision that makes sense in the current state of the world.

I'm going to merge the PR. But I get the sense that this is not directly addressing the issue. For the issue itself, I think that the only real resolution is: that's tough, but too bad. As numerous people have said, this phase of connection establishment is ripe for DoS and we have chosen to accept that risk in QUIC version 1.

@martinthomson
Copy link
Member

Based on a request on #3964. I have also opened #3704.

@martinthomson martinthomson removed the editorial An issue that does not affect the design of the protocol; does not require consensus. label May 27, 2020
@kazuho
Copy link
Member

kazuho commented May 27, 2020

I can see why this issue being classified as a design change, though at the same time, I think that #3694 was a good editorial change.

#3694 made things better。We now have an error code for server indicating that it rejected the connection even though nothing seemed to be wrong with the handshake. Without having CONNECTION_REFUSED_ERROR, a client had to guess if PROTOCOL_VIOLATION meant a problem in the handshake, or if it was simply a refusal.

From now on, on this issue, I hope that we can concentrate on discussing if we want to have more than one error code indicating the specifics of connection refusals (e.g., SERVER_BUSY). Unsurprisingly, my position is that we cannot have specific codes due to the reasons laid out above.

@martinthomson
Copy link
Member

I agree with @kazuho in that #3694 was good (and I don't want to revert it). I also don't think we need anything more than an appropriately generic error code for the stated problem, so I believe that that PR was adequate.

@janaiyengar
Copy link
Contributor

I agree with @kazuho. I think we should close #3704, and I would argue that we should do a new issue if anyone wants more specific error codes.

@ianswett
Copy link
Contributor

I love more specific error codes, but many in the WG do not, so I still like #3704 given the constraints we're working with.

@mjoras
Copy link

mjoras commented Jun 2, 2020

Having that said, I'd hope that we can agree on one principle: QUIC cannot respect unauthenticated information any more than TLS over TCP does. Otherwise, QUIC becomes more prone to MITM attacks that TLS over TCP is - a disobedience to our charter.

@kazuho No introduction of error codes can possibly make QUIC more prone to this than TCP + TLS. A TCP connection can be completely interrupted at any time with an unauthenticated RST. In that sense I don't think it's possible to make QUIC more prone to MITM attacks than TCP + TLS, since at the very least we only have a small window for the similar MITM interruption.

To function reliably on the internet a client has to use heuristics and unauthenticated information in order to provide a good user experience and not overwhelm server resources.

That might be true, but our experience with TLS over TCP is that we do not need an error code for the purpose.

It's not a fair comparison, as TCP does not have an opportunity to fallback to anything. You posit that we have to view the QUIC specification in isolation from the existence TCP or any "happy eyeballs" mechanism, and I don't totally disagree but I don't think it follows we should make decision ignoring the realities of deploying QUIC on the Internet. That reality is, today, that QUIC will not work 100%, whereas we expect TCP to function. If we can add things to the specification which aid in a seamless coexistence of QUIC and alternatives, why would we not?

Am I correct to understand that the reason you are advocating for having more error codes is to allow a server to control the behavior of client for future connection attempts? Assuming that is the case, I'd oppose to having such error codes regardless of how we would describe them. Because IMO that is effectively about making QUIC more prone to attacks than TLS over TCP is.

I don't think I fully understand your point here. You asserted that the client "should refrain from using an unauthenticated signal for determining what to do with future connection attempts". I maintain that this is fantastical. The client has to do something with future connection attempts whether we prescribe a behavior or not, and as such it is abundantly useful to be able to influence, if not dictate, the behavior of clients with more explicit signals from the server.

As it is I am fine with leaving this discussion with the generic CONNECTION_REFUSED and calling it a day, but I think there's value in discussing the real-world considerations here, which was @marten-seemann's original motivation.

@janaiyengar
Copy link
Contributor

Is anyone arguing for leaving this issue open? If folks still want to continue discussing, I propose that we open a new issue to do it.

@larseggert
Copy link
Member

No response to @janaiyengar's question in ten days; it looks like this issue is closed.

@martinthomson
Copy link
Member

Closing then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants