Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http/3 seems to stop working after any kind of reload #4348

Closed
zierhut-it opened this issue Sep 12, 2021 · 38 comments
Closed

http/3 seems to stop working after any kind of reload #4348

zierhut-it opened this issue Sep 12, 2021 · 38 comments
Labels
bug 🐞 Something isn't working

Comments

@zierhut-it
Copy link

http/3 seems to stop working after any kind of reload.

  • http/3 works as expected before reloading

  • I went as far as v2.0.0 back (thus the two Caddyfiles, see further down) and encountered this issue every time

  • Caddy seems to still listen on udp/443 when it's happening:

    $ ss --listening --udp --processes # -lup
    UNCONN 0 0 *:https *:* users:(("caddy",pid=872477,fd=8))
  • Verbose curl -vvv logs:

     $ curl -vvv --http3 https://localhost
     *   Trying ::1:443...
     * Connect socket 5 over QUIC to ::1:443
     * Sent QUIC client Initial, ALPN: h3-29,h3-28,h3-27
     *   Trying 127.0.0.1:443...
     * Connect socket 6 over QUIC to 127.0.0.1:443
     * Sent QUIC client Initial, ALPN: h3-29,h3-28,h3-27
     * After 150000ms connect time, move on!
     * connect to ::1 port 443 failed: Connection timed out
     * After 150000ms connect time, move on!
     * connect to 127.0.0.1 port 443 failed: Connection timed out
     * Failed to connect to localhost port 443: Connection timed out
     * Closing connection 0
     curl: (28) Failed to connect to localhost port 443: Connection timed out
  • Verbose caddy logs from console with annotations (// 📌):

     // 📌 start caddy
     {"level":"info","ts":1631424276.233273,"msg":"using adjacent Caddyfile"}
     {"level":"warn","ts":1631424276.233972,"msg":"input is not formatted with 'caddy fmt'","adapter":"caddyfile","file":"Caddyfile","line":2}
     {"level":"info","ts":1631424276.2348974,"logger":"admin","msg":"admin endpoint started","address":"tcp/localhost:2019","enforce_origin":false,"origins":["[::1]:2019","127.0.0.1:2019","localhost:2019"]}
     {"level":"info","ts":1631424276.2352147,"logger":"tls.cache.maintenance","msg":"started background certificate maintenance","cache":"0xc00045f030"}
     {"level":"info","ts":1631424276.235942,"logger":"http","msg":"server is listening only on the HTTPS port but has no TLS connection policies; adding one to enable TLS","server_name":"srv0","https_port":443}
     {"level":"info","ts":1631424276.2359629,"logger":"http","msg":"enabling automatic HTTP->HTTPS redirects","server_name":"srv0"}
     {"level":"info","ts":1631424276.2525249,"logger":"pki.ca.local","msg":"root certificate is already trusted by system","path":"storage:pki/authorities/local/root.crt"}
     {"level":"info","ts":1631424276.2526367,"logger":"http","msg":"enabling experimental HTTP/3 listener","addr":":443"}
     {"level":"info","ts":1631424276.2526562,"logger":"tls","msg":"cleaning storage unit","description":"FileStorage:/root/.local/share/caddy"}
     {"level":"debug","ts":1631424276.252672,"logger":"http","msg":"starting server loop","address":"[::]:443","http3":true,"tls":true}
     {"level":"debug","ts":1631424276.25269,"logger":"http","msg":"starting server loop","address":"[::]:80","http3":false,"tls":false}
     {"level":"info","ts":1631424276.252697,"logger":"http","msg":"enabling automatic TLS certificate management","domains":["localhost"]}
     {"level":"warn","ts":1631424276.2530873,"logger":"tls","msg":"stapling OCSP","error":"no OCSP stapling for [localhost]: no OCSP server specified in certificate"}
     {"level":"debug","ts":1631424276.2531059,"logger":"tls.cache","msg":"added certificate to cache","subjects":["localhost"],"expiration":1631461064,"managed":true,"issuer_key":"local","hash":"a186539e15337b0547621156c842d1502a09780d05eb5d36684496f530c1fb84"}
     {"level":"info","ts":1631424276.2532187,"msg":"autosaved config (load with --resume flag)","file":"/root/.local/share/caddy/autosave.json"}
     {"level":"info","ts":1631424276.2532284,"msg":"serving initial configuration"}
     {"level":"info","ts":1631424276.2536077,"logger":"tls","msg":"finished cleaning storage units"}
    
     // 📌 test http/3 connection via curl
     {"level":"debug","ts":1631424286.478527,"logger":"tls.handshake","msg":"choosing certificate","identifier":"localhost","num_choices":1}
     {"level":"debug","ts":1631424286.4785457,"logger":"tls.handshake","msg":"default certificate selection results","identifier":"localhost","subjects":["localhost"],"managed":true,"issuer_key":"local","hash":"a186539e15337b0547621156c842d1502a09780d05eb5d36684496f530c1fb84"}
     {"level":"debug","ts":1631424286.4785507,"logger":"tls.handshake","msg":"matched certificate in cache","subjects":["localhost"],"managed":true,"expiration":1631461064,"hash":"a186539e15337b0547621156c842d1502a09780d05eb5d36684496f530c1fb84"}
    
     // 📌 forcefully reload caddy config via curl
     {"level":"info","ts":1631424303.4187894,"logger":"admin.api","msg":"received request","method":"GET","host":"127.0.0.1:2019","uri":"/config/","remote_addr":"127.0.0.1:48794","headers":{"Accept":["*/*"],"User-Agent":["curl/7.78.0"]}}
     {"level":"info","ts":1631424303.4197412,"logger":"admin.api","msg":"received request","method":"POST","host":"127.0.0.1:2019","uri":"/load","remote_addr":"127.0.0.1:48796","headers":{"Accept":["*/*"],"Cache-Control":["must-revalidate"],"Content-Length":["399"],"Content-Type":["application/json"],"User-Agent":["curl/7.78.0"]}}
     {"level":"info","ts":1631424303.4200108,"logger":"admin","msg":"admin endpoint started","address":"tcp/localhost:2019","enforce_origin":false,"origins":["localhost:2019","[::1]:2019","127.0.0.1:2019"]}
     {"level":"info","ts":1631424303.4201722,"logger":"tls.cache.maintenance","msg":"started background certificate maintenance","cache":"0xc000276700"}
     {"level":"info","ts":1631424303.4203856,"logger":"http","msg":"server is listening only on the HTTPS port but has no TLS connection policies; adding one to enable TLS","server_name":"srv0","https_port":443}
     {"level":"info","ts":1631424303.4204006,"logger":"http","msg":"enabling automatic HTTP->HTTPS redirects","server_name":"srv0"}
     {"level":"debug","ts":1631424303.4204953,"logger":"http","msg":"starting server loop","address":"[::]:80","http3":false,"tls":false}
     {"level":"info","ts":1631424303.4205036,"logger":"http","msg":"enabling experimental HTTP/3 listener","addr":":443"}
     2021/09/12 07:25:03 [DEBUG] udp/:443: Usage counter should not go above 2 or maybe 3, is now: 2
     {"level":"debug","ts":1631424303.420539,"logger":"http","msg":"starting server loop","address":"[::]:443","http3":true,"tls":true}
     {"level":"info","ts":1631424303.420544,"logger":"http","msg":"enabling automatic TLS certificate management","domains":["localhost"]}
     {"level":"warn","ts":1631424303.4207737,"logger":"tls","msg":"stapling OCSP","error":"no OCSP stapling for [localhost]: no OCSP server specified in certificate"}
     {"level":"debug","ts":1631424303.4207897,"logger":"tls.cache","msg":"added certificate to cache","subjects":["localhost"],"expiration":1631461064,"managed":true,"issuer_key":"local","hash":"a186539e15337b0547621156c842d1502a09780d05eb5d36684496f530c1fb84"}
     {"level":"info","ts":1631424303.4208012,"logger":"pki.ca.local","msg":"root certificate is already trusted by system","path":"storage:pki/authorities/local/root.crt"}
     2021/09/12 07:25:03 [DEBUG] Fake-closing underlying packet conn
     {"level":"info","ts":1631424303.4235258,"logger":"tls.cache.maintenance","msg":"stopped background certificate maintenance","cache":"0xc00045f030"}
     {"level":"info","ts":1631424303.423664,"msg":"autosaved config (load with --resume flag)","file":"/root/.local/share/caddy/autosave.json"}
     {"level":"info","ts":1631424303.4236767,"logger":"admin.api","msg":"load complete"}
     {"level":"info","ts":1631424303.427598,"logger":"admin","msg":"stopped previous server","address":"tcp/localhost:2019"}
    
     // 📌 test http/3 connection via curl (no result)


Caddyfile (v2.3.0+):

{
  debug
  servers :443 {
    protocol {
      experimental_http3
    }
  }
}

localhost {
  tls internal
  respond localhost
}

Caddyfile (v2.3.0+) as JSON:

{
  "logging": {
    "logs": {
      "default": {
        "level": "DEBUG"
      }
    }
  },
  "apps": {
    "http": {
      "servers": {
        "srv0": {
          "listen": [
            ":443"
          ],
          "routes": [
            {
              "match": [
                {
                  "host": [
                    "localhost"
                  ]
                }
              ],
              "handle": [
                {
                  "handler": "subroute",
                  "routes": [
                    {
                      "handle": [
                        {
                          "body": "localhost",
                          "handler": "static_response"
                        }
                      ]
                    }
                  ]
                }
              ],
              "terminal": true
            }
          ],
          "experimental_http3": true
        }
      }
    },
    "tls": {
      "automation": {
        "policies": [
          {
            "subjects": [
              "localhost"
            ],
            "issuers": [
              {
                "module": "internal"
              }
            ]
          }
        ]
      }
    }
  }
}

Caddyfile (pre v2.3.0):

{
  debug
  experimental_http3
}

localhost {
  tls internal
  respond localhost
}


Below are 3 different way to reproduce this issue:

  1. Case (collapse)

    • Copy example Caddyfile
    • $ caddy run
    • Test http/3 connection via
       $ curl --http3 https://localhost
       # localhost
    • Force reload same (json) config via API
       $ curl --silent 127.1:2019/config/ | curl -X POST 127.1:2019/load -H "Content-Type: application/json" -d @- -H "Cache-Control: must-revalidate"
    • Test http/3 connection again via
       $ curl --http3 https://localhost
       # curl: (28) Failed to connect to localhost port 443: Connection timed out

  2. Case (expand)

    • Copy example Caddyfile
    • $ caddy run
    • Test http/3 connection via
       $ curl --http3 https://localhost
       # localhost
    • Force reload same config via CLI
       $ caddy reload --force
    • Test http/3 connection again via
       $ curl --http3 https://localhost
       # curl: (28) Failed to connect to localhost port 443: Connection timed out

  3. Case (expand)

    • Copy example Caddyfile
    • $ caddy run --watch
    • Test http/3 connection via
       $ curl --http3 https://localhost
       # localhost
    • Edit config (e.g. respond) and save
    • Test http/3 connection again via
       $ curl --http3 https://localhost
       # curl: (28) Failed to connect to localhost port 443: Connection timed out

Hope this is has not yet been reported. Otherwise feel free to close this issue.

And despite this bug, Caddy is awesome! :)

~ @IndeedNotJames

@francislavoie
Copy link
Member

I think this code is supposed to handle that:

// close the http3 servers; it's unclear whether the bug reported in
// https://github.com/caddyserver/caddy/pull/2727#issuecomment-526856566
// was ever truly fixed, since it seemed racey/nondeterministic; but
// recent tests in 2020 were unable to replicate the issue again after
// repeated attempts (the bug manifested after a config reload; i.e.
// reusing a http3 server or listener was problematic), but it seems
// to be working fine now
for _, s := range app.h3servers {
// TODO: CloseGracefully, once implemented upstream
// (see https://github.com/lucas-clemente/quic-go/issues/2103)
err := s.Close()
if err != nil {
return err
}
}
// closing an http3.Server does not close their underlying listeners
// since apparently the listener can be used both by servers and
// clients at the same time; so we need to manually call Close()
// on the underlying h3 listeners (see lucas-clemente/quic-go#2103)
for _, pc := range app.h3listeners {
err := pc.Close()
if err != nil {
return err
}
}

We'll have to wait until @mholt has time to take a look at this one.

@mholt
Copy link
Member

mholt commented Sep 14, 2021

Hm, I'm able to reproduce it, but this used to work and I know we haven't changed our listener code recently.

Might be worth checking if something upstream changed (inadvertently)? @marten-seemann might know.

@mholt mholt added the bug 🐞 Something isn't working label Sep 14, 2021
@Forza-tng
Copy link
Contributor

Forza-tng commented Sep 27, 2021

Hi,

I'm also having a similar issue Caddy simply stops serving HTTP/3 requests after a while. I have not seen any logged errors using debug mode. Seems hard to trigger on-demand. One thing that always fail seems to be using curl --http3 on a large file. It downloads some data and then receives nothing more. Seems it does not time out and I have no log entry in Caddy until I close curl with ctrl-c

# curl -vvv --http3 https://mirrors.tnonline.net/haiku/release/r1beta3/haiku-r1beta3-x86_gcc2h-anyboot.iso>/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 2001:470:28:704::100:443...
* Connect socket 5 over QUIC to 2001:470:28:704::100:443
* Sent QUIC client Initial, ALPN: h3,h3-29,h3-28,h3-27
* Connected to mirrors.tnonline.net () port 443 (#0)
* h3 [:method: GET]
* h3 [:path: /haiku/release/r1beta3/haiku-r1beta3-x86_gcc2h-anyboot.iso]
* h3 [:scheme: https]
* h3 [:authority: mirrors.tnonline.net]
* h3 [user-agent: curl/7.79.1]
* h3 [accept: */*]
* Using HTTP/3 Stream ID: 0 (easy handle 0x5560104c20d0)
> GET /haiku/release/r1beta3/haiku-r1beta3-x86_gcc2h-anyboot.iso HTTP/3
> Host: mirrors.tnonline.net
> user-agent: curl/7.79.1
> accept: */*
> 
< HTTP/3 200
< content-type: application/x-cd-image
< last-modified: Mon, 26 Jul 2021 10:48:42 GMT
< accept-ranges: bytes
< content-length: 748683264
< server: Caddy
< alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
< etag: "qwum16cdqvpc"
< 
{ [38046 bytes data]
  0  714M    0  438k    0     0   1421      0   6d 02h  0:05:16   6d 02h     0
  • curl 7.79.1 (x86_64-pc-linux-gnu) libcurl/7.79.1 GnuTLS/3.7.2 (OpenSSL/1.1.1l) zlib/1.2.11 brotli/1.0.9 zstd/1.5.0 libidn2/2.3.2 nghttp2/1.44.0 quiche/0.10.0
  • caddy v2.4.5 h1:P1mRs6V2cMcagSPn+NWpD+OEYUYLIf6ecOa48cFGeUg=

I'd be happy to help debug this as my goal with Caddy is specifically testing HTTP/3. I would need some pointers on where to look. For example tcpdump, wireshark or some debugger?

@marten-seemann
Copy link
Contributor

We didn't change anything about the shutdown logic recently, as far as I'm aware. Is there any way to easily reproduce this?

@Forza-tng
Copy link
Contributor

Forza-tng commented Sep 27, 2021

We didn't change anything about the shutdown logic recently, as far as I'm aware. Is there any way to easily reproduce this?

At least the curl issue is reproducible 100%. The other part is that Chrome and Firefox suddenly switch over to HTTP/2 with no regard of the network.dns.httpssvc.http3_fast_fallback_timeout setting. I have not found a reliable, precise way to debug this. I need some help with that.

@Forza-tng
Copy link
Contributor

Forza-tng commented Sep 27, 2021

I think this is odd behaviour:
https://www.http3check.net/?host=wikidev.tnonline.net
image

While https://www.http3check.net/?host=http3.is gives the expected results
image

I've added a QLog output.
wikidev.tnonline.net.qlog-json.txt

@zierhut-it
Copy link
Author

@Forza-tng Please open another issue for this as this is very likely to be unrelated and it is very important to keep this issue on topic to be able to solve it effectively. :)
~ @alexander-zierhut

@Forza-tng
Copy link
Contributor

@Forza-tng Please open another issue for this as this is very likely to be unrelated and it is very important to keep this issue on topic to be able to solve it effectively. :) ~ @alexander-zierhut

You're right. I opened a new issue.

At the moment i just had the same issue that Caddy simply stops responding anything over http3. Restarting Caddy solved it. Unfortunately I couldn't reproduce the issue.

@zierhut-it Out of curiosity, have you checked the issue I linked above? Do you get that same weird response from https://www.http3check.net/

@awfulcooking
Copy link

Reproducing reliably on 2.4.5 - keeps working after first systemctl reload, but stops working after a second.

I believe the confusion on http3checker.net is that they refer to "h3" / "h3-29" as HTTP/3, and "h3-27" (etc?) as QUIC

@vladbondarenko
Copy link

have same issue with http3 stops working, im not rely on checkers, im checking in network console of FF.

did some local fix, if it will work couple of days in production fine i will post it here.

@vladbondarenko
Copy link

vladbondarenko commented Nov 11, 2021

have same issue with http3 stops working, im not rely on checkers, im checking in network console of FF.

did some local fix, if it will work couple of days in production fine i will post it here.

in brief:
you dont need to extra close http3 servers and listeners here: /modules/caddyhttp/app.go L:435 as they already closing in lucas-clemente/[email protected]/http3/server.go L:413.

After i disabled it in app.go http3 still work for me even after each 5 mins reload in production with real traffic users.
@mholt plz take a look

@francislavoie
Copy link
Member

@vladbondarenko feel free to make a pull request with the change you propose!

@vladbondarenko
Copy link

PR #4413

@renbou
Copy link
Collaborator

renbou commented Mar 14, 2022

Not sure, but I might've figured out what's going on here. Previously (before quic-go/quic-go#2111) the issue occurred due to multiple quic-go listeners (https://github.com/lucas-clemente/quic-go/blob/fa070e585ecb4ef300795f4bf99cb7815218cc1e/packet_handler_map.go) being started by the multiplexer (https://github.com/lucas-clemente/quic-go/blob/fa2e7972156393e505a16f10ac833369d9614c08/multiplexer.go), which results, if I understood correctly, in a race condition where the yet-to-be-killed server eats up packets meant for the newly launched server, which is why some connections fail after the reload.
At that time, the fix was to not stop the actual http3 servers, and I'm not quite sure how it actually worked.

Looking at what we have now, it seems like the issue occurs due to the aforementioned quic-go listeners being closed after the new caddy server starts:

  1. Caddy server launches normally, sets up quic-go listeners which create 1 packetHandlerMap which all operate normally
  2. Reload is triggered
    2.1 New Caddy server launches, sets up new quic-go server with the same quic-go multiplexer key (https://github.com/lucas-clemente/quic-go/blob/fa2e7972156393e505a16f10ac833369d9614c08/multiplexer.go#L68), which is why the previous packetHandlerMap is returned (it is already listening in a different goroutine)
    2.2 Old Caddy server stops http3 servers, which in turn calls packetHandlerMap.CloseServer, sending connection close and removing the session locally. After this, the h3listeners are closed, which in turn closes the packetHandlerMap (https://github.com/lucas-clemente/quic-go/blob/fa070e585ecb4ef300795f4bf99cb7815218cc1e/packet_handler_map.go#L340) and removes it from the multiplexer.

Since during 2.2 the packetHandlerMap is closed, it stops listening on caddy's fakeClosePacketConn and thus stops receiving packets for the actual connection. However, the new caddy server setup during 2.1 still uses this packetHandlerMap without knowing that it has stopped listening for new requests, which is why it simply doesn't receive anything.

Someone pls double check this and if you can, explain how the previous fix (not closing http3 servers) was valid and how it helped with this problem

I haven't had time to actually test this with logs of some kind, so that might also be something that should be done

@renbou
Copy link
Collaborator

renbou commented Mar 14, 2022

Tested this out - what I've written about is indeed part of the problem. However there is another issue - during step 2.1 when the new quic-go server is created using the previous packetHandlerMap, it calls SetServer, which sets it to packetHandlerMap's underlying server. During step 2.2 when CloseServer is called, packetHandlerMap.server is set to nil, which is why no new packets are routed to the server.

@renbou
Copy link
Collaborator

renbou commented Mar 14, 2022

Yup, after reload I'm getting "received a packet with an unexpected connection ID" (https://github.com/lucas-clemente/quic-go/blob/fa070e585ecb4ef300795f4bf99cb7815218cc1e/packet_handler_map.go#L381).

I've fixed the first issue by making fakeClosePacketConn public and using

go h3srv.Serve(h3ln.(*caddy.FakeClosePacketConn).PacketConn)

at

go h3srv.Serve(h3ln)
- this makes it so that the multiplexer listener closes only after the actual connection closes (when all fake ones close).

However I do not know of a decent way to fix the second issue I've described, since it requires either changing Caddy's reload order (seems like a no-no) or changing the way quic-go starts servers (maybe make packetHandlerMap support multiple servers?).
upd: I've confirmed the fix by removing the line https://github.com/lucas-clemente/quic-go/blob/fa070e585ecb4ef300795f4bf99cb7815218cc1e/packet_handler_map.go#L280 - the new quic-go server replaces the pointer to the old one (so there shouldn't be a leak) and everything works fine. This still isn't a valid fix though, but confirms that it is correct

Waiting for comments from you guys now

@marten-seemann
Copy link
Contributor

👋, quic-go author here. Sorry I didn't follow the discussion here in detail so far.

In general, I'm not opposed to changing the quic-go API, if that enables a use-case we had not though of before and that's not possible with what we have right now, but I'd like to understand better what Caddy is actually is trying to achieve.

Can we take a step back and answer the following questions about what is supposed to happen during a reload?

  1. What happens to existing connections currently served by Caddy? Are they supposed to be kept alive?
  2. Does a reload change anything QUIC-related (so would keeping the existing QUIC server running be an option)?
  3. If the answer to 1. is "they are closed", would closing and re-opening the UDP socket be an option?

@renbou
Copy link
Collaborator

renbou commented Mar 14, 2022

Not giving an authoritative answer, however judging by how Caddy works with http1.1 and h2, as well as quic-go/quic-go#2103:

  1. Connections served by the std http server are closed gracefully - the server waits for them to go into idle mode (no unanswered request) after which it closes them. Connections served by quic-go are currently simply closed (by means of quic-go server.Close), meaning they don't wait for an answer to come (if I correctly understand how packetHandlerMap.CloseServer works)
  2. I'm pretty sure keeping the existic QUIC server should work, however some change in quic-go's http3 will need to be made to support this (ideally done by allowing http3 to be started using a ready quic server, i.e. exporting addListener)
  3. This is currently not possible due to the ordering of the reload steps - a new server is launched first (a UDP socket is created or present) and then the old one is stopped (only the fake socket is closed)

@renbou
Copy link
Collaborator

renbou commented Mar 14, 2022

I think the main difference between http1-2/http3 encountered here is that http1-2 are based on tcp connections, which can be reused by concurrently running servers (as they are by Caddy using a "fake" connection wrapper which doesn't actually close the underlying connection), however for http3 atop quic-go that would be quic-go servers, since the actual listeners aren't exported by the multiplexer (which makes sense since they are not on the session level).

Quic-go servers can be reused like tcp connections can in the sense that multiple servers can concurrently Accept new sessions from them, which is why it makes sense to reuse them for multiple http3 servers on the same address (like in Caddy). This seems to be the exact problem we're looking at right now - Caddy launches multiple http3 servers which all try to launch their own quic-go servers, even though they only need one. So allowing the creation of an http3 server with a given quic-go Listener might be a decent change. (By the way, currently quic-go's Listener's doc doesn't state that it is thread-safe, even though it seems to be)

@marten-seemann
Copy link
Contributor

Quic-go servers can be reused like tcp connections can in the sense that multiple servers can concurrently Accept new sessions from them, which is why it makes sense to reuse them for multiple http3 servers on the same address (like in Caddy). This seems to be the exact problem we're looking at right now - Caddy launches multiple http3 servers which all try to launch their own quic-go servers, even though they only need one. So allowing the creation of an http3 server with a given quic-go Listener might be a decent change. (By the way, currently quic-go's Listener's doc doesn't state that it is thread-safe, even though it seems to be)

Not sure if that's a good idea. If you call Accept from multiple servers (assuming it's thread-safe, I haven't checked), then you can't deterministically know at which server an accepted session might end up.
But if I understand correctly, this is not what we actually need. We want the old http3.Server to stop accepting sessions, and just after that use a new http3.Server to start accepting sessions from the same quic.Listener.

This could relatively easily be achieved by adding a quic.Listener to the http3.Server, and using that one (if set) instead of starting a new listener. We'd also have to make sure to not close the quic.Listener when the http3.Server is closed in that case, so we can reuse it with a new http3.Server. @renbou, @mholt does this sound like it could make sense?

I'm pretty busy at the moment, so I probably won't get around to implementing this any time soon. Would you be interested in adding this feature to quic-go?

@renbou
Copy link
Collaborator

renbou commented Mar 14, 2022

Yes, calling Accept from multiple servers is not a good idea in that sense, what I meant is that logically, two http servers can operate on the same tcp connection in the same way they can operate on the same quic server (even if that means non-deterministic connections). However in the case of Caddy what we need is exactly how you phrased it and it matches what I meant - allowing creation of http3.Server from quic.Listener or addition of an existing quic.Listener to an existing http3.Server (since it already has a listeners map).

I'm up for implementing this in quic-go and updating Caddy accordingly, though I'd like to wait for someone from the Caddy team to approve this as well

@francislavoie
Copy link
Member

To my understanding, #4348 (comment) is correct, except that GraceDuration can be configured to set a timeout at which point the old connections will be force closed so that the old config can finish shutting down. Right now we have the problem that certain kinds of connections like Websockets or long-polling SSE never gracefully close so they need to be force closed.

So yeah, Caddy having control over the listeners sounds like a good plan, so they can be kept around while a config reload happens. Basically, a ref counter is used to determine if the listener should be closed. When Caddy first starts, the counter goes from 0 to 1. When a reload happens, the new config gets loaded in first, so the count goes from 1 to 2, then the old server shuts down and the count goes back from 2 to 1. When Caddy is being shut down completely, the count goes from 1 to 0 and the listeners finally get cleaned up. If a reload happens but the listener is different (imagine a different port number is used, no overlap) then the old listener would also get cleaned up even if Caddy is not shutting down, because the old one's ref count doesn't go up due to reuse.

@mholt
Copy link
Member

mholt commented Mar 14, 2022

Just catching up on this... lots of progress made in the middle of the night here! Thank you for the in-depth exploration, @renbou -- this is super helpful.

Based on my reading of Artem's description of the HTTP/1 and HTTP/2 servers over TCP, I believe that understanding is correct. Upon a reload, existing servers immediately go into "shutdown" mode which stops accepting new connections, waits for existing connections to become idle, then closes the socket; while the new server immediately starts accepting new connections. A GracePeriod can be configured to forcefully close old servers if the connections do not become idle soon enough (as @francislavoie explained).

@marten-seemann Nice to hear from you too! Thanks for chiming in while you're so busy.

This could relatively easily be achieved by adding a quic.Listener to the http3.Server, and using that one (if set) instead of starting a new listener. We'd also have to make sure to not close the quic.Listener when the http3.Server is closed in that case, so we can reuse it with a new http3.Server. @renbou, @mholt does this sound like it could make sense?

Yep, if that's how to do it correctly, let's try it.

@renbou

I'm up for implementing this in quic-go and updating Caddy accordingly, though I'd like to wait for someone from the Caddy team to approve this as well

Thank you 🙏 I'm a bit swamped preparing the v2.5 release, so having a contribution here would be very helpful. If all goes well maybe it could make it into v2.5.0. If not, then maybe v.2.5.1; no biggie either way.

Let me know how I can help going forward! I'm glad there's some traction here, it'll be nice to see HTTP/3 work through reloads.

@renbou
Copy link
Collaborator

renbou commented Mar 16, 2022

I've been figuring out what exactly needs to be implemented in quic-go and Caddy and it seems like quic-go only needs a http3.Server.Serve(listener *quic.EarlyListener) method added as well as http3.Server.Close changed accordingly - listener.Close() removed in the case of Serving with a given listener (pretty much what is described in #4348 (comment)). This will leave already open QUIC sessions running and handling streams which will then be closed either by the client, or due to timeout (this also seems like a decent temporary solution to the lack of an implemented graceful close for http3.Server).

As for Caddy, the changes seem even easier, however I'd like to clarify them first. Since Caddy currently uses caddy.ListenPacket to open udp listeners which are then used to launch http3.Servers, and we need to open quic.EarlyListeners instead, I propose removing caddy.ListenPacket in favor of a new caddy.ListenQUIC since caddy.ListenPacket's only usage is just that - starting a http3.Server (

h3ln, err := caddy.ListenPacket("udp", hostport)
). This would also remove some of the clutter in the current code, such as fakeClosePacketConn.SetReadBuffer and fakeClosePacketConn.SyscallConn which exist for the sole purpose of exporting those methods from the underlying UDP connection to quic-go.

@mholt how does this sound?

@francislavoie
Copy link
Member

Sounds good to me. Maybe caddy.ListenHTTP3 instead though?

@renbou
Copy link
Collaborator

renbou commented Mar 16, 2022

Sounds good to me. Maybe caddy.ListenHTTP3 instead though?

It needs to return quic.EarlyListener though, since that's the underlying Listener being passed to http3.Server.Serve (just like net.Listener gets passed to http.Server.Serve). I think that ListenQUIC would be more suitable here since it signifies the underlying transport (QUIC instead of the usual protocols).

@marten-seemann
Copy link
Contributor

@renbou That makes a lot of sense to me. I opened quic-go/quic-go#3347 in quic-go. Feel free to submit a PR to resolve this issue. :)

@mholt
Copy link
Member

mholt commented Mar 16, 2022

Still need some time to catch up on this (y'all like to work while I'm sleeping, ha). 😁 Thank you for working on it!

Anyway: @renbou ListenPacket() is still important for UDP servers though. Caddy's Layer 4 module uses it, for example: https://github.com/mholt/caddy-l4/blob/56bd7700d889f2ffd52353c241819ddbc7745ff6/layer4/app.go#L68

@mholt
Copy link
Member

mholt commented Mar 17, 2022

@renbou Otherwise the changes sound like a good start -- want to draft up a PR for review? 😃

@renbou
Copy link
Collaborator

renbou commented Mar 17, 2022

I've started a PR with the required changes to quic-go at quic-go/quic-go#3349 and tested them with Caddy locally - everything works, however since it was all done locally, the go.mod points to my local version.

Should I create a PR with the WIP changes then? Supposedly only the go.mod will need to be updated after we've settled on the new quic-go API.

@JeDaYoshi
Copy link

quic-go/quic-go#3349 has been merged 🎉

@renbou
Copy link
Collaborator

renbou commented Mar 21, 2022

@mholt, I've opened #4654

@francislavoie
Copy link
Member

Huge thanks to @renbou, we've just merged #4654! Let's hope this is resolved once and for all!

This should land in the next release of Caddy (likely v2.5.0-beta.2 or whatever it ends up being named)

@QGB
Copy link

QGB commented Jan 14, 2023

v2.6.2 h1:wKoFIxpmOJLGl3QXoo6PNbYvGW4xLEgo32GPBEjWL8o=

caddy_linux_amd64 reload --force --config ./Caddyfile OK

curl http://127.0.0.1:2019/config/|curl -X POST 127.1:2019/load -H "Content-Type: application/json" -d @- -H "Cache-Control: must-revalidate"

@omerXfaruq

This comment was marked as off-topic.

@francislavoie

This comment was marked as off-topic.

@JeDaYoshi

This comment was marked as off-topic.

@omerXfaruq

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants