-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2.7.x: panic when TLS-SNI is received for an unknown domain #5680
Comments
I updated and tested 2.7.2 this morning had to roll back for this. I use a built version with CloudFlareDNS and the Security plugins only.
|
This is where the nil pointer dereference happens: https://github.com/caddyserver/certmagic/blob/v0.19.1/handshake.go#L378 This should have been fixed by quic-go/quic-go#4001. Did you make sure that this patch is included here (released as v0.37.1)? |
Yes @marten-seemann we built and released with 0.37.1 Line 20 in e2fc08b
|
Oh, we built with Go 1.20.6 though 🤔 did we need to build with 1.20.7? Edit: Nevermind that shouldn't have mattered according to the changes in https://github.com/golang/go/issues?q=milestone%3AGo1.20.7 |
That's a different issue for which I'm about to cut a quic-go patch release today (the RSA key size DoS vulnerability fixed in Go 1.20.7, which I need to backport on my crypto/tls fork). I suggested including this in the Caddy patch release in #5671 (comment). |
We should probably figure out why this panic is occurring before I cut v0.37.2. Any hints? |
I installed my first version from a download from the caddy website with the extra modules selected:
My other box has it built from xcaddy with the latest GO version installed, which was 1.20.7.
I've been up for ~10 minutes now without issues and it would panic after 2-3 minutes for me anyway. |
@animosity22 In your build-info output, what is the version of quic-go? |
@otbutz @animosity22 Additionally, can you provide a minimal reproducer for this? At least your config and possibly a request that triggers it? |
Here are the two:
and
My config is small I think as I only use the security plugin to do SSO auth and that protects my site.
I wasn't doing anything specific to generate as the automatic SSL renewal would kick in and that would produce the error as it happened right after startup. |
@animosity22 Thanks. The phoenix instance is clearly using too-old of a quic-go version for some reason: Do they both have the same config? I wonder if a plugin is causing the downgrade. But then again, @otbutz says he gets the error using Caddy from the apt repository... (meaning without plugins) 🤔 We are also deploying that Caddy on our website and I can't find this error in our logs at all. |
Oh sorry, one second, I forgot I downgraded my phoenix back back to the older version, let me update it and reproduce. |
Just installed 2.7.2 on Debian 12 via this repo:
my Caddyfile:
and an example sites/site:
|
Hmm, I've tried a bit more on the reproducing the issue side with a few scenarios.
@raregems-io - if you restart a few times, does it clear up? |
Thanks for the info. I'll try to reproduce it. I should point out that our production website and the forums are both using 2.7.2 with similar configs without issues. |
lmk if there’s anything you need from the quic-go side. I’ll be holding off on cutting the RSA patch release until we’ve reached a conclusion here. |
Yeah, I hate bugs like this as it happened to me twice on one box and zero times on my other box. Same build process, identical config minus 2 different site names.
and since those 2, I've tried restarting ~30 times and I can't get it to reproduce. |
@marten-seemann Thanks. Really appreciate your collaboration and patience. Are there any other code paths that could allow that Conn() or RemoteAddr() to be nil? Given the spurious nature of the reports, it seems like there's either some nondeterminism here or some other code path that we didn't think of that causes them to not be populated. |
Is that codepath always used or only if a on-demand certificate is created? That would explain why the error isn't that common as the mechanism for HTTP2 and 1.1 isn't affected. |
@raregems-io I am unable to reproduce the issue with that config, using HTTP/3, and trying in a loop of about a thousand requests... Tried restarting the server several times and still can't reproduce it. |
I seem to have this issue also, but missed this existing one (sorry!). See #5681 and merge if desired. Apologies! |
@mholt How can I help you reproduce this? It happens every single time I visit the URL without fail. Restart the caddy.service and it's fine (and serving the other sites without issue), but visit Plex or Jellyfin and boom, issue recurs again and again. I'm happy to liaise with you if you want to let me know how I can help? e: I notice the site loads OK in Firefox, no issues. Dev tools shows some items using http 1.1 and others (over half the resources) are loaded with http3. Using Brave, the site refuses to load at all and the Caddy server crashes. I wonder if the agent you're testing from is having an impact (or lack of)? |
@RainmakerRaw I too have a Jellyfin installation. I can try proxying to it? But I feel like it shouldn't matter what kind of content we're serving, since the bug occurs before connections are even established.
If that's the case, then why aren't we seeing this on our production caddyserver.com and caddy.community sites? The bug does seem sporadic though. @marten-seemann I dunno if any of this info is helpful, but I'm starting to wonder if there's another code path we missed. I'm not too familiar with the quic-go library though to know. |
@mholt Aha! I don't know what effect this has, but I can bypass the issue and make Brave work again. On my test desktop client, Firefox was using DoH by default (my own AdGuard Home installation on a VPS). Brave was set to use the local DNS server, and causes Caddy to crash. When I set Brave to also use offsite DoH the Jellyfin server loads fine again and Caddy doesn't crash. I'll be honest, I don't know what this means, but it's progress maybe? |
Further digging: Cloudflare has enabled ECH (encrypted client hello) on my domain. The Caddy server (and Jellyfin server) are running on the same LAN as my test desktop device - i.e. I'm hairpinning back in through the router to connect. The local DNS server and the remote DoH server are both AdGuard Home using the same encrypted upstreams, but I wonder if something's getting mangled and causing the crash? I can connect to Plex and Jellyfin (using their domain not the local address) perfectly now I have enabled DoH in the browser. Both browsers are set to use ECH where available - is yours? In summary, ECH is enabled on both browsers. With DoH and ECH enabled, Firefox connects to Jellyfin fine and Caddy doesn't crash. With Brave, ECH being enabled but DoH not causes Caddy to crash and no page loads. Adding DoH to the ECH on Brave makes everything work OK again. Edit: Disabling ECH in the browser allows the Jellyfin site to load over QUIC with or without DoH enabled. I reloaded dozens of times without incident and Caddy no longer crashes. This seems to be ECH related. |
Good thinking! In fact there is: Fix coming later today. This will be included in the v0.37.2 patch release. |
@marten-seemann This ties in with my last edit above. ECH was enabled in Brave for me, and I needed to also enable DoH to not have Caddy crash. Enabling ECH + DoH meant no crash and the page loads. Disabling ECH in the browser means the page loads (and Caddy doesn't crash) regardless of DoH status. |
Thanks @marten-seemann, I can confirm my reproduce case from #5680 (comment) is fixed by quic-go/quic-go#4016. I think the panic @W0n9 reported is a separate issue though. I don't know how to replicate that problem. A new issue should probably be opened for that. |
@RainmakerRaw @otbutz @animosity22 @raregems-io @bt90 Can some of you also confirm that the patch at quic-go/quic-go#4016 works for you? You will need to build Caddy with that version of quic-go. Here's an easy way using xcaddy: |
@mholt Sorry, per my last reply I fixed the crash (not the underlying cause, just the manifestation of it) by changing my DNS and ECH settings on Cloudflare. I had my root domain proxied by CF but the subdomains in question were CNAMES yet not proxied. This seems to have caused issues with DNS (tagging my Jellyfin server with an ECH https answer that wasn't valid). Once I removed the CF proxy from the root domain, DNS answers changed to reflect the CNAME and root domain, and Caddy no longer crashed. I can replicate the original conditions and wait for DNS to propagate and caches to expire (I chain from DHCP > local DNS > my upstream DNS) to test if you wish; it'll just take me some time. I had already built the binary with that patch, and put it in a dir on the server just in case... |
@RainmakerRaw A confirmation would be helpful. The changes you made are unrelated -- even if correct -- except that they just happen to make it so that the clients send a known/expected SNI to Caddy, which is the real reason it "fixed" it for you, as you alluded to. The bug is exposed when the SNI is not known and Caddy doesn't have a cert. You don't have to revert your changes necessarily, just send a request to Caddy with an unknown SNI/hostname. Thanks for your help thus far :) |
@mholt Ah, I understand what you were asking now. Following the earlier example with Docker and curl-http3 (substituting for I then tried with |
Awesome, thank you. Confirmations of the fix from anyone else experiencing the issue would be valuable as we await the patch release! 💯 |
Testing my server with Caddy crash: 2.7.2 built with |
I did a quick test as well:
2.7.2 fails but the updated build works. |
Excellent, thank you both! @marten-seemann We now have numerous confirmations of the patch working, so I'm confident with your fix. 💯 Thank you thank you. |
For whatever its worth, I tested on windows as well and I no longer receive TLS handshake error spam after building with
|
Oh, as I just checked again I got handshake error now? {"level":"debug","ts":1691191817.8557942,"logger":"http.stdlib","msg":"http: TLS handshake error from 192.168.1.26:56616: no certificate available for '192.168.1.4'"} |
You don't have a certificate for the raw IP, so that is expected and only logged with level debug. |
Right, I should not stay up this late, was just focusing on watching the errors, thats from the server and not cloudflare, so all good. |
I have tried this new commit, and panic again. I think I shoud open a new issue.😥
|
@W0n9 A fix for both panics is in the works, and will be included in the v0.37.2 release. |
@W0n9 can you try a build with quic-go/quic-go#4018 ? You can use |
I have tried caddy 2.7.3, it fixed for me👍 |
Hey @mholt , @marten-seemann I'm not kidding its back with me on windows I built master after a8cc5d1 v2.7.3-0.20230805213002-65e33fc1ee47 h1:6PFhIFWiV7Nfez0pPrk8lnQ0IdbMPnnQUz1clDX+HxY= Is it me being wrong here?, getting massive spams.
|
@wazerstar That's normal. That's not a bug. That's just bots/crawlers trying to connect to your server without knowing the correct domain name (i.e. trying to connect by IP and no domain). Ignore those. Turn off the |
That's odd, why do I not receive those error when I revert back to the version previous to this, the handshake error EOF only recently started to appear after this? I will let this run for another 12 hours and see what happens. |
It's definitely not a new error, handshake errors have always been there. If you only enabled debug logging recently then you'd only be seeing those now. It's also affected by external factors whether bots are actually hitting your server or not. It happens in bursts as bots change their attention from one IP address to another. To further clarify, these are debug logs, not actionable errors. It just means "some client" failed to connect. If you can yourself connect to your site, and Caddy is not crashing, then the problem in this issue is fixed (and we've confirmed that multiple times over). |
No I have debug enabled for about 5-6 days now to figure out some other things, only after this I noticed the errors, perhaps I have been lucky not to get targeted, I don't know, sorry for causing spikes in this thread again then, will just leave and ignore it until I'm done finding cause of other issue. |
caddy v2.7.2 installed via the apt repository:
Might be related to golang/go#61639 ?
The text was updated successfully, but these errors were encountered: