Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cURL IPv4 issue since alpine 3.19 #366

Open
niconoe- opened this issue Dec 13, 2023 · 52 comments
Open

cURL IPv4 issue since alpine 3.19 #366

niconoe- opened this issue Dec 13, 2023 · 52 comments

Comments

@niconoe-
Copy link

niconoe- commented Dec 13, 2023

Hi, and thank you for your awesome work!

I'm experiencing an issue with alpine 3.19 when using curl: it seems that curl only tries to match IPv6 rather than being able to switch to the right IP version to connect. The thing is this doesn't look like to come from cURL itself, as on alpine 3.18, it works like a charm.

How to reproduce

Alpine 3.19 (curl classic)

$ > docker run --rm -it --entrypoint=/bin/sh alpine:3.19
# In container:
/ > apk add curl
# fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
# fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/community/x86_64/APKINDEX.tar.gz
# (1/8) Installing ca-certificates (20230506-r0)
# (2/8) Installing brotli-libs (1.1.0-r1)
# (3/8) Installing c-ares (1.22.1-r0)
# (4/8) Installing libunistring (1.1-r2)
# (5/8) Installing libidn2 (2.3.4-r4)
# (6/8) Installing nghttp2-libs (1.58.0-r0)
# (7/8) Installing libcurl (8.5.0-r0)
# (8/8) Installing curl (8.5.0-r0)
# Executing busybox-1.36.1-r15.trigger
# Executing ca-certificates-20230506-r0.trigger
# OK: 12 MiB in 23 packages
/ > curl www.google.com
# curl: (7) Failed to connect to www.google.com port 80 after 2003 ms: Couldn't connect to server

Alpine 3.19 (curl with --ipv4 option)

$ > docker run --rm -it --entrypoint=/bin/sh alpine:3.19
# In container:
/ > apk add curl
# fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
# fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/community/x86_64/APKINDEX.tar.gz
# (1/8) Installing ca-certificates (20230506-r0)
# (2/8) Installing brotli-libs (1.1.0-r1)
# (3/8) Installing c-ares (1.22.1-r0)
# (4/8) Installing libunistring (1.1-r2)
# (5/8) Installing libidn2 (2.3.4-r4)
# (6/8) Installing nghttp2-libs (1.58.0-r0)
# (7/8) Installing libcurl (8.5.0-r0)
# (8/8) Installing curl (8.5.0-r0)
# Executing busybox-1.36.1-r15.trigger
# Executing ca-certificates-20230506-r0.trigger
# OK: 12 MiB in 23 packages
/ > curl --ipv4 www.google.com
# <html>…</html> # The Google home page

Alpine 3.18 (curl classic)

$ > docker run --rm -it --entrypoint=/bin/sh alpine:3.18
# In container:
/ > apk add curl
# fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/main/x86_64/APKINDEX.tar.gz
# fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/community/x86_64/APKINDEX.tar.gz
# (1/7) Installing ca-certificates (20230506-r0)
# (2/7) Installing brotli-libs (1.0.9-r14)
# (3/7) Installing libunistring (1.1-r1)
# (4/7) Installing libidn2 (2.3.4-r1)
# (5/7) Installing nghttp2-libs (1.57.0-r0)
# (6/7) Installing libcurl (8.5.0-r0)
# (7/7) Installing curl (8.5.0-r0)
# Executing busybox-1.36.1-r5.trigger
# Executing ca-certificates-20230506-r0.trigger
# OK: 12 MiB in 22 packages
/ > curl www.google.com
# <html>…</html> # The Google home page

As you can see, the curl version is exactly the same between all tests (8.5.0-r0) but still, there's a difference between Alpine 3.18 and Alpine 3.19.

I expect the curl command from Alpine 3.19 to work as expected without the requiring need to force ipv4.

If you need more info, fell free to ask. Thanks a lot

EDIT: after a very short investigation, I can see that Alpine 3.19 is now adding c-ares=1.22.1-r0 as a dependency of cURL, and I just discovered that c-ares/c-ares#652 could be related: AFAIK, when cURL tries to resolve the DNS, it tries with both IPv4 or IPv6 by default, and takes the faster match. c-ares is here to help cURL doing that in parallel so that the DNS resolution between both IPv4 or IPv6 is parallelized, resulting in faster cURL calls. But with the issue I just linked above, it looks like when multiple DNS researches are given and one fail, it gives the failure status definitly. Therefore, as the IPv6 fails faster than IPv4 is resolved, c-ares wrongly says to cURL that the host is unreachable.
I'm not 100% sure about this, but I think it deserves to take a look. I wasn't able to remove c-ares and give it a try without it.

@bradh352
Copy link

The description isn't exactly accurate about the behavior. See c-ares/c-ares#551 for a better description, but basically a change was made in c-ares 1.20.0 to not go through the entire timeout sequence if we had at least a partial reply as it is very likely that it won't work. It still waits for the other address family to timeout or have some other issue on the current request. So if someone has tries=3, timeout=2s and 2 dns servers, it could take a minimum of 3*2*2 = 12 seconds (its actually more as there's an additional penalty per retry to the same server), vs if one address class returned in 100ms, it would take at most 2s to return the partial result since it would terminate the other address family's additional attempts.

Now, there apparently have been reported issues to glibc that does something similar to this as per https://man7.org/linux/man-pages/man5/resolv.conf.5.html:

single-request (since glibc 2.10)
                     Sets RES_SNGLKUP in _res.options.  By default,
                     glibc performs IPv4 and IPv6 lookups in parallel
                     since glibc 2.9.  Some appliance DNS servers cannot
                     handle these queries properly and make the requests
                     time out.  This option disables the behavior and
                     makes glibc perform the IPv6 and IPv4 requests
                     sequentially (at the cost of some slowdown of the
                     resolving process).

So likely before c-ares 1.20.0, the retries allowed this to eventually succeed in such an environment. Currently c-ares doesn't honor the glibc single-request option.

It would probably be good to know if this is what is really happening in your environment, a tcpdump/pcap would be useful. You should probably open a ticket in https://github.com/c-ares/c-ares/issues with your findings.

@bradh352
Copy link

I should also mention that we just added alpine linux automated (CI/CD) testing to c-ares to ensure there are no behavioral differences (e.g. due to musl c). All tests are passing, so I'm pretty sure whatever you are experiencing is outside of alpine's scope.

@niconoe-
Copy link
Author

Thank you very much for your answers.

On my side, I'm not that advanced on networking so I'm not 100% sure I could handle this. I'll give it a try by looking at tcpdump and pcap.
I really do understand that your automated tests prevents you from releasing something buggy, and I'm glad that's how it works!

Nevertheless, I'm curious about the result you got when attempting to simply try to reproduce my commands. Did it actually work for you? I mean, when running from Docker containers, I expect almost nothing is fetch from my local environment as I thought containers are mainly isolated. I'm aware that core libs from native OS are used, of course, but I wouldn't expect any difference between my OS, a canonical alpine 3.18 from this OS and a canonical alpine 3.19 from this exact same OS, as what's imported in the containers to make it work seems highly generic and kernel-related to me.

Anyway, thanks to IT colleagues I'll ask and your advices, I'll try to investigate as much as I can to identify the reasons I'm experiancing such issue.

@bradh352
Copy link

I haven't tried your exact scenario (using curl), just building and running the c-ares test suite on alpine linux.

@Tithugues
Copy link

I tried and indeed, I've the same issue:

$ docker run --rm -it --entrypoint=/bin/sh alpine:3.19 -c "apk add curl && curl --trace - --trace-time www.google.com"
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/community/x86_64/APKINDEX.tar.gz
(1/8) Installing ca-certificates (20230506-r0)
(2/8) Installing brotli-libs (1.1.0-r1)
(3/8) Installing c-ares (1.22.1-r0)
(4/8) Installing libunistring (1.1-r2)
(5/8) Installing libidn2 (2.3.4-r4)
(6/8) Installing nghttp2-libs (1.58.0-r0)
(7/8) Installing libcurl (8.5.0-r0)
(8/8) Installing curl (8.5.0-r0)
Executing busybox-1.36.1-r15.trigger
Executing ca-certificates-20230506-r0.trigger
OK: 12 MiB in 23 packages
14:14:49.612106 == Info: Host www.google.com:80 was resolved.
14:14:49.612159 == Info: IPv6: 2a00:1450:4001:80b::2004
14:14:49.612165 == Info: IPv4: (none)
14:14:49.612212 == Info:   Trying [2a00:1450:4001:80b::2004]:80...
14:14:49.612242 == Info: Immediate connect fail for 2a00:1450:4001:80b::2004: Address not available
14:14:49.612258 == Info: Failed to connect to www.google.com port 80 after 2002 ms: Couldn't connect to server
14:14:49.612268 == Info: Closing connection
curl: (7) Failed to connect to www.google.com port 80 after 2002 ms: Couldn't connect to server

@bradh352
Copy link

bradh352 commented Dec 18, 2023

Well, I am running the current c-ares main, not v1.22 which is a couple release behind (current release is v1.24). Perhaps there is some issue in v1.22 ?

Anyhow, in our current c-ares CI system, this is the latest alpine build with tests:
https://api.cirrus-ci.com/v1/task/4971237735137280/logs/main.log

If you search for ./ci/test.sh this is where the tests start, the first test is running adig which is similar to BIND's dig and ends with ;; MSG SIZE, then immediately after that is the output of ahost www.google.com and you can see it returns both IPv4 and IPv6 addresses:

www.google.com                  	142.250.1.104
www.google.com                  	142.250.1.99
www.google.com                  	142.250.1.105
www.google.com                  	142.250.1.106
www.google.com                  	142.250.1.147
www.google.com                  	142.250.1.103
www.google.com                  	2607:f8b0:4001:c09::68
www.google.com                  	2607:f8b0:4001:c09::69
www.google.com                  	2607:f8b0:4001:c09::67
www.google.com                  	2607:f8b0:4001:c09::63

In theory, that's exactly what curl should see as curl should internally be using the same function as ahost does (ares_getaddrinfo), and we can see both ipv4 and ipv6 addresses. That said, I don't know what alpine test environment you're using, as it could very well be environmental with what DNS servers you are using.

Everything after that point is just running the whole test suite.

@niconoe-
Copy link
Author

On my side, I just gave it a try today with tcpdump, here are the results:

Preparation

$ > docker run --rm -it --entrypoint=/bin/sh alpine:3.19
# In container:
/ > apk add curl tcpdump
# Downloading…

Logging shell

# Display verbosly with hexadecimal content representation, with IP and port on interface "eth0" (default one on Docker container) where source or destination is my current IP:
/ > tcpdump -vvXnni eth0 src $(hostname -i) or dst $(hostname -i)
# tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes

Calling cURL shell

/ > curl --trace - --trace-time www.google.com

Logging shell updating…

# 16:09:59.631641 IP (tos 0x0, ttl 64, id 18664, offset 0, flags [DF], proto UDP (17), length 71)
#     10.1.128.2.39924 > 172.18.21.134.53: [bad udp cksum 0x4be0 -> 0x4f44!] 4305+ [1au] A? www.google.com. ar: . OPT UDPsize=1280 (43)
#     0x0000:  4500 0047 48e8 4000 4011 a622 0a01 8002  E..GH.@.@.."....
#     0x0010:  ac12 1586 9bf4 0035 0033 4be0 10d1 0100  .......5.3K.....
#     0x0020:  0001 0000 0000 0001 0377 7777 0667 6f6f  .........www.goo
#     0x0030:  676c 6503 636f 6d00 0001 0001 0000 2905  gle.com.......).
#     0x0040:  0000 0000 0000 00                        .......
# 16:09:59.631696 IP (tos 0xc0, ttl 64, id 990, offset 0, flags [none], proto ICMP (1), length 99)
#     172.18.21.134 > 10.1.128.2: ICMP 172.18.21.134 udp port 53 unreachable, length 79
#     IP (tos 0x0, ttl 64, id 18664, offset 0, flags [DF], proto UDP (17), length 71)
#     10.1.128.2.39924 > 172.18.21.134.53: [bad udp cksum 0x4be0 -> 0x4f44!] 4305+ [1au] A? www.google.com. ar: . OPT UDPsize=1280 (43)
#     0x0000:  45c0 0063 03de 0000 4001 2a61 ac12 1586  E..c....@.*a....
#     0x0010:  0a01 8002 0303 4c41 0000 0000 4500 0047  ......LA....E..G
#     0x0020:  48e8 4000 4011 a622 0a01 8002 ac12 1586  H.@.@.."........
#     0x0030:  9bf4 0035 0033 4be0 10d1 0100 0001 0000  ...5.3K.........
#     0x0040:  0000 0001 0377 7777 0667 6f6f 676c 6503  .....www.google.
#     0x0050:  636f 6d00 0001 0001 0000 2905 0000 0000  com.......).....
#     0x0060:  0000 00                                  ...
# 16:09:59.631737 IP (tos 0x0, ttl 64, id 48319, offset 0, flags [DF], proto UDP (17), length 71)
#     10.1.128.2.50287 > 172.18.86.200.53: [bad udp cksum 0x8d22 -> 0xfbd0!] 64107+ [1au] AAAA? www.google.com. ar: . OPT UDPsize=1280 (43)
#     0x0000:  4500 0047 bcbf 4000 4011 f108 0a01 8002  E..G..@.@.......
#     0x0010:  ac12 56c8 c46f 0035 0033 8d22 fa6b 0100  ..V..o.5.3.".k..
#     0x0020:  0001 0000 0000 0001 0377 7777 0667 6f6f  .........www.goo
#     0x0030:  676c 6503 636f 6d00 001c 0001 0000 2905  gle.com.......).
#     0x0040:  0000 0000 0000 00                        .......
# 16:09:59.633326 IP (tos 0x0, ttl 124, id 58881, offset 0, flags [none], proto UDP (17), length 99)
#     172.18.86.200.53 > 10.1.128.2.50287: [udp sum ok] 64107 q: AAAA? www.google.com. 1/0/1 www.google.com. AAAA 2a00:1450:4001:806::2004 ar: . OPT UDPsize=4000 (71)
#     0x0000:  4500 0063 e601 0000 7c11 cbaa ac12 56c8  E..c....|.....V.
#     0x0010:  0a01 8002 0035 c46f 004f 7427 fa6b 8180  .....5.o.Ot'.k..
#     0x0020:  0001 0001 0000 0001 0377 7777 0667 6f6f  .........www.goo
#     0x0030:  676c 6503 636f 6d00 001c 0001 c00c 001c  gle.com.........
#     0x0040:  0001 0000 0050 0010 2a00 1450 4001 0806  .....P..*..P@...
#     0x0050:  0000 0000 0000 2004 0000 290f a000 0000  ..........).....
#     0x0060:  0000 00                                  ...

Calling cURL shell updating…

# 16:10:01.692831 == Info: Host www.google.com:80 was resolved.
# 16:10:01.692920 == Info: IPv6: 2a00:1450:4001:806::2004
# 16:10:01.692949 == Info: IPv4: (none)
# 16:10:01.693032 == Info:   Trying [2a00:1450:4001:806::2004]:80...
# 16:10:01.693118 == Info: Immediate connect fail for 2a00:1450:4001:806::2004: Address not available
# 16:10:01.693162 == Info: Failed to connect to www.google.com port 80 after 2003 ms: Couldn't connect to server
# 16:10:01.693196 == Info: Closing connection
# curl: (7) Failed to connect to www.google.com port 80 after 2003 ms: Couldn't connect to server

Logging shell updating…

# 16:10:04.765537 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.1.128.1 tell 10.1.128.2, length 28
#     0x0000:  0001 0800 0604 0001 0242 0a01 8002 0a01  .........B......
#     0x0010:  8002 0000 0000 0000 0a01 8001            ............
# 16:10:04.765607 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.1.128.2 tell 10.1.128.1, length 28
#     0x0000:  0001 0800 0604 0001 0242 0c35 ad1f 0a01  .........B.5....
#     0x0010:  8001 0000 0000 0000 0a01 8002            ............
# 16:10:04.765616 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.1.128.2 is-at 02:42:0a:01:80:02, length 28
#     0x0000:  0001 0800 0604 0002 0242 0a01 8002 0a01  .........B......
#     0x0010:  8002 0242 0c35 ad1f 0a01 8001            ...B.5......
# 16:10:04.765680 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.1.128.1 is-at 02:42:0c:35:ad:1f, length 28
#     0x0000:  0001 0800 0604 0002 0242 0c35 ad1f 0a01  .........B.5....
#     0x0010:  8001 0242 0a01 8002 0a01 8002            ...B........

The only thing I can suspect is ICMP 172.18.21.134 udp port 53 unreachable, but I really don't know why, neither why it works very well when specifying to use IPv4 durint the cURL call…

I'll check with my IT dept. people working on the DNS configuration, maybe 🤷

@bradh352
Copy link

172.18.21.134 received an ICMP unreachable reply, and 172.18.86.200 works. That said, it's not immediately clear to me why the A request went to 172.18.21.134 and the AAAA request went to 172.18.86.200. Can you share your /etc/resolv.conf ? I wonder if you have rotate enabled for the dns servers.

@bradh352
Copy link

Is that really the entirety of the tcp dump? Typically an event should be received on an ICMP unreachable which then recv() would be called and then detect the udp destination isn't valid, so we should have seen another "A" record request go out, especially considering the timings shown here.

@niconoe-
Copy link
Author

172.18.21.134 received an ICMP unreachable reply, and 172.18.86.200 works. That said, it's not immediately clear to me why the A request went to 172.18.21.134 and the AAAA request went to 172.18.86.200. Can you share your /etc/resolv.conf ? I wonder if you have rotate enabled for the dns servers.

The /etc/resolv.conf file is containing this

nameserver 127.0.0.1
nameserver 172.18.86.200
nameserver 172.18.32.204
nameserver 172.18.86.207
options edns0 trust-ad
search ad.XXXXX.com # My company's AD

Is that really the entirety of the tcp dump? Typically an event should be received on an ICMP unreachable which then recv() would be called and then detect the udp destination isn't valid, so we should have seen another "A" record request go out, especially considering the timings shown here.

That was the full tcp dump I got when filtering on my IP address. Here is the result without the filter, trying not to be too much polluted:

/ > tcpdump -vvXnni any
# tcpdump: data link type LINUX_SLL2
# tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
# 17:18:01.248975 eth0  Out IP (tos 0x0, ttl 64, id 10392, offset 0, flags [DF], proto UDP (17), length 71)
#     10.1.128.2.34956 > 172.18.21.134.53: [bad udp cksum 0x4be0 -> 0x4ecf!] 9390+ [1au] A? www.google.com. ar: . OPT UDPsize=1280 (43)
#     0x0000:  4500 0047 2898 4000 4011 c672 0a01 8002  E..G(.@[email protected]....
#     0x0010:  ac12 1586 888c 0035 0033 4be0 24ae 0100  .......5.3K.$...
#     0x0020:  0001 0000 0000 0001 0377 7777 0667 6f6f  .........www.goo
#     0x0030:  676c 6503 636f 6d00 0001 0001 0000 2905  gle.com.......).
#     0x0040:  0000 0000 0000 00                        .......
# 17:18:01.249021 eth0  In  IP (tos 0xc0, ttl 64, id 25710, offset 0, flags [none], proto ICMP (1), length 99)
#     172.18.21.134 > 10.1.128.2: ICMP 172.18.21.134 udp port 53 unreachable, length 79
#     IP (tos 0x0, ttl 64, id 10392, offset 0, flags [DF], proto UDP (17), length 71)
#     10.1.128.2.34956 > 172.18.21.134.53: [bad udp cksum 0x4be0 -> 0x4ecf!] 9390+ [1au] A? www.google.com. ar: . OPT UDPsize=1280 (43)
#     0x0000:  45c0 0063 646e 0000 4001 c9d0 ac12 1586  E..cdn..@.......
#     0x0010:  0a01 8002 0303 4bcc 0000 0000 4500 0047  ......K.....E..G
#     0x0020:  2898 4000 4011 c672 0a01 8002 ac12 1586  (.@[email protected]........
#     0x0030:  888c 0035 0033 4be0 24ae 0100 0001 0000  ...5.3K.$.......
#     0x0040:  0000 0001 0377 7777 0667 6f6f 676c 6503  .....www.google.
#     0x0050:  636f 6d00 0001 0001 0000 2905 0000 0000  com.......).....
#     0x0060:  0000 00                                  ...
# 17:18:01.249058 eth0  Out IP (tos 0x0, ttl 64, id 4793, offset 0, flags [DF], proto UDP (17), length 71)
#     10.1.128.2.57608 > 172.18.86.200.53: [bad udp cksum 0x8d22 -> 0x7146!] 26717+ [1au] AAAA? www.google.com. ar: . OPT UDPsize=1280 (43)
#     0x0000:  4500 0047 12b9 4000 4011 9b0f 0a01 8002  E..G..@.@.......
#     0x0010:  ac12 56c8 e108 0035 0033 8d22 685d 0100  ..V....5.3."h]..
#     0x0020:  0001 0000 0000 0001 0377 7777 0667 6f6f  .........www.goo
#     0x0030:  676c 6503 636f 6d00 001c 0001 0000 2905  gle.com.......).
#     0x0040:  0000 0000 0000 00                        .......
# 17:18:01.251345 eth0  In  IP (tos 0x0, ttl 124, id 887, offset 0, flags [none], proto UDP (17), length 99)
#     172.18.86.200.53 > 10.1.128.2.57608: [udp sum ok] 26717 q: AAAA? www.google.com. 1/0/1 www.google.com. AAAA 2a00:1450:4001:80b::2004 ar: . OPT UDPsize=4000 (71)
#     0x0000:  4500 0063 0377 0000 7c11 ae35 ac12 56c8  E..c.w..|..5..V.
#     0x0010:  0a01 8002 0035 e108 004f e9a0 685d 8180  .....5...O..h]..
#     0x0020:  0001 0001 0000 0001 0377 7777 0667 6f6f  .........www.goo
#     0x0030:  676c 6503 636f 6d00 001c 0001 c00c 001c  gle.com.........
#     0x0040:  0001 0000 0047 0010 2a00 1450 4001 080b  .....G..*..P@...
#     0x0050:  0000 0000 0000 2004 0000 290f a000 0000  ..........).....
#     0x0060:  0000 00                                  ...
# 17:18:06.429450 eth0  Out ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.1.128.1 tell 10.1.128.2, length 28
#     0x0000:  0001 0800 0604 0001 0242 0a01 8002 0a01  .........B......
#     0x0010:  8002 0000 0000 0000 0a01 8001            ............
# 17:18:06.429507 eth0  In  ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.1.128.2 tell 10.1.128.1, length 28
#     0x0000:  0001 0800 0604 0001 0242 0c35 ad1f 0a01  .........B.5....
#     0x0010:  8001 0000 0000 0000 0a01 8002            ............
# 17:18:06.429513 eth0  Out ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.1.128.2 is-at 02:42:0a:01:80:02, length 28
#     0x0000:  0001 0800 0604 0002 0242 0a01 8002 0a01  .........B......
#     0x0010:  8002 0242 0c35 ad1f 0a01 8001            ...B.5......
# 17:18:06.429547 eth0  In  ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.1.128.1 is-at 02:42:0c:35:ad:1f, length 28
#     0x0000:  0001 0800 0604 0002 0242 0c35 ad1f 0a01  .........B.5....
#     0x0010:  8001 0242 0a01 8002 0a01 8002            ...B........

In parallel, I also tried to call curl -I -6 www.google.com directly from my machine (not from any container), and I got an error that host is unreachable too. For some reasons, it looks like I can't make any IPv6 calls even if I checked with every commands and looking every config file I know IPv6 is enabled. But still, even if IPv6 is buggy on my machine, if I run a cURL command without specifying the IPv4 or IPv6 option, it should manage to run both and ignores the failures if any succees comes, right?

Plus, even if that's my IPv6 configuration on my machine which is wrongly set, it doesn't explain why the curl command works on alpine 3.18 but doesn't on alpine 3.19.

I'm a bit lost tbh, so thank you very much for your help about that!

@bradh352
Copy link

Ok, well that's even more interesting. That means the 10.1.128.2.34956 > 172.18.21.134.53 wasn't generated by c-ares at all, but from your local resolver at 127.0.0.1, as 172.18.21.134 isn't listed in your /etc/resolv.conf at all so there's no way c-ares would try to use that. Can you tcpdump all interfaces on port 53 udp on the machine and try again? I'd expect "lo" listed as an interface with port 53 traffic.

@niconoe-
Copy link
Author

niconoe- commented Dec 18, 2023

When I run sudo tcpdump -vvXnni any port 53 on my machine, I an over-polluted by other docker services currently running for my development workspace.

I'm cleaning up all of them and I'll try again.

After cleaning up, you're right, I can see lots of traces on "lo" interface with port 53 traffic.
I can't display it as is because it contains sensitive information and domains from my company, but it looks like

19:08:12.115747 lo    In  IP (tos 0x0, ttl 64, id 13846, offset 0, flags [DF], proto UDP (17), length 82)
    127.0.0.1.53 > 127.0.0.1.42736: [bad udp cksum 0xfe51 -> 0xe9bd!] 53037 q: AAAA? xxxxxxxxx01.ad.xxxxxx.com. 0/0/1 ar: . OPT UDPsize=1280 (54)
	0x0000:  4500 0052 3616 4000 4011 0683 7f00 0001  E..R6.@.@.......
	0x0010:  7f00 0001 0035 a6f0 003e fe51 cf2d 8180  .....5...>.Q.-..
	0x0020:  0001 0000 0000 0001 0b78 7878 7878 7878  .........xxxxxxx
	0x0030:  7878 3031 0261 6406 7878 7878 7878 0363  xx01.ad.xxxxxx.c
	0x0040:  6f6d 0000 1c00 0100 0029 0500 0000 0000  om.......)......
	0x0050:  0000                                     ..

Do those calls may interfere with the curl requests I'm trying to do on my containers?

@niconoe-
Copy link
Author

I continued to investigate and I found something quite interesting IMO.

I think the issue comes from the fact I can't use IPv6, neither on my machine nor on any container it hosts. But I also think that something could be improved in c-ares to remediate to such issue.

I managed to simplify my tests to highlight only the important things, so here are my runs:

# Inside a container from image alpine:3.19, on which I added `curl` via `apk add curl`.
/ > curl -Ivvv www.google.com
# * Host www.google.com:80 was resolved.
# * IPv6: 2a00:1450:4025:401::69, 2a00:1450:4025:401::6a, 2a00:1450:4025:401::67, 2a00:1450:4025:401::93
# * IPv4: (none)
# *   Trying [2a00:1450:4025:401::69]:80...
# * Immediate connect fail for 2a00:1450:4025:401::69: Address not available
# *   Trying [2a00:1450:4025:401::6a]:80...
# * Immediate connect fail for 2a00:1450:4025:401::6a: Address not available
# *   Trying [2a00:1450:4025:401::67]:80...
# * Immediate connect fail for 2a00:1450:4025:401::67: Address not available
# *   Trying [2a00:1450:4025:401::93]:80...
# * Immediate connect fail for 2a00:1450:4025:401::93: Address not available
# * Failed to connect to www.google.com port 80 after 2003 ms: Couldn't connect to server
# * Closing connection
# curl: (7) Failed to connect to www.google.com port 80 after 2003 ms: Couldn't connect to server

/ > curl -Ivvv4 www.google.com
# * Host www.google.com:80 was resolved.
# * IPv6: (none)
# * IPv4: 142.250.27.105, 142.250.27.106, 142.250.27.99, 142.250.27.103, 142.250.27.147, 142.250.27.104
# *   Trying 142.250.27.105:80...
# * Connected to www.google.com (142.250.27.105) port 80
# > HEAD / HTTP/1.1
# > Host: www.google.com
# > User-Agent: curl/8.5.0
# Blablabla… the response is OK

To me, that means that when I first ran curl without specifying the IP version to use, c-ares is trying to resolve the domain name on both IPv6 or IPv4 and stops as soon as one resolution as been found. This is the interest of c-ares to run faster DNS resolution as no need to continue resolving if already resolved, even in another format. And on the first command I ran, you can see that IPv6 has been resolved, but not IPv4 (IPv4: (none)). On the second command I run, I force the usage of IPv4, and hopefully, c-ares understands that and only tries to resolve the domain name on IPv4 format, and it manages to do it.

To me, that means 2 issues:

  1. on my local configuration, it should work using IPv6, as it's enabled
  2. c-ares shouldn't take too much insurance on the fact the DNS resolution is done. It's not because the DNS is resolved (no matter in IPv4 or IPv6) that means the host can reach it. Or, if that's the golden goal of c-ares, maybe curl shouldn't use c-ares blindly and must fallback.

I'll investigate more to make IPv6 work on my environment, and this should solve my issue, but I highly suspect other people to have wrongly set configurations too encountering the problem that curl fails because it trusts c-ares playing the lazy-guy and not checking deeply the reachability of the IP by the host.

PS: Looking at c-ares changelogs, I think this issue might be actually solved on v1.24, but as the alpine:3.19 sticks with c-ares 1.22, I can't test it further.

@bradh352
Copy link

Its impossible to tell what is going on with your system with the information provided. The real issue is you have a local dns resolver running at 127.0.0.1 and other servers configured. We can't tell from what you've provide what c-ares is doing vs your local resolver.

I don't believe your conclusion is accurate based on the information at hand. Really you either need to remove your local resolver from /etc/resolv.conf and test ... or remove all other dns servers and leave only the local resolver.

@aptalca
Copy link

aptalca commented Jan 3, 2024

This should fix it:
https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/58154

If you want to test locally, you can install c-ares from the edge repo while on 3.19 and see if it works.

@Tithugues
Copy link

Tithugues commented Jan 4, 2024

This should fix it: https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/58154

If you want to test locally, you can install c-ares from the edge repo while on 3.19 and see if it works.

Hi!

Thanks for the information!
Here is the result of my test:

/ # echo "@edge https://dl-cdn.alpinelinux.org/alpine/edge/main" >> /etc/apk/repositories
/ # apk add c-ares@edge curl
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/community/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/edge/main/x86_64/APKINDEX.tar.gz
(1/8) Installing c-ares@edge (1.24.0-r0)
(2/8) Installing ca-certificates (20230506-r0)
(3/8) Installing brotli-libs (1.1.0-r1)
(4/8) Installing libunistring (1.1-r2)
(5/8) Installing libidn2 (2.3.4-r4)
(6/8) Installing nghttp2-libs (1.58.0-r0)
(7/8) Installing libcurl (8.5.0-r0)
(8/8) Installing curl (8.5.0-r0)
Executing busybox-1.36.1-r15.trigger
Executing ca-certificates-20230506-r0.trigger
OK: 12 MiB in 23 packages
/ # curl --version
curl 8.5.0 (x86_64-alpine-linux-musl) libcurl/8.5.0 OpenSSL/3.1.4 zlib/1.3 brotli/1.1.0 c-ares/1.24.0 libidn2/2.3.4 nghttp2/1.58.0
Release-Date: 2023-12-06
Protocols: dict file ftp ftps gopher gophers http https imap imaps mqtt pop3 pop3s rtsp smb smbs smtp smtps telnet tftp ws wss
Features: alt-svc AsynchDNS brotli HSTS HTTP2 HTTPS-proxy IDN IPv6 Largefile libz NTLM SSL threadsafe TLS-SRP UnixSockets
/ # curl www.google.com
curl: (7) Failed to connect to www.google.com port 80 after 2001 ms: Couldn't connect to server

At least in my environment, it seems to still fail even with the new version of c-ares.

If you see any issue with my test or would like me to test anything else, please let me know. 🙏

Thanks again.

@niconoe-
Copy link
Author

niconoe- commented Jan 4, 2024

FWIW, I have the exact same report than @Tithugues above: I still can't connect with the same issue.

That means my assumption that issue was caused by c-ares in v1.22 is wrong.

To me, the issue is still related to the fact I can't connect to anything with IPv6, and the "software" responsible for DNS resolution is failing to do its job properly, a.k.a. fallback on IPv4. I thought it was c-ares, as it was a new dependency or curl in alpine 3.19 compared to 3.18, but maybe I was wrong, or maybe it is actually c-ares but the version 1.24 doesn't fix my problem.

When running this on alpine 3.19

/ > curl -vvv www.google.com --trace-time
# 09:55:27.932169 * Host www.google.com:80 was resolved.
# 09:55:27.932296 * IPv6: 2a00:1450:4001:80b::2004
# 09:55:27.932366 * IPv4: (none)
# 09:55:27.932437 *   Trying [2a00:1450:4001:80b::2004]:80...
# 09:55:27.932526 * Immediate connect fail for 2a00:1450:4001:80b::2004: Network unreachable
# 09:55:27.932587 * Failed to connect to www.google.com port 80 after 2001 ms: Couldn't connect to server
# 09:55:27.932639 * Closing connection
# curl: (7) Failed to connect to www.google.com port 80 after 2001 ms: Couldn't connect to server

I can clearly see that DNS is resolving the IPv6 faster, but as I can't use IPv6, I just can't connect. Such thing should be tested before the IPv6 resolution starts as there's no point on resolving it.

If I run the exact same command under alpine 3.18, we can see the DNS resolution is done on both IPv6 and IPv4:

/ > curl -vvv www.google.com --trace-time
# 10:00:08.889585 * Host www.google.com:80 was resolved.
# 10:00:08.889688 * IPv6: 2a00:1450:4001:80b::2004
# 10:00:08.889758 * IPv4: 142.250.186.164
# 10:00:08.889855 *   Trying 142.250.186.164:80...
# 10:00:08.892884 * Connected to www.google.com (142.250.186.164) port 80
# 10:00:08.893074 > GET / HTTP/1.1
# 10:00:08.893074 > Host: www.google.com
# 10:00:08.893074 > User-Agent: curl/8.5.0
# 10:00:08.893074 > Accept: */*
# 10:00:08.893074 > 
# 10:00:08.938750 < HTTP/1.1 200 OK
# 10:00:08.938830 < Date: Thu, 04 Jan 2024 10:00:08 GMT
# 10:00:08.938886 < Expires: -1
# 10:00:08.938950 < Cache-Control: private, max-age=0
# 10:00:08.939035 < Content-Type: text/html; charset=ISO-8859-1
# 10:00:08.939107 < Content-Security-Policy-Report-Only: object-src 'none';base-uri 'self';script-src 'nonce-btQ4x8jy7FUdjfmcMd8zxQ' 'strict-dynamic' 'report-sample' 'unsafe-eval' 'unsafe-inline' https: http:;report-uri https://csp.withgoogle.com/csp/gws/other-hp
# 10:00:08.939171 < Server: gws
# 10:00:08.939236 < X-XSS-Protection: 0
# 10:00:08.939308 < X-Frame-Options: SAMEORIGIN
# 10:00:08.939367 < Set-Cookie: AEC=Ackid1T8i9FSUMjTgdj_cyfnoIvnHWy4Kp6QBB4EJ6ShA1xNuiHoehcWOw; expires=Tue, 02-Jul-2024 10:00:08 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax
# 10:00:08.939431 < Accept-Ranges: none
# 10:00:08.939520 < Vary: Accept-Encoding
# 10:00:08.939577 < Transfer-Encoding: chunked
# 10:00:08.939639 < 
# <!doctype html>…</html> # Google's homepage.

so, that's working thanks to IPv4 connection.

Whatever "software", responsible for stopping any DNS resolution as soon as either IPv6 or IPv4 is resolved, must be improved to either:

  • best behavior: not even trying to resolve in IPv6 or IPv4 if there's no way to use such protocol
  • improved behavior: configure which resolution strategy to use (both, first, force-IPv6, force-IPv4, … any possibility)
  • acceptable behavior: automatically try again with another protocol when the connection fails.

With my hands tied on this currently, I don't know how to go further here, unfortunately 😢

@bradh352
Copy link

bradh352 commented Jan 4, 2024

As stated before, you have both a local dns server running at 127.0.0.1 and configurations of other servers which greatly complicates the ability to debug what is going on. I'd need access to a system that's not working in order to have any chance of determining what is really going on.

Likely c-ares/c-ares#551 plays a role in the issue, but it doesn't seem wise to revert that as it will greatly extend DNS resolution times.

@niconoe-
Copy link
Author

niconoe- commented Jan 4, 2024

As stated before, you have both a local dns server running at 127.0.0.1 and configurations of other servers which greatly complicates the ability to debug what is going on.

When in my container, if I open the /etc/resolv.conf file, I can see there's a line with nameserver <my_local_machine_ip>. As soon as I remove this line, curl is able to resolve the addresses in both IPv6 and IPv4, and the request succeeds.

So, indeed, something is related to the configuration of my local DNS on my local machine. Thank you very much for pointing this out 🤟

This leads to 2 questions to me then:

  1. Why in alpine 3.18 there's no issue with my local DNS configuration (or maybe there is, but it's not blocking while it is blocking in alpine:3.19)?
  2. How can I understand what's wrong in my local DNS configuration so I can fix it?

Question 2 is probably for IT dept. of my company 😆 .

@bradh352
Copy link

bradh352 commented Jan 4, 2024

What is the behavior the opposite direction, if you leave only that local DNS server in place? Does it still get an ipv6 address (when running curl with -Ivvv)?

@niconoe-
Copy link
Author

niconoe- commented Jan 4, 2024

What is the behavior the opposite direction, if you leave only that local DNS server in place? Does it still get an ipv6 address (when running curl with -Ivvv)?

Nope, it makes "www.google.com" unresolvable:

/ > curl -I -vvv www.google.com
# * Could not resolve host: www.google.com
# * Closing connection
# curl: (6) Could not resolve host: www.google.com

So maybe there's a real big issue with my local DNS that used to be masked by the other nameservers I have. However, I think I still need this nameserver to my local machine in order for my containers to communicate each other.

@bradh352
Copy link

bradh352 commented Jan 4, 2024

so is your local nameserver meant to only resolve some subset of domains, specific to your internal network? If so, I believe its supposed to have a # suffix to indicate the base domain it is authoritative for (that said, c-ares doesn't currently support that, we have a ticket on that c-ares/c-ares#642 )

@bradh352
Copy link

bradh352 commented Jan 4, 2024

By the way, my theory is your local DNS server is configured to be recursive, but since IPv6 is not working on the host the ipv6 fails fast, so c-ares sends the ipv6 query to the next configured server. But the ipv4 query tries to recurse within your local DNS server, and eventually fails and returns that failure to c-ares ... however, by the time it fails, c-ares already received a legitimate reply for ipv6 from the next server so any retries for ipv4 are halted and you get only an ipv6 address back.

If that is really what is happening, this falls within an "undefined behavior" grey zone. Since your local DNS server can't recurse, recursion should be disabled in its configuration, which in theory should fix the issue.

@niconoe-
Copy link
Author

niconoe- commented Jan 4, 2024

I believe its supposed to have a # suffix to indicate the base domain it is authoritative for

Thanks for sharing this, I wasn't aware about that. I'll do that soon.

By the way, my theory is your local DNS server is configured to be recursive, but since IPv6 is not working on the host the ipv6 fails fast, so c-ares sends the ipv6 query to the next configured server.

I really do think so, or kind of. My local DNS is dnsmasq and if I understand what I found online about it, it's not recursive, but it follows everything, and probably the behavior is similar to what you described. Problem is, there is no way to not follow unless I add the no-resolv configuration on dnsmasq. But if I do that, it will no longer resolve anything so when I have a docker service called acme-my-service that I can reach today from another container via curl -I acme-my-service.my-company.local, it will no longer be accessible then as my local DNS will not resolve such domain name.

Or maybe I'm misunderstanding something?

@bradh352
Copy link

bradh352 commented Jan 4, 2024

Are your domains you're trying to resolve really ending in ".local"? If so, ".local" is reserved for multicast DNS (mDNS). That would also mean you're not maintaining any form of internal dns records within your local resolver.

Perhaps this is a workaround to the fact that the alpine linux musl libc resolver doesn't implement multicast dns, but dnsmasq does, which makes a lot of sense why you might have your configuration this way.

Infact, c-ares doesn't yet support multicast dns either, but it is something we are aware of and is on my task list ( c-ares/c-ares#171 ).

The obvious solution here would be to make it so your dnsmasq can fully perform recursive DNS operations properly, and make it the only dns server in your /etc/resolv.conf. That is honestly the only configuration that would make your setup not rely on some undefined behavior (that just so happens to work some or most of the time).

@beroset
Copy link

beroset commented Jan 4, 2024

This should fix it: https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/58154

If you want to test locally, you can install c-ares from the edge repo while on 3.19 and see if it works.

FYI, I came across this via a different route. I was using alpine:latest to build some software and discovered that git reported a domain lookup failure:

/tmp # git clone https://github.com/AsteroidOS/asteroidos.org.git
Cloning into 'asteroidos.org'...
fatal: unable to access 'https://github.com/AsteroidOS/asteroidos.org.git/': Could not resolve host: github.com

I can verify that using alpine:edge (202312119) instead of alpine:latest (3.19.0) fixes this problem and that alpine:3.18 (3.18.5) also works.

@niconoe-
Copy link
Author

niconoe- commented Jan 4, 2024

Are your domains you're trying to resolve really ending in ".local"? If so, ".local" is reserved for multicast DNS (mDNS). That would also mean you're not maintaining any form of internal dns records within your local resolver.

Perhaps this is a workaround to the fact that the alpine linux musl libc resolver doesn't implement multicast dns, but dnsmasq does, which makes a lot of sense why you might have your configuration this way.

Yes, my servers are reachable via .local in my local environment. They used to be reachable via .dev, but I had to change when Google decided to reserve the .dev TLD 😆. I guess I just picked up twice the bad TLD 😆.

The obvious solution here would be to make it so your dnsmasq can fully perform recursive DNS operations properly, and make it the only dns server in your /etc/resolv.conf. That is honestly the only configuration that would make your setup not rely on some undefined behavior (that just so happens to work some or most of the time).

Unfortunately, due to constraints given by my company, I can't remove the other dns servers, otherwise I won't have access to internal servers my company's hosting.
I'm trying to put my local dns at the end of the list, hoping for the other DNS servers to be configured better than my own, and letting c-ares going through until it reaches mine when appropriate.

@niconoe-
Copy link
Author

niconoe- commented Jan 4, 2024

This should fix it: https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/58154
If you want to test locally, you can install c-ares from the edge repo while on 3.19 and see if it works.

FYI, I came across this via a different route. I was using alpine:latest to build some software and discovered that git reported a domain lookup failure:

/tmp # git clone https://github.com/AsteroidOS/asteroidos.org.git
Cloning into 'asteroidos.org'...
fatal: unable to access 'https://github.com/AsteroidOS/asteroidos.org.git/': Could not resolve host: github.com

I can verify that using alpine:edge (202312119) instead of alpine:latest (3.19.0) fixes this problem and that alpine:3.18 (3.18.5) also works.

I'll give it a try with alpine:edge. Thanks for the info ❤️

EDIT : aaaaaaand, that's a failure 😆

> docker run --rm -it --entrypoint=/bin/sh alpine:edge
# Unable to find image 'alpine:edge' locally
# edge: Pulling from library/alpine
# dcccee43ad5d: Pull complete 
# Digest: sha256:9f867dc20de5aa9690c5ef6c2c81ce35a918c0007f6eac27df90d3166eaa5cc0
# Status: Downloaded newer image for alpine:edge
/ > apk add curl
# fetch https://dl-cdn.alpinelinux.org/alpine/edge/main/x86_64/APKINDEX.tar.gz
# fetch https://dl-cdn.alpinelinux.org/alpine/edge/community/x86_64/APKINDEX.tar.gz
# (1/8) Installing ca-certificates (20230506-r0)
# (2/8) Installing brotli-libs (1.1.0-r1)
# (3/8) Installing c-ares (1.24.0-r0)
# (4/8) Installing libunistring (1.1-r2)
# (5/8) Installing libidn2 (2.3.4-r4)
# (6/8) Installing nghttp2-libs (1.58.0-r0)
# (7/8) Installing libcurl (8.5.0-r0)
# (8/8) Installing curl (8.5.0-r0)
# Executing busybox-1.36.1-r17.trigger
# Executing ca-certificates-20230506-r0.trigger
# OK: 12 MiB in 23 packages
/ > curl -I -vvv www.google.com
# * Host www.google.com:80 was resolved.
# * IPv6: 2a00:1450:4025:401::69, 2a00:1450:4025:401::93, 2a00:1450:4025:401::67, 2a00:1450:4025:401::6a
# * IPv4: (none)
# *   Trying [2a00:1450:4025:401::69]:80...
# * Immediate connect fail for 2a00:1450:4025:401::69: Address not available
# *   Trying [2a00:1450:4025:401::93]:80...
# * Immediate connect fail for 2a00:1450:4025:401::93: Address not available
# *   Trying [2a00:1450:4025:401::67]:80...
# * Immediate connect fail for 2a00:1450:4025:401::67: Address not available
# *   Trying [2a00:1450:4025:401::6a]:80...
# * Immediate connect fail for 2a00:1450:4025:401::6a: Address not available
# * Failed to connect to www.google.com port 80 after 2002 ms: Couldn't connect to server
# * Closing connection
# curl: (7) Failed to connect to www.google.com port 80 after 2002 ms: Couldn't connect to server

@niconoe-
Copy link
Author

niconoe- commented Jan 4, 2024

The content of /etc/resolv.conf in my containers is actually defined by the configuration I gave into my /etc/docker/daemon.json local file.

EDIT: (by the way, trying to add a # at the end of my local IP as indicated here: #366 (comment) doesn't work in /etc/docker/daemon.json as Docker detects the content is not an IP address and refuse to admit this configuration. I'll just ignore this, unfortunately).

If I put my local DNS (nameserver <my_ip>) at the end of the list, I manage to solve my cURL issue, as I let my company's DNS resolve the domains instead of mine, and the call works. Somehow, it also works when I ask for local domains, probably because DNS of my company are better configured than my dsnmasq, so I think I'll just go with it: considering my local DNS as the last attempt to resolve domains so calls to external can be solved correctly. As a downside, it will just slowdown a bit (some microseconds to milliseconds) my internal calls between my servers, but as this is for local environement only, I guess that's acceptable.

@remiville
Copy link

remiville commented Oct 22, 2024

FYI this issues seems to be back again with alpine 3.20.
The workaround is still to install c-ares of 3.19 main repository

@bradh352
Copy link

please be more specific on the exact issue you are having. If you install c-ares 1.34.2, which is the latest version in alpine edge, do you still have your issue?

@remiville
Copy link

remiville commented Oct 22, 2024

I do not have the issue with c-ares 1.27.0-r0 (alpine 3.19), I have the issue with c-ares-1.33.1 (alpine 3.20) 1.34.2-r0 (alpine edge)

/ # cat /etc/os-release 
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.20.3
PRETTY_NAME="Alpine Linux v3.20"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://gitlab.alpinelinux.org/alpine/aports/-/issues"
/ # 
/ # 
/ # cat /etc/apk/repositories
https://dl-cdn.alpinelinux.org/alpine/v3.20/main
https://dl-cdn.alpinelinux.org/alpine/v3.20/community
@3.19.main https://dl-cdn.alpinelinux.org/alpine/v3.19/main
@3.18.main https://dl-cdn.alpinelinux.org/alpine/v3.18/main
@edge.main https://dl-cdn.alpinelinux.org/alpine/edge/main
/ # 
/ # apk add curl
OK: 32 MiB in 50 packages
/ # apk add c-ares
OK: 32 MiB in 50 packages

/ # curl --version
curl 8.10.1 (x86_64-alpine-linux-musl) libcurl/8.10.1 OpenSSL/3.3.2 zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 c-ares/1.33.1 libidn2/2.3.7 libpsl/0.21.5 nghttp2/1.62.1
Release-Date: 2024-09-18
Protocols: dict file ftp ftps gopher gophers http https imap imaps ipfs ipns mqtt pop3 pop3s rtsp smb smbs smtp smtps telnet tftp ws wss
Features: alt-svc AsynchDNS brotli HSTS HTTP2 HTTPS-proxy IDN IPv6 Largefile libz NTLM PSL SSL threadsafe TLS-SRP UnixSockets zstd
/ # 
/ # apk list | grep c-ares
...
c-ares-1.33.1-r0 x86_64 {c-ares} (MIT) [installed]
...
/ # 
/ # curl -o /tmp/test.txt -Ssl "https://archive.apache.org/dist/tomcat/tomcat-8/v8.0.1/KEYS"
curl: (6) Could not resolve host: archive.apache.org
/ # 
/ # apk add [email protected]
(1/1) Downgrading [email protected] (1.33.1-r0 -> 1.27.0-r0)
OK: 32 MiB in 50 packages
/ # apk list | grep c-ares
...
c-ares-1.27.0-r0 x86_64 {c-ares} (MIT) [installed]
...
/ # 
/ # curl -o /tmp/test.txt -Ssl "https://archive.apache.org/dist/tomcat/tomcat-8/v8.0.1/KEYS"
/ # echo $?
0
/ # 
/ # apk add [email protected]
(1/1) Upgrading [email protected] (1.27.0-r0 -> 1.34.2-r0)
OK: 32 MiB in 50 packages
/ #
/ # apk list | grep c-ares
...
c-ares-1.34.2-r0 x86_64 {c-ares} (MIT) [installed]
...
/ # 
/ # curl -o /tmp/test.txt -Ssl "https://archive.apache.org/dist/tomcat/tomcat-8/v8.0.1/KEYS"
curl: (6) Could not resolve host: archive.apache.org
/ # 
/ # cat /etc/resolv.conf 
# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

search bams.corp
nameserver 10.215.39.1
nameserver 10.215.39.2

# Based on host file: '/etc/resolv.conf' (legacy)
# Overrides: []

@bradh352
Copy link

can you install c-ares-utils and try the adig and ahost utilities for dns resolution? It might provide an error code that could provide some insight into the issue. also it would be super helpful to get a tcpdump of what is going on.

@remiville
Copy link

remiville commented Oct 22, 2024

Unfortunately I'm not familiar with these tools, I must doing something wrong:

/ # /usr/bin/.libs/ahost www.apache.org
/usr/bin/.libs/ahost: line 34: /bin/sed: Argument list too long

@bradh352
Copy link

well, that's going to be a packaging issue on the alpine side, sounds like they're redistributing the libtool wrapper rather than the actual utility to me. I'll see if I can investigate that and send those guys a PR.

In the mean time if you can use curl to get a tcpdump of the communication with your DNS server during the failure, that would be useful.

@bradh352
Copy link

merge request to fix packaging issue here: https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/73955

@bradh352
Copy link

any luck getting a tcpdump of the dns query? Also, can you provide any more information such as if looking up other domains works? If not, what kind of DNS server is running at 10.215.39.1 and 10.215.39.2? If I can replicate the issue it should be easy to solve.

@bradh352
Copy link

@remiville any luck getting a tcpdump? I've ping'ed the alpine guys on my merge request to fix the tools, no reply yet.

@bradh352
Copy link

bradh352 commented Nov 7, 2024

ping

@bradh352
Copy link

bradh352 commented Nov 7, 2024

@remiville the alpine package for c-ares-utils has been updated to 1.34.2-r1. The utilities ahost and adig should now work if you can see if they produce any more meaningful info.

@remiville
Copy link

Sorry, busy busy

# apk list | grep c-ares | grep installed
c-ares-1.34.2-r1 x86_64 {c-ares} (MIT) [installed]
c-ares-utils-1.34.2-r1 x86_64 {c-ares} (MIT) [installed]
# curl -o /tmp/test.txt -Ssl "https://archive.apache.org/dist/tomcat/tomcat-8/v8.0.1/KEYS"
curl: (6) Could not resolve host: archive.apache.org


# adig www.apache.org

; <<>> c-ares DiG 1.34.2 <<>> www.apache.org
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 65361
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: 0; udp: 1232
; COOKIE: a82defa12d945a2c (good)
;; QUESTION SECTION:
;www.apache.org.                        IN      A

;; MSG SIZE  rcvd: 55

13c768a71b20:/#
13c768a71b20:/# ahost www.apache.org
www.apache.org: DNS server claims query was misformatted

@bradh352
Copy link

bradh352 commented Nov 7, 2024

@remiville thanks for the reply. The DNS server is rejecting the query with a FORMERR response which is quite unusual.

Can you try adig +qr +noedns www.apache.org ?

I'd also be interested to see if you could install the bind-tools and run dig www.apache.org, I'd honestly expect there to also be a FORMERR response to bind's dig utility.

Finally, if you can provide information about the DNS servers in use so that I can try to reproduce myself, that would be very helpful.

@remiville
Copy link

# adig +qr +noedns www.apache.org

; <<>> c-ares DiG 1.34.2 <<>> www.apache.org
;; Sending:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 0
;; flags: rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.apache.org.                        IN      A

;; MSG SIZE  rcvd: 32

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49770
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.apache.org.                        IN      A

;; ANSWER SECTION:
www.apache.org.         301     IN      A       151.101.2.132

;; MSG SIZE  rcvd: 48
# apk list | grep bind-tools | grep installed
bind-tools-9.18.27-r0 x86_64 {bind} (MPL-2.0) [installed]
13c768a71b20:/# dig www.apache.org

; <<>> DiG 9.18.27 <<>> www.apache.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 1852
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: de6238ce6754c0d9 (echoed)
;; QUESTION SECTION:
;www.apache.org.                        IN      A

;; Query time: 0 msec
;; SERVER: 10.215.39.1#53(10.215.39.1) (UDP)
;; WHEN: Fri Nov 08 08:16:12 UTC 2024
;; MSG SIZE  rcvd: 55

@bradh352
Copy link

bradh352 commented Nov 8, 2024

@remiville awesome, thanks for that. That confirms my suspicion that whatever DNS server you're using is non-compliant and BIND is affected by the same issue when trying to communicate. The server is not ignoring unrecognized edns options like its supposed to. What is the vendor of the DNS server you use?

I think it should be possible to detect this particular situation and mark the server as incapable of supporting DNS cookies (or possibly even EDNS completely), and requeue any queries.

The drawback of this of course is any applications using c-ares that are not long-lived (say curl on the command line) won't be able to "remember" this, and thus will always have to detect this leading to additional queries and latency.

I'd like to report this issue to your upstream DNS server vendor, so do please let me know who that is so they can get their product fixed.

@remiville
Copy link

I'm not par of IT team, but one said it's WindowsServer AD, no much information about that.

@bradh352
Copy link

bradh352 commented Nov 8, 2024

Hrm, I'd sure hope microsoft's DNS isn't that braindead. Unfortunately that's a system I'm least familiar with to try to test, and even harder to submit bug reports.

I'll let you know when we have the detection in place and in a release. Should be less than a week.

@bradh352
Copy link

bradh352 commented Nov 8, 2024

BTW, for clarification for someone reading this in the furture, we sort of took over this ticket. This issue has nothing to do with the original issue reported.

@bradh352
Copy link

bradh352 commented Nov 8, 2024

In theory the original issue reported by @niconoe- should be worked-around in c-ares/c-ares@765d558

bradh352 added a commit to c-ares/c-ares that referenced this issue Nov 9, 2024
Some DNS servers don't properly ignore unknown EDNS options as the spec says they must, and instead will return EFORMERR.

See discussion roughly starting here: alpinelinux/docker-alpine#366 (comment)

In this case the DNS server is known to support EDNS in general (as version prior to c-ares 1.33 worked which used EDNS), but when adding the EDNS DNS Cookie extension, they return EFORMERR.  This is in violation of [RFC6891 6.1.2](https://datatracker.ietf.org/doc/html/rfc6891#section-6.1.2):
> Any OPTION-CODE values not understood by a responder or requestor MUST be ignored.

The server in this example actual echo's back the EDNS record further causing confusion that makes you think they might understand the record.

We need to catch an EFORMERR and re-attempt the query without EDNS completely since they are really non-compliant with EDNS.  We may support additional EDNS extensions in the future and don't want to have to probe each individual extension with a braindead server.

Fixes #911
Authored-By: Brad House (@bradh352)
bradh352 added a commit to c-ares/c-ares that referenced this issue Nov 9, 2024
Some DNS servers don't properly ignore unknown EDNS options as the spec says they must, and instead will return EFORMERR.

See discussion roughly starting here: alpinelinux/docker-alpine#366 (comment)

In this case the DNS server is known to support EDNS in general (as version prior to c-ares 1.33 worked which used EDNS), but when adding the EDNS DNS Cookie extension, they return EFORMERR.  This is in violation of [RFC6891 6.1.2](https://datatracker.ietf.org/doc/html/rfc6891#section-6.1.2):
> Any OPTION-CODE values not understood by a responder or requestor MUST be ignored.

The server in this example actual echo's back the EDNS record further causing confusion that makes you think they might understand the record.

We need to catch an EFORMERR and re-attempt the query without EDNS completely since they are really non-compliant with EDNS.  We may support additional EDNS extensions in the future and don't want to have to probe each individual extension with a braindead server.

Fixes #911
Authored-By: Brad House (@bradh352)
bradh352 added a commit to c-ares/c-ares that referenced this issue Nov 9, 2024
Some DNS servers don't properly ignore unknown EDNS options as the spec says they must, and instead will return EFORMERR.

See discussion roughly starting here: alpinelinux/docker-alpine#366 (comment)

In this case the DNS server is known to support EDNS in general (as version prior to c-ares 1.33 worked which used EDNS), but when adding the EDNS DNS Cookie extension, they return EFORMERR.  This is in violation of [RFC6891 6.1.2](https://datatracker.ietf.org/doc/html/rfc6891#section-6.1.2):
> Any OPTION-CODE values not understood by a responder or requestor MUST be ignored.

The server in this example actual echo's back the EDNS record further causing confusion that makes you think they might understand the record.

We need to catch an EFORMERR and re-attempt the query without EDNS completely since they are really non-compliant with EDNS.  We may support additional EDNS extensions in the future and don't want to have to probe each individual extension with a braindead server.

Fixes #911
Authored-By: Brad House (@bradh352)
@bradh352
Copy link

@remiville please try c-ares 1.34.3 which is now available in edge, hopefully it fixes your issue

@remiville
Copy link

Thanks, I should find some time tomorrow to do a test.

@remiville
Copy link

remiville commented Nov 13, 2024

@bradh352 it is working !, thanks a lot !

# apk list | grep c-ares | grep installed
c-ares-1.34.3-r0 x86_64 {c-ares} (MIT) [installed]
c-ares-utils-1.34.2-r1 x86_64 {c-ares} (MIT) [installed]
13c768a71b20:/# curl -o /tmp/test.txt -Ssl "https://archive.apache.org/dist/tomcat/tomcat-8/v8.0.1/KEYS" && echo "OK"
OK

@bradh352
Copy link

@remiville great. Please do try to ask your IT dept which DNS server is responding there. I'd really like to inform the vendor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants