Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dashboards-helper container's use of curl fails internal container name resolution when host has invalid DNS settings, prevents Malcolm initialization #499

Closed
mmguero opened this issue Jun 20, 2024 · 7 comments
Assignees
Labels
bug Something isn't working docker Relating to docker and docker-compose as used by Malcolm
Milestone

Comments

@mmguero
Copy link
Collaborator

mmguero commented Jun 20, 2024

I've seen this twice now, once in a Malcolm instance installed via ISO and once on an Ubuntu system.

The symptom in both cases is that the dashboards-helper container can't resolve other containers by name while other containers can, e.g.:

$ docker compose exec dashboards-helper curl opensearch:9200
curl: (6) Could not resolve host: opensearch

$ docker compose exec dashboards curl opensearch:9200
{
  "name" : "opensearch",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "m9rhIkDISvWkX0rwfp9ZTQ",
  "version" : {
    "distribution" : "opensearch",
    "number" : "2.14.0",
    "build_type" : "tar",
    "build_hash" : "aaa555453f4713d652b52436874e11ba258d8f03",
    "build_date" : "2024-05-09T18:51:00.973564994Z",
    "build_snapshot" : false,
    "lucene_version" : "9.10.0",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

This is weird, because in the docker-compose.yml there isn't anything defining networking stuff differently for dashboards-helper as compared to other containers, from what I can tell.

In the most recent place I saw it, I found that the host (external to Docker) had bogus DNS settings in /etc/resolv.conf. Clearing those up fixed it internally for dashboards-helper.

With that information, I need to see if we can reproduce this consistently and if there's anything to be done about it.

@mmguero mmguero added bug Something isn't working docker Relating to docker and docker-compose as used by Malcolm labels Jun 20, 2024
@mmguero mmguero added this to the z.staging milestone Jun 20, 2024
@mmguero mmguero added this to Malcolm Jun 20, 2024
@mmguero mmguero moved this to Todo (investigate) in Malcolm Jun 27, 2024
@mmguero mmguero modified the milestones: z.staging, v24.08.0, v24.09.0 Jun 27, 2024
@blvrkr
Copy link

blvrkr commented Jul 18, 2024

I have the same issue on Ubuntu 24.04 LTS host.

resolv.conf content (defaults):

nameserver 127.0.0.1:53
options edns0 trust-ad
search .

I think it's controlled by systemd-resolved.

@mmguero mmguero modified the milestones: v24.09.0, v24.07.0 Jul 18, 2024
@mmguero mmguero self-assigned this Jul 24, 2024
@mmguero mmguero modified the milestones: v24.07.0, v24.08.0 Jul 29, 2024
@mmguero mmguero removed their assignment Jul 31, 2024
@mmguero mmguero self-assigned this Aug 16, 2024
@mmguero mmguero moved this from Todo (investigate) to In Progress in Malcolm Aug 16, 2024
@idaholab idaholab deleted a comment from gustavoberman Aug 16, 2024
@mmguero
Copy link
Collaborator Author

mmguero commented Aug 16, 2024

I've reproduced it in an Ubuntu 24.04 VM:

{
  "name" : "opensearch",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "r8q8n71CQgmz6nVOouWCLw",
  "version" : {
    "distribution" : "opensearch",
    "number" : "2.16.0",
    "build_type" : "tar",
    "build_hash" : "f84a26e76807ea67a69822c37b1a1d89e7177d9b",
    "build_date" : "2024-08-06T20:30:45.209655408Z",
    "build_snapshot" : false,
    "lucene_version" : "9.11.1",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}
ubuntu@ubuntu-noble-176:~/Malcolm$ docker compose exec dashboards-helper curl opensearch:9200
curl: (6) Could not resolve host: opensearch

Steps to reproduce:

  1. Clone Malcolm from GitHub and install/configure with install.py/configure/auth_setup
  2. Remove symlink that is /etc/resolv.conf and save a new file there with a nameserver value of an IP address that does not provide DNS

Initially I don't see the difference between the working container (in this case, arkime) and the non-working (dashboards-helper) container. Both have identical internal /etc/resolve.conf files generated by docker. In the docker-compose.yml file both are set to the same default networks value, both depend on the opensearch container, both have a hostname and VIRTUAL_HOST environment variable defined.

I'm attaching the result of docker inspect on the two containers:

@mmguero
Copy link
Collaborator Author

mmguero commented Aug 16, 2024

I don't see anything majorly different between the output of those two json files, other than the obvious things like volumes and environment variables. One interesting difference, though: dashboards-helper is an alpine-based container, while arkime is a debian-based one.

@mmguero
Copy link
Collaborator Author

mmguero commented Aug 16, 2024

Whoa... check this out:

ubuntu@ubuntu-noble-176:~/Malcolm$ docker compose exec dashboards-helper wget -q -O - opensearch:9200
{
  "name" : "opensearch",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "r8q8n71CQgmz6nVOouWCLw",
  "version" : {
    "distribution" : "opensearch",
    "number" : "2.16.0",
    "build_type" : "tar",
    "build_hash" : "f84a26e76807ea67a69822c37b1a1d89e7177d9b",
    "build_date" : "2024-08-06T20:30:45.209655408Z",
    "build_snapshot" : false,
    "lucene_version" : "9.11.1",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

ubuntu@ubuntu-noble-176:~/Malcolm$ docker compose exec dashboards-helper curl opensearch:9200
curl: (6) Could not resolve host: opensearch

It's not a general networking issue: wget works while curl doesn't ?!?

@mmguero
Copy link
Collaborator Author

mmguero commented Aug 16, 2024

A message I posted on /r/docker on Reddit:

--

I'm scratching my head on this one and am hoping somebody smarter than me can help me pinpoint my issue.

I've got an application orchestrated with compose that has got a number of containers. These containers are using the regular default non-external bridged networking. Under normal circumstances, the containers have no issues resolving each other internally by name (e.g., curl opensearch:9200 from any of the containers works as it should, resolving the opensearch container and making the request).

This application can run in offline "air gapped" mode; in other words, the host not having any internet connectivity at all. When this is the case, even if the host's /etc/resolv.conf file is empty, containers can still resolve each other consistently internally, even though they naturally can't resolve something like example.com, as expected.

The weird behavior gets when I have something invalid in /etc/resolv.conf, like an IP address that does not respond to DNS. In this case, most of my containers can curl the opensearch container just fine, but one of my containers fails to do so. After a bit more comparison between my "working" and "broken" containers, I did determine that the broken container was, in this case, Alpine 3.20-based, while the working containers seemed to be Debian-based.

Then things got really weird. I got a shell in the container, and lo-and-behold:

dashboards-helper:/# curl opensearch:9200
curl: (6) Could not resolve host: opensearch

dashboards-helper:/# wget -q -O - opensearch:9200
{
  "name" : "opensearch",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "r8q8n71CQgmz6nVOouWCLw",
  "version" : {
    "distribution" : "opensearch",
    "number" : "2.16.0",
    "build_type" : "tar",
    "build_hash" : "f84a26e76807ea67a69822c37b1a1d89e7177d9b",
    "build_date" : "2024-08-06T20:30:45.209655408Z",
    "build_snapshot" : false,
    "lucene_version" : "9.11.1",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

What? So it's not a general network issue after all... it's something specific to curl?

Investigating further:

dashboards-helper:/# curl --version
curl 8.9.0 (x86_64-alpine-linux-musl) libcurl/8.9.0 OpenSSL/3.3.1 zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 c-ares/1.28.1 libidn2/2.3.7 libpsl/0.21.5 nghttp2/1.62.1
Release-Date: 2024-07-24
Protocols: dict file ftp ftps gopher gophers http https imap imaps ipfs ipns mqtt pop3 pop3s rtsp smb smbs smtp smtps telnet tftp ws wss
Features: alt-svc AsynchDNS brotli HSTS HTTP2 HTTPS-proxy IDN IPv6 Largefile libz NTLM PSL SSL threadsafe TLS-SRP UnixSockets zstd

dashboards-helper:/# wget
BusyBox v1.36.1 (2024-06-10 07:11:47 UTC) multi-call binary.

Usage: wget [-cqS] [--spider] [-O FILE] [-o LOGFILE] [--header STR]
    [--post-data STR | --post-file FILE] [-Y on/off]
    [-P DIR] [-U AGENT] [-T SEC] URL...

Retrieve files via HTTP or FTP

    --spider    Only check URL existence: $? is 0 if exists
    --header STR    Add STR (of form 'header: value') to headers
    --post-data STR Send STR using POST method
    --post-file FILE    Send FILE using POST method
    -c      Continue retrieval of aborted transfer
    -q      Quiet
    -P DIR      Save to DIR (default .)
    -S          Show server response
    -T SEC      Network read timeout is SEC seconds
    -O FILE     Save to FILE ('-' for stdout)
    -o LOGFILE  Log messages to FILE
    -U STR      Use STR for User-Agent header
    -Y on/off   Use proxy

So, wget isn't actually wget, it's BusyBox doing what BusyBox does. Okay, that's cool. It looks like curl is musl-linked, which I suppose would be one difference between my Debian-based containers and this one.

I did a docker inspect on the working vs. non-working containers, and other than the obvious stuff like environment variables, volumes, etc., I didn't see any differences worth mentioning.

Any ideas on what's going on here? I know the "fix" is for users to not have bogus stuff in their /etc/resolv.conf whether or not it's going to be used, but I'd really like to figure out why this behavior is different and what I can do to work around it.

@mmguero mmguero changed the title invalid (?) DNS on host can interfere with container resolution in dashboards-helper container dashboards-helper container's use of curl fails internal container name resolution when host has invalid DNS settings, prevents Malcolm initialization Aug 16, 2024
@mmguero
Copy link
Collaborator Author

mmguero commented Aug 16, 2024

As a workaround, I suppose I can just rebase the container off of debian instead of alpine.

@mmguero mmguero moved this from In Progress to Testing in Malcolm Aug 19, 2024
mmguero added a commit to mmguero-dev/Malcolm that referenced this issue Aug 19, 2024
…bian:12-slim to avoid weirdness with alpine's busybox curl
@mmguero mmguero moved this from Testing to Done in Malcolm Aug 19, 2024
@mmguero
Copy link
Collaborator Author

mmguero commented Aug 19, 2024

I've converted the container to use debian, all seems to be working fine. Container size difference is negligible (< 10mb).

@mmguero mmguero closed this as completed Aug 19, 2024
This was referenced Aug 20, 2024
@mmguero mmguero moved this from Done to Review in Malcolm Aug 27, 2024
@mmguero mmguero moved this from Review to Released in Malcolm Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docker Relating to docker and docker-compose as used by Malcolm
Projects
Status: Released
Development

No branches or pull requests

2 participants