Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question - DNS Resolution from Consul inside of Mesh Network #8343

Closed
idrennanvmware opened this issue Jul 2, 2020 · 14 comments
Closed

Question - DNS Resolution from Consul inside of Mesh Network #8343

idrennanvmware opened this issue Jul 2, 2020 · 14 comments

Comments

@idrennanvmware
Copy link
Contributor

This may be an obvious answer but we have been stumped getting Mesh and Consul DNS resolution to work together.

Here's the scenario
We have 3 VMs (each running nomad and consul)
192.168.50.91
192.168.50.92
192.168.50.93

Originally we had 3 services, all running in docker with host network.
fake-service-service1-api
fake-service-service1-backend
fake-service-service2

a call to fake-service1-api resulted in:
api->backend->service2

An ssh session to each machine would work fine for the following dig

dig fake-service-service2.service.consul

If we logged in to each container, that same command worked as well (since we were on the host network).

That all worked great, we got answer sections, and service teams were happy.

Now we are moving service1 (api and backend) to Mesh and bridge network. We are accepting and routing calls to the AP no problem, the api happily talks to the backend of mesh - but then.....the problem we are having is that now the backend service using Consul DNS for "fake-service-service2.service.consul" can no longer resolve that IP due to the bridge.
Is there a way to get the container now running on the bridge network to be able to resolve that name from the host it's running on?

Thanks!

@mocofound
Copy link

mocofound commented Jul 2, 2020

Does this network\dns stanza help? For example, attempting to replace internal.corp with the .consul TLD and pointing at Consul IP(s) for resolution.

#7661

network {
  dns {
    servers = ["10.0.0.1", "10.0.0.1"]
    searches = ["internal.corp"]
    options = ["ndots:2"]
  }
}

@idrennanvmware
Copy link
Contributor Author

Thanks @mocofound - I have been playing with that stanza but to no avail. The the thing is the consul agent is running on the VM (say 192.168.50.91) but the docker container in the bridge has no way (that I have been able to find) to essentially do DNS from the host VM (not the docker ip).

Since I'm unable to even ping 192.168.50.91 from within the container I dont seem to have any way of going back down the chain.

btw, this is what I tried with no luck

      config {
        image        = "nicholasjackson/fake-service:v0.12.0"
        dns_servers  = [ "${attr.unique.network.ip-address}", "8.8.8.8" ]
      }

If I look at /etc/resolv.conf I see the 2 values I expect there, but since I can't even hit the VM from the container - it's no dice :(

@idrennanvmware idrennanvmware changed the title Question - DNS Resolution from Consul inside of Bridge or Mesh Network Question - DNS Resolution from Consul inside of Mesh Network Jul 2, 2020
@mocofound
Copy link

mocofound commented Jul 2, 2020

If I'm reading this right, you're kind of trying to use Nomad to access Consul DNS 'indirectly' by relying on the host DNS and going up and down the stack. I want to point out that Nomad has some native Consul integrations, via the template stanza, which uses and/or implements consul-template. More info on Nomad's template stanza.

This example would pull IP for the fake-service-service2 service tagged with v2 in Consul, then write it to a file and populate your environment variables inside your container. Your task could then grab the BACKEND_LOCATION environment variable to connect.

template {
  data = <<EOH
# Lines starting with a # are ignored

# Empty lines are also ignored
BACKEND_LOCATION ="{{ service "v2.fake-service-service2" }}"
EOH

  destination = "secrets/file.env"
  env         = true
}

This is discussed in more detail in this other issue: #8137 (comment)

@DhashS
Copy link

DhashS commented Jul 3, 2020

Same issue here! We’d really like consul dns to work inside a nomad-created network namespace, all our code depends on consul DNS working, especially with native integration since we’re using dynamic upstreams - so we don’t know the list of service names ahead of time to use a template.

@idrennanvmware
Copy link
Contributor Author

Thanks @mocofound - we actually use templating quite a bit for resolution of secrets, nodes, etc in our jobs - we could ask teams to do a template like you proposed but that opens up a bunch of other things we're trying to avoid (services restarting as services move around, or having watchers on internal files, etc).

I am starting to "see" the ingress gateway and terminating gateway value as we move down this path - it's just a bit of a new journey for us so we have a bit of trial and error (which I'd like to avoid as much as possible).

@alexhulbert
Copy link

alexhulbert commented Jul 9, 2020

An alternative solution if you're using a Linux-based container with consul connect is to add a template stanza like

template {
  destination = "local/resolv.conf"
  data = <<EOF
nameserver {{ env "attr.unique.network.ip-address" }}
nameserver 8.8.8.8
nameserver 8.8.4.4
EOF
}

and then add

volumes [
  "local/resolv.conf:/etc/resolv.conf"
]

to the task config stanza

@idrennanvmware
Copy link
Contributor Author

Hey @alexhulbert - ty for that approach - we've also been toying with this

      driver = "docker"

      config {
        image        = "<IMAGE>"
        dns_servers  = [ "127.0.0.1", "${attr.unique.network.ip-address}", "8.8.8.8" ]
      }

Looks like that had the same result as dropping the resolver too. Looks like we're both thinking similar lines.

Thanks!

@DhashS
Copy link

DhashS commented Aug 24, 2020

Worth noting since docker scoops your host's /etc/resolv.conf then strips out 127.0.0.1, the (ideal) solution of making docker pass thru dns while consul is bound only to localhost dosen't really work.

However, if you're doing something like @idrennanvmware or @alexhulbert where consul is serving dns on "${attr.unique.network.ip-address}", then sticking the result of that into your /etc/resolv.conf is a solution that doesn't require any changes in the HCL. This kinda works, modulo the mess that is DHCP and which process actually owns /etc/resolv.conf (distro dependent) - but if you're all static/have a script to initialize /etc/resolv.conf to also contain the IP that Consul DNS is serving on, then this works as well.

@jhitt25
Copy link

jhitt25 commented Dec 17, 2020

This is also highly problematic for the java and exec drivers when attempting to leverage dnsmasq on the nodes (to merge consul dns with public dns). You get absolutely no DNS resolution, as there are no public dns servers listed in /etc/resolv.conf. You also cannot use an alternate local IP, as "${attr.unique.network.ip-address}" is unusable in the taskgroup network block (tested in Nomad 0.12.8 and 1.0.0).

@DhashS
Copy link

DhashS commented Dec 23, 2020

Oh beautiful.
You can do this
Which translates to adding

dns {
  servers = []
  options = []
  searches = []
}

To silently copy in the host /etc/resolv.conf, which our software owns.
Works for at least the java driver, docker we had to finagle it a different way.

@jhitt25
Copy link

jhitt25 commented Dec 23, 2020

Using system DNS is the entire problem. If the node's DNS points to localhost, the mesh network can't see it. I have also verified that even hardcoding an alternate local IP address will not work, as again, the mesh network cannot see it. The only possible option is exposing Consul DNS to your internal network (if that's an option in your environment).

@tgross
Copy link
Member

tgross commented Apr 16, 2021

Cross-linking #8900 which may have a related underlying cause.

@tgross tgross self-assigned this Jun 4, 2021
@tgross
Copy link
Member

tgross commented Jun 17, 2021

Hi folks, so I wanted to circle back to this issue as there's been a couple of improvmeents here since this issue was opened and I think we've got a workable situation.

I'm going to provide an example configuration one could use to expose Consul DNS, but we also have #10665 and #10705 open for further enhancements, and I'm sure my colleague @jrasell would love to hear your thoughts there.

This example uses systemd-resolved as a stub resolver for the host, forwarding to Consul DNS. It follows the Consul Learn Guide, and should map reasonably well to other setups like dnsmasq.

Starting with the Vagrantfile at the root of this repo, I've got the following Nomad configuration to Consul:

# for DNS with systemd-resolved
consul {
  address = "10.0.2.15:8500"
}

Consul configuration:

ui               = true
bootstrap_expect = 1
server           = true
log_level        = "DEBUG"
data_dir         = "/var/consul/data"

bind_addr      = "10.0.2.15"
client_addr    = "10.0.2.15"
advertise_addr = "127.0.0.1"

connect = {
  enabled = true
}

ports = {
  dns  = 53
  grpc = 8502
}

My systemd-resolved configuraton at /etc/systemd/resolved.conf:

[Resolve]
DNS=10.0.2.15
Domains=~consul.

Resulting resolv.conf on the host:

$ cat /etc/resolv.conf
# This file is managed by man:systemd-resolved(8). Do not edit.
...
nameserver 127.0.0.53
options edns0
search fios-router.home

Let's verify Consul DNS is working (from the host) as expected by querying it directly for the Nomad client service:

$ dig @10.0.2.15 -p 53 nomad-client.service.dc1.consul ANY
...
;; ANSWER SECTION:
nomad-client.service.dc1.consul. 0 IN   A       10.0.2.15

;; Query time: 5 msec
;; SERVER: 10.0.2.15#53(10.0.2.15)
;; WHEN: Fri Jun 04 13:56:05 UTC 2021
;; MSG SIZE  rcvd: 76

And then verify we have our stub resolver forwarding configured correctly by looking up the same service via the usual path:

$ dig nomad-client.service.dc1.consul
...
;; ANSWER SECTION:
nomad-client.service.dc1.consul. 0 IN   A       10.0.2.15

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Fri Jun 04 14:00:40 UTC 2021
;; MSG SIZE  rcvd: 76


$ dig example.com
...
;; ANSWER SECTION:
example.com.            7187    IN      A       93.184.216.34

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Fri Jun 04 14:06:03 UTC 2021
;; MSG SIZE  rcvd: 56

Ok, now let's run a Nomad Connect job. This job has two groups so that we can verify cross-allocation traffic. The 0x74696d/dnstools container is built from the following Dockerfile so that we can do some simple tests:

FROM debian:buster

RUN apt-get update -qq \
    && apt-get install --no-install-recommends -y \
    dnsutils \
    curl \
    busybox \
    && rm -rf /var/lib/apt/lists/* \
jobspec
job "example" {
  datacenters = ["dc1"]

  group "server" {

    network {
      mode = "bridge"
    }

    service {
      name = "www"
      port = "8001"
      connect {
        sidecar_service {}
      }
    }

    task "task" {
      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      template {
        data        = "<html>hello, world</html>"
        destination = "local/index.html"
      }
    }
  }


  group "client" {

    network {
      mode = "bridge"
    }

    service {
      name = "client"
      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "www"
              local_bind_port  = 8080
            }
          }
        }
      }
    }

    task "task" {
      driver = "docker"

      config {
        image   = "0x74696d/dnstools"
        command = "/bin/sh"

        # of course we can just use localhost here as well, but
        # this demonstrates we have working DNS!
        args = ["-c", "sleep 5; while true; do curl -v http://www.service.dc1.consul:8080 ; sleep 10; done"]
      }

    }
  }

}

Run the job:

$ nomad job run ./example.nomad
==> 2021-06-17T14:22:29Z: Monitoring evaluation "65cc7583"
    2021-06-17T14:22:29Z: Evaluation triggered by job "example"
==> 2021-06-17T14:22:30Z: Monitoring evaluation "65cc7583"
    2021-06-17T14:22:30Z: Evaluation within deployment: "3e6ef5e9"
    2021-06-17T14:22:30Z: Allocation "5c39e3fc" created: node "808c50bf", group "client"
    2021-06-17T14:22:30Z: Allocation "86ba2c25" created: node "808c50bf", group "server"
    2021-06-17T14:22:30Z: Evaluation status changed: "pending" -> "complete"
==> 2021-06-17T14:22:30Z: Evaluation "65cc7583" finished with status "complete"
==> 2021-06-17T14:22:30Z: Monitoring deployment "3e6ef5e9"
  ⠙ Deployment "3e6ef5e9" successful

    2021-06-17T14:22:43Z
    ID          = 3e6ef5e9
    Job ID      = example
    Job Version = 2
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    client      1        1       1        0          2021-06-17T14:32:42Z
    server      1        1       1        0          2021-06-17T14:32:41Z

See that we're successfully querying Consul DNS for the Connect endpoint (even though it's just localhost here) and that Connect is working:

$ nomad alloc logs -stderr -task task 5c3
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x559caa168520)
* Connected to www.service.dc1.consul (127.0.0.1) port 8080 (#0)
> GET / HTTP/1.1
> Host: www.service.dc1.consul:8080
> User-Agent: curl/7.64.0
> Accept: */*
>
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Date: Thu, 17 Jun 2021 14:22:47 GMT
< Connection: close
< Content-type: text/html
< Accept-Ranges: bytes
< Last-Modified: Thu, 17 Jun 2021 14:22:31 GMT
< Content-Length: 25
<
{ [25 bytes data]
100    25  100    25    0     0   2500      0 --:--:-- --:--:-- --:--:--  2500
* Closing connection 0

Let's curl something on the public internet from that same container:

$ nomad alloc exec -task task 5c3 curl -vso /dev/null example.com
...
*   Trying 93.184.216.34...
* TCP_NODELAY set
* Expire in 149993 ms for 3 (transfer 0x55a2c7b36520)
* Expire in 200 ms for 4 (transfer 0x55a2c7b36520)
* Connected to example.com (93.184.216.34) port 80 (#0)
> GET / HTTP/1.1
> Host: example.com
> User-Agent: curl/7.64.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Age: 298623
< Cache-Control: max-age=604800
< Content-Type: text/html; charset=UTF-8
< Date: Thu, 17 Jun 2021 14:27:18 GMT
< Etag: "3147526947+gzip+ident"
< Expires: Thu, 24 Jun 2021 14:27:18 GMT
< Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
< Server: ECS (phd/FD6F)
< Vary: Accept-Encoding
< X-Cache: HIT
< Content-Length: 1256
<
{ [1108 bytes data]
* Connection #0 to host example.com left intact

Our /etc/resolv.conf inside the task looks like this:

$ nomad alloc exec -task task 5c3 cat /etc/resolv.conf
# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs must not access this file directly, but only through the
# symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a different way,
# replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 10.0.2.15
nameserver 10.0.2.3
search fios-router.home

So that all works, but note that if we use dig from that container, the results show we'll have something that works for glibc-based containers but probably not musl (ex. Alpine) because the first nameserver we hit doesn't have an answer and isn't recursive for Consul. This is where I'd recommend dnsmasq as a more sophisticated resolver.

$ nomad alloc exec -task task 5c3 dig www.service.dc1.consul

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> www.service.dc1.consul
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43123
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.service.dc1.consul.                IN      A

;; ANSWER SECTION:
www.service.dc1.consul. 0       IN      A       127.0.0.1

;; Query time: 0 msec
;; SERVER: 10.0.2.15#53(10.0.2.15)
;; WHEN: Thu Jun 17 14:26:08 UTC 2021
;; MSG SIZE  rcvd: 67

$ nomad alloc exec -task task 5c3 dig example.com

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 15919
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;example.com.                   IN      A

;; Query time: 0 msec
;; SERVER: 10.0.2.15#53(10.0.2.15)
;; WHEN: Thu Jun 17 14:25:48 UTC 2021
;; MSG SIZE  rcvd: 29

$ nomad alloc exec -task task 5c3 dig @10.0.2.3 example.com

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> @10.0.2.3 example.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38721
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 0f83041e8eb04d8d234505b660cb5e2a708af8ffe4446395 (good)
;; QUESTION SECTION:
;example.com.                   IN      A

;; ANSWER SECTION:
example.com.            8085    IN      A       93.184.216.34

;; Query time: 11 msec
;; SERVER: 10.0.2.3#53(10.0.2.3)
;; WHEN: Thu Jun 17 14:37:30 UTC 2021
;; MSG SIZE  rcvd: 84

For more involved setups, we'll probably want to set the network.dns configuration, which as I noted landed back in #8600.

I'm going to close this specific issue as resolved, but feel free to open new issues or Discuss posts to talk through this some more.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants