Skip to content
This repository has been archived by the owner on Jul 27, 2023. It is now read-only.

Kube UI not working #1367

Closed
andreimc opened this issue Apr 20, 2016 · 15 comments
Closed

Kube UI not working #1367

andreimc opened this issue Apr 20, 2016 · 15 comments
Milestone

Comments

@andreimc
Copy link
Contributor

I get this when I spin up a new cluster.

Internal Server Error (500)

Get https://10.254.0.1:443/api/v1/replicationcontrollers: dial tcp 10.254.0.1:443: getsockopt: connection refused

screen shot 2016-04-20 at 10 02 32 pm

- Ansible version (1.9.4): - Python version (2.79): - Git commit hash or branch: a86bf60 - Cloud Environment: Vagrant - Terraform version (0.6.4.11):
@ryane
Copy link
Contributor

ryane commented Apr 21, 2016

I am also having a problem with the UI on AWS. But, instead of the 500 error, I am getting a 504 Gateway Time-out error from the nginx proxy.

@andreimc
Copy link
Contributor Author

@ryane I also get that, I think it's something to do with the kubeworker. Sometimes it comes up other times not really.

@ryane ryane modified the milestone: 1.1 Apr 22, 2016
@ryane
Copy link
Contributor

ryane commented Apr 22, 2016

This appears to be the source of the problem (at least on AWS):

[Service]
ExecStart=/usr/bin/kubelet \
  --api-servers=http://localhost:8085 \
  --register-schedulable=false \
  --hostname-override={{ ansible_hostname }} \
  ...

On AWS, ansible_hostname ends up being something like ip-10-1-1-31. But, this is not resolvable within the cluster:

$ cat /etc/hostname
ip-10-1-1-31.ec2.internal

$ host ip-10-1-1-31.ec2.internal
ip-10-1-1-31.ec2.internal has address 10.1.1.31

$ host ip-10-1-1-31
Host ip-10-1-1-31 not found: 3(NXDOMAIN)

Open questions:

  1. Do we need the hostname-override option? What hostname is used without it? If it is needed, is there a safer variable (inventory_hostname, maybe?) that we can use instead of ansible_hostname?

  2. Why isn't ip-10-1-1-31 resolvable in DNS?

    $ sudo yum list installed mantl-dns -q
    Installed Packages
    mantl-dns.x86_64                                                                                1.1.0-1.centos                                                                                @asteris-mantl-rpm
    
    $ cat /etc/resolv.conf.mantl-dns
    # this config installed by mantl-dns
    
    # search is added here so that the system can address nodes by their simplest
    # name (x.node.consul becomes x.node or simply x.) This will work for service
    # addressing as well, so you can use y.service instead of y.service.consul
    options ndots:2
    search node.consul consul
    nameserver 127.0.0.1
    
    $ cat /etc/resolv.conf.masq
    ; generated by /usr/sbin/dhclient-script
    search ec2.internal
    nameserver 10.1.0.2
    
  3. set persistent, friendly hostname #1374 resolves the problem without any other changes. And, perhaps might fix other things (elasticsearch-executor fails to load on a worker node. #1195). Is setting the hostname this way a good option? Will it cause other problems?

ping @BrianHicks @Zogg @stevendborrelli

@BrianHicks
Copy link
Contributor

The Kubernetes integration brings in some older code that the core team didn't write. It's entirely possible that I missed something when upgrading it. #1374 seems like the best solution, it does need to be resolvable.

@ryane
Copy link
Contributor

ryane commented Apr 22, 2016

I'm thinking we should also change the service file to use inventory_hostname also so that it is consistent.

Any idea why DNS is not working for the internal AWS hostname?

@BrianHicks
Copy link
Contributor

BrianHicks commented Apr 22, 2016 via email

@ryane
Copy link
Contributor

ryane commented Apr 22, 2016

$ sudo /usr/sbin/dnsmasq -d
dnsmasq: started, version 2.66 cachesize 150
dnsmasq: compile time options: IPv6 GNU-getopt DBus no-i18n IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth
dnsmasq: using nameserver 127.0.0.1#153 for domain cluster.local
dnsmasq: using nameserver 127.0.0.1#8600 for domain consul
dnsmasq: reading /etc/resolv.conf.masq
dnsmasq: using nameserver 10.1.0.2#53
dnsmasq: using nameserver 127.0.0.1#153 for domain cluster.local
dnsmasq: using nameserver 127.0.0.1#8600 for domain consul
dnsmasq: read /etc/hosts - 12 addresses
dnsmasq: read /etc/hosts - 12 addresses
dnsmasq: read /etc/hosts - 12 addresses
dnsmasq: time 1461334116
dnsmasq: cache size 150, 0/0 cache insertions re-used unexpired cache entries.
dnsmasq: queries forwarded 2, queries answered locally 1
dnsmasq: server 127.0.0.1#8600: queries sent 2, retried or failed 0
dnsmasq: server 127.0.0.1#153: queries sent 0, retried or failed 0
dnsmasq: server 10.1.0.2#53: queries sent 0, retried or failed 0

Is it that queries are not getting forwarded to the internal dns servers (only consul) for some reason?

On GCE, I see something like this in the logs:

dnsmasq: server 127.0.0.1#8600: queries sent 3, retried or failed 0
dnsmasq: server 127.0.0.1#153: queries sent 0, retried or failed 0
dnsmasq: server 169.254.169.254#53: queries sent 12, retried or failed 0

@BrianHicks
Copy link
Contributor

Is that when you try and resolve {ip}.ec2.internal? IIRC debugging mode logs every query, the results, and where they came from. Did you try that?

@ryane
Copy link
Contributor

ryane commented Apr 22, 2016

RE: #1346 (comment)

ah! My cluster is down now but I'll try that in a bit.

@ryane
Copy link
Contributor

ryane commented Apr 22, 2016

$ sudo /usr/sbin/dnsmasq -d -q
dnsmasq: started, version 2.66 cachesize 150
dnsmasq: compile time options: IPv6 GNU-getopt DBus no-i18n IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth
dnsmasq: using nameserver 127.0.0.1#153 for domain cluster.local
dnsmasq: using nameserver 127.0.0.1#8600 for domain consul
dnsmasq: reading /etc/resolv.conf.masq
dnsmasq: using nameserver 10.1.0.2#53
dnsmasq: using nameserver 127.0.0.1#153 for domain cluster.local
dnsmasq: using nameserver 127.0.0.1#8600 for domain consul
dnsmasq: read /etc/hosts - 12 addresses
dnsmasq: read /etc/hosts - 12 addresses
dnsmasq: read /etc/hosts - 12 addresses
dnsmasq: query[A] ip-10-1-1-125.node.consul from 127.0.0.1
dnsmasq: forwarded ip-10-1-1-125.node.consul to 127.0.0.1
dnsmasq: query[A] ip-10-1-1-125.consul from 127.0.0.1
dnsmasq: forwarded ip-10-1-1-125.consul to 127.0.0.1
dnsmasq: query[A] ip-10-1-1-125 from 127.0.0.1
dnsmasq: config ip-10-1-1-125 is NODATA-IPv4
dnsmasq: query[AAAA] ip-10-1-1-125 from 127.0.0.1
dnsmasq: config ip-10-1-1-125 is NODATA-IPv6
dnsmasq: query[MX] ip-10-1-1-125 from 127.0.0.1
dnsmasq: forwarded ip-10-1-1-125 to 10.1.0.2
dnsmasq: time 1461340612
dnsmasq: cache size 150, 0/0 cache insertions re-used unexpired cache entries.
dnsmasq: queries forwarded 3, queries answered locally 2
dnsmasq: server 127.0.0.1#8600: queries sent 2, retried or failed 0
dnsmasq: server 127.0.0.1#153: queries sent 0, retried or failed 0
dnsmasq: server 10.1.0.2#53: queries sent 1, retried or failed 0
dnsmasq: Host                                     Address                        Flags     Expires
dnsmasq: localhost.localdomain                    ::1                            6F I   H
dnsmasq: localhost.localdomain                    127.0.0.1                      4F I   H
dnsmasq: resching-aws-control-03                  10.1.3.152                     4FRI   H
dnsmasq: resching-aws-kubeworker-001              10.1.1.125                     4FRI   H
dnsmasq: resching-aws-control-02                  10.1.2.170                     4FRI   H
dnsmasq: resching-aws-control-01                  10.1.1.204                     4FRI   H
dnsmasq: resching-aws-worker-002                  10.1.2.208                     4FRI   H
dnsmasq: resching-aws-worker-003                  10.1.3.240                     4FRI   H
dnsmasq: resching-aws-worker-001                  10.1.1.229                     4FRI   H
dnsmasq: resching-aws-worker-004                  10.1.1.36                      4FRI   H
dnsmasq: localhost6.localdomain6                  ::1                            6F I   H
dnsmasq: localhost4                               127.0.0.1                      4F I   H
dnsmasq: localhost                                ::1                            6FRI   H
dnsmasq: localhost                                127.0.0.1                      4FRI   H
dnsmasq: resching-aws-edge-01                     10.1.1.232                     4FRI   H
dnsmasq: localhost6                               ::1                            6F I   H
dnsmasq: localhost4.localdomain4                  127.0.0.1                      4F I   H
dnsmasq: resching-aws-kubeworker-002              10.1.2.243                     4FRI   H
$ dig ip-10-1-1-125 @10.1.0.2

; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.3 <<>> ip-10-1-1-125 @10.1.0.2
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 21487
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;ip-10-1-1-125.                 IN      A

;; AUTHORITY SECTION:
.                       49      IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2016042200 1800 900 604800 86400

;; Query time: 0 msec
;; SERVER: 10.1.0.2#53(10.1.0.2)
;; WHEN: Fri Apr 22 16:00:24 UTC 2016
;; MSG SIZE  rcvd: 117

$ dig ip-10-1-1-125.ec2.internal @10.1.0.2

; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.3 <<>> ip-10-1-1-125.ec2.internal @10.1.0.2
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59043
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;ip-10-1-1-125.ec2.internal.    IN      A

;; ANSWER SECTION:
ip-10-1-1-125.ec2.internal. 20  IN      A       10.1.1.125

;; Query time: 1 msec
;; SERVER: 10.1.0.2#53(10.1.0.2)
;; WHEN: Fri Apr 22 16:00:33 UTC 2016
;; MSG SIZE  rcvd: 71

I guess the internal aws nameservers just don't resolve the short ip-10-1-1-125 hostnames. Do you interpret that the same way?

@BrianHicks
Copy link
Contributor

The nameservers don't resolve it because dig doesn't respect search paths. And in this case, even with +search dig would be reading from /etc/resolv.conf instead of /etc/resolv.conf.masq so it's not a fair test.

I do see that dnsmasq forwarded the IP to the upstream nameserver in the log:

dnsmasq: forwarded ip-10-1-1-125 to 10.1.0.2

But it doesn't have any information about how it applied the search paths, which are clearly not being applied if the query is failing.

This is turning into two different issues at this point. We need to change the Kubelet hostname to the inventory name instead of the IP, but we also need to fix this search config in dnsmasq.

@ryane
Copy link
Contributor

ryane commented Apr 22, 2016

ah, makes sense.

We already have #1376 for the kubelet hostname which can hopefully close out this issue (although the AWS problem looks to be slightly different than the problem @andreimc is experiencing on Vagrant) . I opened #1377 for the dnsmasq issue.

@andreimc
Copy link
Contributor Author

@ryane I am happy if you close this issue I think if the search configuration is updated it will also work in Vagrant

@stevendborrelli
Copy link
Contributor

After applying 1374, I can bring up that endpoint.

@ryane
Copy link
Contributor

ryane commented May 3, 2016

aws problem resolved in #1374. DNS fix is tracked in #1377. Kubernetes on Vagrant issues are being tracked in #1365. going to close this one

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants