Skip to content
This repository has been archived by the owner on Aug 25, 2021. It is now read-only.

Readiness probe failing due to no cluster leader #251

Closed
lkysow opened this issue Oct 15, 2019 · 15 comments
Closed

Readiness probe failing due to no cluster leader #251

lkysow opened this issue Oct 15, 2019 · 15 comments
Labels
question Further information is requested

Comments

@lkysow
Copy link
Member

lkysow commented Oct 15, 2019

@lkysow I need to have this issue reopen . I still getting readiness probe failed on the server and the client:
sever:
Controlled By: StatefulSet/consul-consul-server
Containers:
consul:
Container ID: docker://a30ee732808461baae47884154aa981e7d572570d67ab5583c502103a103fa6b
Image: consul:1.6.0
Image ID: docker-pullable://consul@sha256:63e1a07260418ba05be08b6dc53f4a3bb95aa231cd53922f7b5b5ee5fd77ef3f
Ports: 8500/TCP, 8301/TCP, 8302/TCP, 8300/TCP, 8600/TCP, 8600/UDP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/UDP

client:

Controlled By: DaemonSet/consul-consul
Containers:
consul:
Container ID: docker://fa0792b9918aa615ef364a129eb44b66fe45ef993aff247e9d9f9763ce988f84
Image: consul:1.6.0
Image ID: docker-pullable://consul@sha256:63e1a07260418ba05be08b6dc53f4a3bb95aa231cd53922f7b5b5ee5fd77ef3f
Ports: 8500/TCP, 8502/TCP, 8301/TCP, 8302/TCP, 8300/TCP, 8600/TCP, 8600/UDP
Host Ports: 8500/TCP, 8502/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/UDP

There are not port conflict at my VM .. Why pods can't there port exposed and keep getting these these at the logs:

2019/10/02 12:45:26 [INFO] serf: EventMemberJoin: k8stest-node1 10.233.106.45
2019/10/02 12:45:32 [ERR] agent: Coordinate update error: No cluster leader
2019/10/02 12:45:39 [ERR] agent: failed to sync remote state: No cluster leader

Originally posted by @HeshamAboElMagd in #169 (comment)

@lkysow
Copy link
Member Author

lkysow commented Oct 15, 2019

@HeshamAboElMagd the readiness probes are failing because Consul can't elect a leader.
What are the outputs of kubectl get pods|grep consul and what are the logs of each consul-server Pod?

@lkysow lkysow added question Further information is requested waiting-on-response Waiting on the issue creator for a response before taking further action labels Oct 15, 2019
@HeshamAboElMagd
Copy link

HeshamAboElMagd commented Oct 16, 2019

@lkysow Thanks . Please find the below :

k get pod | grep consul
consul-consul-hsqw9         0/1     Running   0          43s
consul-consul-server-0      0/1     Running   0          43s
consul-consul-server-1      0/1     Running   0          42s
consul-consul-server-2      0/1     Pending   0          42s
consul-consul-xh5bs         0/1     Running   0          43s

Logs below:

k logs -f consul-consul-server-0
bootstrap_expect > 0: expecting 3 servers
==> Starting Consul agent...
           Version: 'v1.6.0'
           Node ID: '99cbd54f-bd22-8e5c-5c75-19226ecc9c5e'
         Node name: 'consul-consul-server-0'
        Datacenter: 'dc1' (Segment: '<all>')
            Server: true (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
      Cluster Addr: 10.233.72.77 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false

==> Log data will now stream in as it occurs:

    2019/10/16 09:16:26 [INFO]  raft: Initial configuration (index=0): []
    2019/10/16 09:16:26 [INFO]  raft: Node at 10.233.72.77:8300 [Follower] entering Follower state (Leader: "")
    2019/10/16 09:16:26 [INFO] serf: EventMemberJoin: consul-consul-server-0.dc1 10.233.72.77
    2019/10/16 09:16:26 [INFO] serf: EventMemberJoin: consul-consul-server-0 10.233.72.77
    2019/10/16 09:16:26 [INFO] consul: Adding LAN server consul-consul-server-0 (Addr: tcp/10.233.72.77:8300) (DC: dc1)
    2019/10/16 09:16:26 [INFO] consul: Handled member-join event for server "consul-consul-server-0.dc1" in area "wan"
    2019/10/16 09:16:26 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
    2019/10/16 09:16:26 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
    2019/10/16 09:16:26 [INFO] agent: Started HTTP server on [::]:8500 (tcp)
    2019/10/16 09:16:26 [INFO] agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce k8s mdns os packet scaleway softlayer triton vsphere
    2019/10/16 09:16:26 [INFO] agent: Joining LAN cluster...
    2019/10/16 09:16:26 [INFO] agent: (LAN) joining: [consul-consul-server-0.consul-consul-server.multicloud.svc consul-consul-server-1.consul-consul-server.multicloud.svc consul-consul-server-2.consul-consul-server.multicloud.svc]
    2019/10/16 09:16:26 [INFO] agent: started state syncer
==> Consul agent running!
    2019/10/16 09:16:26 [WARN] memberlist: Failed to resolve consul-consul-server-0.consul-consul-server.multicloud.svc: lookup consul-consul-server-0.consul-consul-server.multicloud.svc on 169.254.25.10:53: no such host
    2019/10/16 09:16:26 [INFO] serf: EventMemberJoin: consul-consul-server-1 10.233.106.52
    2019/10/16 09:16:26 [INFO] consul: Adding LAN server consul-consul-server-1 (Addr: tcp/10.233.106.52:8300) (DC: dc1)
    2019/10/16 09:16:26 [INFO] serf: EventMemberJoin: consul-consul-server-1.dc1 10.233.106.52
    2019/10/16 09:16:26 [INFO] consul: Handled member-join event for server "consul-consul-server-1.dc1" in area "wan"
    2019/10/16 09:16:26 [WARN] memberlist: Failed to resolve consul-consul-server-2.consul-consul-server.multicloud.svc: lookup consul-consul-server-2.consul-consul-server.multicloud.svc on 169.254.25.10:53: no such host
    2019/10/16 09:16:26 [INFO] agent: (LAN) joined: 1
    2019/10/16 09:16:26 [INFO] agent: Join LAN completed. Synced with 1 initial agents
    2019/10/16 09:16:27 [INFO] serf: EventMemberJoin: k8stest-node2 10.233.72.76
==> Newer Consul version available: 1.6.1 (currently running: 1.6.0)
    2019/10/16 09:16:33 [ERR] agent: failed to sync remote state: No cluster leader
    2019/10/16 09:16:34 [WARN]  raft: no known peers, aborting election
    2019/10/16 09:16:53 [ERR] agent: Coordinate update error: No cluster leader
    2019/10/16 09:16:55 [INFO] serf: EventMemberJoin: k8stest-node1 10.233.106.51
    2019/10/16 09:17:01 [ERR] agent: failed to sync remote state: No cluster leader
    2019/10/16 09:17:17 [ERR] agent: Coordinate update error: No cluster leader
    2019/10/16 09:17:34 [ERR] agent: failed to sync remote state: No cluster leader
    2019/10/16 09:17:45 [ERR] agent: Coordinate update error: No cluster leader
    2019/10/16 09:17:56 [ERR] agent: failed to sync remote state: No cluster leader
    2019/10/16 09:18:08 [ERR] agent: Coordinate update error: No cluster leader
    2019/10/16 09:18:31 [ERR] agent: failed to sync remote state: No cluster leader
    2019/10/16 09:18:37 [ERR] agent: Coordinate update error: No cluster leader
    2019/10/16 09:18:54 [ERR] agent: failed to sync remote state: No cluster leader`

@lkysow
Copy link
Member Author

lkysow commented Oct 16, 2019

  • I see that consul-consul-server-2 is in Pending. Did it get to Running? Until all servers are running, they won't be able to elect a leader because they need 3.
  • If consul-consul-server-2 got to Running then can you send me all the consul server logs again up to that time period
  • Also on each server can you run consul operator raft list-peers?

@HeshamAboElMagd
Copy link

< Events:
Type Reason Age From Message


Warning FailedScheduling 76s (x5750 over 5d23h) default-scheduler 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules.

Hey @lkysow . Sorry for the late reply. Above was the events from this server 2 pod , and I only receive that error with that pod ( even non consul pods dont report that error) . Would you recommend anything I should look into , in order to have it running?

@HeshamAboElMagd
Copy link

Warning FailedScheduling 6s (x5978 over 6d4h) default-scheduler 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules.

@s3than
Copy link
Contributor

s3than commented Oct 23, 2019

@HeshamAboElMagd @lkysow This may be related to issues #264 #265 the easiest way to check is to output the server to yaml and confirm if the retry-join is in the command for the containers. I have a PR #266 to resolve if that is the case.

@HeshamAboElMagd
Copy link

Hi @s3than . I did changed the values.yaml file .. makes the values of replicas of server and updatePartition to string , still unfortunately facing the readiness problem

`Events:
Type Reason Age From Message


Normal Scheduled 5m49s default-scheduler Successfully assigned multicloud/consul-consul-server-0 to k8stest-node2
Normal Pulled 5m46s kubelet, k8stest-node2 Container image "consul:1.6.0" already present on machine
Normal Created 5m46s kubelet, k8stest-node2 Created container consul
Normal Started 5m45s kubelet, k8stest-node2 Started container consul
Warning Unhealthy 46s (x99 over 5m40s) kubelet, k8stest-node2 Readiness probe failed:
`

@s3than
Copy link
Contributor

s3than commented Oct 23, 2019

What environment are you running your cluster in?

I noticed this

There are not port conflict at my VM .. Why pods can't there port exposed and keep getting these these at the logs:

Are you running it in minikube?

@HeshamAboElMagd
Copy link

That comment from me was related to this closed issue : #169
My cluster is one master , two worker nodes running on : Ubuntu 18.04.3 LTS, with kubernetes v1.15.3

@HeshamAboElMagd
Copy link

@lkysow I was mistaken for having 3 replica of the consul server , although I have one two worker.. now I have two servers :

k get pod | grep consul
consul-consul-k55j9 1/1 Running 0 5m9s
consul-consul-sbct2 1/1 Running 0 5m9s
consul-consul-server-0 1/1 Running 0 5m9s
consul-consul-server-1 1/1 Running 0 5m9s

The result of consul operator raft list-peers

Node ID Address State Voter RaftProtocol
consul-consul-server-1 1e4dc9a2-e459-17d9-c321-02a065c170d2 10.233.106.57:8300 leader true 3
consul-consul-server-0 53a1e74f-5954-b5c3-3b13-7154112b4f2d 10.233.72.84:8300 follower true 3

@HeshamAboElMagd
Copy link

logs :
`bootstrap_expect = 2: A cluster with 2 servers will provide no failure tolerance. See https://www.consul.io/docs/internals/consensus.html#deployment-table
bootstrap_expect > 0: expecting 2 servers
==> Starting Consul agent...
Version: 'v1.6.0'
Node ID: '1e4dc9a2-e459-17d9-c321-02a065c170d2'
Node name: 'consul-consul-server-1'
Datacenter: 'dc1' (Segment: '')
Server: true (Bootstrap: false)
Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
Cluster Addr: 10.233.106.57 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false

==> Log data will now stream in as it occurs:

2019/10/23 12:37:23 [INFO]  raft: Initial configuration (index=0): []
2019/10/23 12:37:23 [INFO]  raft: Node at 10.233.106.57:8300 [Follower] entering Follower state (Leader: "")
2019/10/23 12:37:23 [INFO] serf: EventMemberJoin: consul-consul-server-1.dc1 10.233.106.57
2019/10/23 12:37:23 [INFO] serf: EventMemberJoin: consul-consul-server-1 10.233.106.57
2019/10/23 12:37:23 [INFO] consul: Adding LAN server consul-consul-server-1 (Addr: tcp/10.233.106.57:8300) (DC: dc1)
2019/10/23 12:37:23 [INFO] consul: Handled member-join event for server "consul-consul-server-1.dc1" in area "wan"
2019/10/23 12:37:23 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
2019/10/23 12:37:23 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
2019/10/23 12:37:23 [INFO] agent: Started HTTP server on [::]:8500 (tcp)
2019/10/23 12:37:23 [INFO] agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce k8s mdns os packet scaleway softlayer triton vsphere
2019/10/23 12:37:23 [INFO] agent: Joining LAN cluster...
2019/10/23 12:37:23 [INFO] agent: (LAN) joining: [consul-consul-server-0.consul-consul-server.multicloud.svc consul-consul-server-1.consul-consul-server.multicloud.svc]
2019/10/23 12:37:23 [WARN] memberlist: Failed to resolve consul-consul-server-0.consul-consul-server.multicloud.svc: lookup consul-consul-server-0.consul-consul-server.multicloud.svc on 169.254.25.10:53: no such host
2019/10/23 12:37:23 [INFO] agent: started state syncer

==> Consul agent running!
2019/10/23 12:37:23 [WARN] memberlist: Failed to resolve consul-consul-server-1.consul-consul-server.multicloud.svc: lookup consul-consul-server-1.consul-consul-server.multicloud.svc on 169.254.25.10:53: no such host
2019/10/23 12:37:23 [WARN] agent: (LAN) couldn't join: 0 Err: 2 errors occurred:
* Failed to resolve consul-consul-server-0.consul-consul-server.multicloud.svc: lookup consul-consul-server-0.consul-consul-server.multicloud.svc on 169.254.25.10:53: no such host
* Failed to resolve consul-consul-server-1.consul-consul-server.multicloud.svc: lookup consul-consul-server-1.consul-consul-server.multicloud.svc on 169.254.25.10:53: no such host

2019/10/23 12:37:23 [WARN] agent: Join LAN failed: <nil>, retrying in 30s

==> Newer Consul version available: 1.6.1 (currently running: 1.6.0)
2019/10/23 12:37:30 [WARN] raft: no known peers, aborting election
2019/10/23 12:37:30 [ERR] agent: failed to sync remote state: No cluster leader
2019/10/23 12:37:50 [ERR] agent: Coordinate update error: No cluster leader
2019/10/23 12:37:52 [INFO] serf: EventMemberJoin: consul-consul-server-0 10.233.72.84
2019/10/23 12:37:52 [INFO] consul: Adding LAN server consul-consul-server-0 (Addr: tcp/10.233.72.84:8300) (DC: dc1)
2019/10/23 12:37:52 [INFO] serf: EventMemberJoin: k8stest-node1 10.233.106.56
2019/10/23 12:37:52 [INFO] consul: Found expected number of peers, attempting bootstrap: 10.233.106.57:8300,10.233.72.84:8300
2019/10/23 12:37:52 [INFO] serf: EventMemberJoin: consul-consul-server-0.dc1 10.233.72.84
2019/10/23 12:37:52 [INFO] consul: Handled member-join event for server "consul-consul-server-0.dc1" in area "wan"
2019/10/23 12:37:53 [INFO] agent: (LAN) joining: [consul-consul-server-0.consul-consul-server.multicloud.svc consul-consul-server-1.consul-consul-server.multicloud.svc]
2019/10/23 12:37:53 [INFO] agent: (LAN) joined: 2
2019/10/23 12:37:53 [INFO] agent: Join LAN completed. Synced with 2 initial agents
2019/10/23 12:37:54 [INFO] serf: EventMemberJoin: k8stest-node2 10.233.72.83
2019/10/23 12:37:59 [WARN] raft: Heartbeat timeout from "" reached, starting election
2019/10/23 12:37:59 [INFO] raft: Node at 10.233.106.57:8300 [Candidate] entering Candidate state in term 2
2019/10/23 12:37:59 [INFO] raft: Election won. Tally: 2
2019/10/23 12:37:59 [INFO] raft: Node at 10.233.106.57:8300 [Leader] entering Leader state
2019/10/23 12:37:59 [INFO] raft: Added peer 53a1e74f-5954-b5c3-3b13-7154112b4f2d, starting replication
2019/10/23 12:37:59 [INFO] consul: cluster leadership acquired
2019/10/23 12:37:59 [INFO] consul: New leader elected: consul-consul-server-1
2019/10/23 12:37:59 [WARN] raft: AppendEntries to {Voter 53a1e74f-5954-b5c3-3b13-7154112b4f2d 10.233.72.84:8300} rejected, sending older logs (next: 1)
2019/10/23 12:37:59 [INFO] raft: pipelining replication to peer {Voter 53a1e74f-5954-b5c3-3b13-7154112b4f2d 10.233.72.84:8300}
2019/10/23 12:37:59 [INFO] agent: Synced node info
2019/10/23 12:37:59 [INFO] connect: initialized primary datacenter CA with provider "consul"
2019/10/23 12:37:59 [INFO] consul: member 'consul-consul-server-1' joined, marking health alive
2019/10/23 12:37:59 [INFO] consul: member 'consul-consul-server-0' joined, marking health alive
2019/10/23 12:37:59 [INFO] consul: member 'k8stest-node1' joined, marking health alive
2019/10/23 12:37:59 [INFO] consul: member 'k8stest-node2' joined, marking health alive

`

@s3than
Copy link
Contributor

s3than commented Oct 23, 2019

I'll configure a cluster with 1 master and 2 workers and get back to you.

@lkysow
Copy link
Member Author

lkysow commented Oct 25, 2019

@HeshamAboElMagd are you still having problems then? Looks like the server are up?

@HeshamAboElMagd
Copy link

Yes , now the server are up and the UI service can be accessible , and I had integrated vault successfully . I believe this issue can be closed
Thanks for the help @lkysow @s3than

@lkysow
Copy link
Member Author

lkysow commented Oct 28, 2019

Awesome! 🎉

@lkysow lkysow closed this as completed Oct 28, 2019
@lkysow lkysow removed the waiting-on-response Waiting on the issue creator for a response before taking further action label Oct 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants