Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduled jobs stop execution after leader re-election #998

Closed
joshshn opened this issue Jun 24, 2021 · 19 comments · Fixed by #1053
Closed

Scheduled jobs stop execution after leader re-election #998

joshshn opened this issue Jun 24, 2021 · 19 comments · Fixed by #1053

Comments

@joshshn
Copy link

joshshn commented Jun 24, 2021

My cluster consists of 3 server node (Dkron 3.1.7):

[opc@nonprod-exa-services-0 ~]$ dkron version
Name: Dkron
Version: 3.1.7
Codename: merichuas
Agent Protocol: 5 (Understands back to: 2)

All nodes are configured as follows:

[opc@nonprod-exa-services-0 ~]$ cat /etc/dkron/dkron.yml
server: true
bootstrap-expect: 3
join:
- nonprod-exa-services-0.nonprodexasvc.nonprodvcn.oraclevcn.com
- nonprod-exa-services-1.nonprodexasvc.nonprodvcn.oraclevcn.com
- nonprod-exa-services-2.nonprodexasvc.nonprodvcn.oraclevcn.com
log-level: debug

This cluster has been running for ~3 weeks and able to execute scheduled jobs

[opc@nonprod-exa-services-0 ~]$ systemctl status dkron
● dkron.service - Dkron Agent
   Loaded: loaded (/usr/lib/systemd/system/dkron.service; disabled; vendor preset: disabled)
   Active: active (running) since Wed 2021-06-02 10:19:43 CEST; 3 weeks 1 days ago
     Docs: https://dkron.io
 Main PID: 6940 (dkron)
    Tasks: 95
   Memory: 202.0M
   CGroup: /system.slice/dkron.service
           ├─6940 /usr/bin/dkron agent
           ├─6945 /usr/bin/dkron-processor-files
           ├─6952 /usr/bin/dkron-processor-fluent
           ├─7049 /usr/bin/dkron-processor-log
           ├─7056 /usr/bin/dkron-processor-syslog
           ├─7062 /usr/bin/dkron-executor-gcppubsub
           ├─7069 /usr/bin/dkron-executor-http
           ├─7076 /usr/bin/dkron-executor-kafka
           ├─7083 /usr/bin/dkron-executor-nats
           ├─7088 /usr/bin/dkron-executor-rabbitmq
           └─7096 /usr/bin/dkron-executor-shell

ISSUE Details:

One of the nodes is unable to contact the other 2 nodes (could be an intermittent network issue) causing a re-election. After re-election none of the scheduled jobs are executing

Log snippet from one of the nodes. The complete logs from the 3 nodes are attached.

o msg="2021/06/23 23:28:42 [DEBUG] memberlist: Initiating push/pull sync with: nonprod-exa-services-1 10.106.51.22:8946"
o msg="2021-06-23T23:28:44.338+0200 [WARN]  raft: failed to contact: server-id=nonprod-exa-services-0 time=500.962849ms"
o msg="2021-06-23T23:28:44.390+0200 [WARN]  raft: failed to contact: server-id=nonprod-exa-services-1 time=500.680357ms"
o msg="2021-06-23T23:28:44.390+0200 [WARN]  raft: failed to contact: server-id=nonprod-exa-services-0 time=552.570599ms"
o msg="2021-06-23T23:28:44.390+0200 [WARN]  raft: failed to contact quorum of nodes, stepping down"
o msg="2021-06-23T23:28:44.390+0200 [INFO]  raft: entering follower state: follower=\"Node at 10.106.51.29:6868 [Follower]\" leader="
o msg="2021-06-23T23:28:44.390+0200 [INFO]  raft: aborting pipeline replication: peer=\"{Voter nonprod-exa-services-1 10.106.51.22:6868}\""
o msg="2021-06-23T23:28:44.390+0200 [INFO]  raft: aborting pipeline replication: peer=\"{Voter nonprod-exa-services-0 10.106.51.18:6868}\""
ug msg="dkron: shutting down leader loop" node=nonprod-exa-services-2
ug msg="scheduler: Stopping scheduler" node=nonprod-exa-services-2
o msg="dkron: cluster leadership lost" node=nonprod-exa-services-2
o msg="dkron: monitoring leadership" node=nonprod-exa-services-2
o msg="2021-06-23T23:28:46.085+0200 [WARN]  raft: heartbeat timeout reached, starting election: last-leader="
o msg="2021-06-23T23:28:46.085+0200 [INFO]  raft: entering candidate state: node=\"Node at 10.106.51.29:6868 [Candidate]\" term=54"
o msg="2021-06-23T23:28:46.092+0200 [DEBUG] raft: votes: needed=2"
o msg="2021-06-23T23:28:46.092+0200 [DEBUG] raft: vote granted: from=nonprod-exa-services-2 term=54 tally=1"
o msg="2021-06-23T23:28:46.093+0200 [DEBUG] raft: vote granted: from=nonprod-exa-services-1 term=54 tally=2"
o msg="2021-06-23T23:28:46.093+0200 [INFO]  raft: election won: tally=2"
o msg="2021-06-23T23:28:46.094+0200 [INFO]  raft: entering leader state: leader=\"Node at 10.106.51.29:6868 [Leader]\""
o msg="2021-06-23T23:28:46.094+0200 [INFO]  raft: added peer, starting replication: peer=nonprod-exa-services-1"
o msg="2021-06-23T23:28:46.094+0200 [INFO]  raft: added peer, starting replication: peer=nonprod-exa-services-0"
o msg="dkron: cluster leadership acquired" node=nonprod-exa-services-2
o msg="dkron: monitoring leadership" node=nonprod-exa-services-2
o msg="2021-06-23T23:28:46.095+0200 [INFO]  raft: pipelining replication: peer=\"{Voter nonprod-exa-services-1 10.106.51.22:6868}\""
o msg="2021-06-23T23:28:46.097+0200 [INFO]  raft: pipelining replication: peer=\"{Voter nonprod-exa-services-0 10.106.51.18:6868}\""
o msg="agent: Starting scheduler" node=nonprod-exa-services-2
@joshshn
Copy link
Author

joshshn commented Jun 24, 2021

dkron_logs.zip

Logs Attached:

@paul7511
Copy link

paul7511 commented Aug 9, 2021

I encountered the same situation.
Leader stopping scheduler and vote leader again when can't find follower.
Leader Re-elected after the voting.
The scheduler restarts but many jobs are removed and added again.
Finally, the scheduler did not run properly.

This happens between 2021-08-03T02:34:36 and 2021-08-03T02:34:40
Attached log:
dkron_log_20210803.gz

@seanfulton
Copy link

seanfulton commented Dec 27, 2021

dkron.txt

I have the same issue on a three-node cluster. This is the third-time my dkron cluster has stopped functioning.

This is a pretty major issue; you promote this software as fault-tolerant and mission-critical, yet this issue has been known for months with no response or solution. I started using dkron as a pilot to see if it's worth going with the pro version, but my team is giving me pressure to find another solution because this is unreliable.

Does anyone have a solution or work-around?

We are using 3.1.10. Log file is attached. You can see that the node that failed (not the master) rejoins the cluster later, but no jobs run after that.

@vcastellm
Copy link
Member

vcastellm commented Jan 31, 2022

Could any of you specify if it's all jobs or some of them executes and some other don't?

The problem with this issue is that I have not been able to reproduce it, and that makes it pretty difficult to solve.

@seanfulton
Copy link

No new jobs are started after the election.

@vcastellm vcastellm reopened this Jan 31, 2022
@vcastellm
Copy link
Member

ok, build is currently in progress, please try with v3.1.11-beta1 when the build finish and let me know if that solves the issue.

@seanfulton
Copy link

We used the beta for a few months and did not have an issue. I upgraded to 3.1.11-1 and it has happened again. dcos02 did not fail, it just missed a heartbeat. You can see from the logs that it later comes back online. There is no leader re-election, which I take to mean it was not the leader at the time of failure. Now, 24-hours later, dcos01 is the leader.

This is a big problem. This program is core to our operations, and we can use it if it spontaneously stops working.

Apr 22 10:57:01 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:01-04:00" level=debug msg="store: Setting key" execution="executions:import-queue:1650639421031236137-nj-dcos01-cl01.onecount.net" finished
="0001-01-01 00:00:00 +0000 UTC" job=import-queue node=nj-dcos01-cl01.onecount.net
Apr 22 10:57:01 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:01-04:00" level=debug msg="store: to detele key" execution=1650633421042188795-nj-dcos01-cl01.onecount.net job=import-queue node=nj-dcos01-
cl01.onecount.net
Apr 22 10:57:01 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:01-04:00" level=debug msg="2022/04/22 10:57:01 shell: Command output " node=nj-dcos01-cl01.onecount.net subsystem_name=plugins.dkron-execut
or-shell
Apr 22 10:57:01 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:01-04:00" level=debug msg="fsm: received command" command=4 node=nj-dcos01-cl01.onecount.net
Apr 22 10:57:01 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:01-04:00" level=debug msg="fsm: Setting execution" execution=1650639421031236137-nj-dcos01-cl01.onecount.net node=nj-dcos01-cl01.onecount.n
et output=
Apr 22 10:57:01 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:01-04:00" level=debug msg="store: Retrieved job from datastore" job=import-queue node=nj-dcos01-cl01.onecount.net
Apr 22 10:57:01 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:01-04:00" level=debug msg="store: Setting job" job=import-queue node=nj-dcos01-cl01.onecount.net
Apr 22 10:57:13 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:13-04:00" level=info msg="2022/04/22 10:57:13 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:8
946"
Apr 22 10:57:14 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:14-04:00" level=info msg="2022/04/22 10:57:14 [DEBUG] memberlist: Stream connection from=10.10.4.173:51640"
Apr 22 10:57:27 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:27-04:00" level=info msg="2022/04/22 10:57:27 [DEBUG] memberlist: Failed ping: nj-dcos02-cl01.onecount.net (timeout reached)"
Apr 22 10:57:27 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:27-04:00" level=info msg="2022/04/22 10:57:27 [ERR] memberlist: Failed fallback ping: read tcp 10.10.4.171:55376->10.10.4.172:8946: i/o tim
eout"
Apr 22 10:57:27 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:27-04:00" level=info msg="2022/04/22 10:57:27 [INFO] memberlist: Suspect nj-dcos02-cl01.onecount.net has failed, no acks received"
Apr 22 10:57:29 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:29-04:00" level=info msg="2022-04-22T10:57:29.675-0400 [WARN] raft: rejecting vote request since we have a leader: from=10.10.4.172:6868 l
eader=10.10.4.173:6868"
Apr 22 10:57:29 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:29-04:00" level=info msg="2022/04/22 10:57:29 [WARN] memberlist: Refuting a suspect message (from: nj-dcos02-cl01.onecount.net)"
Apr 22 10:57:30 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:30-04:00" level=info msg="2022-04-22T10:57:30.967-0400 [DEBUG] raft: lost leadership because received a requestVote with a newer term"
Apr 22 10:57:43 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:43-04:00" level=info msg="2022/04/22 10:57:43 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:8
946"
Apr 22 10:57:44 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:44-04:00" level=info msg="2022/04/22 10:57:44 [DEBUG] memberlist: Stream connection from=10.10.4.173:51724"
Apr 22 10:57:53 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:53-04:00" level=debug msg="fsm: received command" command=4 node=nj-dcos01-cl01.onecount.net
Apr 22 10:57:53 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:53-04:00" level=debug msg="fsm: Setting execution" execution=1650639000074730057-nj-dcos02-cl01.onecount.net node=nj-dcos01-cl01.onecount.n
et output=
Apr 22 10:57:53 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:53-04:00" level=debug msg="store: Retrieved job from datastore" job=import-scheduler node=nj-dcos01-cl01.onecount.net
Apr 22 10:57:53 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:53-04:00" level=debug msg="store: Setting job" job=import-scheduler node=nj-dcos01-cl01.onecount.net
Apr 22 10:58:13 nj-dcos01-cl01 dkron: time="2022-04-22T10:58:13-04:00" level=info msg="2022/04/22 10:58:13 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:8
946"
Apr 22 10:58:43 nj-dcos01-cl01 dkron: time="2022-04-22T10:58:43-04:00" level=info msg="2022/04/22 10:58:43 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:8
946"
Apr 22 10:59:07 nj-dcos01-cl01 dkron: time="2022-04-22T10:59:07-04:00" level=info msg="2022/04/22 10:59:07 [DEBUG] memberlist: Stream connection from=10.10.4.172:34808"
Apr 22 10:59:13 nj-dcos01-cl01 dkron: time="2022-04-22T10:59:13-04:00" level=info msg="2022/04/22 10:59:13 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:8
946

@vcastellm
Copy link
Member

@seanfulton it's good to know that it was working well before the upgrade, but the fix is included in the latest release, so it should work the same.

Did you use all betas up to beta3?

Did the scheduler started when dcos01 became the new leader?

@seanfulton
Copy link

I don't know that it worked prior, it could have been a coincidence. I only upgraded to the new version last week, and I upgraded to beta1 when it became available. So it could possibly be the upgrade, or it could just be that the trigger didn't happen to happen while I was running the beta.

No, jobs did not start once dcos01 was elected leader. That's the problem. There is a brief network interruption of some kind for the leader, another leader is elected, and then nothing runs... Until I restart the whole cluster. This is a 3 node cluster

@seanfulton
Copy link

seanfulton commented Apr 25, 2022

Happened again this morning. Something about peer having a newer term???? It looks like dcos01 stepped down, then elected itself leader again.

Then no more jobs run after that (it looks like one or two jobs ran in the middle of this, but you can see after 11:38 everything stops.

sean
Apr 25 11:37:14 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:14-04:00" level=info msg="2022/04/25 11:37:14 [DEBUG] memberlist: Stream connection from=10.10.4.172:60992"
Apr 25 11:37:38 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:38-04:00" level=info msg="2022/04/25 11:37:38 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:894
6"
Apr 25 11:37:52 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:52-04:00" level=info msg="2022-04-25T11:37:52.064-0400 [WARN] raft: failed to contact: server-id=nj-dcos02-cl01.onecount.net time=501.077318
ms"
Apr 25 11:37:52 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:52-04:00" level=info msg="2022-04-25T11:37:52.932-0400 [WARN] raft: failed to contact: server-id=nj-dcos02-cl01.onecount.net time=500.68241m
s"
Apr 25 11:37:53 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:53-04:00" level=info msg="2022-04-25T11:37:53.432-0400 [WARN] raft: failed to contact: server-id=nj-dcos02-cl01.onecount.net time=1.00057592
7s"
Apr 25 11:37:54 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:54-04:00" level=info msg="2022-04-25T11:37:54.180-0400 [WARN] raft: rejecting vote request since we have a leader: from=10.10.4.172:6868 lea
der=10.10.4.171:6868"
Apr 25 11:37:54 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:54-04:00" level=info msg="2022-04-25T11:37:54.354-0400 [ERROR] raft: peer has newer term, stopping replication: peer="{Voter nj-dcos02-cl01.
onecount.net 10.10.4.172:6868}""
Apr 25 11:37:54 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:54-04:00" level=debug msg="dkron: shutting down leader loop" node=nj-dcos01-cl01.onecount.net
Apr 25 11:37:54 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:54-04:00" level=debug msg="scheduler: Stopping scheduler" node=nj-dcos01-cl01.onecount.net
Apr 25 11:37:54 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:54-04:00" level=info msg="2022-04-25T11:37:54.354-0400 [INFO] raft: aborting pipeline replication: peer="{Voter nj-dcos02-cl01.onecount.net
10.10.4.172:6868}""
Apr 25 11:37:54 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:54-04:00" level=info msg="2022-04-25T11:37:54.354-0400 [INFO] raft: entering follower state: follower="Node at 10.10.4.171:6868 [Follower]
" leader="
Apr 25 11:37:54 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:54-04:00" level=info msg="2022-04-25T11:37:54.354-0400 [INFO] raft: aborting pipeline replication: peer="{Voter nj-dcos03-cl01.onecount.net
10.10.4.173:6868}""
Apr 25 11:37:54 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:54-04:00" level=info msg="2022/04/25 11:37:54 [WARN] memberlist: Refuting a suspect message (from: nj-dcos02-cl01.onecount.net)"
Apr 25 11:37:56 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:56-04:00" level=info msg="2022-04-25T11:37:56.167-0400 [WARN] raft: heartbeat timeout reached, starting election: last-leader="
Apr 25 11:37:56 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:56-04:00" level=info msg="2022-04-25T11:37:56.167-0400 [INFO] raft: entering candidate state: node="Node at 10.10.4.171:6868 [Candidate]"
term=606"
Apr 25 11:37:56 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:56-04:00" level=info msg="2022-04-25T11:37:56.178-0400 [DEBUG] raft: votes: needed=2"
Apr 25 11:37:56 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:56-04:00" level=info msg="2022-04-25T11:37:56.178-0400 [DEBUG] raft: vote granted: from=nj-dcos01-cl01.onecount.net term=606 tally=1"
Apr 25 11:37:56 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:56-04:00" level=info msg="2022-04-25T11:37:56.241-0400 [DEBUG] raft: vote granted: from=nj-dcos03-cl01.onecount.net term=606 tally=2"
Apr 25 11:37:56 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:56-04:00" level=info msg="2022-04-25T11:37:56.242-0400 [INFO] raft: election won: tally=2"
Apr 25 11:37:56 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:56-04:00" level=info msg="2022-04-25T11:37:56.242-0400 [INFO] raft: entering leader state: leader="Node at 10.10.4.171:6868 [Leader]""
Apr 25 11:37:56 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:56-04:00" level=info msg="2022-04-25T11:37:56.242-0400 [INFO] raft: added peer, starting replication: peer=nj-dcos03-cl01.onecount.net"
Apr 25 11:37:56 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:56-04:00" level=info msg="2022-04-25T11:37:56.242-0400 [INFO] raft: added peer, starting replication: peer=nj-dcos02-cl01.onecount.net"
Apr 25 11:37:56 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:56-04:00" level=info msg="2022/04/25 11:37:56 [DEBUG] memberlist: Failed ping: nj-dcos02-cl01.onecount.net (timeout reached)"
Apr 25 11:37:56 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:56-04:00" level=info msg="2022-04-25T11:37:56.273-0400 [ERROR] raft: peer has newer term, stopping replication: peer="{Voter nj-dcos03-cl01.
onecount.net 10.10.4.173:6868}""
Apr 25 11:37:56 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:56-04:00" level=info msg="2022-04-25T11:37:56.273-0400 [INFO] raft: entering follower state: follower="Node at 10.10.4.171:6868 [Follower]
" leader="
Apr 25 11:37:56 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:56-04:00" level=info msg="2022-04-25T11:37:56.321-0400 [DEBUG] raft: lost leadership because received a requestVote with a newer term"
Apr 25 11:37:56 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:56-04:00" level=info msg="2022-04-25T11:37:56.324-0400 [WARN] raft: rejecting vote request since our last term is greater: candidate=10.10.4
.172:6868 last-term=606 last-candidate-term=605"
Apr 25 11:37:56 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:56-04:00" level=info msg="2022/04/25 11:37:56 [ERR] memberlist: Failed fallback ping: read tcp 10.10.4.171:38228->10.10.4.172:8946: i/o timeo
ut"
Apr 25 11:37:57 nj-dcos01-cl01 dkron: time="2022-04-25T11:37:57-04:00" level=info msg="2022-04-25T11:37:57.267-0400 [ERROR] raft: peer has newer term, stopping replication: peer="{Voter nj-dcos02-cl01.
onecount.net 10.10.4.172:6868}""
Apr 25 11:38:08 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:08-04:00" level=info msg="2022/04/25 11:38:08 [ERR] memberlist: Failed fallback ping: read tcp 10.10.4.171:38308->10.10.4.172:8946: i/o timeo
ut"
Apr 25 11:38:08 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:08-04:00" level=info msg="2022/04/25 11:38:08 [INFO] memberlist: Suspect nj-dcos02-cl01.onecount.net has failed, no acks received"
Apr 25 11:38:08 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:08-04:00" level=info msg="2022/04/25 11:38:08 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:894
6"
Apr 25 11:38:08 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:08-04:00" level=info msg="2022-04-25T11:38:08.956-0400 [WARN] raft: rejecting vote request since we have a leader: from=10.10.4.173:6868 lea
der=10.10.4.172:6868"
Apr 25 11:38:09 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:09-04:00" level=info msg="2022/04/25 11:38:09 [WARN] memberlist: Refuting a suspect message (from: nj-dcos02-cl01.onecount.net)"
Apr 25 11:38:14 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:14-04:00" level=info msg="2022/04/25 11:38:14 [DEBUG] memberlist: Stream connection from=10.10.4.172:41460"
Apr 25 11:38:19 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:19-04:00" level=info msg="2022/04/25 11:38:19 [DEBUG] memberlist: Stream connection from=10.10.4.173:39684"
Apr 25 11:38:25 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:25-04:00" level=info msg="2022/04/25 11:38:25 [DEBUG] memberlist: Failed ping: nj-dcos02-cl01.onecount.net (timeout reached)"
Apr 25 11:38:25 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:25-04:00" level=info msg="2022/04/25 11:38:25 [ERR] memberlist: Failed fallback ping: read tcp 10.10.4.171:38434->10.10.4.172:8946: i/o timeo
ut"
Apr 25 11:38:25 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:25-04:00" level=info msg="2022/04/25 11:38:25 [INFO] memberlist: Suspect nj-dcos02-cl01.onecount.net has failed, no acks received"
Apr 25 11:38:26 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:26-04:00" level=info msg="2022/04/25 11:38:26 [WARN] memberlist: Refuting a suspect message (from: nj-dcos02-cl01.onecount.net)"
Apr 25 11:38:26 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:26-04:00" level=info msg="2022-04-25T11:38:26.679-0400 [WARN] raft: rejecting vote request since we have a leader: from=10.10.4.172:6868 lea
der=10.10.4.173:6868"
Apr 25 11:38:27 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:27-04:00" level=info msg="2022-04-25T11:38:27.802-0400 [WARN] raft: rejecting vote request since we have a leader: from=10.10.4.172:6868 lea
der=10.10.4.173:6868"
Apr 25 11:38:37 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:37-04:00" level=info msg="2022/04/25 11:38:37 [DEBUG] memberlist: Failed ping: nj-dcos02-cl01.onecount.net (timeout reached)"
Apr 25 11:38:38 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:38-04:00" level=info msg="2022/04/25 11:38:38 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:894
6"
Apr 25 11:38:40 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:40-04:00" level=info msg="2022-04-25T11:38:40.466-0400 [WARN] raft: heartbeat timeout reached, starting election: last-leader=10.10.4.172:68
68"
Apr 25 11:38:40 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:40-04:00" level=info msg="2022-04-25T11:38:40.466-0400 [INFO] raft: entering candidate state: node="Node at 10.10.4.171:6868 [Candidate]"
term=611"
Apr 25 11:38:40 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:40-04:00" level=info msg="2022-04-25T11:38:40.476-0400 [DEBUG] raft: votes: needed=2"
Apr 25 11:38:40 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:40-04:00" level=info msg="2022-04-25T11:38:40.476-0400 [DEBUG] raft: vote granted: from=nj-dcos01-cl01.onecount.net term=611 tally=1"
Apr 25 11:38:40 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:40-04:00" level=info msg="2022-04-25T11:38:40.931-0400 [INFO] raft: duplicate requestVote for same term: term=611"
Apr 25 11:38:41 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:41-04:00" level=info msg="2022-04-25T11:38:41.824-0400 [WARN] raft: Election timeout reached, restarting election"
Apr 25 11:38:41 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:41-04:00" level=info msg="2022-04-25T11:38:41.824-0400 [INFO] raft: entering candidate state: node="Node at 10.10.4.171:6868 [Candidate]"
term=612"
Apr 25 11:38:41 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:41-04:00" level=info msg="2022-04-25T11:38:41.832-0400 [DEBUG] raft: votes: needed=2"
Apr 25 11:38:41 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:41-04:00" level=info msg="2022-04-25T11:38:41.832-0400 [DEBUG] raft: vote granted: from=nj-dcos01-cl01.onecount.net term=612 tally=1"
Apr 25 11:38:41 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:41-04:00" level=info msg="2022-04-25T11:38:41.865-0400 [DEBUG] raft: vote granted: from=nj-dcos03-cl01.onecount.net term=612 tally=2"
Apr 25 11:38:41 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:41-04:00" level=info msg="2022-04-25T11:38:41.865-0400 [INFO] raft: election won: tally=2"
Apr 25 11:38:41 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:41-04:00" level=info msg="2022-04-25T11:38:41.866-0400 [INFO] raft: entering leader state: leader="Node at 10.10.4.171:6868 [Leader]""
Apr 25 11:38:41 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:41-04:00" level=info msg="2022-04-25T11:38:41.867-0400 [INFO] raft: added peer, starting replication: peer=nj-dcos03-cl01.onecount.net"
Apr 25 11:38:41 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:41-04:00" level=info msg="2022-04-25T11:38:41.867-0400 [INFO] raft: added peer, starting replication: peer=nj-dcos02-cl01.onecount.net"
Apr 25 11:38:41 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:41-04:00" level=info msg="2022-04-25T11:38:41.867-0400 [INFO] raft: pipelining replication: peer="{Voter nj-dcos03-cl01.onecount.net 10.10.
4.173:6868}""
Apr 25 11:38:41 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:41-04:00" level=info msg="2022-04-25T11:38:41.972-0400 [INFO] raft: pipelining replication: peer="{Voter nj-dcos02-cl01.onecount.net 10.10.
4.172:6868}""
Apr 25 11:38:44 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:44-04:00" level=info msg="2022-04-25T11:38:44.154-0400 [WARN] raft: failed to contact: server-id=nj-dcos02-cl01.onecount.net time=500.091326
ms"
Apr 25 11:38:53 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:53-04:00" level=debug msg="grpc: received execution stream" key=1650900600070058998-nj-dcos02-cl01.onecount.net node=nj-dcos01-cl01.onecount.
net
Apr 25 11:38:53 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:53-04:00" level=debug msg="grpc: Received execution done" from=nj-dcos02-cl01.onecount.net group=16509006000263052
output=
Apr 25 11:38:53 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:53-04:00" level=debug msg="store: Retrieved job from datastore" job=import-scheduler node=nj-dcos01-cl01.onecount.net
Apr 25 11:38:53 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:53-04:00" level=debug msg="store: Setting job" job=import-scheduler node=nj-dcos01-cl01.onecount.net
Apr 25 11:38:53 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:53-04:00" level=debug msg="store: Retrieved job from datastore" job=import-scheduler node=nj-dcos01-cl01.onecount.net
Apr 25 11:38:53 nj-dcos01-cl01 dkron: time="2022-04-25T11:38:53-04:00" level=debug msg="grpc: Response from method" from=nj-dcos01-cl01.onecount.net method=ExecutionDone node=nj-dcos01-cl01.onecount.net
payload=saved server_addr="10.10.4.171:6868"
Apr 25 11:39:08 nj-dcos01-cl01 dkron: time="2022-04-25T11:39:08-04:00" level=info msg="2022/04/25 11:39:08 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:894
6"
Apr 25 11:39:14 nj-dcos01-cl01 dkron: time="2022-04-25T11:39:14-04:00" level=info msg="2022/04/25 11:39:14 [DEBUG] memberlist: Stream connection from=10.10.4.172:50772"
Apr 25 11:39:38 nj-dcos01-cl01 dkron: time="2022-04-25T11:39:38-04:00" level=info msg="2022/04/25 11:39:38 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:894
6"
Apr 25 11:39:49 nj-dcos01-cl01 dkron: time="2022-04-25T11:39:49-04:00" level=info msg="2022/04/25 11:39:49 [DEBUG] memberlist: Stream connection from=10.10.4.173:40252"
Apr 25 11:40:08 nj-dcos01-cl01 dkron: time="2022-04-25T11:40:08-04:00" level=info msg="2022/04/25 11:40:08 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:894
6"
Apr 25 11:40:14 nj-dcos01-cl01 dkron: time="2022-04-25T11:40:14-04:00" level=info msg="2022/04/25 11:40:14 [DEBUG] memberlist: Stream connection from=10.10.4.172:60088"
Apr 25 11:40:19 nj-dcos01-cl01 dkron: time="2022-04-25T11:40:19-04:00" level=info msg="2022/04/25 11:40:19 [DEBUG] memberlist: Stream connection from=10.10.4.173:40818"
Apr 25 11:40:38 nj-dcos01-cl01 dkron: time="2022-04-25T11:40:38-04:00" level=info msg="2022/04/25 11:40:38 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:894
6"
Apr 25 11:40:49 nj-dcos01-cl01 dkron: time="2022-04-25T11:40:49-04:00" level=info msg="2022/04/25 11:40:49 [DEBUG] memberlist: Stream connection from=10.10.4.173:41326"
Apr 25 11:41:08 nj-dcos01-cl01 dkron: time="2022-04-25T11:41:08-04:00" level=info msg="2022/04/25 11:41:08 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:894
6"
Apr 25 11:41:19 nj-dcos01-cl01 dkron: time="2022-04-25T11:41:19-04:00" level=info msg="2022/04/25 11:41:19 [DEBUG] memberlist: Stream connection from=10.10.4.173:41380"
Apr 25 11:41:38 nj-dcos01-cl01 dkron: time="2022-04-25T11:41:38-04:00" level=info msg="2022/04/25 11:41:38 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:894
6"
Apr 25 11:42:08 nj-dcos01-cl01 dkron: time="2022-04-25T11:42:08-04:00" level=info msg="2022/04/25 11:42:08 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:894
6"
Apr 25 11:42:38 nj-dcos01-cl01 dkron: time="2022-04-25T11:42:38-04:00" level=info msg="2022/04/25 11:42:38 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:894
6"
Apr 25 11:42:44 nj-dcos01-cl01 dkron: time="2022-04-25T11:42:44-04:00" level=info msg="2022/04/25 11:42:44 [DEBUG] memberlist: Stream connection from=10.10.4.172:55974"
Apr 25 11:43:08 nj-dcos01-cl01 dkron: time="2022-04-25T11:43:08-04:00" level=info msg="2022/04/25 11:43:08 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:894
6"
Apr 25 11:43:19 nj-dcos01-cl01 dkron: time="2022-04-25T11:43:19-04:00" level=info msg="2022/04/25 11:43:19 [DEBUG] memberlist: Stream connection from=10.10.4.173:42448"
Apr 25 11:43:38 nj-dcos01-cl01 dkron: time="2022-04-25T11:43:38-04:00" level=info msg="2022/04/25 11:43:38 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:894
6"
Apr 25 11:44:08 nj-dcos01-cl01 dkron: time="2022-04-25T11:44:08-04:00" level=info msg="2022/04/25 11:44:08 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:894
6"
Apr 25 11:44:38 nj-dcos01-cl01 dkron: time="2022-04-25T11:44:38-04:00" level=info msg="2022/04/25 11:44:38 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:894
6"
Apr 25 11:44:49 nj-dcos01-cl01 dkron: time="2022-04-25T11:44:49-04:00" level=info msg="2022/04/25 11:44:49 [DEBUG] memberlist: Stream connection from=10.10.4.173:43492"
Apr 25 11:45:08 nj-dcos01-cl01 dkron: time="2022-04-25T11:45:08-04:00" level=info msg="2022/04/25 11:45:08 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:894
6"
Apr 25 11:45:19 nj-dcos01-cl01 dkron: time="2022-04-25T11:45:19-04:00" level=info msg="2022/04/25 11:45:19 [DEBUG] memberlist: Stream connection from=10.10.4.173:43598"
Apr 25 11:45:38 nj-dcos01-cl01 dkron: time="2022-04-25T11:45:38-04:00" level=info msg="2022/04/25 11:45:38 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:894
6"
Apr 25 11:45:44 nj-dcos01-cl01 dkron: time="2022-04-25T11:45:44-04:00" level=info msg="2022/04/25 11:45:44 [DEBUG] memberlist: Stream connection from=10.10.4.172:55544"
Apr 25 11:45:49 nj-dcos01-cl01 dkron: time="2022-04-25T11:45:49-04:00" level=info msg="2022/04/25 11:45:49 [DEBUG] memberlist: Stream connection from=10.10.4.173:43694"
Apr 25 11:46:08 nj-dcos01-cl01 dkron: time="2022-04-25T11:46:08-04:00" level=info msg="2022/04/25 11:46:08 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:894
6"
Apr 25 11:46:38 nj-dcos01-cl01 dkron: time="2022-04-25T11:46:38-04:00" level=info msg="2022/04/25 11:46:38 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:894
6"
Apr 25 11:47:08 nj-dcos01-cl01 dkron: time="2022-04-25T11:47:08-04:00" level=info msg="2022/04/25 11:47:08 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:894
6"
Apr 25 11:47:14 nj-dcos01-cl01 dkron: time="2022-04-25T11:47:14-04:00" level=info msg="2022/04/25 11:47:14 [DEBUG] memberlist: Stream connection from=10.10.4.172:41376"
Apr 25 11:47:38 nj-dcos01-cl01 dkron: time="2022-04-25T11:47:38-04:00" level=info msg="2022/04/25 11:47:38 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net

@seanfulton
Copy link

I was able to reproduce it now. Since dcos02 kept getting disconnected, I tried moving it to a less busy load. docs01 was master. So by taking dcos02 down and bringing it up again, the jobs stopped running.:

Apr 25 15:26:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:01-04:00" level=debug msg="store: Retrieved job from datastore" job=import-queue node=nj-dcos01-cl01.onecount.net
Apr 25 15:26:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:01-04:00" level=debug msg="store: Setting job" job=import-queue node=nj-dcos01-cl01.onecount.net
Apr 25 15:26:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:27-04:00" level=info msg="2022/04/25 15:26:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:26:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:39-04:00" level=info msg="2022/04/25 15:26:39 [DEBUG] memberlist: Failed ping: nj-dcos02-cl01.onecount.net (timeout reached)"
Apr 25 15:26:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:39-04:00" level=info msg="2022/04/25 15:26:39 [DEBUG] memberlist: Stream connection from=10.10.4.173:39332"
Apr 25 15:26:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:39-04:00" level=info msg="2022/04/25 15:26:39 [INFO] memberlist: Suspect nj-dcos02-cl01.onecount.net has failed, no acks received"
Apr 25 15:26:42 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:42-04:00" level=info msg="2022/04/25 15:26:42 [DEBUG] memberlist: Failed ping: nj-dcos02-cl01.onecount.net (timeout reached)"
Apr 25 15:26:42 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:42-04:00" level=info msg="2022/04/25 15:26:42 [INFO] memberlist: Suspect nj-dcos02-cl01.onecount.net has failed, no acks received"
Apr 25 15:26:43 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:43-04:00" level=info msg="2022/04/25 15:26:43 [INFO] memberlist: Marking nj-dcos02-cl01.onecount.net as failed, suspect timeout reached (0
peer confirmations)"
Apr 25 15:26:43 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:43-04:00" level=info msg="2022/04/25 15:26:43 [INFO] serf: EventMemberFailed: nj-dcos02-cl01.onecount.net 10.10.4.172"
Apr 25 15:26:44 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:44-04:00" level=info msg="agent: Received event" event=member-update node=nj-dcos01-cl01.onecount.net
Apr 25 15:26:44 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:44-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos03-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:26:44 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:44-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos01-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:26:44 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:44-04:00" level=info msg="agent: Received event" event=member-failed node=nj-dcos01-cl01.onecount.net
Apr 25 15:26:44 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:44-04:00" level=debug msg="agent: Member event" event=member-failed member=nj-dcos02-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:26:44 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:44-04:00" level=info msg="removing server nj-dcos02-cl01.onecount.net (Addr: 10.10.4.172:6868) (DC: nj)" node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:00 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:00-04:00" level=debug msg="fsm: received command" command=0 node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:00 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:00-04:00" level=debug msg="store: Retrieved job from datastore" job=segment-queue node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:00 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:00-04:00" level=debug msg="store: Setting job" job=segment-queue node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:00 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:00-04:00" level=debug msg="fsm: received command" command=2 node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:00 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:00-04:00" level=debug msg="store: Setting key" execution="executions:segment-queue:1650914820055065247-nj-dcos03-cl01.onecount.net" finish
ed="0001-01-01 00:00:00 +0000 UTC" job=segment-queue node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:00 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:00-04:00" level=debug msg="store: to detele key" execution=1650899520050789028-nj-dcos02-cl01.onecount.net job=segment-queue node=nj-dcos0
1-cl01.onecount.net
Apr 25 15:27:00 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:00-04:00" level=debug msg="fsm: received command" command=4 node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:00 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:00-04:00" level=debug msg="fsm: Setting execution" execution=1650914820055065247-nj-dcos03-cl01.onecount.net node=nj-dcos01-cl01.onecount.
net output=
Apr 25 15:27:00 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:00-04:00" level=debug msg="store: Retrieved job from datastore" job=segment-queue node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:00 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:00-04:00" level=debug msg="store: Setting job" job=segment-queue node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:01-04:00" level=info msg="grpc_agent: Starting job" job=import-queue node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:01-04:00" level=debug msg="grpc_agent: calling executor plugin" node=nj-dcos01-cl01.onecount.net plugin=shell
Apr 25 15:27:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:01-04:00" level=debug msg="2022/04/25 15:27:01 shell: going to run /pub/cust/admin/ImportTool/ImportQueue.sh-new >>/pub/log/ImportTool/imp
ort_queue.log 2>>/pub/log/ImportTool/import_queue_err.log 2>&1" node=nj-dcos01-cl01.onecount.net subsystem_name=plugins.dkron-executor-shell
Apr 25 15:27:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:01-04:00" level=debug msg="fsm: received command" command=0 node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:01-04:00" level=debug msg="store: Retrieved job from datastore" job=import-queue node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:01-04:00" level=debug msg="store: Setting job" job=import-queue node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:01-04:00" level=debug msg="fsm: received command" command=2 node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:01-04:00" level=debug msg="store: Setting key" execution="executions:import-queue:1650914821024805435-nj-dcos01-cl01.onecount.net" finishe
d="0001-01-01 00:00:00 +0000 UTC" job=import-queue node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:01-04:00" level=debug msg="store: to detele key" execution=1650899521024998541-nj-dcos02-cl01.onecount.net job=import-queue node=nj-dcos01
-cl01.onecount.net
Apr 25 15:27:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:01-04:00" level=debug msg="2022/04/25 15:27:01 shell: Command output " node=nj-dcos01-cl01.onecount.net subsystem_name=plugins.dkron-execu
tor-shell
Apr 25 15:27:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:01-04:00" level=debug msg="fsm: received command" command=4 node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:01-04:00" level=debug msg="fsm: Setting execution" execution=1650914821024805435-nj-dcos01-cl01.onecount.net node=nj-dcos01-cl01.onecount.
net output=
Apr 25 15:27:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:01-04:00" level=debug msg="store: Retrieved job from datastore" job=import-queue node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:01-04:00" level=debug msg="store: Setting job" job=import-queue node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:07 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:07-04:00" level=info msg="2022/04/25 15:27:07 [DEBUG] memberlist: Stream connection from=10.10.4.172:43152"
Apr 25 15:27:08 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:08-04:00" level=info msg="2022/04/25 15:27:08 [INFO] serf: EventMemberJoin: nj-dcos02-cl01.onecount.net 10.10.4.172"
Apr 25 15:27:08 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:08-04:00" level=info msg="2022/04/25 15:27:08 [DEBUG] serf: messageJoinType: nj-dcos02-cl01.onecount.net"
Apr 25 15:27:08 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:08-04:00" level=info msg="2022/04/25 15:27:08 [DEBUG] serf: messageJoinType: nj-dcos02-cl01.onecount.net"
Apr 25 15:27:08 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:08-04:00" level=info msg="2022/04/25 15:27:08 [DEBUG] serf: messageJoinType: nj-dcos02-cl01.onecount.net"
Apr 25 15:27:09 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:09-04:00" level=info msg="agent: Received event" event=member-update node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:09 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:09-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos01-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:09 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:09-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos03-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:09 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:09-04:00" level=info msg="agent: Received event" event=member-join node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:09 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:09-04:00" level=debug msg="agent: Member event" event=member-join member=nj-dcos02-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:09 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:09-04:00" level=warning msg="non-server in gossip pool" member=nj-dcos02-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:10 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:10-04:00" level=info msg="2022/04/25 15:27:10 [INFO] serf: EventMemberUpdate: nj-dcos02-cl01.onecount.net"
Apr 25 15:27:11 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:11-04:00" level=info msg="agent: Received event" event=member-update node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:11 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:11-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos01-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:11 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:11-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos02-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:11 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:11-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos03-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:27:11 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:11-04:00" level=info msg="2022-04-25T15:27:11.832-0400 [WARN] raft: rejecting vote request since we have a leader: from=10.10.4.172:6868
leader=10.10.4.173:6868"
Apr 25 15:27:13 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:13-04:00" level=info msg="2022-04-25T15:27:13.331-0400 [WARN] raft: rejecting vote request since we have a leader: from=10.10.4.172:6868
leader=10.10.4.173:6868"
Apr 25 15:27:14 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:14-04:00" level=info msg="2022-04-25T15:27:14.432-0400 [WARN] raft: rejecting vote request since we have a leader: from=10.10.4.172:6868
leader=10.10.4.173:6868"
Apr 25 15:27:16 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:16-04:00" level=info msg="2022-04-25T15:27:16.133-0400 [WARN] raft: rejecting vote request since we have a leader: from=10.10.4.172:6868
leader=10.10.4.173:6868"
Apr 25 15:27:16 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:16-04:00" level=info msg="2022-04-25T15:27:16.305-0400 [WARN] raft: heartbeat timeout reached, starting election: last-leader=10.10.4.173
:6868"
Apr 25 15:27:16 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:16-04:00" level=info msg="2022-04-25T15:27:16.305-0400 [INFO] raft: entering candidate state: node="Node at 10.10.4.171:6868 [Candidate]
" term=648"
Apr 25 15:27:16 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:16-04:00" level=info msg="2022-04-25T15:27:16.312-0400 [DEBUG] raft: votes: needed=2"
Apr 25 15:27:16 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:16-04:00" level=info msg="2022-04-25T15:27:16.312-0400 [DEBUG] raft: vote granted: from=nj-dcos01-cl01.onecount.net term=648 tally=1"
Apr 25 15:27:16 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:16-04:00" level=info msg="2022-04-25T15:27:16.312-0400 [ERROR] raft: failed to make requestVote RPC: target="{Voter nj-dcos02-cl01.onecou
nt.net 10.10.4.172:6868}" error="write tcp 10.10.4.171:50860->10.10.4.172:6868: write: broken pipe""
Apr 25 15:27:16 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:16-04:00" level=info msg="2022-04-25T15:27:16.314-0400 [DEBUG] raft: newer term discovered, fallback to follower"
Apr 25 15:27:16 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:16-04:00" level=info msg="2022-04-25T15:27:16.315-0400 [INFO] raft: entering follower state: follower="Node at 10.10.4.171:6868 [Followe
r]" leader="
Apr 25 15:27:16 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:16-04:00" level=info msg="2022-04-25T15:27:16.566-0400 [DEBUG] raft: lost leadership because received a requestVote with a newer term"
Apr 25 15:27:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:27-04:00" level=info msg="2022/04/25 15:27:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:27:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:27:57-04:00" level=info msg="2022/04/25 15:27:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:28:22 nj-dcos01-cl01 dkron: time="2022-04-25T15:28:22-04:00" level=info msg="2022/04/25 15:28:22 [DEBUG] memberlist: Stream connection from=10.10.4.172:51226"
Apr 25 15:28:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:28:27-04:00" level=info msg="2022/04/25 15:28:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:28:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:28:39-04:00" level=info msg="2022/04/25 15:28:39 [DEBUG] memberlist: Stream connection from=10.10.4.173:42098"
Apr 25 15:28:52 nj-dcos01-cl01 dkron: time="2022-04-25T15:28:52-04:00" level=info msg="2022/04/25 15:28:52 [DEBUG] memberlist: Stream connection from=10.10.4.172:54040"
Apr 25 15:28:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:28:57-04:00" level=info msg="2022/04/25 15:28:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:29:09 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:09-04:00" level=info msg="2022/04/25 15:29:09 [DEBUG] memberlist: Stream connection from=10.10.4.173:42556"
Apr 25 15:29:22 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:22-04:00" level=info msg="2022/04/25 15:29:22 [DEBUG] memberlist: Stream connection from=10.10.4.172:57094"
Apr 25 15:29:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:27-04:00" level=info msg="2022/04/25 15:29:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:29:29 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:29-04:00" level=info msg="2022/04/25 15:29:29 [DEBUG] memberlist: Stream connection from=10.10.4.172:58240"
Apr 25 15:29:30 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:30-04:00" level=info msg="2022/04/25 15:29:30 [DEBUG] memberlist: Failed ping: nj-dcos02-cl01.onecount.net (timeout reached)"
Apr 25 15:29:30 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:30-04:00" level=info msg="2022/04/25 15:29:30 [INFO] serf: EventMemberUpdate: nj-dcos02-cl01.onecount.net"
Apr 25 15:29:30 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:30-04:00" level=info msg="2022/04/25 15:29:30 [DEBUG] serf: messageJoinType: nj-dcos02-cl01.onecount.net"
Apr 25 15:29:30 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:30-04:00" level=info msg="2022/04/25 15:29:30 [DEBUG] serf: messageJoinType: nj-dcos02-cl01.onecount.net"
Apr 25 15:29:30 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:30-04:00" level=info msg="2022/04/25 15:29:30 [DEBUG] serf: messageJoinType: nj-dcos02-cl01.onecount.net"
Apr 25 15:29:30 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:30-04:00" level=info msg="2022/04/25 15:29:30 [DEBUG] serf: messageJoinType: nj-dcos02-cl01.onecount.net"
Apr 25 15:29:31 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:31-04:00" level=info msg="agent: Received event" event=member-update node=nj-dcos01-cl01.onecount.net
Apr 25 15:29:31 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:31-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos01-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:29:31 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:31-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos02-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:29:31 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:31-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos03-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:29:31 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:31-04:00" level=info msg="2022/04/25 15:29:31 [INFO] serf: EventMemberUpdate: nj-dcos02-cl01.onecount.net"
Apr 25 15:29:32 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:32-04:00" level=info msg="agent: Received event" event=member-update node=nj-dcos01-cl01.onecount.net
Apr 25 15:29:32 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:32-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos03-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:29:32 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:32-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos01-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:29:32 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:32-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos02-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:29:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:29:57-04:00" level=info msg="2022/04/25 15:29:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:30:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:30:27-04:00" level=info msg="2022/04/25 15:30:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:30:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:30:39-04:00" level=info msg="2022/04/25 15:30:39 [DEBUG] memberlist: Stream connection from=10.10.4.173:44616"
Apr 25 15:30:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:30:57-04:00" level=info msg="2022/04/25 15:30:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:30:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:30:57-04:00" level=info msg="2022/04/25 15:30:57 [DEBUG] memberlist: Stream connection from=10.10.4.172:40500"
Apr 25 15:31:09 nj-dcos01-cl01 dkron: time="2022-04-25T15:31:09-04:00" level=info msg="2022/04/25 15:31:09 [DEBUG] memberlist: Stream connection from=10.10.4.173:45008"
Apr 25 15:31:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:31:27-04:00" level=info msg="2022/04/25 15:31:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:31:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:31:57-04:00" level=info msg="2022/04/25 15:31:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:31:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:31:57-04:00" level=info msg="2022/04/25 15:31:57 [DEBUG] memberlist: Stream connection from=10.10.4.172:47580"
Apr 25 15:32:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:32:27-04:00" level=info msg="2022/04/25 15:32:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:32:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:32:27-04:00" level=info msg="2022/04/25 15:32:27 [DEBUG] memberlist: Stream connection from=10.10.4.172:50482"
Apr 25 15:32:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:32:57-04:00" level=info msg="2022/04/25 15:32:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:33:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:33:27-04:00" level=info msg="2022/04/25 15:33:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:33:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:33:27-04:00" level=info msg="2022/04/25 15:33:27 [DEBUG] memberlist: Stream connection from=10.10.4.172:57170"
Apr 25 15:33:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:33:57-04:00" level=info msg="2022/04/25 15:33:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:33:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:33:57-04:00" level=info msg="2022/04/25 15:33:57 [DEBUG] memberlist: Stream connection from=10.10.4.172:60190"
Apr 25 15:34:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:34:27-04:00" level=info msg="2022/04/25 15:34:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:34:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:34:27-04:00" level=info msg="2022/04/25 15:34:27 [DEBUG] memberlist: Stream connection from=10.10.4.172:35298"
Apr 25 15:34:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:34:39-04:00" level=info msg="2022/04/25 15:34:39 [DEBUG] memberlist: Stream connection from=10.10.4.173:49924"
Apr 25 15:34:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:34:57-04:00" level=info msg="2022/04/25 15:34:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:34:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:34:57-04:00" level=info msg="2022/04/25 15:34:57 [DEBUG] memberlist: Stream connection from=10.10.4.172:39000"
Apr 25 15:35:29 nj-dcos01-cl01 dkron: time="2022-04-25T15:35:27-04:00" level=info msg="2022/04/25 15:35:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:35:29 nj-dcos01-cl01 dkron: time="2022-04-25T15:35:27-04:00" level=info msg="2022/04/25 15:35:27 [DEBUG] memberlist: Stream connection from=10.10.4.172:43072"
Apr 25 15:35:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:35:39-04:00" level=info msg="2022/04/25 15:35:39 [DEBUG] memberlist: Stream connection from=10.10.4.173:51010"
Apr 25 15:35:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:35:57-04:00" level=info msg="2022/04/25 15:35:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:35:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:35:57-04:00" level=info msg="2022/04/25 15:35:57 [DEBUG] memberlist: Stream connection from=10.10.4.172:46270"
Apr 25 15:36:09 nj-dcos01-cl01 dkron: time="2022-04-25T15:36:09-04:00" level=info msg="2022/04/25 15:36:09 [DEBUG] memberlist: Stream connection from=10.10.4.173:51276"
Apr 25 15:36:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:36:27-04:00" level=info msg="2022/04/25 15:36:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:36:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:36:57-04:00" level=info msg="2022/04/25 15:36:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:36:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:36:57-04:00" level=info msg="2022/04/25 15:36:57 [DEBUG] memberlist: Stream connection from=10.10.4.172:52292"
Apr 25 15:37:09 nj-dcos01-cl01 dkron: time="2022-04-25T15:37:09-04:00" level=info msg="2022/04/25 15:37:09 [DEBUG] memberlist: Stream connection from=10.10.4.173:52832"
Apr 25 15:37:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:37:27-04:00" level=info msg="2022/04/25 15:37:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:37:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:37:27-04:00" level=info msg="2022/04/25 15:37:27 [DEBUG] memberlist: Stream connection from=10.10.4.172:55342"
Apr 25 15:37:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:37:57-04:00" level=info msg="2022/04/25 15:37:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:38:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:38:27-04:00" level=info msg="2022/04/25 15:38:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:38:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:38:57-04:00" level=info msg="2022/04/25 15:38:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:39:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:39:27-04:00" level=info msg="2022/04/25 15:39:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:39:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:39:39-04:00" level=info msg="2022/04/25 15:39:39 [DEBUG] memberlist: Stream connection from=10.10.4.173:55568"
Apr 25 15:39:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:39:57-04:00" level=info msg="2022/04/25 15:39:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:39:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:39:57-04:00" level=info msg="2022/04/25 15:39:57 [DEBUG] memberlist: Stream connection from=10.10.4.172:42844"
Apr 25 15:40:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:40:27-04:00" level=info msg="2022/04/25 15:40:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:40:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:40:27-04:00" level=info msg="2022/04/25 15:40:27 [DEBUG] memberlist: Stream connection from=10.10.4.172:46966"
Apr 25 15:40:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:40:57-04:00" level=info msg="2022/04/25 15:40:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:40:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:40:57-04:00" level=info msg="2022/04/25 15:40:57 [DEBUG] memberlist: Stream connection from=10.10.4.172:50024"
Apr 25 15:41:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:27-04:00" level=info msg="2022/04/25 15:41:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:41:38 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:38-04:00" level=info msg="2022/04/25 15:41:38 [DEBUG] memberlist: Stream connection from=10.10.4.172:54626"
Apr 25 15:41:38 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:38-04:00" level=info msg="2022/04/25 15:41:38 [INFO] serf: EventMemberUpdate: nj-dcos02-cl01.onecount.net"
Apr 25 15:41:38 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:38-04:00" level=info msg="2022/04/25 15:41:38 [DEBUG] serf: messageJoinType: nj-dcos02-cl01.onecount.net"
Apr 25 15:41:38 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:38-04:00" level=info msg="2022/04/25 15:41:38 [DEBUG] serf: messageJoinType: nj-dcos02-cl01.onecount.net"
Apr 25 15:41:38 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:38-04:00" level=info msg="2022/04/25 15:41:38 [DEBUG] serf: messageJoinType: nj-dcos02-cl01.onecount.net"
Apr 25 15:41:38 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:38-04:00" level=info msg="2022/04/25 15:41:38 [DEBUG] serf: messageJoinType: nj-dcos02-cl01.onecount.net"
Apr 25 15:41:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:39-04:00" level=info msg="2022/04/25 15:41:39 [DEBUG] serf: messageJoinType: nj-dcos02-cl01.onecount.net"
Apr 25 15:41:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:39-04:00" level=info msg="2022/04/25 15:41:39 [DEBUG] memberlist: Stream connection from=10.10.4.173:58218"
Apr 25 15:41:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:39-04:00" level=info msg="agent: Received event" event=member-update node=nj-dcos01-cl01.onecount.net
Apr 25 15:41:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:39-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos01-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:41:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:39-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos02-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:41:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:39-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos03-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:41:40 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:40-04:00" level=info msg="2022/04/25 15:41:40 [INFO] serf: EventMemberUpdate: nj-dcos02-cl01.onecount.net"
Apr 25 15:41:41 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:41-04:00" level=info msg="agent: Received event" event=member-update node=nj-dcos01-cl01.onecount.net
Apr 25 15:41:41 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:41-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos01-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:41:41 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:41-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos02-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:41:41 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:41-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos03-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
Apr 25 15:41:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:41:57-04:00" level=info msg="2022/04/25 15:41:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:42:09 nj-dcos01-cl01 dkron: time="2022-04-25T15:42:09-04:00" level=info msg="2022/04/25 15:42:09 [DEBUG] memberlist: Stream connection from=10.10.4.173:58734"
Apr 25 15:42:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:42:27-04:00" level=info msg="2022/04/25 15:42:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:42:36 nj-dcos01-cl01 dkron: time="2022-04-25T15:42:36-04:00" level=info msg="2022/04/25 15:42:36 [DEBUG] memberlist: Stream connection from=10.10.4.172:33916"
Apr 25 15:42:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:42:39-04:00" level=info msg="2022/04/25 15:42:39 [DEBUG] memberlist: Stream connection from=10.10.4.173:60362"
Apr 25 15:42:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:42:57-04:00" level=info msg="2022/04/25 15:42:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:43:06 nj-dcos01-cl01 dkron: time="2022-04-25T15:43:06-04:00" level=info msg="2022/04/25 15:43:06 [DEBUG] memberlist: Stream connection from=10.10.4.172:37214"
Apr 25 15:43:09 nj-dcos01-cl01 dkron: time="2022-04-25T15:43:09-04:00" level=info msg="2022/04/25 15:43:09 [DEBUG] memberlist: Stream connection from=10.10.4.173:60800"
Apr 25 15:43:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:43:27-04:00" level=info msg="2022/04/25 15:43:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:43:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:43:57-04:00" level=info msg="2022/04/25 15:43:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:44:06 nj-dcos01-cl01 dkron: time="2022-04-25T15:44:06-04:00" level=info msg="2022/04/25 15:44:06 [DEBUG] memberlist: Stream connection from=10.10.4.172:44248"
Apr 25 15:44:09 nj-dcos01-cl01 dkron: time="2022-04-25T15:44:09-04:00" level=info msg="2022/04/25 15:44:09 [DEBUG] memberlist: Stream connection from=10.10.4.173:33592"
Apr 25 15:44:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:44:27-04:00" level=info msg="2022/04/25 15:44:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:44:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:44:57-04:00" level=info msg="2022/04/25 15:44:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:45:06 nj-dcos01-cl01 dkron: time="2022-04-25T15:45:06-04:00" level=info msg="2022/04/25 15:45:06 [DEBUG] memberlist: Stream connection from=10.10.4.172:50548"
Apr 25 15:45:09 nj-dcos01-cl01 dkron: time="2022-04-25T15:45:09-04:00" level=info msg="2022/04/25 15:45:09 [DEBUG] memberlist: Stream connection from=10.10.4.173:35430"
Apr 25 15:45:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:45:27-04:00" level=info msg="2022/04/25 15:45:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:45:36 nj-dcos01-cl01 dkron: time="2022-04-25T15:45:36-04:00" level=info msg="2022/04/25 15:45:36 [DEBUG] memberlist: Stream connection from=10.10.4.172:55330"
Apr 25 15:45:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:45:57-04:00" level=info msg="2022/04/25 15:45:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:46:06 nj-dcos01-cl01 dkron: time="2022-04-25T15:46:06-04:00" level=info msg="2022/04/25 15:46:06 [DEBUG] memberlist: Stream connection from=10.10.4.172:58336"
Apr 25 15:46:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:46:27-04:00" level=info msg="2022/04/25 15:46:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:46:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:46:39-04:00" level=info msg="2022/04/25 15:46:39 [DEBUG] memberlist: Stream connection from=10.10.4.173:37326"
Apr 25 15:46:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:46:57-04:00" level=info msg="2022/04/25 15:46:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:47:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:47:27-04:00" level=info msg="2022/04/25 15:47:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:47:36 nj-dcos01-cl01 dkron: time="2022-04-25T15:47:36-04:00" level=info msg="2022/04/25 15:47:36 [DEBUG] memberlist: Stream connection from=10.10.4.172:40968"
Apr 25 15:47:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:47:39-04:00" level=info msg="2022/04/25 15:47:39 [DEBUG] memberlist: Stream connection from=10.10.4.173:38048"
Apr 25 15:47:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:47:57-04:00" level=info msg="2022/04/25 15:47:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:48:06 nj-dcos01-cl01 dkron: time="2022-04-25T15:48:06-04:00" level=info msg="2022/04/25 15:48:06 [DEBUG] memberlist: Stream connection from=10.10.4.172:44190"
Apr 25 15:48:09 nj-dcos01-cl01 dkron: time="2022-04-25T15:48:09-04:00" level=info msg="2022/04/25 15:48:09 [DEBUG] memberlist: Stream connection from=10.10.4.173:38490"
Apr 25 15:48:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:48:27-04:00" level=info msg="2022/04/25 15:48:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:48:36 nj-dcos01-cl01 dkron: time="2022-04-25T15:48:36-04:00" level=info msg="2022/04/25 15:48:36 [DEBUG] memberlist: Stream connection from=10.10.4.172:47668"
Apr 25 15:48:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:48:39-04:00" level=info msg="2022/04/25 15:48:39 [DEBUG] memberlist: Stream connection from=10.10.4.173:39678"
Apr 25 15:48:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:48:57-04:00" level=info msg="2022/04/25 15:48:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:49:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:49:27-04:00" level=info msg="2022/04/25 15:49:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:49:36 nj-dcos01-cl01 dkron: time="2022-04-25T15:49:36-04:00" level=info msg="2022/04/25 15:49:36 [DEBUG] memberlist: Stream connection from=10.10.4.172:54958"
Apr 25 15:49:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:49:39-04:00" level=info msg="2022/04/25 15:49:39 [DEBUG] memberlist: Stream connection from=10.10.4.173:40618"
Apr 25 15:49:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:49:57-04:00" level=info msg="2022/04/25 15:49:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:50:06 nj-dcos01-cl01 dkron: time="2022-04-25T15:50:06-04:00" level=info msg="2022/04/25 15:50:06 [DEBUG] memberlist: Stream connection from=10.10.4.172:59042"
Apr 25 15:50:09 nj-dcos01-cl01 dkron: time="2022-04-25T15:50:09-04:00" level=info msg="2022/04/25 15:50:09 [DEBUG] memberlist: Stream connection from=10.10.4.173:41100"
Apr 25 15:50:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:50:27-04:00" level=info msg="2022/04/25 15:50:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
Apr 25 15:50:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:50:57-04:00" level=info msg="2022/04/25 15:50:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:51:06 nj-dcos01-cl01 dkron: time="2022-04-25T15:51:06-04:00" level=info msg="2022/04/25 15:51:06 [DEBUG] memberlist: Stream connection from=10.10.4.172:38598"
Apr 25 15:51:27 nj-dcos01-cl01 dkron: time="2022-04-25T15:51:27-04:00" level=info msg="2022/04/25 15:51:27 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.172:
8946"
Apr 25 15:51:36 nj-dcos01-cl01 dkron: time="2022-04-25T15:51:36-04:00" level=info msg="2022/04/25 15:51:36 [DEBUG] memberlist: Stream connection from=10.10.4.172:42222"
Apr 25 15:51:39 nj-dcos01-cl01 dkron: time="2022-04-25T15:51:39-04:00" level=info msg="2022/04/25 15:51:39 [DEBUG] memberlist: Stream connection from=10.10.4.173:43066"
Apr 25 15:51:57 nj-dcos01-cl01 dkron: time="2022-04-25T15:51:57-04:00" level=info msg="2022/04/25 15:51:57 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos03-cl01.onecount.net 10.10.4.173:
8946"
[root@nj-dcos01-cl01 ~]#

@seanfulton
Copy link

This is a daily occurrence now. Ping fails, leader election, no more jobs.

Help!

Apr 26 03:35:10 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:10-04:00" level=debug msg="store: Setting job" job=wly-mapping node=nj-dcos01-cl01.onecount.net
Apr 26 03:35:16 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:16-04:00" level=info msg="2022/04/26 03:35:16 [DEBUG] memberlist: Stream connection from=10.10.4.172:41898"
Apr 26 03:35:18 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:18-04:00" level=info msg="2022/04/26 03:35:18 [DEBUG] memberlist: Stream connection from=10.10.4.173:57276"
Apr 26 03:35:19 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:19-04:00" level=info msg="2022/04/26 03:35:19 [DEBUG] memberlist: Failed ping: nj-dcos02-cl01.onecount.net (timeout reached)"
Apr 26 03:35:20 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:20-04:00" level=info msg="2022-04-26T03:35:20.197-0400 [WARN] raft: heartbeat timeout reached, starting election: last-leader
=10.10.4.172:6868"
Apr 26 03:35:20 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:20-04:00" level=info msg="2022-04-26T03:35:20.197-0400 [INFO] raft: entering candidate state: node="Node at 10.10.4.171:6868
[Candidate]" term=657"
Apr 26 03:35:20 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:20-04:00" level=info msg="2022-04-26T03:35:20.202-0400 [DEBUG] raft: votes: needed=2"
Apr 26 03:35:20 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:20-04:00" level=info msg="2022-04-26T03:35:20.202-0400 [DEBUG] raft: vote granted: from=nj-dcos01-cl01.onecount.net term=657 t
ally=1"
Apr 26 03:35:20 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:20-04:00" level=info msg="2022/04/26 03:35:20 [ERR] memberlist: Failed fallback ping: read tcp 10.10.4.171:34076->10.10.4.172:
8946: i/o timeout"
Apr 26 03:35:20 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:20-04:00" level=info msg="2022/04/26 03:35:20 [INFO] memberlist: Suspect nj-dcos02-cl01.onecount.net has failed, no acks recei
ved"
Apr 26 03:35:21 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:21-04:00" level=info msg="2022-04-26T03:35:21.855-0400 [WARN] raft: Election timeout reached, restarting election"
Apr 26 03:35:21 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:21-04:00" level=info msg="2022-04-26T03:35:21.855-0400 [INFO] raft: entering candidate state: node="Node at 10.10.4.171:6868
[Candidate]" term=658"
Apr 26 03:35:21 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:21-04:00" level=info msg="2022-04-26T03:35:21.858-0400 [DEBUG] raft: votes: needed=2"
Apr 26 03:35:21 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:21-04:00" level=info msg="2022-04-26T03:35:21.858-0400 [DEBUG] raft: vote granted: from=nj-dcos01-cl01.onecount.net term=658 t
ally=1"
Apr 26 03:35:21 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:21-04:00" level=info msg="2022/04/26 03:35:21 [DEBUG] memberlist: Failed ping: nj-dcos02-cl01.onecount.net (timeout reached)"
Apr 26 03:35:22 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:22-04:00" level=info msg="2022/04/26 03:35:22 [ERR] memberlist: Failed fallback ping: read tcp 10.10.4.171:34082->10.10.4.172:
8946: i/o timeout"
Apr 26 03:35:22 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:22-04:00" level=info msg="2022/04/26 03:35:22 [INFO] memberlist: Suspect nj-dcos02-cl01.onecount.net has failed, no acks recei
ved"
Apr 26 03:35:22 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:22-04:00" level=info msg="2022-04-26T03:35:22.638-0400 [DEBUG] raft: vote granted: from=nj-dcos02-cl01.onecount.net term=658 t
ally=2"
Apr 26 03:35:22 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:22-04:00" level=info msg="2022-04-26T03:35:22.638-0400 [INFO] raft: election won: tally=2"
Apr 26 03:35:22 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:22-04:00" level=info msg="2022-04-26T03:35:22.638-0400 [INFO] raft: entering leader state: leader="Node at 10.10.4.171:6868
[Leader]""
Apr 26 03:35:22 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:22-04:00" level=info msg="2022-04-26T03:35:22.638-0400 [INFO] raft: added peer, starting replication: peer=nj-dcos03-cl01.one
count.net"
Apr 26 03:35:22 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:22-04:00" level=info msg="dkron: cluster leadership acquired" node=nj-dcos01-cl01.onecount.net
Apr 26 03:35:22 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:22-04:00" level=info msg="2022-04-26T03:35:22.638-0400 [INFO] raft: added peer, starting replication: peer=nj-dcos02-cl01.one
--More--

@seanfulton
Copy link

Any ideas would be appreciated. I understand the event is being triggered by one of the three nodes becoming un-available, but the cluster has to survive a momentary bounce. Any suggestions or advice would be appreciated.

I have this set if it makes any difference:

Provides the number of expected servers in the datacenter.

Either this value should not be provided or the value must agree with other servers in the cluster.

When provided, Dkron waits until the specified number of servers are available and then bootstraps the cluster.

This allows an initial leader to be elected automatically. This flag requires server mode.

bootstrap-expect: 3

@seanfulton
Copy link

This is happening every day. It has something to do with a peer having a newer term. I upgraded by cluster to 5 nodes, reduced the workload, and it is still happening.I am going to revert to an earlier version to see if I can get this stabilized. This is crazy.
May 2 08:37:01 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:01-04:00" level=debug msg="fsm: Setting execution" execution=1651495020556679411-nj-dcos01-cl01.onecount.net node=nj-dcos01-cl01.onecou
nt.net output=
May 2 08:37:01 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:01-04:00" level=debug msg="store: Retrieved job from datastore" job=segment-queue node=nj-dcos01-cl01.onecount.net
May 2 08:37:01 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:01-04:00" level=debug msg="grpc: Response from method" from=nj-dcos01-cl01.onecount.net method=ExecutionDone node=nj-dcos01-cl01.onecou
nt.net payload=saved server_addr="10.10.4.171:6868"
May 2 08:37:01 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:01-04:00" level=debug msg="store: Setting job" job=segment-queue node=nj-dcos01-cl01.onecount.net
May 2 08:37:01 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:01-04:00" level=debug msg="store: Retrieved job from datastore" job=segment-queue node=nj-dcos01-cl01.onecount.net
May 2 08:37:01 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:01-04:00" level=debug msg="grpc: Response from method" from=nj-dcos01-cl01.onecount.net method=ExecutionDone node=nj-dcos01-cl01.onecou
nt.net payload=saved server_addr="10.10.4.171:6868"
May 2 08:37:10 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:10-04:00" level=info msg="2022-05-02T08:37:10.022-0400 [ERROR] raft: peer has newer term, stopping replication: peer="{Voter nj-dcos05
-cl01.onecount.net 10.10.4.175:6868}""
May 2 08:37:10 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:10-04:00" level=info msg="2022/05/02 08:37:09 [DEBUG] memberlist: Stream connection from=10.10.4.173:41664"
May 2 08:37:10 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:10-04:00" level=debug msg="dkron: shutting down leader loop" node=nj-dcos01-cl01.onecount.net
May 2 08:37:10 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:10-04:00" level=debug msg="scheduler: Stopping scheduler" node=nj-dcos01-cl01.onecount.net
May 2 08:37:10 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:10-04:00" level=info msg="2022/05/02 08:37:10 [DEBUG] memberlist: Stream connection from=10.10.4.175:57946"
May 2 08:37:10 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:10-04:00" level=info msg="2022-05-02T08:37:10.077-0400 [INFO] raft: aborting pipeline replication: peer="{Voter nj-dcos05-cl01.onecou
nt.net 10.10.4.175:6868}""
May 2 08:37:10 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:10-04:00" level=info msg="2022-05-02T08:37:10.093-0400 [INFO] raft: aborting pipeline replication: peer="{Voter nj-dcos03-cl01.onecou
nt.net 10.10.4.173:6868}""
May 2 08:37:10 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:10-04:00" level=info msg="2022-05-02T08:37:10.094-0400 [INFO] raft: aborting pipeline replication: peer="{Voter nj-dcos02-cl01.onecou
nt.net 10.10.4.172:6868}""
May 2 08:37:10 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:10-04:00" level=info msg="2022-05-02T08:37:10.092-0400 [INFO] raft: aborting pipeline replication: peer="{Voter nj-dcos04-cl01.onecou
nt.net 10.10.4.174:6868}""
May 2 08:37:10 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:10-04:00" level=info msg="2022/05/02 08:37:10 [INFO] memberlist: Suspect nj-dcos04-cl01.onecount.net has failed, no acks received"
May 2 08:37:10 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:10-04:00" level=info msg="2022-05-02T08:37:10.090-0400 [INFO] raft: entering follower state: follower="Node at 10.10.4.171:6868 [Foll
ower]" leader="
May 2 08:37:10 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:10-04:00" level=info msg="2022-05-02T08:37:10.138-0400 [DEBUG] raft: lost leadership because received a requestVote with a newer term"
May 2 08:37:10 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:10-04:00" level=info msg="2022/05/02 08:37:10 [WARN] memberlist: Refuting a suspect message (from: nj-dcos05-cl01.onecount.net)"
May 2 08:37:10 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:10-04:00" level=info msg="2022-05-02T08:37:10.364-0400 [DEBUG] raft: lost leadership because received a requestVote with a newer term"
May 2 08:37:11 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:11-04:00" level=info msg="2022-05-02T08:37:11.682-0400 [DEBUG] raft: lost leadership because received a requestVote with a newer term"
May 2 08:37:24 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:24-04:00" level=info msg="2022/05/02 08:37:24 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos02-cl01.onecount.net 10.10.4.1
72:8946"
May 2 08:37:29 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:29-04:00" level=info msg="2022/05/02 08:37:29 [DEBUG] memberlist: Stream connection from=10.10.4.175:57956"
May 2 08:37:29 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:29-04:00" level=info msg="2022/05/02 08:37:29 [DEBUG] memberlist: Stream connection from=10.10.4.172:38930"
May 2 08:37:29 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:29-04:00" level=info msg="2022/05/02 08:37:29 [WARN] memberlist: Refuting a suspect message (from: nj-dcos05-cl01.onecount.net)"
May 2 08:37:29 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:29-04:00" level=info msg="2022/05/02 08:37:29 [INFO] memberlist: Suspect nj-dcos03-cl01.onecount.net has failed, no acks received"
May 2 08:37:32 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:32-04:00" level=info msg="2022/05/02 08:37:32 [DEBUG] memberlist: Failed ping: nj-dcos03-cl01.onecount.net (timeout reached)"
May 2 08:37:34 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:34-04:00" level=info msg="2022/05/02 08:37:34 [ERR] memberlist: Failed fallback ping: read tcp 10.10.4.171:45610->10.10.4.173:8946: i/o
timeout"
May 2 08:37:34 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:34-04:00" level=info msg="2022/05/02 08:37:34 [INFO] memberlist: Suspect nj-dcos03-cl01.onecount.net has failed, no acks received"
May 2 08:37:34 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:34-04:00" level=info msg="2022-05-02T08:37:34.818-0400 [WARN] raft: rejecting vote request since we have a leader: from=10.10.4.173:68
68 leader=10.10.4.175:6868"
May 2 08:37:37 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:37-04:00" level=info msg="2022-05-02T08:37:37.699-0400 [WARN] raft: rejecting vote request since we have a leader: from=10.10.4.172:68
68 leader=10.10.4.175:6868"
May 2 08:37:37 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:37-04:00" level=info msg="2022-05-02T08:37:37.834-0400 [WARN] raft: heartbeat timeout reached, starting election: last-leader=10.10.4.
175:6868"
May 2 08:37:37 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:37-04:00" level=info msg="2022-05-02T08:37:37.837-0400 [INFO] raft: entering candidate state: node="Node at 10.10.4.171:6868 [Candida
te]" term=1041"
May 2 08:37:37 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:37-04:00" level=info msg="2022-05-02T08:37:37.856-0400 [DEBUG] raft: votes: needed=3"
May 2 08:37:37 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:37-04:00" level=info msg="2022-05-02T08:37:37.857-0400 [DEBUG] raft: vote granted: from=nj-dcos01-cl01.onecount.net term=1041 tally=1"
May 2 08:37:37 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:37-04:00" level=info msg="2022-05-02T08:37:37.872-0400 [DEBUG] raft: lost leadership because received a requestVote with a newer term"
May 2 08:37:37 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:37-04:00" level=info msg="2022-05-02T08:37:37.877-0400 [INFO] raft: entering follower state: follower="Node at 10.10.4.171:6868 [Foll
ower]" leader="
May 2 08:37:55 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:52-04:00" level=info msg="2022/05/02 08:37:52 [DEBUG] memberlist: Stream connection from=10.10.4.173:41856"
May 2 08:37:55 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:52-04:00" level=info msg="2022/05/02 08:37:52 [DEBUG] memberlist: Stream connection from=10.10.4.172:47816"
May 2 08:37:55 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:52-04:00" level=info msg="2022/05/02 08:37:52 [DEBUG] memberlist: Stream connection from=10.10.4.175:57972"
May 2 08:37:55 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:52-04:00" level=info msg="2022-05-02T08:37:52.666-0400 [WARN] raft: heartbeat timeout reached, starting election: last-leader=10.10.4.
173:6868"
May 2 08:37:55 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:52-04:00" level=info msg="2022-05-02T08:37:52.698-0400 [INFO] raft: entering candidate state: node="Node at 10.10.4.171:6868 [Candida
te]" term=1043"
May 2 08:37:56 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:52-04:00" level=info msg="2022/05/02 08:37:52 [DEBUG] memberlist: Stream connection from=10.10.4.174:53314"
May 2 08:37:56 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:53-04:00" level=info msg="2022/05/02 08:37:53 [INFO] memberlist: Suspect nj-dcos04-cl01.onecount.net has failed, no acks received"
May 2 08:37:56 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:53-04:00" level=info msg="2022/05/02 08:37:53 [WARN] memberlist: Refuting a suspect message (from: nj-dcos02-cl01.onecount.net)"
May 2 08:37:56 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:54-04:00" level=info msg="2022/05/02 08:37:54 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos05-cl01.onecount.net 10.10.4.1
75:8946"
May 2 08:37:56 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:54-04:00" level=info msg="2022-05-02T08:37:54.244-0400 [DEBUG] raft: votes: needed=3"
May 2 08:37:56 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:54-04:00" level=info msg="2022-05-02T08:37:54.246-0400 [INFO] raft: entering follower state: follower="Node at 10.10.4.171:6868 [Foll
ower]" leader=10.10.4.175:6868"
May 2 08:37:56 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:54-04:00" level=info msg="2022-05-02T08:37:54.246-0400 [WARN] raft: rejecting vote request since we have a leader: from=10.10.4.172:68
68 leader=10.10.4.175:6868"
May 2 08:37:56 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:54-04:00" level=info msg="2022-05-02T08:37:54.844-0400 [WARN] raft: rejecting vote request since we have a leader: from=10.10.4.174:68
68 leader=10.10.4.175:6868"
May 2 08:37:56 nj-dcos01-cl01 dkron: time="2022-05-02T08:37:54-04:00" level=info msg="2022-05-02T08:37:54.845-0400 [ERROR] raft-net: failed to flush response: error="write tcp 10.10.4.171:6868->
10.10.4.173:33546: write: broken pipe""
May 2 08:38:11 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:09-04:00" level=info msg="2022/05/02 08:38:09 [DEBUG] memberlist: Failed ping: nj-dcos03-cl01.onecount.net (timeout reached)"
May 2 08:38:11 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:10-04:00" level=info msg="2022/05/02 08:38:10 [ERR] memberlist: Failed fallback ping: read tcp 10.10.4.171:45960->10.10.4.173:8946: i/o
timeout"
May 2 08:38:11 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:10-04:00" level=info msg="2022/05/02 08:38:10 [INFO] memberlist: Suspect nj-dcos03-cl01.onecount.net has failed, no acks received"
May 2 08:38:12 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:12-04:00" level=info msg="2022/05/02 08:38:12 [DEBUG] memberlist: Failed ping: nj-dcos03-cl01.onecount.net (timeout reached)"
May 2 08:38:14 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:14-04:00" level=info msg="2022/05/02 08:38:14 [ERR] memberlist: Failed fallback ping: read tcp 10.10.4.171:45964->10.10.4.173:8946: i/o
timeout"
May 2 08:38:15 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:14-04:00" level=info msg="2022/05/02 08:38:14 [INFO] memberlist: Suspect nj-dcos03-cl01.onecount.net has failed, no acks received"
May 2 08:38:15 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:14-04:00" level=info msg="2022/05/02 08:38:14 [INFO] memberlist: Marking nj-dcos03-cl01.onecount.net as failed, suspect timeout reached
(2 peer confirmations)"
May 2 08:38:15 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:14-04:00" level=info msg="2022/05/02 08:38:14 [INFO] serf: EventMemberFailed: nj-dcos03-cl01.onecount.net 10.10.4.173"
May 2 08:38:15 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:15-04:00" level=info msg="agent: Received event" event=member-failed node=nj-dcos01-cl01.onecount.net
May 2 08:38:15 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:15-04:00" level=debug msg="agent: Member event" event=member-failed member=nj-dcos03-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
May 2 08:38:15 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:15-04:00" level=info msg="removing server nj-dcos03-cl01.onecount.net (Addr: 10.10.4.173:6868) (DC: nj)" node=nj-dcos01-cl01.onecount.n
et
May 2 08:38:15 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:15-04:00" level=info msg="agent: Received event" event=member-update node=nj-dcos01-cl01.onecount.net
May 2 08:38:15 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:15-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos02-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
May 2 08:38:15 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:15-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos05-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
May 2 08:38:15 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:15-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos04-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
May 2 08:38:15 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:15-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos01-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
May 2 08:38:19 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:19-04:00" level=info msg="2022/05/02 08:38:19 [DEBUG] serf: forgoing reconnect for random throttling"
May 2 08:38:24 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:24-04:00" level=info msg="2022/05/02 08:38:24 [DEBUG] memberlist: Initiating push/pull sync with: nj-dcos05-cl01.onecount.net 10.10.4.1
75:8946"
May 2 08:38:24 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:24-04:00" level=info msg="2022-05-02T08:38:24.208-0400 [WARN] raft: rejecting vote request since we have a leader: from=10.10.4.173:68
68 leader=10.10.4.175:6868"
May 2 08:38:24 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:24-04:00" level=info msg="2022/05/02 08:38:24 [INFO] serf: EventMemberJoin: nj-dcos03-cl01.onecount.net 10.10.4.173"
May 2 08:38:25 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:25-04:00" level=info msg="agent: Received event" event=member-update node=nj-dcos01-cl01.onecount.net
May 2 08:38:25 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:25-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos01-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
May 2 08:38:25 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:25-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos02-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
May 2 08:38:25 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:25-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos05-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
May 2 08:38:25 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:25-04:00" level=debug msg="agent: Member event" event=member-update member=nj-dcos04-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
May 2 08:38:25 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:25-04:00" level=info msg="agent: Received event" event=member-join node=nj-dcos01-cl01.onecount.net
May 2 08:38:25 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:25-04:00" level=debug msg="agent: Member event" event=member-join member=nj-dcos03-cl01.onecount.net node=nj-dcos01-cl01.onecount.net
May 2 08:38:25 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:25-04:00" level=info msg="adding server" node=nj-dcos01-cl01.onecount.net server=nj-dcos03-cl01.onecount.net
May 2 08:38:26 nj-dcos01-cl01 dkron: time="2022-05-02T08:38:26-04:00" level=info msg="2022-05-02T08:38:26.340-0400 [WARN] raft: rejecting vote request since we have a leader: from=10.10.4.172:68
68 leader=10.10.4.175:6868"

@seanfulton
Copy link

May 2 08:43:22 nj-dcos05-cl01 dkron: time="2022-05-02T08:43:22-04:00" level=info msg="2022-05-02T08:43:22.112-0400 [INFO] raft: added peer, starting replication: peer=nj-dcos01-cl01.onecount.net
"
May 2 08:43:22 nj-dcos05-cl01 dkron: time="2022-05-02T08:43:22-04:00" level=info msg="2022-05-02T08:43:22.112-0400 [INFO] raft: added peer, starting replication: peer=nj-dcos03-cl01.onecount.net
"
May 2 08:43:22 nj-dcos05-cl01 dkron: time="2022-05-02T08:43:22-04:00" level=info msg="2022-05-02T08:43:22.112-0400 [INFO] raft: added peer, starting replication: peer=nj-dcos02-cl01.onecount.net
"
May 2 08:43:22 nj-dcos05-cl01 dkron: time="2022-05-02T08:43:22-04:00" level=info msg="2022-05-02T08:43:22.112-0400 [INFO] raft: added peer, starting replication: peer=nj-dcos04-cl01.onecount.net
"
May 2 08:43:22 nj-dcos05-cl01 dkron: time="2022-05-02T08:43:22-04:00" level=info msg="2022-05-02T08:43:22.113-0400 [INFO] raft: pipelining replication: peer="{Voter nj-dcos02-cl01.onecount.net
10.10.4.172:6868}""
May 2 08:43:22 nj-dcos05-cl01 dkron[24899]: time="2022-05-02T08:43:22-04:00" level=info msg="2022-05-02T08:43:22.112-0400 [INFO] raft: added peer, starting replication: peer=nj-dcos03-cl01.oneco
unt.net"
May 2 08:43:22 nj-dcos05-cl01 dkron[24899]: time="2022-05-02T08:43:22-04:00" level=info msg="2022-05-02T08:43:22.112-0400 [INFO] raft: added peer, starting replication: peer=nj-dcos02-cl01.oneco
unt.net"
May 2 08:43:22 nj-dcos05-cl01 dkron[24899]: time="2022-05-02T08:43:22-04:00" level=info msg="2022-05-02T08:43:22.112-0400 [INFO] raft: added peer, starting replication: peer=nj-dcos04-cl01.oneco
unt.net"
May 2 08:43:22 nj-dcos05-cl01 dkron[24899]: time="2022-05-02T08:43:22-04:00" level=info msg="2022-05-02T08:43:22.113-0400 [INFO] raft: pipelining replication: peer="{Voter nj-dcos02-cl01.onecou
nt.net 10.10.4.172:6868}""
May 2 08:43:22 nj-dcos05-cl01 dkron[24899]: time="2022-05-02T08:43:22-04:00" level=info msg="2022-05-02T08:43:22.113-0400 [INFO] raft: pipelining replication: peer="{Voter nj-dcos04-cl01.onecou
nt.net 10.10.4.174:6868}""
May 2 08:43:22 nj-dcos05-cl01 dkron: time="2022-05-02T08:43:22-04:00" level=info msg="2022-05-02T08:43:22.113-0400 [INFO] raft: pipelining replication: peer="{Voter nj-dcos04-cl01.onecount.net
10.10.4.174:6868}""
May 2 08:43:22 nj-dcos05-cl01 dkron[24899]: time="2022-05-02T08:43:22-04:00" level=info msg="agent: Starting scheduler" node=nj-dcos05-cl01.onecount.net
May 2 08:43:22 nj-dcos05-cl01 dkron: time="2022-05-02T08:43:22-04:00" level=info msg="agent: Starting scheduler" node=nj-dcos05-cl01.onecount.net
May 2 08:43:22 nj-dcos05-cl01 dkron[24899]: time="2022-05-02T08:43:22-04:00" level=error msg="dkron: failed to establish leadership" error="cron is already configured, can not start scheduler" no
de=nj-dcos05-cl01.onecount.net
May 2 08:43:22 nj-dcos05-cl01 dkron: time="2022-05-02T08:43:22-04:00" level=error msg="dkron: failed to establish leadership" error="cron is already configured, can not start scheduler" node=nj-d
cos05-cl01.onecount.net
May 2 08:43:22 nj-dcos05-cl01 dkron[24899]: time="2022-05-02T08:43:22-04:00" level=info msg="2022-05-02T08:43:22.398-0400 [INFO] raft: pipelining replication: peer="{Voter nj-dcos01-cl01.onecou
nt.net 10.10.4.171:6868}""
May 2 08:43:22 nj-dcos05-cl01 dkron: time="2022-05-02T08:43:22-04:00" level=info msg="2022-05-02T08:43:22.398-0400 [INFO] raft: pipelining replication: peer="{Voter nj-dcos01-cl01.onecount.net
10.10.4.171:6868}""
May 2 08:43:22 nj-dcos05-cl01 dkron[24899]: time="2022-05-02T08:43:22-04:00" level=info msg="2022-05-02T08:43:22.618-0400 [WARN] raft: failed to contact: server-id=nj-dcos03-cl01.onecount.net ti
me=505.807042ms"
May 2 08:43:22 nj-dcos05-cl01 dkron: time="2022-05-02T08:43:22-04:00" level=info msg="2022-05-02T08:43:22.618-0400 [WARN] raft: failed to contact: server-id=nj-dcos03-cl01.onecount.net time=505.
807042ms"
May 2 08:43:23 nj-dcos05-cl01 dkron[24899]: time="2022-05-02T08:43:23-04:00" level=info msg="2022-05-02T08:43:23.043-0400 [WARN] raft: failed to contact: server-id=nj-dcos03-cl01.onecount.net ti
me=930.725989ms"
May 2 08:43:23 nj-dcos05-cl01 dkron: time="2022-05-02T08:43:23-04:00" level=info msg="2022-05-02T08:43:23.043-0400 [WARN] raft: failed to contact: server-id=nj-dcos03-cl01.onecount.net time=930.
725989ms"
May 2 08:43:23 nj-dcos05-cl01 dkron[24899]: time="2022-05-02T08:43:23-04:00" level=info msg="2022-05-02T08:43:23.471-0400 [WARN] raft: failed to contact: server-id=nj-dcos03-cl01.onecount.net ti
me=1.358762343s"
May 2 08:43:23 nj-dcos05-cl01 dkron: time="2022-05-02T08:43:23-04:00" level=info msg="2022-05-02T08:43:23.471-0400 [WARN] raft: failed to contact: server-id=nj-dcos03-cl01.onecount.net time=1.35
8762343s"
May 2 08:43:23 nj-dcos05-cl01 dkron[24899]: time="2022-05-02T08:43:23-04:00" level=info msg="2022-05-02T08:43:23.891-0400 [DEBUG] raft: failed to contact: server-id=nj-dcos03-cl01.onecount.net ti
me=1.779063246s"
May 2 08:43:23 nj-dcos05-cl01 dkron: time="2022-05-02T08:43:23-04:00" level=info msg="2022-05-02T08:43:23.891-0400 [DEBUG] raft: failed to contact: server-id=nj-dcos03-cl01.onecount.net time=1.77
9063246s"
May 2 08:43:24 nj-dcos05-cl01 dkron[24899]: time="2022-05-02T08:43:24-04:00" level=info msg="2022-05-02T08:43:24.367-0400 [DEBUG] raft: failed to contact: server-id=nj-dcos03-cl01.onecount.net ti
me=2.255024783s"
May 2 08:43:24 nj-dcos05-cl01 dkron: time="2022-05-02T08:43:24-04:00" level=info msg="2022-05-02T08:43:24.367-0400 [DEBUG] raft: failed to contact: server-id=nj-dcos03-cl01.onecount.net time=2.25
5024783s"
May 2 08:43:24 nj-dcos05-cl01 dkron[24899]: time="2022-05-02T08:43:24-04:00" level=info msg="2022/05/02 08:43:24 [DEBUG] memberlist: Failed ping: nj-dcos03-cl01.onecount.net (timeout reached)"
May 2 08:43:24 nj-dcos05-cl01 dkron: time="2022-05-02T08:43:24-04:00" level=info msg="2022/05/02 08:43:24 [DEBUG] memberlist: Failed ping: nj-dcos03-cl01.onecount.net (timeout reached)"
--More--

@rorrobriones
Copy link

rorrobriones commented May 2, 2022

We are experiencing the same since version 3.1.11. Every time the leader is down, another node is taking the leadership this is not able to start scheduler and remains sending the following errors:

May 02 21:15:04 rcs-hub-pre-redis-fluentd-2 dkron[14307]: time="2022-05-02T21:15:04Z" level=info msg="2022/05/02 21:15:04 [DEBUG] memberlist: Initiating push/pull sync with: rcs-hub-pre-redis-fluentd 10.128.0.7:8946"
May 02 21:15:20 rcs-hub-pre-redis-fluentd-2 dkron[14307]: time="2022-05-02T21:15:20Z" level=info msg="2022/05/02 21:15:20 [DEBUG] memberlist: Stream connection from=10.128.0.2:51654"
May 02 21:15:28 rcs-hub-pre-redis-fluentd-2 dkron[14307]: time="2022-05-02T21:15:28Z" level=info msg="agent: Starting scheduler" node=rcs-hub-pre-redis-fluentd-2
May 02 21:15:28 rcs-hub-pre-redis-fluentd-2 dkron[14307]: time="2022-05-02T21:15:28Z" level=error msg="dkron: failed to establish leadership" error="cron is already configured, can not start scheduler" node=rcs-hub-pre-redis-fluentd-2

So what we are doing is forcing to the previous leader to acquire ladership doing:

1.- Start or Restart former leader server (rcs-hub-pre-redis-fluentd-1 server)
2.- Shuting down the newest leader (in my case rcs-hub-pre-redis-fluentd-2 server)
3.- Wait for leadership acquisition in node rcs-hub-pre-redis-fluentd-1
4.- Startup the follower server rcs-hub-pre-redis-fluentd-2 again.

This is just our workaround, but the clustering suppose to be give us HA that's why we choose Dkron, but the HA is only working in the API to receive new jobs creation but not for execution.

Please, if you can downgrade to an older version, I'd appreciate it if you could let us know if it worked so we can do the same.

@gamer22026
Copy link

Same for us on a 3-node 3.1.11 cluster in k8s (dkron-server-0,dkron-server-1,dkron-server-2) When this happened, nodes 0 and 1 were followers and node 2 the leader. Then the following log messages:

time="2022-05-16T14:15:32Z" level=debug msg="dkron: shutting down leader loop" node=dkron-server-2
time="2022-05-16T14:15:32Z" level=debug msg="scheduler: Stopping scheduler" node=dkron-server-2
time="2022-05-16T14:15:33Z" level=error msg="dkron: failed to establish leadership" error="cron is already configured, can not start scheduler" node=dkron-server-2

After this nodes 0 and 1 were still listed as followers and node 2 still listed as leader, but all scheduled jobs stopped. Restarting the node 2 pod got it working again as the leader then moved to node 1

@seanfulton
Copy link

So here is an update:
After failing _every single day_I upgraded my cluster from 3 nodes to 5. I commented out the following line:
#bootstrap-expect: 3
It failed again on May 2--just to piss me off, I think. I did a cluster restart.

And it has not failed since.... No idea why. Any thoughts from any one? Five better than three?? Getting rid of bootstrap expect? Both? Neither?

With the three node cluster we were seeing saturation of some nodes, which is what triggered the leader election in the first place. Maybe by expanding the cluster I eliminidated the choke point and the problem still exists. Or maybe having five nodes instead of three resolves the problem (although I don't know how since it did fail on May 2 with a five-node cluster).

Any ideas?

@vcastellm
Copy link
Member

vcastellm commented Jun 2, 2022

Main issue here fixed in #1119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants