-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduled jobs stop execution after leader re-election #998
Comments
Logs Attached: |
I encountered the same situation. This happens between 2021-08-03T02:34:36 and 2021-08-03T02:34:40 |
I have the same issue on a three-node cluster. This is the third-time my dkron cluster has stopped functioning. This is a pretty major issue; you promote this software as fault-tolerant and mission-critical, yet this issue has been known for months with no response or solution. I started using dkron as a pilot to see if it's worth going with the pro version, but my team is giving me pressure to find another solution because this is unreliable. Does anyone have a solution or work-around? We are using 3.1.10. Log file is attached. You can see that the node that failed (not the master) rejoins the cluster later, but no jobs run after that. |
Could any of you specify if it's all jobs or some of them executes and some other don't? The problem with this issue is that I have not been able to reproduce it, and that makes it pretty difficult to solve. |
No new jobs are started after the election. |
ok, build is currently in progress, please try with |
We used the beta for a few months and did not have an issue. I upgraded to 3.1.11-1 and it has happened again. dcos02 did not fail, it just missed a heartbeat. You can see from the logs that it later comes back online. There is no leader re-election, which I take to mean it was not the leader at the time of failure. Now, 24-hours later, dcos01 is the leader. This is a big problem. This program is core to our operations, and we can use it if it spontaneously stops working. Apr 22 10:57:01 nj-dcos01-cl01 dkron: time="2022-04-22T10:57:01-04:00" level=debug msg="store: Setting key" execution="executions:import-queue:1650639421031236137-nj-dcos01-cl01.onecount.net" finished |
@seanfulton it's good to know that it was working well before the upgrade, but the fix is included in the latest release, so it should work the same. Did you use all betas up to beta3? Did the scheduler started when dcos01 became the new leader? |
I don't know that it worked prior, it could have been a coincidence. I only upgraded to the new version last week, and I upgraded to beta1 when it became available. So it could possibly be the upgrade, or it could just be that the trigger didn't happen to happen while I was running the beta. No, jobs did not start once dcos01 was elected leader. That's the problem. There is a brief network interruption of some kind for the leader, another leader is elected, and then nothing runs... Until I restart the whole cluster. This is a 3 node cluster |
Happened again this morning. Something about peer having a newer term???? It looks like dcos01 stepped down, then elected itself leader again. Then no more jobs run after that (it looks like one or two jobs ran in the middle of this, but you can see after 11:38 everything stops. sean |
I was able to reproduce it now. Since dcos02 kept getting disconnected, I tried moving it to a less busy load. docs01 was master. So by taking dcos02 down and bringing it up again, the jobs stopped running.: Apr 25 15:26:01 nj-dcos01-cl01 dkron: time="2022-04-25T15:26:01-04:00" level=debug msg="store: Retrieved job from datastore" job=import-queue node=nj-dcos01-cl01.onecount.net |
This is a daily occurrence now. Ping fails, leader election, no more jobs. Help! Apr 26 03:35:10 nj-dcos01-cl01 dkron: time="2022-04-26T03:35:10-04:00" level=debug msg="store: Setting job" job=wly-mapping node=nj-dcos01-cl01.onecount.net |
Any ideas would be appreciated. I understand the event is being triggered by one of the three nodes becoming un-available, but the cluster has to survive a momentary bounce. Any suggestions or advice would be appreciated. I have this set if it makes any difference: Provides the number of expected servers in the datacenter.Either this value should not be provided or the value must agree with other servers in the cluster.When provided, Dkron waits until the specified number of servers are available and then bootstraps the cluster.This allows an initial leader to be elected automatically. This flag requires server mode.bootstrap-expect: 3 |
This is happening every day. It has something to do with a peer having a newer term. I upgraded by cluster to 5 nodes, reduced the workload, and it is still happening.I am going to revert to an earlier version to see if I can get this stabilized. This is crazy. |
May 2 08:43:22 nj-dcos05-cl01 dkron: time="2022-05-02T08:43:22-04:00" level=info msg="2022-05-02T08:43:22.112-0400 [INFO] raft: added peer, starting replication: peer=nj-dcos01-cl01.onecount.net |
We are experiencing the same since version 3.1.11. Every time the leader is down, another node is taking the leadership this is not able to start scheduler and remains sending the following errors: May 02 21:15:04 rcs-hub-pre-redis-fluentd-2 dkron[14307]: time="2022-05-02T21:15:04Z" level=info msg="2022/05/02 21:15:04 [DEBUG] memberlist: Initiating push/pull sync with: rcs-hub-pre-redis-fluentd 10.128.0.7:8946" So what we are doing is forcing to the previous leader to acquire ladership doing: 1.- Start or Restart former leader server (rcs-hub-pre-redis-fluentd-1 server) This is just our workaround, but the clustering suppose to be give us HA that's why we choose Dkron, but the HA is only working in the API to receive new jobs creation but not for execution. Please, if you can downgrade to an older version, I'd appreciate it if you could let us know if it worked so we can do the same. |
Same for us on a 3-node 3.1.11 cluster in k8s (dkron-server-0,dkron-server-1,dkron-server-2) When this happened, nodes 0 and 1 were followers and node 2 the leader. Then the following log messages:
After this nodes 0 and 1 were still listed as followers and node 2 still listed as leader, but all scheduled jobs stopped. Restarting the node 2 pod got it working again as the leader then moved to node 1 |
So here is an update: And it has not failed since.... No idea why. Any thoughts from any one? Five better than three?? Getting rid of bootstrap expect? Both? Neither? With the three node cluster we were seeing saturation of some nodes, which is what triggered the leader election in the first place. Maybe by expanding the cluster I eliminidated the choke point and the problem still exists. Or maybe having five nodes instead of three resolves the problem (although I don't know how since it did fail on May 2 with a five-node cluster). Any ideas? |
Main issue here fixed in #1119 |
My cluster consists of 3 server node (Dkron 3.1.7):
All nodes are configured as follows:
This cluster has been running for ~3 weeks and able to execute scheduled jobs
ISSUE Details:
One of the nodes is unable to contact the other 2 nodes (could be an intermittent network issue) causing a re-election. After re-election none of the scheduled jobs are executing
Log snippet from one of the nodes. The complete logs from the 3 nodes are attached.
The text was updated successfully, but these errors were encountered: