Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster missing executions #71

Closed
Luzifer opened this issue Feb 5, 2016 · 3 comments · Fixed by #73
Closed

Cluster missing executions #71

Luzifer opened this issue Feb 5, 2016 · 3 comments · Fixed by #73

Comments

@Luzifer
Copy link

Luzifer commented Feb 5, 2016

As an administrator of a dkron cluster I want my users to have a reliable cron system which guarantees "at least once" execution of their cron tasks.

My setup:

3 AWS EC2s running CoreOS stable, all three machines being part of an etcd cluster, all three machines running dkron 0.6.3 as a cluster. All machines does have role: executor tag and the following cron:

{
    "name": "date",
    "schedule": "0 */5 * * * *",
    "command": "/usr/bin/date",
    "owner": "Knut Ahlers",
    "owner_email": "knut****.com",
    "run_as_user": "core",
    "success_count": 58,
    "error_count": 0,
    "last_success": "2016-02-05T10:45:00.00621402Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": {
        "role": "executor:1"
    }
}

Expected result:

The cron is executed every 5 minutes and will succeed every single time as it just calls /usr/bin/date (which is the location of the date utility on CoreOS).

Observed result:

The task is sent to one of the machines every 5 minutes and gets eventually executed. The cron is missing several executions having no end date, no running date utility on the machine and a failed state.

screen shot 2016-02-05 at 11 43 02

Question

Can you help me with this / tell me what I'm doing wrong? Is there anything I can do to prevent this?

@vcastellm
Copy link
Member

Weir one, could you plese enable the debug flag and send me the log of the three nodes? Thanks!

@Luzifer
Copy link
Author

Luzifer commented Feb 8, 2016

Found an interesting part in the logs: I did not have a rule in the firewall configuration for that cluster to allow connections on port 6868 (RPC). After opening that port I get reports for all runs until now.

Though there are jobs assigned to a machine, which left the cluster (status left instead of failed), after it left. In that case the job execution just vanishes and is not recorded in the interface…

Logs from the current leader after its takeover from the node which left the cluster at 13:37:

# journalctl -u dkron | grep -E '(246850c9e0a46d3182e2fb2ad24b0d884d2eca13|10-8-4-212)'
[...]
Feb 08 13:37:56 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T13:37:56Z" level=debug msg="store: Retrieved leader from datastore" key=246850c9e0a46d3182e2fb2ad24b0d884d2eca13
Feb 08 13:37:56 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T13:37:56Z" level=debug msg="store: Retrieved leader from datastore" key=246850c9e0a46d3182e2fb2ad24b0d884d2eca13
Feb 08 13:37:56 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T13:37:56Z" level=debug msg="store: Leader Swap" new_leader=e3d721b9473b3cb6a9241f34ce7b0f16000ed331 old_leader=246850c9e0a46d3182e2fb2ad24b0d884d2eca13
Feb 08 13:50:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T13:50:00Z" level=debug msg="agent: Filtered nodes to run: [ip-10-8-4-212.eu-west-1.compute.internal]"
Feb 08 13:55:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T13:55:00Z" level=debug msg="agent: Filtered nodes to run: [ip-10-8-4-212.eu-west-1.compute.internal]"
Feb 08 14:15:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T14:15:00Z" level=debug msg="agent: Filtered nodes to run: [ip-10-8-4-212.eu-west-1.compute.internal]"
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="agent: Filtered nodes to run: [ip-10-8-4-212.eu-west-1.compute.internal]"
Feb 08 15:10:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:10:00Z" level=debug msg="agent: Filtered nodes to run: [ip-10-8-4-212.eu-west-1.compute.internal]"
[...]

Logs from 15:00:

# journalctl -u dkron | grep -E 'Feb 08 15:00'
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="scheduler: Run job" job=date schedule="0 */5 * * * *"
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="agent: Filtered nodes to run: [ip-10-8-4-211.eu-west-1.compute.internal]"
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="agent: Filtered tags to run: map[role:executor:1]"
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="agent: Sending query" job_name=date json="{\"job_name\":\"date\",\"started_at\":\"0001-01-01T00:00:00Z\",\"finished_at\":\"0001-01-01T00:00:00Z\",\"group\":1454943600000392357,\"job\":{\"name\":\"date\",\"schedule\":\"0 */5 * * * *\",\"command\":\"/usr/bin/date\",\"owner\":\"Knut Ahlers\",\"owner_email\":\"[email protected]\",\"run_as_user\":\"core\",\"success_count\":1,\"error_count\":0,\"last_success\":\"2016-02-08T13:10:00.479231499Z\",\"last_error\":\"0001-01-01T00:00:00Z\",\"disabled\":false,\"tags\":{\"role\":\"executor:1\"}}}" query="run:job"
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="scheduler: Run job" job="cluster_health" schedule="0 */5 * * * *"
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="agent: Filtered nodes to run: [ip-10-8-4-212.eu-west-1.compute.internal]"
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="agent: Filtered tags to run: map[role:executor:1]"
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="agent: Sending query" job_name="cluster_health" json="{\"job_name\":\"cluster_health\",\"started_at\":\"0001-01-01T00:00:00Z\",\"finished_at\":\"0001-01-01T00:00:00Z\",\"group\":1454943600000550408,\"job\":{\"name\":\"cluster_health\",\"schedule\":\"0 */5 * * * *\",\"command\":\"/usr/bin/etcdctl cluster-health\",\"owner\":\"Knut Ahlers\",\"owner_email\":\"[email protected]\",\"run_as_user\":\"core\",\"success_count\":0,\"error_count\":0,\"last_success\":\"0001-01-01T00:00:00Z\",\"last_error\":\"0001-01-01T00:00:00Z\",\"disabled\":false,\"tags\":{\"role\":\"executor:1\"}}}" query="run:job"
Feb 08 15:00:03 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:03Z" level=debug msg="agent: Done receiving acks and responses" query="run:job"
Feb 08 15:00:03 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:03Z" level=debug msg="agent: Done receiving acks and responses" query="run:job"
Feb 08 15:00:07 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:07Z" level=debug msg="store: Retrieved leader from datastore" key=e3d721b9473b3cb6a9241f34ce7b0f16000ed331
Feb 08 15:00:07 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:07Z" level=debug msg="store: Retrieved leader from datastore" key=e3d721b9473b3cb6a9241f34ce7b0f16000ed331
Feb 08 15:00:27 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:27Z" level=debug msg="store: Retrieved leader from datastore" key=e3d721b9473b3cb6a9241f34ce7b0f16000ed331
Feb 08 15:00:27 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:27Z" level=debug msg="store: Retrieved leader from datastore" key=e3d721b9473b3cb6a9241f34ce7b0f16000ed331
Feb 08 15:00:47 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:47Z" level=debug msg="store: Retrieved leader from datastore" key=e3d721b9473b3cb6a9241f34ce7b0f16000ed331
Feb 08 15:00:47 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:47Z" level=debug msg="store: Retrieved leader from datastore" key=e3d721b9473b3cb6a9241f34ce7b0f16000ed331

Both job executions vanished because they were sent to machines which already left the cluster (exchanged by the autoscaling-group).

I think this may be the case because in agent.go:743 there is no check for the state (left / failed) of the serf member and the leader still having knowledge about former members of the cluster… (But this is only a quick assumption because I did not read all of your code yet!)

@vcastellm
Copy link
Member

@Luzifer both things makes sense

allow connections on port 6868 (RPC)

This is necessary for the clients to send back execution reports with a tolerable size as serf doesn't support large messages (yet) I'll make a statement in the docs.

Regarding the vanished executions due to autoscaling, this is a good catch! indeed a bug, I missed the status when getting the cluster members.

Expect a hotfix release the following days.

vcastellm pushed a commit that referenced this issue Feb 10, 2016
Currently we're including in the node filtering, nodes that could
potentially be gone.

fixes #71
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants