-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fatal Concurrent Issue #299
Comments
Another issue I have encountered: ERRO[2017-09-08T21:45:37+08:00] rpc: error dialing. err="dial tcp: missing address" node=node2 server_addr=
ERRO[2017-09-08T21:45:37+08:00] agent: Error invoking job command error="dial tcp: missing address" node=node2 Let's look at this code (dkron/queries.go 180,200): for !qr.Finished() {
select {
case ack, ok := <-ackCh:
if ok {
log.WithFields(logrus.Fields{
"query": QueryRPCConfig,
"from": ack,
}).Debug("proc: [" + tidStr + "]Received ack")
}
case resp, ok := <-respCh:
if ok {
log.WithFields(logrus.Fields{
"query": QueryRPCConfig,
"from": resp.From,
"payload": string(resp.Payload),
}).Debug("proc: [" + tidStr + "]Received response")
rpcAddr = resp.Payload
}
}
} Is it possible that, "qr.Finished() == true" just after branch "case ack, ok := <-ackCh:" is executed? In this case, the last line "rpcAddr = resp.Payload" is not executed, which results in the "missing address" error? |
And....Another one: for _, member := range a.serf.Members() {
if member.Status == serf.StatusAlive {
for mtk, mtv := range member.Tags { // <---------------- LOOK AT HERE!
if mtk == jtk && mtv == tv {
if len(nodes) < count {
nodes = append(nodes, member.Name)
}
}
}
}
} In golang, while "range [map]", the iteration order is "arbitrary", but not technically "random". |
@vaporz Regarding the last range[map], I don't get your point, I know about the "arbitrary" order of maps in Go. In this code it doesn't matter if it's ordered, random or arbitrary. It is comparing elements from the |
Very improbable but not verified.
You tell me. I mean, when you see Do the master node outputs |
Regarding the original issue - I followed the steps but seems something is missing, in steps 3 and 4, are you deleting jobs? Because otherwise, it works as expected for me. |
Hi,
Sorry, I didn't make myself clear. I was testing with 3 nodes, and created a job with tag "web:1". I have observed that, the job is not executed on each node with a rate of 1:1:1. The leader node seems to have more chance. So I looked into code and tested "range map", I think "range map" is causing this, because it is not "random".
Yes... maybe I was wrong, but the error often occurs, here is log:
No, I can reproduce this issue exactly with those steps. I just start 3 nodes, create 5 jobs, and two nodes would crash themselves. // Unlock the "key". Calling unlock while
// not holding the lock will throw an error
func (l *etcdLock) Unlock() error {
if l.stopLock != nil {
l.stopLock <- struct{}{}
}
if l.last != nil {
delOpts := &etcd.DeleteOptions{
PrevIndex: l.last.Node.ModifiedIndex,
}
_, err := l.client.Delete(context.Background(), l.key, delOpts)
if err != nil {
return err
}
}
return nil
} |
As reported #299 This fixes a race condition with the job update after execution. In this PR the lock is refactored to an Atomic job put, that reads the job, updates it and performs a CAS operation to update it in the store. If this update fails, it retryies it until it's updated. This ensures no race condition between nodes and efectively fixes the reported error.
Main issue fixed in #355, all other issues where previously fixed in other PRs |
Dkron version: master (commit 14c32d2)
etcd Version: 3.2.7
Git SHA: bb66589
Go Version: go1.8.3
Go OS/Arch: darwin/amd64
Reproduce steps (very simple):
0. shared dkron config:
1, Start 3 nodes:
2, Create 5 jobs:
{
"name": "echo1",
"command": "/bin/echo
date
>> /Users/xiaozhang/goworkspace/src/test.log","schedule": "@every 4s",
"shell":true,
"tags": {
"role": "web"
}
}
...
...
{
"name": "echo5",
"command": "/bin/echo
date
>> /Users/xiaozhang/goworkspace/src/test.log","schedule": "@every 4s",
"shell":true,
"tags": {
"role": "web"
}
}
3, Wait
One of servers should crash in 10 seconds (or 10 minutes), log:
4, Wait
Another server should crash in 10 seconds, too:
Root Cause
The error is from "dkron/job.go":
etcd debug log:
Trace into etcd client (dkron/vendor/github.com/docker/libkv/store/etcd/etcd.go 488,515):
One question
If a job is created to be executed on multiple nodes, is it necessary to "job.Lock()" and "job.Unlock()"?
The text was updated successfully, but these errors were encountered: