Fatal Concurrent Issue #299

vaporz · 2017-09-08T11:38:37Z

Dkron version: master (commit 14c32d2)
etcd Version: 3.2.7
Git SHA: bb66589
Go Version: go1.8.3
Go OS/Arch: darwin/amd64

Reproduce steps (very simple):

0. shared dkron config:

{
  "backend": "etcd",
  "backend_machine": "127.0.0.1:2379",
  "tags": {
    "role": "web",
    "datacenter": "east"
  },
  "keyspace": "dkron",
  "encrypt": "5jmSmVFpMRYA9mQh+sBKpA=="
}

1, Start 3 nodes:

./dkron agent -server=true -node=node1 -bind=0.0.0.0:8946 -join=0.0.0.0:8947 -join=0.0.0.0:8948 -http-addr=:8080 -rpc-port=6868
./dkron agent -server=true -node=node2 -bind=0.0.0.0:8947 -join=0.0.0.0:8946 -join=0.0.0.0:8948 -http-addr=:8081 -rpc-port=6869
./dkron agent -server=true -node=node3 -bind=0.0.0.0:8948 -join=0.0.0.0:8946 -join=0.0.0.0:8947 -http-addr=:8082 -rpc-port=6870

2, Create 5 jobs:

{
"name": "echo1",
"command": "/bin/echo date >> /Users/xiaozhang/goworkspace/src/test.log",
"schedule": "@every 4s",
"shell":true,
"tags": {
"role": "web"
}
}
...
...
{
"name": "echo5",
"command": "/bin/echo date >> /Users/xiaozhang/goworkspace/src/test.log",
"schedule": "@every 4s",
"shell":true,
"tags": {
"role": "web"
}
}

3, Wait

One of servers should crash in 10 seconds (or 10 minutes), log:

FATA[2017-09-08T19:14:24+08:00] rpc:100: Key not found (/dkron/job_locks/echo5) [94661]  node=node2

4, Wait

Another server should crash in 10 seconds, too:

FATA[2017-09-08T19:14:44+08:00] rpc:100: Key not found (/dkron/job_locks/echo1) [94831]  node=node3

Root Cause

The error is from "dkron/job.go":

// Lock the job in store
func (j *Job) Lock() error {
......
      _, err = j.lock.Lock(nil)  // <------ err returned!
......
}

etcd debug log:

2017-09-08 19:14:44.267965 D | etcdserver/api/v2http: [PUT] /v2/keys/dkron/job_locks/echo1?prevExist=false remote:127.0.0.1:55868
2017-09-08 19:14:44.282722 D | etcdserver/api/v2http: [PUT] /v2/keys/dkron/job_locks/echo1?prevExist=true&prevIndex=94821 remote:127.0.0.1:55858
2017-09-08 19:14:44.329762 D | etcdserver/api/v2http: [PUT] /v2/keys/dkron/job_locks/echo1?prevExist=false remote:127.0.0.1:55807

(HERE! "dkron/job_locks/echo1" is deleted)
2017-09-08 19:14:44.340926 D | etcdserver/api/v2http: [DELETE] /v2/keys/dkron/job_locks/echo1?prevIndex=94822 remote:127.0.0.1:55895

(HERE! get "dkron/job_locks/echo1" with "prevExist=true", and BOOM!!!!!!)
2017-09-08 19:14:44.340953 D | etcdserver/api/v2http: [PUT] /v2/keys/dkron/job_locks/echo1?prevExist=true&prevIndex=18446744073709551615 remote:127.0.0.1:55889

Trace into etcd client (dkron/vendor/github.com/docker/libkv/store/etcd/etcd.go 488,515):

setOpts.PrevExist = etcd.PrevNoExist
resp, err := l.client.Set(context.Background(), l.key, l.value, setOpts)
if err != nil {
	if etcdError, ok := err.(etcd.Error); ok {
		if etcdError.Code != etcd.ErrorCodeNodeExist {
			return nil, err
		}
		setOpts.PrevIndex = ^uint64(0)
	}
} else {
	setOpts.PrevIndex = resp.Node.ModifiedIndex
}

// !!!!!!!  at this point, another server has unlocked the same job, and deleted node "dkron/job_locks/echo1"

setOpts.PrevExist = etcd.PrevExist // <-----------prevExist=true
l.last, err = l.client.Set(context.Background(), l.key, l.value, setOpts) // set with "prevExist=true", an error is returned

if err == nil {
	// Leader section
	l.stopLock = stopLocking
	go l.holdLock(l.key, lockHeld, stopLocking)
	break
} else {
	// If this is a legitimate error, return
	if etcdError, ok := err.(etcd.Error); ok {
		if etcdError.Code != etcd.ErrorCodeTestFailed {
			return nil, err
		}
	}

One question

If a job is created to be executed on multiple nodes, is it necessary to "job.Lock()" and "job.Unlock()"?

The text was updated successfully, but these errors were encountered:

vaporz · 2017-09-08T13:50:55Z

Another issue I have encountered:

ERRO[2017-09-08T21:45:37+08:00] rpc: error dialing.                           err="dial tcp: missing address" node=node2 server_addr=
ERRO[2017-09-08T21:45:37+08:00] agent: Error invoking job command             error="dial tcp: missing address" node=node2

Let's look at this code (dkron/queries.go 180,200):

for !qr.Finished() {
	select {
	case ack, ok := <-ackCh:
		if ok {
			log.WithFields(logrus.Fields{
				"query": QueryRPCConfig,
				"from":  ack,
			}).Debug("proc: [" + tidStr + "]Received ack")
		}
	case resp, ok := <-respCh:
		if ok {
			log.WithFields(logrus.Fields{
				"query":   QueryRPCConfig,
				"from":    resp.From,
				"payload": string(resp.Payload),
			}).Debug("proc: [" + tidStr + "]Received response")

				rpcAddr = resp.Payload
		}
	}
}

Is it possible that, "qr.Finished() == true" just after branch "case ack, ok := <-ackCh:" is executed?

In this case, the last line "rpcAddr = resp.Payload" is not executed, which results in the "missing address" error?

vaporz · 2017-09-08T14:06:57Z

And....Another one:
Here:

for _, member := range a.serf.Members() {
	if member.Status == serf.StatusAlive {
		for mtk, mtv := range member.Tags {  // <---------------- LOOK AT HERE!
			if mtk == jtk && mtv == tv {
				if len(nodes) < count {
					nodes = append(nodes, member.Name)
				}
			}
		}
	}
}

In golang, while "range [map]", the iteration order is "arbitrary", but not technically "random".

vcastellm · 2017-09-12T15:48:20Z

@vaporz Regarding the last range[map], I don't get your point, I know about the "arbitrary" order of maps in Go. In this code it doesn't matter if it's ordered, random or arbitrary.

It is comparing elements from the tags map with elements of the member.Tags map, order is not important. What's your point?

vcastellm · 2017-09-12T15:57:46Z

Is it possible that, "qr.Finished() == true" just after branch "case ack, ok := <-ackCh:" is executed?

Very improbable but not verified. qr.Finished() is false until all responses has been received or a timeout is reached.

In this case, the last line "rpcAddr = resp.Payload" is not executed, which results in the "missing address" error?

You tell me. I mean, when you see
agent: Error invoking job command error="dial tcp: missing address" node=node2

Do the master node outputs proc: Received response with the payload?

vcastellm · 2017-09-12T16:14:03Z

Regarding the original issue - I followed the steps but seems something is missing, in steps 3 and 4, are you deleting jobs?

Because otherwise, it works as expected for me.

vaporz · 2017-09-12T18:08:25Z

Hi,

It is comparing elements from the tags map with elements of the member.Tags map, order is not important. What's your point?

Sorry, I didn't make myself clear. I was testing with 3 nodes, and created a job with tag "web:1". I have observed that, the job is not executed on each node with a rate of 1:1:1. The leader node seems to have more chance. So I looked into code and tested "range map", I think "range map" is causing this, because it is not "random".

Do the master node outputs proc: Received response with the payload?

Yes... maybe I was wrong, but the error often occurs, here is log:
https://github.com/vaporz/dkron/blob/master/debug.log

Regarding the original issue - I followed the steps but seems something is missing, in steps 3 and 4, are you deleting jobs?

No, I can reproduce this issue exactly with those steps. I just start 3 nodes, create 5 jobs, and two nodes would crash themselves.
Will job.Unlock() delete a key like dkron/job_locks/echo1?
This is etcdLock.Unlock(), l.client.Delete() deleted this key?
Maybe this issue happens only with etcd?

// Unlock the "key". Calling unlock while
// not holding the lock will throw an error
func (l *etcdLock) Unlock() error {
	if l.stopLock != nil {
		l.stopLock <- struct{}{}
	}
	if l.last != nil {
		delOpts := &etcd.DeleteOptions{
			PrevIndex: l.last.Node.ModifiedIndex,
		}
		_, err := l.client.Delete(context.Background(), l.key, delOpts)
		if err != nil {
			return err
		}
	}
	return nil
}

As reported #299 This fixes a race condition with the job update after execution. In this PR the lock is refactored to an Atomic job put, that reads the job, updates it and performs a CAS operation to update it in the store. If this update fails, it retryies it until it's updated. This ensures no race condition between nodes and efectively fixes the reported error.

vcastellm · 2018-04-27T21:58:09Z

Main issue fixed in #355, all other issues where previously fixed in other PRs

vcastellm added the bug label Apr 25, 2018

vcastellm added this to the v0.10.0 milestone Apr 25, 2018

vcastellm mentioned this issue Apr 27, 2018

Fix race condition in job update #355

Merged

vcastellm closed this as completed Apr 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fatal Concurrent Issue #299

Fatal Concurrent Issue #299

vaporz commented Sep 8, 2017 •

edited

Loading

vaporz commented Sep 8, 2017 •

edited

Loading

vaporz commented Sep 8, 2017 •

edited

Loading

vcastellm commented Sep 12, 2017

vcastellm commented Sep 12, 2017

vcastellm commented Sep 12, 2017

vaporz commented Sep 12, 2017 •

edited

Loading

vcastellm commented Apr 27, 2018

Fatal Concurrent Issue #299

Fatal Concurrent Issue #299

Comments

vaporz commented Sep 8, 2017 • edited Loading

Reproduce steps (very simple):

0. shared dkron config:

1, Start 3 nodes:

2, Create 5 jobs:

3, Wait

4, Wait

Root Cause

One question

vaporz commented Sep 8, 2017 • edited Loading

vaporz commented Sep 8, 2017 • edited Loading

vcastellm commented Sep 12, 2017

vcastellm commented Sep 12, 2017

vcastellm commented Sep 12, 2017

vaporz commented Sep 12, 2017 • edited Loading

vcastellm commented Apr 27, 2018

vaporz commented Sep 8, 2017 •

edited

Loading

vaporz commented Sep 8, 2017 •

edited

Loading

vaporz commented Sep 8, 2017 •

edited

Loading

vaporz commented Sep 12, 2017 •

edited

Loading