Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal Concurrent Issue #299

Closed
vaporz opened this issue Sep 8, 2017 · 7 comments
Closed

Fatal Concurrent Issue #299

vaporz opened this issue Sep 8, 2017 · 7 comments
Labels
Milestone

Comments

@vaporz
Copy link

vaporz commented Sep 8, 2017

Dkron version: master (commit 14c32d2)
etcd Version: 3.2.7
Git SHA: bb66589
Go Version: go1.8.3
Go OS/Arch: darwin/amd64

Reproduce steps (very simple):

0. shared dkron config:

{
  "backend": "etcd",
  "backend_machine": "127.0.0.1:2379",
  "tags": {
    "role": "web",
    "datacenter": "east"
  },
  "keyspace": "dkron",
  "encrypt": "5jmSmVFpMRYA9mQh+sBKpA=="
}

1, Start 3 nodes:

./dkron agent -server=true -node=node1 -bind=0.0.0.0:8946 -join=0.0.0.0:8947 -join=0.0.0.0:8948 -http-addr=:8080 -rpc-port=6868
./dkron agent -server=true -node=node2 -bind=0.0.0.0:8947 -join=0.0.0.0:8946 -join=0.0.0.0:8948 -http-addr=:8081 -rpc-port=6869
./dkron agent -server=true -node=node3 -bind=0.0.0.0:8948 -join=0.0.0.0:8946 -join=0.0.0.0:8947 -http-addr=:8082 -rpc-port=6870

2, Create 5 jobs:

{
"name": "echo1",
"command": "/bin/echo date >> /Users/xiaozhang/goworkspace/src/test.log",
"schedule": "@every 4s",
"shell":true,
"tags": {
"role": "web"
}
}
...
...
{
"name": "echo5",
"command": "/bin/echo date >> /Users/xiaozhang/goworkspace/src/test.log",
"schedule": "@every 4s",
"shell":true,
"tags": {
"role": "web"
}
}

3, Wait

One of servers should crash in 10 seconds (or 10 minutes), log:

FATA[2017-09-08T19:14:24+08:00] rpc:100: Key not found (/dkron/job_locks/echo5) [94661]  node=node2

4, Wait

Another server should crash in 10 seconds, too:

FATA[2017-09-08T19:14:44+08:00] rpc:100: Key not found (/dkron/job_locks/echo1) [94831]  node=node3

Root Cause

The error is from "dkron/job.go":

// Lock the job in store
func (j *Job) Lock() error {
......
      _, err = j.lock.Lock(nil)  // <------ err returned!
......
}

etcd debug log:

2017-09-08 19:14:44.267965 D | etcdserver/api/v2http: [PUT] /v2/keys/dkron/job_locks/echo1?prevExist=false remote:127.0.0.1:55868
2017-09-08 19:14:44.282722 D | etcdserver/api/v2http: [PUT] /v2/keys/dkron/job_locks/echo1?prevExist=true&prevIndex=94821 remote:127.0.0.1:55858
2017-09-08 19:14:44.329762 D | etcdserver/api/v2http: [PUT] /v2/keys/dkron/job_locks/echo1?prevExist=false remote:127.0.0.1:55807

(HERE! "dkron/job_locks/echo1" is deleted)
2017-09-08 19:14:44.340926 D | etcdserver/api/v2http: [DELETE] /v2/keys/dkron/job_locks/echo1?prevIndex=94822 remote:127.0.0.1:55895

(HERE! get "dkron/job_locks/echo1" with "prevExist=true", and BOOM!!!!!!)
2017-09-08 19:14:44.340953 D | etcdserver/api/v2http: [PUT] /v2/keys/dkron/job_locks/echo1?prevExist=true&prevIndex=18446744073709551615 remote:127.0.0.1:55889

Trace into etcd client (dkron/vendor/github.com/docker/libkv/store/etcd/etcd.go 488,515):

setOpts.PrevExist = etcd.PrevNoExist
resp, err := l.client.Set(context.Background(), l.key, l.value, setOpts)
if err != nil {
	if etcdError, ok := err.(etcd.Error); ok {
		if etcdError.Code != etcd.ErrorCodeNodeExist {
			return nil, err
		}
		setOpts.PrevIndex = ^uint64(0)
	}
} else {
	setOpts.PrevIndex = resp.Node.ModifiedIndex
}

// !!!!!!!  at this point, another server has unlocked the same job, and deleted node "dkron/job_locks/echo1"

setOpts.PrevExist = etcd.PrevExist // <-----------prevExist=true
l.last, err = l.client.Set(context.Background(), l.key, l.value, setOpts) // set with "prevExist=true", an error is returned

if err == nil {
	// Leader section
	l.stopLock = stopLocking
	go l.holdLock(l.key, lockHeld, stopLocking)
	break
} else {
	// If this is a legitimate error, return
	if etcdError, ok := err.(etcd.Error); ok {
		if etcdError.Code != etcd.ErrorCodeTestFailed {
			return nil, err
		}
	}

One question

If a job is created to be executed on multiple nodes, is it necessary to "job.Lock()" and "job.Unlock()"?

@vaporz
Copy link
Author

vaporz commented Sep 8, 2017

Another issue I have encountered:

ERRO[2017-09-08T21:45:37+08:00] rpc: error dialing.                           err="dial tcp: missing address" node=node2 server_addr=
ERRO[2017-09-08T21:45:37+08:00] agent: Error invoking job command             error="dial tcp: missing address" node=node2

Let's look at this code (dkron/queries.go 180,200):

for !qr.Finished() {
	select {
	case ack, ok := <-ackCh:
		if ok {
			log.WithFields(logrus.Fields{
				"query": QueryRPCConfig,
				"from":  ack,
			}).Debug("proc: [" + tidStr + "]Received ack")
		}
	case resp, ok := <-respCh:
		if ok {
			log.WithFields(logrus.Fields{
				"query":   QueryRPCConfig,
				"from":    resp.From,
				"payload": string(resp.Payload),
			}).Debug("proc: [" + tidStr + "]Received response")

				rpcAddr = resp.Payload
		}
	}
}

Is it possible that, "qr.Finished() == true" just after branch "case ack, ok := <-ackCh:" is executed?

In this case, the last line "rpcAddr = resp.Payload" is not executed, which results in the "missing address" error?

@vaporz
Copy link
Author

vaporz commented Sep 8, 2017

And....Another one:
Here:

for _, member := range a.serf.Members() {
	if member.Status == serf.StatusAlive {
		for mtk, mtv := range member.Tags {  // <---------------- LOOK AT HERE!
			if mtk == jtk && mtv == tv {
				if len(nodes) < count {
					nodes = append(nodes, member.Name)
				}
			}
		}
	}
}

In golang, while "range [map]", the iteration order is "arbitrary", but not technically "random".

@vcastellm
Copy link
Member

@vaporz Regarding the last range[map], I don't get your point, I know about the "arbitrary" order of maps in Go. In this code it doesn't matter if it's ordered, random or arbitrary.

It is comparing elements from the tags map with elements of the member.Tags map, order is not important. What's your point?

@vcastellm
Copy link
Member

Is it possible that, "qr.Finished() == true" just after branch "case ack, ok := <-ackCh:" is executed?

Very improbable but not verified. qr.Finished() is false until all responses has been received or a timeout is reached.

In this case, the last line "rpcAddr = resp.Payload" is not executed, which results in the "missing address" error?

You tell me. I mean, when you see
agent: Error invoking job command error="dial tcp: missing address" node=node2

Do the master node outputs proc: Received response with the payload?

@vcastellm
Copy link
Member

Regarding the original issue - I followed the steps but seems something is missing, in steps 3 and 4, are you deleting jobs?

Because otherwise, it works as expected for me.

@vaporz
Copy link
Author

vaporz commented Sep 12, 2017

Hi,

It is comparing elements from the tags map with elements of the member.Tags map, order is not important. What's your point?

Sorry, I didn't make myself clear. I was testing with 3 nodes, and created a job with tag "web:1". I have observed that, the job is not executed on each node with a rate of 1:1:1. The leader node seems to have more chance. So I looked into code and tested "range map", I think "range map" is causing this, because it is not "random".

Do the master node outputs proc: Received response with the payload?

Yes... maybe I was wrong, but the error often occurs, here is log:
https://github.com/vaporz/dkron/blob/master/debug.log

Regarding the original issue - I followed the steps but seems something is missing, in steps 3 and 4, are you deleting jobs?

No, I can reproduce this issue exactly with those steps. I just start 3 nodes, create 5 jobs, and two nodes would crash themselves.
Will job.Unlock() delete a key like dkron/job_locks/echo1?
This is etcdLock.Unlock(), l.client.Delete() deleted this key?
Maybe this issue happens only with etcd?

// Unlock the "key". Calling unlock while
// not holding the lock will throw an error
func (l *etcdLock) Unlock() error {
	if l.stopLock != nil {
		l.stopLock <- struct{}{}
	}
	if l.last != nil {
		delOpts := &etcd.DeleteOptions{
			PrevIndex: l.last.Node.ModifiedIndex,
		}
		_, err := l.client.Delete(context.Background(), l.key, delOpts)
		if err != nil {
			return err
		}
	}
	return nil
}

@vcastellm vcastellm added the bug label Apr 25, 2018
@vcastellm vcastellm added this to the v0.10.0 milestone Apr 25, 2018
vcastellm pushed a commit that referenced this issue Apr 27, 2018
As reported #299

This fixes a race condition with the job update after execution.

In this PR the lock is refactored to an Atomic job put, that reads the job, updates it and performs a CAS operation to update it in the store.

If this update fails, it retryies it until it's updated.

This ensures no race condition between nodes and efectively fixes the reported error.
@vcastellm
Copy link
Member

Main issue fixed in #355, all other issues where previously fixed in other PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants