Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrency lock error #10756

Closed
yueyuan4 opened this issue May 23, 2019 · 9 comments
Closed

concurrency lock error #10756

yueyuan4 opened this issue May 23, 2019 · 9 comments

Comments

@yueyuan4
Copy link

yueyuan4 commented May 23, 2019

package test

import (
	"testing"
	"log"
	"context"
	"time"
	"github.com/coreos/etcd/clientv3"
	"go.etcd.io/etcd/clientv3/concurrency"
	"sync"
)
var endpoints = []string{"172.31.236.172:12379","172.31.236.53:12379","172.31.63.49:12379"}
var dialTimeout = 10*time.Second
func TestServer_EtcdLock2(t *testing.T)  {
	cli, err := clientv3.New(clientv3.Config{Endpoints: endpoints,DialTimeout:dialTimeout})
	if err != nil {
		log.Fatal(err)
	}
	defer cli.Close()


	s1, err := concurrency.NewSession(cli)
	if err != nil {
		panic(err)
	}
	defer s1.Close()


	for j := 0;j < 100;j++{
		counter = 0

		var wg sync.WaitGroup
		for i := 0; i < 1000; i++ {
			wg.Add(1)
			go func() {
				defer wg.Done()

				m1 := concurrency.NewMutex(s1, "/my-lock/")
				if err := m1.Lock(context.TODO()); err != nil {
					panic(err)
				}
				counter++
                                fmt.Println(counter,"...",time.Now().String())
				if err := m1.Unlock(context.TODO()); err != nil {
					panic(err)
				}
			}()
		}

		wg.Wait()
		println(counter)
		if counter != 1000{
			panic("error count")
		}
		time.Sleep(3*time.Second)
	}
}

I got this error.Thank you for your help

999
--- FAIL: TestServer_EtcdLock2 (297.02s)
panic: error count [recovered]
	panic: error count
@jingyih
Copy link
Contributor

jingyih commented May 23, 2019

I am not familiar with the usage of concurrency package. If you create the 1000 mutexes before starting the go routines, will it work?

@yueyuan4
Copy link
Author

Do you mean this? I tried it, and it doesn't work.I got the same error.

	for j := 0; j < 100; j++ {
		counter = 0

		var wg sync.WaitGroup
		for i := 0; i < 1000; i++ {

			wg.Add(1)
			m1 := concurrency.NewMutex(s1, "/my-lock/")
			go func() {
				defer wg.Done()

				//m1 := concurrency.NewMutex(s1, "/my-lock/")
				if err := m1.Lock(context.TODO()); err != nil {
					panic(err)
				}

				println("lock succ, do your business logic")
				A()
				if err := m1.Unlock(context.TODO()); err != nil {
					panic(err)
				}
				println("unlock succ, finish business logic")
			}()
		}

		wg.Wait()
		println(counter)
		if counter != 1000 {
			panic("error count")
		}
		time.Sleep(3 * time.Second)
	}

@jingyih
Copy link
Contributor

jingyih commented May 23, 2019

I meant:

create mutexes
for i := 0; i < 1000; i++ {
  go func() {
    lock mutex
    counter++
    unlock mutex
  }()
}

@yueyuan4
Copy link
Author

I am very happy that it works.However,when the number of goroutine is large, such as 1000, there will be an error at the lock. The error is "etcdserver: request timed out".

@jingyih
Copy link
Contributor

jingyih commented May 27, 2019

It literally means the request is timed out. Maybe the QPS is too high and the server cannot keep up. What is the QPS? Are you using SSD or HDD? Reference on performance: https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/performance.md

@yueyuan4
Copy link
Author

The etcd write performance is as flows. But I don't know it's SSD or HDD.They are virtual machines.

Number of keys Key size in bytes Value size in bytes Number of connections Number of clients Target etcd server Average write QPS Average latency per request Average server RSS
10,000 8 256 1 1 leader only 137 7.3ms 335MB
100,000 8 256 100 1000 leader only 8,095 122.4ms 410MB
100,000 8 256 100 1000 all members 10,722 92.5ms 430MB

A warning appears once when I test.
benchmark --endpoints=172.31.236.53:12379 --target-leader --conns=1 --clients=1 put --key-size=8 --sequential-keys --total=10000 --val-size=256
197 / 10000 Booom ! 1.97% 04m12s{"level":"warn","ts":"2019-05-27T16:16:14.304+0800","caller":"clientv3/retry_interceptor.go:60","msg":"retrying of unary invoker failed","target":"endpoint://client-7111faf1-4daa-4ca1-8963-0df466c3aeb4/172.31.236.53:12379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
10000 / 10000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00% 1m50s

Summary:
Total: 110.9975 secs.
Slowest: 1.8903 secs.
Fastest: 0.0035 secs.
Average: 0.0104 secs.
Stddev: 0.0607 secs.
Requests/sec: 90.0832

Response time histogram:
0.0035 [1] |
0.1922 [9954] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.3809 [9] |
0.5696 [11] |
0.7582 [5] |
0.9469 [3] |
1.1356 [6] |
1.3243 [4] |
1.5130 [1] |
1.7017 [3] |
1.8903 [2] |

Latency distribution:
10% in 0.0046 secs.
25% in 0.0049 secs.
50% in 0.0056 secs.
75% in 0.0075 secs.
90% in 0.0106 secs.
95% in 0.0136 secs.
99% in 0.0227 secs.
99.9% in 1.2204 secs.

Error distribution:
[1] etcdserver: request timed out

Thanks a lot.

@jingyih
Copy link
Contributor

jingyih commented May 27, 2019

Can you also post the disk metrics? They are accessible via http endpoint /metrics. Disk related metrics are

  • wal_fsync_duration_seconds
  • backend_commit_duration_seconds

Ref: https://github.com/etcd-io/etcd/blob/master/Documentation/metrics.md#disk

@yueyuan4
Copy link
Author

curl -L http://localhost:12379/metrics
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 144061
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 423542
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 492066
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 500613
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 501933
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 502514
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 502802
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 503072
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 503369
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 503698
etcd_disk_wal_fsync_duration_seconds_bucket{le="2.048"} 503845
etcd_disk_wal_fsync_duration_seconds_bucket{le="4.096"} 503870
etcd_disk_wal_fsync_duration_seconds_bucket{le="8.192"} 503871
etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"} 503871
etcd_disk_wal_fsync_duration_seconds_sum 2158.7388408510105
etcd_disk_wal_fsync_duration_seconds_count 503871
etcd_disk_backend_commit_duration_seconds_bucket{le="0.001"} 0
etcd_disk_backend_commit_duration_seconds_bucket{le="0.002"} 0
etcd_disk_backend_commit_duration_seconds_bucket{le="0.004"} 11
etcd_disk_backend_commit_duration_seconds_bucket{le="0.008"} 26457
etcd_disk_backend_commit_duration_seconds_bucket{le="0.016"} 54237
etcd_disk_backend_commit_duration_seconds_bucket{le="0.032"} 57572
etcd_disk_backend_commit_duration_seconds_bucket{le="0.064"} 58170
etcd_disk_backend_commit_duration_seconds_bucket{le="0.128"} 58442
etcd_disk_backend_commit_duration_seconds_bucket{le="0.256"} 58613
etcd_disk_backend_commit_duration_seconds_bucket{le="0.512"} 58802
etcd_disk_backend_commit_duration_seconds_bucket{le="1.024"} 59010
etcd_disk_backend_commit_duration_seconds_bucket{le="2.048"} 59122
etcd_disk_backend_commit_duration_seconds_bucket{le="4.096"} 59140
etcd_disk_backend_commit_duration_seconds_bucket{le="8.192"} 59142
etcd_disk_backend_commit_duration_seconds_bucket{le="+Inf"} 59142
etcd_disk_backend_commit_duration_seconds_sum 1052.4123231940036
etcd_disk_backend_commit_duration_seconds_count 59142

etcd_disk_wal_fsync_duration_seconds_sum
image

etcd_disk_wal_fsync_duration_seconds_count
image

etcd_disk_wal_fsync_duration_seconds_bucket
image

etcd_disk_backend_commit_duration_seconds_sum
image

etcd_disk_backend_commit_duration_seconds_count
image

etcd_disk_backend_commit_duration_seconds_bucket
image

@jingyih
Copy link
Contributor

jingyih commented May 29, 2019

Couple thoughts:

  1. I think etcd_disk_backend_commit_duration_seconds_bucket is a bit slow, having a faster disk could help here. Normally the disk speed increases with disk size in cloud VMs.

  2. Also, maybe measure the timing in the test code? From your previously shared test code, it creates 1000 Lock() requests (which is basically 1000 Txn requests + 1000 get requests + 1000 watch create requests) in a very short period of time (measure how long is this period?). And these requests all come from 1 client. The QPS might be very high.

  3. Following Add testAndSet Command #2, maybe create more clients and try again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants