sync: TestWaitGroupMisuse2 test is flaky #11443

kostya-sh · 2015-06-28T12:00:52Z

Using tip (ca91de7) TestWaitGroupMisuse2 test fails approximately 19 times out of 20 on my PC:

ok      strings 0.104s
--- FAIL: TestWaitGroupMisuse2 (1.41s)
    waitgroup_test.go:110: Should panic
    waitgroup_test.go:80: Unexpected panic: <nil>
FAIL
FAIL    sync    1.558s
ok      sync/atomic 0.

The text was updated successfully, but these errors were encountered:

bradfitz · 2015-06-28T15:11:58Z

Could you describe your PC?

bradfitz · 2015-06-28T15:21:57Z

How many CPUs? Which? Which OS, OS version, arch (32-bit or 64-bit)?

Dmitry, maybe 1e6 should be bigger in the test?

/cc @dvyukov

kostya-sh · 2015-06-28T15:27:12Z

OS: 64-bit Debian 8.1 running on VMWare player 7.1 (host OS Windows 8).

Hardware: 4 CPUs (Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz)

kostya-sh · 2015-06-28T15:42:55Z

I've just tested it under Windows 8 OS (the same hardware). The test is still flaky though success rate is higher.

dvyukov · 2015-06-28T16:11:42Z

@kostya-sh What is the iteration count in the test at which it stops being flaky?

bradfitz · 2015-06-28T16:12:27Z

And how long does the test take to run at said iteration count?

marete · 2015-06-28T23:07:39Z

It fails reliably on my Linux laptop in an LXC container running Ubuntu 14.10 (3/3 times):

ok strings 0.156s
--- FAIL: TestWaitGroupMisuse2 (3.25s)
waitgroup_test.go:110: Should panic
waitgroup_test.go:80: Unexpected panic:
FAIL
FAIL sync 3.410s
ok sync/atomic 0.897s

My CPU's details (this is a Nehalem class mobile CPU):

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 37
Model name: Intel(R) Core(TM) i5 CPU M 460 @ 2.53GHz
Stepping: 5
CPU MHz: 2528.000
CPU max MHz: 2528.0000
CPU min MHz: 1197.0000
BogoMIPS: 5056.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3

My OS's details:

Ubuntu Linux 14.10 (x86_64), running kernel 4.0.6 (Linus' mainline) in an unprivileged LXC container within an Ubuntu 15.04 (x86_64) host:

marebri@utopic:~/devel/go.git/src$ uname -a
Linux utopic 4.0.6 #1 SMP Wed Jun 24 01:11:35 EAT 2015 x86_64 x86_64 x86_64 GNU/Linux

Other information:

I reproduced this on the Ubuntu 15.04 (x86_64) host (failed 1/3 times):

ok strings 0.179s
--- FAIL: TestWaitGroupMisuse2 (3.32s)
waitgroup_test.go:110: Should panic
waitgroup_test.go:80: Unexpected panic:
FAIL
FAIL sync 3.493s
ok sync/atomic 1.018s

vs the passing 2 tests:

ok strings 0.236s
ok sync 2.478s
ok sync/atomic 0.821s
ok syscall 0.131s

ok strings 0.312s
ok sync 0.452s
ok sync/atomic 1.025s
ok syscall 0.151s

EDIT: Both host and LXC container are on the current git tip: d0ed87d

mikioh · 2015-06-28T23:57:58Z

We can see this on the dragonfly buildbot: http://build.golang.org/log/fd18334c684f5cec5c7d4f939c39f26ec7c30741 by 03a48eb.

kostya-sh · 2015-06-29T00:16:10Z

Even with 8e7 iterations the test is still flaky on x86_64 Debian VMWare VM. It takes about 100 seconds to fail with this number of iterations.

It could take any time between 0.5 sec to 50 seconds for the test to succeed.

iworker · 2015-06-29T10:25:28Z

I reproduced this on Ubuntu 15.04 vivid (4 times)

--- FAIL: TestWaitGroupMisuse2 (2.77s)
    waitgroup_test.go:110: Should panic
    waitgroup_test.go:80: Unexpected panic: <nil>
FAIL
FAIL    sync    2.917s
....
2015/06/29 13:50:33 Failed: exit status 1

lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz
Stepping:              4
CPU MHz:               1207.269
CPU max MHz:           3900.0000
CPU min MHz:           1200.0000
BogoMIPS:              7400.11
Virtualisation:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              10240K
NUMA node0 CPU(s):     0-7

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 15.04
Release:    15.04
Codename:   vivid

Linux *** 3.19.0-21-generic #21-Ubuntu SMP Sun Jun 14 18:31:11 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

After all 1e6 iterations there was no one expected panic.

xoba · 2015-06-29T10:31:32Z

i've also seen the failure on a m3.xlarge aws ec2 instance (4 vCPU, 15 GiB), running ami-ee793a86 (ubuntu), golang commit a76c1a5:

ok cmd/nm 1.763s
ok cmd/objdump 4.847s
ok cmd/pack 3.671s
ok cmd/pprof/internal/profile 0.030s
ok cmd/vet 4.370s

GOMAXPROCS=2 runtime -cpu=1,2,4

ok runtime 36.327s

sync -cpu=10

--- FAIL: TestWaitGroupMisuse2 (1.58s)
waitgroup_test.go:110: Should panic
waitgroup_test.go:80: Unexpected panic:
FAIL
FAIL sync 1.682s
2015/06/29 10:28:14 Failed: exit status 1

iworker · 2015-06-29T11:05:08Z

@dvyukov with iteration count 1e9 this test didn't fails:

ok      sync    6.109s

But I think it's not good choice to increase iterations count. Maybe better to introduce some better way to get panic with more probability. Actually, maybe iterations count must depends on CPU's count. For example, 8 CPU's – 1e9, 2 CPU's – 1e6.

dvyukov · 2015-06-29T12:18:35Z

Maybe better to introduce some better way to get panic with more probability.

I don't see any way to increase panic probability.

We probably need to not run it in short mode. But that will make test effectively useless...
@bradfitz What do you think if we whitelist it for short mode only on some subset of builders?

bradfitz · 2015-06-29T16:27:52Z

It sounds like the test just sucks. I think it should be deleted if it can't made to be reliable.

bradfitz · 2015-06-29T16:29:31Z

But if you want to make it whitelist per builder, we'll need to finish #11346

bradfitz · 2015-06-29T18:36:54Z

This is also failing on my personal Linux server. Physical hardware, Ubuntu vivid 3.16.0-39-generic, 8 CPUs in /proc/cpuinfo:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 30
model name      : Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz
stepping        : 5     
microcode       : 0x3
cpu MHz         : 1200.000
cache size      : 8192 KB
physical id     : 0     
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes   
fpu_exception   : yes
cpuid level     : 11    
wp              : yes   
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips        : 5617.29
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

bradfitz · 2015-06-29T18:44:13Z

Disabling this for now in https://go-review.googlesource.com/11721

Update #11443 Change-Id: Icb7ea291a837dcf2799a791a2ba780fd2a5e712b Reviewed-on: https://go-review.googlesource.com/11721 Reviewed-by: Brad Fitzpatrick <[email protected]> Reviewed-by: Dmitry Vyukov <[email protected]>

bradfitz · 2016-11-15T22:18:12Z

@dvyukov, do you care to make this test reliable before I delete it?

dvyukov · 2016-11-16T07:16:50Z

The test looks good to me. It just shows that scheduler sucks and doesn't execute runnable goroutines.

1.4:

$ stress -p 1 ./sync.test -test.run=TestWaitGroupMisuse2
...
7312 runs so far, 0 failures

1.7:

$ stress -p 1 ./sync.test -test.run=TestWaitGroupMisuse2
...
92 runs so far, 63 failures

Part of the problem is the next argument of runtime.runqput. But there may be other problems.

bradfitz · 2016-11-16T17:56:08Z

/cc @aclements because the scheduler reportedly sucks.

dvyukov · 2016-11-16T18:31:34Z

Just in case, this deflakes the test:
https://go-review.googlesource.com/#/c/33272/

gopherbot · 2017-02-13T15:36:02Z

CL https://golang.org/cl/36841 mentions this issue.

bradfitz assigned dvyukov Jun 28, 2015

bradfitz added this to the Go1.5 milestone Jun 28, 2015

bradfitz added the Testing An issue that has been verified to require only test changes, not just a test failure. label Jun 28, 2015

bradfitz modified the milestones: Unplanned, Go1.5 Jun 29, 2015

ianlancetaylor mentioned this issue Nov 15, 2016

x/build: run standard library tests without -short #17472

Closed

bradfitz modified the milestones: Go1.9Early, Unplanned Nov 15, 2016

gopherbot closed this as completed in 83f95b8 Feb 16, 2017

mundaym mentioned this issue Apr 21, 2017

sync: TestWaitGroupMisuse2 hangs sometimes #20072

Closed

golang locked and limited conversation to collaborators Feb 16, 2018

gopherbot added the FrozenDueToAge label Feb 16, 2018

rsc unassigned dvyukov Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: TestWaitGroupMisuse2 test is flaky #11443

sync: TestWaitGroupMisuse2 test is flaky #11443

kostya-sh commented Jun 28, 2015

bradfitz commented Jun 28, 2015

bradfitz commented Jun 28, 2015

kostya-sh commented Jun 28, 2015

kostya-sh commented Jun 28, 2015

dvyukov commented Jun 28, 2015

bradfitz commented Jun 28, 2015

marete commented Jun 28, 2015

mikioh commented Jun 28, 2015

kostya-sh commented Jun 29, 2015

iworker commented Jun 29, 2015

xoba commented Jun 29, 2015

iworker commented Jun 29, 2015

dvyukov commented Jun 29, 2015

bradfitz commented Jun 29, 2015

bradfitz commented Jun 29, 2015

bradfitz commented Jun 29, 2015

bradfitz commented Jun 29, 2015

bradfitz commented Nov 15, 2016

dvyukov commented Nov 16, 2016

bradfitz commented Nov 16, 2016

dvyukov commented Nov 16, 2016

gopherbot commented Feb 13, 2017

sync: TestWaitGroupMisuse2 test is flaky #11443

sync: TestWaitGroupMisuse2 test is flaky #11443

Comments

kostya-sh commented Jun 28, 2015

bradfitz commented Jun 28, 2015

bradfitz commented Jun 28, 2015

kostya-sh commented Jun 28, 2015

kostya-sh commented Jun 28, 2015

dvyukov commented Jun 28, 2015

bradfitz commented Jun 28, 2015

marete commented Jun 28, 2015

mikioh commented Jun 28, 2015

kostya-sh commented Jun 29, 2015

iworker commented Jun 29, 2015

xoba commented Jun 29, 2015

GOMAXPROCS=2 runtime -cpu=1,2,4

sync -cpu=10

iworker commented Jun 29, 2015

dvyukov commented Jun 29, 2015

bradfitz commented Jun 29, 2015

bradfitz commented Jun 29, 2015

bradfitz commented Jun 29, 2015

bradfitz commented Jun 29, 2015

bradfitz commented Nov 15, 2016

dvyukov commented Nov 16, 2016

bradfitz commented Nov 16, 2016

dvyukov commented Nov 16, 2016

gopherbot commented Feb 13, 2017