Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: TestWaitGroupMisuse2 test is flaky #11443

Closed
kostya-sh opened this issue Jun 28, 2015 · 22 comments
Closed

sync: TestWaitGroupMisuse2 test is flaky #11443

kostya-sh opened this issue Jun 28, 2015 · 22 comments
Labels
FrozenDueToAge Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@kostya-sh
Copy link
Contributor

Using tip (ca91de7) TestWaitGroupMisuse2 test fails approximately 19 times out of 20 on my PC:

ok      strings 0.104s
--- FAIL: TestWaitGroupMisuse2 (1.41s)
    waitgroup_test.go:110: Should panic
    waitgroup_test.go:80: Unexpected panic: <nil>
FAIL
FAIL    sync    1.558s
ok      sync/atomic 0.
@bradfitz
Copy link
Contributor

Could you describe your PC?

@bradfitz
Copy link
Contributor

How many CPUs? Which? Which OS, OS version, arch (32-bit or 64-bit)?

Dmitry, maybe 1e6 should be bigger in the test?

/cc @dvyukov

@bradfitz bradfitz added this to the Go1.5 milestone Jun 28, 2015
@bradfitz bradfitz added the Testing An issue that has been verified to require only test changes, not just a test failure. label Jun 28, 2015
@kostya-sh
Copy link
Contributor Author

OS: 64-bit Debian 8.1 running on VMWare player 7.1 (host OS Windows 8).

Hardware: 4 CPUs (Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz)

@kostya-sh
Copy link
Contributor Author

I've just tested it under Windows 8 OS (the same hardware). The test is still flaky though success rate is higher.

@dvyukov
Copy link
Member

dvyukov commented Jun 28, 2015

@kostya-sh What is the iteration count in the test at which it stops being flaky?

@bradfitz
Copy link
Contributor

And how long does the test take to run at said iteration count?

@marete
Copy link
Contributor

marete commented Jun 28, 2015

It fails reliably on my Linux laptop in an LXC container running Ubuntu 14.10 (3/3 times):

ok strings 0.156s
--- FAIL: TestWaitGroupMisuse2 (3.25s)
waitgroup_test.go:110: Should panic
waitgroup_test.go:80: Unexpected panic:
FAIL
FAIL sync 3.410s
ok sync/atomic 0.897s

My CPU's details (this is a Nehalem class mobile CPU):

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 37
Model name: Intel(R) Core(TM) i5 CPU M 460 @ 2.53GHz
Stepping: 5
CPU MHz: 2528.000
CPU max MHz: 2528.0000
CPU min MHz: 1197.0000
BogoMIPS: 5056.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3

My OS's details:

Ubuntu Linux 14.10 (x86_64), running kernel 4.0.6 (Linus' mainline) in an unprivileged LXC container within an Ubuntu 15.04 (x86_64) host:

marebri@utopic:~/devel/go.git/src$ uname -a
Linux utopic 4.0.6 #1 SMP Wed Jun 24 01:11:35 EAT 2015 x86_64 x86_64 x86_64 GNU/Linux

Other information:

I reproduced this on the Ubuntu 15.04 (x86_64) host (failed 1/3 times):

ok strings 0.179s
--- FAIL: TestWaitGroupMisuse2 (3.32s)
waitgroup_test.go:110: Should panic
waitgroup_test.go:80: Unexpected panic:
FAIL
FAIL sync 3.493s
ok sync/atomic 1.018s

vs the passing 2 tests:

ok strings 0.236s
ok sync 2.478s
ok sync/atomic 0.821s
ok syscall 0.131s

ok strings 0.312s
ok sync 0.452s
ok sync/atomic 1.025s
ok syscall 0.151s

EDIT: Both host and LXC container are on the current git tip: d0ed87d

@mikioh
Copy link
Contributor

mikioh commented Jun 28, 2015

We can see this on the dragonfly buildbot: http://build.golang.org/log/fd18334c684f5cec5c7d4f939c39f26ec7c30741 by 03a48eb.

@kostya-sh
Copy link
Contributor Author

Even with 8e7 iterations the test is still flaky on x86_64 Debian VMWare VM. It takes about 100 seconds to fail with this number of iterations.

It could take any time between 0.5 sec to 50 seconds for the test to succeed.

@iworker
Copy link

iworker commented Jun 29, 2015

I reproduced this on Ubuntu 15.04 vivid (4 times)

--- FAIL: TestWaitGroupMisuse2 (2.77s)
    waitgroup_test.go:110: Should panic
    waitgroup_test.go:80: Unexpected panic: <nil>
FAIL
FAIL    sync    2.917s
....
2015/06/29 13:50:33 Failed: exit status 1
lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz
Stepping:              4
CPU MHz:               1207.269
CPU max MHz:           3900.0000
CPU min MHz:           1200.0000
BogoMIPS:              7400.11
Virtualisation:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              10240K
NUMA node0 CPU(s):     0-7
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 15.04
Release:    15.04
Codename:   vivid
Linux *** 3.19.0-21-generic #21-Ubuntu SMP Sun Jun 14 18:31:11 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

After all 1e6 iterations there was no one expected panic.

@xoba
Copy link
Contributor

xoba commented Jun 29, 2015

i've also seen the failure on a m3.xlarge aws ec2 instance (4 vCPU, 15 GiB), running ami-ee793a86 (ubuntu), golang commit a76c1a5:

ok cmd/nm 1.763s
ok cmd/objdump 4.847s
ok cmd/pack 3.671s
ok cmd/pprof/internal/profile 0.030s
ok cmd/vet 4.370s

GOMAXPROCS=2 runtime -cpu=1,2,4

ok runtime 36.327s

sync -cpu=10

--- FAIL: TestWaitGroupMisuse2 (1.58s)
waitgroup_test.go:110: Should panic
waitgroup_test.go:80: Unexpected panic:
FAIL
FAIL sync 1.682s
2015/06/29 10:28:14 Failed: exit status 1

@iworker
Copy link

iworker commented Jun 29, 2015

@dvyukov with iteration count 1e9 this test didn't fails:

ok      sync    6.109s

But I think it's not good choice to increase iterations count. Maybe better to introduce some better way to get panic with more probability. Actually, maybe iterations count must depends on CPU's count. For example, 8 CPU's – 1e9, 2 CPU's – 1e6.

@dvyukov
Copy link
Member

dvyukov commented Jun 29, 2015

Maybe better to introduce some better way to get panic with more probability.

I don't see any way to increase panic probability.

We probably need to not run it in short mode. But that will make test effectively useless...
@bradfitz What do you think if we whitelist it for short mode only on some subset of builders?

@bradfitz
Copy link
Contributor

It sounds like the test just sucks. I think it should be deleted if it can't made to be reliable.

@bradfitz
Copy link
Contributor

But if you want to make it whitelist per builder, we'll need to finish #11346

@bradfitz
Copy link
Contributor

This is also failing on my personal Linux server. Physical hardware, Ubuntu vivid 3.16.0-39-generic, 8 CPUs in /proc/cpuinfo:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 30
model name      : Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz
stepping        : 5     
microcode       : 0x3
cpu MHz         : 1200.000
cache size      : 8192 KB
physical id     : 0     
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes   
fpu_exception   : yes
cpuid level     : 11    
wp              : yes   
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips        : 5617.29
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

@bradfitz
Copy link
Contributor

Disabling this for now in https://go-review.googlesource.com/11721

@bradfitz bradfitz modified the milestones: Unplanned, Go1.5 Jun 29, 2015
bradfitz added a commit that referenced this issue Jun 29, 2015
Update #11443

Change-Id: Icb7ea291a837dcf2799a791a2ba780fd2a5e712b
Reviewed-on: https://go-review.googlesource.com/11721
Reviewed-by: Brad Fitzpatrick <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
@bradfitz
Copy link
Contributor

@dvyukov, do you care to make this test reliable before I delete it?

@bradfitz bradfitz modified the milestones: Go1.9Early, Unplanned Nov 15, 2016
@dvyukov
Copy link
Member

dvyukov commented Nov 16, 2016

The test looks good to me. It just shows that scheduler sucks and doesn't execute runnable goroutines.

1.4:

$ stress -p 1 ./sync.test -test.run=TestWaitGroupMisuse2
...
7312 runs so far, 0 failures

1.7:

$ stress -p 1 ./sync.test -test.run=TestWaitGroupMisuse2
...
92 runs so far, 63 failures

Part of the problem is the next argument of runtime.runqput. But there may be other problems.

@bradfitz
Copy link
Contributor

/cc @aclements because the scheduler reportedly sucks.

@dvyukov
Copy link
Member

dvyukov commented Nov 16, 2016

Just in case, this deflakes the test:
https://go-review.googlesource.com/#/c/33272/

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/36841 mentions this issue.

@golang golang locked and limited conversation to collaborators Feb 16, 2018
@rsc rsc unassigned dvyukov Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

8 participants