Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/trace: trace tests are flaky #10512

Closed
mdempsky opened this issue Apr 20, 2015 · 24 comments
Closed

runtime/trace: trace tests are flaky #10512

mdempsky opened this issue Apr 20, 2015 · 24 comments
Milestone

Comments

@mdempsky
Copy link
Contributor

openbsd/amd64 failures:

--- FAIL: TestTraceFutileWakeup (0.08s)
    trace_test.go:428: failed to parse trace: g 78 is not waiting during syscall exit (offset 323, time 0)
--- FAIL: TestTraceFutileWakeup (0.13s)
    trace_test.go:428: failed to parse trace: g 82 is not waiting during syscall exit (offset 4636, time 0)
--- FAIL: TestTraceFutileWakeup (0.13s)
    trace_test.go:428: failed to parse trace: g 82 is not waiting during syscall exit (offset 1271, time 0)

linux/ppc64le failures:

--- FAIL: TestTraceFutileWakeup (0.07s)
    trace_test.go:441: goroutine 65 blocked on GoBlockSend at 7278802 right after start
--- FAIL: TestTraceFutileWakeup (0.06s)
    trace_test.go:427: failed to parse trace: g 118 is not running while GoSched (offset 5452, time 1881495)
--- FAIL: TestTraceFutileWakeup (0.07s)
    trace_test.go:441: goroutine 99 blocked on GoBlockSend at 3400181 right after start
--- FAIL: TestTraceFutileWakeup (0.10s)
    trace_test.go:440: goroutine 141 blocked on GoBlockSelect at 1140998 right after start
--- FAIL: TestTraceFutileWakeup (0.08s)
    trace_test.go:427: failed to parse trace: g 133 is not running while unpark (offset 16052, time 10507482)
--- FAIL: TestTraceFutileWakeup (0.09s)
    trace_test.go:428: failed to parse trace: g 99 is not running while GoSched (offset 36466, time 12834938)
--- FAIL: TestTraceFutileWakeup (0.08s)
    trace_test.go:441: goroutine 116 blocked on GoBlockSelect at 1975223 right after start

I haven't noticed any failures on other bots.

@mdempsky
Copy link
Contributor Author

@dvyukov
Copy link
Member

dvyukov commented Apr 21, 2015

Do we have access to linux/ppc64le machine?

mdempsky added a commit that referenced this issue Apr 21, 2015
Update #10512.

Change-Id: Ifdc59c3a5d8aba420b34ae4e37b3c2315dd7c783
Reviewed-on: https://go-review.googlesource.com/9162
Reviewed-by: Dmitry Vyukov <[email protected]>
@bradfitz
Copy link
Contributor

bradfitz commented May 6, 2015

Also failed once on nacl-amd64p32 at ec87dbf:

http://build.golang.org/log/124733943107cda60b12f6742d08ca233e206e14

@bradfitz bradfitz added this to the Go1.5 milestone May 6, 2015
@alexbrainman
Copy link
Member

@rjammala
Copy link

Have seen this on linux/amd64 also:

ok runtime/debug 0.006s
--- FAIL: TestTraceFutileWakeup (0.04s)
trace_test.go:445: goroutine 153 blocked on GoBlockRecv at 3249129 right after start

[rjammalamadaka@localhost src]$ uname -a
Linux localhost.localdomain 2.6.32-504.12.2.el6.x86_64 #1 SMP Wed Mar 11 22:03:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[rjammalamadaka@localhost src]$ go version
go version devel +7e26a2d Fri May 15 13:56:40 2015 +0000 linux/amd64

@josharian
Copy link
Contributor

@bradfitz
Copy link
Contributor

Also flaky on Linux on go1.5beta2, per report on golang-nuts:

$ ./all.bash

...

ok      runtime 36.204s
ok      runtime/debug   0.068s
--- FAIL: TestTraceStress (4.73s)
        trace_test.go:226: failed to parse trace: p 4 is running before start (offset 322664, time 65700205)
--- FAIL: TestTraceStressStartStop (1.70s)
        trace_test.go:365: failed to parse trace: g 5 is not waiting during syscall exit (offset 190855, time 1924)
--- FAIL: TestTraceFutileWakeup (0.56s)
        trace_test.go:433: failed to parse trace: p 7 is running before start (offset 119705, time 13515283)
FAIL
FAIL    runtime/pprof   13.658s
ok      sort    0.102s
ok      strconv 0.436s
ok      strings 0.170s
ok      sync    0.179s
ok      sync/atomic     0.610s
ok      syscall 0.075s
ok      testing 2.399s
ok      testing/quick   0.044s
ok      text/scanner    0.010s
ok      text/tabwriter  0.016s
ok      text/template   0.055s
ok      text/template/parse     0.018s
ok      time    2.948s
ok      unicode 0.010s
ok      unicode/utf16   0.007s
ok      unicode/utf8    0.008s
ok      cmd/addr2line   2.795s
ok      cmd/api 0.042s
ok      cmd/asm/internal/asm    0.043s
ok      cmd/asm/internal/lex    0.004s
ok      cmd/compile/internal/big        2.936s
ok      cmd/cover       4.232s
ok      cmd/doc 0.028s
ok      cmd/fix 0.021s
ok      cmd/go  108.378s
ok      cmd/gofmt       0.071s
ok      cmd/internal/goobj      0.005s
ok      cmd/internal/obj        0.014s
ok      cmd/internal/obj/x86    0.021s
ok      cmd/internal/rsc.io/arm/armasm  0.010s
ok      cmd/internal/rsc.io/x86/x86asm  0.501s
ok      cmd/newlink     0.021s
ok      cmd/nm  2.267s
ok      cmd/objdump     8.034s
ok      cmd/pack        5.847s
ok      cmd/pprof/internal/profile      0.005s
ok      cmd/vet 11.152s
2015/07/17 14:40:13 Failed: exit status 1

$ go version
go version devel +cc8f544 Fri Jul 17 03:34:11 2015 +0000 linux/amd64
$ linuxinfo
Linux debianhp64 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1 (2015-05-24)
Zwei AMD Unknown 1000MHz Prozessoren, 3990.06 Bogomips gesamt, 2891 MB RAM
Systembibliothek 2.19.0

@bradfitz bradfitz changed the title runtime/pprof: TestTraceFutileWakeup flaky on openbsd/amd64 and linux/ppc64le runtime/pprof: trace tests are flaky Jul 17, 2015
@bradfitz
Copy link
Contributor

We're getting lots of dups of this. (see referenced issues above)

We should disable these tests for Go 1.5 if they're not reliable.

@dvyukov?

@dvyukov
Copy link
Member

dvyukov commented Jul 17, 2015

Can we exclude them only short mode?
Ideally we still run them on a subset of builders where it currently passes. Some parts of runtime tracing are subtle, it would be bad if they silently regress.
I can try to write a function for linux that will try to determine whether cpu ticks are sane on the machine or not. But it is not completely trivial.
Looking at Zwei AMD Unknown 1000MHz Prozessoren, I guess in this case it is some broken virtual environment.

@dvyukov
Copy link
Member

dvyukov commented Jul 19, 2015

What if we whitelist known working configurations in short mode?
This will be much simpler. And if they break, that's something we want to look at.

So far the safe configurations seems to be linux/windows on Intel processors. E.g. Solaris does bother to synchronize ticks across CPUs, lots of AMD processors provide unstable ticks, for ARM cputicks are unimplemented.

@mewmew
Copy link
Contributor

mewmew commented Jul 21, 2015

So far the safe configurations seems to be linux/windows on Intel processors.

(On a closer look, maybe this is not the same issue as the one above references TestTraceFutileWakeup. If this is the case, please ignore this comment.)

This test has been failing on my Intel laptop with Linux for some time. Ref: #10476 (comment)

Including the latest results below:

[~/go/src/runtime/pprof]$ go test
--- FAIL: TestTraceSymbolize (2.99s)
    trace_stack_test.go:130: failed to parse trace: p 1 is running before start (offset 5114, time 1936973701)
--- FAIL: TestTraceStress (3.79s)
    trace_test.go:226: failed to parse trace: g 84 is not runnable before start (offset 259030, time 56554755)
--- FAIL: TestTraceStressStartStop (0.30s)
    trace_test.go:365: failed to parse trace: g 68 is not runnable before start (offset 95155, time 340799)
--- FAIL: TestTraceFutileWakeup (0.61s)
    trace_test.go:433: failed to parse trace: g 90 is not runnable before start (offset 10035, time 1165852)
FAIL
exit status 1
FAIL    runtime/pprof   39.342s

[~/go/src]$ go version
go version devel +0505dfc Tue Jul 21 20:28:23 2015 +0000 linux/amd64

[~/go/src]$ uname -a
Linux x61s 4.1.2-2-ARCH #1 SMP PREEMPT Wed Jul 15 08:30:32 UTC 2015 x86_64 GNU/Linux

[~/go/src]$ cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 15
model name  : Intel(R) Core(TM)2 Duo CPU     L7500  @ 1.60GHz
stepping    : 11
microcode   : 0xba
cpu MHz     : 800.000
cache size  : 4096 KB
physical id : 0
siblings    : 2
core id     : 0
cpu cores   : 2
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 10
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida dtherm tpr_shadow vnmi flexpriority
bugs        :
bogomips    : 3193.09
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 15
model name  : Intel(R) Core(TM)2 Duo CPU     L7500  @ 1.60GHz
stepping    : 11
microcode   : 0xba
cpu MHz     : 800.000
cache size  : 4096 KB
physical id : 0
siblings    : 2
core id     : 1
cpu cores   : 2
apicid      : 1
initial apicid  : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 10
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida dtherm tpr_shadow vnmi flexpriority
bugs        :
bogomips    : 3193.09
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

Let me know if I can assist in any way to troubleshoot this issue.

@rsc
Copy link
Contributor

rsc commented Jul 23, 2015

We can exclude all the machines where the test fails, but what if users want to use those machines? It is starting to sound like linux/x86 is the one system that works. Maybe darwin/amd64 too.

Fundamentally, the problem seems to be that the trace code assumes synchronized cputicks results across cores, and (as I understand it) modern systems make no such guarantee. @aclements says that Intel may have a soft guarantee about cores on the same chip, but there is no guarantee for cores in different sockets. @ianlancetaylor says that there is complex code in Linux to synchronize RDTSC counters across different cores. So linux/x86 should be okay. But obviously linux/ppc64le is not, so probably linux/non-x86 is not. And non-linux is probably not, once there are multiple sockets.

Tracing is not super high frequency. Perhaps instead of relying on the cpu ticks for sorting, traceEvent could do an xadd64(&traceseq, +1) to get a unique sequence number, and that number could be recorded in the profile and used as the sort key. I realize that would cause contention in the traceseq cache line. But again tracing is not super high frequency, and you only pay for it when it's on, so the slowdown might be entirely tolerable. At least you'd get correct answers.

Thoughts?

@rsc rsc changed the title runtime/pprof: trace tests are flaky runtime/trace: trace tests are flaky Jul 23, 2015
@dvyukov
Copy link
Member

dvyukov commented Jul 23, 2015

@rsc Windows should also synchronize across sockets.
Besides ordering we also need some timestamps, traces without timestamps are not very useful.
Some systems have RDTSC unsynchronized across cores/sockets. And some systems have RDTSC with variable frequency. RDTSCP instruction can be used to fix the first problem. But it is not available on the @mewmew processor -- L7500 -- which is dated by 2006.

@rsc
Copy link
Contributor

rsc commented Jul 23, 2015

I'm concerned that we're going to launch this tracing feature and it's not going to work on some large fraction of systems and people are going to be mystified. How do you propose to address that? At least if we record both ordering and time stamps we can detect in cmd/trace systems for which time stamps do not respect ordering.

@dvyukov
Copy link
Member

dvyukov commented Jul 23, 2015

At least if we record both ordering and time stamps we can detect in cmd/trace systems for which time stamps do not respect ordering.

Inconsistencies are already detected. That's how people notice that there is something wrong and report it here. Adding global sequencing just for error detection does not look like a good idea to me, tracing is not particularly low frequency.
I think we need to improve error message that is printed by cmd/trace and describe some common reasons for broken traces (VMs, old processors, broken OSes).

How do you propose to address that?

This is a good question.
Hopefully it is not too bad, because we see only complaints and we don't hear from people for whom it works. I personally tried tracer on linux, darwin and windows laptops and linux and windows desktops. It worked flawlessly everywhere.
But that's not to say that there are no issues.
I see only one way how we can improve the situation. For processors that provide stable-frequency ticks and support RDTSCP, we can try to synchronize across CPUs manually (by measuring RTT across different cores). I never tried that, so can't say how reliable it will be and what portion of systems it will fix.
Besides that I hope that processors manufactured in 2006 will die over time and OSes will become better...

@rsc
Copy link
Contributor

rsc commented Jul 23, 2015

This may be a viable strategy long term, but for Go 1.5 it's clearly not ready yet. I sent CL 12579, which adds an explicit sequence number. I propose we ship with that for Go 1.5, and if the time-based approach is perfected in a later cycle, that's fine.

@aclements
Copy link
Member

I see only one way how we can improve the situation. For processors that provide stable-frequency ticks and support RDTSCP, we can try to synchronize across CPUs manually (by measuring RTT across different cores).

You could also do this after-the-fact by piecing together constraints from the recorded trace to compute each CPU's offset. I'm not convinced that it's actually possible to do this up-front in user space given things like dynamic cgroup CPU masks.

@dvyukov
Copy link
Member

dvyukov commented Jul 23, 2015

@aclements Do you mean that we still use RDTSCP and attach processor id to every event? If we don't, then I am not sure that we will be able to restore meaningful timestamps. If we do, then it should be possible to find offsets that make the trace valid. However, we will end up with some ranges with offsets that make trace valid, and then we will need to choose a particular offset from the range somehow. I am not sure how large these ranges will be.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/12579 mentions this issue.

@aclements
Copy link
Member

Do you mean that we still use RDTSCP and attach processor id to every event?

Yes, this is what I was thinking. I imagine one could optimize repeated processor IDs within a trace buffer to save space.

However, we will end up with some ranges with offsets that make trace valid, and then we will need to choose a particular offset from the range somehow.

That's certainly true. I suspect these ranges will be small, but don't have anything to support that claim. I would certainly expect them to be small if there are closely causally related events in the trace, which is probably where they're most interesting, and the more causally related events, the more precise the alignment would get. I believe Dapper does something like this [Sigelman, 2010, §2.1].

@rsc
Copy link
Contributor

rsc commented Jul 28, 2015

Updated CL to check sequence numbers vs time stamps in trace.Parse and return a recognizable error. This should be good enough that when people do look at a trace in Go 1.5, they'll know it's real (because otherwise it would have been rejected).

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/12746 mentions this issue.

@rsc rsc closed this as completed in 80c98fa Jul 29, 2015
rsc added a commit that referenced this issue Jul 29, 2015
The skips added in CL 12579, based on incorrect time stamps,
should be sufficient to identify and exclude all the time-related
flakiness on these systems.

If there is other flakiness, we want to find out.

For #10512.

Change-Id: I5b588ac1585b2e9d1d18143520d2d51686b563e3
Reviewed-on: https://go-review.googlesource.com/12746
Reviewed-by: Austin Clements <[email protected]>
@mewmew
Copy link
Contributor

mewmew commented Aug 3, 2015

If there is other flakiness, we want to find out.

After updating to the latest version and running the test cases 500 consecutive times, I am getting a consistent failure rate of about 3%.

$ go version
go version devel +5e15e28 Mon Aug 3 04:41:48 2015 +0000 linux/amd64

The laptop specs are the same as mentioned in #10512 (comment).

$ for i in {1..1000}; do echo "test $i" && go test; done 
test 1
PASS
ok      runtime/trace   2.940s
test 11
--- FAIL: TestTraceFutileWakeup (0.37s)
    trace_test.go:406: failed to parse trace: g 19 is not runnable before start (offset 545, time -250749)
FAIL
exit status 1
FAIL    runtime/trace   2.840s
test 59
--- FAIL: TestTraceFutileWakeup (0.32s)
    trace_test.go:406: failed to parse trace: g 19 is not runnable before start (offset 46518, time 243559)
FAIL
exit status 1
FAIL    runtime/trace   3.125s
test 76
--- FAIL: TestTraceFutileWakeup (0.33s)
    trace_test.go:406: failed to parse trace: g 19 is not runnable before start (offset 517, time 213271)
FAIL
exit status 1
FAIL    runtime/trace   3.040s
test 88
--- FAIL: TestTraceFutileWakeup (0.20s)
    trace_test.go:406: failed to parse trace: g 19 is not runnable before start (offset 21367, time 729251)
FAIL
exit status 1
FAIL    runtime/trace   2.659s
test 101
--- FAIL: TestTraceFutileWakeup (0.28s)
    trace_test.go:406: failed to parse trace: g 19 is not runnable before start (offset 44824, time -301874)
FAIL
exit status 1
FAIL    runtime/trace   2.929s
test 111
--- FAIL: TestTraceFutileWakeup (0.23s)
    trace_test.go:406: failed to parse trace: g 19 is not runnable before start (offset 554, time 213621)
FAIL
exit status 1
FAIL    runtime/trace   3.618s
test 165
--- FAIL: TestTraceFutileWakeup (0.23s)
    trace_test.go:406: failed to parse trace: g 19 is not runnable before start (offset 6696, time 726649)
FAIL
exit status 1
FAIL    runtime/trace   3.508s
test 209
--- FAIL: TestTraceFutileWakeup (0.32s)
    trace_test.go:406: failed to parse trace: g 19 is not runnable before start (offset 21670, time 776659)
FAIL
exit status 1
FAIL    runtime/trace   2.931s
test 256
--- FAIL: TestTraceFutileWakeup (0.35s)
    trace_test.go:406: failed to parse trace: g 33 is not runnable before start (offset 31488, time 722734)
FAIL
exit status 1
FAIL    runtime/trace   3.575s
test 286
--- FAIL: TestTraceFutileWakeup (0.22s)
    trace_test.go:406: failed to parse trace: g 19 is not runnable before start (offset 8374, time 703479)
FAIL
exit status 1
FAIL    runtime/trace   3.116s
test 374
--- FAIL: TestTraceFutileWakeup (0.28s)
    trace_test.go:406: failed to parse trace: g 19 is not runnable before start (offset 26718, time 814835)
FAIL
exit status 1
FAIL    runtime/trace   2.881s
test 388
--- FAIL: TestTraceFutileWakeup (0.29s)
    trace_test.go:406: failed to parse trace: g 19 is not runnable before start (offset 6781, time 424414)
FAIL
exit status 1
FAIL    runtime/trace   3.066s
test 402
--- FAIL: TestTraceFutileWakeup (0.28s)
    trace_test.go:406: failed to parse trace: g 19 is not runnable before start (offset 48391, time -260290)
FAIL
exit status 1
FAIL    runtime/trace   3.313s
test 413
--- FAIL: TestTraceFutileWakeup (0.47s)
    trace_test.go:406: failed to parse trace: g 19 is not runnable before start (offset 22107, time -358752)
FAIL
exit status 1
FAIL    runtime/trace   2.978s
test 460
--- FAIL: TestTraceFutileWakeup (0.30s)
    trace_test.go:406: failed to parse trace: g 19 is not runnable before start (offset 31998, time 700030)
FAIL
exit status 1
FAIL    runtime/trace   3.442s
test 473
--- FAIL: TestTraceFutileWakeup (0.41s)
    trace_test.go:406: failed to parse trace: g 19 is not runnable before start (offset 24233, time 688257)
FAIL
exit status 1
FAIL    runtime/trace   3.398s
test 500
--- FAIL: TestTraceFutileWakeup (0.41s)
    trace_test.go:406: failed to parse trace: g 33 is not runnable before start (offset 14657, time -293854)
FAIL
exit status 1
FAIL    runtime/trace   3.289s
test 501
PASS
ok      runtime/trace   2.819s

@golang golang locked and limited conversation to collaborators Aug 5, 2016
@rsc rsc unassigned dvyukov Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants