-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/trace: trace tests are flaky #10512
Comments
It affects openbsd/386 too: http://build.golang.org/log/b4604499df4e06974962fb73cd2e6afc071af0bb |
Do we have access to linux/ppc64le machine? |
Update #10512. Change-Id: Ifdc59c3a5d8aba420b34ae4e37b3c2315dd7c783 Reviewed-on: https://go-review.googlesource.com/9162 Reviewed-by: Dmitry Vyukov <[email protected]>
Also failed once on nacl-amd64p32 at ec87dbf: http://build.golang.org/log/124733943107cda60b12f6742d08ca233e206e14 |
windows-amd64 failure http://build.golang.org/log/5667475ee0041dcaf3b19714360caa9a9c448173 |
Have seen this on linux/amd64 also: ok runtime/debug 0.006s [rjammalamadaka@localhost src]$ uname -a |
Showed up again as of Jun 9. http://build.golang.org/log/92deee43c1bd04e6be337b307a0b98a0ee818ed1 |
Also flaky on Linux on go1.5beta2, per report on golang-nuts:
|
We're getting lots of dups of this. (see referenced issues above) We should disable these tests for Go 1.5 if they're not reliable. |
Can we exclude them only short mode? |
What if we whitelist known working configurations in short mode? So far the safe configurations seems to be linux/windows on Intel processors. E.g. Solaris does bother to synchronize ticks across CPUs, lots of AMD processors provide unstable ticks, for ARM cputicks are unimplemented. |
(On a closer look, maybe this is not the same issue as the one above references This test has been failing on my Intel laptop with Linux for some time. Ref: #10476 (comment) Including the latest results below:
Let me know if I can assist in any way to troubleshoot this issue. |
We can exclude all the machines where the test fails, but what if users want to use those machines? It is starting to sound like linux/x86 is the one system that works. Maybe darwin/amd64 too. Fundamentally, the problem seems to be that the trace code assumes synchronized cputicks results across cores, and (as I understand it) modern systems make no such guarantee. @aclements says that Intel may have a soft guarantee about cores on the same chip, but there is no guarantee for cores in different sockets. @ianlancetaylor says that there is complex code in Linux to synchronize RDTSC counters across different cores. So linux/x86 should be okay. But obviously linux/ppc64le is not, so probably linux/non-x86 is not. And non-linux is probably not, once there are multiple sockets. Tracing is not super high frequency. Perhaps instead of relying on the cpu ticks for sorting, traceEvent could do an xadd64(&traceseq, +1) to get a unique sequence number, and that number could be recorded in the profile and used as the sort key. I realize that would cause contention in the traceseq cache line. But again tracing is not super high frequency, and you only pay for it when it's on, so the slowdown might be entirely tolerable. At least you'd get correct answers. Thoughts? |
@rsc Windows should also synchronize across sockets. |
I'm concerned that we're going to launch this tracing feature and it's not going to work on some large fraction of systems and people are going to be mystified. How do you propose to address that? At least if we record both ordering and time stamps we can detect in cmd/trace systems for which time stamps do not respect ordering. |
Inconsistencies are already detected. That's how people notice that there is something wrong and report it here. Adding global sequencing just for error detection does not look like a good idea to me, tracing is not particularly low frequency.
This is a good question. |
This may be a viable strategy long term, but for Go 1.5 it's clearly not ready yet. I sent CL 12579, which adds an explicit sequence number. I propose we ship with that for Go 1.5, and if the time-based approach is perfected in a later cycle, that's fine. |
You could also do this after-the-fact by piecing together constraints from the recorded trace to compute each CPU's offset. I'm not convinced that it's actually possible to do this up-front in user space given things like dynamic cgroup CPU masks. |
@aclements Do you mean that we still use RDTSCP and attach processor id to every event? If we don't, then I am not sure that we will be able to restore meaningful timestamps. If we do, then it should be possible to find offsets that make the trace valid. However, we will end up with some ranges with offsets that make trace valid, and then we will need to choose a particular offset from the range somehow. I am not sure how large these ranges will be. |
CL https://golang.org/cl/12579 mentions this issue. |
Yes, this is what I was thinking. I imagine one could optimize repeated processor IDs within a trace buffer to save space.
That's certainly true. I suspect these ranges will be small, but don't have anything to support that claim. I would certainly expect them to be small if there are closely causally related events in the trace, which is probably where they're most interesting, and the more causally related events, the more precise the alignment would get. I believe Dapper does something like this [Sigelman, 2010, §2.1]. |
Updated CL to check sequence numbers vs time stamps in trace.Parse and return a recognizable error. This should be good enough that when people do look at a trace in Go 1.5, they'll know it's real (because otherwise it would have been rejected). |
CL https://golang.org/cl/12746 mentions this issue. |
The skips added in CL 12579, based on incorrect time stamps, should be sufficient to identify and exclude all the time-related flakiness on these systems. If there is other flakiness, we want to find out. For #10512. Change-Id: I5b588ac1585b2e9d1d18143520d2d51686b563e3 Reviewed-on: https://go-review.googlesource.com/12746 Reviewed-by: Austin Clements <[email protected]>
After updating to the latest version and running the test cases 500 consecutive times, I am getting a consistent failure rate of about 3%.
The laptop specs are the same as mentioned in #10512 (comment).
|
openbsd/amd64 failures:
linux/ppc64le failures:
I haven't noticed any failures on other bots.
The text was updated successfully, but these errors were encountered: