-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/trace: time stamps out of order #16755
Comments
Oh this might be relevant, too:
|
|
Is this running in a VM or directly on physical hardware? Looks like a laptop? |
|
This looks like a buggy processor/OS. @tv42 are you sure that C programs get reasonable data out of RDTSC on this machine? |
I'm not sure why you'd call an RDTSC varying between cores/chips "buggy". I mean, this particular BIOS may even fail to synchronize TSC between the cores at boot, and it may be buggy, but the assumption of unified TSC across cores is not safe. Invariant TSC was a feature added to the Nehalem generation Intel chips, and even then you have no guarantees that the computer doesn't actually have multiple cpus (not just multiple cores on one cpu). You just simply can't rely on a singular RDTSC source across all possible cores, without further information that that assumption is safe. Outside of a kernel, that sounds like risky business. http://lxr.linux.no/linux+v2.6.38/arch/x86/kernel/tsc.c#L859 |
For what it's worth, Linux properly diagnoses TSC as unreliable on this host, and uses hpet instead.
Decent summary of TSC issues (talks about hardware, also outside of virtualization): |
@tv42 do you have any proposition as to how we can fix this? |
I'm hitting this "time stamps out of order" issue as well, using Gentoo's
|
@mpictor out of curiosity: does the problem go away if you change the clock source to |
@kostix it does.
That said, as tv42 notes, assumptions about the TSC break on multi-socket systems. While this worked once, I can't repeat it - writing
|
@dvyukov I don't know enough about pprof to really have much useful to say. I wonder how big the overhead of Is it possible to collect (core, tsc) pairs? Times between cores would be essentially incomparable, but for short runs, you could assume that the offsets of the different TSCs don't drift much. For longer runs, perhaps there would be a mechanism to regularly emit (core, tsc, monotonic_clock) sync point, that can be used to correct for drift. And thread start/exit should probably emit that too. That should amortize the (assumed) slower Of course, that leaves the analysis side stuck with a more difficult problem. But hey, it's no different from trying to analyze similar traces from a distributed system (because it is one). |
Does it provide enough precision?
Yes. |
It is a lot slower to take a measurement, even with the vDSO optimized not-real-syscalls. On this system ( |
For comparison, a laptop with i5-5300U and clocksource TSC says I just realized CLOCK_MONOTONIC is still subject to NTP time warping, I'll repeat my experiments with CLOCK_MONOTONIC_RAW. |
|
Change https://golang.org/cl/97757 mentions this issue: |
runtime/trace test already skips tests in case of the timestamp error. Moreover, relax TestAnalyzeAnnotationGC test condition to deal with the inaccuracy caused from use of cputicks in tracing. Fixes #24081 Updates #16755 Change-Id: I708ecc6da202eaec07e431085a75d3dbfbf4cc06 Reviewed-on: https://go-review.googlesource.com/97757 Run-TryBot: Hyang-Ah Hana Kim <[email protected]> Reviewed-by: Heschi Kreinick <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>
Change https://golang.org/cl/105821 mentions this issue: |
All tests involving trace collection and parsing still need handling of failures caused by #16755 (Timestamp issue) Fixes #24738 Change-Id: I6cd0f9c6f49854a22fad6fce1a00964c168aa614 Reviewed-on: https://go-review.googlesource.com/105821 Reviewed-by: Peter Weinberger <[email protected]>
and
@hyangah how did you fix the timestamps? I looked into that but didn't have success. I'm really disappointed to see that the milestone is now |
I'm experiencing this issue as well. I'm trying to retrieve a trace file from a process on one of our servers and open it to find bottlenecks and I'm getting the out of order error that is consistent with this thread. This prevents me from using any of the go tooling for profiling. It would be nice to know how many events are out of order and have the option to ignore them. If only a few events are out of order I would prefer to still be able to work on my trace but with a warning so that I know to be skeptical about what I see or look for anomalies. Right now I can't use any tooling and there may be valuable information for me to look at. Has anyone found a solution for this? EnvironmentOur environment is bare metal not virtualized. Nothing exotic in here I don't think.
System measurementsInteresting thing I noticed is the system is really overloaded. It's a 12 core box with a load of 63 and
|
Also experiencing this issue.
the same code works fine on another machine (a macbook) |
same issue
|
Hi. What version of Go are you using (go version)? Does this issue reproduce with the latest release? What operating system and processor architecture are you using (go env)?
go env Output``` GO111MODULE="on" GOARCH="amd64" GOBIN="" GOCACHE="/Users/adeshina/Library/Caches/go-build" GOENV="/Users/adeshina/Library/Application Support/go/env" GOEXE="" GOEXPERIMENT="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="darwin" GOINSECURE="" GOMODCACHE="/Users/adeshina/go/pkg/mod" GONOPROXY="github.com/gettreasure-mother/*,github.com/linuxfoundation-it/*,github.com/LF-Engineering" GONOSUMDB="github.com/gettreasure-mother/*,github.com/linuxfoundation-it/*,github.com/LF-Engineering" GOOS="darwin" GOPATH="/Users/adeshina/go" GOPRIVATE="github.com/acme-corporation/internal-rpc-client" GOPROXY="https://proxy.golang.org,direct" GOROOT="/usr/local/opt/go/libexec" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/usr/local/opt/go/libexec/pkg/tool/darwin_amd64" GOVCS="" GOVERSION="go1.19.4" GCCGO="gccgo" GOAMD64="v1" AR="ar" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="/Users/adeshina/Workspace/go-apps/go-tracing/go.mod" GOWORK="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/4l/0kgwdc8d0cl8ss8x6__q0gp00000gn/T/go-build3344081618=/tmp/go-build -gno-record-gcc-switches -fno-common" ``` What did you expect to see? What did you see instead?
|
The same problem, go env Output:GO111MODULE="auto" GOARCH="amd64" GOBIN="" GOCACHE="/Users/zhangjie/Library/Caches/go-build" GOENV="/Users/zhangjie/Library/Application Support/go/env" GOEXE="" GOEXPERIMENT="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="darwin" GOINSECURE="" GOMODCACHE="/Users/zhangjie/go/pkg/mod" GONOPROXY="git.woa.com/*" GONOSUMDB="git.woa.com/*" GOOS="darwin" GOPATH="/Users/zhangjie/go" GOPRIVATE="git.woa.com/*" GOPROXY="https://goproxy.woa.com,direct" GOROOT="/usr/local/go" GOSUMDB="off" GOTMPDIR="" GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64" GOVCS="" GOVERSION="go1.19.5" GCCGO="gccgo" GOAMD64="v1" AR="ar" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="/Users/zhangjie/Github/codemaster/go.mod" GOWORK="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/4m/2_pn1sln1fzcg2zg_4mh95z80000gn/T/go-build2690046038=/tmp/go-build -gno-record-gcc-switches -fno-common" |
Same issue in |
Uh that is interesting. On Intel cores, this mostly triggered with older generations (before Nehalem, mostly 10+ year old machines by now). I do not know enough about AMD, but I guess the same was true there? However, this triggering on Apple's M1 Pro and M1 Max means new hardware also suffers from this. (Or has symptoms that get confused with this?) |
With the new tracer this error can no longer happen -- timestamps are not used for ordering events anymore. (See #60773.) |
The RDTSCP event ordering logic talked about in #10512 and #15102 (tricky palindrome bug numbers) doesn't seem to be working, at all.
go version
)?go version go1.7 linux/amd64
go env
)?/proc/cpuinfo
contents:Imported
net/http/pprof
, put load on the application (webserver +wrk
), fetchedhttp://localhost:9999/debug/pprof/trace?seconds=1
, rango tool trace trace.out
Web browser opening to the trace viewer.
If I constrain the app to a single core with
taskset
, the trace viewer works.The text was updated successfully, but these errors were encountered: