Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/trace: go tool trace trace time stamps out of order #24663

Closed
bbuchanan-bcs opened this issue Apr 3, 2018 · 8 comments
Closed

cmd/trace: go tool trace trace time stamps out of order #24663

bbuchanan-bcs opened this issue Apr 3, 2018 · 8 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bbuchanan-bcs
Copy link

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.10.1 linux/amd64

Does this issue reproduce with the latest release?

1.10.1 is the latest stable version to my knowledge

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/home/bo/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/bo/go"
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build507967120=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Starting up a microservice of ours and use our loadtest to generate load. During load I took a trace.
wget http://localhost:6060/debug/pprof/trace?seconds=10

Attempting to view the trace resulted in:
go tool trace trace\?seconds\=10 2018/04/03 16:00:52 Parsing trace... failed to parse trace: time stamps out of order

This happens occasionally. Sometimes I can change GOMAXPROCS and it will work correctly. I attached an example bad trace file.

trace.zip

I can't provide source since it is proprietary code. This also occurred on go 1.9.2 as well. I upgraded to try to fix the issue. I will mention that we are running into other issues that might effect this but I'll open separate issues for them as I get time.

  • Go 1.10.1 appears to have javascript and html errors in chrome if I can get a trace to open. This didn't occur on 1.9.2
  • Our services seem to have a problem with things getting listed as (Did Not Finish) and causing the trace visualizations to grow vertically. This appears to be a bug and makes it almost impossible to get anything out of the trace. This was also something I was trying to fix by upgrading from 1.9.2 to 1.10.1

What did you expect to see?

The trace to parse and the web browser to launch.

What did you see instead?

The trace failed to parse.

@bcmills
Copy link
Contributor

bcmills commented Apr 3, 2018

(CC: @hyangah, @heschik)

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 3, 2018
@bcmills bcmills added this to the Go1.11 milestone Apr 3, 2018
@hyangah
Copy link
Contributor

hyangah commented Apr 3, 2018

Possibly a duplicate of #16755.

  152381622900 ProcStart p=0 g=0 off=133103 thread=12
...
  152381662473 ProcStart p=2 g=0 off=68074 thread=3
  152381662474 GoStart p=2 g=114 off=68079 g=114 seq=105
                                              152381662599 GoUnblock p=0 g=70 off=133185 g=179 seq=48
                                            A 152381663067 GoBlockSend p=0 g=70 off=133193
B 152381662477 GoUnblock p=2 g=114 off=68083 g=70 seq=92
  152381662481 GoBlockSelect p=2 g=114 off=68089
  152381662482 GoStart p=2 g=70 off=68092 g=70 seq=0
  152381662482 GoSysCall p=2 g=70 off=68095

Event A (Wait on channel send) must happen before Event B (Unblock) but A has a larger timestamp.
They are running in different threads and may run on different cores.

  • With what GOMAXPROCS, did you see it always working correctly? GOMAXPROCS <= 6 based on the trace.

  • Can you share the /proc/cpuinfo and /sys/devices/system/clocksource/clocksource0/current_clocksource ? Is your service running in a VM?

For the other two issues (UI related probably), please file separate issues.

@bbuchanan-bcs
Copy link
Author

I've gotten the issue intermittently with it unset (which should be 6), GOMAXPROCS=1 and GOMAXPROCS=2.

It is currently running in a VM using VirtualBox 5.1.10r112026 on a Windows 7 host.

Request information:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz
stepping	: 4
microcode	: 0x19
cpu MHz		: 3691.116
cache size	: 10240 KB
physical id	: 0
siblings	: 6
core id		: 0
cpu cores	: 6
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx rdrand lahf_lm
bogomips	: 7382.23
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz
stepping	: 4
microcode	: 0x19
cpu MHz		: 3691.116
cache size	: 10240 KB
physical id	: 0
siblings	: 6
core id		: 1
cpu cores	: 6
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx rdrand lahf_lm
bogomips	: 7382.23
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz
stepping	: 4
microcode	: 0x19
cpu MHz		: 3691.116
cache size	: 10240 KB
physical id	: 0
siblings	: 6
core id		: 2
cpu cores	: 6
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx rdrand lahf_lm
bogomips	: 7382.23
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz
stepping	: 4
microcode	: 0x19
cpu MHz		: 3691.116
cache size	: 10240 KB
physical id	: 0
siblings	: 6
core id		: 3
cpu cores	: 6
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx rdrand lahf_lm
bogomips	: 7382.23
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 4
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz
stepping	: 4
microcode	: 0x19
cpu MHz		: 3691.116
cache size	: 10240 KB
physical id	: 0
siblings	: 6
core id		: 4
cpu cores	: 6
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx rdrand lahf_lm
bogomips	: 7382.23
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 5
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz
stepping	: 4
microcode	: 0x19
cpu MHz		: 3691.116
cache size	: 10240 KB
physical id	: 0
siblings	: 6
core id		: 5
cpu cores	: 6
apicid		: 5
initial apicid	: 5
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx rdrand lahf_lm
bogomips	: 7382.23
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:
tsc

@hyangah
Copy link
Contributor

hyangah commented Apr 4, 2018

I don't know how VirtualBox handles TSC reads but generally I wouldn't rely on TSC based measurements in VM. Given that the execution tracer depends on TSC, I don't know how to debug further this without fixing the issue #16755.

BTW, I fixed up the timestamps of the problematic record and now I see what you mean by "trace visualizations to grow vertically". It looks like some records are missing which caused slices representing goroutine's execution periods not to complete and made some goroutines look as if they were running on multiple Ps concurrently. I've never seen this kind of failure before. @dvyukov

strange

@hyangah
Copy link
Contributor

hyangah commented Apr 4, 2018

The stacked, weird looking graph is also related to the inaccuracy and insufficient resolution of the timestamps in the trace.

For example, here, the goroutine 85, resumed its execution at time 2113529119 and then, immediately called syscall and decided to wait for network. All these events occurred at time 2113529119. Less than a nanosecond according to the trace.

2113529119 GoStart p=0 g=85 off=136458 g=85 seq=0 link={ts=2113529119 type=GoBlockNet off=136464}
2113529119 GoSysCall p=0 g=85 off=136461 stkid=41
2113529119 GoBlockNet p=0 g=85 off=136464 stkid=52 link={ts=2122591806 type=GoUnblock off=136631}

When formulating the input for trace viewer, the duration for the slice is computed as 0, which trace viewer treated as unfinished time interval. We can make the tracer to make sure not to emit slice with duration 0.

But the more important issue is, again, time info measured through TSC in this platform can't be trusted.

@bbuchanan-bcs Have you tried to run your program on non-VM environment?

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/104716 mentions this issue: cmd/trace: avoid emitting traceview slice with 0 duration

gopherbot pushed a commit that referenced this issue Apr 4, 2018
The trace viewer interprets the slice as a non-terminating
time interval which is quite opposit to what trace records indicate
(i.e., almostly immediately terminating time interval).
As observed in the issue #24663 this can result in quite misleading
visualization of the trace.

Work around the trace viewer's issue by setting a small value
(0.0001usec) as the duration if the time interval is not positive.

Change-Id: I1c2aac135c194d0717f5c01a98ca60ffb14ef45c
Reviewed-on: https://go-review.googlesource.com/104716
Reviewed-by: Heschi Kreinick <[email protected]>
@bbuchanan-bcs
Copy link
Author

bbuchanan-bcs commented Apr 4, 2018

I don't have a development environment set up that isn't on a VM. I will be able to pull it from a deployed environment eventually but that will be in a docker container in ECS in AWS, hopefully it doesn't have the same issue.
When I get some time to look at this again, I can try to run it on the host windows machine.

Edit: Actually, we have some linux boxes that I might be able to use for performance testing.

@hyangah
Copy link
Contributor

hyangah commented Apr 4, 2018

Thanks. I will close this issue and we can continue the discussion on the timestamp out of order error in #16755.

@hyangah hyangah closed this as completed Apr 4, 2018
@golang golang locked and limited conversation to collaborators Apr 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants