-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/pprof: panic in appendLocsForStack on fuchsia-arm64 #40823
Comments
I agree that at face value this looks like a regression of #35538, likely in http://golang.org/cl/221577. http://golang.org/cl/221577 was merged into the Fuchsia tree in https://fuchsia-review.googlesource.com/c/third_party/go/+/374435 (April 4). However, https://fxbug.dev/52575 was filed on May 13 just a few hours after https://fuchsia-review.googlesource.com/c/third_party/go/+/389033 (Go 1.14.3 merge), which IMO strongly implicates something in that merge. @tamird do you know if the May 13 bug was really the first cases of this you saw, or was there perhaps some earlier bug that's not attached? |
Yes, there were earlier bugs; sorry to have mislead you! https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=44718 was filed on January 24 |
Ah, interesting. https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=44718 I don't recognize this crash (log, search for pprof) (@hyangah maybe you do?), but it does not seem related to this bug. https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=47563 This crash looks like #38605. The May 13 merge included the fix for this. Though it is not clear to me if the appendLocsForStack panics were really new after that merge or the other allocToCache crash was just so much more frequent that we never saw the appendLocsForStack panic. |
TestBlockProfile failed with
|
Change https://golang.org/cl/248728 mentions this issue: |
With the change above applied, here's what we see:
|
Ah, this is very helpful. It looks like this might be a bug in runtime.expandFinalInlineFrame. From the PCs, I suspect that nanotime1 and nanotime are inlined into suspendG, so expandFinalInlineFrame should have expanded the stack to include nanotime and suspendG, but it seems that it didn't for some reason. Could you provide a copy of the crashing binary, or tell me how to build it so I can take a closer look? |
So I think the problem here is that expandFinalInlineFrame doesn't account for the magic _System, _GC, etc frames. It should expand the second frame if one of those appear at the bottom. I'll send out a patch for this. That said, I don't understand why the traceback would be failing here: https://fuchsia.googlesource.com/third_party/go/+/refs/heads/master/src/runtime/proc.go#4069 |
Regarding why the traceback is failing in the first place: Since nanotime1 does call into the VDSO, it is possible there is some interaction with the handling of vdsoPC/vdsoSP or some weirdness with how profiling signals are delivered on Fuchsia (haven't looked at that much), though I'll want to take a closer at the binary. As an aside, you all will want to port http://golang.org/cl/246763 for Fuchsia. I don't think that is the direct cause of these crashes, but is certainly suspicious. |
If you can reproduce this issue, I think it would be valuable to see if making nanotime1 reentrant fixes the issue, to rule that out. I wrote https://fuchsia-review.googlesource.com/c/third_party/go/+/419640, which I believe is what you need, though it is untested since I don't have a Fuchsia environment. |
I haven't been able to reproduce locally, but the issue only appears on arm64. The previous file I provided was the amd64 binary; here's the arm64: https://gofile.io/d/xPjFh3. |
Reproducing only on ARM64 makes me wonder if this is a memory ordering issue. VDSO calls set However, on Fuchsia it is running on a different thread. Thus given that ARM allows stores to be reordered after stores, Before calling On AMD64, stores are ordered, so without a memory barrier we will still see consistent values (though perhaps stale!). I see that the Windows port has a very similar structure, though there is no Windows ARM64 port, and I don't know if cc @aclements @mknyszek to see if this sounds right, plus if you know more about Windows. |
@prattmic Your theory seems plausible to me, though I'm still tracking down how "bad Oh, I see, in an earlier post you posit that If this is the problem then we need either (1) synchronization of Windows is the only platform (AFAIK) we officially support that has the "suspend thread at a distance" behavior, everything else uses signals. We had a |
Yes,
Note that this isn't the exact same binary that crashed, but these instructions match up almost perfectly with the stack, so perhaps it is identical. I think we ultimately need (1) as you suggest. (2) should stop the immediate crashing, but stacks will be misattributed, and I wonder if there are other odd cases we'll hit with odd memory ordering. We could make these vdso fields explicitly atomics, or perhaps the ARM64 assembly should just manually include Of course, this is predicated on the assumption that |
w.r.t. Windows, I forgot that Windows has no VDSO and thus doesn't use |
This makes sense. I believe there are implicit memory barriers in |
Thanks everyone for the investigations, and for the patience. Punting to Go1.17. |
We stopped seeing this for a while, but having just upgraded to go1.16rc1, it's back. EDIT: correction, we stopped seeing it because we started skipping the previous occurrence. It seems that we now have a new occurrence. |
Ouch, sorry about that @tamird, and thank you for the report. I shall be reinstating as a release blocker for Go1.16 |
I don't think this needs to block the Go 1.16 release. Given that Go 1.16 RC 1 is already out and the bar for fixes is very high, it doesn't seem that this can make it to Go 1.16. I'll reset the milestone to Backlog, and it can be updated when the timing of the fix is more clear. |
The fence was added, and we're still seeing this come up in the same context. I was never able to repro this on a 128-core, multi-node Cavium system with innumerable iterations over several days, even with cpusets that should've guaranteed at least half of the threads to run on a different node (and thus increase the likelihood of a memory ordering issue). I'm not happy blaming this on qemu without actually understanding the problem, but I'm wondering if this is a similar issue to #9605. These ARM tests do run in a modern descendant of the qemu-based emulator that was being used in that issue, but that issue suggested this was flaky on real hardware, too. However, I have a hard time reconciling not being able to reproduce the issue once (in who-knows-how-many thousands of runs) on real hardware with it happening quite regularly (~hundreds or thousands of runs) in a qemuish arm64 env. |
The benchmark that was producing this failure has been disabled until yesterday; having enabled it, we're now seeing this even on fuchsia-x64; e.g. this build.
|
The problem here is two-fold (as previously partially discussed in #40823 (comment)):
|
I've mailed (untested!) https://fuchsia-review.googlesource.com/c/third_party/go/+/629462, which I hope fixes this. |
We've been seeing this issue for a while; filing this bug now that we've upgraded to go1.15. (See https://fxbug.dev/52575).
This happens sometimes when running the runtime/pprof tests:
There's no diff between Fuchsia's fork and go1.15 in src/runtime/pprof. See https://fuchsia.googlesource.com/third_party/go/+diff/go1.15..master.
Possibly a regression of #35538, related to #37446 and its fix in https://golang.org/cl/221577.
@prattmic
The text was updated successfully, but these errors were encountered: