-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: exit status 0xC0000005 from test process on windows-amd64-longtest builder #38440
Comments
2020-04-13T22:38:56-ca017a6/windows-amd64-longtest
|
That makes two in the last month, and none from the entire year before. Absent a specific theory of what's wrong, I think we need to treat this as a potential 1.15 compiler regression. |
Or a runtime regression. The other repeating mystery failure is on windows/386, which has windows in common. I took a quick peek through git history around 1/9/2020 and CL 213837 might be a candidate(?). |
Ooh, CL 213837 is an interesting hypothesis. (CC @aclements @cherrymui @dr2chase @ianlancetaylor). (For reference, the |
There is also mysterious #36492 bug. Alex |
The same exit status on a 32-bit system in a trybot run as exit status -1073741819. https://storage.googleapis.com/go-build-log/bbdccf35/windows-386-2008_b817f391.log I don't see any in the saved logs, though, so that might have really been corruption of some sort. |
It's strange to see this exit status. It should be handled as an exception by the runtime package (code in runtime/signal_windows.go). Even if the runtime package doesn't handle the exception, it should still cause a stack trace and an exit with status 2. I don't know under what conditions Windows can still have a 0xc0000005 exit status. |
The error seems to tend to happen in cmd/go |
I don't know what is going on here; not looking at it any further at present. |
The process returned ERROR_ACCESS_DENIED. ERROR_ACCESS_DENIED is just another Windows error. There is no restriction on what error process can and cannot return.
I agree, Mind you, if something goes wrong in exception handle, it won't exit the process properly, and Windows itself will exit process. Also, maybe we have a bug and exception handler does not run for some reason.
The exception handler can fail as it is printing stack trace.
If I run this program https://play.golang.org/p/r_g4sD76FOh it prints
on Windows. I reckon (I did not test) the program will return 0xc0000005 exit code, if I disable exception handler. Do we know which program is failing? I still don't see which program failed from test output. Maybe, if we know which program fails, we can add some temporary debugging code to it. Alex Alex |
It's a range of different programs that are failing. The only consistency I've seen so far is that it happens in the cmd/go testsuite and in the top-level test testsuite, both of which run many different Go programs in parallel. Some of the Go programs that they run in parallel fail with exit status I'm wondering if it's possible that when memory allocation fails while calling into Windows, such as when starting a new thread, that this exit status might occur. That is pure speculation, though; I have no evidence. |
The way threads are created by Go runtime is by calling CreateThread Windows API (in runtime.newosproc). So, if CreateThread fails and returns an error, it should be printed by runtime. Same when Go allocates memory. I also considered bug in how we implemented exception handling. Windows exception handling on 386 is very different from amd64. The fact that both 368 and amd64 builders fail with this error points that the problem is not exception handler related. Maybe crashing program does print at least some stack trace, but its output is lost by the parent process. Maybe stdout is not read for some reason by parent process. The fact that exit code is 0xc0000005 does not supports that theory, but ... Alex |
It seems this became much more common in early April.
|
Hello! This is one of the few remaining issues blocking the Beta release of Go 1.15. We'll need to make a decision on this in the next week in order to keep our release on schedule. |
I'm working on bisecting this (which is a very slow process; given the 13% failure probability, it takes 34 successful runs to drive the chance of a missed repro under 1%). I have been able to reproduce it as far back as da8591b, which is 32 commits earlier than the first observation on the dashboard (ea7126f). |
Reproduced at e31d741, which is 64 commits before the first observation on the dashboard. Given the failure probability and the clear cut-off on the dashboard, I'm starting to wonder if something changed on the builder, rather than in the tree. Assuming it had the same 13% failure probability before the first dashboard observation, there's only a (1-0.13)**64 = 0.01% chance it wouldn't have appeared somewhere in those 64 commits. |
For some reason (I don’t recall why), when I first looked at this in #38440 (comment), I thought the first occurrence was on 1/9/2020. Have you looked around that date range? |
This is definitely not happening at tag go1.14.3. @josharian, I haven't looked around that range yet. Do you have any recollection where 1/9/2020 come from? :) |
I should note that go1.14.3 included CL 213837, which was your hypothesized culprit, but I wasn't able to reproduce at that tag (I even accidentally ran a lot more iterations than I meant to!) |
Ah, I think it's because of #37360 (comment), which I mentally transposed into this issue. They may be related, in that they are both weird exit statuses on Windows. |
Just a guess (completely unsure): the linker has written the output executable, and another process tries to execute it but failed (to even start running code), so we don't get any stack traces. (It might be similar to on UNIX if we somehow screw the ELF header or the dynamic loader fails.) |
You're right that hostobjCopy was barking up the wrong tree. So I went to see how
That would seem fine according to what we know of the interaction between Windows IO and the page mapper so far. But then I opened up the function that emitted that, and look what I found: // moveOrCopyFile is like 'mv src dst' or 'cp src dst'.
func (b *Builder) moveOrCopyFile(dst, src string, perm os.FileMode, force bool) error {
if cfg.BuildN {
b.Showcmd("", "mv %s %s", src, dst)
return nil
}
// If we can update the mode and rename to the dst, do it.
// Otherwise fall back to standard copy.
// If the source is in the build cache, we need to copy it.
if strings.HasPrefix(src, cache.DefaultDir()) {
return b.copyFile(dst, src, perm, force)
}
// On Windows, always copy the file, so that we respect the NTFS
// permissions of the parent folder. https://golang.org/issue/22343.
// What matters here is not cfg.Goos (the system we are building
// for) but runtime.GOOS (the system we are building on).
if runtime.GOOS == "windows" {
return b.copyFile(dst, src, perm, force)
} If we look at df, err := os.OpenFile(dst, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, perm)
if err != nil && base.ToolIsWindows {
// Windows does not allow deletion of a binary file
// while it is executing. Try to move it out of the way.
// If the move fails, which is likely, we'll try again the
// next time we do an install of this binary.
if err := os.Rename(dst, dst+"~"); err == nil {
os.Remove(dst + "~")
}
df, err = os.OpenFile(dst, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, perm)
}
if err != nil {
return fmt.Errorf("copying %s: %w", src, err) // err should already refer to dst
}
_, err = io.Copy(df, sf)
df.Close() So again, voila. We have the pattern of: write using Actually though, I deleted a step from that
The missing step is the call to f, err = os.OpenFile(file, os.O_WRONLY, 0)
if err != nil {
log.Fatal(err)
}
if err := buildid.Rewrite(f, matches, newID); err != nil {
log.Fatal(err)
}
if err := f.Close(); err != nil {
log.Fatal(err)
} So that means that either I remain convinced that CL 235639 is correct and necessary if we take heed of MSDN's warnings. |
Take a normal working .exe, and then write zeros to various swaths of the file. Try to execute it. I would bet that there's at least a few places where writing pages of zeros causes the resultant executable to crash.
The option 3 theory says: Mapping, unmapping, mapping will give a consistent file when reading from the second mapping. But mapping, unmapping, The option 1 theory says: Mapping, unmapping, mapping will not give a consistent file when reading from the second mapping, because the dirty bits from those pages are just dropped entirely when unmapping, and there's no nice pagecache and kswapd keeping things consistent like on Linux. At first, option 1 seemed a tad bit more likely to me, because that would help explain the But now, seeing how many times Go object files are copied around (see this latest #38440 (comment) ), it seems like option 3 would be the most likely. Option 1 would be annoying and surprising behavior, after all, and option 3 is consistent with what's been expressed clearly on MSDN. And now we have Go code that precisely matches the bad edge case that MSDN says to avoid. |
To be clear, I agree that CL 235639 is probably what we should do. I also think that it is not safe to mix mmap and file I/O in the same process. That is why in the past, when the linker itself used both mmap and file I/O for the output binary, we did have a But I'm not clear whether it is safe if another process, after waiting for the linker process terminated (which unmaps the file and closes the FD before exiting), do anything with the file, read or execute. @aclements 's comment #38440 (comment) seems to indicate this is safe on UNIX systems. The MSDN document doesn't make this explicit. But it is a possibility. (Anything could happen after the linker process terminated. Besides what go command does, the user may open the file, execute it, or read it for inspection. We need to make sure that is safe.) |
I'm also not really sure what |
@aclements If I'm reading you right, you're looking for a crash dump from cmd/link or cmd/go. But from this discussion it seems more likely to me that the executable is corrupt on disk, so running it crashes early, and the Go signal handling is never installed. |
@ianlancetaylor , I'm looking for a crash dump from whatever's crashing. If it were cmd/link or cmd/go, I would expect a Go traceback, and since we don't see that, I suspect it's not cmd/link or cmd/go. If it's the corrupt binary, I would expect the crash dump to implicate some binary loader or something very early in process start up. |
We know that another process using normal boring file I/O isn't safe. But what's unclear is whether remapping after dropping those pages is coherent. That's my option (1) earlier. This seems definitely possible, though it'd be a bit wild. The issue is essentially whether dirty pages are stored with their dirty bit in tact in some global page cache. Linux does this. The question is: does Windows? |
I'm not certain about the coherence question (based on the kernel data structures at play, I believe map+unmap+map (even with intervening close+open), will be coherent, but I can't speak from a position of authority). But to @aclements's point, there are several code paths in the Windows loader that handle an access violation (or other exception) and call NtTerminateProcess without raising again. So it's certainly possible that a corrupt binary will exit without producing a crash dump. Unfortunately. |
Ah, that's interesting. It seems like the most likely theory at this point is that the Windows loader uses read/write IO for at least some things (even if it memory maps the segments themselves), and that omitting the If it were a problem in the linker itself, we would expect a Go traceback (and possibly user-mode crash dumps). If the main contents of the binary were corrupted (e.g., incoherence led to zeroed pages being mapped in), we would expect a much wider variety of crashes. |
That would be surprising to me. I would expect that However, this is not a problem. From @zx2c4's analysis, it sounded like the corruption was introduced earlier in the chain. If the Go linker is calling |
I'm not sure I understand what you're saying. What do you mean by "image section"? I wasn't saying there are any writable mappings at that point. My theory is that
We've seen failures in situations where the binary does not get copied. |
This is my guess as well. It probably needs to read the header to know the addresses and ranges of the mappings. I guess it is the header that gets corrupted, so it fails very early so that it doesn't print much information. |
I'm not seeing anywhere in ntoskrnl where the process image is being read through means other than its mapping. It looks like NtCreateUserProcess calls createfile and then maps it pretty soon after. I didn't see much in userspace CreateProcessInternalW either, besides reading file metadata. But @jstarks can probably consult the source and find something more definitive than me poking around for 5 minutes in the disassembler. |
Actually, NtCreateUserProcess->MmCreateSpecialImageSection->MiCreateSection->MiCreateImageOrDataSection->MiCreateNewSection->MiCreateImageFileMap->MiReadImageHeaders->MiPageRead->IoPageReadEx->an irp gets queued up. So that might give credence to what @cherrymui just suggested. |
It looks like MiCreateImageFileMap flushes the data section (i.e. waits for modified pages to be written to the backing file) before doing any reads. So if the MiReadImageHeaders read is the first one in the CreateProcess path, then I wouldn't expect any coherence issues in that path. Let me see if I can get someone from the Mm team to chime in. |
Would be very interested. I'd also like to know if the above option (1) is a real possibility. Specifically, will Windows just discard unflushed pages and ignore the dirty bit if you unmap without flushing first? I'd expect the answer to be "no, that would be insane", but ya never know. The whole caveat of fscache/normal I/O being somehow separate from mmap'd I/O is sufficiently weird that anything sort of seems possible there. |
Interestingly, according to the Mm team, mappings and ReadFile should be coherent as long as the file was opened cached, i.e. without We (well, not me, but others here in Windows) will try to get a local repro under stress. In the meantime, it sounds like flushing the mapping explicitly is a reasonable workaround. Thanks for looping me in. @aclements, on which OS build are you reproducing this? |
Oh, good, that's reassuring. I was worried that these being not coherent implied some very odd IO cache architecture. |
I tried simulating corrupted file on Windows https://play.golang.org/p/5aZQqi7pS8o and that is what I see
Exactly what we see on the builder. Child process has no output (stack trace or otherwise), because it has not run yet. And parent process prints child's exit code of 0xC0000005, and empty output. Alex |
This is on the windows-amd64-longtest builder, which reports as: $ VM=$(gomote create windows-amd64-longtest); gomote run -system $VM systeminfo
Host Name: SERVER-2016-V7-
OS Name: Microsoft Windows Server 2016 Datacenter
OS Version: 10.0.14393 N/A Build 14393
OS Manufacturer: Microsoft Corporation
OS Configuration: Standalone Server
OS Build Type: Multiprocessor Free
Registered Owner: N/A
Registered Organization: N/A
Product ID: -
Original Install Date: 7/2/2018, 5:50:15 PM
System Boot Time: 6/2/2020, 1:10:45 PM
System Manufacturer: Google
System Model: Google Compute Engine
System Type: x64-based PC
Processor(s): 1 Processor(s) Installed.
[01]: Intel64 Family 6 Model 63 Stepping 0 GenuineIntel ~2300 Mhz
BIOS Version: Google Google, 1/1/2011
Windows Directory: C:\Windows
System Directory: C:\Windows\system32
Boot Device: \Device\HarddiskVolume1
System Locale: en-us;English (United States)
Input Locale: en-us;English (United States)
Time Zone: (UTC+00:00) Monrovia, Reykjavik
Total Physical Memory: 14,746 MB
Available Physical Memory: 13,750 MB
Virtual Memory: Max Size: 15,770 MB
Virtual Memory: Available: 14,800 MB
Virtual Memory: In Use: 970 MB
Page File Location(s): C:\pagefile.sys
Domain: WORKGROUP
Logon Server: \\SERVER-2016-V7-
Hotfix(s): 5 Hotfix(s) Installed.
[01]: KB3186568
[02]: KB3199986
[03]: KB4049065
[04]: KB4132216
[05]: KB4284880
Network Card(s): 1 NIC(s) Installed.
[01]: Red Hat VirtIO Ethernet Adapter
Connection Name: Ethernet
DHCP Enabled: Yes
DHCP Server: 169.254.169.254
IP address(es)
[01]: 10.240.0.19
Hyper-V Requirements: A hypervisor has been detected. Features required for Hyper-V will not be displayed.
@alexbrainman, fascinating. Y'know, if the text is zeroes, it disassembles to But surely the text section is also being memory mapped by the loader, so it wouldn't be incoherent, right? Or maybe this is the OS bug @jstarks is sensing. And why would be text be incoherent but not the executable headers (which we write out after the text)? So many questions... |
I made my example up. I don't know, if it is only .text section is corrupted. And I don't know, if corrupted area is filled with 0. Alex |
The "where" part is as yet undetermined, but the "what" part is almost certainly zeros, as that's what the file contains when you truncate a zero sized file to a non-zero sized file. |
Thanks for explaining. I suppose I hit jackpot when I decided to put 0s there. Alex |
Observed in 2020-04-10T16:24:46-ea7126f/windows-amd64-longtest:
The
FAIL
int/t1
is unexpected in this test.Exit status 3221225477 is
0xC0000005
, which some cursory searching seems to suggest is a generic “access violation” error. That suggests possible memory corruption.I haven't seen any repeats of this error so far.
CC @alexbrainman @zx2c4
The text was updated successfully, but these errors were encountered: