Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: exit status 0xC0000005 from test process on windows-amd64-longtest builder #38440

Closed
bcmills opened this issue Apr 14, 2020 · 84 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Apr 14, 2020

Observed in 2020-04-10T16:24:46-ea7126f/windows-amd64-longtest:

windows-amd64-longtest …
…
--- FAIL: TestTestCache (4.15s)
    go_test.go:2405: 
        
        INITIAL
        
        
    go_test.go:2413: running testgo [test -x -v -short t/...]
    go_test.go:2413: standard output:
    go_test.go:2413: exit status 3221225477
        FAIL	t/t1	0.486s
        exit status 3221225477
        FAIL	t/t2	0.481s
        === RUN   Test3
            Test3: t3_test.go:6: 1
        --- PASS: Test3 (0.00s)
        PASS
        ok  	t/t3	0.462s
        === RUN   Test4
            Test4: t4_test.go:6: 1
        --- PASS: Test4 (0.00s)
        PASS
        ok  	t/t4	0.467s
        FAIL
        
    go_test.go:2413: standard error:

The FAIL in t/t1 is unexpected in this test.

Exit status 3221225477 is 0xC0000005, which some cursory searching seems to suggest is a generic “access violation” error. That suggests possible memory corruption.

I haven't seen any repeats of this error so far.

CC @alexbrainman @zx2c4

@bcmills bcmills added OS-Windows NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 14, 2020
@bcmills bcmills added this to the Unplanned milestone Apr 14, 2020
@bcmills
Copy link
Contributor Author

bcmills commented Apr 14, 2020

2020-04-13T22:38:56-ca017a6/windows-amd64-longtest

--- FAIL: TestScript (0.01s)
    --- FAIL: TestScript/test_go111module_cache (1.22s)
        script_test.go:205: 
            > env GO111MODULE=on
            > go mod init foo
            [stderr]
            go: creating new go.mod: module foo
            > go test
            [stdout]
            exit status 3221225477
            FAIL	foo	0.267s
            [exit status 1]
            FAIL: testdata\script\test_go111module_cache.txt:3: unexpected command failure
            
FAIL
FAIL	cmd/go	95.006s

@bcmills
Copy link
Contributor Author

bcmills commented Apr 14, 2020

That makes two in the last month, and none from the entire year before. Absent a specific theory of what's wrong, I think we need to treat this as a potential 1.15 compiler regression.

CC @randall77 @josharian

@bcmills bcmills modified the milestones: Unplanned, Go1.15 Apr 14, 2020
@josharian
Copy link
Contributor

Or a runtime regression. The other repeating mystery failure is on windows/386, which has windows in common. I took a quick peek through git history around 1/9/2020 and CL 213837 might be a candidate(?).

@bcmills
Copy link
Contributor Author

bcmills commented Apr 15, 2020

Ooh, CL 213837 is an interesting hypothesis. (CC @aclements @cherrymui @dr2chase @ianlancetaylor).

(For reference, the windows/386 issue is #37360.)

@alexbrainman
Copy link
Member

There is also mysterious #36492 bug.

Alex

@bcmills
Copy link
Contributor Author

bcmills commented Apr 21, 2020

@bcmills
Copy link
Contributor Author

bcmills commented Apr 22, 2020

@bcmills
Copy link
Contributor Author

bcmills commented Apr 30, 2020

2020-04-30T02:29:55-eda6fe3/windows-amd64-longtest
2020-04-29T21:29:17-faafdf5/windows-amd64-longtest
2020-04-29T21:29:08-b3863fb/windows-amd64-longtest
2020-04-29T20:33:31-197a2a3/windows-amd64-longtest
2020-04-29T19:03:10-35dce1d/windows-amd64-longtest
2020-04-29T18:56:19-b13ce66/windows-amd64-longtest
2020-04-29T18:50:32-202c43b/windows-amd64-longtest
2020-04-29T08:57:33-e1a96b8/windows-amd64-longtest
2020-04-28T18:41:07-39380e8/windows-amd64-longtest
2020-04-27T20:06:57-d2f5e4e/windows-amd64-longtest
2020-04-27T18:19:26-a136919/windows-amd64-longtest
2020-04-27T17:49:51-8af8fa9/windows-amd64-longtest
2020-04-27T17:20:00-1c4e9b2/windows-amd64-longtest
2020-04-27T17:19:27-f4e13b8/windows-amd64-2016
2020-04-27T17:19:16-5029c36/windows-amd64-longtest
2020-04-27T15:53:46-9b9556f/windows-amd64-2016
2020-04-27T10:23:01-925d6b3/windows-amd64-longtest
2020-04-25T18:40:45-7ef28cb/windows-amd64-longtest
2020-04-25T02:19:12-49f10f3/windows-amd64-longtest
2020-04-24T23:58:38-396833c/windows-amd64-longtest
2020-04-24T23:12:43-b0a8754/windows-amd64-longtest
2020-04-24T23:12:23-6b5ab20/windows-amd64-longtest
2020-04-24T23:06:59-1c54535/windows-amd64-longtest
2020-04-24T23:06:48-47b5efa/windows-amd64-longtest
2020-04-24T22:01:49-d8d3542/windows-amd64-longtest
2020-04-24T21:12:21-943a0d0/windows-amd64-longtest
2020-04-24T17:13:12-da33f9c/windows-amd64-longtest
2020-04-24T13:26:34-ee8972c/windows-amd64-longtest
2020-04-23T19:40:07-6303c34/windows-amd64-longtest
2020-04-22T16:19:35-1257900/windows-amd64-2016

@ianlancetaylor
Copy link
Member

ianlancetaylor commented May 1, 2020

The same exit status on a 32-bit system in a trybot run as exit status -1073741819.

https://storage.googleapis.com/go-build-log/bbdccf35/windows-386-2008_b817f391.log

I don't see any in the saved logs, though, so that might have really been corruption of some sort.

@ianlancetaylor
Copy link
Member

It's strange to see this exit status. It should be handled as an exception by the runtime package (code in runtime/signal_windows.go). Even if the runtime package doesn't handle the exception, it should still cause a stack trace and an exit with status 2. I don't know under what conditions Windows can still have a 0xc0000005 exit status.

@ianlancetaylor
Copy link
Member

The error seems to tend to happen in cmd/go TestScript or when running the tests in the test directory, which are both cases where we are executing a large number of child processes in parallel.

@ianlancetaylor
Copy link
Member

I don't know what is going on here; not looking at it any further at present.

@alexbrainman
Copy link
Member

It's strange to see this exit status.

The process returned ERROR_ACCESS_DENIED. ERROR_ACCESS_DENIED is just another Windows error. There is no restriction on what error process can and cannot return.

It should be handled as an exception by the runtime package (code in runtime/signal_windows.go).

I agree,

Mind you, if something goes wrong in exception handle, it won't exit the process properly, and Windows itself will exit process.

Also, maybe we have a bug and exception handler does not run for some reason.

Even if the runtime package doesn't handle the exception, it should still cause a stack trace and an exit with status 2.

The exception handler can fail as it is printing stack trace.

I don't know under what conditions Windows can still have a 0xc0000005 exit status.

If I run this program

https://play.golang.org/p/r_g4sD76FOh

it prints

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x9853ca]

goroutine 1 [running]:
main.main()
        /home/a/src/issues/go/38440/a.go:9 +0x2a

on Windows. I reckon (I did not test) the program will return 0xc0000005 exit code, if I disable exception handler.

Do we know which program is failing? I still don't see which program failed from test output. Maybe, if we know which program fails, we can add some temporary debugging code to it.

Alex

Alex

@ianlancetaylor
Copy link
Member

It's a range of different programs that are failing. The only consistency I've seen so far is that it happens in the cmd/go testsuite and in the top-level test testsuite, both of which run many different Go programs in parallel. Some of the Go programs that they run in parallel fail with exit status 0xc0000005 and no other output, implying, as you say, that the exception handler did not take effect for some reason.

I'm wondering if it's possible that when memory allocation fails while calling into Windows, such as when starting a new thread, that this exit status might occur. That is pure speculation, though; I have no evidence.

@alexbrainman
Copy link
Member

I'm wondering if it's possible that when memory allocation fails while calling into Windows, such as when starting a new thread, that this exit status might occur.

The way threads are created by Go runtime is by calling CreateThread Windows API (in runtime.newosproc). So, if CreateThread fails and returns an error, it should be printed by runtime. Same when Go allocates memory.

I also considered bug in how we implemented exception handling. Windows exception handling on 386 is very different from amd64. The fact that both 368 and amd64 builders fail with this error points that the problem is not exception handler related.

Maybe crashing program does print at least some stack trace, but its output is lost by the parent process. Maybe stdout is not read for some reason by parent process. The fact that exit code is 0xc0000005 does not supports that theory, but ...

Alex

@aclements
Copy link
Member

It seems this became much more common in early April.

$ greplogs -dashboard -E "exit status 3221225477" -l | findflakes -paths
First observed ea7126f 10 Apr 16:24 2020 (455 commits ago)
Last observed  364a05e 14 May 15:18 2020 (2 commits ago)
76% chance failure is still happening
13% failure probability (59 of 454 commits)
Likely culprits:
   13% ea7126f cmd/compile: use a Sym type instead of interface{} for symbolic offsets
   11% 782fcb4 regexp: add (*Regexp).SubexpIndex
   10% 245409e runtime: replace the type of netpollWakeSig from a uintptr to a uint32
    8% 7b90c1c os/exec: extend grace period in TestExtraFiles to 20% of overall deadline
    7% 28157b3 cmd/compile: start implementing strongly typed aux and auxint fields
    6% a1550d3 cmd/compile: use isel with variable shifts on ppc64x
    6% 7c0ee11 test: deflaking measures for runtime gdb test
    5% b191c60 cmd/oldlink: delete darwin/386 host object loading support
    4% a3ef804 cmd/internal/obj/ppc64: leverage AND operation to calculate remainder
    4% c1f0eda net: convert many Close tests to use parallel subtests
Past failures:
  8a958bb 29 Apr 16:24 2018 to ad4e637 04 Jul 00:15 2018
    0.14% failure probability (2 of 695 commits)
  08c4348 10 Apr 18:40 2015 (isolated failure)

@toothrot
Copy link
Contributor

Hello! This is one of the few remaining issues blocking the Beta release of Go 1.15. We'll need to make a decision on this in the next week in order to keep our release on schedule.

@aclements
Copy link
Member

I'm working on bisecting this (which is a very slow process; given the 13% failure probability, it takes 34 successful runs to drive the chance of a missed repro under 1%). I have been able to reproduce it as far back as da8591b, which is 32 commits earlier than the first observation on the dashboard (ea7126f).

@aclements
Copy link
Member

Reproduced at e31d741, which is 64 commits before the first observation on the dashboard.

Given the failure probability and the clear cut-off on the dashboard, I'm starting to wonder if something changed on the builder, rather than in the tree. Assuming it had the same 13% failure probability before the first dashboard observation, there's only a (1-0.13)**64 = 0.01% chance it wouldn't have appeared somewhere in those 64 commits.

@josharian
Copy link
Contributor

For some reason (I don’t recall why), when I first looked at this in #38440 (comment), I thought the first occurrence was on 1/9/2020. Have you looked around that date range?

@aclements
Copy link
Member

This is definitely not happening at tag go1.14.3.

@josharian, I haven't looked around that range yet. Do you have any recollection where 1/9/2020 come from? :)

@aclements
Copy link
Member

I should note that go1.14.3 included CL 213837, which was your hypothesized culprit, but I wasn't able to reproduce at that tag (I even accidentally ran a lot more iterations than I meant to!)

@josharian
Copy link
Contributor

Ah, I think it's because of #37360 (comment), which I mentally transposed into this issue. They may be related, in that they are both weird exit statuses on Windows.

@cherrymui
Copy link
Member

Just a guess (completely unsure): the linker has written the output executable, and another process tries to execute it but failed (to even start running code), so we don't get any stack traces. (It might be similar to on UNIX if we somehow screw the ELF header or the dynamic loader fails.)

@zx2c4
Copy link
Contributor

zx2c4 commented May 31, 2020

I don't think hostobjCopy does anything with the output file we wrote using mmap. hostobjCopy copies "host object files", which are non-Go object files we are linking in.

You're right that hostobjCopy was barking up the wrong tree. So I went to see how go build does things and found the -x flag, and saw that it seemed, on Linux at least, to mv /path/to/temporary/exe /path/to/final/exe:

/usr/lib/go/pkg/tool/linux_amd64/link -o $WORK/b001/exe/a.out -importcfg $WORK/b001/importcfg.link -buildmode=exe -buildid=58H7_5kYPqii0uAGLa38/sqTE1WDirs7yz2m_Gl_O/J64dQnzG1o1jnUgl1kbu/58H7_5kYPqii0uAGLa38 -extld=x86_64-pc-linux-gnu-gcc $WORK/b001/_pkg_.a
mv $WORK/b001/exe/a.out a

That would seem fine according to what we know of the interaction between Windows IO and the page mapper so far. But then I opened up the function that emitted that, and look what I found:

// moveOrCopyFile is like 'mv src dst' or 'cp src dst'.
func (b *Builder) moveOrCopyFile(dst, src string, perm os.FileMode, force bool) error {
        if cfg.BuildN {
                b.Showcmd("", "mv %s %s", src, dst)
                return nil
        }

        // If we can update the mode and rename to the dst, do it.
        // Otherwise fall back to standard copy.

        // If the source is in the build cache, we need to copy it.
        if strings.HasPrefix(src, cache.DefaultDir()) {
                return b.copyFile(dst, src, perm, force)
        }

        // On Windows, always copy the file, so that we respect the NTFS
        // permissions of the parent folder. https://golang.org/issue/22343.
        // What matters here is not cfg.Goos (the system we are building
        // for) but runtime.GOOS (the system we are building on).
        if runtime.GOOS == "windows" {
                return b.copyFile(dst, src, perm, force)
        }

If we look at b.copyFile, we see another boring file I/O copy operation:

        df, err := os.OpenFile(dst, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, perm)
        if err != nil && base.ToolIsWindows {
                // Windows does not allow deletion of a binary file
                // while it is executing. Try to move it out of the way.
                // If the move fails, which is likely, we'll try again the
                // next time we do an install of this binary.
                if err := os.Rename(dst, dst+"~"); err == nil {
                        os.Remove(dst + "~")
                }
                df, err = os.OpenFile(dst, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, perm)
        }
        if err != nil {
                return fmt.Errorf("copying %s: %w", src, err) // err should already refer to dst
        }

        _, err = io.Copy(df, sf)
        df.Close()

So again, voila. We have the pattern of: write using MapViewOfFile, forget to flush it, and then read that file using normal boring file I/O. MSDN makes it very clear that if you forget to flush, the contents you made with mapping might be out of sync with that of normal file I/O. That means the final executable that gets executed has some random parts of the file as zeros instead of as the actual file contents. This is, again, option 3 above.

Actually though, I deleted a step from that -x trace. Here's the real one:

/usr/lib/go/pkg/tool/linux_amd64/link -o $WORK/b001/exe/a.out -importcfg $WORK/b001/importcfg.link -buildmode=exe -buildid=58H7_5kYPqii0uAGLa38/sqTE1WDirs7yz2m_Gl_O/J64dQnzG1o1jnUgl1kbu/58H7_5kYPqii0uAGLa38 -extld=x86_64-pc-linux-gnu-gcc $WORK/b001/_pkg_.a
/usr/lib/go/pkg/tool/linux_amd64/buildid -w $WORK/b001/exe/a.out # internal
mv $WORK/b001/exe/a.out a

The missing step is the call to buildid. Guess what this does? More normal boring file I/O:

        f, err = os.OpenFile(file, os.O_WRONLY, 0)
        if err != nil {
                log.Fatal(err)
        }
        if err := buildid.Rewrite(f, matches, newID); err != nil {
                log.Fatal(err)
        }
        if err := f.Close(); err != nil {
                log.Fatal(err)
        }

So that means that either buildid is writing things back inconsistently, or that the mv that is actually a filecopy is writing things back inconsistently. Either way, getting an inconsistent executable is to be expected. With link using mapped I/O, the rest of the toolchain that runs after and doesn't use mapped I/O but rather uses normal boring file I/O is bound to create bogus files some percentage of the time, because its reads won't be consistent with link's writes.

I remain convinced that CL 235639 is correct and necessary if we take heed of MSDN's warnings.

@zx2c4
Copy link
Contributor

zx2c4 commented May 31, 2020

The one thing that's still bothering me is, if this is where the crash is, why don't we see a stack trace?

Take a normal working .exe, and then write zeros to various swaths of the file. Try to execute it. I would bet that there's at least a few places where writing pages of zeros causes the resultant executable to crash.

Is this true even when the file has been unmapped and reopened with a different handle? In UNIX, if you're going to interleave mmap'ed IO and read/write IO, you're advised to msync between them, but once you've closed, unmapped, and reopened the file, it's safe to do read/write IO. (msync is also a giant hammer that you don't want to overuse!)

The option 3 theory says: Mapping, unmapping, mapping will give a consistent file when reading from the second mapping. But mapping, unmapping, ReadFile-ing will not give a consistent file when using the ReadFile. MSDN makes this latter point very clear.

The option 1 theory says: Mapping, unmapping, mapping will not give a consistent file when reading from the second mapping, because the dirty bits from those pages are just dropped entirely when unmapping, and there's no nice pagecache and kswapd keeping things consistent like on Linux.

At first, option 1 seemed a tad bit more likely to me, because that would help explain the ReadFile quirk in a deeper way, as well as explain why we're getting these crashes: that when the PE loader maps the executable file, it's after those pages have been dropped entirely from before, and so its missing previously written but not flushed chunks.

But now, seeing how many times Go object files are copied around (see this latest #38440 (comment) ), it seems like option 3 would be the most likely. Option 1 would be annoying and surprising behavior, after all, and option 3 is consistent with what's been expressed clearly on MSDN. And now we have Go code that precisely matches the bad edge case that MSDN says to avoid.

@cherrymui
Copy link
Member

To be clear, I agree that CL 235639 is probably what we should do.

I also think that it is not safe to mix mmap and file I/O in the same process. That is why in the past, when the linker itself used both mmap and file I/O for the output binary, we did have a msync (FlushViewOfFile on windows, if mapping were used, but it was not due to permission bug) between the two. In this cycle, as we changed to use mmap only, the msync was removed (CL https://golang.org/cl/228385).

But I'm not clear whether it is safe if another process, after waiting for the linker process terminated (which unmaps the file and closes the FD before exiting), do anything with the file, read or execute. @aclements 's comment #38440 (comment) seems to indicate this is safe on UNIX systems. The MSDN document doesn't make this explicit. But it is a possibility.

(Anything could happen after the linker process terminated. Besides what go command does, the user may open the file, execute it, or read it for inspection. We need to make sure that is safe.)

@cherrymui
Copy link
Member

cherrymui commented May 31, 2020

I'm also not really sure what go build does is really the cause. I looked at some of the failures, and they failed at go run or go test, which don't seem to copy the output file or write buildid, but just execute the binary. So it seems only executing the output after the linker process terminated is still not safe.

@ianlancetaylor
Copy link
Member

@aclements If I'm reading you right, you're looking for a crash dump from cmd/link or cmd/go. But from this discussion it seems more likely to me that the executable is corrupt on disk, so running it crashes early, and the Go signal handling is never installed.

@aclements
Copy link
Member

@ianlancetaylor , I'm looking for a crash dump from whatever's crashing. If it were cmd/link or cmd/go, I would expect a Go traceback, and since we don't see that, I suspect it's not cmd/link or cmd/go. If it's the corrupt binary, I would expect the crash dump to implicate some binary loader or something very early in process start up.

@zx2c4
Copy link
Contributor

zx2c4 commented Jun 1, 2020

But I'm not clear whether it is safe if another process, after waiting for the linker process terminated (which unmaps the file and closes the FD before exiting), do anything with the file, read or execute. @aclements 's comment #38440 (comment) seems to indicate this is safe on UNIX systems. The MSDN document doesn't make this explicit. But it is a possibility.

We know that another process using normal boring file I/O isn't safe. But what's unclear is whether remapping after dropping those pages is coherent. That's my option (1) earlier. This seems definitely possible, though it'd be a bit wild. The issue is essentially whether dirty pages are stored with their dirty bit in tact in some global page cache. Linux does this. The question is: does Windows?

@jstarks
Copy link

jstarks commented Jun 1, 2020

I'm not certain about the coherence question (based on the kernel data structures at play, I believe map+unmap+map (even with intervening close+open), will be coherent, but I can't speak from a position of authority).

But to @aclements's point, there are several code paths in the Windows loader that handle an access violation (or other exception) and call NtTerminateProcess without raising again. So it's certainly possible that a corrupt binary will exit without producing a crash dump. Unfortunately.

@aclements
Copy link
Member

But to @aclements's point, there are several code paths in the Windows loader that handle an access violation (or other exception) and call NtTerminateProcess without raising again. So it's certainly possible that a corrupt binary will exit without producing a crash dump. Unfortunately.

Ah, that's interesting.

It seems like the most likely theory at this point is that the Windows loader uses read/write IO for at least some things (even if it memory maps the segments themselves), and that omitting the FlushViewOfFile causes that to be incoherent, which causes the loader to crash.

If it were a problem in the linker itself, we would expect a Go traceback (and possibly user-mode crash dumps). If the main contents of the binary were corrupted (e.g., incoherence led to zeroed pages being mapped in), we would expect a much wider variety of crashes.

@jstarks
Copy link

jstarks commented Jun 1, 2020

It seems like the most likely theory at this point is that the Windows loader uses read/write IO for at least some things (even if it memory maps the segments themselves), and that omitting the FlushViewOfFile causes that to be incoherent, which causes the loader to crash.

That would be surprising to me. I would expect that CreateProcess would construct an image section for the file, which I would expect to implicitly flush any remaining data section (and fail if there were any existing writable mappings). I'm not an expert in this area, and I did not read the code, so this is just my intuition.

However, this is not a problem. From @zx2c4's analysis, it sounded like the corruption was introduced earlier in the chain. If the Go linker is calling CopyFile (which calls NtReadFile) without calling FlushViewOfFile, then the copied binary could be corrupted due to incoherence well before CreateProcess is called on it.

@aclements
Copy link
Member

That would be surprising to me. I would expect that CreateProcess would construct an image section for the file, which I would expect to implicitly flush any remaining data section (and fail if there were any existing writable mappings). I'm not an expert in this area, and I did not read the code, so this is just my intuition.

I'm not sure I understand what you're saying. What do you mean by "image section"? I wasn't saying there are any writable mappings at that point.

My theory is that CreateProcess (or the loader behind it) uses ReadFile to get a few things like headers before creating memory mappings for the bulk of the binary. Those memory mappings are probably fine because they're coherent with the data written via memory mappings in the linker. But ReadFile could be returning bad data since the linker didn't FlushViewOfFile, and that could cause loading the binary to fail.

However, this is not a problem. From @zx2c4's analysis, it sounded like the corruption was introduced earlier in the chain. If the Go linker is calling CopyFile (which calls NtReadFile) without calling FlushViewOfFile, then the copied binary could be corrupted due to incoherence well before CreateProcess is called on it.

We've seen failures in situations where the binary does not get copied.

@cherrymui
Copy link
Member

My theory is that CreateProcess (or the loader behind it) uses ReadFile to get a few things like headers before creating memory mappings for the bulk of the binary. Those memory mappings are probably fine because they're coherent with the data written via memory mappings in the linker. But ReadFile could be returning bad data since the linker didn't FlushViewOfFile, and that could cause loading the binary to fail.

This is my guess as well. It probably needs to read the header to know the addresses and ranges of the mappings. I guess it is the header that gets corrupted, so it fails very early so that it doesn't print much information.

@zx2c4
Copy link
Contributor

zx2c4 commented Jun 1, 2020

I'm not seeing anywhere in ntoskrnl where the process image is being read through means other than its mapping. It looks like NtCreateUserProcess calls createfile and then maps it pretty soon after. I didn't see much in userspace CreateProcessInternalW either, besides reading file metadata. But @jstarks can probably consult the source and find something more definitive than me poking around for 5 minutes in the disassembler.

@zx2c4
Copy link
Contributor

zx2c4 commented Jun 1, 2020

Actually, NtCreateUserProcess->MmCreateSpecialImageSection->MiCreateSection->MiCreateImageOrDataSection->MiCreateNewSection->MiCreateImageFileMap->MiReadImageHeaders->MiPageRead->IoPageReadEx->an irp gets queued up. So that might give credence to what @cherrymui just suggested.

@jstarks
Copy link

jstarks commented Jun 1, 2020

It looks like MiCreateImageFileMap flushes the data section (i.e. waits for modified pages to be written to the backing file) before doing any reads. So if the MiReadImageHeaders read is the first one in the CreateProcess path, then I wouldn't expect any coherence issues in that path.

Let me see if I can get someone from the Mm team to chime in.

@zx2c4
Copy link
Contributor

zx2c4 commented Jun 1, 2020

Let me see if I can get someone from the Mm team to chime in.

Would be very interested. I'd also like to know if the above option (1) is a real possibility. Specifically, will Windows just discard unflushed pages and ignore the dirty bit if you unmap without flushing first? I'd expect the answer to be "no, that would be insane", but ya never know. The whole caveat of fscache/normal I/O being somehow separate from mmap'd I/O is sufficiently weird that anything sort of seems possible there.

@jstarks
Copy link

jstarks commented Jun 2, 2020

Interestingly, according to the Mm team, mappings and ReadFile should be coherent as long as the file was opened cached, i.e. without FILE_NO_INTERMEDIATE_BUFFERING. The fact that they appear not to be in this case may be an OS bug.

We (well, not me, but others here in Windows) will try to get a local repro under stress. In the meantime, it sounds like flushing the mapping explicitly is a reasonable workaround.

Thanks for looping me in. @aclements, on which OS build are you reproducing this?

@zx2c4
Copy link
Contributor

zx2c4 commented Jun 2, 2020

Interestingly, according to the Mm team, mappings and ReadFile should be coherent as long as the file was opened cached

Oh, good, that's reassuring. I was worried that these being not coherent implied some very odd IO cache architecture.

@alexbrainman
Copy link
Member

I tried simulating corrupted file on Windows

https://play.golang.org/p/5aZQqi7pS8o

and that is what I see

C:\>test
--- FAIL: TestCorruptedExecutable (0.33s)
    a_test.go:76: Child: exit status 3221225477
        ""
FAIL

C:\>

Exactly what we see on the builder.

Child process has no output (stack trace or otherwise), because it has not run yet.

And parent process prints child's exit code of 0xC0000005, and empty output.

Alex

@aclements
Copy link
Member

@aclements, on which OS build are you reproducing this?

This is on the windows-amd64-longtest builder, which reports as:

$ VM=$(gomote create windows-amd64-longtest); gomote run -system $VM systeminfo
Host Name:                 SERVER-2016-V7-
OS Name:                   Microsoft Windows Server 2016 Datacenter
OS Version:                10.0.14393 N/A Build 14393
OS Manufacturer:           Microsoft Corporation
OS Configuration:          Standalone Server
OS Build Type:             Multiprocessor Free
Registered Owner:          N/A
Registered Organization:   N/A
Product ID:                -
Original Install Date:     7/2/2018, 5:50:15 PM
System Boot Time:          6/2/2020, 1:10:45 PM
System Manufacturer:       Google
System Model:              Google Compute Engine
System Type:               x64-based PC
Processor(s):              1 Processor(s) Installed.
                           [01]: Intel64 Family 6 Model 63 Stepping 0 GenuineIntel ~2300 Mhz
BIOS Version:              Google Google, 1/1/2011
Windows Directory:         C:\Windows
System Directory:          C:\Windows\system32
Boot Device:               \Device\HarddiskVolume1
System Locale:             en-us;English (United States)
Input Locale:              en-us;English (United States)
Time Zone:                 (UTC+00:00) Monrovia, Reykjavik
Total Physical Memory:     14,746 MB
Available Physical Memory: 13,750 MB
Virtual Memory: Max Size:  15,770 MB
Virtual Memory: Available: 14,800 MB
Virtual Memory: In Use:    970 MB
Page File Location(s):     C:\pagefile.sys
Domain:                    WORKGROUP
Logon Server:              \\SERVER-2016-V7-
Hotfix(s):                 5 Hotfix(s) Installed.
                           [01]: KB3186568
                           [02]: KB3199986
                           [03]: KB4049065
                           [04]: KB4132216
                           [05]: KB4284880
Network Card(s):           1 NIC(s) Installed.
                           [01]: Red Hat VirtIO Ethernet Adapter
                                 Connection Name: Ethernet
                                 DHCP Enabled:    Yes
                                 DHCP Server:     169.254.169.254
                                 IP address(es)
                                 [01]: 10.240.0.19
Hyper-V Requirements:      A hypervisor has been detected. Features required for Hyper-V will not be displayed.

I tried simulating corrupted file on Windows

@alexbrainman, fascinating.

Y'know, if the text is zeroes, it disassembles to add %al,(%rax), which will indeed cause an access violation for most values of rax.

But surely the text section is also being memory mapped by the loader, so it wouldn't be incoherent, right? Or maybe this is the OS bug @jstarks is sensing. And why would be text be incoherent but not the executable headers (which we write out after the text)? So many questions...

@alexbrainman
Copy link
Member

Y'know, if the text is zeroes, it disassembles to add %al,(%rax), which will indeed cause an access violation for most values of rax.

But surely the text section is also being memory mapped by the loader, so it wouldn't be incoherent, right?

I made my example up. I don't know, if it is only .text section is corrupted. And I don't know, if corrupted area is filled with 0.

Alex

@zx2c4
Copy link
Contributor

zx2c4 commented Jun 3, 2020

The "where" part is as yet undetermined, but the "what" part is almost certainly zeros, as that's what the file contains when you truncate a zero sized file to a non-zero sized file.

@alexbrainman
Copy link
Member

but the "what" part is almost certainly zeros, as that's what the file contains when you truncate a zero sized file to a non-zero sized file.

Thanks for explaining. I suppose I hit jackpot when I decided to put 0s there.

Alex

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows release-blocker
Projects
None yet
Development

No branches or pull requests