Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of persistent build errors due to non-atomic file writes in GHC and stack #4559

Open
2 of 4 tasks
nh2 opened this issue Feb 4, 2019 · 14 comments
Open
2 of 4 tasks

Comments

@nh2
Copy link
Collaborator

nh2 commented Feb 4, 2019

GHC as of writing does not ensure that files are written atomically.

This means that a Ctrl+C, kill or reboot at the right time can result in truncated files.
GHC does not detect this, so that resuming/rerunning the build with ghc --make continues to show up error messages.

In such situation, the only workaround is to wipe all files (e.g. stack/cabal/make clean).

Specifically, we've observed the following to happen:

  • object files being written half-way, resulting in persistent linker errors
  • executable files being written half-way, so they can start executing but then crash
  • users reporting downstream tooling bug reports that only a build directory wipe helped

There may be situations where stack has this problem too, but so far we believe all occurrences of this that we see are in GHC.

GHC issue about this: #14533 - Make GHC more robust against PC crashes by using atomic writes

Repro

@lehins made a repro script that shows the problem in GHC at https://github.com/lehins/exec-kill-loop

Planned solution

  • GHC should use atomic writes (write to temp file, then rename() syscall).
    • Make repro script that shows the problem (done)
    • Decrease repro waiting time from hours to a few minutes by killing just at the right time, doing syscall interception (done with hatrace test)
    • Find all places where GHC does non-atomic file writes, and fix them (ghc issue)
    • If possible, add this as a system test to GHC CI

Related issues

Issues for which this may be the reason

@nh2
Copy link
Collaborator Author

nh2 commented Feb 4, 2019

In case we need to add atomic writes for stack, in commercialhaskell/rio#138 @roman introduced atomic+durable file writes that we can use.

However, as per this:

There is no function exported by this module that provides /only/ atomicity.

We probably want to add that since for many cases, atomic-but-not-durable is enough, and durability (fsync) makes things a lot slower.

@nh2
Copy link
Collaborator Author

nh2 commented Feb 17, 2019

New task done:

  • Decrease repro waiting time from hours to a few minutes by killing just at the right time, doing syscall interception

Here comes hatrace, a Haskell-API for strace-style syscall incerception and scripting.

I made it for my work on https://phabricator.haskell.org/D42, but it's very useful for this problem, too.

@nh2
Copy link
Collaborator Author

nh2 commented Feb 26, 2019

My GHC patch for writing .o files atomically got merged:

https://gitlab.haskell.org/ghc/ghc/merge_requests/391

This should fix the biggest source for these errors, but the other files GHC and its subprograms write need to be done.

@snoyberg
Copy link
Contributor

@nh2 I don't think this issue belongs on the Stack issue tracker, as it's entirely upstream. Any objection to closing it?

@nh2
Copy link
Collaborator Author

nh2 commented Mar 25, 2019

The main GHC bug has been fixed, but I still wanted to do an investigation with hatrace on

There may be situations where stack has this problem too

to ensure that there's no case where stack may write non-atomic files.

I'd like to write a hatrace filter like --find-nonatomic-writes that listens for all open()+close()s without subsequent rename(), so that it can print a report like

The following files were written nonatomically by the program:
  - path/to/file.o
  - ...

This was referenced Mar 26, 2019
@nh2
Copy link
Collaborator Author

nh2 commented Mar 28, 2019

In nh2/hatrace#9 @qrilka implemented a --find-nonatomic-writes mode for hatrace that can point out non-atomic writes happening.

In nh2/hatrace#9 (comment) we recoded its output for a stack build invocation. It yields some paths that are written non-atomically, and we should check whether those are benign or should be improved.

Especially interesting are the entries

 - "/home/niklas/src/haskell/call-haskell-from-anything/.stack-work/install/x86_64-linux/lts-8.22/8.0.2/lib/x86_64-linux-ghc-8.0.2/call-haskell-from-anything-1.1.0.0-6Xcajdv7jFLAvwMt571tKe/st4joiWL/DataKinds.o"
 - "/home/niklas/src/haskell/call-haskell-from-anything/.stack-work/install/x86_64-linux/lts-8.22/8.0.2/lib/x86_64-linux-ghc-8.0.2/call-haskell-from-anything-1.1.0.0-6Xcajdv7jFLAvwMt571tKe/st4joiWL/Msgpack.o"
 - "/home/niklas/src/haskell/call-haskell-from-anything/.stack-work/install/x86_64-linux/lts-8.22/8.0.2/lib/x86_64-linux-ghc-8.0.2/call-haskell-from-anything-1.1.0.0-6Xcajdv7jFLAvwMt571tKe/st4joiWL/TH.o"
 - "/home/niklas/src/haskell/call-haskell-from-anything/.stack-work/install/x86_64-linux/lts-8.22/8.0.2/lib/x86_64-linux-ghc-8.0.2/call-haskell-from-anything-1.1.0.0-6Xcajdv7jFLAvwMt571tKe/st4joiWL/TypeUncurry.o"

which disappear after stack build completes.

@nh2
Copy link
Collaborator Author

nh2 commented Mar 28, 2019

I think the fastest way to investigate it would be to add a command like hatrace --kill-after-write /path/to/Myfile.o, and then we can run it on each file shown by --find-nonatomic-writes, and afterwards run another stack build or stack repl and see if it recovers or crashes.

@snoyberg
Copy link
Contributor

This is referring to writes in GHC itself, not Stack, correct?

@qrilka
Copy link
Contributor

qrilka commented Mar 29, 2019

@snoyberg the most of non-atomically written files are form GHC, I think we see some from Cabal e.g. from the copy operation but also there are some produced by Stack, I think at least these:

 - "/home/niklas/src/haskell/call-haskell-from-anything/.stack-work/dist/x86_64-linux/Cabal-1.24.2.0/stack-config-cache"
 - "/home/niklas/src/haskell/call-haskell-from-anything/.stack-work/install/x86_64-linux/lts-8.22/8.0.2/flag-cache/call-haskell-from-anything-1.1.0.0-6Xcajdv7jFLAvwMt571tKe"

And during our call Niklas was able to reproduce Stack failure by cutting one of those

@snoyberg
Copy link
Contributor

Awesome, that's good to know @qrilka. I believe both of those are going away with @borsboom's changes to remove store.

@qrilka
Copy link
Contributor

qrilka commented Mar 30, 2019

We were testing 1.9.3 so it makes total sense to do proper testing with master

@borsboom
Copy link
Contributor

borsboom commented Apr 1, 2019

Awesome, that's good to know @qrilka. I believe both of those are going away with @borsboom's changes to remove store.

That is correct.

@nh2
Copy link
Collaborator Author

nh2 commented Apr 8, 2019

Yes, those look relevant.

The last one suffers from non-atomic writes to .hi files, which is something I haven't fixed yet in GHC (only object files).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants