Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic (i.e. racy) filesystem failures under heavy load #3281

Closed
zjturner opened this issue Jun 6, 2018 · 8 comments
Closed

Non-deterministic (i.e. racy) filesystem failures under heavy load #3281

zjturner opened this issue Jun 6, 2018 · 8 comments

Comments

@zjturner
Copy link

zjturner commented Jun 6, 2018

This bug-tracker is monitored by developers and other technical types. We like detail! So please use this form and tell us, concisely but precisely, what's up. Please fill out ALL THE FIELDS!

  • Your Windows build number: Microsoft Windows [Version 10.0.17134.48]

  • What you're doing and what's happening: Building a large piece of software with gcc using the ninja build system (so many file accesses happen in parallel across 40+ cores), and randomly files will fail to be included.

  • What's wrong / what should be happening instead: The compiler generates an error message about not being able to find a particular include file. This is GCC, where when it encounters this type of error it will re-try the same compilation again. The second time, however, it does not fail. This causes GCC to report that it is a non-deterministic failure, and probably a result of some OS or hardware failure.

  • Strace of the failing command, if applicable: Unfortunately the probelm does not reproduce under strace. Presumably this is because it's a race condition and strace changes the timing of the commands (or more specifically, it vastly slows everything down). So some internal buffer which is probably reaching its limit is not getting hit under strace because there is not as much load on the filesystem).

Luckily, it's (relatively) easy to reproduce, you just have to get set up to build an open source project (which thankfully is pretty easy)

  1. clone an llvm repository (update: This may need to be on an NTFS mount, with a symlink on your WSL filesystem pointing to the NTFS mount. See the bottom of this report)
% git clone https://github.com/llvm-project/llvm-project-20170507/ llvm-project
  1. Install ninja and cmake
% sudo apt-get install ninja-build
% sudo apt-get install cmake3
  1. Configure a build directory using the default compiler (update: This should be on your native WSL partition, NOT pointing to any NTFS mount).
% mkdir llvmbuild
% cd llvmbuild
% cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS=clang ../llvm-project/llvm
% ninja

After running for some time, you will get an error such as:

[268/2465] Building CXX object lib/Transforms/Scalar/CMakeFiles/LLVMScalarOpts.dir/MergeICmps.cpp.o
FAILED: /usr/bin/c++   -DGTEST_HAS_RTTI=0 -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -Ilib/Transforms/Scalar -I/root/llvm/llvm/lib/Transforms/Scalar -I/usr/include/libxml2 -Iinclude -I/root/llvm/llvm/include -fPIC -fvisibility-inlines-hidden -std=c++11 -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wno-maybe-uninitialized -Wdelete-non-virtual-dtor -Wno-comment -ffunction-sections -fdata-sections -O3 -DNDEBUG    -fno-exceptions -fno-rtti -MMD -MT lib/Transforms/Scalar/CMakeFiles/LLVMScalarOpts.dir/MergeICmps.cpp.o -MF lib/Transforms/Scalar/CMakeFiles/LLVMScalarOpts.dir/MergeICmps.cpp.o.d -o lib/Transforms/Scalar/CMakeFiles/LLVMScalarOpts.dir/MergeICmps.cpp.o -c /root/llvm/llvm/lib/Transforms/Scalar/MergeICmps.cpp
In file included from /root/llvm/llvm/include/llvm/IR/Value.h:19:0,
                 from /root/llvm/llvm/include/llvm/IR/Argument.h:20,
                 from /root/llvm/llvm/include/llvm/IR/Function.h:26,
                 from /root/llvm/llvm/include/llvm/IR/CallSite.h:34,
                 from /root/llvm/llvm/include/llvm/Analysis/MemoryLocation.h:21,
                 from /root/llvm/llvm/include/llvm/Analysis/AliasAnalysis.h:44,
                 from /root/llvm/llvm/include/llvm/Analysis/Loads.h:17,
                 from /root/llvm/llvm/lib/Transforms/Scalar/MergeICmps.cpp:29:
/root/llvm/llvm/include/llvm/IR/Use.h:30:43: fatal error: llvm/Support/CBindingWrapping.h: No such file or directory
 #include "llvm/Support/CBindingWrapping.h"
                                           ^
compilation terminated.
The bug is not reproducible, so it is likely a hardware or OS problem.

However, if you try again, the problem will not happen but it will happen again for the same file, but it will happen again later on a different file.

Update: Of (possible) importance here is that the source code is on an NTFS mount. That is, in my particular setup, the llvm-project folder is on an NTFS mount on my D drive, and I've created a symlink from ~/llvm-project to /mnt/d/src/llvm-project. That being said, there should be no writes going to NTFS, only reads.

@onomatopellan
Copy link

By the symptoms it looks like the race condition bug of #2712.

It should be fixed on latest insider build in the Fast Ring but if it's confirmed that it works a fix will be released for 17134 too.

@therealkenc
Copy link
Collaborator

Or #2484 which dangles open and may or may not be a dupe. Either way the OP is 17134 and the #2712 coast isn't clear until 17677. Smart money says try Insiders and see if it resolves the problem.

@ryannathans
Copy link

I have this same problem - micropython/micropython#3976

@canda
Copy link

canda commented Mar 18, 2019

Still facing the same issue.
Any way to work around this problem?

@dismirlian
Copy link

I also have the same issue, while building OpenThread on WSL (Ubuntu 18.04)

@canda
Copy link

canda commented Mar 23, 2019

Fixed it installing the windows insider build

@dismirlian
Copy link

@canda Thanks for the tip. I actually installed build 19H1, and it's fixed. Apparently, build 17677 onwards should contain the fix:

Details here

Thanks!

@therealkenc
Copy link
Collaborator

Alright let's call this dupe #2712 until proven otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants