Non-deterministic (i.e. racy) filesystem failures under heavy load #3281

zjturner · 2018-06-06T16:11:15Z

This bug-tracker is monitored by developers and other technical types. We like detail! So please use this form and tell us, concisely but precisely, what's up. Please fill out ALL THE FIELDS!

Your Windows build number: Microsoft Windows [Version 10.0.17134.48]
What you're doing and what's happening: Building a large piece of software with gcc using the ninja build system (so many file accesses happen in parallel across 40+ cores), and randomly files will fail to be included.
What's wrong / what should be happening instead: The compiler generates an error message about not being able to find a particular include file. This is GCC, where when it encounters this type of error it will re-try the same compilation again. The second time, however, it does not fail. This causes GCC to report that it is a non-deterministic failure, and probably a result of some OS or hardware failure.
Strace of the failing command, if applicable: Unfortunately the probelm does not reproduce under strace. Presumably this is because it's a race condition and strace changes the timing of the commands (or more specifically, it vastly slows everything down). So some internal buffer which is probably reaching its limit is not getting hit under strace because there is not as much load on the filesystem).

Luckily, it's (relatively) easy to reproduce, you just have to get set up to build an open source project (which thankfully is pretty easy)

clone an llvm repository (update: This may need to be on an NTFS mount, with a symlink on your WSL filesystem pointing to the NTFS mount. See the bottom of this report)

% git clone https://github.com/llvm-project/llvm-project-20170507/ llvm-project

Install ninja and cmake

% sudo apt-get install ninja-build
% sudo apt-get install cmake3

Configure a build directory using the default compiler (update: This should be on your native WSL partition, NOT pointing to any NTFS mount).

% mkdir llvmbuild
% cd llvmbuild
% cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS=clang ../llvm-project/llvm
% ninja

After running for some time, you will get an error such as:

[268/2465] Building CXX object lib/Transforms/Scalar/CMakeFiles/LLVMScalarOpts.dir/MergeICmps.cpp.o
FAILED: /usr/bin/c++   -DGTEST_HAS_RTTI=0 -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -Ilib/Transforms/Scalar -I/root/llvm/llvm/lib/Transforms/Scalar -I/usr/include/libxml2 -Iinclude -I/root/llvm/llvm/include -fPIC -fvisibility-inlines-hidden -std=c++11 -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wno-maybe-uninitialized -Wdelete-non-virtual-dtor -Wno-comment -ffunction-sections -fdata-sections -O3 -DNDEBUG    -fno-exceptions -fno-rtti -MMD -MT lib/Transforms/Scalar/CMakeFiles/LLVMScalarOpts.dir/MergeICmps.cpp.o -MF lib/Transforms/Scalar/CMakeFiles/LLVMScalarOpts.dir/MergeICmps.cpp.o.d -o lib/Transforms/Scalar/CMakeFiles/LLVMScalarOpts.dir/MergeICmps.cpp.o -c /root/llvm/llvm/lib/Transforms/Scalar/MergeICmps.cpp
In file included from /root/llvm/llvm/include/llvm/IR/Value.h:19:0,
                 from /root/llvm/llvm/include/llvm/IR/Argument.h:20,
                 from /root/llvm/llvm/include/llvm/IR/Function.h:26,
                 from /root/llvm/llvm/include/llvm/IR/CallSite.h:34,
                 from /root/llvm/llvm/include/llvm/Analysis/MemoryLocation.h:21,
                 from /root/llvm/llvm/include/llvm/Analysis/AliasAnalysis.h:44,
                 from /root/llvm/llvm/include/llvm/Analysis/Loads.h:17,
                 from /root/llvm/llvm/lib/Transforms/Scalar/MergeICmps.cpp:29:
/root/llvm/llvm/include/llvm/IR/Use.h:30:43: fatal error: llvm/Support/CBindingWrapping.h: No such file or directory
 #include "llvm/Support/CBindingWrapping.h"
                                           ^
compilation terminated.
The bug is not reproducible, so it is likely a hardware or OS problem.

However, if you try again, the problem will not happen but it will happen again for the same file, but it will happen again later on a different file.

Update: Of (possible) importance here is that the source code is on an NTFS mount. That is, in my particular setup, the llvm-project folder is on an NTFS mount on my D drive, and I've created a symlink from ~/llvm-project to /mnt/d/src/llvm-project. That being said, there should be no writes going to NTFS, only reads.

The text was updated successfully, but these errors were encountered:

onomatopellan · 2018-06-06T19:08:25Z

By the symptoms it looks like the race condition bug of #2712.

It should be fixed on latest insider build in the Fast Ring but if it's confirmed that it works a fix will be released for 17134 too.

therealkenc · 2018-06-06T21:48:02Z

Or #2484 which dangles open and may or may not be a dupe. Either way the OP is 17134 and the #2712 coast isn't clear until 17677. Smart money says try Insiders and see if it resolves the problem.

ryannathans · 2018-07-24T04:49:26Z

I have this same problem - micropython/micropython#3976

canda · 2019-03-18T14:12:49Z

Still facing the same issue.
Any way to work around this problem?

dismirlian · 2019-03-23T17:06:30Z

I also have the same issue, while building OpenThread on WSL (Ubuntu 18.04)

canda · 2019-03-23T18:59:18Z

Fixed it installing the windows insider build

dismirlian · 2019-03-24T12:26:53Z

@canda Thanks for the tip. I actually installed build 19H1, and it's fixed. Apparently, build 17677 onwards should contain the fix:

Details here

Thanks!

therealkenc · 2019-03-24T17:07:44Z

Alright let's call this dupe #2712 until proven otherwise.

ryannathans mentioned this issue Jul 24, 2018

Compilation randomly fails when compiling fast (race condition) micropython/micropython#3976

Closed

therealkenc closed this as completed Mar 24, 2019

therealkenc added the duplicate label Mar 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-deterministic (i.e. racy) filesystem failures under heavy load #3281

Non-deterministic (i.e. racy) filesystem failures under heavy load #3281

zjturner commented Jun 6, 2018 •

edited

Loading

onomatopellan commented Jun 6, 2018

therealkenc commented Jun 6, 2018

ryannathans commented Jul 24, 2018

canda commented Mar 18, 2019

dismirlian commented Mar 23, 2019

canda commented Mar 23, 2019

dismirlian commented Mar 24, 2019

therealkenc commented Mar 24, 2019

Non-deterministic (i.e. racy) filesystem failures under heavy load #3281

Non-deterministic (i.e. racy) filesystem failures under heavy load #3281

Comments

zjturner commented Jun 6, 2018 • edited Loading

onomatopellan commented Jun 6, 2018

therealkenc commented Jun 6, 2018

ryannathans commented Jul 24, 2018

canda commented Mar 18, 2019

dismirlian commented Mar 23, 2019

canda commented Mar 23, 2019

dismirlian commented Mar 24, 2019

therealkenc commented Mar 24, 2019

zjturner commented Jun 6, 2018 •

edited

Loading