-
Notifications
You must be signed in to change notification settings - Fork 12.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLD][COFF] LIB-imported symbols seem to have weaker precedence than LIB-defined symbols #82050
Comments
@llvm/issue-subscribers-lld-coff Author: Alexandre Ganea (aganea)
I found an interesting issue in LLD that I was chasing down for more than a year. The problem revolves around `XInput1_4.dll`. I wanted to share how I got there, since others might see this problem before we land a fix.
First signsAt first, we wanted to use
But remotely, it wouldn't work. Our local machines were on Windows 10 while the remote build system was on Windows Server 2019 cloud VMs. We got super-weird errors, and only on certain Rust crates. For example this one: Another time, it was on the
Since our monorepo uses workspaces, feature unification would kick in. We also happened to be experimenting with Bevy, so the the Strangely the build only failed when running headless, through a SSH session. When a user was connected on the VM, the build would succeed. We managed to capture minidumps of the failure, and inspection was showing an error deep down some core Microsoft libraries. At first we suspected an ABI incompatibility on Windows Server 2019. However it seemed that it had to do with I left this aside for a while, since build times weren't that much of an issue. We were planning on upgrading the VMs to Windows Server 2022 later that year, so I postponed the AgainAt some point I wanted to upgrade our C++ codebase to using We ended up upgrading our VMs to Windows Server 2022, and during my spare time between other things, I tried preflighting my Yet, again.Months later (now) I got a new Windows 11 install. I wanted to test LLVM 18.0 and un-shelved again my <img width="1905" alt="crash_xinput_1" src="https://github.com/llvm/llvm-project/assets/37383324/1dd29ca8-5c0b-4c8a-bbd2-48ed4d0314c8"> I did not immediately make the connection with the past issues I had with Rust compilation. There was something strange about this callstack but I couldn't put my finger on it. While looking for clues on the Internet, I stomped on this thread. I immediately made the connection with the Rust build issue I had a year ago! However there wasn't really a resolution, just a suggestion to replace What. Do what?? So yeah. Duplicating I just couldn't leave it to that, I decided to dig a bit more to understand the problem. Putting back <img width="1908" alt="crash_xinput_2" src="https://github.com/llvm/llvm-project/assets/37383324/6aa3e9d4-aa76-448d-9698-95c3425b5ed9"> How could that be? Was it a memory corruption? But the HANDLE looked legit, so that structure must have been already initialized somewhere else. Maybe the DLL was loaded elsewhere? Let's put a breakpoint in that function, see if it was called before. <img width="1901" alt="crash_xinput_3" src="https://github.com/llvm/llvm-project/assets/37383324/9aed535a-85f6-415e-8db8-3b990de3370e"> Looking at the memory window, the location where the HANDLE is hasn't been written yet. Also the callstack shows a different codepath when entering Now if I hit Run again, the program will stop at the same location a second time. That's where the bug happens. Looking at the callstack again, something fishy appears to me: why is the entry point of my DLL calling <img width="1898" alt="crash_xinput_4" src="https://github.com/llvm/llvm-project/assets/37383324/d75de63c-a168-4d02-9c5d-c20b8a48611b"> Looks legit. But then, stepping inside: <img width="560" alt="crash_xinput_5" src="https://github.com/llvm/llvm-project/assets/37383324/8a0d5a58-0ccb-4190-9813-06ac457f2f5a"> Wait a second. Why is that an import symbol?? That location is an IAT entry that points to The bug and a reproEssentially, The repro is quite straightforward:
MSVC somehow seems to prefer a fully defined symbol rather than an imported symbol. Another possibility is that it tends to favor a symbol inside the same .LIB rather than one defined elsewhere. At this point, if someone from the MSVC team could comment, that would be very helpful. I have a tentative fix, but I'm not sure what the "good" behavior should be. We could categorize this as the "annoying" class of bugs. However given how annoying it was to find, it might be that other folks will see different inceptions of it, so ideally I'd prefer fixing it in LLD. @mstorsjo @rnk @tru @sylvain-audi @compnerd @MaskRay |
@TartanLlama Could you possibly help here? Thanks! |
@aganea sure, I'll send an email to the linker team |
Thanks a lot! |
Pasting a response from Grant Richins on the linker team:
|
@TartanLlama Thank you to you both! |
Hello @mstorsjo! I've crafted a fix for this issue, but it breaks this test. You did introduce that test in: 28212df I have a feeling that your initial issue was caused by the symbol searching behavior described above which previously wasn't matching what MSVC was doing. I can fix that test, unless we should keep that previous behavior for MinGW mode? |
No, I don't think there's a reason to try to keep/replicate the exact behaviour we had. LLD generally deviates more than this from ld.bfd (which would be the reference for MinGW mode behaviour) overall, so corner case behaviours like this shouldn't really matter, and the exact behaviour we had probably isn't the exact thing we should strive for anyway. I think the previous behaviour was just a product of what the LLD implementation looked like at the time. If it's no longer possible to trigger this case, I guess it can be reasonable to remove the testcase. Thanks for looping me in! |
This issue was also observed when building Unreal Engine 5 game editor using clang-cl and lld-link. Tested LLVM version: 17.0.6, 18.1.0, 18.1.7, 19.1.0 |
I found an interesting issue in LLD that I was chasing down for more than a year. The problem revolves around
XInput1_4.dll
. I wanted to share how I got there, since others might see this problem before we land a fix.First signs
At first, we wanted to use
rust-lld.exe
along with our Rust monorepo. This worked locally......but not remotely. Our local machines were on Windows 10 while the remote build system was on Windows Server 2019 cloud VMs. We got super-weird errors, and only on certain Rust crates. For example this one:
Another time, it was on the
sqlx-macros
crate. Note its dependencies:Since our monorepo uses workspaces, feature unification would kick in. We also happened to be experimenting with Bevy, so the the
winapi
crate would be compiled with all the features used in the workspace, such as"xinput"
even thoughsqlx-macros
didn't really need it here.Strangely the build only failed when running headless, through a SSH session. When a user was connected on the VM, the build would succeed. We managed to capture minidumps of the failure, and inspection was showing an error deep down some core Microsoft libraries. At first we suspected an ABI incompatibility on Windows Server 2019. However it seemed that it had to do with
XINPUT1_4.dll
. Essentially the DLL load graph wasrust.exe
->sqlx_macros-XXX.dll
->XINPUT1_4.DLL
->inputhost.dll
->CoreMessaging.dll
then a call toRaiseFailFastException()
.I left this aside for a while, since build times weren't that much of an issue. We were planning on upgrading the VMs to Windows Server 2022 later that year, so I postponed the
rust-lld
update.Again
At some point I wanted to upgrade our C++ codebase to using
clang-cl
andlld-link
. Things worked locally, although our unit tests failed on the preflight remotely on the VMs. They were failing on startup withcode exit status 0xc0000409
. At first I thought this was a crash in our code caused by UB withclang-cl
(this happened to me at Ubisoft during our firstclang-cl
integration). Strangely I wasn't getting this error locally. I got taken with other things, so I decided to look at this later on.We ended up upgrading our VMs to Windows Server 2022, and during my spare time between other things, I tried preflighting my
clang-cl
upgrade again. Same thing, same error. I was quite puzzled, and I was starting to rule out the ABI incompatibility I though I had with Rust at first. It was inconceivable that the same failure would be carried on several CRT updates and Windows OS upgrades.Yet, again.
Months later (now) I got a new Windows 11 install. I wanted to test LLVM 18.0 and un-shelved again my
clang-cl
upgrade for our C++ codebase. After fixing a number of compilation errors, on the first boot our program complained about not findingXInput1_3.dll
. Installing that inC:\Windows\System
fixed that particular issue. Now it was crashing onXInput1_4.dll
:I did not immediately make the connection with the past issues I had with Rust compilation. There was something strange about this callstack but I couldn't put my finger on it. While looking for clues on the Internet, I stomped on this thread. I immediately made the connection with the Rust build issue I had a year ago! However there wasn't really a resolution, just a suggestion to replace
XInput1_4.dll
withXInput1_3.dll
.What.
Do what??
So yeah. Duplicating
XInput1_3.dll
and renaming it toXInput1_4.dll
fixed the issue. 😒 No more crashes.I just couldn't leave it to that, I decided to dig a bit more to understand the problem. Putting back
XInput1_4.dll
as it was, restarting and stepping through the asm code manually reveled that it had something to do with calls that XInput was doing to the Windows Event Tracing API. It failed because one of the HANDLEs in the structure used to initialize theevntprov.h
API was already initialized (see the highlighted value in the Memory window).How could that be? Was it a memory corruption? But the HANDLE looked legit, so that structure must have been already initialized somewhere else. Maybe the DLL was loaded elsewhere? Let's put a breakpoint in that function, see if it was called before.
Indeed:
Looking at the memory window, the location where the HANDLE is hasn't been written yet. Also the callstack shows a different codepath when entering
DllMain
. I was able to confirm that this was happening after the call toLoadLibrary
, but before the entry point of my DLL was executed. Essentially, at this point the Windows OS loader was initializing the DLL dependency graph.Now if I hit Run again, the program will stop at the same location a second time. That's where the bug happens. Looking at the callstack again, something fishy appears to me: why is the entry point of my DLL calling
XInput1_4.DLL!DllMain()
? Looking at the assembly:Looks legit. But then, stepping inside:
Wait a second. Why is that an import symbol??
That location is an IAT entry that points to
XInput1_4.DLL!DllMain()
. At that point I was pretty sure that I was experiencing the side-effect of a bug in LLD. I recompiled LLD from source, quickly stepped through it while it was linking my DLL, and then it become clear what the problem was.The bug and a repro
Essentially,
XInput1_4.dll
exports itsDllMain
symbol (why?). This seems to be a mistake on Microsoft's end. But luckily that triggers our bug in LLD 😄XInput.lib
appears first on the command-line, then/DEFAULTLIB:msvcrt.lib
where the default (weak)DllMain
function is. Due to the lazy nature of how the inputs are parsed in LLD, theXInput1_4.dll!DllMain
symbol is placed first in the symbol table, while scanning all the inputs. Then comes themsvcrt.lib
DllMain
symbol which does not overwrite the symbol table entry, since it's already there. Later, when another function pulls onDllMain
, LLD pulls on the OBJ insideXInput.lib
, and replaces that symbol table entry with its (XInput import lib) import symbol.The repro is quite straightforward:
MSVC somehow seems to prefer a fully defined symbol rather than an imported symbol. Another possibility is that it tends to favor a symbol inside the same .LIB rather than one defined elsewhere. At this point, if someone from the MSVC team could comment, that would be very helpful. I have a tentative fix, but I'm not sure what the "good" behavior should be.
We could categorize this as the "annoying" class of bugs with possibly a workaround. However given how annoying it was to find, it might be that other folks will see different inceptions of it, so ideally I'd prefer fixing it in LLD.
@mstorsjo @rnk @tru @sylvain-audi @compnerd @MaskRay
The text was updated successfully, but these errors were encountered: