-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XCC RG-2017.8-linux: generated object code is affected by -g2 level (the default level) + longer source paths #7114
Comments
Some relevant code
sched.s
After assembly and disassembly
Dump WITHOUT the padding. Some offsets are adjust accordingly.
|
I asked people in my team to run the As soon as he lower the debug level to |
I compared all the toolchain files between |
@marc-hb XCC bugs go to vendor. |
The differences between build machines is actually just a difference in (debug) path lengths. I think I can now reproduce the object code change on any system as long as |
FWIW: I have a setup internally where I have xt-xcc/clang version RI-2021.6 built for all the SOF targets (and a handful of other tool versions, e.g. the original targets for all the devices). If you have a simple-ish test case it wouldn't be much of a hardship to try it across the cross product of the version/device space to try to isolate the problem. |
I tried again to reproduce without SOF but still could not. Maybe because the number of ELF sections is much lower? RI-2021.6 does not seem affected but I did not try to "stress" it. Reproducing with |
I guess what I was thinking of as a minimal test was "here's an preprocessed C file and some other files read by the toolchain in a tarball, unpack it into /opt/foo and it works, but /opt/very___long___name/ fails". |
Never a dull moment! sof commit a942f10 ("west.yml: upgrade Zephyr to 0c0d73721ed") added ~1000 new zephyr commits which seems to change everything; now the padding seems always dropped?!? Let's ignore that, too many variables already. Let's focus on the original bug description which is unchanged in sof commit 7fdc623. Also, while I could never caught it "red-handed", I recommended temporarily uninstalling The attached
The object code padding is found when compiling SOF in the short path This cannot be anything but a toolchain bug.
I'm attaching Expected debug path differences aside, the two The two Yet the two This very short
|
Nice, that's a smoking gun. I'll run these against all my local toolchain variants tomorrow to see if we can figure out which version(s) and/or core(s) are at fault. Note that the ".byte" directive is the offending bit for sure, that's an instruction to assemble the list of numbers that follow into the stream, probably that's how the compiler has decided to express padding. The ".file" directive just specifies the name of the file, which obviously shouldn't change the generated code; GNU as documents it as a noop; my guess is xt-as does so too. |
The |
I filed new case # 46682483 at https://support.cadence.com |
... and the reply is unsurprisingly "upgrade your toolchain". Which will likely never happen for TGL, @lgirdwood confirm? |
After internal discussion we're not going to upgrade the toolchain and perform a full round of validation for a reproducibility issue "only". We hope there is no other side effect. Moreover there is a (funny!) workaround for the reproducibility issue: make your source directory longer or shorter. Closing as wontfix. |
I tested again with the same sof commit 64fe2ea and I confirm that there is a Then I tested again with much later sof commit ac91071 and I could not reproduce this bug at all! west topdir lengths below 50 all produce the same output! Go figure...
For v2.5, the west topdir length threshold is between 35 and 36. |
The recent introduction of CONFIG_ASSERT in commit 4d67d2f means that Totally different issue and not really a "bug". |
tl;dr: if you're failing to reproduce a build, try a longer or shorter
west topdir
source directory. That's the workaround. There seems to be a threshold somewhere around 27 characters.Summary
A very strange and complex toolchain issue where the
-g
level and debug symbols affect the object code on some Intel CI build system(s) for a totally unknown reason.Unexpected variations like this break build reproducibility = how this issue was discovered in the first place.
Variations like these could also make it harder to reproduce race conditions and other difficult bugs.
I focused on
.text.k_ticks_to_us_floor64
and RG-2017.8-linux below because it just happened to be the first difference I found but I noticed that the -g level also affects.text.z_swap_irqlock
in the same conditions and there could be unexpected differences in other files.=> Don't trust RG-2017.8-linux to produce consistent object code
This is not a problem with the Zephyr SDK toochain: it produces identical object code across Linux and Windows, as routinely verified in CI across all targets: https://github.com/thesofproject/sof/actions/runs/4189511798/jobs/7262044804
I haven't observed any problem with RI-2020.5-linux (and MTL) yet which Zephyr still identifies as GNU 4.2.2. Note there is on-going work to switch to
xt-clang
.Longer story
Reproduced on SOF commit 64fe2ea +
west update
+sof/scripts/xtensa-build-zephyr.py -p tgl
Reproduced with a couple other SOF commits too.
The code generated for the k_ticks_to_us_floor64() normally looks like this:
=> Note the five
00
bytes of padding.@andyross, who answered A LOT of questions during this investigation (thank you!), suspects
xt-as
adds this padding for cache alignment and performance reasons. This padding is there in most but unfortunately not all cases, which is a bug.You don't need SOF to observe this code and padding, you can produce it with a pure upstream zephyr workspace:
Make sure you use
ZEPHYR_TOOLCHAIN_VARIANT=xcc
and other variables below. If .text.k_ticks_to_us_floor64 is missing then you probably forgot to switch to that toolchain.I could unfortunately not reproduce this issue with
hello_world
in any configuration; so far reproduction requires SOF.On at least two of our automated build systems (sofbld07 and 08, Ubuntu 20.04), the padding disappears in the default configuration. This breaks build reproducibility. There are other differences in
sched.c.obj
caused by-g
(apparently not in other.c.obj
files, which don't have 150+ ELF sections)EDIT: this happens because these systems use longer source paths / debug symbols, that's what triggering this compiler bug.
When unspecified with
-g
, the default debug level is '-g2'. All other things equal, the high debug level set by default by Zephyr makes the usual padding disappear on these particular systems.When decreasing the debug level in
cmake/compiler/gcc/compiler_flags.cmake
to -g1 or -g0 or no -g at all and making no other change whatsoever, the usual padding re-appears! In other words, a lower -g debug level makes the object code that runs on the DSP normal again.For quick testing use:
... then copy/paste and edit the line that compiles sched.c
The difference appears during the
-S sched.i -> sched.s
compilation step.The compiled
sched-g1.s
andsched-g2.s
have no actual difference in the assembly code, they only have a lot of.byte
lines which are different. Somehow these bytes affects the assembly phase and the object code on those systems.The pre-processed files
sched-g1.i
andsched-g2.i
are strictly identical.Another proof that the -g level triggers the issue: "hiding" the zephyr source code during the
-S sched.i -> sched.s
step restores the usual padding. stracing with-e openat
the compiler at that step shows that it reads hundreds pf source files when using -g (even at the -g1 level that preserves the usual padding).I tried to reproduce on a few other systems including another Ubuntu 20.04 but did not: the padding is always there no matter what I tried, only a few systems are affected. I'm of course using the very same toolchain.
I could not reproduce on the "guilty" systems with
samples/hello_world/
either: with hello_world the padding is always there too. Too few debug symbols inhello_world
to trigger this bug?I unfortunately couldn't find what's unusual about these CI systems. Whatever makes them special, I don't think that, all other things being equal, the generated object code should EVER differ between -g1 and -g2 in any circumstance. It should even less differ on some systems but not on others when using the very same toolchain. So this qualifies as a compiler bug IMHO and makes that toolchain untrustworthy for reproducible builds.
cc:
The text was updated successfully, but these errors were encountered: