-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory corruption for newlib-nano with float printf and disabled heap #30055
Comments
IIRC the newlib heap and the Zephyr system heap aren't the same memory, right? Maybe the bug here is a linkage problem where the newlib heap memory is clobbering something adjacent? |
Yes, that could be the case. Do you know how I can find out which memory location newlib uses as its heap? |
I dug around here a little, but obviously without an exerciser there's not a lot I can do. In fact yes, as I (vaguely) remember the newlib malloc() heap is an sbrk-style thing that is a physically separate region of memory from the Zephyr heap. So despite the fact that they both say "malloc" these regions shouldn't interact at all. Can you post the zephyr.map files from your minimally-changed "working" and "failing" cases? I still strongly suspect this is going to turn out to be some kind of linker interaction, plausibly alignment of the heap that interacts with pointer math somewhere in the soft float implementation? |
@andyross sorry for the delayed response. Since my upgrade to Zephyr SDK v0.12.0 I was not able to reproduce the issue anymore. Instead, I get the following error printed to the serial console now:
A colleage compiled our firmware with exactly the same settings on Windows and could still reproduce the issue. So here is a bunch of map files: https://nextcloud.libre.solar/s/CjeXZZt7AYKjEBJ
The issue seems to be independent of the Zephyr version. I get the same behavior with v2.4-branch and v2.5-RC1. |
Some additional information: The newlib version of my colleague where the issue appeared is 3.1.0, where as Zephyr SDK v0.12 updated newlib to 3.3.0. This seems to be relevant to the problem: https://census-labs.com/news/2020/01/31/multiple-null-pointer-dereference-vulnerabilities-in-newlib/ So maybe the remaining problem is only to create a linker error if not enough space is left for the newlib heap, if that makes sense? |
Grooming bugs: sorry, can you confirm that a 3.3.0 newlib fixes the problem? Or you just suspect it might? |
And serendipitously, I just ran into #33164 which very well might be the root cause if your application has multiple threads trying to use that heap. |
I would not say it fixes the problem completely. I can confirm that we don't see the junk characters anymore with newlib 3.3.0. But we get above assertion message instead. Ideally, the linker would claim sufficient memory for newlib such that we see at compile-time that we run out of space. Regarding the potential race condition in newlib: I'm not sure if that's really the root cause. I suspended all threads except for one with |
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time. |
@martinjaeger Can you re-test this to see if #35227 might have helped. |
@galak No, don't think this has brought the solution. I compiled our firmware with the main stack size increased such that 99% of the RAM is consumed. Now, a printf at the beginning of the code doesn't get printed at all. After sending a command to the device via the serial it prints |
I've created a minimum non-working example with most recent Zephyr branch:
Result on the serial console:
As soon as we decrease the main stack size by a few hundred bytes, everything works fine. |
@martinjaeger thanks for working up a test. |
NOTE: The issue is reproducible with @martinjaeger's sample in |
Our colleague with using older newlib version 3.1.0 can also still confirm the junk characters. With
With
|
AnalysisBy default, newlib heap starts at zephyr/lib/libc/newlib/libc-hooks.c Line 82 in 1a2804c
zephyr/lib/libc/newlib/libc-hooks.c Lines 271 to 285 in 1a2804c
The default
The newlib internal This problem was fixed in the recent commit: In summary, this was newlib's failure to NULL-check the pointer returned by |
@martinjaeger zephyrproject-rtos/newlib-cygwin@f88aece is included in newlib 3.1.0 and, if the above analysis is correct, you should be seeing the assert message instead. Can you tell me what toolchain/version you are using? |
I'm using Zephyr SDK v0.12 with newlib 3.3.0. Above output with newlib 3.1.0 was from my colleague. @azeemshatp can you double-check that your newlib is 3.1.0? |
Sorry, I misread. That commit is part of |
@martinjaeger Since this is not a Zephyr-side issue and it has been fixed in the newlib upstream (zephyrproject-rtos/newlib-cygwin@f88aece), shall we close this issue? |
I have the toolchain in Windows installed at C:\gnu_arm_embedded\arm-none-eabi. I can see the _newlib_version.h having the version info as _NEWLIB_VERSION "3.1.0" |
@stephanosio I still think it's a Zephyr-related issue, as we don't consider the heap space required for newlib. Is the newlib |
Memory corruption in itself, which the reported issue is, is not a Zephyr issue and has been fixed upstream. The default Zephyr-newlib integration scheme is designed to make use of whatever RAM is leftover (i.e. not explicitly allocated by the Zephyr kernel and apps) for the newlib heap region, and when this leftover space is insufficient for certain operations, such operations will fail. I can see two potential enhancements we can make in this regard:
|
|
Thanks for the clarification regarding newlib heap management in Zephyr. I think option 1 makes sense. Not sure what a "reasonable" default would be, though. |
Describe the bug
We are facing a strange issue in our project based on STM32L072 with 20k of RAM. If certain features are enabled such that most of the RAM is consumed, float variables in
printf
statements (using newlib nano) get replaced by random junk characters. Printing of integers works fine. Alsoprintk
with recently added float support (cbprintf
) works fine.Example code:
Results in:
To Reproduce
Haven't been able to generate a minimum working example to reproduce the issue, as it disappears if too much of the code is removed. However, it does not seem to be an issue in the application firmware itself. The issue happens in different threads and stack usage is still quite low (because I put all threads immediately into
k_sleep(K_FOREVER)
to exclude possible application firmware bugs):Possible root cause and workaround
Our application doesn't use the heap. Since PR #28486, the RAM reserved for the heap seems to be garbage-collected away in that case (independent of the value of
CONFIG_HEAP_MEM_POOL_SIZE
) and can be reused for the stack.However, newlib requires
malloc
ifprintf
is used with%f
: http://www.nadler.com/embedded/newlibAndFreeRTOS.htmlIf I add a line
void *mem_test = k_malloc(4);
to the code, Zephyr compiles in the heap management again and the issue is gone.I'm not 100% sure if the above is really the root cause or if it made the issue disappear by coincidence, but it looks plausible to me. Maybe someone with more insight into newlib internals can confirm.
This link posted by @pabigot on Slack might also be relevant: https://stackoverflow.com/questions/28746062/snprintf-prints-garbage-floats-with-newlib-nano
Ping @nashif @dcpleung @andrewboie @carlescufi as you were involved in mentioned PR.
The text was updated successfully, but these errors were encountered: