Memory corruption for newlib-nano with float printf and disabled heap #30055

martinjaeger · 2020-11-16T14:48:43Z

Describe the bug

We are facing a strange issue in our project based on STM32L072 with 20k of RAM. If certain features are enabled such that most of the RAM is consumed, float variables in printf statements (using newlib nano) get replaced by random junk characters. Printing of integers works fine. Also printk with recently added float support (cbprintf) works fine.

Example code:

printf("Junk: %.2f and maybe not: %d\n", 123.4F, 1234);

Results in:

Junk: <0xb0>@<break>
. <0x9d> and maybe not: 1234

To Reproduce

Haven't been able to generate a minimum working example to reproduce the issue, as it disappears if too much of the code is removed. However, it does not seem to be an issue in the application firmware itself. The issue happens in different threads and stack usage is still quite low (because I put all threads immediately into k_sleep(K_FOREVER) to exclude possible application firmware bugs):

I: Thread analyze:
I:  thread_analyzer     : unused 80 usage 432 / 512 (84 %)
I:  serial_thread_id    : unused 1160 usage 120 / 1280 (9 %)
I:  leds_thread         : unused 104 usage 152 / 256 (59 %)
I:  gsm                 : unused 1208 usage 216 / 1424 (15 %)
I:  control_thread_id   : unused 912 usage 112 / 1024 (10 %)
I:  idle 00             : unused 188 usage 68 / 256 (26 %)
I:  main                : unused 792 usage 232 / 1024 (22 %)

Possible root cause and workaround

Our application doesn't use the heap. Since PR #28486, the RAM reserved for the heap seems to be garbage-collected away in that case (independent of the value of CONFIG_HEAP_MEM_POOL_SIZE) and can be reused for the stack.

However, newlib requires malloc if printf is used with %f: http://www.nadler.com/embedded/newlibAndFreeRTOS.html

If I add a line void *mem_test = k_malloc(4); to the code, Zephyr compiles in the heap management again and the issue is gone.

I'm not 100% sure if the above is really the root cause or if it made the issue disappear by coincidence, but it looks plausible to me. Maybe someone with more insight into newlib internals can confirm.

This link posted by @pabigot on Slack might also be relevant: https://stackoverflow.com/questions/28746062/snprintf-prints-garbage-floats-with-newlib-nano

Ping @nashif @dcpleung @andrewboie @carlescufi as you were involved in mentioned PR.

The text was updated successfully, but these errors were encountered:

andyross · 2021-01-13T22:32:13Z

IIRC the newlib heap and the Zephyr system heap aren't the same memory, right? Maybe the bug here is a linkage problem where the newlib heap memory is clobbering something adjacent?

martinjaeger · 2021-01-14T07:27:33Z

Yes, that could be the case. Do you know how I can find out which memory location newlib uses as its heap?

andyross · 2021-01-21T21:46:53Z

I dug around here a little, but obviously without an exerciser there's not a lot I can do. In fact yes, as I (vaguely) remember the newlib malloc() heap is an sbrk-style thing that is a physically separate region of memory from the Zephyr heap. So despite the fact that they both say "malloc" these regions shouldn't interact at all.

Can you post the zephyr.map files from your minimally-changed "working" and "failing" cases? I still strongly suspect this is going to turn out to be some kind of linker interaction, plausibly alignment of the heap that interacts with pointer math somewhere in the soft float implementation?

martinjaeger · 2021-01-28T18:02:53Z

@andyross sorry for the delayed response. Since my upgrade to Zephyr SDK v0.12.0 I was not able to reproduce the issue anymore. Instead, I get the following error printed to the serial console now:

assertion "Balloc succeeded" failed: file "/workdir/build/build_arm/.build/arm-zephyr-eabi/src/newlib/newlib/libc/stdlib/mprec.c", line 778
exit

A colleage compiled our firmware with exactly the same settings on Windows and could still reproduce the issue. So here is a bunch of map files: https://nextcloud.libre.solar/s/CjeXZZt7AYKjEBJ

zephyr_balloc_assert.map: Compiled with SDK v0.12 and leading to above error
zephyr_junk_char.map: Same firmware configuration compiled under Windows, leading to junk character output
zephyr_no_junk.map: Configuration slightly changed such that the floats are printed correctly (also compiled under Windows)

The issue seems to be independent of the Zephyr version. I get the same behavior with v2.4-branch and v2.5-RC1.

martinjaeger · 2021-01-28T18:21:14Z

Some additional information: The newlib version of my colleague where the issue appeared is 3.1.0, where as Zephyr SDK v0.12 updated newlib to 3.3.0.

This seems to be relevant to the problem: https://census-labs.com/news/2020/01/31/multiple-null-pointer-dereference-vulnerabilities-in-newlib/

So maybe the remaining problem is only to create a linker error if not enough space is left for the newlib heap, if that makes sense?

andyross · 2021-03-08T22:07:32Z

Grooming bugs: sorry, can you confirm that a 3.3.0 newlib fixes the problem? Or you just suspect it might?

andyross · 2021-03-08T23:21:12Z

And serendipitously, I just ran into #33164 which very well might be the root cause if your application has multiple threads trying to use that heap.

martinjaeger · 2021-03-09T07:10:35Z

Grooming bugs: sorry, can you confirm that a 3.3.0 newlib fixes the problem? Or you just suspect it might?

I would not say it fixes the problem completely. I can confirm that we don't see the junk characters anymore with newlib 3.3.0. But we get above assertion message instead. Ideally, the linker would claim sufficient memory for newlib such that we see at compile-time that we run out of space.

Regarding the potential race condition in newlib: I'm not sure if that's really the root cause. I suspended all threads except for one with k_sleep(K_FOREVER). So there should not be multiple threads calling printf at the same time. I'm happy to verify it once a fix for newlib synchronization is in place.

github-actions · 2021-05-09T00:29:35Z

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

galak · 2021-05-18T16:17:33Z

@martinjaeger Can you re-test this to see if #35227 might have helped.

martinjaeger · 2021-05-21T12:13:57Z

@galak No, don't think this has brought the solution. I compiled our firmware with the main stack size increased such that 99% of the RAM is consumed. Now, a printf at the beginning of the code doesn't get printed at all. After sending a command to the device via the serial it prints exit and restarts (without any further fault indication). So there must be still some function that uses memory locations it's not supposed to use.

martinjaeger · 2021-05-25T07:18:59Z

I've created a minimum non-working example with most recent Zephyr branch:
martinjaeger@3bf40cb

$ west build -b nucleo_l073rz samples/hello_world/
[3/9] Linking C executable zephyr/zephyr_prebuilt.elf

[9/9] Linking C executable zephyr/zephyr.elf
Memory region         Used Size  Region Size  %age Used
           FLASH:       33752 B       192 KB     17.17%
            SRAM:       19920 B        20 KB     97.27%
        IDT_LIST:          0 GB         2 KB      0.00%
$ west flash

Result on the serial console:

*** Booting Zephyr OS build v2.6.0-rc1-337-g3bf40cb11342  ***
Hello World! nucleo_l073rz assertion "Balloc succeeded" failed: file "/workdir/build/build_arm/.build/arm-zephyr-eabi/src/newlib/newlib/libc/stdlib/mprec.c", line 778
exit

As soon as we decrease the main stack size by a few hundred bytes, everything works fine.

galak · 2021-05-25T12:12:20Z

@martinjaeger thanks for working up a test.

stephanosio · 2021-05-25T12:50:18Z

NOTE: The issue is reproducible with @martinjaeger's sample in qemu_cortex_m3 if you set CONFIG_MAIN_STACK_SIZE=61440.

martinjaeger · 2021-05-25T14:40:28Z

Our colleague with using older newlib version 3.1.0 can also still confirm the junk characters.

With CONFIG_MAIN_STACK_SIZE=16200:

*** Booting Zephyr OS build v2.6.0-rc1  ***
Hello World! nucleo_l073rz Ø.D. -2.

With CONFIG_MAIN_STACK_SIZE=1000:

*** Booting Zephyr OS build v2.6.0-rc1  ***
Hello World! nucleo_l073rz 3.140000

stephanosio · 2021-05-25T14:44:49Z

Analysis

By default, newlib heap starts at USED_RAM_END_ADDR = _end = _image_ram_end, which points to the memory address right after the end of the RAM region occupied by the Zephyr kernel, which includes the "main stack" area:

zephyr/lib/libc/newlib/libc-hooks.c

Line 82 in 1a2804c

#define HEAP_BASE USED_RAM_END_ADDR

printf invokes various functions (e.g. dtoa), which internally call malloc and its friends (i.e. calloc) to allocate buffer memory, which in turn call _sbrk to increase the heap size as necessary:

Breakpoint 1, _sbrk (count=140) at /home/stephanos/Dev/zephyrproject/zephyr/lib/libc/newlib/libc-hooks.c:275
275             ptr = ((char *)HEAP_BASE) + heap_sz;
(gdb) bt
#0  _sbrk (count=140) at /home/stephanos/Dev/zephyrproject/zephyr/lib/libc/newlib/libc-hooks.c:275
#1  0x0000699c in _sbrk_r ()
#2  0x000066f2 in _malloc_r ()
#3  0x000090c0 in _calloc_r ()
#4  0x00006046 in _Balloc ()
#5  0x00006576 in __d2b ()
#6  0x00000e66 in _dtoa_r ()
#7  0x00008c68 in __cvt ()
#8  0x000058f0 in _printf_float ()
#9  0x00006948 in _vfprintf_r ()
#10 0x00005e9a in printf ()
#11 0x00000d62 in main () at ../src/main.c:13

_sbrk checks that the current heap size + requested expansion size (count) does not exceed MAX_HEAP_SIZE to ensure that the newlib heap region does not expand beyond the RAM area -- if it does, it will return -1, which in turn will cause malloc to fail and return NULL:

zephyr/lib/libc/newlib/libc-hooks.c

Lines 271 to 285 in 1a2804c

    
           void *_sbrk(intptr_t count) 
        
           { 
        
           	void *ret, *ptr; 
        
           	ptr = ((char *)HEAP_BASE) + heap_sz; 
        
           	if ((heap_sz + count) < MAX_HEAP_SIZE) { 
        
           		heap_sz += count; 
        
           		ret = ptr; 
        
           	} else { 
        
           		ret = (void *)-1; 
        
           	} 
        
           	return ret; 
        
           }

The default malloc algorithm implemented by newlib has a tendency to request large chunks through _sbrk and this fails when the memory available for the newlib heap is small (similar issue observed in #21167):

*** Booting Zephyr OS build v2.6.0-rc1-337-g3bf40cb11342  ***
_sbrk: SUCC: heap_sz = 0, count = 0
_sbrk: SUCC: heap_sz = 0, count = 420
_sbrk: FAIL: heap_sz = 420, count = 1032, MAX_HEAP_SIZE = 552
Hello World! qemu_cortex_m3 _sbrk: SUCC: heap_sz = 420, count = 24
_sbrk: FAIL: heap_sz = 444, count = 140, MAX_HEAP_SIZE = 552
assertion "Balloc succeeded" failed: file "/workdir/build/build_arm/.build/arm-zephyr-eabi/src/newlib/newlib/libc/stdlib/mprec.c", line 778
exit

The newlib internal __d2b function, which is called by printf, previously did not check the NULL return from malloc and this caused the function to access invalid memory address.

This problem was fixed in the recent commit:
zephyrproject-rtos/newlib-cygwin@f88aece

In summary, this was newlib's failure to NULL-check the pointer returned by malloc, which was fixed in the above commit, and not a problem with Zephyr per se.

stephanosio · 2021-05-25T14:53:03Z

Our colleague with using older newlib version 3.1.0 can also still confirm the junk characters.

@martinjaeger zephyrproject-rtos/newlib-cygwin@f88aece is included in newlib 3.1.0 and, if the above analysis is correct, you should be seeing the assert message instead. Can you tell me what toolchain/version you are using?

martinjaeger · 2021-05-25T14:59:55Z

Our colleague with using older newlib version 3.1.0 can also still confirm the junk characters.

@martinjaeger zephyrproject-rtos/newlib-cygwin@f88aece is included in newlib 3.1.0 and, if the above analysis is correct, you should be seeing the assert message instead. Can you tell me what toolchain/version you are using?

I'm using Zephyr SDK v0.12 with newlib 3.3.0. Above output with newlib 3.1.0 was from my colleague. @azeemshatp can you double-check that your newlib is 3.1.0?

stephanosio · 2021-05-25T15:03:44Z

I'm using Zephyr SDK v0.12 with newlib 3.3.0. Above output with newlib 3.1.0 was from my colleague. @azeemshatp can you double-check that your newlib is 3.1.0?

Sorry, I misread. That commit is part of cygwin-3_1_0-release, not newlib-3.1.0, so seeing memory corruption with newlib 3.1.0 is to be expected.

stephanosio · 2021-05-25T15:09:46Z

@martinjaeger Since this is not a Zephyr-side issue and it has been fixed in the newlib upstream (zephyrproject-rtos/newlib-cygwin@f88aece), shall we close this issue?

azeemshatp · 2021-05-25T15:14:09Z

Our colleague with using older newlib version 3.1.0 can also still confirm the junk characters.

@martinjaeger zephyrproject-rtos/newlib-cygwin@f88aece is included in newlib 3.1.0 and, if the above analysis is correct, you should be seeing the assert message instead. Can you tell me what toolchain/version you are using?

I'm using Zephyr SDK v0.12 with newlib 3.3.0. Above output with newlib 3.1.0 was from my colleague. @azeemshatp can you double-check that your newlib is 3.1.0?

I have the toolchain in Windows installed at C:\gnu_arm_embedded\arm-none-eabi. I can see the _newlib_version.h having the version info as _NEWLIB_VERSION "3.1.0"

martinjaeger · 2021-05-25T15:17:04Z

@stephanosio I still think it's a Zephyr-related issue, as we don't consider the heap space required for newlib. Is the newlib MAX_HEAP_SIZE from your post above a newlib constant? Can we reduce the amount of RAM allowed to be consumed during linking by that value?

stephanosio · 2021-05-25T15:40:07Z

Memory corruption in itself, which the reported issue is, is not a Zephyr issue and has been fixed upstream.

The default Zephyr-newlib integration scheme is designed to make use of whatever RAM is leftover (i.e. not explicitly allocated by the Zephyr kernel and apps) for the newlib heap region, and when this leftover space is insufficient for certain operations, such operations will fail.

I can see two potential enhancements we can make in this regard:

Add a config that specifies the minimum amount of memory space that must be available for the newlib heap (with a default value that is deemed reasonable for common newlib operations). When MAX_HEAP_SIZE in libc-hooks.c is below the value specified in this config, a build-time error is output so that user is aware of potential problems and make adjustments as necessary.
Make non-critical failures (e.g. printf) less disruptive (this will require submitting patches to the upstream newlib, so this should be deemed a long-term solution).

stephanosio · 2021-05-25T15:42:31Z

Is the newlib MAX_HEAP_SIZE from your post above a newlib constant

zephyr/lib/libc/newlib/libc-hooks.c

Lines 92 to 93 in 1a2804c

#define MAX_HEAP_SIZE (KB(CONFIG_SRAM_SIZE) - (HEAP_BASE - \

CONFIG_SRAM_BASE_ADDRESS))

MAX_HEAP_SIZE in just the size of the RAM area not used (explicitly allocated) by the Zephyr kernel and apps.

martinjaeger · 2021-05-25T15:53:57Z

Thanks for the clarification regarding newlib heap management in Zephyr.

I think option 1 makes sense. Not sure what a "reasonable" default would be, though.

martinjaeger added the bug The issue is a bug, or the PR is fixing a bug label Nov 16, 2020

nashif added the priority: medium Medium impact/importance bug label Nov 17, 2020

carlescufi added the area: C Library C Standard Library label Nov 17, 2020

nashif assigned andyross Jan 12, 2021

github-actions bot added the Stale label May 9, 2021

martinjaeger removed the Stale label May 18, 2021

carlescufi assigned stephanosio and unassigned andyross May 25, 2021

stephanosio added this to the v2.6.0 milestone May 25, 2021

stephanosio mentioned this issue May 28, 2021

lib: newlib: Add minimum required heap size config #35799

Merged

stephanosio linked a pull request May 28, 2021 that will close this issue

lib: newlib: Add minimum required heap size config #35799

Merged

galak closed this as completed in #35799 Jun 2, 2021

Retro-Fitt mentioned this issue Nov 7, 2023

Which MCU model is supported? LibreSolar/bms-15s80-sc#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory corruption for newlib-nano with float printf and disabled heap #30055

Memory corruption for newlib-nano with float printf and disabled heap #30055

martinjaeger commented Nov 16, 2020 •

edited

Loading

andyross commented Jan 13, 2021

martinjaeger commented Jan 14, 2021

andyross commented Jan 21, 2021

martinjaeger commented Jan 28, 2021

martinjaeger commented Jan 28, 2021

andyross commented Mar 8, 2021

andyross commented Mar 8, 2021

martinjaeger commented Mar 9, 2021

github-actions bot commented May 9, 2021

galak commented May 18, 2021

martinjaeger commented May 21, 2021

martinjaeger commented May 25, 2021

galak commented May 25, 2021

stephanosio commented May 25, 2021

martinjaeger commented May 25, 2021

stephanosio commented May 25, 2021 •

edited

Loading

stephanosio commented May 25, 2021

martinjaeger commented May 25, 2021

stephanosio commented May 25, 2021

stephanosio commented May 25, 2021

azeemshatp commented May 25, 2021

martinjaeger commented May 25, 2021

stephanosio commented May 25, 2021

stephanosio commented May 25, 2021

martinjaeger commented May 25, 2021

Memory corruption for newlib-nano with float printf and disabled heap #30055

Memory corruption for newlib-nano with float printf and disabled heap #30055

Comments

martinjaeger commented Nov 16, 2020 • edited Loading

andyross commented Jan 13, 2021

martinjaeger commented Jan 14, 2021

andyross commented Jan 21, 2021

martinjaeger commented Jan 28, 2021

martinjaeger commented Jan 28, 2021

andyross commented Mar 8, 2021

andyross commented Mar 8, 2021

martinjaeger commented Mar 9, 2021

github-actions bot commented May 9, 2021

galak commented May 18, 2021

martinjaeger commented May 21, 2021

martinjaeger commented May 25, 2021

galak commented May 25, 2021

stephanosio commented May 25, 2021

martinjaeger commented May 25, 2021

stephanosio commented May 25, 2021 • edited Loading

Analysis

stephanosio commented May 25, 2021

martinjaeger commented May 25, 2021

stephanosio commented May 25, 2021

stephanosio commented May 25, 2021

azeemshatp commented May 25, 2021

martinjaeger commented May 25, 2021

stephanosio commented May 25, 2021

stephanosio commented May 25, 2021

martinjaeger commented May 25, 2021

martinjaeger commented Nov 16, 2020 •

edited

Loading

stephanosio commented May 25, 2021 •

edited

Loading