-
-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory optimisations ESP32 #869
Comments
One idea is to look at https://arduinojson.org/news/2022/12/26/arduinojson-6-20-0/#shallow-copy |
@MichaelDvP we could experiment with reducing the UART buffer in
which gives 1k back in heap. I tested this and seems to work, but its risky. What do you think is the correct value? |
I've added a |
not sure if it's related but when I reduced the task to 1024 bytes the OTA stops working around 8% upload (on E32). So guess we leave it as 2048, it's fine and safe and also the same as in the example library. It seems memory is becoming an issue again and all my nightmares from the ESP8266 writing my own internal std libraries and doing code optimizations into the night are coming back! In future gateway boards, we'll have more RAM and PSRAM to play with but that doesn't solve the thousands of existing users using the 4MB ESP32. We can throttle and limit the WebUI and MQTT like you did by checking the memory before creating those large DynamicJson buffers. We can also think of a way to do this dynamically by first calculating the size of the buffer by counting the entities and then allocating the memory to JsonBuffer - just a thought and not sure if this can be done real-time. What do you think? What keeps me awake at night are these crashes. Heap on a cold-start with no devices/entities is 160 KB and drops to 100 KB on a few devices and 180 entities. Which seems like more than enough stack. I haven't seen signs of memory leakage anymore. BTW I didn't know the mDNS library takes up 5KB, it's a beast. The Max allocation/available block I think is what is hurting us. All the string functions (sprintf/std::string concatenation) are causing fragmentation and reducing the size of the block. I think anything < 40KB is a problem. The other thing I've been working on is it making it more thread-safe, especially if the console, syslog and web log are running. The ESP32 is dual-core and not all functions are single-threaded. Although I don't think that is related to the memory problems or crashes. |
Could you take a look at the post of weblog settings. Every time i change a setting i get reproduceable a rx error. But usart/rx is a comlete different task. And it's only in weblog page. Settings in weblog are different from other, there is no send button and post seems to handled different. For the weblog buffer i've tested to only reset the log_message_id_tail_ and let the event send the buffer one by one. Then we don't need the large msgpackjson to send the buffer at once.
Works. We only need to start id from 1 ( |
I've created a new branch tech-upgrade for testing these changes and monitoring memory usage and performance. These changes will most likely end up in EMS-ESP 3.6.0. |
Another thing to look at (potentially) is the filesystem, which uses 28K of Flash. By compressing (json messagepack) or using something completely different like CBOR (Concise Binary Object Representation, see https://www.rfc-editor.org/rfc/rfc7049 and https://github.com/ssilverman/libCBOR) would half that, and we can use the space for more Web code and translations. |
I've made some small minor changes in the new branch. Should have done a PR but forgot I was adding to upstream so apologies if the changes are difficult to track. One rule I did is make any buffer <1024 use a StaticJsonDocument and not Dynamic, which adds onto stack vs heap, so in theory less fragmentation |
I moved some std::strings over to const char *'s in this branch. Also the ability to run and debug in Visual Studio Code without using an ESP32 which will help memory profile the crashes @HansRemmerswaal is experiencing. |
@MichaelDvP for shallowCopy() see bblanchon/ArduinoJson#1849 |
Yes, the shallCopy will help to reduce the large onBlock buffers on heap. But another thing are the crashes. This should not happen. I think it's a larger buffer that can not be allocated and returns nullptr. Json never retruns nullptr, only documentcapacity zero, that is safe. I think maybe somewhere in mqtt, the boiler_data is large and needs a serialization buffer. Or thermostat_data, Hans have 2 thermostats, both published in a single data. Mqtts-single option could help. |
I think it's in the MQTT code too. I'm running some tests and can reproduce the crash - I just need to figure out where. I'm using this new tech-upgrade branch, building with Then I turn on MQTT Discovery and ka-boom Also without MQTT |
I currently have at least one customer who also experiences crashes when they specifically turn on MQTT. |
It's something in the MQTT code for sure. I changed the test to only load the boiler (160 entities)
Which points to a defrag problem in the MQTT code and the max allocation not big enough. Investigation continues.... |
another thing to look at is the MALLOC_CAP_32BIT flags https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/system/mem_alloc.html#bit-accessible-memory |
Do you know what process tries to malloc such a large memory block? I think there is a char * with missing termination and something like a serialiation buffer reads memory until finding a null. |
I found the similar pattern and can simulate the crash without HA (I've been working on the automated test script). Bu tI haven't had time to debug where it is happening. I think it's also related to a nullptr somewhere as I can't imagine malloc() crashing the system like it does. There is also a memory leak somewhere - I've seen the heap get smaller over a period of days. I'll find time to test this all out soon |
I've been running some memory tests over the last 24hrs looking at how services affect the free heap and max alloc buffer. Some interesting results. I'll summarize below what I've found out so far: the good news:
the bad news:
things we can't fix but should be aware of and document:
I haven't tested SysLog or the Web Log yet. |
This splits into adding devices, a lot of register_telegram and the 35 register_device_value. Are the entities the registered, or the only ones with values. A register_device_value is around 100 bytes, e.g. 3,5k for the entties, 24k for the telegrams and devices. |
It's not the free mem I'm worried about, I think we have enough. It's the max allocation block that goes from 107KB to 54KB and stays there after adding the 35 entities/device values. That is what I'm going to look into when I'm back next week. The frag is mainly due to the std::string concatenations and snprintf()'s. One other memory test we can add is just adding a malloc() somewhere in setup() to force the available heap to say 70KB (leaving max alloc high at 100) and see how that operates. |
I'll also set OTA to be disabled as Default to save 6KB heap&alloc buffer. Any issues with this @MichaelDvP ? |
I see no issues, imho most people use web upload, if someone really want to use OTA he can enable it. I think we can also set this as default in v3.5. |
added to 3.5.0-dev-16 |
@MichaelDvP I've implemented #911 (show a message in WebUI when there are unsaved changes) and flash is up to 99.8% on 4MB which breaks the OTA (not enough space). I haven't looked into it yet but it should fit the partition? |
The upload checks for filesize, this is always a bit bigger than the reported flash size. I think this the image header, see here: https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/system/app_image_format.html
if (Update.begin()) { then it writes the partition and do not compare partitionsize with filesize..This shlould also be safe, if new data does not fit in partition this EMS-ESP32/lib/framework/UploadFileService.cpp Line 106 in e66e2c4
|
think I'm done here. Profiled the tech-upgrade build to death and the key conclusions are
|
In the tech-upgrade build I made some other modifications along the way, half of which I've forgotten with my Covid brain fog. However one worth noticing is that I removed DEBUG from the log. This is only for coders who want to debug when building (with -DEMSESP_DEBUG flag) and not part of the production build. |
and @MichaelDvP i've added all your changes from your dev branch to tech-upgrade. I'll keep testing for a few days more and then, if you agree, push it to dev as 3.6.0-dev-0 |
Look at ways to optimize memory usage, reduce heap and fragmentation
Things to investigate and try out:
std::string
withconst char *
where possibleDefault reference config:
-DEMSESP_DEBUG
Test with:
python3 scripts/run_memory_test.py -i <IP> -w 30 -t <token>
The default test is called
general
and will load a boiler and thermostat with 64 entities and will wait for 30 seconds before taking a measurement.The text was updated successfully, but these errors were encountered: