Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA compatibility issue #9049

Closed
1 task done
marchingband opened this issue Dec 29, 2023 · 13 comments
Closed
1 task done

OTA compatibility issue #9049

marchingband opened this issue Dec 29, 2023 · 13 comments
Labels
Status: Awaiting triage Issue is waiting for triage

Comments

@marchingband
Copy link

Board

ESP32 WROVER

Device Description

https://www.sparkfun.com/products/21307

Hardware Configuration

eMMC, DAC, etc

Version

v2.0.14

IDE Name

Arduino IDE

Operating System

macos

Flash frequency

80

PSRAM enabled

yes

Upload speed

115200

Description

I am struggling to debug an issue regarding OTA updates.
For the last few years, I have had seamless OTA update processes running.
I use the Update() library to implement a manual update system.
Recently, the .bin files have been incompatible, producing odd behaviour after an update, such as failing to boot, (RTCWDT reset in a loop) or, sometimes a rollback, with the error code indicating "failed to erase".
I noticed that in this commit 161b167, @lbernstone and @me-no-dev changed the default partition tables, to include coredump and changed the size of the SPIFFS partition.
My project does not have a partitions.csv file, so relies on the default.
I am unfortunately in a position where I am unclear about what versions of arduino-esp32 were previously working for me, and the process of regression testing is getting very hard. I suspect I was previously on 2.0.2.
I am wondering if there is any information you guys could share that could help ... are you aware of any such issues where an update of arduino-esp32 causes incompatible binaries, when using the default.csv partitions?
thanks!

Sketch

https://github.com/marchingband/wvr

Debug Message

see above

Other Steps to Reproduce

  • flash with binary created using arduino-esp32 version 2.0.2, using older default.csv partition table
  • use Updates library to install a binary created using arduino-esp32 version 2.0.14, using newer default.csv partition table

I have checked existing issues, online documentation and the Troubleshooting Guide

  • I confirm I have checked existing issues, online documentation and Troubleshooting guide.
@marchingband marchingband added the Status: Awaiting triage Issue is waiting for triage label Dec 29, 2023
@SuGlider
Copy link
Collaborator

My project does not have a partitions.csv file, so relies on the default.

Is this ESP32 remote and can't be reached or you have access to it in order to upload a completely new firmware, boot, partition file etc. using a computer + cable?

@marchingband
Copy link
Author

marchingband commented Dec 29, 2023

@SuGlider I have some here with me to experiment with, but hundreds of others are remote (in the hands of users). I distribute binaries to users for updates, which they upload using my OTA methods, and can store on the devices eMMC memory.

Why do you ask?

@Jason2866
Copy link
Collaborator

The firmware compile process does not use any info from the partition scheme. Only when flashing via wire the partition scheme is relevant. Your compiled firmware (used for OTA) will not be different when using different partition schemes.
So the issue you encounter is caused elsewhere.

@Jason2866
Copy link
Collaborator

Jason2866 commented Dec 30, 2023

What i can imagine, that code compiled which includes insights and coredump support (actual Arduino cores) is not compatible (crashing) when the partitions the need are not found.
Since i do find both (insights, coredump) waste of precious flash and useless for 99% of use cases, i removed them completely in my fork. Maybe that's the reason we do not run in the problem with Tasmota when doing OTA updates from older versions as you encounter.

And i dislike (and removed) the needed change for insights which needs sha256 #7566 (comment) this makes it impossible for esptool.py to adjust the magic firmware file header.
This is needed when the built firmware is flashed to an other device which is different than the settings choosen during compile for flash size, speed and mode.

@marchingband
Copy link
Author

marchingband commented Dec 30, 2023

It would be useful for me, and I imagine many others using OTA in commercial projects, to have a precise understanding of which version of Arduino-ESP32 breaks OTA compatibility, if indeed that is a real thing, and if there are steps required to work around, so we can continue to enjoy new versions in Arduino IDE or Arduino CLI.

My assumption has always been that a bad binary would (worst case scenario) trigger a rollback. This crash I am experiencing bricks the device, which is fairly terrifying.

@lbernstone
Copy link
Contributor

lbernstone commented Dec 31, 2023

The platformio-build-esp32.py file is the best place to find the build version. To find what it is with your gold image, extract it from a working device with esptool: ./esptool.py --read_flash 0x1000 0x140000 /tmp/dumpfile.img
You actually only need the first 4k of that dumpfile for the version. Open it in just about any editor, and you will see the plaintext version mixed into the header.
I assume @SuGlider asked about whether you have some in hand to be sure you can recreate the issue on another device.

@marchingband
Copy link
Author

marchingband commented Dec 31, 2023

@lbernstone is there meant to be a guarantee that binaries will be compatible for OTA across arduino-esp32 versions? If not is there documentation on which are, or are not, compatible?
Unfortunately I am not certain that I have a device from every production run, I have been incrementally updating Arduino-esp32, assuming compatibility.

@lbernstone
Copy link
Contributor

This product is LGPL- there are no guarantees on suitability for anything. That said, upgrades from one version to a newer one are expected to work. I understand your frustration, and we will help you identify the problem, but you are going to have to provide the time and resources to troubleshoot it.
What you need is to have a reproduceable process that gets you to the brick state. If you can flash an image that is functional, and then an image that consistently bricks the device when upgraded via OTA, that will go a long way towards a solution. If we can't see it, and make some changes to try to mitigate, we can't fix it.

@marchingband
Copy link
Author

marchingband commented Jan 1, 2024

@lbernstone thanks. If I can isolate a bug I will definitely make a proper bug report. Right now I am just looking for information.

It sounds like nobody who has responded yet has experienced this, or is aware of any known incompatibility issues for OTA. It also doesn't sound like there are any best practises on the subject that I am not aware of, or not implementing currently.

I would have guessed that changing the default partitions.csv file would be a breaking change for OTA. Is this known to not be the case?

IE. if a newer version fails to find a coredump partition, or any other newly added partition, is it able to proceed, or will it crash? And if it does crash, will it try to trigger a rollback?

@lbernstone
Copy link
Contributor

lbernstone commented Jan 1, 2024

OTA only works on the existing partitioning. If you change the size of the OTA, it will either be too small to fit the new image, which will give you an error, or it is even larger, and fits easily.
Read up on OTA. What seems likely to be happening here is that the new image starts, but for some reason fails. The OTA attempts to roll back, but for some reason the old firmware is corrupted (something has changed in the nvs?). Since both OTA partitions are invalid, it ends up in panic.
It will be useful to know what happens if you wipe the otadata or nvs partitions.

@20162026
Copy link

20162026 commented Jan 1, 2024

in general firmware compiled with newer IDF should work with older bootloader/partition scheme, assuming your image did not exceed the partition limits and arduino foulks dont make drastic changes in bootloader config.
However recently there was a bug with IDF nvs, that screwed up wifi configuration and require erasing nvs partition in order to get wifi working again (espressif/esp-idf#12829). Either way I would highly advice to freeze at some git commit and not ship random version with every device/batch.

@marchingband
Copy link
Author

marchingband commented Jan 2, 2024

If I switch back to Arduino-cli, I seem to be able to get things working again.
it seems that my issues may come from flashing the board from PlatformIO.
Boards that were flashed using Arduino-cli accept binaries produced with PIO, but boards that were flashed with PIO do not accept binaries from Arduino-cli.
I have ended up in a corrupt PIO state before ... so it is possible this is my doing somehow.
I used the technique above to determine that I was originally using Arduino-esp v2.0.1 (esp-idf 4.4.0 is at the top of the binary).
I will continue to run some regression tests for the sake of PIO users out there.
If I use pio run -t clean periodically does this mean I will be in a good state? Or do I need to delete PIO completely and start from scratch to be certain?
Also, is there possibly any known issues with PIO that would explain this? Small differences in the bootloader perhaps? i am on PlatformIO Core, version 6.1.11

@marchingband
Copy link
Author

After more testing the issue is entirely with PIO. infuriating. sorry, I will close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting triage Issue is waiting for triage
Projects
None yet
Development

No branches or pull requests

5 participants