-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: config file corruption due to filesystem size #4184
Comments
I might (hopefully will) look into this in a week or two. |
This same corruption probably exposed #4167 a couple of days ago. |
hmm this is less that perfect (though quite possibly unrelated to the problem):
We could eliminate this window of risk by renaming the file.new to be file.good, then remove file, then rename file.good to be filename (a 3 stage commit). Then at load time if we ever see a file.good existing, we know that we lost power during that window and file.good should be used instead of file (and file should be deleted at that point. But this might not actually be the bug, so I'll wait until I look into this and somehow make a reproducable failure. |
I believe I am seeing the same issue on a RAK19007/RAK4631 (2 of them actually). I believe the issue is specific to nRF builds. I have roughly 10 devices (all running the same version - 2.3.13) and only the two nRF based modules have run into this. This board is for use in a solar node (bought the second to troubleshoot and eliminate hardware), when connected via USB-C to a host (capturing serial log), the issue does not occur, but once I close it up and place it outside, the issue occurs after some seemingly random amount of time. I don't believe it is the battery going low (causing a reset), when I reattach the device, the battery is general around 90%, but appears to have experienced a factory-reset. I suspected corruption, but have not been able to capture a log of it occurring. Open to suggestions - considered connecting an ESP32 (on a separate power source) to the serial lines and using it as a uart->wifi bridge, I suspect this is the only way I'll capture it as I have had it occur 6 times as described above, and never when attached via USB-C. |
@ImprobableStudios thank you for your report. I'm eager to help look into this but alas I'm doing powermon/structured logging/power measurements for another weekish. @thebentern and I have been talking about ideas related to this but I don't think anyone has really dove into this yet. |
and some copypasta from our chat so I don't forget it: a slight reformatting of the list of options I could think of (ya'll can think of others possibly!): it seems to me if there is a LFS corruption bug there are four robust possible fixes:
I think options 1, 3 or 4 is what I'd probably lean towards (though I haven't spent any time getting a reproducable case or looking at the implementation). I'm leaning towards quickly implementing 3 and then constructing a test case so that 1 could be acheived. |
Heh, funny... Before reading your post, I was also thinking something 2-like would be a more hardened solution, especially if you unmounted it when not in use. My thinking was store a backup of the core-config files in that partition when the config is updated leaving the NodeDb and other collected data to be wiped in the case of a failure. This, in theory, would be pretty solid as long as the issue isn't something overwriting the partition boundaries or something nasty like that. After seeing the issue I did check and saw that nRF builds use a different underlying file system than ESP, which would help to give credence to this being an issue in the FS itself.... while I agree option 1 would be the absolute best - I also think it will be the most difficult... Something else this makes me think about that would be nice to have is a system-telemetry set for events in the system. I design high-end sensors (think $3000 IMUs) and recently built something similar into the next generation. Here it would require PB work, etc., but keep a small telemetry partition - does not need a file system just a set of 32-bit (?) values for different events such as watchdog reset, radio errors, file-system errors, overall "uptime" since last factory-reset, etc. Would allow for establishing how well a node is performing before sticking it someplace inaccessible... Flash wear is the problem with it and why I think it is something that should only be on for test and node/fw-build eval... I might have to put some thought into that. |
One potential caveat to option 2 is that we'll have to explore is where the BLE bonds are stored by BlueFruit, as restoring the core configuration is only half of the experience. A failed bluetooth auth flow requiring the user to forget and re-pair the device to their phone will still be an issue. |
Hmm - the adafruit fork of littlefs is quite old (last commit to their lfs.c was five years ago and the main littlefs project has many fixes which are not in the adafruit nrf52 tree... hmm... (just adding this link for later reference: https://github.com/littlefs-project/littlefs ) I still think one of the first steps is to make a stress test that shows the problem, so that we can confirm we fixed it (after fixing whatever it is ;-)). Also it might be that we need to be very careful to partition our BLE operations (in particular the FromNum notify call) from other LFS operations? Also see these interesting comment re shared directory structures: adafruit/Adafruit_nRF52_Arduino#372 (comment) |
Connecting to the public MQTT can cause the issue as well. |
Some other misc comments for when someone gets to looking at this:
|
btw: the 'good' news is that I can pretty easily reproduce this bug by power cycling a rak4630 (noticed while doing my power testing). I'm still planning on working on this issue once power stuff is done. |
oh: interesting: I can cause this corruption by performing a filesystem write 100% of the time on rak4630 by doing a write while the battery voltage is down at 3.2V. |
imo this is definitely it. over the next couple of days I'll add some instrumentation (both with the oscope and in sw to print vcc voltage via a ADC). Then I'll figure out why our brown-out detector didn't kill the processor before we ever reached this flash write. Possibly the voltage is barely above the (current) brown-out threshold and the higher current draw of doing the flash writes is enough to drop V below it. Thus almost guaranteeing flash corruption. @garth in my case I was working on getting system-off state working and in that test on the way to deep sleep I'm doing the FS writes (essentially the same code as the normal reboot case). But I'm doing it while hooked to a programmable bench supply that was repeating the test at a series of voltages. Once at 3.3Vish things start to go bad (because the LDO on the rak board requires a min 3.3V input to meet spec - I bet it's output voltage to the processor in that case begins to get real-crummy real-fast). I'll know better once I can read that voltage in sw and with a blue wire. |
Important Note: BatmansWang on discord reports having corruption without low batteries so there might be something else going on (possibly not related to voltage, possibly still related - tbd) |
Loss of config due to filesystem size limits(Any comments/advice on the notes below are apprecated!) I think I found another (far more common in the field probably) cause of filesystem corruption (and caused this in my test config). If you are in an area that can see a lot of nodes it is possible for your local node DB to reach the maximum size of 100 nodes. This means that the final size of db.proto on the disk would be 16600 bytes + a few kb for the other stuff in deviceState. Which is a bummer because on the nrf52 our filesystem size is only 28672. We always write pref files in an atomic fashion (first writing the new file then deleting the old version). Since 2x16600 it guarantees in that case that the write will fail (because bigger than the filesystem). This causes two failures - one less serious and the other which can cause loss of all device state (even at less than 100 nodes). case 1: The less serious case (but still probably catastrophic)
case 2: The more serious case:
For reference here's some current file sizes as my node was slowly learning the mesh (in a busy area that can certainly see 100 nodes)
Fix descriptionI've partially completed the fixes for this 'non low power related' fault. Summary of the fixes I'm making:
docs for SafeFile
|
* Use SafeFile for atomic file writing (with xor checksum readback) * Write db.proto last because it could be the largest file on the FS (and less critical) * Don't keep a tmp file around while writing db.proto (because too big to fit two files in the filesystem)
* Use SafeFile for atomic file writing (with xor checksum readback) * Write db.proto last because it could be the largest file on the FS (and less critical) * Don't keep a tmp file around while writing db.proto (because too big to fit two files in the filesystem)
* generate a new critial fault if we encounter errors writing to flash either CriticalErrorCode_FLASH_CORRUPTION_RECOVERABLE or CriticalErrorCode_FLASH_CORRUPTION_UNRECOVERABLE (depending on if the second write attempt worked) * reformat the filesystem if we detect it is corrupted (then rewrite our config files) (only on nrf52 - not sure yet if we should bother on ESP32)
ok all the stuff needed for nrf52 is done. Tomorrow I'll test on esp32 (and check esp32 filesystem size) then send in the PR. |
* Use SafeFile for atomic file writing (with xor checksum readback) * Write db.proto last because it could be the largest file on the FS (and less critical) * Don't keep a tmp file around while writing db.proto (because too big to fit two files in the filesystem) * generate a new critial fault if we encounter errors writing to flash either CriticalErrorCode_FLASH_CORRUPTION_RECOVERABLE or CriticalErrorCode_FLASH_CORRUPTION_UNRECOVERABLE (depending on if the second write attempt worked) * reformat the filesystem if we detect it is corrupted (then rewrite our config files) (only on nrf52 - not sure yet if we should bother on ESP32) * If we have to format the FS, make sure to preserve the oem.proto if it exists
* Use SafeFile for atomic file writing (with xor checksum readback) * Write db.proto last because it could be the largest file on the FS (and less critical) * Don't keep a tmp file around while writing db.proto (because too big to fit two files in the filesystem) * generate a new critial fault if we encounter errors writing to flash either CriticalErrorCode_FLASH_CORRUPTION_RECOVERABLE or CriticalErrorCode_FLASH_CORRUPTION_UNRECOVERABLE (depending on if the second write attempt worked) * reformat the filesystem if we detect it is corrupted (then rewrite our config files) (only on nrf52 - not sure yet if we should bother on ESP32) * If we have to format the FS, make sure to preserve the oem.proto if it exists
Test results on ESP32ESP32 very unlikely to exhibit this bug, because the filesystem there is huge and empty.
|
* Use SafeFile for atomic file writing (with xor checksum readback) * Write db.proto last because it could be the largest file on the FS (and less critical) * Don't keep a tmp file around while writing db.proto (because too big to fit two files in the filesystem) * generate a new critial fault if we encounter errors writing to flash either CriticalErrorCode_FLASH_CORRUPTION_RECOVERABLE or CriticalErrorCode_FLASH_CORRUPTION_UNRECOVERABLE (depending on if the second write attempt worked) * reformat the filesystem if we detect it is corrupted (then rewrite our config files) (only on nrf52 - not sure yet if we should bother on ESP32) * If we have to format the FS, make sure to preserve the oem.proto if it exists Co-authored-by: Ben Meadors <[email protected]>
* bug #4184: fix config file loss due to filesystem write errors * Use SafeFile for atomic file writing (with xor checksum readback) * Write db.proto last because it could be the largest file on the FS (and less critical) * Don't keep a tmp file around while writing db.proto (because too big to fit two files in the filesystem) * generate a new critial fault if we encounter errors writing to flash either CriticalErrorCode_FLASH_CORRUPTION_RECOVERABLE or CriticalErrorCode_FLASH_CORRUPTION_UNRECOVERABLE (depending on if the second write attempt worked) * reformat the filesystem if we detect it is corrupted (then rewrite our config files) (only on nrf52 - not sure yet if we should bother on ESP32) * If we have to format the FS, make sure to preserve the oem.proto if it exists * add logLegacy() so old C code in libs can log via our logging * move filesList() to a better location (used only in developer builds) * Reformat with "trunk fmt" to match our coding conventions * for #4395: don't use .exists() to before attempting file open If a LFS filesystem is corrupted, .exists() can fail when a mere .open() attempt would have succeeded. Therefore better to do the .open() in hopes that we can read the file (in case we need to reformat to fix the FS). (Seen and confirmed in stress testing) * for #4395 more fixes, see below for details: * check for LFS assertion failures during file operations (needs customized lfs_util.h to provide suitable hooks) * Remove fsCheck() because checking filesystem by writing to it is very high risk, it makes likelyhood that we will be able to read the config protobufs quite low. * Update the LFS inside of adafruitnrf52 to 1.7.2 (from their old 1.6.1) to get the following fix: littlefs-project/littlefs@97d8d5e * use disable_adafruit_usb.py now that we are (temporarily?) using a forked adafruit lib We need to reach inside the adafruit project and turn off USE_TINYUSB, just doing that from platformio.ini is no longer sufficient. Tested on a wio-sdk-wm1110 board (which is the only board that had this problem) --------- Co-authored-by: Ben Meadors <[email protected]>
Category
Other
Hardware
Other
Firmware Version
2.3.14.c67a9dfe
Description
I bet it is not wm1110 specific. Occurred while doing hundreds of power cycles.
I bet the best way to find/fix it is to turn off our "if config read fails or is corrupted completely factory reset and try again". Instead we should spin inside the ICE debugger.
Notes from chat:
Relevant log output
No response
The text was updated successfully, but these errors were encountered: