-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPERCEDED by #397] Guard against race conditions in flash FS cache #372
Conversation
Thanks @henrygab for your superb works. Though I am currently in the middle of other works, need to wrap it up asap since I am blocking other people works. I will surely re-check/review all of your excellent work on LFS later on. |
ea2ba7c
to
6704e33
Compare
Thanks @henrygab for your effort and patient :) . I am pull this PR to test locally with https://gist.github.com/henrygab/8fa35df56889d64b240e4f64d164f731 . I will do an overall review to this patch and the whole LFS as well. The current version used by the repo is 1.6, https://github.com/ARMmbed/littlefs LFS is currently at 2.1 . Maybe we could update it as well. |
I would delay the LFS 1.6 to 2.x update for at least two reasons:
It's of course your decision to make, I simply offer my thoughts based on past experiences. Regardless, I'm thrilled to know this sporadic corruption bug might finally be fixed, and look forward to seeing this PR finally integrate, closing the book on this series of reported issues! 🎉 |
Thanks for your opinion, I will make sure the stress test passed with current 1.6 version first. Then do some testing with 2.x . We will upgrade to 2.x (or any later version anyway). I am not really sure how long it takes until I could mange the time to come to this issue now. So I will try to see if I could get the best for this week :). I write a test sketch and submit it as PR to your repo folk henrygab#2 , please check it out. As you recommendation, it uses 4 threads with different priority: loop (low), normal, high, highest. With small enough delay each run for task to preempt. It seems to lead to deadblock. I will do more tests. |
forgot the log at Level 2
|
Based on my below comments, I am recommending to accept this PR as-is.
This is great! The fact that the output indicates blocked parallel write outputs validates that this fixes one cause of corruption! I have not been able to reproduce a deadlock. However, due to the output having an extremely high rate of bad block warnings, I dug deeper. Then I discovered the following two notes:
To summarize... LFS is not re-entrant. LFS is designed to only be used from a single thread / task at a time. More specific rules:
Put another way, it appears LFS has a shared-read / exclusive-write model. If I understand correctly, the softdevice uses LFS, at least to store BLE bonding data. Sketches also want to store data. Thus, it appears that LFS entry points may need a mutex, similar to what was done here, if the goal is to allow multi-threaded concurrent LFS calls. Even understanding the above, this PR prevents one layer of corruption, and it's unlikely that a sketch & the softdevice will both write to the same file. |
Finally, I recommend delaying before updating to v2.x ... it appears they are similarly tracking a power-failure corruption bug that is likely new to v2.x. I've opened a new issue to cover the need to serialize LFS itself. |
// Note: default loop() is running at LOW | ||
Scheduler.startLoop(loop, 1024, TASK_PRIO_NORMAL, "normal"); | ||
Scheduler.startLoop(loop, 1024, TASK_PRIO_NORMAL, "normal"); | ||
Scheduler.startLoop(loop, 1024, TASK_PRIO_NORMAL, "normal"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use threadname as filename for each thread to write, we should name them differently even though they share the priority. Maybe n1, n2, n3. I will made another PR to update the sketch
@henrygab I push another PR to your branch here here is the log, there is 66 writes, and they are all visible from thread's file, not one missing. This looks good enough, I will let the sketch running for longer period to see if there is any issues. Else I think we can safely merge this. Current scenario Bluefruit lib using LFS to store bond in its own subfolder "/adafruit/bond_prph" and "/adafruit/bond_cntr", and this isn't always read/write. It should relatively safe to use together with user sketch, unless yeah for power corruption issue. |
After lots of testing, there is still issue running the stresstest over long period of time (600 seconds) with 2000 writes occasionally. Though, it could be due to lfs internal issue, I am not sure. But it is much more reliable than ever. Thank you very much @henrygab for your great work. |
int _internal_flash_cache_read (flash_cache_t* fc, void* dst, uint32_t addr, uint32_t count); | ||
void _internal_flash_cache_flush (flash_cache_t* fc); | ||
|
||
static inline void _internal_EnsureFlashCacheSemaphoreInitialized() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am doing mostly testing and tweaking so far. Everything is good, though this seems to be over complicated. I will add an Static semaphor struct, it will guarantee an mutex creation is always succeeded in my PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, forgot to delete this one
Happy to see it working well. BTW, the reason it's now stable is not so much because the threads were all writing to the root directory per se, but rather that they were all writing to the same directory. At least in LFS 1.6 (and FAT12, FAT16, FAT32, and exFAT), the metadata for each directory tends to be in its own cluster / sector / allocation unit. While there are still shared metadata updates when creating the directories in the root, and when allocating additional pages when the file grows to a new page, the test sketch is no longer likely to cause LFS to simultaneously attempt to update the same internal state, as each file has its own cluster to play in. There is absolutely still a lingering potential problem if two tasks update the file system, as LFS itself is not serialized. However, it seems if a sketch can apply the following guidance, it might reduce risk of corruption in LFS itself:
This should further improve reliability, until (if) the LFS upper layer can be serialized also. |
@henrygab henrygab#4 use static semaphore create which is always succedded. This PR is as good as I could hope, then I think it makes sense to move the mutex into |
Will look soon! |
Take your time, I am always having overdue work to tackle 🤓🤓 |
@hathach -- I realized you indicated this flash cache layer may also be used in other project. If that is true, then even though these changes will not be necessary for the InternalFS on nRF52, perhaps it would be worthwhile to merge these changes anyways, in case any other AdaFruit project wants to use this flash cache layer where it's not guaranteed that higher-level code will synchronize? What are your thoughts on that? |
This is not intended as final production code, but rather for use to validate if the flash code is being pre-empted by another task, which is then also entering one of the flash functions. If so, this could corrupt state.
Also rename serialization functions to avoid collisions with client code.
This allows use of Visual Studio Code, without having to continually exclude the .vscode directory from commits.
This allows internally flushing the cache in write(), without use of a recursive mutex.
c36a8a1
to
2d44af0
Compare
@hathach -- If you use this caching layer for any other adafruit project, I recommend you take these changes, in case other projects do not synchronize the cache layer. Otherwise, it may be OK to close this one entirely now. |
@henrygab thank for reminding, close this for now |
Serialize all the flash_cache read/write/flush functions.
This is necessary because these functions may be called by multiple tasks simultaneously, and because these functions use and modify shared state.
@hathach, @ladyada -- This change compiles, but I recommend your validation, as I have been unable to do so myself. That said, this has a very strong likelihood of resolving the file system corruption issues. Testing should include repeated BLE pairing and main sketch file system use.
After additional testing, this PR should fix the following issues:
Fix #350
Fix #325
Fix #227
Fix #222