[SUPERCEDED by #397] Guard against race conditions in flash FS cache #372

henrygab · 2019-11-04T18:59:04Z

Serialize all the flash_cache read/write/flush functions.

This is necessary because these functions may be called by multiple tasks simultaneously, and because these functions use and modify shared state.

@hathach, @ladyada -- This change compiles, but I recommend your validation, as I have been unable to do so myself. That said, this has a very strong likelihood of resolving the file system corruption issues. Testing should include repeated BLE pairing and main sketch file system use.

After additional testing, this PR should fix the following issues:
Fix #350
Fix #325
Fix #227
Fix #222

hathach · 2019-11-05T05:20:14Z

Thanks @henrygab for your superb works. Though I am currently in the middle of other works, need to wrap it up asap since I am blocking other people works. I will surely re-check/review all of your excellent work on LFS later on.

hathach · 2019-12-11T16:00:41Z

Thanks @henrygab for your effort and patient :) . I am pull this PR to test locally with https://gist.github.com/henrygab/8fa35df56889d64b240e4f64d164f731 . I will do an overall review to this patch and the whole LFS as well. The current version used by the repo is 1.6, https://github.com/ARMmbed/littlefs LFS is currently at 2.1 . Maybe we could update it as well.

henrygab · 2019-12-12T09:32:40Z

... The current version [of LFS] used by the repo is 1.6, https://github.com/ARMmbed/littlefs LFS is currently at 2.1 . Maybe we could update it as well.

I would delay the LFS 1.6 to 2.x update for at least two reasons:

changing the cache layer at the same time as the file system major revision raises the risk of a bug (e.g., due to new requirements or new LFS behavior causing new interactions), and at the same time, if such a bug become real, it will be harder to know if the bug would have existed without the 2.0 LFS. Changing one thing at a time helps track down problems.
LFS v1.x and LFS v2.x are not compatible. Thus, some thought on how to handle this is recommended. Do you include the migration code by default, to save existing media's data? Or do you save 11% memory by ignoring that and forcing reformat / running an intermediate sketch that simply upgrades the FS version? Or something else entirely?

It's of course your decision to make, I simply offer my thoughts based on past experiences.

Regardless, I'm thrilled to know this sporadic corruption bug might finally be fixed, and look forward to seeing this PR finally integrate, closing the book on this series of reported issues! 🎉

hathach · 2019-12-12T16:30:22Z

... The current version [of LFS] used by the repo is 1.6, https://github.com/ARMmbed/littlefs LFS is currently at 2.1 . Maybe we could update it as well.

I would delay the LFS 1.6 to 2.x update for at least two reasons:

changing the cache layer at the same time as the file system major revision raises the risk of a bug (e.g., due to new requirements or new LFS behavior causing new interactions), and at the same time, if such a bug become real, it will be harder to know if the bug would have existed without the 2.0 LFS. Changing one thing at a time helps track down problems.

LFS v1.x and LFS v2.x are not compatible. Thus, some thought on how to handle this is recommended. Do you include the migration code by default, to save existing media's data? Or do you save 11% memory by ignoring that and forcing reformat / running an intermediate sketch that simply upgrades the FS version? Or something else entirely?

It's of course your decision to make, I simply offer my thoughts based on past experiences.

Regardless, I'm thrilled to know this sporadic corruption bug might finally be fixed, and look forward to seeing this PR finally integrate, closing the book on this series of reported issues!

Thanks for your opinion, I will make sure the stress test passed with current 1.6 version first. Then do some testing with 2.x . We will upgrade to 2.x (or any later version anyway). I am not really sure how long it takes until I could mange the time to come to this issue now. So I will try to see if I could get the best for this week :).

I write a test sketch and submit it as PR to your repo folk henrygab#2 , please check it out.

As you recommendation, it uses 4 threads with different priority: loop (low), normal, high, highest. With small enough delay each run for task to preempt. It seems to lead to deadblock. I will do more tests.

https://gist.github.com/hathach/859d228204c97f130359b391d2f1ea76#file-internal_stresstest-ino

hathach · 2019-12-12T16:40:38Z

forgot the log at Level 2

Internal Stress Test Example
Formatting ... Done
Task normal writing ...
Task high writing ...
Task normal writing ...
Task high writing ...
[IFLASH] Blocked parallel write attempt ... waiting for mutex
Task highest writing ...
Task high writing ...
Task highest writing ...
[IFLASH] Blocked parallel write attempt ... waiting for mutex
Task normal writing ...
Task highest writing ...
[IFLASH] Blocked parallel write attempt ... waiting for mutex
Task high writing ...
Task highest writing ...
[IFLASH] Bl

henrygab · 2019-12-12T23:07:13Z

Based on my below comments, I am recommending to accept this PR as-is.

forgot the log at Level 2
[... snip!]
[IFLASH] Blocked parallel write attempt ... waiting for mutex

This is great! The fact that the output indicates blocked parallel write outputs validates that this fixes one cause of corruption!

I have not been able to reproduce a deadlock. However, due to the output having an extremely high rate of bad block warnings, I dug deeper. Then I discovered the following two notes:

To summarize... LFS is not re-entrant. LFS is designed to only be used from a single thread / task at a time. More specific rules:

LFS allows multiple concurrent readers of a single file.
LFS allows a single writer of a file (no concurrent readers allowed of that file)
[conjecture] LFS allows multiple concurrent read-only metadata operations (directory enumeration, file open)
[conjecture] LFS allows a single metadata modification operation (metadata reads must wait)

Put another way, it appears LFS has a shared-read / exclusive-write model.

If I understand correctly, the softdevice uses LFS, at least to store BLE bonding data. Sketches also want to store data. Thus, it appears that LFS entry points may need a mutex, similar to what was done here, if the goal is to allow multi-threaded concurrent LFS calls.

Even understanding the above, this PR prevents one layer of corruption, and it's unlikely that a sketch & the softdevice will both write to the same file.

henrygab · 2019-12-12T23:08:33Z

Finally, I recommend delaying before updating to v2.x ... it appears they are similarly tracking a power-failure corruption bug that is likely new to v2.x.

I've opened a new issue to cover the need to serialize LFS itself.

hathach · 2019-12-13T05:40:22Z

libraries/InternalFileSytem/examples/Internal_StressTest/Internal_StressTest.ino

+  // Note: default loop() is running at LOW
+  Scheduler.startLoop(loop, 1024, TASK_PRIO_NORMAL, "normal");
+  Scheduler.startLoop(loop, 1024, TASK_PRIO_NORMAL, "normal");
+  Scheduler.startLoop(loop, 1024, TASK_PRIO_NORMAL, "normal");


I use threadname as filename for each thread to write, we should name them differently even though they share the priority. Maybe n1, n2, n3. I will made another PR to update the sketch

hathach · 2019-12-13T05:50:04Z

@henrygab I push another PR to your branch here
henrygab#3 . It pretty reliable now when having each thread writing to its own directory. I run for mini test for 20 seconds, and 100ms delay. There is no bad/corrupted blocks as previously, although there is bunch of waiting for mutex. Possibly concurrently writing to the root folder make it corrupted.

here is the log, there is 66 writes, and they are all visible from thread's file, not one missing.
https://gist.github.com/hathach/7c3fad9b2119b90e5acd99175ac0b6fe

This looks good enough, I will let the sketch running for longer period to see if there is any issues. Else I think we can safely merge this. Current scenario Bluefruit lib using LFS to store bond in its own subfolder "/adafruit/bond_prph" and "/adafruit/bond_cntr", and this isn't always read/write. It should relatively safe to use together with user sketch, unless yeah for power corruption issue.

hathach · 2019-12-13T10:00:07Z

After lots of testing, there is still issue running the stresstest over long period of time (600 seconds) with 2000 writes occasionally. Though, it could be due to lfs internal issue, I am not sure. But it is much more reliable than ever. Thank you very much @henrygab for your great work.

hathach · 2019-12-13T10:06:08Z

libraries/InternalFileSytem/src/flash/flash_cache.c

+int  _internal_flash_cache_read  (flash_cache_t* fc, void* dst, uint32_t addr, uint32_t count);
+void _internal_flash_cache_flush (flash_cache_t* fc);
+
+static inline void _internal_EnsureFlashCacheSemaphoreInitialized()


I am doing mostly testing and tweaking so far. Everything is good, though this seems to be over complicated. I will add an Static semaphor struct, it will guarantee an mutex creation is always succeeded in my PR.

hathach

Great, forgot to delete this one

henrygab · 2019-12-14T10:33:27Z

After lots of testing, there is still issue running the stresstest over long period of time (600 seconds) with 2000 writes occasionally. Though, it could be due to lfs internal issue, I am not sure. But it is much more reliable than ever. Thank you very much @henrygab for your great work.

Happy to see it working well. BTW, the reason it's now stable is not so much because the threads were all writing to the root directory per se, but rather that they were all writing to the same directory. At least in LFS 1.6 (and FAT12, FAT16, FAT32, and exFAT), the metadata for each directory tends to be in its own cluster / sector / allocation unit. While there are still shared metadata updates when creating the directories in the root, and when allocating additional pages when the file grows to a new page, the test sketch is no longer likely to cause LFS to simultaneously attempt to update the same internal state, as each file has its own cluster to play in.

There is absolutely still a lingering potential problem if two tasks update the file system, as LFS itself is not serialized. However, it seems if a sketch can apply the following guidance, it might reduce risk of corruption in LFS itself:

Before initializing BT LE or creating any tasks that would use the file system:
a. Initialize the internal file system.
b. Create a distinct subdirectory for each task that might simultaneously use the filesystem.
c. Where possible, pre-allocate the space for each file (e.g., seek to size needed and write a single byte)
d. Flush() to ensure it's all stored on the flash
Then, only after the subdirectories are created and files pre-allocated, initialize BT LE

This should further improve reliability, until (if) the LFS upper layer can be serialized also.

hathach · 2019-12-16T10:08:01Z

@henrygab henrygab#4 use static semaphore create which is always succedded. This PR is as good as I could hope, then I think it makes sense to move the mutex into flash_cache_t to make the code more generic. Afterwards, I see the lack of init() is a bit troublesome, making the API checking semaphore every time is a waste of cycle --> flash cache init. Please let me know what do you think of the changes, I am open to any suggestion.

henrygab · 2019-12-18T05:14:42Z

Will look soon!

hathach · 2019-12-18T06:50:31Z

Take your time, I am always having overdue work to tackle 🤓🤓

henrygab · 2019-12-21T20:47:54Z

@hathach -- I realized you indicated this flash cache layer may also be used in other project. If that is true, then even though these changes will not be necessary for the InternalFS on nRF52, perhaps it would be worthwhile to merge these changes anyways, in case any other AdaFruit project wants to use this flash cache layer where it's not guaranteed that higher-level code will synchronize?

What are your thoughts on that?

This is not intended as final production code, but rather for use to validate if the flash code is being pre-empted by another task, which is then also entering one of the flash functions. If so, this could corrupt state.

Also rename serialization functions to avoid collisions with client code.

This allows use of Visual Studio Code, without having to continually exclude the .vscode directory from commits.

This allows internally flushing the cache in write(), without use of a recursive mutex.

henrygab · 2020-01-06T19:01:12Z

@hathach -- If you use this caching layer for any other adafruit project, I recommend you take these changes, in case other projects do not synchronize the cache layer. Otherwise, it may be OK to close this one entirely now.

hathach · 2020-01-07T05:29:25Z

@henrygab thank for reminding, close this for now

henrygab force-pushed the SerializeFlashPrototype branch from ea2ba7c to 6704e33 Compare November 28, 2019 18:15

henrygab changed the title ~~Guard against race conditions in flash FS cache~~ [WIP] Guard against race conditions in flash FS cache Dec 4, 2019

henrygab mentioned this pull request Dec 5, 2019

[Bug] InternalFS analysis: LFS requirements, coherency guarantees #350

Closed

henrygab changed the title ~~[WIP] Guard against race conditions in flash FS cache~~ Guard against race conditions in flash FS cache Dec 12, 2019

henrygab mentioned this pull request Dec 12, 2019

LFS needs to be synchronized #393

Closed

hathach reviewed Dec 13, 2019

View reviewed changes

hathach reviewed Dec 14, 2019

View reviewed changes

henrygab mentioned this pull request Dec 20, 2019

Serialize lfs #397

Merged

henrygab changed the title ~~Guard against race conditions in flash FS cache~~ [SUPERCEDED?] Guard against race conditions in flash FS cache Dec 20, 2019

henrygab changed the title ~~[SUPERCEDED?] Guard against race conditions in flash FS cache~~ [SUPERCEDED by #397] Guard against race conditions in flash FS cache Dec 20, 2019

henrygab added 4 commits December 26, 2019 14:57

Brute-force serialize flash access.

1eb706d

This is not intended as final production code, but rather for use to validate if the flash code is being pre-empted by another task, which is then also entering one of the flash functions. If so, this could corrupt state.

Add missing mutex release (avoid deadlock)

8e0a9da

Also rename serialization functions to avoid collisions with client code.

Use non-recursive mutex.

f8b2f30

Ignore .vscode

bc092c4

This allows use of Visual Studio Code, without having to continually exclude the .vscode directory from commits.

henrygab and others added 16 commits December 26, 2019 14:57

Wrap entry points

026eab6

This allows internally flushing the cache in write(), without use of a recursive mutex.

added stress test, update Scheduler.startLoop() for prio and name

530517e

update test sketch

ce5f7a3

avoid use of highest priority

741595b

Match type in printf

5eafe24

use function for VERIFY_MESS() to get parameter type checking

abac951

add function prototypes to top of file

3f413d8

update rtos, test sketch

a3bd957

each folder for each task

a57e419

stress test update

ef5eac7

add write limit, random delay for test sketch

1161e5a

update sketch test

177cff7

Remove unused local variables

547977b

use static semaphore

9707d12

move mutex and its storage into flash_cache_t

4768f64

add flash_nrf5x_init() flash_cache_init() to initialize semaphore/mutex

2d44af0

henrygab force-pushed the SerializeFlashPrototype branch from c36a8a1 to 2d44af0 Compare December 26, 2019 22:58

henrygab mentioned this pull request Jan 6, 2020

[Bug] Why does PDM require manual IRQ triggering? #352

Closed

hathach closed this Jan 7, 2020

henrygab deleted the SerializeFlashPrototype branch March 11, 2020 22:30

geeksville mentioned this pull request Jul 10, 2024

[Bug]: config file corruption due to filesystem size meshtastic/firmware#4184

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPERCEDED by #397] Guard against race conditions in flash FS cache #372

[SUPERCEDED by #397] Guard against race conditions in flash FS cache #372

henrygab commented Nov 4, 2019 •

edited

Loading

hathach commented Nov 5, 2019

hathach commented Dec 11, 2019

henrygab commented Dec 12, 2019

hathach commented Dec 12, 2019

hathach commented Dec 12, 2019

henrygab commented Dec 12, 2019

henrygab commented Dec 12, 2019 •

edited

Loading

hathach Dec 13, 2019

hathach commented Dec 13, 2019

hathach commented Dec 13, 2019

hathach Dec 13, 2019

hathach left a comment

henrygab commented Dec 14, 2019

hathach commented Dec 16, 2019

henrygab commented Dec 18, 2019

hathach commented Dec 18, 2019

henrygab commented Dec 21, 2019

henrygab commented Jan 6, 2020

hathach commented Jan 7, 2020

[SUPERCEDED by #397] Guard against race conditions in flash FS cache #372

[SUPERCEDED by #397] Guard against race conditions in flash FS cache #372

Conversation

henrygab commented Nov 4, 2019 • edited Loading

hathach commented Nov 5, 2019

hathach commented Dec 11, 2019

henrygab commented Dec 12, 2019

hathach commented Dec 12, 2019

hathach commented Dec 12, 2019

henrygab commented Dec 12, 2019

henrygab commented Dec 12, 2019 • edited Loading

hathach Dec 13, 2019

Choose a reason for hiding this comment

hathach commented Dec 13, 2019

hathach commented Dec 13, 2019

hathach Dec 13, 2019

Choose a reason for hiding this comment

hathach left a comment

Choose a reason for hiding this comment

henrygab commented Dec 14, 2019

hathach commented Dec 16, 2019

henrygab commented Dec 18, 2019

hathach commented Dec 18, 2019

henrygab commented Dec 21, 2019

henrygab commented Jan 6, 2020

hathach commented Jan 7, 2020

henrygab commented Nov 4, 2019 •

edited

Loading

henrygab commented Dec 12, 2019 •

edited

Loading