Internal assert on corrupted filesystem caused boot loop #390

vonnieda · 2020-03-04T21:04:47Z

LFS: v1.3 9ee112a
Also reproduced with LFS v1.7.2

I have a filesystem that became corrupt during a power cycle, I believe. When the device next booted it went into a boot loop due to an internal assert:

Assertion failed: (block < lfs->cfg->block_count), function lfs_cache_read, file littlefs/lfs.c, line 18.

The assert is this one:

LFS_ASSERT(block < lfs->cfg->block_count);

The corruption seems to have resulted in a stored block number of 1001106, when the disk is only 1024 blocks long.

I can supply the filesystem image privately if that's helpful. It contains proprietary information, so I am not able to post it publicly. I could probably post just the bad block publicly.

I'm not sure how the corruption happened in the first place, but my bigger concern is that lfs_cache_read asserts in this situation instead of returning an error code. If it returned an error code I could capture that and try to reformat the filesystem.

I have disabled the asserts and then I am able to get an error code and continue by reformatting, instead of crashing. The function tries to read that high block number, the flash subsystem returns an error and the error bubbles up.

So, I have a couple questions:

Are the asserts intended to be used in production, or just debugging?
If the asserts are disabled, should I generally expect the code not to crash and return error codes, or do the asserts protect against possible crashes on bad data?

Thanks,
Jason

The text was updated successfully, but these errors were encountered:

geky · 2020-03-20T15:02:38Z

Hi @vonnieda, sorry for such a late response. Hopefully it's still useful to have your questions answered.

Are the asserts intended to be used in production, or just debugging?

The asserts in littlefs are used to indicate a situation that should not happen. littlefs isn't programmed to behave itself if an assert fails, so the assert provides a last-minute safe-gaurd to prevent it from corrupting itself.

The asserts should not normally trigger, however there are enough bugs in littlefs still that seeing an assert failure is not uncommon. I'm working on improving this here: #372, but it's built on the v2 branch.

I would suggest keeping the asserts on if you can afford the code-space in production, since there's no other reason to disable them.

One big thing that's improved in #372 is how littlefs responds to corrupted disks. Currently, littlefs asserts on a corrupted disk, but in #372 has changed these asserts to return errors, giving users a chance to recover (the block < block_count assert falls into this category).

Unfortunately, there aren't any plans to backport #372 to v1.

If the asserts are disabled, should I generally expect the code not to crash and return error codes, or do the asserts protect against possible crashes on bad data?

It's the second option. If an assert fails, you can't expect littlefs to work afterwards.

However, one thing you can do is replace the assert with a condition that returns an error:

- LFS_ASSERT(block < lfs->cfg->block_count);
+ if (block < lfs->cfg->block_count) {
+     return LFS_ERR_CORRUPT;
+ }

This should work in more cases and cause littlefs to propogate the error upwards.

It's important to note you still can't expect littlefs to work in this situation.

This is ok:

int err = lfs_file_open(&lfs, &file, "path", LFS_O_RDONLY);
if (err == LFS_ERR_CORRUPT) {
    assert(lfs_unmount(&lfs) == 0);
    assert(lfs_format(&lfs, &cfg) == 0);
    assert(lfs_mount(&lfs, &cfg) == 0);
    assert(lfs_file_open(&lfs, &file, "path", LFS_O_RDONLY) == 0);
}

This is not a good idea and will likely break later in surprising ways:

int err = lfs_file_open(&lfs, &file, "path", LFS_O_RDONLY);
if (err == LFS_ERR_CORRUPT) {
    assert(lfs_file_open(&lfs, &file, "other_path", LFS_O_RDONLY) == 0);
}

If you have threads, another option is killing the thread inside the assert and restarting the "process".

vonnieda · 2020-03-20T16:39:36Z

Hi @vonnieda, sorry for such a late response. Hopefully it's still useful to have your questions answered.

Are the asserts intended to be used in production, or just debugging?

The asserts in littlefs are used to indicate a situation that should not happen. littlefs isn't programmed to behave itself if an assert fails, so the assert provides a last-minute safe-gaurd to prevent it from corrupting itself.

The asserts should not normally trigger, however there are enough bugs in littlefs still that seeing an assert failure is not uncommon. I'm working on improving this here: #372, but it's built on the v2 branch.

I would suggest keeping the asserts on if you can afford the code-space in production, since there's no other reason to disable them.

Thanks for the response @geky!

I can afford the code space, but the issue is that this causes a boot loop that I can't predict. If the filesystem has become corrupt, the only way to know is to try to use it and then it crashes due to the assert.

One big thing that's improved in #372 is how littlefs responds to corrupted disks. Currently, littlefs asserts on a corrupted disk, but in #372 has changed these asserts to return errors, giving users a chance to recover (the block < block_count assert falls into this category).

Unfortunately, there aren't any plans to backport #372 to v1.

Okay, thanks. I'll look into upgrading to 2.

If the asserts are disabled, should I generally expect the code not to crash and return error codes, or do the asserts protect against possible crashes on bad data?

It's the second option. If an assert fails, you can't expect littlefs to work afterwards.

However, one thing you can do is replace the assert with a condition that returns an error:
- LFS_ASSERT(block < lfs->cfg->block_count);
+ if (block < lfs->cfg->block_count) {
+     return LFS_ERR_CORRUPT;
+ }
This should work in more cases and cause littlefs to propogate the error upwards.

It's important to note you still can't expect littlefs to work in this situation.

This is ok:
int err = lfs_file_open(&lfs, &file, "path", LFS_O_RDONLY);
if (err == LFS_ERR_CORRUPT) {
    assert(lfs_unmount(&lfs) == 0);
    assert(lfs_format(&lfs, &cfg) == 0);
    assert(lfs_mount(&lfs, &cfg) == 0);
    assert(lfs_file_open(&lfs, &file, "path", LFS_O_RDONLY) == 0);
}

This is similar to how I've fixed it for now. In my case, the only assert I've ever seen has been block < lfs->cfg->block_count and since that is handled by my block device read() I just disabled asserts and let it through. So now, during boot I run an lfs_traverse and if I get any errors during that I unmount, format, mount, and traverse again.

Given that that's the only assert I've seen, it would probably be safer to modify the code to return the error as you've mentioned, but I don't want to risk another assert causing a boot loop.

This is not a good idea and will likely break later in surprising ways:
int err = lfs_file_open(&lfs, &file, "path", LFS_O_RDONLY);
if (err == LFS_ERR_CORRUPT) {
    assert(lfs_file_open(&lfs, &file, "other_path", LFS_O_RDONLY) == 0);
}
If you have threads, another option is killing the thread inside the assert and restarting the "process".

If I were to turn asserts back on, do you have any suggestions for a robust check I could perform? Or would you recommend going straight to v2?

Thanks,
Jason

geky · 2020-03-26T14:44:09Z

If I were to turn asserts back on, do you have any suggestions for a robust check I could perform? Or would you recommend going straight to v2?

If you're able to, I would recommend adopting v2 (once #372 lands).

Unfortunately, the only check I have in MCU-friendly code is littlefs's mount, which as you note doesn't fail gracefully (until #372).

vonnieda · 2020-03-28T16:42:50Z

Thanks @geky - appreciate the response. I'll look towards migrating to v2.

vonnieda closed this as completed Mar 28, 2020

geky mentioned this issue Apr 6, 2020

LittleFS assert with "Assertion failed: block < lfs->cfg->block_count" ARMmbed/mbed-os#11704

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal assert on corrupted filesystem caused boot loop #390

Internal assert on corrupted filesystem caused boot loop #390

vonnieda commented Mar 4, 2020

geky commented Mar 20, 2020

vonnieda commented Mar 20, 2020

geky commented Mar 26, 2020

vonnieda commented Mar 28, 2020

Internal assert on corrupted filesystem caused boot loop #390

Internal assert on corrupted filesystem caused boot loop #390

Comments

vonnieda commented Mar 4, 2020

geky commented Mar 20, 2020

vonnieda commented Mar 20, 2020

geky commented Mar 26, 2020

vonnieda commented Mar 28, 2020