Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal assert on corrupted filesystem caused boot loop #390

Closed
vonnieda opened this issue Mar 4, 2020 · 4 comments
Closed

Internal assert on corrupted filesystem caused boot loop #390

vonnieda opened this issue Mar 4, 2020 · 4 comments

Comments

@vonnieda
Copy link

vonnieda commented Mar 4, 2020

LFS: v1.3 9ee112a
Also reproduced with LFS v1.7.2

I have a filesystem that became corrupt during a power cycle, I believe. When the device next booted it went into a boot loop due to an internal assert:

Assertion failed: (block < lfs->cfg->block_count), function lfs_cache_read, file littlefs/lfs.c, line 18.

The assert is this one:

LFS_ASSERT(block < lfs->cfg->block_count);

The corruption seems to have resulted in a stored block number of 1001106, when the disk is only 1024 blocks long.

I can supply the filesystem image privately if that's helpful. It contains proprietary information, so I am not able to post it publicly. I could probably post just the bad block publicly.

I'm not sure how the corruption happened in the first place, but my bigger concern is that lfs_cache_read asserts in this situation instead of returning an error code. If it returned an error code I could capture that and try to reformat the filesystem.

I have disabled the asserts and then I am able to get an error code and continue by reformatting, instead of crashing. The function tries to read that high block number, the flash subsystem returns an error and the error bubbles up.

So, I have a couple questions:

  1. Are the asserts intended to be used in production, or just debugging?
  2. If the asserts are disabled, should I generally expect the code not to crash and return error codes, or do the asserts protect against possible crashes on bad data?

Thanks,
Jason

@geky
Copy link
Member

geky commented Mar 20, 2020

Hi @vonnieda, sorry for such a late response. Hopefully it's still useful to have your questions answered.

  1. Are the asserts intended to be used in production, or just debugging?

The asserts in littlefs are used to indicate a situation that should not happen. littlefs isn't programmed to behave itself if an assert fails, so the assert provides a last-minute safe-gaurd to prevent it from corrupting itself.

The asserts should not normally trigger, however there are enough bugs in littlefs still that seeing an assert failure is not uncommon. I'm working on improving this here: #372, but it's built on the v2 branch.

I would suggest keeping the asserts on if you can afford the code-space in production, since there's no other reason to disable them.

One big thing that's improved in #372 is how littlefs responds to corrupted disks. Currently, littlefs asserts on a corrupted disk, but in #372 has changed these asserts to return errors, giving users a chance to recover (the block < block_count assert falls into this category).

Unfortunately, there aren't any plans to backport #372 to v1.

  1. If the asserts are disabled, should I generally expect the code not to crash and return error codes, or do the asserts protect against possible crashes on bad data?

It's the second option. If an assert fails, you can't expect littlefs to work afterwards.

However, one thing you can do is replace the assert with a condition that returns an error:

- LFS_ASSERT(block < lfs->cfg->block_count);
+ if (block < lfs->cfg->block_count) {
+     return LFS_ERR_CORRUPT;
+ }

This should work in more cases and cause littlefs to propogate the error upwards.

It's important to note you still can't expect littlefs to work in this situation.

This is ok:

int err = lfs_file_open(&lfs, &file, "path", LFS_O_RDONLY);
if (err == LFS_ERR_CORRUPT) {
    assert(lfs_unmount(&lfs) == 0);
    assert(lfs_format(&lfs, &cfg) == 0);
    assert(lfs_mount(&lfs, &cfg) == 0);
    assert(lfs_file_open(&lfs, &file, "path", LFS_O_RDONLY) == 0);
}

This is not a good idea and will likely break later in surprising ways:

int err = lfs_file_open(&lfs, &file, "path", LFS_O_RDONLY);
if (err == LFS_ERR_CORRUPT) {
    assert(lfs_file_open(&lfs, &file, "other_path", LFS_O_RDONLY) == 0);
}

If you have threads, another option is killing the thread inside the assert and restarting the "process".

@vonnieda
Copy link
Author

Hi @vonnieda, sorry for such a late response. Hopefully it's still useful to have your questions answered.

  1. Are the asserts intended to be used in production, or just debugging?

The asserts in littlefs are used to indicate a situation that should not happen. littlefs isn't programmed to behave itself if an assert fails, so the assert provides a last-minute safe-gaurd to prevent it from corrupting itself.

The asserts should not normally trigger, however there are enough bugs in littlefs still that seeing an assert failure is not uncommon. I'm working on improving this here: #372, but it's built on the v2 branch.

I would suggest keeping the asserts on if you can afford the code-space in production, since there's no other reason to disable them.

Thanks for the response @geky!

I can afford the code space, but the issue is that this causes a boot loop that I can't predict. If the filesystem has become corrupt, the only way to know is to try to use it and then it crashes due to the assert.

One big thing that's improved in #372 is how littlefs responds to corrupted disks. Currently, littlefs asserts on a corrupted disk, but in #372 has changed these asserts to return errors, giving users a chance to recover (the block < block_count assert falls into this category).

Unfortunately, there aren't any plans to backport #372 to v1.

Okay, thanks. I'll look into upgrading to 2.

  1. If the asserts are disabled, should I generally expect the code not to crash and return error codes, or do the asserts protect against possible crashes on bad data?

It's the second option. If an assert fails, you can't expect littlefs to work afterwards.

However, one thing you can do is replace the assert with a condition that returns an error:

- LFS_ASSERT(block < lfs->cfg->block_count);
+ if (block < lfs->cfg->block_count) {
+     return LFS_ERR_CORRUPT;
+ }

This should work in more cases and cause littlefs to propogate the error upwards.

It's important to note you still can't expect littlefs to work in this situation.

This is ok:

int err = lfs_file_open(&lfs, &file, "path", LFS_O_RDONLY);
if (err == LFS_ERR_CORRUPT) {
    assert(lfs_unmount(&lfs) == 0);
    assert(lfs_format(&lfs, &cfg) == 0);
    assert(lfs_mount(&lfs, &cfg) == 0);
    assert(lfs_file_open(&lfs, &file, "path", LFS_O_RDONLY) == 0);
}

This is similar to how I've fixed it for now. In my case, the only assert I've ever seen has been block < lfs->cfg->block_count and since that is handled by my block device read() I just disabled asserts and let it through. So now, during boot I run an lfs_traverse and if I get any errors during that I unmount, format, mount, and traverse again.

Given that that's the only assert I've seen, it would probably be safer to modify the code to return the error as you've mentioned, but I don't want to risk another assert causing a boot loop.

This is not a good idea and will likely break later in surprising ways:

int err = lfs_file_open(&lfs, &file, "path", LFS_O_RDONLY);
if (err == LFS_ERR_CORRUPT) {
    assert(lfs_file_open(&lfs, &file, "other_path", LFS_O_RDONLY) == 0);
}

If you have threads, another option is killing the thread inside the assert and restarting the "process".

If I were to turn asserts back on, do you have any suggestions for a robust check I could perform? Or would you recommend going straight to v2?

Thanks,
Jason

@geky
Copy link
Member

geky commented Mar 26, 2020

If I were to turn asserts back on, do you have any suggestions for a robust check I could perform? Or would you recommend going straight to v2?

If you're able to, I would recommend adopting v2 (once #372 lands).

Unfortunately, the only check I have in MCU-friendly code is littlefs's mount, which as you note doesn't fail gracefully (until #372).

@vonnieda
Copy link
Author

Thanks @geky - appreciate the response. I'll look towards migrating to v2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants