Improve behavior on corrupted checkpoint. #4025

fulmicoton · 2023-10-25T02:10:58Z

If a checkpoint contains an invalid position (for instance not u64) for ingest,
we currently panic.

Ideally we should:

log an error
repair the checkpoint by removing the corrupted partition
start indexing the faulty partition from the beginning.

This is very defensive and hence low priority

jmintb · 2023-10-26T08:28:09Z

Can I work on this? :)

guilload · 2023-10-26T12:34:45Z

Yes, you can.

Logging an error and skipping the partition is enough. I don't think self-clean-up is helpful because it's likely that the issue comes from either a bug in the source or a user manually editing a checkpoint. In the first case, the bug will reoccur, and self-cleanup will be less helpful. In the second, I'd rather have users clean up their mess themselves and decide from which point they want to resume indexing the partition. Always restarting from the beginning will yield duplicates.

jmintb · 2023-10-26T13:35:31Z

Perfect, do you know whereabouts in the codebase the panic(s) occurs? Sounds like it is in quickwit-ingest.

guilload · 2023-10-26T14:02:33Z

if you rg -e 'expect\(.*offset.*' in quickwit/quickwit-indexing/src/source you should find them all.

jmintb · 2023-10-27T14:56:31Z

Are there any existing tests or test data that would simulate this scenario?

Issue: quickwit-oss#4025

fulmicoton added the bug Something isn't working label Oct 25, 2023

fulmicoton mentioned this issue Oct 25, 2023

Close shards with EOF record #4021

Merged

fulmicoton added the low-priority label Oct 25, 2023

guilload assigned jmintb Oct 26, 2023

jmintb added a commit to jmintb/quickwit that referenced this issue Oct 27, 2023

Improve error handling for corrupt checkpoints

e4ffd64

Issue: quickwit-oss#4025

jmintb mentioned this issue Oct 27, 2023

Improve error handling for corrupt checkpoints #4039

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve behavior on corrupted checkpoint. #4025

Improve behavior on corrupted checkpoint. #4025

fulmicoton commented Oct 25, 2023 •

edited

Loading

jmintb commented Oct 26, 2023

guilload commented Oct 26, 2023

jmintb commented Oct 26, 2023

guilload commented Oct 26, 2023

jmintb commented Oct 27, 2023

Improve behavior on corrupted checkpoint. #4025

Improve behavior on corrupted checkpoint. #4025

Comments

fulmicoton commented Oct 25, 2023 • edited Loading

jmintb commented Oct 26, 2023

guilload commented Oct 26, 2023

jmintb commented Oct 26, 2023

guilload commented Oct 26, 2023

jmintb commented Oct 27, 2023

fulmicoton commented Oct 25, 2023 •

edited

Loading