Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safely handle flush exceptions in Raft #15463

Closed
Tracked by #14843
npepinpe opened this issue Dec 5, 2023 · 2 comments · Fixed by #24498
Closed
Tracked by #14843

Safely handle flush exceptions in Raft #15463

npepinpe opened this issue Dec 5, 2023 · 2 comments · Fixed by #24498
Assignees
Labels
component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug likelihood/high A recurring issue likelihood/low Observed rarely / rather unlikely edge-case planning/discuss To be discussed at the next planning. severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:8.7.0-alpha2 Labal that represents features released with 8.7.0-alpha2

Comments

@npepinpe
Copy link
Member

npepinpe commented Dec 5, 2023

@deepthidevaki @oleschoenburg - could we implement minimal fault tolerance to flush error with the following:

  1. Follower/passive role truncates the log and retries the full append on flush error
  2. Leader truncates the log to commit index and steps down on flush error

In both cases, truncating should invalidate the buffers/page cache. Since we only update the commit index on successful flush, then truncating to the last known commit index should be safe.

This can cause a lot of disruption, but it will at least avoid inconsistencies. WDYT?

I would be happy to implement at least safety (even if performance/availability is terrible) first, and leave performance to the NFS feature request in the future.

Originally posted by @npepinpe in #14843 (comment)

@npepinpe npepinpe added kind/bug Categorizes an issue or PR as a bug severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround severity/high Marks a bug as having a noticeable impact on the user with no known workaround likelihood/low Observed rarely / rather unlikely edge-case likelihood/high A recurring issue and removed severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround labels Dec 5, 2023
@npepinpe
Copy link
Member Author

npepinpe commented Dec 5, 2023

I've put both high and low likelihood, since:

  1. This has a low likelihood to occur with standard block devices
  2. This has a high likelihood to occur with network file systems (e.g. NFS, SMB, etc.)

@npepinpe npepinpe added the planning/discuss To be discussed at the next planning. label Dec 7, 2023
@megglos
Copy link
Contributor

megglos commented Feb 19, 2024

ZDP-Planning:

  • relates to NFS support, will be handled once we have decision on that
  • moving to backlog because of this

@romansmirnov romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024
@entangled90 entangled90 self-assigned this Nov 6, 2024
entangled90 added a commit that referenced this issue Nov 7, 2024
All flush() commands in the journal module now can throw a checked
exception FlushException, in order to make sure that all possible
exceptions thrown when flushing are handled.
A new class CheckedJournalException has been created, since
JournalException extends RuntimeException.

When the leader fails to flush, it resets its commitIndex to the
previous value and transitions to the Follower role.

closes #15463
@entangled90 entangled90 linked a pull request Nov 7, 2024 that will close this issue
entangled90 added a commit that referenced this issue Nov 7, 2024
All flush() commands in the journal module now can throw a checked
exception FlushException, in order to make sure that all possible
exceptions thrown when flushing are handled.
A new class CheckedJournalException has been created, since
JournalException extends RuntimeException.

When the leader fails to flush, it resets its commitIndex to the
previous value and transitions to the Follower role.

closes #15463
entangled90 added a commit that referenced this issue Nov 7, 2024
All flush() commands in the journal module now can throw a checked
exception FlushException, in order to make sure that all possible
exceptions thrown when flushing are handled.
A new class CheckedJournalException has been created, since
JournalException extends RuntimeException.

When the leader fails to flush, it resets its commitIndex to the
previous value and transitions to the Follower role.

closes #15463
entangled90 added a commit that referenced this issue Nov 12, 2024
All flush() commands in the journal module now can throw a checked
exception FlushException, in order to make sure that all possible
exceptions thrown when flushing are handled.
A new class CheckedJournalException has been created, since
JournalException extends RuntimeException.

When the leader fails to flush, it resets its commitIndex to the
previous value and transitions to the Follower role.

closes #15463
entangled90 added a commit that referenced this issue Nov 12, 2024
All flush() commands in the journal module now can throw a checked
exception FlushException, in order to make sure that all possible
exceptions thrown when flushing are handled.
A new class CheckedJournalException has been created, since
JournalException extends RuntimeException.

When the leader fails to flush, it resets its commitIndex to the
previous value and transitions to the Follower role.

closes #15463
entangled90 added a commit that referenced this issue Nov 12, 2024
All flush() commands in the journal module now can throw a checked
exception FlushException, in order to make sure that all possible
exceptions thrown when flushing are handled.
A new class CheckedJournalException has been created, since
JournalException extends RuntimeException.

When the leader fails to flush, it resets its commitIndex to the
previous value and transitions to the Follower role.

closes #15463
entangled90 added a commit that referenced this issue Nov 13, 2024
All flush() commands in the journal module now can throw a checked
exception FlushException, in order to make sure that all possible
exceptions thrown when flushing are handled.
A new class CheckedJournalException has been created, since
JournalException extends RuntimeException.

When the leader fails to flush, it resets its commitIndex to the
previous value and transitions to the Follower role.

closes #15463
entangled90 added a commit that referenced this issue Nov 13, 2024
All flush() commands in the journal module now can throw a checked
exception FlushException, in order to make sure that all possible
exceptions thrown when flushing are handled.
A new class CheckedJournalException has been created, since
JournalException extends RuntimeException.

When the leader fails to flush, it resets its commitIndex to the
previous value and transitions to the Follower role.

closes #15463
entangled90 added a commit that referenced this issue Nov 13, 2024
All flush() commands in the journal module now can throw a checked
exception FlushException, in order to make sure that all possible
exceptions thrown when flushing are handled.
A new class CheckedJournalException has been created, since
JournalException extends RuntimeException.

When the leader fails to flush, it resets its commitIndex to the
previous value and transitions to the Follower role.

closes #15463
entangled90 added a commit that referenced this issue Nov 15, 2024
All flush() commands in the journal module now can throw a checked
exception FlushException, in order to make sure that all possible
exceptions thrown when flushing are handled.
A new class CheckedJournalException has been created, since
JournalException extends RuntimeException.

When the leader fails to flush, it resets its commitIndex to the
previous value and transitions to the Follower role.

closes #15463
github-merge-queue bot pushed a commit that referenced this issue Nov 15, 2024
## Description
All flush() commands in the journal module now can throw a checked
exception FlushException, in order to make sure that all possible
exceptions thrown when flushing are handled.
A new class `CheckedJournalException` has been created, since
`JournalException` extends `RuntimeException`, thus it would not be a
checked exception.

Using `Either<Throwable,Void>` was also taken into consideration as a
first approach, but I was afraid that it was likely that a left either
value could be easily discarded as a return value by mistake, while this
is not possible with checked exception.
This also means that using checked exception "pollutes" a lot the code
so the idea is to use them only in critical parts of the code when we
must ensure that exceptions are handled properly.

When the leader fails to flush, it resets its commitIndex to the
previous value and transitions to the Follower role.

<!-- Describe the goal and purpose of this PR. -->

## Related issues
closes #15463
@npepinpe npepinpe added the version:8.7.0-alpha2 Labal that represents features released with 8.7.0-alpha2 label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug likelihood/high A recurring issue likelihood/low Observed rarely / rather unlikely edge-case planning/discuss To be discussed at the next planning. severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:8.7.0-alpha2 Labal that represents features released with 8.7.0-alpha2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants