Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v23.1.x] Fix txn consume group issues leading to undefined behavior #12006

Merged

Conversation

(cherry picked from commit 7debd01)
An execution of abort_old_txes could span multiple terms so the so the
method could modify new state assuming it's the old state resulting in
undefined behavior

(cherry picked from commit f7fc026)
Make group accept term to reduce scope of where reset_tx_state is used
to easier track where the write lock is necessary

(cherry picked from commit 93297d5)
@rystsov rystsov added kind/backport PRs targeting a stable branch and removed area/redpanda labels Jul 10, 2023
@rystsov rystsov added this to the v23.1.x-next milestone Jul 10, 2023
When the consumer group log's term change we replay the whole log to
reconstruct the state. We used to merge current and the replayed state
but it's error prone. Reseting the whole txn state to have more deter-
ministic behavior

(cherry picked from commit 69c5392)
Transactions in kafka protocol are stateful: the processing of the
requests depends on the previous commands executed by the same or
even different producer. It makes the situations when the replica-
tion fails with the indecisive errors such as timeout dangerous
because the true state is unknown.

Stepping down to resolve uncertainty by replaying the log

(cherry picked from commit 7ca5707)
(cherry picked from commit 45675cc)
An execution of abort_old_txes could span multiple terms so the so the
method could modify new state assuming it's the old state resulting in
undefined behavior.

This commit is the rewrite of the reverted f7fc026 in 11474. The pro-
blem was caused by:

  - do_detach_partition got blocked
  - RP ignored blocked do_detach_partition and attempted next op leading
    double registration of the consumer groups ntp

The op was blocked by a deadlock:

  - do_abort_old_txes was waiting for read lock while holding _gate
  - do_detach_partition was holding write lock while waiting to the
    gate to be closed

This version doesn't wait for the read lock to become available and exit
do_abort_old_txes releasing the _gate.

It still isn't clear why RP ignored a blocked op

(cherry picked from commit 0b9b9fb)
@rystsov rystsov merged commit 014a23e into redpanda-data:v23.1.x Jul 18, 2023
@BenPope BenPope modified the milestones: v23.1.x-next, v23.1.14 Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants