Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery from sequence window expiration incident #11228

Open
ImTei opened this issue Jul 24, 2024 · 2 comments
Open

Recovery from sequence window expiration incident #11228

ImTei opened this issue Jul 24, 2024 · 2 comments

Comments

@ImTei
Copy link
Collaborator

ImTei commented Jul 24, 2024

High level description

If op-batcher stops batch submission for a while and the sequence window is expired, op-node starts to generate empty blocks. But even if the op-batcher becomes operational, it's hard to recover batch submission and chain derivation because new batches will be likely dropped. This issue includes details of the situation and proposed solution.

Details of the incident

  1. For some reason, batches are not submitted for a while and the sequence window expired.
  2. op-node generates empty batches and safe head is advanced with empty blocks. Chain reorg occurs.
  3. Sequencer builds new blocks after generated empty blocks.
  4. op-batcher makes the next batch from the its current safe head and submit the batch.
  5. While the new batch is built and submitted to the L1, op-node generates next empty blocks and reorged again.
  6. The new batch will be dropped by following reasons:
    i. If the batch has non-empty blocks, it would not be canonical blocks after new empty blocks.
    ii. If the batch has only empty blocks and the batch is a span batch, the first block of span batch is already over sequence window. So the entire batch is dropped.

Repeat 2~6

Currently, we have to do following things manually to recover the chain from the incident.

  1. Block new user TX submissions.
  2. Empty sequencer's TX pool.
  3. Run op-batcher as a singular batch mode until the chain derivation is recovered.

We may add these steps to the runbook, but we can improve system to automate these steps.

Solution

  1. Define a new state of op-node which indicates "Currently sequence window is expired and generating empty batches". Let's say this as incident mode for now.
  2. incident mode is enabled when the op-node generates empty block. and disabled when the op-node derives new block from L1 batch.
  3. If the op-node is in incident mode, sequencer builds empty blocks by setting NoTxPool as true.
  4. incident mode is included as a boolean value in the optimism_syncStatus RPC response.
  5. op-batcher can check if the op-node is incident mode by syncStatus RPC. If the op-node is in incident mode, op-batcher builds singular batch even if it's running as a span batch mode. (Or we can make it build span batch far from the current safe head to avoid sequence window expiration)

Discussion

This change can automate the incident recovery of OP stack chains, but it may be a bit risky because it's touching a lot of important features like sequencing and batch submission. Because this incident situation is very unlikely, we can consider more manual way to recover the system from the incident.

@emilianobonassi
Copy link
Contributor

emilianobonassi commented Jul 25, 2024

Thanks @ImTei for tracking the issue on this edge case we found!

I do agree this might not be handled automatically in first instance.

I think could be a good opportunity to spin-up a playbook/runbook section in the docs to describe this and other scenarios that might realize, providing best practices (at conduit we have internally).

Another example, blob congestion => switch to calldata, make a blob transaction type to cancel the pending one (see LZ airdrop).

@sebastianst
Copy link
Member

Another example, blob congestion => switch to calldata, make a blob transaction type to cancel the pending one (see LZ airdrop).

@emilianobonassi This is already done automatically by the batcher since #10941

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants