You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If op-batcher stops batch submission for a while and the sequence window is expired, op-node starts to generate empty blocks. But even if the op-batcher becomes operational, it's hard to recover batch submission and chain derivation because new batches will be likely dropped. This issue includes details of the situation and proposed solution.
Details of the incident
For some reason, batches are not submitted for a while and the sequence window expired.
op-node generates empty batches and safe head is advanced with empty blocks. Chain reorg occurs.
Sequencer builds new blocks after generated empty blocks.
op-batcher makes the next batch from the its current safe head and submit the batch.
While the new batch is built and submitted to the L1, op-node generates next empty blocks and reorged again.
The new batch will be dropped by following reasons:
i. If the batch has non-empty blocks, it would not be canonical blocks after new empty blocks.
ii. If the batch has only empty blocks and the batch is a span batch, the first block of span batch is already over sequence window. So the entire batch is dropped.
Repeat 2~6
Currently, we have to do following things manually to recover the chain from the incident.
Block new user TX submissions.
Empty sequencer's TX pool.
Run op-batcher as a singular batch mode until the chain derivation is recovered.
We may add these steps to the runbook, but we can improve system to automate these steps.
Solution
Define a new state of op-node which indicates "Currently sequence window is expired and generating empty batches". Let's say this as incident mode for now.
incident mode is enabled when the op-node generates empty block. and disabled when the op-node derives new block from L1 batch.
If the op-node is in incident mode, sequencer builds empty blocks by setting NoTxPool as true.
incident mode is included as a boolean value in the optimism_syncStatus RPC response.
op-batcher can check if the op-node is incident mode by syncStatus RPC. If the op-node is in incident mode, op-batcher builds singular batch even if it's running as a span batch mode. (Or we can make it build span batch far from the current safe head to avoid sequence window expiration)
Discussion
This change can automate the incident recovery of OP stack chains, but it may be a bit risky because it's touching a lot of important features like sequencing and batch submission. Because this incident situation is very unlikely, we can consider more manual way to recover the system from the incident.
The text was updated successfully, but these errors were encountered:
Thanks @ImTei for tracking the issue on this edge case we found!
I do agree this might not be handled automatically in first instance.
I think could be a good opportunity to spin-up a playbook/runbook section in the docs to describe this and other scenarios that might realize, providing best practices (at conduit we have internally).
Another example, blob congestion => switch to calldata, make a blob transaction type to cancel the pending one (see LZ airdrop).
High level description
If op-batcher stops batch submission for a while and the sequence window is expired, op-node starts to generate empty blocks. But even if the op-batcher becomes operational, it's hard to recover batch submission and chain derivation because new batches will be likely dropped. This issue includes details of the situation and proposed solution.
Details of the incident
i. If the batch has non-empty blocks, it would not be canonical blocks after new empty blocks.
ii. If the batch has only empty blocks and the batch is a span batch, the first block of span batch is already over sequence window. So the entire batch is dropped.
Repeat 2~6
Currently, we have to do following things manually to recover the chain from the incident.
We may add these steps to the runbook, but we can improve system to automate these steps.
Solution
incident mode
for now.incident mode
is enabled when the op-node generates empty block. and disabled when the op-node derives new block from L1 batch.incident mode
, sequencer builds empty blocks by settingNoTxPool
astrue
.incident mode
is included as a boolean value in theoptimism_syncStatus
RPC response.incident mode
bysyncStatus
RPC. If the op-node is inincident mode
, op-batcher builds singular batch even if it's running as a span batch mode. (Or we can make it build span batch far from the current safe head to avoid sequence window expiration)Discussion
This change can automate the incident recovery of OP stack chains, but it may be a bit risky because it's touching a lot of important features like sequencing and batch submission. Because this incident situation is very unlikely, we can consider more manual way to recover the system from the incident.
The text was updated successfully, but these errors were encountered: