diff --git a/pages/builders/chain-operators/tools/_meta.json b/pages/builders/chain-operators/tools/_meta.json index d859c48d9..826e2e1db 100644 --- a/pages/builders/chain-operators/tools/_meta.json +++ b/pages/builders/chain-operators/tools/_meta.json @@ -1,4 +1,5 @@ { "op-challenger": "Configure Challenger For Your Chain", + "op-conductor": "op-conductor", "explorer": "Block Explorer" } diff --git a/pages/builders/chain-operators/tools/op-conductor.mdx b/pages/builders/chain-operators/tools/op-conductor.mdx new file mode 100644 index 000000000..6d4d6cff0 --- /dev/null +++ b/pages/builders/chain-operators/tools/op-conductor.mdx @@ -0,0 +1,727 @@ +--- +title: Conductor +lang: en-US +description: Learn what the op-conductor is and how to use it to create a highly available and reliable sequencer. +--- + +import { Callout, Tabs, Steps } from 'nextra/components' + +# Conductor + +This page will teach you what the `op-conductor` service is and how it works on +a high level. It will also get you started on setting it up in your own +environment. + +## Enhancing Sequencer Reliability and Availability + +The [op-conductor](https://github.com/ethereum-optimism/optimism/tree/develop/op-conductor) +is an auxiliary service designed to enhance the reliability and availability of +a sequencer within high-availability setups. By minimizing the risks +associated with a single point of failure, the op-conductor ensures that the +sequencer remains operational and responsive. + +### Assumptions + +It is important to note that the `op-conductor` does not incorporate Byzantine +fault tolerance (BFT). This means the system operates under the assumption that +all participating nodes are honest and act correctly. + +### Summary of Guarantees + +The design of the `op-conductor` provides the following guarantees: + +* **No Unsafe Reorgs** +* **No Unsafe Head Stall During Network Partition** +* **100% Uptime with No More Than 1 Node Failure** + +## Design + +![op-conductor.](/img/builders/chain-operators/op-conductor.svg) + +**On a high level, `op-conductor` serves the following functions:** + +### Raft Consensus Layer Participation + +* **Leader Determination:** Participates in the Raft consensus algorithm to + determine the leader among sequencers. +* **State Management:** Stores the latest unsafe block ensuring consistency + across the system. + +### RPC Request Handling + +* **Admin RPC:** Provides administrative RPCs for manual recovery scenarios, + including, but not limited to: stopping the leadership vote and removing itself + from the cluster. +* **Health RPC:** Offers health RPCs for the `op-node` to determine whether it + should allow the publishing of transactions and unsafe blocks. + +### Sequencer Health Monitoring + +* Continuously monitors the health of the sequencer (op-node) to ensure + optimal performance and reliability. + +### Control Loop Management + +* Implements a control loop to manage the status of the sequencer (op-node), + including starting and stopping operations based on different scenarios and + health checks. + +## Conductor State Transition + +The following is a state machine diagram of how the op-conductor manages the +sequencers Raft consensus. + +![op-conductor-state-transition.](/img/builders/chain-operators/op-conductor-state-transition.svg) + +**Helpful tips:** To better understand the graph, focus on one node at a time, +understand what can be transitioned to this current state and how it can +transition to other states. This way you could understand how we handle the +state transitions. + +## Setup + +At OP Labs, op-conductor is deployed as a kubernetes statefulset because it +requires a persistent volume to store the raft log. This guide describes +setting up conductor on an existing network without incurring downtime. + +### Assumptions + +This setup guide has the following assumptions: + +* 3 deployed sequencers (sequencer-0, sequencer-1, sequencer-2) that are all + in sync and in the same vpc network +* sequencer-0 is currently the active sequencer +* You can execute a blue/green style sequencer deployment workflow that + involves no downtime (described below) +* conductor and sequencers are running in k8s or some other container + orchestrator (vm-based deployment may be slightly different and not covered + here) + +### Spin up op-conductor + + + {

Deploy conductor

} + + Deploy a conductor instance per sequencer with sequencer-1 as the raft cluster + bootstrap node: + + * suggested conductor configs: + + ```yaml + OP_CONDUCTOR_CONSENSUS_ADDR: '' + OP_CONDUCTOR_CONSENSUS_PORT: '50050' + OP_CONDUCTOR_EXECUTION_RPC: ':8545' + OP_CONDUCTOR_HEALTHCHECK_INTERVAL: '1' + OP_CONDUCTOR_HEALTHCHECK_MIN_PEER_COUNT: '2' # set based on your internal p2p network peer count + OP_CONDUCTOR_HEALTHCHECK_UNSAFE_INTERVAL: '5' # recommend a 2-3x multiple of your network block time to account for temporary performance issues + OP_CONDUCTOR_LOG_FORMAT: logfmt + OP_CONDUCTOR_LOG_LEVEL: info + OP_CONDUCTOR_METRICS_ADDR: 0.0.0.0 + OP_CONDUCTOR_METRICS_ENABLED: 'true' + OP_CONDUCTOR_METRICS_PORT: '7300' + OP_CONDUCTOR_NETWORK: '' + OP_CONDUCTOR_NODE_RPC: ':8545' + OP_CONDUCTOR_RAFT_SERVER_ID: 'unique raft server id' + OP_CONDUCTOR_RAFT_STORAGE_DIR: /conductor/raft + OP_CONDUCTOR_RPC_ADDR: 0.0.0.0 + OP_CONDUCTOR_RPC_ENABLE_ADMIN: 'true' + OP_CONDUCTOR_RPC_ENABLE_PROXY: 'true' + OP_CONDUCTOR_RPC_PORT: '8547' + ``` + + * sequencer-1 op-conductor extra config: + + ```yaml + OP_CONDUCTOR_PAUSED: "true" + OP_CONDUCTOR_RAFT_BOOTSTRAP: "true" + ``` + + {

Pause two conductors

} + + Pause `sequencer-0` &` sequencer-1` conductors with [conductor\_pause](#conductor_pause) + RPC request. + + {

Update op-node configuration and switch the active sequencer

} + + Deploy an `op-node` config update to all sequencers that enables conductor. Use + a blue/green style deployment workflow that switches the active sequencer to + `sequencer-1`: + + * all sequencer op-node configs: + + ```yaml + OP_NODE_CONDUCTOR_ENABLED: "true" + OP_NODE_RPC_ADMIN_STATE: "" # this flag cant be used with conductor + ``` + + {

Confirm sequencer switch was successful

} + + Confirm `sequencer-1` is active and successfully producing unsafe blocks. + Because `sequencer-1` was the raft cluster bootstrap node, it is now committing + unsafe payloads to the raft log. + + {

Add voting nodes

} + + Add voting nodes to cluster using [conductor\_AddServerAsVoter](#conductor_addServerAsVoter) + RPC request to the leader conductor (`sequencer-1`) + + {

Confirm state

} + + Confirm cluster membership and sequencer state: + + * `sequencer-0` and `sequencer-2`: + 1. raft cluster follower + 2. sequencer is stopped + 3. conductor is paused + 4. conductor enabled in op-node config + + * `sequencer-1` + 1. raft cluster leader + 2. sequencer is active + 3. conductor is paused + 4. conductor enabled in op-node config + + {

Resume conductors

} + + Resume all conductors with [conductor\_resume](#conductor_resume) RPC request to + each conductor instance. + + {

Confirm state

} + + Confirm all conductors successfully resumed with [conductor\_paused](#conductor_paused) + + {

Tranfer leadership

} + + Trigger leadership transfer to `sequencer-0` using [conductor\_transferLeaderToServer](#conductor_transferLeaderToServer) + + {

Confirm state

} + + * `sequencer-1` and `sequencer-2`: + 1. raft cluster follower + 2. sequencer is stopped + 3. conductor is active + 4. conductor enabled in op-node config + + * `sequencer-0` + 1. raft cluster leader + 2. sequencer is active + 3. conductor is active + 4. conductor enabled in op-node config + + {

Update configuration

} + + Deploy a config change to `sequencer-1` conductor to remove the + `OP_CONDUCTOR_PAUSED: true` flag and `OP_CONDUCTOR_RAFT_BOOTSTRAP` flag. +
+ +#### Blue/Green Deployment + +In order to ensure there is no downtime when setting up conductor, you need to +have a deployment script that can update sequencers without network downtime. + +An example of this workflow might look like: + +1. Query current state of the network and determine which sequencer is + currently active (referred to as "original" sequencer below). + From the other available sequencers, choose a candidate sequencer. +2. Deploy the change to the candidate sequencer and then wait for it to sync + up to the original sequencer's unsafe head. You may want to check peer counts + and other important health metrics. +3. Stop the original sequencer using `admin_stopSequencer` which returns the + last inserted unsafe block hash. Wait for candidate sequencer to sync with + this returned hash in case there is a delta. +4. Start the candidate sequencer at the original's last inserted unsafe block + hash. + 1. Here you can also execute additional check for unsafe head progression + and decide to roll back the change (stop the candidate sequencer, start the + original, rollback deployment of candidate, etc.) +5. Deploy the change to the original sequencer, wait for it to sync to the + chain head. Execute health checks. + +#### Post-Conductor Launch Deployments + +After conductor is live, a similar canary style workflow is used to ensure +minimal downtime in case there is an issue with deployment: + +1. Choose a candidate sequencer from the raft-cluster followers +2. Deploy to the candidate sequencer. Run health checks on the candidate. +3. Transfer leadership to the candidate sequencer using + `conductor_transferLeaderToServer`. Run health checks on the candidate. +4. Test if candidate is still the leader using `conductor_leader` after some + grace period (ex: 30 seconds) + 1. If not, then there is likely an issue with the deployment. Roll back. +5. Upgrade the remaining sequencers, run healthchecks. + +### Configuration Options + +It is configured via its [flags / environment variables](https://github.com/ethereum-optimism/optimism/blob/develop/op-conductor/flags/flags.go) + +#### --consensus.addr (`CONSENSUS_ADDR`) + +* **Usage:** Address to listen for consensus connections +* **Default Value:** 127.0.0.1 +* **Required:** yes + +#### --consensus.port (`CONSENSUS_PORT`) + +* **Usage:** Port to listen for consensus connections +* **Default Value:** 50050 +* **Required:** yes + +#### --raft.bootstrap (`RAFT_BOOTSTRAP`) + + + For bootstrapping a new cluster. This should only be used on the sequencer + that is currently active and can only be started once with this flag, + otherwise the flag has to be removed or the raft log must be deleted before + re-bootstrapping the cluster. + + +* **Usage:** If this node should bootstrap a new raft cluster +* **Default Value:** false +* **Required:** no + +#### --raft.server.id (`RAFT_SERVER_ID`) + +* **Usage:** Unique ID for this server used by raft consensus +* **Default Value:** None specified +* **Required:** yes + +#### --raft.storage.dir (`RAFT_STORAGE_DIR`) + +* **Usage:** Directory to store raft data +* **Default Value:** None specified +* **Required:** yes + +#### --node.rpc (`NODE_RPC`) + +* **Usage:** HTTP provider URL for op-node +* **Default Value:** None specified +* **Required:** yes + +#### --execution.rpc (`EXECUTION_RPC`) + +* **Usage:** HTTP provider URL for execution layer +* **Default Value:** None specified +* **Required:** yes + +#### --healthcheck.interval (`HEALTHCHECK_INTERVAL`) + +* **Usage:** Interval between health checks +* **Default Value:** None specified +* **Required:** yes + +#### --healthcheck.unsafe-interval (`HEALTHCHECK_UNSAFE_INTERVAL`) + +* **Usage:** Interval allowed between unsafe head and now measured in seconds +* **Default Value:** None specified +* **Required:** yes + +#### --healthcheck.safe-enabled (`HEALTHCHECK_SAFE_ENABLED`) + +* **Usage:** Whether to enable safe head progression checks +* **Default Value:** false +* **Required:** no + +#### --healthcheck.safe-interval (`HEALTHCHECK_SAFE_INTERVAL`) + +* **Usage:** Interval between safe head progression measured in seconds +* **Default Value:** 1200 +* **Required:** no + +#### --healthcheck.min-peer-count (`HEALTHCHECK_MIN_PEER_COUNT`) + +* **Usage:** Minimum number of peers required to be considered healthy +* **Default Value:** None specified +* **Required:** yes + +#### --paused (`PAUSED`) + + + There is no configuration state, so if you unpause via RPC and then restart, + it will start paused again. + + +* **Usage:** Whether the conductor is paused +* **Default Value:** false +* **Required:** no + +#### --rpc.enable-proxy (`RPC_ENABLE_PROXY`) + +* **Usage:** Enable the RPC proxy to underlying sequencer services +* **Default Value:** true +* **Required:** no + +### RPCs + +Conductor exposes [admin RPCs](https://github.com/ethereum-optimism/optimism/blob/develop/op-conductor/rpc/api.go#L17) +on the `conductor` namespace. + +#### conductor\_overrideLeader + +`OverrideLeader` is used to override the leader status, this is only used to +return true for `Leader()` & `LeaderWithID()` calls. It does not impact the +actual raft consensus leadership status. It is supposed to be used when the +cluster is unhealthy and the node is the only one up, to allow batcher to +be able to connect to the node so that it could download blocks from the +manually started sequencer. + + + + ```sh + curl -X POST -H "Content-Type: application/json" --data \ + '{"jsonrpc":"2.0","method":"conductor_overrideLeader","params":[],"id":1}' \ + http://127.0.0.1:50050 + ``` + + + + ```sh + cast rpc conductor_overrideLeader --rpc-url http://127.0.0.1:50050 + ``` + + + +#### conductor\_pause + +`Pause` pauses op-conductor. + + + + ```sh + curl -X POST -H "Content-Type: application/json" --data \ + '{"jsonrpc":"2.0","method":"conductor_pause","params":[],"id":1}' \ + http://127.0.0.1:50050 + ``` + + + + ```sh + cast rpc conductor_pause --rpc-url http://127.0.0.1:50050 + ``` + + + +#### conductor\_resume + +`Resume` resumes op-conductor. + + + + ```sh + curl -X POST -H "Content-Type: application/json" --data \ + '{"jsonrpc":"2.0","method":"conductor_resume","params":[],"id":1}' \ + http://127.0.0.1:50050 + ``` + + + + ```sh + cast rpc conductor_resume --rpc-url http://127.0.0.1:50050 + ``` + + + +#### conductor\_paused + +Paused returns true if the op-conductor is paused. + + + + ```sh + curl -X POST -H "Content-Type: application/json" --data \ + '{"jsonrpc":"2.0","method":"conductor_paused","params":[],"id":1}' \ + http://127.0.0.1:50050 + ``` + + + + ```sh + cast rpc conductor_paused --rpc-url http://127.0.0.1:50050 + ``` + + + +#### conductor\_stopped + +Stopped returns true if the op-conductor is stopped. + + + + ```sh + curl -X POST -H "Content-Type: application/json" --data \ + '{"jsonrpc":"2.0","method":"conductor_stopped","params":[],"id":1}' \ + http://127.0.0.1:50050 + ``` + + + + ```sh + cast rpc conductor_stopped --rpc-url http://127.0.0.1:50050 + ``` + + + +#### conductor\_sequencerHealthy + +SequencerHealthy returns true if the sequencer is healthy. + + + + ```sh + curl -X POST -H "Content-Type: application/json" --data \ + '{"jsonrpc":"2.0","method":"conductor_sequencerHealthy","params":[],"id":1}' \ + http://127.0.0.1:50050 + ``` + + + + ```sh + cast rpc conductor_sequencerHealthy --rpc-url http://127.0.0.1:50050 + ``` + + + +#### conductor\_leader + + + API related to consensus. + + +Leader returns true if the server is the leader. + + + + ```sh + curl -X POST -H "Content-Type: application/json" --data \ + '{"jsonrpc":"2.0","method":"conductor_leader","params":[],"id":1}' \ + http://127.0.0.1:50050 + ``` + + + + ```sh + cast rpc conductor_leader --rpc-url http://127.0.0.1:50050 + ``` + + + +#### conductor\_leaderWithID + + + API related to consensus. + + +LeaderWithID returns the current leader's server info. + + + + ```sh + curl -X POST -H "Content-Type: application/json" --data \ + '{"jsonrpc":"2.0","method":"conductor_leaderWithID","params":[],"id":1}' \ + http://127.0.0.1:50050 + ``` + + + + ```sh + cast rpc conductor_leaderWithID --rpc-url http://127.0.0.1:50050 + ``` + + + +#### conductor\_addServerAsVoter + + + API related to consensus. + + +AddServerAsVoter adds a server as a voter to the cluster. + + + + ```sh + curl -X POST -H "Content-Type: application/json" --data \ + '{"jsonrpc":"2.0","method":"conductor_addServerAsVoter","params":[, , ],"id":1}' \ + http://127.0.0.1:50050 + ``` + + + + ```sh + cast rpc conductor_addServerAsVoter --rpc-url http://127.0.0.1:50050 + ``` + + + +#### conductor\_addServerAsNonvoter + + + API related to consensus. + + +AddServerAsNonvoter adds a server as a non-voter to the cluster. non-voter +The non-voter will not participate in the leader election. + + + + ```sh + curl -X POST -H "Content-Type: application/json" --data \ + '{"jsonrpc":"2.0","method":"conductor_addServerAsNonvoter","params":[],"id":1}' \ + http://127.0.0.1:50050 + ``` + + + + ```sh + cast rpc conductor_addServerAsNonvoter --rpc-url http://127.0.0.1:50050 + ``` + + + +#### conductor\_removeServer + + + API related to consensus. + + +RemoveServer removes a server from the cluster. + + + + ```sh + curl -X POST -H "Content-Type: application/json" --data \ + '{"jsonrpc":"2.0","method":"conductor_removeServer","params":[],"id":1}' \ + http://127.0.0.1:50050 + ``` + + + + ```sh + cast rpc conductor_removeServer --rpc-url http://127.0.0.1:50050 + ``` + + + +#### conductor\_transferLeader + + + API related to consensus. + + +TransferLeader transfers leadership to another server. + + + + ```sh + curl -X POST -H "Content-Type: application/json" --data \ + '{"jsonrpc":"2.0","method":"conductor_transferLeader","params":[],"id":1}' \ + http://127.0.0.1:50050 + ``` + + + + ```sh + cast rpc conductor_transferLeader --rpc-url http://127.0.0.1:50050 + ``` + + + +#### conductor\_transferLeaderToServer + + + API related to consensus. + + +TransferLeaderToServer transfers leadership to a specific server. + + + + ```sh + curl -X POST -H "Content-Type: application/json" --data \ + '{"jsonrpc":"2.0","method":"conductor_transferLeaderToServer","params":[],"id":1}' \ + http://127.0.0.1:50050 + ``` + + + + ```sh + cast rpc conductor_transferLeaderToServer --rpc-url http://127.0.0.1:50050 + ``` + + + +#### conductor\_clusterMembership + +ClusterMembership returns the current cluster membership configuration. + + + + ```sh + curl -X POST -H "Content-Type: application/json" --data \ + '{"jsonrpc":"2.0","method":"conductor_clusterMembership","params":[],"id":1}' \ + http://127.0.0.1:50050 + ``` + + + + ```sh + cast rpc conductor_clusterMembership --rpc-url http://127.0.0.1:50050 + ``` + + + +#### conductor\_active + + + API called by `op-node`. + + +Active returns true if the op-conductor is active (not paused or stopped). + + + + ```sh + curl -X POST -H "Content-Type: application/json" --data \ + '{"jsonrpc":"2.0","method":"conductor_active","params":[],"id":1}' \ + http://127.0.0.1:50050 + ``` + + + + ```sh + cast rpc conductor_active --rpc-url http://127.0.0.1:50050 + ``` + + + +#### conductor\_commitUnsafePayload + + + API called by `op-node`. + + +CommitUnsafePayload commits an unsafe payload (latest head) to the consensus +layer. + + + + ```sh + curl -X POST -H "Content-Type: application/json" --data \ + '{"jsonrpc":"2.0","method":"conductor_commitUnsafePayload","params":[],"id":1}' \ + http://127.0.0.1:50050 + ``` + + + + ```sh + cast rpc conductor_commitUnsafePayload --rpc-url http://127.0.0.1:50050 + ``` + + + +## Next Steps + +* Checkout [op-conductor-mon](https://github.com/ethereum-optimism/infra): + which monitors multiple op-conductor instances and provides a unified interface + for reporting metrics. diff --git a/public/img/builders/chain-operators/op-conductor-state-transition.svg b/public/img/builders/chain-operators/op-conductor-state-transition.svg new file mode 100644 index 000000000..b9e054452 --- /dev/null +++ b/public/img/builders/chain-operators/op-conductor-state-transition.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/public/img/builders/chain-operators/op-conductor.svg b/public/img/builders/chain-operators/op-conductor.svg new file mode 100644 index 000000000..5b3e50b5f --- /dev/null +++ b/public/img/builders/chain-operators/op-conductor.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/words.txt b/words.txt index 9504ff5e9..50b084d9b 100644 --- a/words.txt +++ b/words.txt @@ -25,7 +25,6 @@ BLOBPOOL blobpool blobspace blockhash -blockheaders blocklists BLOCKLOGS blocklogs @@ -127,6 +126,7 @@ hardfork hardforks HEALTHCHECK healthcheck +healthchecks heartbeating HISTORICALRPC historicalrpc @@ -321,6 +321,7 @@ SRAV SRLV Stablecoins stablecoins +statefulset subcomponents subgame subheaders @@ -340,6 +341,7 @@ therealbytes threadcreate tility timeseries +Tranfer trustlessly trustrpc txfeecap