Improve beacon node failover in validator client [tracking issue] #3613

michaelsproul · 2022-09-27T02:41:30Z

Description

I think the beacon node failover feature is in need of some love, particularly now that we are post-merge and it is complicated by the addition of the execution node.

There are several issues:

Beacon nodes continue to self-report their status as synced when the execution node goes offline. This means they will be used by validator clients to produce sub-optimal attestations.
Beacon nodes self-report their status as synced for eight epochs after their internal sync state switches to syncing. Again this results in sub-optimal attestations.
Beacon nodes return nasty 500 internal server errors when the head is optimistic, e.g.

CRIT Error during attestation routine        slot: 4747474, committee_index: 11, error: "All endpoints failed http://localhost:5052/ => RequestFailed("Failed to produce attestation data: ServerMessage(ErrorMessage { code: 500, message: \"UNHANDLED_ERROR: HeadBlockNotFullyVerified { beacon_block_root: 0xbef1f36d129ca41de0a2da31962f5cb5025262f8813c6be2448524fc75be9947, execution_status: Optimistic(0x93219f971e74377cb48ba8050f1f193390d0313afd23c8cc4b1e8a119fb32fa1) }\", stacktraces: [] })")"

This error is probably handled most gracefully by a VC with redundant beacon nodes, as it should failover to the next BN (need to double check this, the log above is from a node without failover BNs).

Option 1: Fail Fast

One way to address all of these issues would be to add a flag to the beacon node like --fail-fast which makes it report its status as unsynced more readily:

If the execution node is offline, report BN status as unsynced. Extension of [Merged by Bors] - Pause sync when EE is offline #3428.
If the internal sync state is anything but fully synced, report sync status as unsynced (ignore SYNC_TOLERANCE_EPOCHS).

The downside of this approach is that it adds configuration complexity: most beacon nodes used for redundancy should be configured with --fail-fast, but at least one beacon node per cluster should not be, in order to not cripple liveness in case of sync issues (which is what SYNC_TOLERANCE_EPOCHS is trying to guard against).

The --fail-fast solution may also still not fail fast enough for some users, Lighthouse will usually consider itself synced for a few epochs (2?) without any blocks, during which time attestations will still miss.

Option 2: Quality Control

Rather than changing the behaviour of the beacon node, we could change the behaviour of the validator client to make it smarter about which beacon nodes to try first. Instead of using a binary "synced" or "not synced" differentiation the VC could use other data to make its choice, e.g. sync distance, optimistic status, execution layer online/offline status.

The complication with this approach is that we need to work within the confines of standard APIs provided by beacon nodes (which e.g. doesn't include EL status), and we need to avoid opening ourselves to attacks where an attacker tricks us into following their chain by proposing blocks closer to the current wall clock slot than the canonical chain.

I think I have a preference for option 2 currently.

The text was updated successfully, but these errors were encountered:

michaelsproul · 2022-10-03T11:10:56Z

VC to delay before switching back to recovered beacon node #3002. This switchback stickiness could be implemented as part of the "quality" assessment and switching mechanism.

Complementary:

VC: Publish to multiple BNs #3516. Publishing messages to all connected BNs is an orthogonal improvement that we could make alongside improving the current lightweight fallback system.

jmcruz1983 · 2023-05-12T10:30:12Z

@michaelsproul
what about implementing following logic on the /eth/v1/node/health endpoint:

when execution node is syncing then returning unhealthy (something like 4xx or 5xx http error code)
using that as readiness probe in k8s would stop sending traffic to current beacon meanwhile execution node is syncing
wdyt?

## Issue Addressed Closes #4291, part of #3613. ## Proposed Changes - Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if: - The EL's internal status is `Offline` or `AuthFailed`, _or_ - The most recent call to `newPayload` resulted in an error (more on this in a moment). - Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes. - Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated. - Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check. ## Why track `newPayload` errors? Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because: - If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_. - The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms). - If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them. ## Additional Changes - Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too. - Delete some stale comments from when we used to support multiple execution nodes.

## Issue Addressed Closes sigp#4291, part of sigp#3613. ## Proposed Changes - Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if: - The EL's internal status is `Offline` or `AuthFailed`, _or_ - The most recent call to `newPayload` resulted in an error (more on this in a moment). - Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes. - Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated. - Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check. ## Why track `newPayload` errors? Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because: - If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_. - The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms). - If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them. ## Additional Changes - Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too. - Delete some stale comments from when we used to support multiple execution nodes.

Closes sigp#4291, part of sigp#3613. - Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if: - The EL's internal status is `Offline` or `AuthFailed`, _or_ - The most recent call to `newPayload` resulted in an error (more on this in a moment). - Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes. - Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated. - Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check. Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because: - If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_. - The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms). - If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them. - Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too. - Delete some stale comments from when we used to support multiple execution nodes.

macladson · 2024-10-28T17:50:28Z

Since the merge of #4393, we now have more sophisticated fallback behaviour following Option 2 listed in the issue.
I'll close this issue now, and if fallback issues continue to be widespread, new issues should be opened to address them.

michaelsproul added RFC Request for comment enhancement New feature or request val-client Relates to the validator client binary labels Sep 27, 2022

michaelsproul mentioned this issue Oct 3, 2022

VC: Publish to multiple BNs #3516

Closed

michaelsproul mentioned this issue Oct 14, 2022

Validator fallback does not work when the execution engine is offline or unavailable #3641

Closed

This was referenced Nov 9, 2022

VC does not fallback for sync sigs when primary BN is optimistic #3612

Closed

Beacon Node Fallback Issues #3718

Open

michaelsproul assigned macladson Jan 13, 2023

michaelsproul mentioned this issue Jan 25, 2023

Allow for 2s latency on attestation subscriptions #3506

Closed

michaelsproul mentioned this issue Feb 16, 2023

Validator client first_success swallows intermediate errors #3985

Closed

This was referenced May 15, 2023

Update node health endpoint #4292

Closed

[Merged by Bors] - Implement el_offline and use it in the VC #4295

Closed

macladson mentioned this issue Jun 13, 2023

Rework Validator Client fallback mechanism #4393

Merged

5 tasks

macladson closed this as completed Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve beacon node failover in validator client [tracking issue] #3613

Improve beacon node failover in validator client [tracking issue] #3613

michaelsproul commented Sep 27, 2022

michaelsproul commented Oct 3, 2022

jmcruz1983 commented May 12, 2023

macladson commented Oct 28, 2024

Improve beacon node failover in validator client [tracking issue] #3613

Improve beacon node failover in validator client [tracking issue] #3613

Comments

michaelsproul commented Sep 27, 2022

Description

Option 1: Fail Fast

Option 2: Quality Control

michaelsproul commented Oct 3, 2022

jmcruz1983 commented May 12, 2023

macladson commented Oct 28, 2024