[Merged by Bors] - Implement `el_offline` and use it in the VC #4295

michaelsproul · 2023-05-16T07:47:05Z

Issue Addressed

Closes #4291, part of #3613.

Proposed Changes

Implement the el_offline field on /eth/v1/node/syncing. We set el_offline=true if:
- The EL's internal status is Offline or AuthFailed, or
- The most recent call to newPayload resulted in an error (more on this in a moment).
Use the el_offline field in the VC to mark nodes with offline ELs as unsynced. These nodes will still be used, but only after synced nodes.
Overhaul the usage of RequireSynced so that ::No is used almost everywhere. The --allow-unsynced flag was broken and had the opposite effect to intended, so it has been deprecated.
Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check.

Why track `newPayload` errors?

Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because:

If the EL is timing out to some calls, it's unlikely to timeout on the upcheck call, which is just eth_syncing. Every failed call is followed by an upcheck here, which would have the effect of masking the failure and keeping the status online.
The newPayload call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with forkchoiceUpdated usually returning much faster (<50ms).
If newPayload is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients should prefer other BNs if they have one available. In the second case, all of their BNs will likely report el_offline and they'll just have to proceed with trying to use them.

Additional Changes

Add utility method ForkName::latest which is quite convenient for test writing, but probably other things too.
Delete some stale comments from when we used to support multiple execution nodes.

beacon_node/execution_layer/src/lib.rs

pawanjay176

LGTM. Just had a small question

beacon_node/execution_layer/src/lib.rs

michaelsproul · 2023-05-17T01:43:31Z

Ready for final review and merge. I've added more aggressive polling to the VC, after noticing that the previous lazy polling strategy didn't work on Goerli

paulhauner · 2023-05-17T04:10:07Z

Flagging as backwards-incompat so we mention the addition of ee_offline to node/version in the release notes.

I believe v4.2.0 will be backwards compatible with earlier VCs since we don't have #[serde(deny_unknown_fields)] on SyncingData.

paulhauner

Very nice! Approved with a few questions/comments.

paulhauner · 2023-05-17T04:00:20Z

beacon_node/execution_layer/src/engines.rs

@@ -238,6 +238,11 @@ impl Engine {
        **self.state.read().await == EngineStateInternal::Synced
    }

+    /// Returns `true` if the engine has a status other than synced or syncing.
+    pub async fn is_offline(&self) -> bool {
+        EngineState::from(**self.state.read().await) == EngineState::Offline


Do you have thoughts on catching EngineState::AuthFailed as well? I think the newPayload failures would catch an auth failure in practice. I don't feel strongly either way.

I think it's accurate. If the auth is wrong, then the EL is effectively offline.

I imagine this would only be an issue during initial setup, or if resyncing the EE and deleting the JWT secret.

paulhauner · 2023-05-17T04:03:12Z

beacon_node/execution_layer/src/lib.rs

+    ///
+    /// This is used *only* in the informational sync status endpoint, so that a VC using this
+    /// node can prefer another node with a healthier EL.
+    last_new_payload_errored: RwLock<bool>,


I had originally expected that we'd track the last fcU error rather than the last newPayload error.

My reasoning was that it's the last call we'd do in the block import process so it has the most up-to-date information about EE state. However I see now that if we fail a newPayload then we won't end up calling fcU and we'd end up kinda stuck.

If we're choosing just one, then I now prefer newPayload. No changes suggested, just sharing my reasoning.

paulhauner · 2023-05-17T04:05:37Z

beacon_node/execution_layer/src/lib.rs

@@ -1116,18 +1131,6 @@ impl<T: EthSpec> ExecutionLayer<T> {
    }

    /// Maps to the `engine_newPayload` JSON-RPC call.
-    ///


Nice comment cleanups 👍

michaelsproul · 2023-05-17T05:36:40Z

I believe v4.2.0 will be backwards compatible with earlier VCs since we don't have #[serde(deny_unknown_fields)] on SyncingData.

Yeah I tested this when deploying to Goerli as well. The old VC doesn't mind the new field, and the new VC doesn't mind if it's not there

michaelsproul · 2023-05-17T05:39:41Z

The doppelganger protection tests are failing because the EL is offline, but I think this is fixed by #3807. Maybe we could (ambitiously) try batching them together, or failing that, mergin 3807 and then updating this PR

michaelsproul · 2023-05-17T05:51:37Z

bors r+

## Issue Addressed Closes #4291, part of #3613. ## Proposed Changes - Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if: - The EL's internal status is `Offline` or `AuthFailed`, _or_ - The most recent call to `newPayload` resulted in an error (more on this in a moment). - Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes. - Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated. - Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check. ## Why track `newPayload` errors? Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because: - If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_. - The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms). - If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them. ## Additional Changes - Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too. - Delete some stale comments from when we used to support multiple execution nodes.

bors · 2023-05-17T08:55:44Z

Pull request successfully merged into unstable.

Build succeeded!

The publicly hosted instance of bors-ng is deprecated and will go away soon.

If you want to self-host your own instance, instructions are here.
For more help, visit the forum.

If you want to switch to GitHub's built-in merge queue, visit their help page.

## Issue Addressed #4309 (comment) ## Proposed Changes Log the `Connected to beacon node` message only if the node was previously offline. This avoids a regression in logging after #4295, whereby the `Connected to beacon node` message would be logged every slot. The new reduced logging is _slightly different_ from what we had prior to my changes in #4295. The main difference is that we used to log the `Connected` message whenever a node was online and subject to a health check (for being unhealthy in some other way). I think the new behaviour is reasonable, as the `Connected` message isn't particularly helpful if the BN is unhealthy, and the specific reason for unhealthiness will be logged by the warnings for `is_compatible`/`is_synced`.

## Issue Addressed Closes sigp#4291, part of sigp#3613. ## Proposed Changes - Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if: - The EL's internal status is `Offline` or `AuthFailed`, _or_ - The most recent call to `newPayload` resulted in an error (more on this in a moment). - Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes. - Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated. - Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check. ## Why track `newPayload` errors? Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because: - If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_. - The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms). - If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them. ## Additional Changes - Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too. - Delete some stale comments from when we used to support multiple execution nodes.

## Issue Addressed sigp#4309 (comment) ## Proposed Changes Log the `Connected to beacon node` message only if the node was previously offline. This avoids a regression in logging after sigp#4295, whereby the `Connected to beacon node` message would be logged every slot. The new reduced logging is _slightly different_ from what we had prior to my changes in sigp#4295. The main difference is that we used to log the `Connected` message whenever a node was online and subject to a health check (for being unhealthy in some other way). I think the new behaviour is reasonable, as the `Connected` message isn't particularly helpful if the BN is unhealthy, and the specific reason for unhealthiness will be logged by the warnings for `is_compatible`/`is_synced`.

Closes sigp#4291, part of sigp#3613. - Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if: - The EL's internal status is `Offline` or `AuthFailed`, _or_ - The most recent call to `newPayload` resulted in an error (more on this in a moment). - Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes. - Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated. - Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check. Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because: - If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_. - The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms). - If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them. - Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too. - Delete some stale comments from when we used to support multiple execution nodes.

## Issue Addressed sigp#4309 (comment) ## Proposed Changes Log the `Connected to beacon node` message only if the node was previously offline. This avoids a regression in logging after sigp#4295, whereby the `Connected to beacon node` message would be logged every slot. The new reduced logging is _slightly different_ from what we had prior to my changes in sigp#4295. The main difference is that we used to log the `Connected` message whenever a node was online and subject to a health check (for being unhealthy in some other way). I think the new behaviour is reasonable, as the `Connected` message isn't particularly helpful if the BN is unhealthy, and the specific reason for unhealthiness will be logged by the warnings for `is_compatible`/`is_synced`.

Closes sigp#4291, part of sigp#3613. - Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if: - The EL's internal status is `Offline` or `AuthFailed`, _or_ - The most recent call to `newPayload` resulted in an error (more on this in a moment). - Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes. - Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated. - Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check. Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because: - If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_. - The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms). - If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them. - Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too. - Delete some stale comments from when we used to support multiple execution nodes.

## Issue Addressed sigp#4309 (comment) ## Proposed Changes Log the `Connected to beacon node` message only if the node was previously offline. This avoids a regression in logging after sigp#4295, whereby the `Connected to beacon node` message would be logged every slot. The new reduced logging is _slightly different_ from what we had prior to my changes in sigp#4295. The main difference is that we used to log the `Connected` message whenever a node was online and subject to a health check (for being unhealthy in some other way). I think the new behaviour is reasonable, as the `Connected` message isn't particularly helpful if the BN is unhealthy, and the specific reason for unhealthiness will be logged by the warnings for `is_compatible`/`is_synced`.

michaelsproul added 3 commits May 16, 2023 16:31

Implement el_offline and use it in the VC

2d6e4d6

Overhaul RequireSynced usage

9766347

Clippy fixes

cda5820

michaelsproul added val-client Relates to the validator client binary ready-for-review The code is ready for review v4.2.0 Q2 2023 labels May 16, 2023

michaelsproul commented May 16, 2023

View reviewed changes

beacon_node/execution_layer/src/lib.rs Outdated Show resolved Hide resolved

Update beacon_node/execution_layer/src/lib.rs

1723c95

pawanjay176 approved these changes May 16, 2023

View reviewed changes

beacon_node/execution_layer/src/lib.rs Outdated Show resolved Hide resolved

michaelsproul added 2 commits May 17, 2023 10:45

Address review comments, fix tests

3248e71

Poll all BNs every slot

a54f96a

paulhauner self-requested a review May 17, 2023 01:59

paulhauner added the backwards-incompat Backwards-incompatible API change label May 17, 2023

paulhauner approved these changes May 17, 2023

View reviewed changes

michaelsproul added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels May 17, 2023

bors bot changed the title ~~Implement el_offline and use it in the VC~~ [Merged by Bors] - Implement el_offline and use it in the VC May 17, 2023

bors bot closed this May 17, 2023

This was referenced May 17, 2023

Implement el_offline #4291

Closed

[Merged by Bors] - v4.2.0 #4309

Closed

michaelsproul mentioned this pull request May 22, 2023

[Merged by Bors] - Avoid excessive logging of BN online status #4315

Closed

torfbolt mentioned this pull request May 23, 2023

Nimbus reports el_offline=true even when synced status-im/nimbus-eth2#4987

Closed

yorickdowne mentioned this pull request May 25, 2023

Feature: Support el_offline in eth/v1/node/syncing ChainSafe/lodestar#5542

Closed

michaelsproul deleted the el-offline branch May 31, 2023 00:42

nflaig mentioned this pull request Oct 17, 2024

fix: return el_offline as true in syncing status response if auth failed ChainSafe/lodestar#7175

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - Implement `el_offline` and use it in the VC #4295

[Merged by Bors] - Implement `el_offline` and use it in the VC #4295

michaelsproul commented May 16, 2023

pawanjay176 left a comment

michaelsproul commented May 17, 2023

paulhauner commented May 17, 2023 •

edited

Loading

paulhauner left a comment

paulhauner May 17, 2023

michaelsproul May 17, 2023

paulhauner May 17, 2023

paulhauner May 17, 2023

michaelsproul commented May 17, 2023

michaelsproul commented May 17, 2023

michaelsproul commented May 17, 2023

bors bot commented May 17, 2023

[Merged by Bors] - Implement el_offline and use it in the VC #4295

[Merged by Bors] - Implement el_offline and use it in the VC #4295

Conversation

michaelsproul commented May 16, 2023

Issue Addressed

Proposed Changes

Why track newPayload errors?

Additional Changes

pawanjay176 left a comment

Choose a reason for hiding this comment

michaelsproul commented May 17, 2023

paulhauner commented May 17, 2023 • edited Loading

paulhauner left a comment

Choose a reason for hiding this comment

paulhauner May 17, 2023

Choose a reason for hiding this comment

michaelsproul May 17, 2023

Choose a reason for hiding this comment

paulhauner May 17, 2023

Choose a reason for hiding this comment

paulhauner May 17, 2023

Choose a reason for hiding this comment

michaelsproul commented May 17, 2023

michaelsproul commented May 17, 2023

michaelsproul commented May 17, 2023

bors bot commented May 17, 2023

[Merged by Bors] - Implement `el_offline` and use it in the VC #4295

[Merged by Bors] - Implement `el_offline` and use it in the VC #4295

Why track `newPayload` errors?

paulhauner commented May 17, 2023 •

edited

Loading