-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Merged by Bors] - Implement el_offline
and use it in the VC
#4295
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just had a small question
Ready for final review and merge. I've added more aggressive polling to the VC, after noticing that the previous lazy polling strategy didn't work on Goerli |
Flagging as I believe v4.2.0 will be backwards compatible with earlier VCs since we don't have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Approved with a few questions/comments.
@@ -238,6 +238,11 @@ impl Engine { | |||
**self.state.read().await == EngineStateInternal::Synced | |||
} | |||
|
|||
/// Returns `true` if the engine has a status other than synced or syncing. | |||
pub async fn is_offline(&self) -> bool { | |||
EngineState::from(**self.state.read().await) == EngineState::Offline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have thoughts on catching EngineState::AuthFailed
as well? I think the newPayload failures would catch an auth failure in practice. I don't feel strongly either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's accurate. If the auth is wrong, then the EL is effectively offline.
I imagine this would only be an issue during initial setup, or if resyncing the EE and deleting the JWT secret.
/// | ||
/// This is used *only* in the informational sync status endpoint, so that a VC using this | ||
/// node can prefer another node with a healthier EL. | ||
last_new_payload_errored: RwLock<bool>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had originally expected that we'd track the last fcU error rather than the last newPayload error.
My reasoning was that it's the last call we'd do in the block import process so it has the most up-to-date information about EE state. However I see now that if we fail a newPayload then we won't end up calling fcU and we'd end up kinda stuck.
If we're choosing just one, then I now prefer newPayload. No changes suggested, just sharing my reasoning.
@@ -1116,18 +1131,6 @@ impl<T: EthSpec> ExecutionLayer<T> { | |||
} | |||
|
|||
/// Maps to the `engine_newPayload` JSON-RPC call. | |||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice comment cleanups 👍
Yeah I tested this when deploying to Goerli as well. The old VC doesn't mind the new field, and the new VC doesn't mind if it's not there |
The doppelganger protection tests are failing because the EL is offline, but I think this is fixed by #3807. Maybe we could (ambitiously) try batching them together, or failing that, mergin 3807 and then updating this PR |
bors r+ |
## Issue Addressed Closes #4291, part of #3613. ## Proposed Changes - Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if: - The EL's internal status is `Offline` or `AuthFailed`, _or_ - The most recent call to `newPayload` resulted in an error (more on this in a moment). - Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes. - Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated. - Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check. ## Why track `newPayload` errors? Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because: - If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_. - The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms). - If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them. ## Additional Changes - Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too. - Delete some stale comments from when we used to support multiple execution nodes.
Pull request successfully merged into unstable. Build succeeded! The publicly hosted instance of bors-ng is deprecated and will go away soon. If you want to self-host your own instance, instructions are here. If you want to switch to GitHub's built-in merge queue, visit their help page.
|
el_offline
and use it in the VCel_offline
and use it in the VC
## Issue Addressed #4309 (comment) ## Proposed Changes Log the `Connected to beacon node` message only if the node was previously offline. This avoids a regression in logging after #4295, whereby the `Connected to beacon node` message would be logged every slot. The new reduced logging is _slightly different_ from what we had prior to my changes in #4295. The main difference is that we used to log the `Connected` message whenever a node was online and subject to a health check (for being unhealthy in some other way). I think the new behaviour is reasonable, as the `Connected` message isn't particularly helpful if the BN is unhealthy, and the specific reason for unhealthiness will be logged by the warnings for `is_compatible`/`is_synced`.
## Issue Addressed Closes sigp#4291, part of sigp#3613. ## Proposed Changes - Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if: - The EL's internal status is `Offline` or `AuthFailed`, _or_ - The most recent call to `newPayload` resulted in an error (more on this in a moment). - Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes. - Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated. - Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check. ## Why track `newPayload` errors? Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because: - If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_. - The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms). - If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them. ## Additional Changes - Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too. - Delete some stale comments from when we used to support multiple execution nodes.
## Issue Addressed sigp#4309 (comment) ## Proposed Changes Log the `Connected to beacon node` message only if the node was previously offline. This avoids a regression in logging after sigp#4295, whereby the `Connected to beacon node` message would be logged every slot. The new reduced logging is _slightly different_ from what we had prior to my changes in sigp#4295. The main difference is that we used to log the `Connected` message whenever a node was online and subject to a health check (for being unhealthy in some other way). I think the new behaviour is reasonable, as the `Connected` message isn't particularly helpful if the BN is unhealthy, and the specific reason for unhealthiness will be logged by the warnings for `is_compatible`/`is_synced`.
Closes sigp#4291, part of sigp#3613. - Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if: - The EL's internal status is `Offline` or `AuthFailed`, _or_ - The most recent call to `newPayload` resulted in an error (more on this in a moment). - Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes. - Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated. - Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check. Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because: - If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_. - The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms). - If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them. - Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too. - Delete some stale comments from when we used to support multiple execution nodes.
## Issue Addressed sigp#4309 (comment) ## Proposed Changes Log the `Connected to beacon node` message only if the node was previously offline. This avoids a regression in logging after sigp#4295, whereby the `Connected to beacon node` message would be logged every slot. The new reduced logging is _slightly different_ from what we had prior to my changes in sigp#4295. The main difference is that we used to log the `Connected` message whenever a node was online and subject to a health check (for being unhealthy in some other way). I think the new behaviour is reasonable, as the `Connected` message isn't particularly helpful if the BN is unhealthy, and the specific reason for unhealthiness will be logged by the warnings for `is_compatible`/`is_synced`.
Closes sigp#4291, part of sigp#3613. - Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if: - The EL's internal status is `Offline` or `AuthFailed`, _or_ - The most recent call to `newPayload` resulted in an error (more on this in a moment). - Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes. - Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated. - Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check. Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because: - If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_. - The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms). - If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them. - Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too. - Delete some stale comments from when we used to support multiple execution nodes.
## Issue Addressed sigp#4309 (comment) ## Proposed Changes Log the `Connected to beacon node` message only if the node was previously offline. This avoids a regression in logging after sigp#4295, whereby the `Connected to beacon node` message would be logged every slot. The new reduced logging is _slightly different_ from what we had prior to my changes in sigp#4295. The main difference is that we used to log the `Connected` message whenever a node was online and subject to a health check (for being unhealthy in some other way). I think the new behaviour is reasonable, as the `Connected` message isn't particularly helpful if the BN is unhealthy, and the specific reason for unhealthiness will be logged by the warnings for `is_compatible`/`is_synced`.
Issue Addressed
Closes #4291, part of #3613.
Proposed Changes
Implement the
el_offline
field on/eth/v1/node/syncing
. We setel_offline=true
if:Offline
orAuthFailed
, ornewPayload
resulted in an error (more on this in a moment).Use the
el_offline
field in the VC to mark nodes with offline ELs as unsynced. These nodes will still be used, but only after synced nodes.Overhaul the usage of
RequireSynced
so that::No
is used almost everywhere. The--allow-unsynced
flag was broken and had the opposite effect to intended, so it has been deprecated.Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check.
Why track
newPayload
errors?Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because:
upcheck
call, which is justeth_syncing
. Every failed call is followed by an upcheck here, which would have the effect of masking the failure and keeping the status online.newPayload
call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), withforkchoiceUpdated
usually returning much faster (<50ms).newPayload
is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients should prefer other BNs if they have one available. In the second case, all of their BNs will likely reportel_offline
and they'll just have to proceed with trying to use them.Additional Changes
ForkName::latest
which is quite convenient for test writing, but probably other things too.