-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change getHealth to compare optimistically confirmed slots #33651
Conversation
784126f
to
570f4a5
Compare
90597ed
to
108cb32
Compare
Doing so allows for a health_check_slot_distance of 0. While this may be stricter than many operator would want to actually use on nodes, TestValidator uses a value of 0 so we need this for unit tests.
99fde92
to
df6689a
Compare
Wait, this actually works? That is, a nodes knows the highest optimistically confirmed slot even if the highest bank it's currently replaying is older than that slot? 🤔 |
Yessir! |
via monitoring gossip votes? |
Codecov Report
@@ Coverage Diff @@
## master #33651 +/- ##
=======================================
Coverage 81.8% 81.8%
=======================================
Files 806 806
Lines 217595 217574 -21
=======================================
+ Hits 178065 178093 +28
+ Misses 39530 39481 -49 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for improving this!
can we run a node with this change, minus the removal of the old version, that polls itself ever 5-10s and submits both the health distances via metrics? |
Sure. I tracked
Lastly, I left incremental snapshot interval at 100 slots and dropped the health check slot distance to 50. I left
The mechanism from this PR reported the node as healthy starting just before 02:26 whereas the old mechanism stabilized around 02:27:30 First off, here is a blow up of the legend to see which color is which trace:
Lastly, here are the distance plots; technically the same information as above just with a different way of viewing: So, things are somewhat similar on capable machine. One instance where the new mechanism is night and day better is with an incremental snapshot interval. For example, here is the new vs. old healthy if I use an incremental snapshot (and thus accounts hash) interval of 250 slots: Technically, me using a smaller-than-default (50) health check distance lengthens the duration of "unhealthyness". However, the old health mechanism would still exhibit the toggling between healthy/unhealthy in this case. But, the new mechanism would allow tightening this limit significantly without being dependent on known validators. |
awesome! let's get this in! |
The current getHealth mechanism checks a local accounts hash slot vs. those of other nodes as specified by --known-validator. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie --incremental-snapshot-interval 500) will likely see getHealth return status behind at some point. Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. Even when a node is behind, it is able to observe cluster optimistically confirmed by slots by viewing votes published in gossip. Thus, the latest cluster optimistically confirmed slot can be compared against the latest optimistically confirmed bank from replay to determine health. This new comparison is much more granular, and not needing to depend on individual known validators is also a plus. (cherry picked from commit 8bd0e4c)
…bs#33651) The current getHealth mechanism checks a local accounts hash slot vs. those of other nodes as specified by --known-validator. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie --incremental-snapshot-interval 500) will likely see getHealth return status behind at some point. Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. Even when a node is behind, it is able to observe cluster optimistically confirmed by slots by viewing votes published in gossip. Thus, the latest cluster optimistically confirmed slot can be compared against the latest optimistically confirmed bank from replay to determine health. This new comparison is much more granular, and not needing to depend on individual known validators is also a plus.
…ckport of #33651) (#33714) Change getHealth to compare optimistically confirmed slots (#33651) The current getHealth mechanism checks a local accounts hash slot vs. those of other nodes as specified by --known-validator. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie --incremental-snapshot-interval 500) will likely see getHealth return status behind at some point. Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. Even when a node is behind, it is able to observe cluster optimistically confirmed by slots by viewing votes published in gossip. Thus, the latest cluster optimistically confirmed slot can be compared against the latest optimistically confirmed bank from replay to determine health. This new comparison is much more granular, and not needing to depend on individual known validators is also a plus. (cherry picked from commit 8bd0e4c) Co-authored-by: steviez <[email protected]>
The current getHealth mechanism checks a local accounts hash slot vs. those of other nodes as specified by --known-validator. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie --incremental-snapshot-interval 500) will likely see getHealth return status behind at some point. Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. Even when a node is behind, it is able to observe cluster optimistically confirmed by slots by viewing votes published in gossip. Thus, the latest cluster optimistically confirmed slot can be compared against the latest optimistically confirmed bank from replay to determine health. This new comparison is much more granular, and not needing to depend on individual known validators is also a plus. (cherry picked from commit 8bd0e4c) # Conflicts: # rpc/src/rpc.rs
…ckport of #33651) (#33716) * Change getHealth to compare optimistically confirmed slots (#33651) The current getHealth mechanism checks a local accounts hash slot vs. those of other nodes as specified by --known-validator. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie --incremental-snapshot-interval 500) will likely see getHealth return status behind at some point. Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. Even when a node is behind, it is able to observe cluster optimistically confirmed by slots by viewing votes published in gossip. Thus, the latest cluster optimistically confirmed slot can be compared against the latest optimistically confirmed bank from replay to determine health. This new comparison is much more granular, and not needing to depend on individual known validators is also a plus. (cherry picked from commit 8bd0e4c) # Conflicts: # rpc/src/rpc.rs * Resolve merge conflicts --------- Co-authored-by: steviez <[email protected]>
…bs#33651) The current getHealth mechanism checks a local accounts hash slot vs. those of other nodes as specified by --known-validator. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie --incremental-snapshot-interval 500) will likely see getHealth return status behind at some point. Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. Even when a node is behind, it is able to observe cluster optimistically confirmed by slots by viewing votes published in gossip. Thus, the latest cluster optimistically confirmed slot can be compared against the latest optimistically confirmed bank from replay to determine health. This new comparison is much more granular, and not needing to depend on individual known validators is also a plus.
* feat: moved common docs to repo * refactor: removed sidebar items * refactor: removed unused images * fix: terminology link * fix: introduction links * fix: developing links * refactor: fixed assorted links * fix: added back the home index * refactor: home page links * refactor: primary links * fix: links * fix: updated existing redirects * feat: added new redirects * refactor: moved cli index file to cli folder * feat: turned breadcrumbs on * feat: auto generated cli sidebar * refactor: page titles * feat: added usage and wallets categories * refactor: moved wallet-guide/cli * style: page titles * refactor: renamed file to install * style: page title * refactor: relocated file to cli/usage/index.md * style: page title * refactor: relocat detailed usage generator for cli commands * refactor: relocated clie usage files * refactor: relocated paper wallet file * refactor: relocated file system wallet doc * feat: added hardware wallet category * refactor: relocated hardware wallet overview * refactor: relocated ledger wallet doc * style: clie wallet titles * refactor(revert): relocated cli usage doc * refactor: relocated to examples * style: cli examples category title * style: usage doc title * refactor: relocated cli intro doc * style: category title * refactor: renamed file * refactor: renamed file * fix: cli links * refactor: relocated file * refactor: relocated files * fix: more cli links * refactor: sidebar order * fix: final cli links? * refactor: proposals * refactor: split sidebars * refactor: removed unused icons * refactor: relocated file * refactor: relocated file * refactor: relocated file * refactor: relocated file * feat: added architecture page * refactor: reloacted filed * refactor: adjusted header links * style: sidebar labels * feat: clusters sidebar details * style: sidebar label * refactor: relocate file * refactor: relocated files * refactor: relocated files * refactor: relocated files * style: validator sidebar * style: sidebar styles * refactor: internal links * style: sidebar order * fix: internal links * feat: master sidebar * refactor: removed unneeded h2 * fix: link redirects * refactor: relocated pages * style: runtime links * refactor: simplified runtime redirects * fix: internal redirect * refactor: moved proposals to dropdown * docs: Removes accounts-on-ramdisk section (#33655) * RPC: update websocket docs (#33460) * [rpc]: update websocket docs * rename rewards to showRewards * add remaining optional fields for slotsUpdates * update block subscription showRewards * Change getHealth to compare optimistically confirmed slots (#33651) The current getHealth mechanism checks a local accounts hash slot vs. those of other nodes as specified by --known-validator. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie --incremental-snapshot-interval 500) will likely see getHealth return status behind at some point. Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. Even when a node is behind, it is able to observe cluster optimistically confirmed by slots by viewing votes published in gossip. Thus, the latest cluster optimistically confirmed slot can be compared against the latest optimistically confirmed bank from replay to determine health. This new comparison is much more granular, and not needing to depend on individual known validators is also a plus. * build(deps): bump @babel/traverse from 7.19.6 to 7.23.2 in /docs (#33726) Bumps [@babel/traverse](https://github.com/babel/babel/tree/HEAD/packages/babel-traverse) from 7.19.6 to 7.23.2. - [Release notes](https://github.com/babel/babel/releases) - [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md) - [Commits](https://github.com/babel/babel/commits/v7.23.2/packages/babel-traverse) --- updated-dependencies: - dependency-name: "@babel/traverse" dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * docs: move rpc info to rpc docs (#33723) docs: link fixes docs: link fixes docs: link fixes * Fix typos in documentation for Secp256k1 native program (#33796) * docs: outline requirement of stake in order to vote (#33842) * docs: outline requirement of stake in order to vote * pr feedback: move stake section up * chore: fix some typos (#33833) * fix spelling of "retrieved" * fix spelling of "should" * fix spelling of "comparisons" * docs: updating apt install to apt upgrade (#33920) * Fix some typo in the documentation (#34058) Co-authored-by: Andrew Fitzgerald <[email protected]> * fix: internal links * refactor: removed rpc api docs * refactor: removed rpc sidebar * fix: updated remaining rpc api links * refactor: removed final rpc /api route * refactor: removed dangling component files * refactor: changed copyright * fix: dangling ordered list * refactor: wording around solana docs * feat: home page content * refactor: updated docs url * Link to latest version of the off-chain message signing proposal in the docs (#34329) * docs: (cli) minor updates to deploy-a-program.md (#34307) * docs: (cli) minor updates to deploy-a-program.md * address review comments * remove unnecessary impl details from the docs about deploy command upgrade flow * clarify program redeploy section --------- Co-authored-by: norwnd <norwnd> * refactor: removed GA --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Brooks <[email protected]> Co-authored-by: Joe C <[email protected]> Co-authored-by: steviez <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jacob Creech <[email protected]> Co-authored-by: Nick Guo <[email protected]> Co-authored-by: Ashwin Sekar <[email protected]> Co-authored-by: Kevin Heavey <[email protected]> Co-authored-by: Max Kaplan <[email protected]> Co-authored-by: hugo-syn <[email protected]> Co-authored-by: Andrew Fitzgerald <[email protected]> Co-authored-by: norwnd <[email protected]>
Problem
The current
getHealth
mechanism checks a local accounts hash slot vs. those of other nodes as specified by--known-validator
. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie--incremental-snapshot-interval 500
) will 100% seegetHealth
return status behind at some point.Summary of Changes
Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. The cluster confirmed slot can be compared against the latest optimistically confirmed bank to determine how far behind the node is. This new comparison is much more granular than the old approach, and also has the benefit of not going haywire if a known validator has issues (no domino effect).
Testing
With this branch, I observed a node reporting a diminishing "slots behind" value as it caught up from a process restart. To be sure, I also added a hack to stall
ReplayStage
for 30 seconds every 1000 slots. Stalling for 30 seconds should result in my node falling behind about 60-75 slots; this hack plus passing--health-check-slot-distance 25
allowed me to see the node fall behind (as reported bysolana-validator monitor
) immediately after the 30s sleep, but then quickly catch up and continue to report a healthy status.Fixes #16957