excludes epoch-slots from nodes with unknown or different shred version #17899

behzadnouri · 2021-06-11T20:38:40Z

Problem

Inspecting TDS gossip table shows that crds values of nodes with
different shred-versions are creeping in. Their epoch-slots are
accumulated in ClusterSlots causing bogus slots very far from current
root which are not purged and so cause ClusterSlots keep consuming more
memory:
#17789
#14366 (comment)
#14366 (comment)

Summary of Changes

This commit updates ClusterInfo::get_epoch_slots, and discards entries
from nodes with unknown or different shred-version.

Follow up commits will patch gossip not to waste bandwidth and memory
over crds values of nodes with different shred-version.

Inspecting TDS gossip table shows that crds values of nodes with different shred-versions are creeping in. Their epoch-slots are accumulated in ClusterSlots causing bogus slots very far from current root which are not purged and so cause ClusterSlots keep consuming more memory: solana-labs#17789 solana-labs#14366 (comment) solana-labs#14366 (comment) This commit updates ClusterInfo::get_epoch_slots, and discards entries from nodes with unknown or different shred-version. Follow up commits will patch gossip not to waste bandwidth and memory over crds values of nodes with different shred-version.

carllin · 2021-06-11T21:11:16Z

Ah nice catch! I believe @ryoqun suspected this as well here: #14366 (comment), but as @t-nelson pointed out, #14366 (comment), we thought shred versions should be keeping networks apart.

Are the push/pull shred version filter in gossip missing something?

behzadnouri · 2021-06-11T21:14:36Z

Ah nice catch! I believe @ryoqun suspected this as well here: #14366 (comment), but as @t-nelson pointed out, #14366 (comment), we thought shred versions should be keeping networks apart.

Are the push/pull shred version filter in gossip missing something?

yeah, there are some but generally the shred version checks are not very consistent.
I am working on some changes to purge/drop values if shred versions do not match.

codecov · 2021-06-11T22:21:14Z

Codecov Report

Merging #17899 (e3831da) into master (a501707) will increase coverage by 0.0%.
The diff coverage is 96.9%.

@@           Coverage Diff           @@
##           master   #17899   +/-   ##
=======================================
  Coverage    82.6%    82.6%           
=======================================
  Files         431      431           
  Lines      121184   121217   +33     
=======================================
+ Hits       100137   100177   +40     
+ Misses      21047    21040    -7

…on (#17899) Inspecting TDS gossip table shows that crds values of nodes with different shred-versions are creeping in. Their epoch-slots are accumulated in ClusterSlots causing bogus slots very far from current root which are not purged and so cause ClusterSlots keep consuming more memory: #17789 #14366 (comment) #14366 (comment) This commit updates ClusterInfo::get_epoch_slots, and discards entries from nodes with unknown or different shred-version. Follow up commits will patch gossip not to waste bandwidth and memory over crds values of nodes with different shred-version. (cherry picked from commit 985280e)

…on (#17899) (#17916) Inspecting TDS gossip table shows that crds values of nodes with different shred-versions are creeping in. Their epoch-slots are accumulated in ClusterSlots causing bogus slots very far from current root which are not purged and so cause ClusterSlots keep consuming more memory: #17789 #14366 (comment) #14366 (comment) This commit updates ClusterInfo::get_epoch_slots, and discards entries from nodes with unknown or different shred-version. Follow up commits will patch gossip not to waste bandwidth and memory over crds values of nodes with different shred-version. (cherry picked from commit 985280e) Co-authored-by: behzad nouri <[email protected]>

Crds values of nodes with different shred versions are creeping into gossip table resulting in runtime issues as the one addressed in: solana-labs#17899 This commit works towards enforcing more checks and filtering based on shred version by adding necessary mapping and api to gossip table. Once populated, pubkey->shred-version mapping persists as long as there are any values associated with the pubkey.

When starting a validator, the node initially joins gossip with shred_verison = 0, until it adopts the entrypoint's shred-version: https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417 Depending on the load on the entrypoint, this adopting entrypoint shred-version through gossip sometimes becomes very slow, and causes several problems in gossip because we have to partially support shred_version == 0 which is a source of leaking crds values from one cluster to another. e.g. see solana-labs#17899 and the other linked issues there. In order to remove shred_version == 0 from gossip, this commit adds shred-version to ip-echo-server response. Once the entrypoints are updated, on validator start-up, if --expected_shred_version is not specified we will obtain shred-version from the entrypoint using ip-echo-server.

Crds values of nodes with different shred versions are creeping into gossip table resulting in runtime issues as the one addressed in: #17899 This commit works towards enforcing more checks and filtering based on shred version by adding necessary mapping and api to gossip table. Once populated, pubkey->shred-version mapping persists as long as there are any values associated with the pubkey.

Crds values of nodes with different shred versions are creeping into gossip table resulting in runtime issues as the one addressed in: #17899 This commit works towards enforcing more checks and filtering based on shred version by adding necessary mapping and api to gossip table. Once populated, pubkey->shred-version mapping persists as long as there are any values associated with the pubkey. (cherry picked from commit 5a99fa3)

Crds values of nodes with different shred versions are creeping into gossip table resulting in runtime issues as the one addressed in: #17899 This commit works towards enforcing more checks and filtering based on shred version by adding necessary mapping and api to gossip table. Once populated, pubkey->shred-version mapping persists as long as there are any values associated with the pubkey. (cherry picked from commit 5a99fa3) Co-authored-by: behzad nouri <[email protected]>

When starting a validator, the node initially joins gossip with shred_verison = 0, until it adopts the entrypoint's shred-version: https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417 Depending on the load on the entrypoint, this adopting entrypoint shred-version through gossip sometimes becomes very slow, and causes several problems in gossip because we have to partially support shred_version == 0 which is a source of leaking crds values from one cluster to another. e.g. see solana-labs#17899 and the other linked issues there. In order to remove shred_version == 0 from gossip, this commit adds shred-version to ip-echo-server response. Once the entrypoints are updated, on validator start-up, if --expected_shred_version is not specified we will obtain shred-version from the entrypoint using ip-echo-server.

When starting a validator, the node initially joins gossip with shred_verison = 0, until it adopts the entrypoint's shred-version: https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417 Depending on the load on the entrypoint, this adopting entrypoint shred-version through gossip sometimes becomes very slow, and causes several problems in gossip because we have to partially support shred_version == 0 which is a source of leaking crds values from one cluster to another. e.g. see #17899 and the other linked issues there. In order to remove shred_version == 0 from gossip, this commit adds shred-version to ip-echo-server response. Once the entrypoints are updated, on validator start-up, if --expected_shred_version is not specified we will obtain shred-version from the entrypoint using ip-echo-server.

When starting a validator, the node initially joins gossip with shred_verison = 0, until it adopts the entrypoint's shred-version: https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417 Depending on the load on the entrypoint, this adopting entrypoint shred-version through gossip sometimes becomes very slow, and causes several problems in gossip because we have to partially support shred_version == 0 which is a source of leaking crds values from one cluster to another. e.g. see #17899 and the other linked issues there. In order to remove shred_version == 0 from gossip, this commit adds shred-version to ip-echo-server response. Once the entrypoints are updated, on validator start-up, if --expected_shred_version is not specified we will obtain shred-version from the entrypoint using ip-echo-server. (cherry picked from commit 598093b) # Conflicts: # Cargo.lock # net-utils/Cargo.toml # programs/bpf/Cargo.lock

* adds shred-version to ip-echo-server response When starting a validator, the node initially joins gossip with shred_verison = 0, until it adopts the entrypoint's shred-version: https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417 Depending on the load on the entrypoint, this adopting entrypoint shred-version through gossip sometimes becomes very slow, and causes several problems in gossip because we have to partially support shred_version == 0 which is a source of leaking crds values from one cluster to another. e.g. see #17899 and the other linked issues there. In order to remove shred_version == 0 from gossip, this commit adds shred-version to ip-echo-server response. Once the entrypoints are updated, on validator start-up, if --expected_shred_version is not specified we will obtain shred-version from the entrypoint using ip-echo-server. (cherry picked from commit 598093b) # Conflicts: # Cargo.lock # net-utils/Cargo.toml # programs/bpf/Cargo.lock * removes backport merge conflicts * obtains shred-version from entrypoint's ip-echo-server in validator-main (cherry picked from commit 58e1152) Co-authored-by: behzad nouri <[email protected]>

…on (#17899) Inspecting TDS gossip table shows that crds values of nodes with different shred-versions are creeping in. Their epoch-slots are accumulated in ClusterSlots causing bogus slots very far from current root which are not purged and so cause ClusterSlots keep consuming more memory: #17789 #14366 (comment) #14366 (comment) This commit updates ClusterInfo::get_epoch_slots, and discards entries from nodes with unknown or different shred-version. Follow up commits will patch gossip not to waste bandwidth and memory over crds values of nodes with different shred-version. (cherry picked from commit 985280e) # Conflicts: # core/src/cluster_info.rs

…on (backport #17899) (#19551) * excludes epoch-slots from nodes with unknown or different shred version (#17899) Inspecting TDS gossip table shows that crds values of nodes with different shred-versions are creeping in. Their epoch-slots are accumulated in ClusterSlots causing bogus slots very far from current root which are not purged and so cause ClusterSlots keep consuming more memory: #17789 #14366 (comment) #14366 (comment) This commit updates ClusterInfo::get_epoch_slots, and discards entries from nodes with unknown or different shred-version. Follow up commits will patch gossip not to waste bandwidth and memory over crds values of nodes with different shred-version. (cherry picked from commit 985280e) # Conflicts: # core/src/cluster_info.rs * removes backport merge conflicts Co-authored-by: behzad nouri <[email protected]>

behzadnouri linked an issue Jun 11, 2021 that may be closed by this pull request

ClusterSlots uses too much memory #17789

Closed

behzadnouri requested review from carllin, ryoqun and sakridge June 11, 2021 20:40

behzadnouri force-pushed the epoch-slots-shred-version branch from 755718c to e3831da Compare June 11, 2021 20:41

behzadnouri mentioned this pull request Jun 11, 2021

solana-validator leaks memory (but at very slow pace) #14366

Closed

carllin approved these changes Jun 11, 2021

View reviewed changes

behzadnouri merged commit 985280e into solana-labs:master Jun 13, 2021

behzadnouri deleted the epoch-slots-shred-version branch June 13, 2021 14:08

behzadnouri added the v1.7 label Jun 13, 2021

mergify bot mentioned this pull request Jun 13, 2021

excludes epoch-slots from nodes with unknown or different shred version (backport #17899) #17916

Merged

behzadnouri mentioned this pull request Jun 14, 2021

adds mapping from nodes pubkeys to their shred-version #17940

Merged

behzadnouri mentioned this pull request Jun 18, 2021

adds shred-version to ip-echo-server response #18066

Merged

brooksprumo mentioned this pull request Aug 23, 2021

backport 19361 v17 #19380

Closed

behzadnouri added the v1.6 label Sep 1, 2021

mergify bot mentioned this pull request Sep 1, 2021

excludes epoch-slots from nodes with unknown or different shred version (backport #17899) #19551

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

excludes epoch-slots from nodes with unknown or different shred version #17899

excludes epoch-slots from nodes with unknown or different shred version #17899

behzadnouri commented Jun 11, 2021

carllin commented Jun 11, 2021

behzadnouri commented Jun 11, 2021

codecov bot commented Jun 11, 2021

excludes epoch-slots from nodes with unknown or different shred version #17899

excludes epoch-slots from nodes with unknown or different shred version #17899

Conversation

behzadnouri commented Jun 11, 2021

Problem

Summary of Changes

carllin commented Jun 11, 2021

behzadnouri commented Jun 11, 2021

codecov bot commented Jun 11, 2021

Codecov Report