Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

excludes epoch-slots from nodes with unknown or different shred version #17899

Merged

Conversation

behzadnouri
Copy link
Contributor

Problem

Inspecting TDS gossip table shows that crds values of nodes with
different shred-versions are creeping in. Their epoch-slots are
accumulated in ClusterSlots causing bogus slots very far from current
root which are not purged and so cause ClusterSlots keep consuming more
memory:
#17789
#14366 (comment)
#14366 (comment)

Summary of Changes

This commit updates ClusterInfo::get_epoch_slots, and discards entries
from nodes with unknown or different shred-version.

Follow up commits will patch gossip not to waste bandwidth and memory
over crds values of nodes with different shred-version.

@behzadnouri behzadnouri linked an issue Jun 11, 2021 that may be closed by this pull request
Inspecting TDS gossip table shows that crds values of nodes with
different shred-versions are creeping in. Their epoch-slots are
accumulated in ClusterSlots causing bogus slots very far from current
root which are not purged and so cause ClusterSlots keep consuming more
memory:
solana-labs#17789
solana-labs#14366 (comment)
solana-labs#14366 (comment)

This commit updates ClusterInfo::get_epoch_slots, and discards entries
from nodes with unknown or different shred-version.

Follow up commits will patch gossip not to waste bandwidth and memory
over crds values of nodes with different shred-version.
@carllin
Copy link
Contributor

carllin commented Jun 11, 2021

Ah nice catch! I believe @ryoqun suspected this as well here: #14366 (comment), but as @t-nelson pointed out, #14366 (comment), we thought shred versions should be keeping networks apart.

Are the push/pull shred version filter in gossip missing something?

@behzadnouri
Copy link
Contributor Author

Ah nice catch! I believe @ryoqun suspected this as well here: #14366 (comment), but as @t-nelson pointed out, #14366 (comment), we thought shred versions should be keeping networks apart.

Are the push/pull shred version filter in gossip missing something?

yeah, there are some but generally the shred version checks are not very consistent.
I am working on some changes to purge/drop values if shred versions do not match.

@codecov
Copy link

codecov bot commented Jun 11, 2021

Codecov Report

Merging #17899 (e3831da) into master (a501707) will increase coverage by 0.0%.
The diff coverage is 96.9%.

@@           Coverage Diff           @@
##           master   #17899   +/-   ##
=======================================
  Coverage    82.6%    82.6%           
=======================================
  Files         431      431           
  Lines      121184   121217   +33     
=======================================
+ Hits       100137   100177   +40     
+ Misses      21047    21040    -7     

@behzadnouri behzadnouri merged commit 985280e into solana-labs:master Jun 13, 2021
@behzadnouri behzadnouri deleted the epoch-slots-shred-version branch June 13, 2021 14:08
mergify bot pushed a commit that referenced this pull request Jun 13, 2021
…on (#17899)

Inspecting TDS gossip table shows that crds values of nodes with
different shred-versions are creeping in. Their epoch-slots are
accumulated in ClusterSlots causing bogus slots very far from current
root which are not purged and so cause ClusterSlots keep consuming more
memory:
#17789
#14366 (comment)
#14366 (comment)

This commit updates ClusterInfo::get_epoch_slots, and discards entries
from nodes with unknown or different shred-version.

Follow up commits will patch gossip not to waste bandwidth and memory
over crds values of nodes with different shred-version.

(cherry picked from commit 985280e)
mergify bot added a commit that referenced this pull request Jun 13, 2021
…on (#17899) (#17916)

Inspecting TDS gossip table shows that crds values of nodes with
different shred-versions are creeping in. Their epoch-slots are
accumulated in ClusterSlots causing bogus slots very far from current
root which are not purged and so cause ClusterSlots keep consuming more
memory:
#17789
#14366 (comment)
#14366 (comment)

This commit updates ClusterInfo::get_epoch_slots, and discards entries
from nodes with unknown or different shred-version.

Follow up commits will patch gossip not to waste bandwidth and memory
over crds values of nodes with different shred-version.

(cherry picked from commit 985280e)

Co-authored-by: behzad nouri <[email protected]>
behzadnouri added a commit to behzadnouri/solana that referenced this pull request Jun 14, 2021
Crds values of nodes with different shred versions are creeping into
gossip table resulting in runtime issues as the one addressed in:
solana-labs#17899

This commit works towards enforcing more checks and filtering based on
shred version by adding necessary mapping and api to gossip table.
Once populated, pubkey->shred-version mapping persists as long as there
are any values associated with the pubkey.
behzadnouri added a commit to behzadnouri/solana that referenced this pull request Jun 18, 2021
When starting a validator, the node initially joins gossip with
shred_verison = 0, until it adopts the entrypoint's shred-version:
https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417

Depending on the load on the entrypoint, this adopting entrypoint
shred-version through gossip sometimes becomes very slow, and causes
several problems in gossip because we have to partially support
shred_version == 0 which is a source of leaking crds values from one
cluster to another. e.g. see
solana-labs#17899
and the other linked issues there.

In order to remove shred_version == 0 from gossip, this commit adds
shred-version to ip-echo-server response. Once the entrypoints are
updated, on validator start-up, if --expected_shred_version is not
specified we will obtain shred-version from the entrypoint using
ip-echo-server.
behzadnouri added a commit to behzadnouri/solana that referenced this pull request Jun 18, 2021
When starting a validator, the node initially joins gossip with
shred_verison = 0, until it adopts the entrypoint's shred-version:
https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417

Depending on the load on the entrypoint, this adopting entrypoint
shred-version through gossip sometimes becomes very slow, and causes
several problems in gossip because we have to partially support
shred_version == 0 which is a source of leaking crds values from one
cluster to another. e.g. see
solana-labs#17899
and the other linked issues there.

In order to remove shred_version == 0 from gossip, this commit adds
shred-version to ip-echo-server response. Once the entrypoints are
updated, on validator start-up, if --expected_shred_version is not
specified we will obtain shred-version from the entrypoint using
ip-echo-server.
behzadnouri added a commit that referenced this pull request Jun 18, 2021
Crds values of nodes with different shred versions are creeping into
gossip table resulting in runtime issues as the one addressed in:
#17899

This commit works towards enforcing more checks and filtering based on
shred version by adding necessary mapping and api to gossip table.
Once populated, pubkey->shred-version mapping persists as long as there
are any values associated with the pubkey.
mergify bot pushed a commit that referenced this pull request Jun 18, 2021
Crds values of nodes with different shred versions are creeping into
gossip table resulting in runtime issues as the one addressed in:
#17899

This commit works towards enforcing more checks and filtering based on
shred version by adding necessary mapping and api to gossip table.
Once populated, pubkey->shred-version mapping persists as long as there
are any values associated with the pubkey.

(cherry picked from commit 5a99fa3)
mergify bot added a commit that referenced this pull request Jun 18, 2021
Crds values of nodes with different shred versions are creeping into
gossip table resulting in runtime issues as the one addressed in:
#17899

This commit works towards enforcing more checks and filtering based on
shred version by adding necessary mapping and api to gossip table.
Once populated, pubkey->shred-version mapping persists as long as there
are any values associated with the pubkey.

(cherry picked from commit 5a99fa3)

Co-authored-by: behzad nouri <[email protected]>
behzadnouri added a commit to behzadnouri/solana that referenced this pull request Jun 18, 2021
When starting a validator, the node initially joins gossip with
shred_verison = 0, until it adopts the entrypoint's shred-version:
https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417

Depending on the load on the entrypoint, this adopting entrypoint
shred-version through gossip sometimes becomes very slow, and causes
several problems in gossip because we have to partially support
shred_version == 0 which is a source of leaking crds values from one
cluster to another. e.g. see
solana-labs#17899
and the other linked issues there.

In order to remove shred_version == 0 from gossip, this commit adds
shred-version to ip-echo-server response. Once the entrypoints are
updated, on validator start-up, if --expected_shred_version is not
specified we will obtain shred-version from the entrypoint using
ip-echo-server.
behzadnouri added a commit to behzadnouri/solana that referenced this pull request Jun 18, 2021
When starting a validator, the node initially joins gossip with
shred_verison = 0, until it adopts the entrypoint's shred-version:
https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417

Depending on the load on the entrypoint, this adopting entrypoint
shred-version through gossip sometimes becomes very slow, and causes
several problems in gossip because we have to partially support
shred_version == 0 which is a source of leaking crds values from one
cluster to another. e.g. see
solana-labs#17899
and the other linked issues there.

In order to remove shred_version == 0 from gossip, this commit adds
shred-version to ip-echo-server response. Once the entrypoints are
updated, on validator start-up, if --expected_shred_version is not
specified we will obtain shred-version from the entrypoint using
ip-echo-server.
behzadnouri added a commit to behzadnouri/solana that referenced this pull request Jun 20, 2021
When starting a validator, the node initially joins gossip with
shred_verison = 0, until it adopts the entrypoint's shred-version:
https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417

Depending on the load on the entrypoint, this adopting entrypoint
shred-version through gossip sometimes becomes very slow, and causes
several problems in gossip because we have to partially support
shred_version == 0 which is a source of leaking crds values from one
cluster to another. e.g. see
solana-labs#17899
and the other linked issues there.

In order to remove shred_version == 0 from gossip, this commit adds
shred-version to ip-echo-server response. Once the entrypoints are
updated, on validator start-up, if --expected_shred_version is not
specified we will obtain shred-version from the entrypoint using
ip-echo-server.
behzadnouri added a commit that referenced this pull request Jun 21, 2021
When starting a validator, the node initially joins gossip with
shred_verison = 0, until it adopts the entrypoint's shred-version:
https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417

Depending on the load on the entrypoint, this adopting entrypoint
shred-version through gossip sometimes becomes very slow, and causes
several problems in gossip because we have to partially support
shred_version == 0 which is a source of leaking crds values from one
cluster to another. e.g. see
#17899
and the other linked issues there.

In order to remove shred_version == 0 from gossip, this commit adds
shred-version to ip-echo-server response. Once the entrypoints are
updated, on validator start-up, if --expected_shred_version is not
specified we will obtain shred-version from the entrypoint using
ip-echo-server.
mergify bot pushed a commit that referenced this pull request Jun 21, 2021
When starting a validator, the node initially joins gossip with
shred_verison = 0, until it adopts the entrypoint's shred-version:
https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417

Depending on the load on the entrypoint, this adopting entrypoint
shred-version through gossip sometimes becomes very slow, and causes
several problems in gossip because we have to partially support
shred_version == 0 which is a source of leaking crds values from one
cluster to another. e.g. see
#17899
and the other linked issues there.

In order to remove shred_version == 0 from gossip, this commit adds
shred-version to ip-echo-server response. Once the entrypoints are
updated, on validator start-up, if --expected_shred_version is not
specified we will obtain shred-version from the entrypoint using
ip-echo-server.

(cherry picked from commit 598093b)

# Conflicts:
#	Cargo.lock
#	net-utils/Cargo.toml
#	programs/bpf/Cargo.lock
behzadnouri added a commit that referenced this pull request Jun 21, 2021
* adds shred-version to ip-echo-server response

When starting a validator, the node initially joins gossip with
shred_verison = 0, until it adopts the entrypoint's shred-version:
https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417

Depending on the load on the entrypoint, this adopting entrypoint
shred-version through gossip sometimes becomes very slow, and causes
several problems in gossip because we have to partially support
shred_version == 0 which is a source of leaking crds values from one
cluster to another. e.g. see
#17899
and the other linked issues there.

In order to remove shred_version == 0 from gossip, this commit adds
shred-version to ip-echo-server response. Once the entrypoints are
updated, on validator start-up, if --expected_shred_version is not
specified we will obtain shred-version from the entrypoint using
ip-echo-server.

(cherry picked from commit 598093b)

# Conflicts:
#	Cargo.lock
#	net-utils/Cargo.toml
#	programs/bpf/Cargo.lock

* removes backport merge conflicts

* obtains shred-version from entrypoint's ip-echo-server in validator-main

(cherry picked from commit 58e1152)

Co-authored-by: behzad nouri <[email protected]>
@brooksprumo brooksprumo mentioned this pull request Aug 23, 2021
mergify bot pushed a commit that referenced this pull request Sep 1, 2021
…on (#17899)

Inspecting TDS gossip table shows that crds values of nodes with
different shred-versions are creeping in. Their epoch-slots are
accumulated in ClusterSlots causing bogus slots very far from current
root which are not purged and so cause ClusterSlots keep consuming more
memory:
#17789
#14366 (comment)
#14366 (comment)

This commit updates ClusterInfo::get_epoch_slots, and discards entries
from nodes with unknown or different shred-version.

Follow up commits will patch gossip not to waste bandwidth and memory
over crds values of nodes with different shred-version.

(cherry picked from commit 985280e)

# Conflicts:
#	core/src/cluster_info.rs
mergify bot added a commit that referenced this pull request Sep 1, 2021
…on (backport #17899) (#19551)

* excludes epoch-slots from nodes with unknown or different shred version (#17899)

Inspecting TDS gossip table shows that crds values of nodes with
different shred-versions are creeping in. Their epoch-slots are
accumulated in ClusterSlots causing bogus slots very far from current
root which are not purged and so cause ClusterSlots keep consuming more
memory:
#17789
#14366 (comment)
#14366 (comment)

This commit updates ClusterInfo::get_epoch_slots, and discards entries
from nodes with unknown or different shred-version.

Follow up commits will patch gossip not to waste bandwidth and memory
over crds values of nodes with different shred-version.

(cherry picked from commit 985280e)

# Conflicts:
#	core/src/cluster_info.rs

* removes backport merge conflicts

Co-authored-by: behzad nouri <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ClusterSlots uses too much memory
2 participants