Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

checks for duplicate validator instances using gossip #14018

Merged
merged 4 commits into from
Dec 9, 2020

Conversation

behzadnouri
Copy link
Contributor

Problem

Two running instances of the same validator is not good.

Summary of Changes

Added a new node-instance crds value which contains a random token and timestamp of when the instance was created. If a node sees itself with a different token value and more recent timestamp it will stop.

self.instance.write().unwrap().update_wallclock(now);
let entries: Vec<_> = vec![
CrdsData::ContactInfo(self.my_contact_info()),
CrdsData::NodeInstance(self.instance.read().unwrap().clone()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to send NodeInstance continually? CrdsData::Version is just added once at startup

Copy link
Contributor Author

@behzadnouri behzadnouri Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question. Is there a risk that the value is not propagated if it is only pushed once?
Something like:

There is an update_record_timestamp which extends local-timestamp of the associated values if we hear from a contact-info (on some but not all gossip paths), which may stop the node from getting purged from crds table as long as its contact info is propagated, but seems like if the wallclock is too old, it will not get propagated through gossip.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a risk that the value is not propagated if it is only pushed once?

That sounds like it. I guess version has this same problem then. Maybe not a big deal here though, you're now more knowledgable in this area than I am so your call :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaning towards sending these periodically, just in case they get dropped because of network partitions, message expiration, etc. It should not add too much overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@@ -300,6 +300,7 @@ pub struct ClusterInfo {
socket: UdpSocket,
local_message_pending_push_queue: RwLock<Vec<(CrdsValue, u64)>>,
contact_debug_interval: u64,
instance: RwLock<NodeInstance>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can instance just be the unique u64 value? Everything else can be constructed at send time, no? And not changing, so doesn't need a lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We at least need:

  • a timestamp of when the instance started, to stop the older one.
  • and also, a randomly generated token, to distinguish the instances if both start at the same milli-second (e.g. programmatically)

Nevertheless, we do not need to update the wallclock here, so I can remove RwLock, and only update the wallclock on the value which is pushed through gossip.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if instance was a UTC timestamp at ms or even second accuracy we'd get 99.9% of the way there. If two instances happen to start at exactly the same time then they both exit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current code no longer has RwLock so I guess should be good.

Alternative is to use std::time::SystemTime::now() as the timestamp of when the validator started (as opposed to milliseconds from timestamp()), and assume that no two validators can start at the exact same SystemTime. Then we no longer need the random token, and one single immutable SystemTime should suffice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wfm

@codecov
Copy link

codecov bot commented Dec 8, 2020

Codecov Report

Merging #14018 (1c3dc25) into master (e1a4251) will increase coverage by 0.0%.
The diff coverage is 88.2%.

@@           Coverage Diff            @@
##           master   #14018    +/-   ##
========================================
  Coverage    82.1%    82.1%            
========================================
  Files         381      381            
  Lines       93798    93918   +120     
========================================
+ Hits        77022    77135   +113     
- Misses      16776    16783     +7     

@mvines
Copy link
Contributor

mvines commented Dec 9, 2020

@behzadnouri - is this PR ready for another review now?

@behzadnouri
Copy link
Contributor Author

@behzadnouri - is this PR ready for another review now?

yes, the changes where:

  • adding std::process::exit to stop all threads (with a todo to do clean exit, but that requires bigger code change to pass around ValidatorExit).
  • removing RwLock from ClusterInfo.instance, in favor of an immutable value. wallclock is updated on a clone before pushing the value over gossip.
  • pushing instance early on in gossip along with version, in addition to periodic pushes with ContactInfo every ~7 seconds.

@mvines mvines merged commit 1d267ea into solana-labs:master Dec 9, 2020
@behzadnouri behzadnouri deleted the validator-instance branch December 9, 2020 18:33
@behzadnouri behzadnouri restored the validator-instance branch December 9, 2020 18:33
@behzadnouri behzadnouri deleted the validator-instance branch December 9, 2020 19:29
mergify bot added a commit that referenced this pull request Dec 9, 2020
…4027)

* checks for duplicate validator instances using gossip

(cherry picked from commit 8cd5eb9)

# Conflicts:
#	core/src/cluster_info.rs
#	core/src/crds_value.rs
#	core/src/result.rs

* pushes node-instance along with version early in gossip

(cherry picked from commit 5421981)

# Conflicts:
#	core/src/cluster_info.rs

* removes RwLock on ClusterInfo.instance

(cherry picked from commit 895d7d6)

# Conflicts:
#	core/src/cluster_info.rs

* std::process::exit to kill all threads

(cherry picked from commit 1d267ea)

* removes backport merge conflicts

Co-authored-by: behzad nouri <[email protected]>
mvines pushed a commit that referenced this pull request Dec 9, 2020
…4028)

* checks for duplicate validator instances using gossip

(cherry picked from commit 8cd5eb9)

# Conflicts:
#	core/src/cluster_info.rs

* pushes node-instance along with version early in gossip

(cherry picked from commit 5421981)

* removes RwLock on ClusterInfo.instance

(cherry picked from commit 895d7d6)

# Conflicts:
#	core/src/cluster_info.rs

* std::process::exit to kill all threads

(cherry picked from commit 1d267ea)

* removes backport merge conflicts

Co-authored-by: behzad nouri <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants