checks for duplicate validator instances using gossip #14018

behzadnouri · 2020-12-08T18:43:33Z

Problem

Two running instances of the same validator is not good.

Summary of Changes

Added a new node-instance crds value which contains a random token and timestamp of when the instance was created. If a node sees itself with a different token value and more recent timestamp it will stop.

mvines · 2020-12-08T18:55:29Z

core/src/cluster_info.rs

+        self.instance.write().unwrap().update_wallclock(now);
+        let entries: Vec<_> = vec![
+            CrdsData::ContactInfo(self.my_contact_info()),
+            CrdsData::NodeInstance(self.instance.read().unwrap().clone()),


Do we need to send NodeInstance continually? CrdsData::Version is just added once at startup

That is a good question. Is there a risk that the value is not propagated if it is only pushed once?
Something like:

The network is partitioned during the push process.

or, the value becomes outdated and not pushed any further.

and, the value expires and is purged from crds table, before it is returned in pull-requests.

or, similarly, it is not processed in pull request responses because it is too old.

There is an update_record_timestamp which extends local-timestamp of the associated values if we hear from a contact-info (on some but not all gossip paths), which may stop the node from getting purged from crds table as long as its contact info is propagated, but seems like if the wallclock is too old, it will not get propagated through gossip.

Is there a risk that the value is not propagated if it is only pushed once?

That sounds like it. I guess version has this same problem then. Maybe not a big deal here though, you're now more knowledgable in this area than I am so your call :)

I am leaning towards sending these periodically, just in case they get dropped because of network partitions, message expiration, etc. It should not add too much overhead.

sakridge · 2020-12-08T20:15:32Z

core/src/cluster_info.rs

@@ -300,6 +300,7 @@ pub struct ClusterInfo {
    socket: UdpSocket,
    local_message_pending_push_queue: RwLock<Vec<(CrdsValue, u64)>>,
    contact_debug_interval: u64,
+    instance: RwLock<NodeInstance>,


can instance just be the unique u64 value? Everything else can be constructed at send time, no? And not changing, so doesn't need a lock.

We at least need:

a timestamp of when the instance started, to stop the older one.

and also, a randomly generated token, to distinguish the instances if both start at the same milli-second (e.g. programmatically)

Nevertheless, we do not need to update the wallclock here, so I can remove RwLock, and only update the wallclock on the value which is pushed through gossip.

I think if instance was a UTC timestamp at ms or even second accuracy we'd get 99.9% of the way there. If two instances happen to start at exactly the same time then they both exit.

Current code no longer has RwLock so I guess should be good.

Alternative is to use std::time::SystemTime::now() as the timestamp of when the validator started (as opposed to milliseconds from timestamp()), and assume that no two validators can start at the exact same SystemTime. Then we no longer need the random token, and one single immutable SystemTime should suffice.

codecov · 2020-12-08T22:57:17Z

Codecov Report

Merging #14018 (1c3dc25) into master (e1a4251) will increase coverage by 0.0%.
The diff coverage is 88.2%.

@@           Coverage Diff            @@
##           master   #14018    +/-   ##
========================================
  Coverage    82.1%    82.1%            
========================================
  Files         381      381            
  Lines       93798    93918   +120     
========================================
+ Hits        77022    77135   +113     
- Misses      16776    16783     +7

mvines · 2020-12-09T17:47:49Z

@behzadnouri - is this PR ready for another review now?

behzadnouri · 2020-12-09T18:11:33Z

@behzadnouri - is this PR ready for another review now?

yes, the changes where:

adding std::process::exit to stop all threads (with a todo to do clean exit, but that requires bigger code change to pass around ValidatorExit).
removing RwLock from ClusterInfo.instance, in favor of an immutable value. wallclock is updated on a clone before pushing the value over gossip.
pushing instance early on in gossip along with version, in addition to periodic pushes with ContactInfo every ~7 seconds.

…4027) * checks for duplicate validator instances using gossip (cherry picked from commit 8cd5eb9) # Conflicts: # core/src/cluster_info.rs # core/src/crds_value.rs # core/src/result.rs * pushes node-instance along with version early in gossip (cherry picked from commit 5421981) # Conflicts: # core/src/cluster_info.rs * removes RwLock on ClusterInfo.instance (cherry picked from commit 895d7d6) # Conflicts: # core/src/cluster_info.rs * std::process::exit to kill all threads (cherry picked from commit 1d267ea) * removes backport merge conflicts Co-authored-by: behzad nouri <[email protected]>

…4028) * checks for duplicate validator instances using gossip (cherry picked from commit 8cd5eb9) # Conflicts: # core/src/cluster_info.rs * pushes node-instance along with version early in gossip (cherry picked from commit 5421981) * removes RwLock on ClusterInfo.instance (cherry picked from commit 895d7d6) # Conflicts: # core/src/cluster_info.rs * std::process::exit to kill all threads (cherry picked from commit 1d267ea) * removes backport merge conflicts Co-authored-by: behzad nouri <[email protected]>

behzadnouri requested review from sakridge, mvines and carllin December 8, 2020 18:43

behzadnouri force-pushed the validator-instance branch from 0e6527d to 3a1506d Compare December 8, 2020 18:47

mvines added v1.3 labels Dec 8, 2020

mvines reviewed Dec 8, 2020

View reviewed changes

sakridge reviewed Dec 8, 2020

View reviewed changes

behzadnouri force-pushed the validator-instance branch from e532d50 to 1c3dc25 Compare December 9, 2020 16:49

behzadnouri added 4 commits December 9, 2020 11:49

checks for duplicate validator instances using gossip

40ffa9c

pushes node-instance along with version early in gossip

6a37b29

removes RwLock on ClusterInfo.instance

80b9887

std::process::exit to kill all threads

1c3dc25

mvines approved these changes Dec 9, 2020

View reviewed changes

mvines merged commit 1d267ea into solana-labs:master Dec 9, 2020

This was referenced Dec 9, 2020

checks for duplicate validator instances using gossip (bp #14018) #14027

Merged

checks for duplicate validator instances using gossip (bp #14018) #14028

Merged

behzadnouri deleted the validator-instance branch December 9, 2020 18:33

behzadnouri restored the validator-instance branch December 9, 2020 18:33

behzadnouri deleted the validator-instance branch December 9, 2020 19:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checks for duplicate validator instances using gossip #14018

checks for duplicate validator instances using gossip #14018

behzadnouri commented Dec 8, 2020

mvines Dec 8, 2020

behzadnouri Dec 8, 2020 •

edited

Loading

mvines Dec 8, 2020

behzadnouri Dec 9, 2020

mvines Dec 9, 2020

sakridge Dec 8, 2020

behzadnouri Dec 8, 2020

mvines Dec 8, 2020

behzadnouri Dec 9, 2020

mvines Dec 9, 2020

codecov bot commented Dec 8, 2020 •

edited

Loading

mvines commented Dec 9, 2020

behzadnouri commented Dec 9, 2020

checks for duplicate validator instances using gossip #14018

checks for duplicate validator instances using gossip #14018

Conversation

behzadnouri commented Dec 8, 2020

Problem

Summary of Changes

Choose a reason for hiding this comment

behzadnouri Dec 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 8, 2020 • edited Loading

Codecov Report

mvines commented Dec 9, 2020

behzadnouri commented Dec 9, 2020

behzadnouri Dec 8, 2020 •

edited

Loading

codecov bot commented Dec 8, 2020 •

edited

Loading