-
Notifications
You must be signed in to change notification settings - Fork 4.5k
checks for duplicate validator instances using gossip #14018
Conversation
0e6527d
to
3a1506d
Compare
core/src/cluster_info.rs
Outdated
self.instance.write().unwrap().update_wallclock(now); | ||
let entries: Vec<_> = vec![ | ||
CrdsData::ContactInfo(self.my_contact_info()), | ||
CrdsData::NodeInstance(self.instance.read().unwrap().clone()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to send NodeInstance
continually? CrdsData::Version
is just added once at startup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good question. Is there a risk that the value is not propagated if it is only pushed once?
Something like:
- The network is partitioned during the push process.
- or, the value becomes outdated and not pushed any further.
- and, the value expires and is purged from crds table, before it is returned in pull-requests.
- or, similarly, it is not processed in pull request responses because it is too old.
There is an update_record_timestamp
which extends local-timestamp
of the associated values if we hear from a contact-info (on some but not all gossip paths), which may stop the node from getting purged from crds table as long as its contact info is propagated, but seems like if the wallclock
is too old, it will not get propagated through gossip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a risk that the value is not propagated if it is only pushed once?
That sounds like it. I guess version has this same problem then. Maybe not a big deal here though, you're now more knowledgable in this area than I am so your call :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am leaning towards sending these periodically, just in case they get dropped because of network partitions, message expiration, etc. It should not add too much overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
core/src/cluster_info.rs
Outdated
@@ -300,6 +300,7 @@ pub struct ClusterInfo { | |||
socket: UdpSocket, | |||
local_message_pending_push_queue: RwLock<Vec<(CrdsValue, u64)>>, | |||
contact_debug_interval: u64, | |||
instance: RwLock<NodeInstance>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can instance
just be the unique u64 value? Everything else can be constructed at send time, no? And not changing, so doesn't need a lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We at least need:
- a timestamp of when the instance started, to stop the older one.
- and also, a randomly generated token, to distinguish the instances if both start at the same milli-second (e.g. programmatically)
Nevertheless, we do not need to update the wallclock here, so I can remove RwLock
, and only update the wallclock on the value which is pushed through gossip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if instance
was a UTC timestamp at ms or even second accuracy we'd get 99.9% of the way there. If two instances happen to start at exactly the same time then they both exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current code no longer has RwLock
so I guess should be good.
Alternative is to use std::time::SystemTime::now()
as the timestamp of when the validator started (as opposed to milliseconds from timestamp()
), and assume that no two validators can start at the exact same SystemTime
. Then we no longer need the random token
, and one single immutable SystemTime
should suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wfm
Codecov Report
@@ Coverage Diff @@
## master #14018 +/- ##
========================================
Coverage 82.1% 82.1%
========================================
Files 381 381
Lines 93798 93918 +120
========================================
+ Hits 77022 77135 +113
- Misses 16776 16783 +7 |
e532d50
to
1c3dc25
Compare
@behzadnouri - is this PR ready for another review now? |
yes, the changes where:
|
…4027) * checks for duplicate validator instances using gossip (cherry picked from commit 8cd5eb9) # Conflicts: # core/src/cluster_info.rs # core/src/crds_value.rs # core/src/result.rs * pushes node-instance along with version early in gossip (cherry picked from commit 5421981) # Conflicts: # core/src/cluster_info.rs * removes RwLock on ClusterInfo.instance (cherry picked from commit 895d7d6) # Conflicts: # core/src/cluster_info.rs * std::process::exit to kill all threads (cherry picked from commit 1d267ea) * removes backport merge conflicts Co-authored-by: behzad nouri <[email protected]>
…4028) * checks for duplicate validator instances using gossip (cherry picked from commit 8cd5eb9) # Conflicts: # core/src/cluster_info.rs * pushes node-instance along with version early in gossip (cherry picked from commit 5421981) * removes RwLock on ClusterInfo.instance (cherry picked from commit 895d7d6) # Conflicts: # core/src/cluster_info.rs * std::process::exit to kill all threads (cherry picked from commit 1d267ea) * removes backport merge conflicts Co-authored-by: behzad nouri <[email protected]>
Problem
Two running instances of the same validator is not good.
Summary of Changes
Added a new node-instance crds value which contains a random token and timestamp of when the instance was created. If a node sees itself with a different token value and more recent timestamp it will stop.