Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make polkadot behave correctly if node changes peer-id at restart #3673

Closed
alexggh opened this issue Mar 13, 2024 · 13 comments
Closed

Make polkadot behave correctly if node changes peer-id at restart #3673

alexggh opened this issue Mar 13, 2024 · 13 comments
Assignees

Comments

@alexggh
Copy link
Contributor

alexggh commented Mar 13, 2024

During the investigation for #3314 we discovered there is a number of validators that change their PeerId at every restart, that seems not work properly right now.

Work-arrounds

  1. Don't change PeerId at every restart, the peer-id is saved on disk next to the databse in the network folder, so you can simply reuse that by always providing --base-path argument and making sure the network folder is present where you restart the node.
  2. If you restarted already, key rotations would fix the issue.

[WIP] Things that need fixing

In general all the assumptions where we think we have just one PeerID per AuthorithyId are breaking this on a quick scan by me and @bkchr:

With kusama size nodes won't be able to connect to nodes if they aren't in the reserved set.

in_peers: super::MIN_GOSSIP_PEERS as u32 / 2 - 1,
out_peers: super::MIN_GOSSIP_PEERS as u32 / 2 - 1,

Some places where the assumption that just one peerID per authority exist.

.find_map(|addr| parse_addr(addr).ok().map(|(p, _)| p));

"Ignoring new peer with duplicate authority ID as a bearer of that identity"

Another thing that I observed that is probably off while testing on versi is that ConnectToResolvedValidators gets called sometime with the new address and sometimes with old-address.

@bkchr
Copy link
Member

bkchr commented Mar 13, 2024

The first link is not really a problem. Just more the explanation why these people could not connect to all the other validators.

async fn issue_connection_request<Sender>(
&mut self,
sender: &mut Sender,
authorities: Vec<AuthorityDiscoveryId>,
) where
Sender: overseer::GossipSupportSenderTrait,
{
let num = authorities.len();
let mut validator_addrs = Vec::with_capacity(authorities.len());
let mut failures = 0;
let mut resolved = HashMap::with_capacity(authorities.len());
for authority in authorities {
if let Some(addrs) =
self.authority_discovery.get_addresses_by_authority_id(authority.clone()).await
{
validator_addrs.push(addrs.clone());
resolved.insert(authority, addrs);
} else {
failures += 1;
gum::debug!(
target: LOG_TARGET,
"Couldn't resolve addresses of authority: {:?}",
authority
);
}
}

This function needs to be called multiple times per session. Currently we don't call this once and if a validator address was not known at this point, we will never resolve it in this session.

@alexggh
Copy link
Contributor Author

alexggh commented Mar 13, 2024

The first link is not really a problem

Messed up the links, fixed it in the description.

@alexggh alexggh self-assigned this Mar 19, 2024
@alexggh alexggh moved this from Backlog to In Progress in parachains team board Mar 19, 2024
@alexggh
Copy link
Contributor Author

alexggh commented Mar 20, 2024

Looked a bit more closer at Kademlia DHT and the biggest problem for this use-case seem to be is that both records will continue living in the DHT on different nodes, so on any given node any of the following might be true:

  1. When querying the DHT a node discovers only the old record.
  2. When querying the DHT a node discovers only the new record.
  3. When querying the DHT a node discovers both the new record and the old record, but because they are sequentially advertised to authority-discovery only the last one will be saved. We can probably mitigate this, by modifying authorithy-discovery to save both of them and expire the old-record at a later data.

Both 1) & 3) will happen until the old record TTL is met which currently is 36 hours, after that the new record will be the only one in the network.

Why does this happen ?

The way Kademlia DHT works is that it uses the PeerId as a key for determining on which nodes records need to be replicated the receiving nodes will do replication as well every 60 minutes. The replication factor is 20.
So, regularly both the old records and new records will continue living in the network on different nodes until the old one expires.

What can we do ?

At least as far as I can tell this behaviour is entrenched in the Kademlia DHT protocol, we can probably reduce the window the old entry lives by playing with record_ttl(36h), record_replication_interval(1), record_publication_interval(24h), but there will still be a window where this happens.

So, I'm inclined to think changing the network key(PeerID) while a node is in the active set should not be considered a safe operation, to perform, hence what I'm proposing is:

  • Modify the behaviour of polkadot executable to not auto-generated the network key(PeerID) if none is provided and exit with error (Kudos to @s0me0ne-unkn0wn for giving me the idea)
  • Add an explicit flag for auto-generating the network key at startup with proper documentation that this is not a safe operation to do while the node is in the active set and what will happen.

@bkchr @dmitry-markin @eskimor Thoughts ?

@bkchr
Copy link
Member

bkchr commented Mar 20, 2024

Here are my notes from last week that I wanted to put into an issue actually:

I think one problem exists in authority discovery:

let old_addresses = self.authority_id_to_addresses.insert(authority_id.clone(), addresses);
let old_peer_ids = addresses_to_peer_ids(&old_addresses.unwrap_or_default());
// Add the new peer ids
peer_ids.difference(&old_peer_ids).for_each(|new_peer_id| {
self.peer_id_to_authority_ids
.entry(*new_peer_id)
.or_default()
.insert(authority_id.clone());
});
and
.flatten()
.take(MAX_ADDRESSES_PER_AUTHORITY)
.
If the validator exists with multiple peerids in the DHT (because of the restart), we maybe get multiple DHT events with different peerids. We should keep different peerids around. Maybe store when we received them the last time. We also should not take the maximum number of addresses solely based on the addresses, we should have addresses per peer_id and then have a max number of peer_ids and addresses.

// Let's directly finish the query, as we are only interested in a
// quorum of 1.
if let Some(kad) = self.kademlia.as_mut() {
if let Some(mut query) = kad.query_mut(&id) {
query.finish();
}
}
Another issue is this properly. Here we should not end directly with quorum of 1. Instead we should probably fetch all the values.

This assumes the same as you are writing. However, I would like to try to optimize the current implementation given the points above. We can also go further and extend the record with a timestamp to find out which one is newer. We also don't verify the data in the DHT storage. The verification is only done in the authority discovery. This means that the DHT may could still contain invalid values.

  • Modify the behaviour of polkadot executable to not auto-generated the network key(PeerID) if none is provided and exit with error

Could be done as well, but should only be done for validators. Generally we probably need to improve docs around this stuff. There also exists generate-node-key to generate a key. Maybe we should just require that --node-key is passed when --validator is set.

@alexggh
Copy link
Contributor Author

alexggh commented Mar 20, 2024

Another issue is this properly. Here we should not end directly with quorum of 1. Instead we should probably fetch all the values.

Ok, I see what you mean, so basically if we let the query continue we could probably end up discovering both more often than not and have them both stored in authorithy-discovery. Let me check how that behaves.

alexggh added a commit that referenced this issue Mar 21, 2024
…e robust

In the case when nodes don't persist their node-key or they want to
generate a new one while being in the active set, things go wrong
because both the old addresses and the new ones will still be present in
DHT, so because of the distributed nature of the DHT both will survive
in the network untill the old ones expires which is 36 hours.
Nodes in the network will randomly resolve the authorithy-id to the old
address or the new one.

More details in: #3673

This PR proposes we mitigate this problem, by:

1. Let the query for a DHT key retrieve all the results, that is usually
   bounded by the replication factor which is 20, currently we interrupt
   the querry on the first result.
2. Modify the authority-discovery service to keep all the discovered
   addresses around for 24h since they last seen an address.
3. Plumb through other subsystems where the assumption was that an
   authorithy-id will resolve only to one PeerId. Currently, the
   authorithy-discovery keeps just the last record it received from DHT
   and queries the DHT every 10 minutes. But they could always receive
   only the old address, only the new address or a flip-flop between
   them depending on what node wins the race to provide the record
4. Update gossip-support to try resolve authorities more often than
   every session.

This would gives us a lot more chances for the nodes in the networks to
also discover not only the old address of the node but also the new one
and should improve the time it takes for a node to be properly connected
in the network. The behaviour won't be deterministic because there is no
guarantee the all nodes will see the new record at least once, since
they could query only nodes that have the old one.

Signed-off-by: Alexandru Gheorghe <[email protected]>
@alexggh
Copy link
Contributor Author

alexggh commented Mar 21, 2024

Posted a draft PR with all the plumbing that would need to happen to make this a bit better than the way it currently is: #3786, let me know what you guys think, don't be nitpicky it is in a very rough form :D.

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Mar 25, 2024

So, I'm inclined to think changing the network key(PeerID) while a node is in the active set should not be considered a safe operation, to perform, hence what I'm proposing is:

I don't understand why people regenerate the PeerIDs of validators in the first place. Obviously, a validator should have the actual database, so the contents of --base-path should be preserved across restarts. Why they remove the network key file then?

Generally I'm for not-regenerating PeerIDs of the validators without a sound reason (is there one?)

Modify the behaviour of polkadot executable to not auto-generated the network key(PeerID) if none is provided and exit with error (Kudos to @s0me0ne-unkn0wn for giving me the idea)

May be make this default for validators only? This would change the interface and force operators to update their procedures, but we can print a nice error message if somebody starts a validator without network key file present.

Add an explicit flag for auto-generating the network key at startup with proper documentation that this is not a safe operation to do while the node is in the active set and what will happen.

IMO, this should be an explicit cli command like generate-node-key that has to be invoked separately, otherwise the operators would just plug this --auto-generate-node-key option and keep everything as before.

@lexnv
Copy link
Contributor

lexnv commented Mar 25, 2024

Likewise, we could fix this in the CLI and require validators to use --node-key with some improved documentation.

We can also go further and extend the record with a timestamp to find out which one is newer. We also don't verify the data in the DHT storage.

I think this sounds like a reasonable solution to me. We'd need to extend the DHT::ValueFound event with the kademlia record expiry time:

/// The DHT yielded results for the record request.
///
/// Returning the result grouped in (key, value) pairs as well as the request duration.
ValueFound(Vec<(RecordKey, Vec<u8>)>, Duration),

Then, in the authority-discovery, we could store only the most recent records, since we know that older records might contain stale addresses. While at it, we could also have a look at the MAX_ADDRESSES_PER_AUTHORITY and adjust it as Basti suggested 🙏

@alexggh
Copy link
Contributor Author

alexggh commented Mar 25, 2024

I think this sounds like a reasonable solution to me. We'd need to extend the DHT::ValueFound event with the kademlia record expiry time:

I've been down that road, it won't work because expiry is actually a function of the distance from the key to the node answering the request: https://github.com/libp2p/rust-libp2p/blob/master/protocols/kad/src/behaviour.rs#L1816. Expiry doesn't tell how old the record is because if you get it from nodes farther from the key they would have a close expiry they should not store the record for too much time.

@alexggh
Copy link
Contributor Author

alexggh commented Mar 25, 2024

I don't understand why people regenerate the PeerIDs of validators in the first place. Obviously, a validator should have the actual database, so the contents of --base-path should be preserved across restarts. Why they remove the network key file then?

In this case a large validator set >50, were mount individually the keystore and db in their pod, so they missed persisting the network:

.local/share/polkadot/chains/ksmcc3/db
.local/share/polkadot/chains/ksmcc3/keystore

@alexggh
Copy link
Contributor Author

alexggh commented Mar 25, 2024

IMO, this should be an explicit cli command like generate-node-key that has to be invoked separately, otherwise the operators would just plug this --auto-generate-node-key option and keep everything as before.

I was thinking to name it more like --i-know-i-m-getting-0-rewards-but-auto-generate-node-key :D

@alexggh
Copy link
Contributor Author

alexggh commented Mar 27, 2024

PR to not allow auto generate of network key unless it is explicitly required: #3852.

github-merge-queue bot pushed a commit that referenced this issue Apr 15, 2024
As discovered during investigation of
#3314 and
#3673 there are active
validators which accidentally might change their network key during
restart, that's not a safe operation when you are in the active set
because of distributed nature of DHT, so the old records would still
exist in the network until they expire 36h, so unless they have a good
reason validators should avoid changing their key when they restart
their nodes.

There is an effort in parallel to improve this situation
#3786, but those changes
are way more intrusive and will need more rigorous testing, additionally
they will reduce the time to less than 36h, but the propagation won't be
instant anyway, so not changing your network during restart should be
the safest way to run your node, unless you have a really good reason to
change it.

## Proposal
1. Do not auto-generate the network if the network file does not exist
in the provided path. Nodes where the key file does not exist will get
the following error:
```
Error: 
   0: Starting an authorithy without network key in /home/alexggh/.local/share/polkadot/chains/ksmcc3/network/secret_ed25519.
      
       This is not a safe operation because the old identity still lives in the dht for 36 hours.
      
       Because of it your node might suffer from not being properly connected to other nodes for validation purposes.
      
       If it is the first time running your node you could use one of the following methods.
      
       1. Pass --unsafe-force-node-key-generation and make sure you remove it for subsequent node restarts
      
       2. Separetly generate the key with: polkadot key generate-node-key --file <YOUR_PATH_TO_NODE_KEY>
```

2. Add an explicit parameters for nodes that do want to change their
network despite the warnings or if they run the node for the first time.
`--unsafe-force-node-key-generation`

3. For `polkadot key generate-node-key` add two new mutually exclusive
parameters `base_path` and `default_base_path` to help with the key
generation in the same path the polkadot main command would expect it.
 
4. Modify the installation scripts to auto-generate a key in default
path if one was not present already there, this should help with making
the executable work out of the box after an instalation.

## Notes

Nodes that do not have already the key persisted will fail to start
after this change, however I do consider that better than the current
situation where they start but they silently hide that they might not be
properly connected to their peers.

## TODO
- [x] Make sure only nodes that are authorities on producation chains
will be affected by this restrictions.
- [x] Proper PRDOC, to make sure node operators are aware this is
coming.

---------

Signed-off-by: Alexandru Gheorghe <[email protected]>
Co-authored-by: Dmitry Markin <[email protected]>
Co-authored-by: s0me0ne-unkn0wn <[email protected]>
Co-authored-by: Bastian Köcher <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Jul 25, 2024
…e robust (#3786)

In the case when nodes don't persist their node-key or they want to
generate a new one while being in the active set, things go wrong
because both the old addresses and the new ones will still be present in
DHT, so because of the distributed nature of the DHT both will survive
in the network untill the old ones expires which is 36 hours. Nodes in
the network will randomly resolve the authorithy-id to the old address
or the new one.

More details in: #3673

This PR proposes we mitigate this problem, by:

1. Let the query for a DHT key retrieve more than one results(4), that
is also bounded by the replication factor which is 20, currently we
interrupt the querry on the first result.
~2. Modify the authority-discovery service to keep all the discovered
addresses around for 24h since they last seen an address.~
~3. Plumb through other subsystems where the assumption was that an
authorithy-id will resolve only to one PeerId. Currently, the
authorithy-discovery keeps just the last record it received from DHT and
queries the DHT every 10 minutes. But they could always receive only the
old address, only the new address or a flip-flop between them depending
on what node wins the race to provide the record~

2. Extend the `SignedAuthorityRecord` with a signed creation_time.
3. Modify authority discovery to keep track of nodes that sent us old
record and once we are made aware of a new record update the nodes we
know about with the new record.
4. Update gossip-support to try resolve authorities more often than
every session.

~This would gives us a lot more chances for the nodes in the networks to
also discover not only the old address of the node but also the new one
and should improve the time it takes for a node to be properly connected
in the network. The behaviour won't be deterministic because there is no
guarantee the all nodes will see the new record at least once, since
they could query only nodes that have the old one.~


## TODO
- [x] Add unittests for the new paths.
- [x] Make sure the implementation is backwards compatible
- [x] Evaluate if there are any bad consequence of letting the query
continue rather than finish it at first record found.
- [x] Bake in versi the new changes.

---------

Signed-off-by: Alexandru Gheorghe <[email protected]>
Co-authored-by: Dmitry Markin <[email protected]>
Co-authored-by: Alexandru Vasile <[email protected]>
@alexggh
Copy link
Contributor Author

alexggh commented Jul 29, 2024

Problem mitigated with: 6720279

@alexggh alexggh closed this as completed Jul 29, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Completed in parachains team board Jul 29, 2024
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this issue Aug 2, 2024
…e robust (paritytech#3786)

In the case when nodes don't persist their node-key or they want to
generate a new one while being in the active set, things go wrong
because both the old addresses and the new ones will still be present in
DHT, so because of the distributed nature of the DHT both will survive
in the network untill the old ones expires which is 36 hours. Nodes in
the network will randomly resolve the authorithy-id to the old address
or the new one.

More details in: paritytech#3673

This PR proposes we mitigate this problem, by:

1. Let the query for a DHT key retrieve more than one results(4), that
is also bounded by the replication factor which is 20, currently we
interrupt the querry on the first result.
~2. Modify the authority-discovery service to keep all the discovered
addresses around for 24h since they last seen an address.~
~3. Plumb through other subsystems where the assumption was that an
authorithy-id will resolve only to one PeerId. Currently, the
authorithy-discovery keeps just the last record it received from DHT and
queries the DHT every 10 minutes. But they could always receive only the
old address, only the new address or a flip-flop between them depending
on what node wins the race to provide the record~

2. Extend the `SignedAuthorityRecord` with a signed creation_time.
3. Modify authority discovery to keep track of nodes that sent us old
record and once we are made aware of a new record update the nodes we
know about with the new record.
4. Update gossip-support to try resolve authorities more often than
every session.

~This would gives us a lot more chances for the nodes in the networks to
also discover not only the old address of the node but also the new one
and should improve the time it takes for a node to be properly connected
in the network. The behaviour won't be deterministic because there is no
guarantee the all nodes will see the new record at least once, since
they could query only nodes that have the old one.~


## TODO
- [x] Add unittests for the new paths.
- [x] Make sure the implementation is backwards compatible
- [x] Evaluate if there are any bad consequence of letting the query
continue rather than finish it at first record found.
- [x] Bake in versi the new changes.

---------

Signed-off-by: Alexandru Gheorghe <[email protected]>
Co-authored-by: Dmitry Markin <[email protected]>
Co-authored-by: Alexandru Vasile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Completed
Development

No branches or pull requests

4 participants