-
Notifications
You must be signed in to change notification settings - Fork 964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
protocols/kad: Implement S-Kademlia's lookup over disjoint paths v2 #1473
Conversation
The extension paper S-Kademlia includes a proposal for lookups over disjoint paths. Within vanilla Kademlia, queries keep track of the closest nodes in a single bucket. Any adversary along the path can thus influence all future paths, in case they can come up with the next-closest (not overall closest) hops. S-Kademlia tries to solve the attack above by querying over disjoint paths using multiple buckets. To adjust the libp2p Kademlia implementation accordingly this change-set introduces an additional peers iterator: `DisjointClosestPeersIter`. This new iterator wraps around a set of `ClosestPeersIter` giving each of them a share of the allowed parallelism, amount of striven for results and set of initial nodes to contact. `DisjointClosestPeersIter` enforces that each of the `ClosestPeersIter` explore disjoint paths by having each peer instantly fail that was queried by a different iterator.
`into_result` does not return the target, but only the closest peers.
Replace the ClosestPeersIter with a DisjointClosestPeersIter with a disjoint_path config of 1, resulting in the same behavior.
pub parallelism: usize, | ||
|
||
// TODO: Document that it needs to be equal or larger than parallelism? | ||
pub disjoint_paths: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to have this as a separate parameter from parallelism
- the whole point of the multipaths approach is to run a query in parallel, there is no separate meaning to the idea of how wide the multipath is. It is just going to confuse people who call this API.
OTOH S/Kademlia does suggest that a value be stored in the d
nodes around a particular key rather than the single node at a particular key, but this parameter can rather be represented elsewhere in the general Kademlia config.
/// | ||
/// The number of closest peers for which the iterator must obtain successful results | ||
/// in order to finish successfully. Defaults to `K_VALUE`. | ||
pub num_results: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the point of this parameter and it's in neither the S/Kademlia paper nor my multipath version. If peers fail to reply successfully to a query we just skip it, regardless of whether they are in the middle or the end of a path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from the ClosestPeersIterConfig
, thus not a new concept but a way of maintaining the status quo.
This concept is part of the original Kademlia protocol:
The lookup terminates when the initiator has queried and gotten responses from the k closest nodes it has seen.
Say we would set this to equal disjoint_paths
, then each path iterator would get exactly one slot. Thus a path iterator would stop once it has successfully queried the closest node it is aware of instead of the n
(K_VALUE / disjoint_paths
) closest nodes it is aware of. Doing so would reduce the overall security guarantees.
Are you suggesting to drop the concept of num_results
, have each path iterator stop once its closest node was successfully queried, and choose a high value for disjoint_paths
? (//CC @romanb)
parallelism: ALPHA_VALUE.get(), | ||
disjoint_paths: 3, | ||
num_results: K_VALUE.get(), | ||
peer_timeout: Duration::from_secs(10), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be taken from a higher-level global config, e.g. the overall Kademlia instance config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from ClosestPeersIterConfig
. I don't like the duplication introduced by my current implementation. If the overall design is perceived as good I will look into deduplicating this.
self.iters[self.yielded_peers[peer]].is_waiting(peer) | ||
} | ||
|
||
pub fn next(&mut self, now: Instant) -> PeersIterState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it's providing an API for the end user to transparently blockingly iterate through the results as they come in on the network, but also implementing the protocol within this type of wrapper.
On the surface it might look like a reasonable thing to do, but it gives you very little control into the low-level details of how the protocol behaves when executed. For example there is no way to make decisions based on whether you received the next hop in path A or path B first. As you noted "[arbitrary ordering] might be unfair". Blocking paradigms such as this make it easier to implement control flows that "look like" multiple unrelated threads, but for more predictable performance we'd like to have control flows that are more interconnected, that don't "look like" multiple unrelated threads.
Implementing the protocol as a non-blocking state machine gives you more control into the details of the process and allows for easier reasoning about what resources are being used at any given time. I would suggest implementing the protocol like this, and then wrapping a blocking iterator around this for the end user, rather than doing the protocol lookups in the iterator-wrapper itself. For example the implementation can just shove the results into a blocking queue, and the end user can consume the queue. Then all the blocking would be localised into the queue implementation, and the performance / resource-usage behaviour of your non-blocking code would be easier to reason about.
This is how my python example is implemented, everything is non-blocking there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another advantage of the non-blocking statemachine approach is it's easier to introspect the memory usage; harder to do that when your protocol state is implicitly the local variables of a bunch of green threads fed into a global executor, mixed together with other protocols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be that I'm misunderstanding how the iteration works - e.g. perhaps it is repeatedly polling the child iterators. However this just opens up extra questions - Rust futures are supposed to be woken up when the status changes, but I see no-where in the current PR where a wake-up of the child iterators would forward the wake-up onto the parent iterator, which would be needed under the polling paradigm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely sure I understand all of your comments. Let me try to explain some parts that might cause some confusion.
The libp2p-kad
crate is a synchronous state machine. It takes as input events from the network (e.g. new node connected, response from node x, ...) and produces as output requests to be send out on the network (e.g. send FIND_NODE
request to node x).
I am guessing by blocking you mean e.g. blocking the current thread until a response was received on a TCP socket. Under this assumption Kademlia has no way of blocking, given that it does not control any network resources (e.g. a Tcp socket), but can only return action requests (e.g. please dial this node, please send this packet to this node, ...).
In regards to any of the XPeersIterator
: All of their methods are synchronous and single-threaded. While they are blocking in terms of not being asynchronous, they are non-blocking in terms of parking the OS-thread (e.g. blocking on a Tcp connection). You can see each method call as being instant (only bound by the CPU computation time they need).
Does this make sense to you @infinity0?
pub fn next(&mut self, now: Instant) -> PeersIterState { | ||
let mut state = PeersIterState::Finished; | ||
|
||
// TODO: Iterating through all iterators in the same order might be unfair. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no way to "select" on a bunch of iterators? That is, block until any iterator unblocks, and give me that one that just unblocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of this is asynchronous. You can think of the iterators of pure state machines with no side effects other than system time. All of their method calls are instant (ignoring CPU time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In regards to any of the XPeersIterator: All of their methods are synchronous and single-threaded. While they are blocking in terms of not being asynchronous,
OK so it is "logically" blocking from the perspective of the consumer of the iterator, that's what I meant.
However, the state machines need to accept both input from the consumer of the iterator, AND input from the network. I do not see anywhere in the code where the latter occurs.
For Kademlia specifically, we might receive an input from the network, e.g. the result of a previous hop, and then we want to react to this input, independently of whether the consumer of the iterator calls next()
or not. Especially, we don't want to wait for another next()
call before we respond to network inputs. But the way this code is written, is either hiding this part of the logic extremely well, or not performing it at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the state machines need to accept both input from the consumer of the iterator, AND input from the network. I do not see anywhere in the code where the latter occurs.
The iterator is not aware of what a network is. It gets informed about occurred network events through its handlers on_success
and on_failure
.
on_success
can be called regardless of whether next
was called recently, implying that handling of incoming events is happening outside the next
calls.
Deciding when to reply to an input from the network and when to reach out to the next node is up to the Kademlia
structure which owns a QueryPool
which in turn owns e.g. a DisjointClosestPeerIter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I meant abstract input from an abstract networking. In general we have the following abstract inputs and outputs that (together with the algorithm state) should completely and deterministically affect the running of this algorithm:
- input from the network
- input from the local clock
- output to the network
- output to the consumer
This iterator implementation would seem to add another input:
- input from the consumer, requesting the "next" item in the iteration
This is nice though not a fundamental aspect of how the algorithm must work by definition.
Are you saying that on_success
and on_failure
represent the logic to handle input from the network (and presumably local clock, for timeouts)? That would seem to correspond to the on_recv
function in my python code (granted, it does not deal with timeouts or failures). But the amount of code in these functions seems far too small to perform what (IMO) should be the logic.
Also, what is is_waiting
?
It would help my understanding, if you could classify each of the existing functions as "input from X" or "output to X" like I did in the list above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the amount of code in these functions seems far too small to perform what (IMO) should be the logic.
Specifically, when one hop on one path succeeds, we want to choose the next hop based on (filtering out) stuff that was previously observed on other paths. But the current code doesn't do that. Most of the complex logic is in next
, which is an input from the consumer of the iterator, not the abstract network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for giving this a look @infinity0. Let me know what you think of my replies.
/// | ||
/// The number of closest peers for which the iterator must obtain successful results | ||
/// in order to finish successfully. Defaults to `K_VALUE`. | ||
pub num_results: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from the ClosestPeersIterConfig
, thus not a new concept but a way of maintaining the status quo.
This concept is part of the original Kademlia protocol:
The lookup terminates when the initiator has queried and gotten responses from the k closest nodes it has seen.
Say we would set this to equal disjoint_paths
, then each path iterator would get exactly one slot. Thus a path iterator would stop once it has successfully queried the closest node it is aware of instead of the n
(K_VALUE / disjoint_paths
) closest nodes it is aware of. Doing so would reduce the overall security guarantees.
Are you suggesting to drop the concept of num_results
, have each path iterator stop once its closest node was successfully queried, and choose a high value for disjoint_paths
? (//CC @romanb)
parallelism: ALPHA_VALUE.get(), | ||
disjoint_paths: 3, | ||
num_results: K_VALUE.get(), | ||
peer_timeout: Duration::from_secs(10), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from ClosestPeersIterConfig
. I don't like the duplication introduced by my current implementation. If the overall design is perceived as good I will look into deduplicating this.
self.iters[self.yielded_peers[peer]].is_waiting(peer) | ||
} | ||
|
||
pub fn next(&mut self, now: Instant) -> PeersIterState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely sure I understand all of your comments. Let me try to explain some parts that might cause some confusion.
The libp2p-kad
crate is a synchronous state machine. It takes as input events from the network (e.g. new node connected, response from node x, ...) and produces as output requests to be send out on the network (e.g. send FIND_NODE
request to node x).
I am guessing by blocking you mean e.g. blocking the current thread until a response was received on a TCP socket. Under this assumption Kademlia has no way of blocking, given that it does not control any network resources (e.g. a Tcp socket), but can only return action requests (e.g. please dial this node, please send this packet to this node, ...).
In regards to any of the XPeersIterator
: All of their methods are synchronous and single-threaded. While they are blocking in terms of not being asynchronous, they are non-blocking in terms of parking the OS-thread (e.g. blocking on a Tcp connection). You can see each method call as being instant (only bound by the CPU computation time they need).
Does this make sense to you @infinity0?
pub fn next(&mut self, now: Instant) -> PeersIterState { | ||
let mut state = PeersIterState::Finished; | ||
|
||
// TODO: Iterating through all iterators in the same order might be unfair. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of this is asynchronous. You can think of the iterators of pure state machines with no side effects other than system time. All of their method calls are instant (ignoring CPU time).
Again apologies for the late reply. The direction looks good to me. Some high-level suggestions / comments:
I hope that all makes sense. |
Previous implementation would split up all provided peers only to have the `ClosestDisjointIter` choose a subset of it. This patch only splits up the needed subsets.
Have a user either disable or enable the disjoint paths feature. When enabled always set the amount of disjoint paths to use to the amount of parallelism.
Add test ensuring DisjointClosestPeersIter derives same peer set result as ClosestPeersIter on perfect network conditions.
All [`ClosestPeersIter`] share the same set of peers at initialization. The [`DisjointClosestPeersIter`] ensures a peer is only ever queried by a single [`ClosestPeersIter`].
Don't start with closest iter 0 each time, but instead start off with iter least querried so far.
@romanb I have addressed all your suggestions in #1473 (comment). Would you mind giving this another review? As of right now |
@romanb What does If this is something libp2p added on top of the Kademlia paper, there should be a document on it. It is about as significant of a difference as S/Kademlia itself is on top of regular Kademlia, which has its own paper to describe its design motivations.
@mxinden It is sufficient to test that with the multipaths approach, the result of the lookup, which is a set of nodes, contains the actual closest node - which you know, because you are defining the test. A slightly more strict test would check that the result of the lookup, is approximately equal (say >1/2 overlap) with the sibling nodes that S/Kademlia additionally keeps. I am not sure if you implemented this in your PR however. |
// iterator, asking it to return another peer in the next loop | ||
// iteration. | ||
let peer = peer.into_owned(); | ||
iter.on_failure(&peer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to fail this query, can't we just not make the query in the first place? And save some useless extra network traffic?
Judging by the documentation of PeersIterState
it does sound like the overall framework doesn't give us this fine-grained control into the query process, is that right? It seems to be a very inelegant abstraction that is also overly restrictive - it only tells us when we are "waiting" (i.e. after it has received a reply and reacted to it already by sending follow-up queries), it doesn't tell us when we actually receive a reply and let us adjust the reaction in response to the reply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to fail this query, can't we just not make the query in the first place? And save some useless extra network traffic?
There is no query being made and no extra network traffic. As @mxinden explained well (I think) in earlier comments (on this and / or an earlier PR) the peer iterators are pure state machines: They yield the candidate peers to query (via next()
) and expect to get back the results via on_success
and on_failure
. The state they manage is the (ordered) list of peers with their corresponding state (not yet queried / success / failed) and it enforces the parallelism and the (regular) termination conditions.
Thus what is being done here is merely that the composite iterator that enforces "disjoint paths" via individual iterators makes sure there is no overlap in the peers of the individual component iterators by immediately calling on_failure
after next
if it knows that this peer was already produced (and thus "claimed") by another iterator (i.e. path). It thus effectively just tells this iterator "sorry, that peer is already taken". The query state machines are driven from the I/O in the Kademlia
NetworkBehaviour
(protocols/kad/src/behaviour.rs
), which thus is where you need to look if you want to know how the queries progress w.r.t. I/O.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the peer iterators are pure state machines
This is not the only way of representing the logic as a pure state machine - I think this was where the previous discussion / explanation stalled. For some reason you & Max thought that I was confusing this approach with a non-pure approach, which I wasn't.
They yield the candidate peers to query (via next()) and expect to get back the results via on_success and on_failure.
OK, now this makes sense. Then this iterator works backwards from the typical iterator pattern, which is to yield results from next
rather than "new things to query". A name like "RequestDriver" would be less ambiguous and clearer.
It's further confusing, because doing it this way around, next
is actually redundant - you can simply merge its functionality into on_success
and on_failure
. You can make the decision on what the next peers-to-query are, as soon as you get the result back from a previous query, there is no need to separate out this functionality into on_success
and then later next
, which seems more like shoehorning the required functionality into a pre-chosen abstract design pattern (the "iterator").
This merged on_success/next
is how I have implemented the multipaths algorithm here also as a pure state machine, and recv_result
would correspond to the combination of on_success
then next
.
Thus what is being done here is merely that the composite iterator that enforces "disjoint paths" via individual iterators [..]
This all makes immediate sense given the explanation of what the interface next
, on_success
and on_failure
do. However, I'd suggest that this approach is subpar and leads to exactly the problem I mention here, and that the multipaths approach (treating the whole query across all the peers as a single fat multipath, rather than d disjoint paths) avoids this bad case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then this iterator works backwards from the typical iterator pattern, which is to yield results from next rather than "new things to query". A name like "RequestDriver" would be less ambiguous and clearer.
I think this is just a matter of what preconceptions one has (apart from the fact that names are always contentious). When you say "[..] typical iterator pattern, which is to yield results [..]" you seem to have a certain idea of what a "typical iterator pattern" is and what "results" are and what not. To me, an iterator, in the general sense, is just anything that yields a sequence of values. The choice of name "Iterator" here does not stem from a desire to shoehorn the code into some abstract pattern, but only because the state of an (iterative) query is essentially a prioritised list of peers and an iterative query iterates over that list, thereby updating the list accordingly as results come in, eventually terminating.
It's further confusing, because doing it this way around, next is actually redundant - you can simply merge its functionality into on_success and on_failure. You can make the decision on what the next peers-to-query are, as soon as you get the result back from a previous query, there is no need to separate out this functionality into on_success and then later next [..]
The decision on what the next peers-to-query are is essentially made when the callbacks are called, yes, but leaving the request for the next peer to another API call (here next
) instead of returning it from these functions directly is mainly done because the separation of these steps integrates well with the overall callback-based approach to the libp2p NetworkBehaviour
: When a response comes in on the network on some connection, a certain callback is invoked, which can in turn invoke on_success
/ on_failure
on the corresponding query. Notably there is no option at this point to directly tell libp2p that another peer should be queried. Instead, libp2p regularly calls NetworkBehaviour::poll
to ask the behaviour to "make progress" and as a result tell the underlying libp2p networking layer what to do next (e.g. query another peer). That is why Kademlia::poll
, which implements NetworkBehaviour::poll
calls QueryPool::next()
which in turn calls next()
on the underlying peer iterators. So, one reason for this split is simply the easy integration with the I/O layer in this way.
[..] and that the multipaths approach (treating the whole query across all the peers as a single fat multipath, rather than d disjoint paths) avoids this bad case.
I still need to take a closer look at that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no need to separate out this functionality into
on_success
and then laternext
,
Although perhaps doing it this way allows you to work with TCP flow control a bit better, since you only call next
when the stream unblocks, and still want to receive results in the meantime. It's not clear if that was the purpose behind the original design though. (edit: and since Kademlia is expected to be low-traffic, the stream should normally not be in a blocked state anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say "[..] typical iterator pattern, which is to yield results [..]" you seem to have a certain idea of what a "typical iterator pattern" is and what "results" are and what not. To me, an iterator, in the general sense, is just anything that yields a sequence of values.
Yes I can see this perspective too. However when trying to quickly understand new code you often have to take shortcuts and "iterator" suggests a producer that doesn't care what the consumer does with the yielded value - whereas here the behaviour of the consumer also very much affects the producer, via on_success
and on_failure
. This would be more completely-described by the term "coroutine" - although you are right that an iterator technically doesn't prohibit this, it does connotate it to a human reading the code.
The choice of name "Iterator" here does not stem from a desire to shoehorn the code into some abstract pattern, but only because the state of an (iterative) query is essentially a prioritised list of peers and an iterative query iterates over that list, [..]
Yeah fair enough, I can see the resemblance from this p2p usage of the terminology, esp in terms of "iterative" vs "recursive" queries. The term "iteration" is overloaded a bit.
@@ -55,6 +57,7 @@ pub struct ClosestPeersIterConfig { | |||
/// The `α` parameter in the Kademlia paper. The maximum number of peers that | |||
/// the iterator is allowed to wait for in parallel while iterating towards the closest | |||
/// nodes to a target. Defaults to `ALPHA_VALUE`. | |||
// TODO: Should this be configurable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think it shouldn't be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From past open source maintenance experience I prefer small configuration surfaces because:
-
They are usually more difficult to misuse.
-
They are easier to maintain.
-
They are easier to change in the future.
As far as I know noone asked this to be configurable in the past. Thus I would not make it configurable now, but only once someone needs it to be configurable.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused, because:
a) This is not (crate-)public configuration (or is it?).
b) At least for the transition period, these iterators need to be configured with a different parallelism in the context of disjoint vs standard Kademlia queries, no? And furthermore, if we eventually end up only supporting disjoint-path queries, the whole concept of parallelism
inside a single ClosestPeersIter
could be removed, not just the configuration option, unless we wish to make that configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think my TODO
comment is confusing, sorry about that. Let me rephrase:
// TODO: Should parallelism be configurable from outside the crate, e.g. by a user?
Do you agree that parallelism
should not be configurable from outside of the crate @romanb for now?
At least for the transition period, these iterators need to be configured with a different parallelism in the context of disjoint vs standard Kademlia queries, no?
Yes. From within the crate it is configurable and should stay configurable given that ClosestDisjointPeersIter
makes use of it.
And furthermore, if we eventually end up only supporting disjoint-path queries, the whole concept of parallelism inside a single ClosestPeersIter could be removed, not just the configuration option, unless we wish to make that configurable.
👍 Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you agree that parallelism should not be configurable from outside of the crate @romanb for now?
Well, I think with the introduction of disjoint paths the parallelism should be configurable on the public API (but that is not this parallelism option directly on the ClosestPeersIter
), because the parallelism controls the number of disjoint paths. What is confusing me is why you put this TODO on the internal configuration of a ClosestPeersIter
, because clearly a) that is not in the public API and b) as we agreed it needs to be configurable for now at this level until we eventually only support disjoint path queries.
}, | ||
PeersIterState::Waiting(None) => {} | ||
PeersIterState::WaitingAtCapacity => {} | ||
PeersIterState::Finished => state = PeersIterState::WaitingAtCapacity, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to understand but the following properties should hold, right?
-
The composite iterator returns
Waiting(None)
if and only if at least one iterator returnsWaiting(None)
and all others return one ofWaiting(None)
,WaitingAtCapacity
orFinished
. -
The composite iterator returns
WaitingAtCapacity
if and only if at least one iterator returnsWaitingAtCapacity
and all others return eitherWaitingAtCapacity
orFinished
. -
The composite iterator returns
Waiting(Some(_))
if and only if one of the iterators returnsWaiting(Some(peer))
andpeer
was not yet claimed by another iterator.
Would it be feasible to add property tests to that end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the following properties should hold, right?
Correct.
Would it be feasible to add property tests to that end?
Were you thinking of a mocked ClosestPeersIter
or an Arbitrary
implementation for ClosestPeersIter
that returns ClosestPeersIter
s in different states?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't have a concrete approach in mind, just saying that given it is not easily checked that these properties hold by looking at the implementation, it may be good to have these tested.
/// | ||
// NOTE: One can not ensure both iterators yield the *same* result. While [`ClosestPeersIter`] | ||
// always yields the closest peers, [`ClosestDisjointPeersIter`] might not. Imagine a node on | ||
// path x yielding the 22 absolute closest peers, not returned by any other node. In addition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[..] yielding the 22 absolute closest peers, not returned by any other node.
That is a strange assumption though. In a non-adversarial setting that seems unlikely since all disjoint paths converge to the same closest peers by the nature of the iterative query process, only differing in which subsets of these are queried on each path. Thus any "unused" peers on one path should still be seen (and can be queried) on other paths if they are indeed closer than others. So, I still don't see how, in a non-adversarial setting with perfect network connectivity (e.g. in this test setup), the results could possibly diverge? Could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure I understand your comment, so please bear with me and ask follow up questions.
Thus any "unused" peers on one path should still be seen (and can be queried) on other paths if they are indeed closer than others.
Yes, but only if those other paths have a peer that is aware of these "unused" peers.
The edge case I am describing is likely not happening in normal and non-adversarial situations. Among others due to the fact that the amount of nodes a node is aware of for a given distance bucket (K-VALUE) is a lot larger than the amount of disjoint paths. Said differently: A node is likely known by > 1 node.
However in cases where the graph of nodes is not well connected, e.g. only a single node is aware of the closest X nodes to the target, ClosestPeersIter
and ClosestDisjointPeersIter
can yield different results.
In this test I don't want to assume a well connected graph and rather as well test the edge cases.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this test I don't want to assume a well connected graph and rather as well test the edge cases.
But the weaker the assertions one can make in a test, the less useful the test, because it will accept more implementations (including incorrect ones). Now I'm not saying that it may not be useful to check some properties of the algorithm in less ideal conditions, as long as one can still make some useful assertions, but the reason I suggested the test for exact agreement of the result with the standard Kademlia lookup in the absence of connectivity problems and adversarial behaviour is that it establishes a baseline for correctness: If an implementation does not pass that, there is little hope it will work correctly under less ideal conditions. And the reason I expect agreement of these algorithms on the result is that, as far as my understanding goes, every path in a multi-path lookup should essentially end up seeing the same closest peers to the key being looked up, given that each peer has full knowledge of its close neighbourhood in the overlay (routing table) - again in ideal conditions of network connectivity (but not necessarily assuming every node knows every other, only that every node can reach every other via some path in the overlay routing) and no adversarial nodes in the query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I still don't see how, in a non-adversarial setting with perfect network connectivity (e.g. in this test setup), the results could possibly diverge? Could you elaborate?
After trying to adjust the test accordingly I have found the following edge case where a ClosestPeersIter
and a ClosestDisjointPeersIter
yield different results running on a fully connected graph.
Example
There are 21 (1, 2, ..., 21) remote peers total (excluding the local node). Each node is aware of all other nodes (fully connected). Both ClosestPeersIter
as well as ClosestDisjointPeersIter
are initialized with all but the closest nodes (2, 3, ..., 21). The amount of allowed parallelism and thus the amount of disjoint paths is 3
. num_results
is 3
as well (one per disjoint path).
(This is the simplest example I can come up with. More complex ones are possible.)
ClosestPeersIter
-
Query node
2
(the closest node the iterator is aware of) and thus learn of node1
. -
Query node
1
, but don't learn anything new. -
Query node
3
, but don't learn anything new. -
Return nodes
1
,2
,3
.
ClosestDisjointPeersIter
-
Path
1
: Query node2
(the closest node the iterator is aware of) and thus learn of node1
. -
Path
2
: Try query node2
, which was already queried by path1
and thuson_failure
is called. -
Path
2
: Query node3
and thus learn of node1
. -
Path
3
: Try query node2
, which was already queried by path1
and thuson_failure
is called. -
Path
3
: Try query node3
, which was already queried by path2
and thuson_failure
is called. -
Path
3
: Query node4
and thus learn of node1
. -
Path
1
Query node1
, but don't learn anything new. (Path1
is finished yielding node1
.) -
Path
2
: Try query node1
, which was already queried by path1
and thuson_failure
is called. (Path2
is finished yielding node3
.) -
Path
3
: Try query node1
, which was already queried by path1
and thuson_failure
is called. (Path3
is finished yielding node4
.) -
Return nodes
1
,3
,4
.
In other words: The first path of the ClosestDisjointPeersIter
tries node 2
and thus finds out about node 1
which it eventually yields. Given that it queried node 2
no other path can query / yield node 2
and thus the final output misses node 2
which the output of a ClosestPeersIter
would not.
Does the above make sense @romanb?
While I will put more thoughts into this, I would suggest moving forward requiring both iterators to find the single closest node and requiring that the two results overlap like @infinity0 suggested above.
@mxinden It is sufficient to test that with the multipaths approach, the result of the lookup, which is a set of nodes, contains the actual closest node - which you know, because you are defining the test.
A slightly more strict test would check that the result of the lookup, is approximately equal (say > >1/2 overlap) with the sibling nodes that S/Kademlia additionally keeps. I am not sure if you > implemented this in your PR however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the example. It does show that I overlooked an important detail of the current implementation. Let me try to summarise in my own words: We may not end up with all the actual k
closest nodes to the key, because any one path may contact peers that are among the k
closest but not among the k / d
closest that this path is allowed to contribute to the overall k
results. Thinking then about what guarantees we actually have, it seems that we can only expect at least 1 / d
of the resulting list of peers to be among the actual k
closest (according to the non-disjoint query). That kind of dispersion in a completely non-adversarial setting does not really seem acceptable and is unlikely to be intended by the S/Kademlia paper, as it directly affects where records are stored and whether they are found.
Intuitively, the problem with this approach then is that every individual path is allowed to "race arbitrarily far ahead" towards the k
closest peers, thus contacting these but then only contributing its k / d
fraction of them (naturally as permitted, since no path should be allowed to make a larger contribution to the resulting list of peers than any other).
A first improvement that suggests itself is to enforce fairness in the progression across all paths, i.e. a sort of lock-step execution where, after a path has contacted a new peer, it may only contact another after all other paths have contacted one more peer (or terminated). That would guarantee, I think, an overlap of k - (k / d)
peers among all lookups for that key in a non-adversarial setting. Still, that is not great. I really think that in the non-adversarial setting the disjoint-path lookup should guarantee to find all of the "true" k
closest peers to the key.
So I went back to the paper and read "[..] By using the sibling list [..] the lookup doesn’t converge at a single node but terminates on d close-by neighbors, which all know the complete siblings for the destination key." I read this such that as long as at least one path is free of adversarial nodes, the complete list of "true" siblings of the key should be discovered by the query. To realise that in the absence of special handling for sibling lists, I've had the following thought: When a path a
wants to contact peer p
, but p
is already being contacted by another path b
, then instead of always regarding p
as "failed" for a
, it is instead reported to a
with the same outcome as for b
(success or failure), however without reporting to a
the supposed closer peers obtained from p
in case of success (these are only reported to b
). Thus a
does not incorporate any peers from other paths into its own, but it still recognises a successful contact made on another path to a peer that this path would have liked to query as well (because it obtained that peer earlier on in this same path). Furthermore, each path must then collect a full k
results. When all paths have terminated, the k
results from all paths are merged into one set. In the completely non-adversarial setting (and no connectivity problems), all these results will be the same and the resulting set of size k
contains the "true" k
closest peers. In an adversarial setting and / or with partial network connectivity between some of the nodes, the resulting set will be larger than k
and if this set is supposed to be used in a follow-up STORE
operation (e.g. in the context of put_record
or add_provider
), that request is nevertheless sent to this full set of peers, which may be more than k
, but we are then assured that an attempt is made to send it to the "true" k
closest peers to the key.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this more complex case breaks all of the suggested solutions so far, including my "easy fix" to the multipaths approach:
hop1 hop2 hop3
3 -> 2 -> 1
^ ^
4 ---| |
| |
5 ---| |
|
7 -> 6 ---|
|
9 -> 8 ---|
My suggested way of proceeding backwards on a per-hop basis will give {1, 2, 6, 8, 3} but we want {1, 2, 3, 4, 5}...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another example:
hop1 hop2 hop3
3 -> 2 -> 1
^
4 ---|
|
5 ---|
7 -> 6
9 -> 8
Here the correct solution is {1, 2, 3, 6, 8} because only 3 sources have "vouched for" 1, and so we can't return {1, 2, 3, 4, 5} since this doesn't give representation for sources 7 and 9.
OTOH if you have a query graph that looks like this:
3 -> 2 -> 1
^
4 ---|
|
5 ---|
|
7 -> 6 ---|
|
9 -> 8 ---|
Then intuitively the correct solution is {1, 2, 3, 4, 5}, since 7 and 9 are all vouching for 2 and 1, so it's OK for 3, 4, 5 to "vouch for" themselves. However this doesn't fit very well into the "hop" columns; previously I've been suggesting to simply ignore (delete) the 6 -> 2 and 8 -> 2 edges but that would collapse this example into the previous example, but they have different intuitively "correct" answers...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the algorithm below is "obviously correct" but involves a bit of brute-force, it would be better to make it more efficient:
- From the query graph, figure out what every node's sources are. So for example the graph in this comment becomes the following dictionary, which can be viewed as a graph G:
V1 -> {3,4,5,7,9}
V2 -> {3,4,5}
V3 -> {3}
V4 -> {4}
V5 -> {5}
V6 -> {7}
V7 -> {7}
V8 -> {9}
V9 -> {9}
The V prefix is just to distinguish e.g. V7 on the LHS, from 7 on the RHS, which makes the next step easier:
- Iterate through all the d-combinations of the nodes, in sorted order. So for example this will be {V1,V2,V3,V4,V5}, {V1,V2,V3,V4,V6}, {V1,V2,V3,V4,V7}, etc. Take the subgraph of G induced by the combination, e.g. for {V1-V5} this is:
V1 -> {3,4,5,7,9}
V2 -> {3,4,5}
V3 -> {3}
V4 -> {4}
V5 -> {5}
Is there a maximum bipartite matching? If so, return the LHS as the set of results. If not, continue onto the next combination in the iteration.
The algorithm is guaranteed to terminate since there is always one matching {Vn -> {n} for all n in sources}, but I don't know if it's possible to construct pathological cases that take a long time...
edit: and for the example in question, the algorithm will eventually find:
V1 -> {3,4,5,7,9}
V2 -> {3,4,5}
V3 -> {3}
V6 -> {7}
V8 -> {9}
with the maximum bipartite matching:
V1 -> {4}
V2 -> {5}
V3 -> {3}
V6 -> {7}
V8 -> {9}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing step (2) we can convert the whole thing into a single maximum-flow problem. First convert G from step (1) into a single flow graph as follows. For each source, connect it to each target with a flow of (1000000 - target node distance). (The number 1000000 is chosen just so the closest d-1
targets have less flow than the furthest d
targets; in the real network we might have to choose a bigger number.)
Then create a typical flow graph, with edges from a flow-source to all the sources, and from all the targets to a flow-sink. So the example above will be converted to:
(3) ---(1m - 1)--> V1
^ \
/ \
(1m) [etc] (1m)
/ [etc] \
/ v
flow flow
source sink
\ ^
\ /
(1m) (1m)
\ /
v /
(9) ----(1m - 9)--> V9
Then the problem turns into a simple maximum-flow problem - the maximum flow will select d flows from the source to the sink, going through d of the Vn, with the minimal sum-over-n. Depending on the algorithm used, the complexity is O(V^3) which should be fine for what we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above conversion is not exactly right, but I've figured out the correct version here that also works to select the next-peer-to-query during the running of the algorithm.
This should be due to paritytech/substrate#5993 and mxinden/kademlia-exporter@748adc2. The former has Substrate drop the default Dht which was shared with IPFS and edgeware and the latter configures the exporter to use the Kusama Dht only. The result is a smaller Dht with less incompatible nodes. Both should reduce the query latency. Does that make sense @infinity0?
It does not take 250 requests per query. 95% of the queries take less than 250 requests. The number 250 is only the ceiling.
These are FIND_NODE queries only. For |
@romanb I think I addressed all comments of your recent review (#1473 (review)). Would you mind taking another look whenever you have time? (I am sorry for the notification storm.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor comments this time around. Overall this looks pretty much ready to me, thanks for your effort!
protocols/kad/src/query.rs
Outdated
/// A finished query immediately stops yielding new peers to contact and | ||
/// will be reported by [`QueryPool::poll`] via | ||
/// [`QueryPoolState::Finished`]. | ||
pub fn try_finish<I>(&mut self, peers: I) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about returning a boolean here to indicate success or failure (and thus similarly for finish_paths
, but not for finish()
), which we could use at the call-site to add some useful debug
logging to see these finishing attempts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not in favor of altering the function signature for a debug
statement. Wouldn't that confuse more than it helps?
What do you think of b69f24f instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not in favor of altering the function signature for a debug statement. Wouldn't that confuse more than it helps?
I tend to view the module APIs also in isolation and when just looking at try_finish()
the first thing I wonder is how I am supposed to know about the outcome of this attempt, i.e. is the query now finished or not? Of course we already have is_finished()
and I suppose we could use that as well, but indicating success/failure from try_finish()
itself via a return value seems a bit more straight-forward. Just because, right now, the only place where we would use this return value is so that we can add some useful logging, because for the program flow at that point we don't actually care whether the query did indeed finish, does not make the fact that try_finish
returns success/failure less valuable, in my opinion.
What do you think of b69f24f instead?
I don't think these logs are very useful because they lack too much context, like the query ID or the peer ID of the latest response, both of which are available at the call-site of try_finish()
.
/// See [`crate::query::Query::may_finish`] for details. | ||
pub fn may_finish<I>(&mut self, peers: I) | ||
where | ||
I: IntoIterator<Item = PeerId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if this method does not need owned items, it shouldn't require them. The borrowing can be done at the call-site, can it not?
impl From<usize> for IteratorIndex { | ||
fn from(i: usize) -> Self { | ||
IteratorIndex(i) | ||
} | ||
} | ||
|
||
impl From<IteratorIndex> for usize { | ||
fn from(i: IteratorIndex) -> Self { | ||
i.0 | ||
} | ||
} | ||
|
||
impl Add<usize> for IteratorIndex { | ||
type Output = Self; | ||
|
||
fn add(self, rhs: usize) -> Self::Output { | ||
(self.0 + rhs).into() | ||
} | ||
} | ||
|
||
impl Index<IteratorIndex> for Vec<ClosestPeersIter> { | ||
type Output = ClosestPeersIter; | ||
|
||
fn index(&self, index: IteratorIndex) -> &Self::Output { | ||
&self[index.0] | ||
} | ||
} | ||
|
||
impl IndexMut<IteratorIndex> for Vec<ClosestPeersIter> { | ||
fn index_mut(&mut self, index: IteratorIndex) -> &mut Self::Output { | ||
&mut self[index.0] | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the above one can use IteratorIndex to directly index into a collection of iterators, thus one does not need to do iterators[index.0] and can thereby not confuse it with any other defined newtype pattern (e.g. as in iterators[not_an_index_into_iterators.0]).
I would agree if this newtype were not confined to this module and there were not just a single such index (e.g. there is no SomeOtherIndex
), but here we have a use that is both confined to a single module and a single such index being tracked. The newtype and all the trait impls thus seem a bit overkill to me but I won't object if you want to keep them in, I just wanted to get clearer on your motivation, which in principle I fully understand. I wish Rust had some form of GeneralizedNewtypeDeriving.
Thank you @romanb for yet another review! (I am counting more than 7 by now.) Would you mind taking another look whenever you have chance? |
Following up on the discussion for
Yes this makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I only left some more small suggestions. It would be great to find the cause or an explanation for the perceived anomalies in the request counts, but that needn't hold back a merge of this initial implementation - there will always be changes and improvements to be made later.
I can not reproduce a situation where disjoint-paths is querying more nodes within the 50th percentile. My guess is that the previous observation was due to different nodes in the two routing tables. With the current deployment both disjoint and non-disjoint paths need roughly the same number of successful requests in order for a query to succeed. (Ramp up is due to the two DHT clients not yet knowing the entire network.) For anyone interested in more data kademlia-exporter.max-inden.de displays the data of the most recent version of this pull request (1232c55). With the approval and the data in mind, I will merge this pull request tomorrow, unless someone objects of course. |
Thanks for all the help on this pull request @romanb @infinity0. |
The extension paper S-Kademlia includes a proposal for lookups over
disjoint paths. Within vanilla Kademlia, queries keep track of the
closest nodes in a single bucket. Any adversary along the path can thus
influence all future paths, in case they can come up with the
next-closest (not overall closest) hops. S-Kademlia tries to solve the
attack above by querying over disjoint paths using multiple buckets.
To adjust the libp2p Kademlia implementation accordingly this change-set
introduces an additional peers iterator:
DisjointClosestPeersIter
.This new iterator wraps around a set of
ClosestPeersIter
giving eachof them a share of the allowed parallelism, amount of striven for
results and set of initial nodes to contact.
DisjointClosestPeersIter
enforces that each of the
ClosestPeersIter
explore disjoint paths byhaving each peer instantly fail that was queried by a different
iterator.
Status
This is a follow up (/alternative implementation) to #1412, more particularly to the feedback from @romanb in #1412 (review).
Ready for review.
Like #1412 this is in an early stage, ready for high-level feedback. Reviewing commit by commit should be easiest.Open questions:
Different from the suggestion in protocols/kad: Implement S-Kademlia's lookup over disjoint paths #1412 (review) this pull request introduces the disjoint path logic into a separate iterator. The reason is that introducing it intoquery.rs
would mess up the abstraction ofclosest
andfixed
iterator. ShouldDisjointClosestPeersIter
andClosestPeersIter
be merged? Should this logic live inquery.rs
like suggested in the first place?As of today this implementation depends on the amount of allowed parallelism to be equal or greater to the amount of disjoint paths. Is that something that we want to assume? If not this needs some more in-depth changes.If oneClosestPeersIter
returnsPeersIterState::Waiting
and one returnsPeersIterState::WaitingAtCapacity
DisjointClosestPeersIter
returns the former. I am not sure this upholds correctness.