-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental update of trust levels in a dynamic blockchain graph #2805
Comments
http://www-sop.inria.fr/members/Konstantin.Avratchenkov/pubs/mc.pdf http://crad.ict.ac.cn/CN/abstract/abstract3238.shtml This is the architecture of the algorithm, every time when the graph is changed, we just need to update the value of VT(x), x belongs to all nodes in the graph, according to the Variation of Graph and the PageRank Value in last time. |
General research description: Excellent work by Harvard University + Zurich: Personalized Hitting Time for Informative Trust Mechanisms Despite Sybils ( thesis version ) |
2005 work: Incremental page rank computation on evolving graphs Why trust and incremental pagerank: |
First code: https://github.com/Peiteng/Incremental-Pagerank !
Key outcome plot :
|
Possible solo "blockchain Engineering" course. ToDo: take trustchain dataset, copy algorithms from state-of-the-art, possible enhance, implement Python code, math beautiful equations on probability & stochastic. @alexander-stannat We establish a surprising connection between the personalized PageRank algorithm and the stochastic block model for random graphs, showing that personalized PageRank, in fact, provides the optimal geometric discriminant function for separating the communities in stochastic block models over a wide class of functions. Building on this result, we develop stronger classifiers that, although scalable, are competitive with computationally much more demanding methods such as belief propagation. |
First sprint:
|
First sprint finished. I implemented a standard Monte Carlo random walk algorithm (R random walks starting from each node without resets, with average length reset_prob^-1). The page rank is determined by the number of visit times of each node divided by the sum over all visit times (see algorithm 4) For nodes that are added to the graph we run R random walks starting at the corresponding nodes. I tested the algorithm on a number of different example graphs, including the graph of the multichain data set and determined the execution times. For the graphs generated by blocks 0-100 and 50-150, it took 24.5 seconds to compute the algorithm. The next step would now be to optimize the algorithm. These guys divide the graph into a partition Q containing all changed nodes/edges and no outgoing links, and a partition P. Seems like the right direction seeing as the multi chain network only grows in 'one direction' making such a partition easy to obtain. I couldn't find an English version of the Chinese paper. so I don't quite know what to do with it. |
For inspiration: https://github.com/vandenheuvel/incremental-pagerank |
Feedback in terms of coding:
Function wise changes: All walks from a single node Continuous updating Think about churn, low priority Think about speed of convergence Think about how to set up sybil resistance experiments |
I have written a class to incrementally compute the personalized page ranks of a directed graph from the perspective of a predetermined node in the graph. Personalized page rank is an alternate version of page rank where the ranks of nodes are determined by their distance from a given seed node. In this class the page ranks are computed using the monte carlo method, whereby a number of random walks of predetermined lengths originate from the seed node and walk along the edges through the graph. They jump back to the seed node with a given probability at every step and the walk is terminated once it reaches a certain length. If a random walk reaches a "dangling node", i.e. a node with no outgoing edges it is reset as well. A vector of visit times is computed containing the number of times the random walk passes through the individual nodes and the page rank is given by the visit times divided by the accumulated visit times of all nodes in the graph. The personalized page rank given below is incremental, meaning that it can be recomputed every time the underlying graph structure is modified, i.e. edges and nodes are added or removed. In order to recompute the page ranks, one doesn't have to recompute the entire set of random walks through the graph. Instead, the given set of random walks is modified. The idea behind this is that a given random walk does not pass through every single edge. Only random walks that reach a node for which the outgoing edges have been modified, i.e. an edge is added or removed, or the weight of an edge is changed, need to be recomputed, starting from the first node for which such changes have occurred. I compare my results to the regular power iteration method whereby i used 500 as a maximum number of iterations. The power iteration method is more accurate than the monte carlo mehtod. However, the goal was to implement an incremental pagerank. Seeing as recomputing the power iteration every time the graph changes, the monte carlo algorithm turns out to be the more efficient option. As the graph sizes increases will become more efficient than the power iteration. I created some unit testing examples where graphs are randomly generated and the page ranks are computed (both monte carlo and power iteration) and then randomly modified, i.e. edges and nodes removed and added. Finally the page ranks are computed again. Once incrementally (monte carlo) and once by power iteration. I assert that both vectors of page rank values are approximately equal. We can see that the distance converges approximately to zero as the length of the walks increases and a walk length ca. 80 is very accurate. We also see that the number of random walks plays an unimportant role. Finally Martijn gave me access to the new data set which I will incorporate into the algorithm. |
Make the implementation scale Normalized error value To cap or not to cap the walk length
We suspect the first one approximates typical personalized PageRank as implemented by NetworkX, while the second one results in slightly different values. The may explain the asymptotic behaviour of the error as measured in the above post: when the cap on the walk length increases, the second method approaches the first method. In the next sprint, also pick up the first method and compare. Sybil resistance For completeness, include a quick comparison of the NetworkX PageRank and NetworkX personalized PageRank. Completing the picture
|
We proposed the EscapeLimit algorithm in the past, there are multiple random walks:
|
Sybil Resistance: I simulated a random graph of 500 nodes with a Sybil region of 50 nodes and a range of attack edges from 0 to 50. For each number of attack edges I compute the ROC curve and determined the area under the ROC curve as a measure of the sybil resistance of the algorithm. I also computed the proportion of false positives and false negatives for each number of attack edges and obtain the following results: |
Please no random walks of 100 hops! Sane values for reset probability = [15%, 30%]. Not 1%. |
Wrapping up with a few pages report in ipynb format (or old-skool 2-column IEEE format paper).
|
Reviewed final project report:
|
In our discussion on September 6, the following argument was made in favor of small hops:
While it is true that it becomes possible to reach the entire network, but the probability of reaching one of the nodes which you can reach in n steps but not in n - 1 steps is extremely small. Indeed: the number of nodes reachable becomes enormous, but there is still only one walk. The probability that a far away node is visited many times, is extremely small, both because the walk isn't often very long and because there are so many nodes within reach. Instead, the walks will more often visit the nodes which are closer. The above doesn't necessarily argue for long walks, but it shows that they can't be dismissed with that argument. When we're approximating personalized pagerank, only the accuracy and computational cost matters. The incremental method using long walks when approximating personalized pagerank with low reset probability is accurate, see the above experiments by @alexander-stannat. Whether the computational costs are acceptable, still needs some discussion in my opinion (see bottom). If one is interested in nodes which are close by, I think it might be worth considering doing sparse matrix-vector multiplication to approximate.
The result will be similar to personalized pagerank with high alpha. Example: Limit the nodes to third degree connections. Choose alpha = 0.5, and do three multiplications. Had we approximated with random walks, only 12.5% of walks could have reached out of our selection, when assuming equal probabilities for every hop. In practice, people will prefer helping those they know, so this number would be significantly smaller in practice (especially after implementing a peer selection algorithm which gives lowest priority to those with a connection degree that is too large). As a bonus, the Tribler users could be given a simple choice they can understand: "do you only trust the friends of your friends, or also their friends?". @devos50 could you weigh in on what, from a tribler design perspective, you think is an acceptable computational cost for such an algorithm? The current implementation would use a single core 100% for less than a minute for each update (updates can be accumulated and processed at once). @synctext mentioned that he prefers an algorithm which instead uses the cpu continuously but never longer than a few ms. @synctext what was the reason for this preference? Do you want to keep these computations on the twisted event loop? |
This week, we brainstormed a bit about modularization of Tribler (splitting Tribler up in separate components, like IPv8, VoD, market etc). I think a dedicated 'trust module' would be an excellent first candidate to adhere to a modularized design. As a result, I would like to see this module being executed as a separate subprocess and communicate with the Tribler core using an IPC protocol (like JSON-RPC). Depending on the scheduler in the OS and the number of available cores, this subprocess should have minimal impact on the performance of the main Tribler process and overall user experience.
If we follow the design I elaborated on in this comment, we have more flexibility regarding trust computations. Yet, it is very hard to come up with reasonable bounds for this. I would suggest playing around with the implementation and get more insight into the scalability and computational requirements of the algorithm first. To answer your question, we would need to know the tradeoff between required computation time and accuracy. I'm not sure whether the performed experiments are sufficient to answer this question yet. Note that @ichorid can also help you with performance optimizations. |
Update: Very basic GUI is now implemented. The idea was to implement an interactive graph visualising the Tribler network, including the trust rankings of all agents, determined by our current Monte Carlo pagerank implementation. The code can be found here. Ideally, it will resemble the current Tribler network explorer in design and layout. So far, a few features are available, such as finding the highest ranking (most trustworthy) node in the network. This is the node that has the largest number of visit times out of all nodes in the network. However, it may not be the node everyone can or wants to interact with, due to the decline in transitive trust over multiple hops. Another feature is the "Show Neighbourhood" button with which a node can get a close-up look of all it's adjacent nodes. The aim for the future is to create a GUI, in which the user can click on an arbitrary node in the network and then zoom into that node's respective neighbourhood. That way one could explore the entire Tribler network, along interaction paths. Finally, we're working on a "Most Likely Path" feature which will highlight the path in the network along which the majority of random walks in the Monte Carlo algorithm have travelled. For our next steps, we will enhance the Interface's interactiveness, i.e. clicking on nodes and edges reveal information about them. More aesthetically pleasing design and layout and some additional features such as viewing the overall net contribution of nodes to the network, colour-coding trustworthiness, etc. This can be discussed in our next meeting. |
Solid progress. Please focus on feature completeness and leave "GUI beauty" to others... To reduce rendering CPU consumption you can use "DrawCircular". Hard-code that there is 1 node at the center, 1 inner-circle and at most 1 outer-circle. Fill the outer-circle with all 2-hop neighbors of the node we currently view. For simplicity, draw the inner-circle with even spacing, to simplify things for now. But the inner-circle spacing is actually very simple, position in in the middle of the connected outer-circle peers. However, our dataset probably has outer-circle nodes connected by multiple 2-hop paths :-) |
Report in IEEE format now ready. Now It's back to the GUI. |
Nice! Side note, are you sure that's IEEE style? Looks a lot like the ACM template to me ;) |
I have updated the Trust GUI to include a double-circle spring layout of the nodes in the network. See below A peer can view its immediate neighbourhood (2 hops) in this GUI and determine how trustworthy the nodes in its vicinity are. The nodes are ranked using our previous Monte Carlo PageRank Algorithm and are colourcoded based on their trustworthiness from red to green. Making the networkx plot of the graph interactive has proven to be challenging and it looks as though this will have to be hard coded, which is rather burdensome. For now, there seems to be no way around this. Here is an example of how that can be done. Note that this example shows how ineffective such hard-coding is. Also @devos50, yes it's actually an ACM template, you're right :) |
Next sprint: connect to the live network; as of now 400+ MByte of Trustchain data, growing with 80k records daily. Just plot all trustchain data you find in real-time. Keep track of discovered Tribler users by you: "discovered 7500 people". On making the bitmap image interactive
Focus on real-time updates. Show partly the fresh information that the random walk discovered. For instance, |
Goal: elegant math model for cooperation within existing spacial relationships (graph distance, network latency or neighborhood; #2541). Cooperators and Defectors. A simple rule for the evolution of cooperation on graphs and social networks. "the following update rule for evolutionary dynamics (Fig. 1): in each time step, a random individual is chosen to die, and the neighbors compete for the empty site proportional to their fitness. We call this mechanism ‘death–birth’ updating". We do not have anything like that in our Tribler micro-economy. Spatial relationships could influence how you vector with policy settings evolves in time; or new installs ('births') obtain setting from the network! For instance, trustchain policies and forgiveness levels. Thought experiment: the spatial region of The Internet you are born in (install IPv8 latency community) determines who you form relationships with and whom you trust (initial bulk Trustchain record download). You can be born in a bad region or a cooperative blooming area with high-trust. |
Update: |
please do not be confused. |
Here is the current state of the report for the deep-dive meeting with the Ministry of Infrastructure tomorrow. At this point the algorithm is still a standalone implementation based on a static database. Starting February I will begin working on this full-time. The next steps from thereon out will be to:
|
@alexander-stannat What exactly is plotted in Figure 13? I think you should invert your metric, because as it is plotted now it looks like your algorithm is performing consistently worse than random guessing. |
@vandenheuvel The graph in figure 13 corresponds to the area under the ROC curve, a measure for determining the proportion of false negatives/positives of a ranking algorithm. You're right. Our current implementation of the PageRank approximation has a very high proportion of false positives/negatives. The reason for this are the very short random walks which we used in order to mitigate sybil attacks. This leads to nodes that haven't actually been discovered by any random walk being ranked as non-trustworthy (Sybil), because they have a PageRank of 0. This, of course shouldn't be the case, because nodes that haven't been discovered yet shouldn't be labelled Sybils. Instead, only nodes that are in the already discovered part of the graph and have very low trust values should be considered Sybils. Others should just be labelled "unknown". If we do that the accuracy of our ranking should improve dramatically. The graphs therefore need to be updated. Thanks for the feedback 😃 |
Official start of master thesisWe have several online reading lists on trust. Start with writing a related work section. Focus on multiple domains, then sharpen to gossip of reputation info and Sybil attacks. The material already on this wiki:
|
Meetings minute: continue intense reading and wrap-up "blockchain engineering" course work. |
Detected a problem with both Alexander's and my implementation of random walks: In 1000 random walks initiated by root r
even though they did the same amount of job. Winning strategy (Possible attack): Solution: |
I would call this a feature: trustworthiness of information is taken into account. In this extreme numerical example it has a nice dramatic effect (10 vs. 990). These are synthetic numbers, the effectiveness of a Sybil attack prevention method should be done with the really-hard-to-get attack datasets. |
I could consider the node 'b' in the example to be more trustworthy that 'c': they have made the same contribution to this part of the graph, however, from the perspective of node 'r', 'b' is closer. |
Large body of related work from the AI agents-community. From 2002 with currently 1607 citations, "the Beta reputation system". Their magic:
|
Latest 2019 TRUST workshop papers
Long running series of workshops from AI multi-agent community: 21 editions! TUDelft faculty in program committee. |
2018 "Network Reciprocity" paper, great overview of the big picture, https://doi.org/10.1073/pnas.1707505115 "Experiments show that network reciprocity promotes cooperation under a restrictive condition: The benefit of cooperation must outweigh the total cost of cooperating with all neighbors (39). Our question, therefore, is whether this condition can be relaxed by combining two cooperation-promoting mechanisms. Specifically, are there any synergies to be gained by aiding network reciprocity with costly punishment?" |
Thesis work moved to the "proximity" Sybil attack mitigation ticket #4481 |
This issue needs to be promoted and renamed. Contains a wealth of info. How are we going to re-organise all these tickets of past 7+ years? |
Within Tribler we aim to calculate and show trust level of our neighbors.
Trust levels evolve over time, as more information is coming in from our blockchain. Restarting the calculation from cratch every time new information comes in is prohibitively expensive.
We require an incremental algorithm which can update our sybil-resilient random walk algorithm. With our dataset from our deployed network it currently takes 6-9 seconds to do a full calculation. When we have a dataset locally of 50000 peers around us, we require an incremental algorithm.
Key related work by Stanford and Twitter engineers "Fast incremental and personalized PageRank":
No decent Python implementation of incremental PageRank seems to exist.
The text was updated successfully, but these errors were encountered: