-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bloom filter gossip #266
Bloom filter gossip #266
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level comments:
- Improve the gossip to not block on each peer's response.
- Remove the wrapper from the bloom filter
- Wrap the mempool type instead of modifying it if possible.
Also left some inline comments.
@@ -56,6 +62,8 @@ func TestMempoolAtmTxsIssueTxAndGossiping(t *testing.T) { | |||
return nil | |||
} | |||
|
|||
assert.NoError(vm.SetState(ctx, snow.NormalOp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: prefer using require
over assert
for new code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just using assert here because the rest of this test uses assert, I think we can probably pass over these tests to use require in a future pr
plugin/evm/mempool.go
Outdated
if _, exists := m.issuedTxs[txID]; exists { | ||
return nil | ||
return fmt.Errorf("%w: %s", errTxAlreadyIssued, txID) | ||
} | ||
if _, exists := m.currentTxs[txID]; exists { | ||
return nil | ||
return fmt.Errorf("%w: %s", errTxAlreadyIssued, txID) | ||
} | ||
if _, exists := m.txHeap.Get(txID); exists { | ||
return nil | ||
return fmt.Errorf("%w: %s", errTxAlreadyIssued, txID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we changing this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we want something to signal that tx was already issued and we don't need to refresh the bloom filter.
I'd add a boolean to to return boolean + error, instead of changing the error semantics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joshua-kim I am still reviewing this but I have two general questions:
- Shouldn't the bloom filter mechanism be activated only following an hard fork? What would happen with current code if it was deployed in a network of nodes not fuly sopporting bloom gossiping?
- Is this bloom gossip mechanism designed to be portable to avalanchego mempool? If so, are there PRs on avalanchego side? Happy to help there if I can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly notes on first pass, really cool work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take my comment with a grain of salt, it's honestly more for me than it is for you!
@joshua-kim, feel free to resolve the comment if you don't agree, but I would love it if you could explain why.
And if you do agree... you're welcome 😉 .
We can deploy this without a hard fork, since this introduces all the new gossip over new message types. Nodes without the upgrade will just drop the new messages. Eventually we will want to deprecate the existing push-based gossip, which will need to be done on a hard-fork.
Yeah this should be portable to X/P chains (which is why the code is in its own independent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes overall lgtm, just few comments. I feel like currently our gossip/network/handler related files are little too spread in the code base. It's a bit hard to follow and add new handlers. I wonder if we should collect them together in a package and use subpackage to differentiate from each other.
gossip/bloom.go
Outdated
// ResetBloomFilterIfNeeded resets a bloom filter if it breaches a ratio of | ||
// filled elements. Returns true if the bloom filter was reset. | ||
func ResetBloomFilterIfNeeded( | ||
bloomFilter **bloomfilter.Filter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a pointer to a pointer here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should just have a NeedReset(...) bool
function and recreate the bloom filter anew in the called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so this is a funny situation, we either pass a double pointer here, or we pass a single pointer here but this causes an issue since when we overwrite this pointer's value like so:
fresh, _ := bloomfilter.New((*bloomFilter).M(), (*bloomFilter).K())
*bloomFilter = *fresh // copying a mutex here
it flags the linter as us copying a mutex (I think this is technically safe since it's gonna be unlocked and no one else can modify it). I felt like the double pointer was less evil than this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should just have a
NeedReset(...) bool
function and recreate the bloom filter anew in the called?
Yeah an alternative would be just to remove this function entirely, it's only used in two places and it's a small snippet of code so the duplication cost is really not bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I'd prefer to either remove this entirely or update this to a function that does not deal with a double pointer. Even though this seems correct, it still seems overly complicated. I think another simpler alternative would be to have a function that returns nil if no reset is needed and returns a non-nil pointer to a new bloom filter if it is needed (or a boolean if that seems cleaner)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a pass of the gossip
package
gossip/gossip.go
Outdated
// GossipableAny exists to help create non-nil pointers to a concrete Gossipable | ||
type GossipableAny[T any] interface { | ||
*T | ||
Gossipable | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is such a wild hack... afaict it does work tho... Is there a ref we can link to for this? Essentially it works because:
- The type must be a pointer to something
- The type must implement the interface.
- A pointer to an interface does not implement the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you summed up how it works. I can add a ref here.
It's basically a decision of whether we prefer this generics black magic (add complexity to the package) vs if we want to add another interface the caller has to return a pointer to their type (add complexity to the caller). I prefer the former personally but can understand if we feel like this is too evil.
Alternative:
type FooFactory struct{}{}
func (FooFactory) MakePointer() *Foo {
return &Foo{}
}
plugin/evm/gossip_mempool.go
Outdated
type GossipAtomicTx struct { | ||
Tx *Tx `serialize:"true"` | ||
} | ||
|
||
func (tx *GossipAtomicTx) GetHash() gossip.Hash { | ||
id := tx.Tx.ID() | ||
return gossip.HashFromBytes(id[:]) | ||
} | ||
|
||
func (tx *GossipAtomicTx) Marshal() ([]byte, error) { | ||
return Codec.Marshal(message.Version, tx) | ||
} | ||
|
||
func (tx *GossipAtomicTx) Unmarshal(bytes []byte) error { | ||
_, err := Codec.Unmarshal(bytes, tx) | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to implement these functions directly on the *Tx
type? I suppose it's pretty nice that we define both messages here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts are that this is all gossip-specific code and we should never be using GetHash
, Marshal
, or Unmarshal
outside of the context of our gossip
package/implementations so I don't want people to ever depend on it.
plugin/evm/gossip_mempool.go
Outdated
g.lock.RLock() | ||
defer g.lock.RUnlock() | ||
|
||
return g.bloom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems racy - this bloom can be reset in the goroutine running Subscribe
- so the caller of this function must either have access to the lock that is local to this struct - or be otherwise guaranteed that Subscribe
isn't running.
Or I'm missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is racy as you correctly mentioned since we return a reference which can be modified by the Subscribe
goroutine once the lock is released. I think this is fixed if we deep-copy here.
plugin/evm/mempool.go
Outdated
m.lock.RLock() | ||
defer m.lock.RUnlock() | ||
|
||
return m.bloom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to the other comment - this seems racy.
for _, item := range m.txHeap.maxHeap.items { | ||
gossipTx := &GossipAtomicTx{Tx: item.tx} | ||
if !filter(gossipTx) { | ||
continue | ||
} | ||
gossipTxs = append(gossipTxs, gossipTx) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also feel like this should have some form of maximum size
Co-authored-by: Stephen Buttolph <[email protected]> Signed-off-by: Joshua Kim <[email protected]>
Co-authored-by: Stephen Buttolph <[email protected]> Signed-off-by: Joshua Kim <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit
Co-authored-by: Stephen Buttolph <[email protected]> Signed-off-by: Joshua Kim <[email protected]>
Diff is getting huge, we broke up this PR's diff to make it more reviewable to get this over the finish line: Coreth PR to add router handling + UT: #316 |
@joshua-kim as there are nodes running older versions, will they get throttled due to droppedRequests counter or any other mechanisms? |
@t-anyu Nodes running older versions when receiving an incoming new gossip request will drop them because they'll fail to unmarshal against the legacy codec. They will still get transactions because all nodes will support the legacy |
@joshua-kim @StephenButtolph I hope the team isn't planning to remove push gossip entirely. |
We will be moving push gossip to the new p2p SDK. We will not remove push gossip entirely - as that would significantly increase the tx propagation times. |
Ah, I see. Thanks! |
Why this should be merged
Performance optimization to lower end-to-end issuance to acceptance time for a transaction.
How this works
Uses a pull-based approach to gossip instead of a push-based approach. See testing for more details.
How this was tested
Ran some simulations, results are as follows:
With the existing push-based gossip, we see that as more and more nodes learn the new transaction, the rate at which nodes learn the transaction slows down exponentially.
At first, few nodes know of the transactions, but only one node knows about it so there's not many people gossiping it. Once it hits the 50% mark, the transaction is being gossiped at maximum velocity, since we're at a middle-point where both half the network is willing to gossip the transaction, and the other half is willing to continue forwarding it.
As a majority of the network learns of the transaction, it starts slowing down again. This is because nodes sample peers to gossip to, but the peers will only forward the gossiped transaction if it hasn't seen it already. Over time, peers effectively "absorb" the gossip and don't forward it, so the last few nodes are very unlikely to ever hear of the transaction until a re-gossip cycle kicks in.
With the pull-based approach, we see that the transaction is gossiped exponentially quickly as more and more nodes learn the transaction. This is because we now poll peers, and as more peers learn the transaction, it's more likely that you're going to poll a set of peers that has it.