-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove entries from wantlists when their related requests are cancelled #3182
Conversation
License: MIT Signed-off-by: Jeromy <[email protected]>
@@ -75,6 +75,7 @@ func (pm *WantManager) WantBlocks(ctx context.Context, ks []key.Key) { | |||
} | |||
|
|||
func (pm *WantManager) CancelWants(ks []key.Key) { | |||
log.Infof("cancel wants: %s", ks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably shouldn't be info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ehh, I think its best to keep it at Info for now. Its one of the few things thats accurately reporting whats going on in this ipfs node (hey, we're requesting a block! oh, now we're not looking for that block anymore)
SGTM, 2nd commit above is failing tests so it needs to be squashed or fixed. |
License: MIT Signed-off-by: Jeromy <[email protected]>
2210ef6
to
1548c8a
Compare
Squashed. |
A long time ago we agreed not to send wantlists so often, and to instead endeavor to send correct diffs. (i.e. sending a wantlist should be a rare event, once on connect and then maybe every 10m to be safe (but again, diffs alone should work, because they happen over a reliable channel, so if a diff is lost the reliable channel would break, triggering a reconnection and thus new wantlist. I think re-sending want lists should in general not be necessary, and it's only a safety measure) a good metric to collect is when a received wantlist differs from our view on it (i.e. when it was useful to have sent a wantlist update)
What do you mean by that? we should and i thought we DID remove elements from wantlist after they're retrieved or cancelled. |
We weren't removing elements on context cancel. |
@Kubuxu thanks |
@jbenet My bad, this isnt actually sending wantlists over again... It used to be that code, but now its only purpose is to periodically search for new providers for content we want. So every ten seconds, for each key in our wantlist, we call Some options here:
|
So, since we already look for the root hash of every request when the blocks wanted are first added, that should give us a pretty good amount of walking the DHT to connect to peers that will have the blocks (unless we are unlucky and connected to the best peer already and that peer only has a really small set of the blocks). Doing something like 30 seconds, send a query for a random block from the wantlist will reduce the bandwidth dramatically, the problem is that we can't tell if that will impact a ton our ability to search, but it is a good experiment. A node well peered should not have these issues. so it might be good to have a smaller interval for short lived nodes and then increase the interval as the node has more and more connections (i.e from 10 secs to 60 seconds). The outcomes of this will change a lot once the network grows bigger, right now, it is mostly easy to get well peered (except of the symmetric NAT traversal cases) |
yes, but downside is also longer downloads. the "urgency of a request" is not currently captured but could inform these decisions
yes. +1 to this. another option is to use "trickle" (exponential) backoffs. They work really well in networking in general:
Trickle is used differently -- more for broadcasts to regain consistency in routing protocols. But here, "agree" would mean "i searched the dht but found nothing". each successive query like that could have an exponential backoff, with a maximum (of say 30min or 1hr). That way, things that are "not around right now" don't place a strain on the network. The way to implement this here is: type trickleCfg struct {
min time.Duration
max time.Duration
}
var bsGetProvsTrickle := trickleCfg{time.Second * 5, time.Hour}
// keep one of these per key in wantlist
type trickleTimer struct {
interval time.Duration // initialized to bsGetProvsTrickle.Min
next time.Time // initialized to now + interval
}
func (t *trickleTimer) Reset(cfg *TrickleCfg) {
t.interval = cfg.Min
t.next = time.Now().Add(cfg.Min)
}
// algorithm for updating.
toSearch = []cid.Cid
for cid, timer := keyTimers {
if now > timer.next {
toSearch = append(toSearch, cid)
timer.next = timer.next.Add(timer.interval)
timer.interval = timer.interval * 2 // exponential backoff (trickle)
if timer.interval > bsGetProvsTrickle.Max {
timer.interval = bsGetProvsTrickle.Max
}
}
}
go getProviders(toSearch)
// IMPORTANT:
// - if new request comes in, search for providers then AND
// call timer.Reset(bsGetProvsTrickle)
// - dont need to bring it back down ever, because if we FIND
// or cancel the key, then the timer goes away entirely. |
This PR has been a long time coming. A recent race condition fix has lead to us actually rebroadcasting our wantlist every ten seconds as we claim to do (instead of just pretending to). As a result, the gateways have started catching on fire. The reason for this is that since we never remove elements from the wantlist, we now rebroadcast thousands of wantlist entries every ten seconds. Thats kinda bad.
The solution, is to reference count wantlist entries and remove them when no active requests are referencing them any longer.
A few notable sub-fixes happened to make this work: