Adaptive queue for staging dials #237

raulk · 2019-01-29T01:01:06Z

Currently the DHT is performing dials outside of the Alpha concurrency limit. We are dialling all nodes that peers return in CloserPeers without limit. As a result, we end up flooding the swarm with dial jobs, which trips over the file descriptor limits, and brings dialling to a halt under some circumstances. Our current approach is also algorithmically incorrect, and leads to suboptimal query patterns.

This patch introduces an adaptive dial queue that spawns a dynamically sized set of goroutines to preemptively stage dials for later handoff to the DHT protocol for RPC. It identifies backpressure on both ends (dial consumers and dial producers), and takes compensating action by adjusting the worker pool.

We start with DialQueueMinParallelism number of workers (6), and scale up and down based on demand and supply of dialled peers.

The following events trigger scaling:

we scale up when we can't immediately return a successful dial to a new consumer.
we scale down when we've been idle for a while waiting for new dial attempts.
we scale down when we complete a dial and realise nobody was waiting for it.

Dialler throttling (e.g. FD limit exceeded) is a concern, as we can easily spin up more workers to compensate, and end up adding fuel to the fire. Since we have no deterministic way to detect this for now, we hard-limit concurrency to DialQueueMaxParallelism (20).

Testing this patch in a production mirror reduced dial backlog considerably, and showed the adaptiveness in action:

Jan 29 00:42:02 localhost ipfs[6248]: 2019-01-29 00:42:02.873850 DEBUG dht dial_queue.go:177: grew dial worker pool: 6 => 9
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.003489 DEBUG dht dial_queue.go:177: grew dial worker pool: 6 => 9
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.083072 DEBUG dht dial_queue.go:177: grew dial worker pool: 9 => 13
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.186067 DEBUG dht dial_queue.go:177: grew dial worker pool: 6 => 9
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.246148 DEBUG dht dial_queue.go:199: shrunk dial worker pool: 9 => 6
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.246174 DEBUG dht dial_queue.go:177: grew dial worker pool: 13 => 19
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.328308 DEBUG dht dial_queue.go:177: grew dial worker pool: 6 => 9
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.451401 DEBUG dht dial_queue.go:177: grew dial worker pool: 6 => 9
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.462604 DEBUG dht dial_queue.go:177: grew dial worker pool: 9 => 13
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.463808 DEBUG dht dial_queue.go:177: grew dial worker pool: 9 => 13
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.464417 DEBUG dht dial_queue.go:177: grew dial worker pool: 9 => 13
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.465973 DEBUG dht dial_queue.go:177: grew dial worker pool: 9 => 13
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.470865 DEBUG dht dial_queue.go:199: shrunk dial worker pool: 18 => 12
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.697266 DEBUG dht dial_queue.go:177: grew dial worker pool: 6 => 9
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.697847 DEBUG dht dial_queue.go:177: grew dial worker pool: 6 => 9
Jan 29 00:42:03 localhost ipfs[6248]: 2019-01-29 00:42:03.838040 DEBUG dht dial_queue.go:177: grew dial worker pool: 9 => 13

This patch introduces an adaptive dial queue that spawns a dynamically sized set of goroutines to preemptively stage dials for later handoff to the DHT protocol for RPC. It identifies backpressure on both ends (dial consumers and dial producers), and takes compensating action by adjusting the worker pool. We start with `DialQueueMinParallelism` number of workers (6), and scale up and down based on demand and supply of dialled peers. The following events trigger scaling: - we scale up when we can't immediately return a successful dial to a new consumer. - we scale down when we've been idle for a while waiting for new dial attempts. - we scale down when we complete a dial and realise nobody was waiting for it. Dialler throttling (e.g. FD limit exceeded) is a concern, as we can easily spin up more workers to compensate, and end up adding fuel to the fire. Since we have no deterministic way to detect this for now, we hard-limit concurrency to `DialQueueMaxParallelism` (20).

anacrolix

Very nice and clean.

dial_queue.go

raulk · 2019-01-29T01:16:16Z

Future optimisation: cancelling pending dials to worse nodes as we find closer nodes to the target.

EDIT: in practice, this is complex, because those theoretically better nodes may never respond, and we would've stopped making progress. The algorithm would have to compensate by backtracking and replaying those dials. Quite a dance.

dial_queue.go

query.go

dial_queue.go

raulk · 2019-01-29T16:16:22Z

Addressed the review comments, but I noticed a flaky test on CI along the way. I do deplore depending on time, but I cannot think of another way to test this.

dial_queue.go

query.go

dial_queue.go

raulk · 2019-01-30T23:29:58Z

@Stebalien – up for re-review. I ended up changing the waiting mechanism to a slice, like we discussed in comments.

raulk added 3 commits January 29, 2019 00:56

adjust godoc.

f044043

enhance logging and import prefixes.

95a0975

ghost assigned raulk Jan 29, 2019

ghost added the status/in-progress In progress label Jan 29, 2019

raulk requested review from whyrusleeping, Stebalien, anacrolix and jhiesey January 29, 2019 01:01

anacrolix approved these changes Jan 29, 2019

View reviewed changes

dial_queue.go Outdated Show resolved Hide resolved

dial_queue.go Outdated Show resolved Hide resolved

dial_queue.go Show resolved Hide resolved

Stebalien reviewed Jan 29, 2019

View reviewed changes

remove unneeded lk; group global vars.

24f0a2d

raulk added 3 commits January 29, 2019 16:34

replace concurrency level error with panic.

7cd14f5

refactor interface of Consume().

abacfe5

cleanup channels on context cancellation.

8366041

Stebalien requested changes Jan 29, 2019

View reviewed changes

dial_queue.go Outdated Show resolved Hide resolved

dial_queue.go Outdated Show resolved Hide resolved

query.go Outdated Show resolved Hide resolved

raulk added 3 commits January 29, 2019 20:48

add godoc on global vars.

72f9d4c

fix shutdown logic; fix timer logic.

f2df3ec

harden tests.

bf4b91c

Stebalien requested changes Jan 29, 2019

View reviewed changes

dial_queue.go Outdated Show resolved Hide resolved

park waiters in slice; revise closure logic.

74d22f3

raulk force-pushed the feat/adaptive-dial-queue branch from 4ab14f9 to 74d22f3 Compare January 30, 2019 23:25

Stebalien approved these changes Jan 30, 2019

View reviewed changes

raulk mentioned this pull request Jan 30, 2019

[WIP] DHT query rate limiter improvements #229

Closed

4 tasks

raulk merged commit 7a255be into libp2p:master Jan 30, 2019

ghost removed the status/in-progress In progress label Jan 30, 2019

raulk deleted the feat/adaptive-dial-queue branch January 30, 2019 23:38

This was referenced Jan 31, 2019

bump: @textile/[email protected] textileio/photos#884

Merged

[RFC] limiter: smarter perPeerLimit libp2p/go-libp2p-swarm#81

Open

swarm: rate limit establishing new connections libp2p/go-libp2p#1550

Open

sanderpick mentioned this pull request Jan 31, 2019

ipfs: apply final go-libp2p-kad-dht dial fix textileio/go-textile#494

Merged

This was referenced Feb 8, 2019

Writeup of router kill issue ipfs/kubo#3320

Closed

IPFS loses swarm connection while pinning ipfs/kubo#5977

Closed

Dial backpressure #110

Closed

scout mentioned this pull request Feb 13, 2019

Gateway files very slow to load ipfs/infra#467

Closed

raulk mentioned this pull request Feb 19, 2019

Connecting to some peers fails with 'dial attempt failed: context deadline exceeded' (MOSTLY) ipfs/kubo#5800

Open

sanderpick mentioned this pull request Feb 27, 2019

Switch cafe service back to direct p2p textileio/go-textile#568

Closed

vasco-santos mentioned this pull request Mar 4, 2019

Closest peers query is not efficient libp2p/js-libp2p-kad-dht#86

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive queue for staging dials #237

Adaptive queue for staging dials #237

raulk commented Jan 29, 2019 •

edited

Loading

anacrolix left a comment

raulk commented Jan 29, 2019 •

edited

Loading

raulk commented Jan 29, 2019

raulk commented Jan 30, 2019

Adaptive queue for staging dials #237

Adaptive queue for staging dials #237

Conversation

raulk commented Jan 29, 2019 • edited Loading

anacrolix left a comment

Choose a reason for hiding this comment

raulk commented Jan 29, 2019 • edited Loading

raulk commented Jan 29, 2019

raulk commented Jan 30, 2019

raulk commented Jan 29, 2019 •

edited

Loading

raulk commented Jan 29, 2019 •

edited

Loading