Use timeout context for NewStream call #994

ligustah · 2024-10-20T17:09:01Z

Hey,

I'm running an IPFS node with the Accelerated DHT Client, but I'm seeing issues with the reprovider system showing reprovider system not ready. I've been debugging the fullrt code a little bit and it seems it's because the initial crawl never completes, because this loop never exits. Specifically, it seems that some of the worker goroutines get stuck trying to query some quic peers. See the below goroutine stacks to follow along.

crawler stuck in queryPeer, trying to open a stream to a peer

goroutine 579 [select, 10 minutes]:
github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).NewStream(0x14001660480, {0x10676c6d8, 0x1400dbe6ff0}, {0x1401648def0, 0x26}, {0x1400200a8a0, 0x1, 0x1})
	/go/pkg/mod/github.com/libp2p/[email protected]/p2p/host/basic/basic_host.go:737 +0xb34
github.com/libp2p/go-libp2p/p2p/host/routed.(*RoutedHost).NewStream(0x14000a2b860, {0x10676c6d8, 0x1400dbe6ff0}, {0x1401648def0, 0x26}, {0x1400200a8a0, 0x1, 0x1})
	/go/pkg/mod/github.com/libp2p/[email protected]/p2p/host/routed/routed.go:214 +0x144
github.com/libp2p/go-libp2p-kad-dht/crawler.(*messageSender).SendRequest(0x1400233d590, {0x10676c6d8, 0x1400dbe6ff0}, {0x1401648def0, 0x26}, 0x1401d97c800)
	/go/pkg/mod/github.com/libp2p/[email protected]/crawler/crawler.go:78 +0x9c
github.com/libp2p/go-libp2p-kad-dht/pb.(*ProtocolMessenger).GetClosestPeers(0x1400200a8b0, {0x10676c6d8, 0x1400dbe6ff0}, {0x1401648def0, 0x26}, {0x1401729bad0, 0x22})
	/go/pkg/mod/github.com/libp2p/[email protected]/pb/protocol_messenger.go:164 +0x514
github.com/libp2p/go-libp2p-kad-dht/crawler.(*DefaultCrawler).queryPeer(0x1400233d5c0, {0x10676c710, 0x14001ac6320}, {0x1401648def0, 0x26})
	/go/pkg/mod/github.com/libp2p/[email protected]/crawler/crawler.go:248 +0x6f4
github.com/libp2p/go-libp2p-kad-dht/crawler.(*DefaultCrawler).Run.func1()
	/go/pkg/mod/github.com/libp2p/[email protected]/crawler/crawler.go:148 +0x124
created by github.com/libp2p/go-libp2p-kad-dht/crawler.(*DefaultCrawler).Run in goroutine 302
	/go/pkg/mod/github.com/libp2p/[email protected]/crawler/crawler.go:145 +0x200

BasicHost stuck trying to negotiate a protocol?

goroutine 295264 [chan receive, 14 minutes]:
github.com/quic-go/quic-go.(*receiveStream).readImpl(0x14016495500, {0x14002800595, 0x1, 0x1})
	/go/pkg/mod/github.com/quic-go/[email protected]/receive_stream.go:180 +0xb50
github.com/quic-go/quic-go.(*receiveStream).Read(0x14016495500, {0x14002800595, 0x1, 0x1})
	/go/pkg/mod/github.com/quic-go/[email protected]/receive_stream.go:94 +0xe4
github.com/libp2p/go-libp2p/p2p/transport/quic.(*stream).Read(0x14017093400, {0x14002800595, 0x1, 0x1})
	/go/pkg/mod/github.com/libp2p/[email protected]/p2p/transport/quic/stream.go:22 +0x60
github.com/libp2p/go-libp2p/p2p/net/swarm.(*Stream).Read(0x1401770c000, {0x14002800595, 0x1, 0x1})
	/go/pkg/mod/github.com/libp2p/[email protected]/p2p/net/swarm/swarm_stream.go:58 +0x60
github.com/multiformats/go-multistream.(*byteReader).ReadByte(0x140060e0080)
	/go/pkg/mod/github.com/multiformats/[email protected]/multistream.go:322 +0x84
github.com/multiformats/go-varint.ReadUvarint({0x10674f1e0, 0x140060e0080})
	/go/pkg/mod/github.com/multiformats/[email protected]/varint.go:80 +0x58
github.com/multiformats/go-multistream.lpReadBuf({0x1301c4f78, 0x1401770c000})
	/go/pkg/mod/github.com/multiformats/[email protected]/multistream.go:286 +0x130
github.com/multiformats/go-multistream.ReadNextTokenBytes({0x1301c4f78, 0x1401770c000})
	/go/pkg/mod/github.com/multiformats/[email protected]/multistream.go:269 +0x48
github.com/multiformats/go-multistream.ReadNextToken[...]({0x1301c4f78, 0x1401770c000})
	/go/pkg/mod/github.com/multiformats/[email protected]/multistream.go:258 +0x48
github.com/multiformats/go-multistream.readMultistreamHeader({0x1301c4f78, 0x1401770c000})
	/go/pkg/mod/github.com/multiformats/[email protected]/client.go:118 +0x44
github.com/multiformats/go-multistream.SelectProtoOrFail[...]({0x105597f50, 0xf}, {0x1301c4f48, 0x1401770c000})
	/go/pkg/mod/github.com/multiformats/[email protected]/client.go:56 +0x250
github.com/multiformats/go-multistream.SelectOneOf[...]({0x1400200a8a0, 0x1, 0x1}, {0x1301c4f48, 0x1401770c000})
	/go/pkg/mod/github.com/multiformats/[email protected]/client.go:89 +0x184
github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).NewStream.func2()
	/go/pkg/mod/github.com/libp2p/[email protected]/p2p/host/basic/basic_host.go:734 +0xc4
created by github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).NewStream in goroutine 622
	/go/pkg/mod/github.com/libp2p/[email protected]/p2p/host/basic/basic_host.go:733 +0xab8

I feel like this has only been happening recently, so maybe some default timeout somewhere in libp2p was removed. Looking through the code, it seems that no timeout is set on the call to NewStream, which already involves network communication and will not always time out on its own. The timeout is only set later in the function when reading from the stream.

Most NewStream calls complete in less than 1ms, so it seems sensible to just include the call in the same timeout context. I confirmed locally that this makes the initial crawl complete successfully.

lidel · 2024-10-22T23:46:36Z

cc @sukunrt @MarcoPolo for visibility, perhaps this class of quic issue was reported before?

MarcoPolo · 2024-10-23T00:40:33Z

I added an accompanying change to scope the query context down as well: #996

guillaumemichel

Thanks for your contribution @ligustah

testing timeouts from libp2p/go-libp2p-kad-dht#994 and libp2p/go-libp2p-kad-dht#996

Use timeout context for NewStream call

02d373e

ligustah requested review from guillaumemichel and a team as code owners October 20, 2024 17:09

gammazero added the need/analysis Needs further analysis before proceeding label Oct 22, 2024

lidel self-assigned this Oct 22, 2024

gammazero removed the need/analysis Needs further analysis before proceeding label Oct 22, 2024

lidel removed their assignment Oct 22, 2024

This was referenced Oct 23, 2024

Set sane default timeout for BasicHost.NewStream libp2p/go-libp2p#3019

Closed

fix: set context timeout for queryPeer #996

Merged

MarcoPolo approved these changes Oct 23, 2024

View reviewed changes

guillaumemichel approved these changes Oct 23, 2024

View reviewed changes

guillaumemichel merged commit 50b4e99 into libp2p:master Oct 23, 2024
9 checks passed

lidel added a commit to ipfs/kubo that referenced this pull request Nov 13, 2024

test: go-libp2p-kad-dht#996

fd1f673

testing timeouts from libp2p/go-libp2p-kad-dht#994 and libp2p/go-libp2p-kad-dht#996

This was referenced Nov 13, 2024

fix: go-libp2p-kad-dht v0.28.0 ipfs/kubo#10578

Merged

chore: release v0.28.0 #998

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use timeout context for NewStream call #994

Use timeout context for NewStream call #994

ligustah commented Oct 20, 2024

lidel commented Oct 22, 2024

MarcoPolo commented Oct 23, 2024

guillaumemichel left a comment •

edited

Loading

Use timeout context for NewStream call #994

Use timeout context for NewStream call #994

Conversation

ligustah commented Oct 20, 2024

crawler stuck in queryPeer, trying to open a stream to a peer

BasicHost stuck trying to negotiate a protocol?

lidel commented Oct 22, 2024

MarcoPolo commented Oct 23, 2024

guillaumemichel left a comment • edited Loading

Choose a reason for hiding this comment

guillaumemichel left a comment •

edited

Loading