IPFS loses swarm connection while pinning #5977

markg85 · 2019-02-08T23:35:52Z

Hi,

I'm playing with IPFS and pinning. I might have discovered an oddity while pinning and swarm connections.

The setup is as follows.
1 IPFS server on a cloud hosting provider
1 IPFS locally
Both are the latest IPFS version (0.4.18).
Both run with --routing=dhtclient
The server is running with IPFS_PROFILE=server

Locally i added a large folder.
On the cloud i'm pinning that same folder.
On the cloud i'm grepping to see if i'm connected to my local machine:
docker exec ipfs_host ipfs swarm peers | grep CID

Locally in the WEBUI i'm monitoring for traffic to see when it's uploading.
This gives me quite notable gaps: https://i.imgur.com/1whgzx6.png

The server oftentimes quickly reconnects to the peer it is pinning from, but sometimes it takes a LONG while or just doesn't reconnect at all anymore (or so it seems). So long that i manually connect the peer to the swarm again on the server to resume uploading. Like you see in the before linked image. It had a lot of gaps and just ended doing nothing.

Both locally and on the cloud there were no internet connection issues that might have caused this. Also, it's very much repeatable. Just try the same setup yourself and you will probably see the same thing happening.

Also, most gaps happen to be spaced at around 90 seconds intervals. Might be a coincidence as i ended up manually reconnecting over and over again till everything was pinned.

Best regards,
Mark

The text was updated successfully, but these errors were encountered:

raulk · 2019-02-08T23:40:58Z

This looks like an issue we fixed recently: libp2p/go-libp2p-kad-dht#237 (comment)

Would you be able to build IPFS from master and try reproducing?

markg85 · 2019-02-08T23:52:00Z

This looks like an issue we fixed recently: libp2p/go-libp2p-kad-dht#237 (comment)

Would you be able to build IPFS from master and try reproducing?

If you provide me the commands for the docker ipfs image, yes gladly :)

raulk · 2019-02-08T23:59:22Z

@markg85 you can just fetch the master tag from Docker Hub:
https://hub.docker.com/r/ipfs/go-ipfs/tags

markg85 · 2019-02-09T00:10:08Z

@markg85 you can just fetch the master tag from Docker Hub:
https://hub.docker.com/r/ipfs/go-ipfs/tags

Ehh, oke.
The cloud version is now the docker master one.
My local version (arch linux distribution package) is still just the latest version (0.4.18).

The master ipfs doesn't appear to be able to connect:

failure: dial attempt failed: <peer.ID QmSuFCF6> --> <peer.ID Qm5SHS8v> dial attempt failed: context deadline exceeded

raulk · 2019-02-09T00:20:26Z

On which machine are you executing the connect command? Is this local trying to connect to the cloud, or viceversa? Beware that your peer ID could’ve possibly changed.

markg85 · 2019-02-09T00:22:49Z

I'm executing the command on the cloud (that id changed) to the local one (that remained as is). I'm trying to build go-ipfs locally now, just to see of that would work as both would be from master.

raulk · 2019-02-09T00:29:56Z

Thanks. Just one note: I think your issue could be with the connection manager killing the session. You can try to increase the connection manager limits in the IPFS config.

https://github.com/ipfs/go-ipfs/blob/master/docs/config.md

markg85 · 2019-02-09T00:37:10Z

No i won't. It currently is at the defaults and that already causes the cloud provider to thing that i got hacked due to thousands of connections in mere minutes like i'm attacking someone. I'm guessing that improved greatly with your p2p fixes and the recent bitswap fixes. At least, i hope it did :)

raulk · 2019-02-09T00:43:42Z

Note that the connection manager and the swarm dialer limit are distinct. The connection rate (inflight dials) is governed by the swarm (what your cloud provider may be complaining about). That has improved with the DHT fixes. The connection manager is in charge of keeping open connections within bounds.

markg85 · 2019-02-09T00:46:00Z

I'm sorry, but i can't get this working anymore at all now.
Both instances now run on got master. Executing the swarm connect to my local ipfs still gives:

failure: dial attempt failed: <peer.ID QmSuFCF6> --> <peer.ID Qm5SHS8v> dial attempt failed: context deadline exceeded

Is there anything i can add in debug logging to help trace this thing?
Note: I am online on irc (markg85) in #ipfs

raulk · 2019-02-09T01:36:06Z

@markg85 was kind to pair with me on this. The issue is that despite having a static mapping in his router for IPFS on port 4001, current master was discovering a wrong public port (1024, weird). This led to his address in the DHT being incorrect, and dials failing due to his NAT dropping the incoming traffic. ipfs swarm addrs local shows the incorrect port number. He will post more details shortly. This issue did not happen with 0.4.18.

markg85 · 2019-02-09T01:36:56Z

@raulk and i paired on IRC to debug this.
Turns out that IPFS is advertising a multiaddr with a bad public port.
For example:
ipfs swarm addrs local
Gives: (ip's anonymized)

/ip4/127.0.0.1/tcp/4001
/ip4/123..123.123.123/tcp/1025
/ip6/::/tcp/4001
/ip6/::1/tcp/4001

While i had port 4001 open and forwarded. It shows port 1025 in this case, which is wrong.

raulk · 2019-02-09T01:37:48Z

@markg85 can you post the equivalent output from 0.4.18, please? Thanks again.

markg85 · 2019-02-09T01:45:47Z

And as i just tested, 0.4.18 has the same issue.

$ ipfs version
ipfs version 0.4.18

$ ipfs swarm addrs local
/ip4/10.0.3.50/tcp/4001
/ip4/127.0.0.1/tcp/4001
/ip4/123.123.123.123/tcp/1024
/ip6/::1/tcp/4001

markg85 · 2019-03-03T17:52:21Z

Just a friendly reminder.
A new go-ipfs has been released. I had hoped this bug to be magically fixed, but the new version apparently didn't fix that.

Both my local and remote machine now run 0.4.19!
Both run IPFS in docker from the latest image.

On my remote machine there is no 1024 port. Good!
On my local machine i do still see a 1024 port being present!

The local machine has a clean IPFS setup, data and config.
The remote had it from the previous version.

Please take a look at this. It cases swarm connections to "sometimes" fail and "sometimes" work.

markg85 · 2019-04-04T15:17:08Z

How can i raise the attention of the right people for this issue? As i have a feeling the ones that need to know about this don't. Which causes new releases to be shipped with the very same bug still present.

Stebalien · 2019-04-04T23:43:18Z

We are working on this but it's just not the only thing we're working on fixing. @raulk is the right person.

markg85 · 2019-04-05T15:23:36Z

I would suggest marking this a blocker for the next release.

Stebalien · 2019-04-05T16:08:07Z

That's not going to get the problem fixed any faster, just delay other fixes.

markg85 · 2019-04-05T16:55:53Z

I understand, but do know that this bug prevents making a connection at all. That little side effect alone should make it a quite high priority.

On the other side, i have it but others don't seem to be bothered by it at all. So it might just be occurring with some router vendors? Or some other special non-obvious thing. And with just using IPFS (aka, not running commands but just using it to browse the "IPFS internet") there seems to be nothing wrong.

remmerw · 2019-04-06T15:04:50Z

@markg85 I have the same issue with advertising wrong ports (ipfs id)
My observation is: When I have run one IPFS node behind a router,
after a period of time it reports the public IP of the router with the swarm port (4001)
-> this can be correct when doing port forwarding
But when you run a second node behind the same router (different machine).
It advertises the public IP address and the port 1025 (sometimes 1024) [not yet figured it out]
When you run a third node behind the same router (different machine) it just
increase the port number by one and advertise it.
I am not an expert in NAT, but it looks like an issue

markg85 · 2019-04-06T15:14:30Z

@remmerw That might be something. Or perhaps something that makes investigating it more easy for the devs.

In my case however, I've only ever had 1 node running behind the router. Never more.

voidao · 2019-04-27T14:27:57Z

@markg85 @remmerw @raulk Seems like I got a pretty similar issue(Local desktop node failed swarm connect to remote cloud node)!

:~ $ ipfs swarm connect /ip4/1**.10*.6*.1*9/tcp/4009/ipfs/...
failure: context deadline exceeded

~ $ jsipfs swarm connect /ip4/1**.10*.6*.1*9/tcp/4009/ipfs/...
No available transports to dial peer

Some clues/findings:

It used to work pretty well, almost always succeed, but suddenly ran into trouble without any change.
Newly initialized node(with new repo location and peerid, @Local) would succeed to connect, but ran into issue later on.
Based on 1&2, I guess the cause may be some restriction on remote/cloud side, which is triggered by IPFS related networking operations.

Stebalien · 2019-07-17T20:09:11Z

@voidao that's likely unrelated to this issue. "Cloud" nodes don't have NAT issues.

WRT this issue, the core problem is that IPFS doesn't know how you've configured your router. It has to guess as well as it can.

It does this by:

Asking the router to forward a port using UPnP (and related protocols).
Opening outbound connections using the same port on which it receives connections and tracking addresses observed by peers. Many routers will consistently map the same external port to the same internal port so the external port observed by our peers can often be re-used for inbound connections.

Unfortunately, it doesn't look like either of those are working in this case.

I'm going to close this in favor of libp2p/go-libp2p#559 as that's an actionable solution to this issue.

voidao · 2019-07-18T03:38:43Z

@Stebalien Thank you for the detailed explanation! It makes sense to me, and I guess it's caused by the router or something else in the NAT environment.

markg85 mentioned this issue Mar 10, 2019

CPU Utilization Issues #5613

Closed

markg85 mentioned this issue May 16, 2019

Can pinning get a verbose argument? #6042

Open

Stebalien closed this as completed Jul 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPFS loses swarm connection while pinning #5977

IPFS loses swarm connection while pinning #5977

markg85 commented Feb 8, 2019

raulk commented Feb 8, 2019 •

edited

Loading

markg85 commented Feb 8, 2019

raulk commented Feb 8, 2019

markg85 commented Feb 9, 2019

raulk commented Feb 9, 2019 •

edited

Loading

markg85 commented Feb 9, 2019

raulk commented Feb 9, 2019

markg85 commented Feb 9, 2019

raulk commented Feb 9, 2019

markg85 commented Feb 9, 2019

raulk commented Feb 9, 2019 •

edited

Loading

markg85 commented Feb 9, 2019

raulk commented Feb 9, 2019

markg85 commented Feb 9, 2019

markg85 commented Mar 3, 2019

markg85 commented Apr 4, 2019

Stebalien commented Apr 4, 2019

markg85 commented Apr 5, 2019

Stebalien commented Apr 5, 2019

markg85 commented Apr 5, 2019

remmerw commented Apr 6, 2019

markg85 commented Apr 6, 2019

voidao commented Apr 27, 2019

Stebalien commented Jul 17, 2019

voidao commented Jul 18, 2019

IPFS loses swarm connection while pinning #5977

IPFS loses swarm connection while pinning #5977

Comments

markg85 commented Feb 8, 2019

raulk commented Feb 8, 2019 • edited Loading

markg85 commented Feb 8, 2019

raulk commented Feb 8, 2019

markg85 commented Feb 9, 2019

raulk commented Feb 9, 2019 • edited Loading

markg85 commented Feb 9, 2019

raulk commented Feb 9, 2019

markg85 commented Feb 9, 2019

raulk commented Feb 9, 2019

markg85 commented Feb 9, 2019

raulk commented Feb 9, 2019 • edited Loading

markg85 commented Feb 9, 2019

raulk commented Feb 9, 2019

markg85 commented Feb 9, 2019

markg85 commented Mar 3, 2019

markg85 commented Apr 4, 2019

Stebalien commented Apr 4, 2019

markg85 commented Apr 5, 2019

Stebalien commented Apr 5, 2019

markg85 commented Apr 5, 2019

remmerw commented Apr 6, 2019

markg85 commented Apr 6, 2019

voidao commented Apr 27, 2019

Stebalien commented Jul 17, 2019

voidao commented Jul 18, 2019

raulk commented Feb 8, 2019 •

edited

Loading

raulk commented Feb 9, 2019 •

edited

Loading

raulk commented Feb 9, 2019 •

edited

Loading