Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPFS loses swarm connection while pinning #5977

Closed
markg85 opened this issue Feb 8, 2019 · 25 comments
Closed

IPFS loses swarm connection while pinning #5977

markg85 opened this issue Feb 8, 2019 · 25 comments

Comments

@markg85
Copy link
Contributor

markg85 commented Feb 8, 2019

Hi,

I'm playing with IPFS and pinning. I might have discovered an oddity while pinning and swarm connections.

The setup is as follows.
1 IPFS server on a cloud hosting provider
1 IPFS locally
Both are the latest IPFS version (0.4.18).
Both run with --routing=dhtclient
The server is running with IPFS_PROFILE=server

Locally i added a large folder.
On the cloud i'm pinning that same folder.
On the cloud i'm grepping to see if i'm connected to my local machine:
docker exec ipfs_host ipfs swarm peers | grep CID

Locally in the WEBUI i'm monitoring for traffic to see when it's uploading.
This gives me quite notable gaps: https://i.imgur.com/1whgzx6.png

The server oftentimes quickly reconnects to the peer it is pinning from, but sometimes it takes a LONG while or just doesn't reconnect at all anymore (or so it seems). So long that i manually connect the peer to the swarm again on the server to resume uploading. Like you see in the before linked image. It had a lot of gaps and just ended doing nothing.

Both locally and on the cloud there were no internet connection issues that might have caused this. Also, it's very much repeatable. Just try the same setup yourself and you will probably see the same thing happening.

Also, most gaps happen to be spaced at around 90 seconds intervals. Might be a coincidence as i ended up manually reconnecting over and over again till everything was pinned.

Best regards,
Mark

@raulk
Copy link
Member

raulk commented Feb 8, 2019

This looks like an issue we fixed recently: libp2p/go-libp2p-kad-dht#237 (comment)

Would you be able to build IPFS from master and try reproducing?

@markg85
Copy link
Contributor Author

markg85 commented Feb 8, 2019

This looks like an issue we fixed recently: libp2p/go-libp2p-kad-dht#237 (comment)

Would you be able to build IPFS from master and try reproducing?

If you provide me the commands for the docker ipfs image, yes gladly :)

@raulk
Copy link
Member

raulk commented Feb 8, 2019

@markg85 you can just fetch the master tag from Docker Hub:
https://hub.docker.com/r/ipfs/go-ipfs/tags

@markg85
Copy link
Contributor Author

markg85 commented Feb 9, 2019

@markg85 you can just fetch the master tag from Docker Hub:
https://hub.docker.com/r/ipfs/go-ipfs/tags

Ehh, oke.
The cloud version is now the docker master one.
My local version (arch linux distribution package) is still just the latest version (0.4.18).

The master ipfs doesn't appear to be able to connect:

failure: dial attempt failed: <peer.ID QmSuFCF6> --> <peer.ID Qm5SHS8v> dial attempt failed: context deadline exceeded

@raulk
Copy link
Member

raulk commented Feb 9, 2019

On which machine are you executing the connect command? Is this local trying to connect to the cloud, or viceversa? Beware that your peer ID could’ve possibly changed.

@markg85
Copy link
Contributor Author

markg85 commented Feb 9, 2019

I'm executing the command on the cloud (that id changed) to the local one (that remained as is). I'm trying to build go-ipfs locally now, just to see of that would work as both would be from master.

@raulk
Copy link
Member

raulk commented Feb 9, 2019

Thanks. Just one note: I think your issue could be with the connection manager killing the session. You can try to increase the connection manager limits in the IPFS config.

https://github.com/ipfs/go-ipfs/blob/master/docs/config.md

@markg85
Copy link
Contributor Author

markg85 commented Feb 9, 2019

No i won't. It currently is at the defaults and that already causes the cloud provider to thing that i got hacked due to thousands of connections in mere minutes like i'm attacking someone. I'm guessing that improved greatly with your p2p fixes and the recent bitswap fixes. At least, i hope it did :)

@raulk
Copy link
Member

raulk commented Feb 9, 2019

Note that the connection manager and the swarm dialer limit are distinct. The connection rate (inflight dials) is governed by the swarm (what your cloud provider may be complaining about). That has improved with the DHT fixes. The connection manager is in charge of keeping open connections within bounds.

@markg85
Copy link
Contributor Author

markg85 commented Feb 9, 2019

I'm sorry, but i can't get this working anymore at all now.
Both instances now run on got master. Executing the swarm connect to my local ipfs still gives:

failure: dial attempt failed: <peer.ID QmSuFCF6> --> <peer.ID Qm5SHS8v> dial attempt failed: context deadline exceeded

Is there anything i can add in debug logging to help trace this thing?
Note: I am online on irc (markg85) in #ipfs

@raulk
Copy link
Member

raulk commented Feb 9, 2019

@markg85 was kind to pair with me on this. The issue is that despite having a static mapping in his router for IPFS on port 4001, current master was discovering a wrong public port (1024, weird). This led to his address in the DHT being incorrect, and dials failing due to his NAT dropping the incoming traffic. ipfs swarm addrs local shows the incorrect port number. He will post more details shortly. This issue did not happen with 0.4.18.

@markg85
Copy link
Contributor Author

markg85 commented Feb 9, 2019

@raulk and i paired on IRC to debug this.
Turns out that IPFS is advertising a multiaddr with a bad public port.
For example:
ipfs swarm addrs local
Gives: (ip's anonymized)

/ip4/127.0.0.1/tcp/4001
/ip4/123..123.123.123/tcp/1025
/ip6/::/tcp/4001
/ip6/::1/tcp/4001

While i had port 4001 open and forwarded. It shows port 1025 in this case, which is wrong.

@raulk
Copy link
Member

raulk commented Feb 9, 2019

@markg85 can you post the equivalent output from 0.4.18, please? Thanks again.

@markg85
Copy link
Contributor Author

markg85 commented Feb 9, 2019

And as i just tested, 0.4.18 has the same issue.

$ ipfs version
ipfs version 0.4.18

$ ipfs swarm addrs local
/ip4/10.0.3.50/tcp/4001
/ip4/127.0.0.1/tcp/4001
/ip4/123.123.123.123/tcp/1024
/ip6/::1/tcp/4001

@markg85
Copy link
Contributor Author

markg85 commented Mar 3, 2019

Just a friendly reminder.
A new go-ipfs has been released. I had hoped this bug to be magically fixed, but the new version apparently didn't fix that.

Both my local and remote machine now run 0.4.19!
Both run IPFS in docker from the latest image.

On my remote machine there is no 1024 port. Good!
On my local machine i do still see a 1024 port being present!

The local machine has a clean IPFS setup, data and config.
The remote had it from the previous version.

Please take a look at this. It cases swarm connections to "sometimes" fail and "sometimes" work.

@markg85
Copy link
Contributor Author

markg85 commented Apr 4, 2019

How can i raise the attention of the right people for this issue? As i have a feeling the ones that need to know about this don't. Which causes new releases to be shipped with the very same bug still present.

@Stebalien
Copy link
Member

We are working on this but it's just not the only thing we're working on fixing. @raulk is the right person.

@markg85
Copy link
Contributor Author

markg85 commented Apr 5, 2019

I would suggest marking this a blocker for the next release.

@Stebalien
Copy link
Member

That's not going to get the problem fixed any faster, just delay other fixes.

@markg85
Copy link
Contributor Author

markg85 commented Apr 5, 2019

I understand, but do know that this bug prevents making a connection at all. That little side effect alone should make it a quite high priority.

On the other side, i have it but others don't seem to be bothered by it at all. So it might just be occurring with some router vendors? Or some other special non-obvious thing. And with just using IPFS (aka, not running commands but just using it to browse the "IPFS internet") there seems to be nothing wrong.

@remmerw
Copy link

remmerw commented Apr 6, 2019

@markg85 I have the same issue with advertising wrong ports (ipfs id)
My observation is: When I have run one IPFS node behind a router,
after a period of time it reports the public IP of the router with the swarm port (4001)
-> this can be correct when doing port forwarding
But when you run a second node behind the same router (different machine).
It advertises the public IP address and the port 1025 (sometimes 1024) [not yet figured it out]
When you run a third node behind the same router (different machine) it just
increase the port number by one and advertise it.
I am not an expert in NAT, but it looks like an issue

@markg85
Copy link
Contributor Author

markg85 commented Apr 6, 2019

@remmerw That might be something. Or perhaps something that makes investigating it more easy for the devs.

In my case however, I've only ever had 1 node running behind the router. Never more.

@voidao
Copy link

voidao commented Apr 27, 2019

@markg85 @remmerw @raulk Seems like I got a pretty similar issue(Local desktop node failed swarm connect to remote cloud node)!

:~ $ ipfs swarm connect /ip4/1**.10*.6*.1*9/tcp/4009/ipfs/...
failure: context deadline exceeded

~ $ jsipfs swarm connect /ip4/1**.10*.6*.1*9/tcp/4009/ipfs/...
No available transports to dial peer

Some clues/findings:

  1. It used to work pretty well, almost always succeed, but suddenly ran into trouble without any change.
  2. Newly initialized node(with new repo location and peerid, @Local) would succeed to connect, but ran into issue later on.
  3. Based on 1&2, I guess the cause may be some restriction on remote/cloud side, which is triggered by IPFS related networking operations.

@Stebalien
Copy link
Member

@voidao that's likely unrelated to this issue. "Cloud" nodes don't have NAT issues.

WRT this issue, the core problem is that IPFS doesn't know how you've configured your router. It has to guess as well as it can.

It does this by:

  1. Asking the router to forward a port using UPnP (and related protocols).
  2. Opening outbound connections using the same port on which it receives connections and tracking addresses observed by peers. Many routers will consistently map the same external port to the same internal port so the external port observed by our peers can often be re-used for inbound connections.

Unfortunately, it doesn't look like either of those are working in this case.


I'm going to close this in favor of libp2p/go-libp2p#559 as that's an actionable solution to this issue.

@voidao
Copy link

voidao commented Jul 18, 2019

@Stebalien Thank you for the detailed explanation! It makes sense to me, and I guess it's caused by the router or something else in the NAT environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants