Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some cabals are not syncing between remote peers #17

Closed
nikolaiwarner opened this issue Sep 11, 2018 · 32 comments
Closed

Some cabals are not syncing between remote peers #17

nikolaiwarner opened this issue Sep 11, 2018 · 32 comments
Labels
bug Something isn't working

Comments

@nikolaiwarner
Copy link
Member

Connecting to a cabal shows yourself but no nicks, no messages, and no channels from remote peers. A peer joining from another client on the same machine appears correctly however.

screen shot 2018-09-11 at 8 48 02 am

screen shot 2018-09-11 at 8 50 40 am

cc @noffle @cblgh

@cblgh
Copy link
Member

cblgh commented Sep 11, 2018

this is the cabal nick tried to join:

cabal://3115ddead69876368789e03101ab5136ccd445449024f08b3318986690467905

i created it, so we definitely shouldn't be as many peers as the sidebar is showing haha. i only see myself in it, with my messages

@hackergrrl
Copy link
Member

I'm able to join & sync down that cabal fyi, but nobody on the public one

@hackergrrl
Copy link
Member

me and @nikolaiwarner are experimenting:

  • we can't repro issues on the secret core cabal, nor on a fresh one
  • on the public cabal I see nick join, but as a key /wo a name set, and don't seem to get his msgs
  • on the secret core cabal, I see what looks like a new key joining and then leaving, every 3-4 seconds. maybe an old client trying to reseed the cabal or something?

@nikolaiwarner nikolaiwarner changed the title Cabal is not syncing between remote peers Some cabals are not syncing between remote peers Sep 11, 2018
@hackergrrl
Copy link
Member

hackergrrl commented Sep 12, 2018 via email

@hackergrrl
Copy link
Member

When I look at my cabal directory for 00794539a8ce6bed76e40b9d259666303d39271da66140282bfbce76fd9a4434 I see 776 directories, which means 776 hypercores. The maximum number of hypercores that hypercore-protocol replicates over one stream is 128 (why?). It emits an error, but hypercore doesn't listen for it, so it gets swallowed silently. This answers one question, which was "why can't we replicate eachother's messages?".

The other question is, "how did we end up with 776 hypercores on this cabal?". At first I thought it was an old client trying to connect, and multifeed was interpreting the hyperdb protocol data as noise and created lots of junk feeds, but I can't reproduce that. Maybe somebody has been generating tons of new keys (intentionally or not) on our public cabals.

@hackergrrl
Copy link
Member

Either way, it seems important that

  1. a cabal can handle more than 128 users across all history, and
  2. generating a bunch of garbage identities can't break a cabal

@fenwick67
Copy link
Contributor

@noffle FWIW I saw many (116) directories being generated locally, all at once, in a case where the cabal key was private and only 2 clients:

screenshot showing 116 directories all created at the same time

@hackergrrl
Copy link
Member

@fenwick67 I think this has to do with an old (hyperdb) client trying to participate in a newer (kappa-core) cabal, but this remains unproven.

@cblgh
Copy link
Member

cblgh commented Sep 23, 2018

@fenwick67 I think this has to do with an old (hyperdb) client trying to participate in a newer (kappa-core) cabal, but this remains unproven.

which has a wip fix in #22 :D!

@hackergrrl
Copy link
Member

I can't reproduce this. I created a kappa cabal and had a hyperdb client try to join it. It created empty users in the user sidebar, but my local cabal dir wasn't getting filled with empty hypercores like we see on the broken cabals.

@cinnamon-bun
Copy link
Member

(I'm using the latest cabal CLI for all of this:)

I cleared out .cabal and tried again with the public cabal cabal://00794539.... I got 43 feeds in .cabal but no chat messages show in the UI. I see one peer in the sidebar, 00794539, which is the same as the cabal's hash! Is that the "fake" hypercore?

Those 43 feeds come from out there on the internet somewhere. Doing this experiment with a couple of local peers and no internet connection, I don't get them (nor the mystery peer 00794539).

I created a fresh cabal of my own and added a couple of local peers all running inside my machine. It all works as expected with cabal CLI, and no extra feeds appear. I also joined with the latest cabal-desktop and though it had troubles of its own [1], no extra feeds appeared.

[1] When it joined, it showed preexisting messages as all coming from conspirator. After a couple of restarts and new chat messages, it showed everything correctly.

@hackergrrl
Copy link
Member

@cinnamon-bun

(I'm using the latest cabal CLI for all of this:)

I actually just published some multifeed-index fixes very recently (within the hour). I wonder if you tested before I pushed those or since. It might be that these fixes repaired the issues you were seeing before!

I see one peer in the sidebar, 00794539, which is the same as the cabal's hash! Is that the "fake" hypercore?

The peer who's key matches the cabal is the original creator of that cabal. (maybe me?)

@hackergrrl
Copy link
Member

hackergrrl commented Sep 30, 2018

Hey debug friends! I created a new test cabal for us to try and break with the latest cabal-cli client:

cabal://58dc528ab340938eb66a29f80583ca1b0dcb9034ee78875ac695fbc8359b3581

I pushed some fixes to multifeed-index that explain some weird race conditions around messages not appearing, so I'm keen to see if we can break another cabal and do forensics on it if so.

Feel free to spam the heck outta this & do whatever weird stuff you'd like (as long as you document what you did!)

@makew0rld
Copy link

makew0rld commented Oct 7, 2018

I having the same issue, they only peer I see in the main cabal beyond myself is someone named 00794539, even with two different machines on the same LAN. Would installing new clients direct from git work better? I am using the latest release of the terminal and desktop client, from appimage and npm.

@hackergrrl
Copy link
Member

hackergrrl commented Oct 7, 2018 via email

@makew0rld
Copy link

makew0rld commented Oct 7, 2018

Could you elaborate on that? A bug in what part of cabal? Would using an older version fix this, since it seems to only be an issue with the new version?

Also, I'm not sure what you mean by a "new" cabal that "remains private". Like it didn't exist before the protocol update? What does private mean in this context?

@hackergrrl
Copy link
Member

@makeworld-the-better-one I think it's a bug in a lower level database module, multifeed. Using an older version of cabal would let you sidestep this bug, but there were even worse bugs with hyperdb!

By "new", anything created now onwards (/w latest cabal-cli). By "private" I mean, low traffic. The more users hitting the cabal, the higher chance of the race condition being hit. It doesn't make sense to rely on any cabals right now though for anything critical, until this bug is fixed.

@makew0rld
Copy link

@noffle thanks for the info! By latest cabal-cli do you mean from git or from npm? Also, didn't you fix the race condition as you said above? Or it's still having issues I guess.

@hackergrrl
Copy link
Member

hackergrrl commented Oct 8, 2018 via email

@makew0rld
Copy link

I tried to create a cabal on machine and chat between that machine and another one on the same LAN. It didn't work, the cabal was created, but neither machine could see the other. It worked with two clients on the same machine though. This is all with cabal cli from npm, from a day or two ago.

@hackergrrl
Copy link
Member

hackergrrl commented Oct 9, 2018 via email

@makew0rld
Copy link

@noffle good call, thanks. That's a cool app... that unfortunately doesn't work. So yes you must be right, it's probably mDNS that's failing. I'm guessing that's an issue with one computer being on wifi and another being wired, as said here. I'll try with two devices connected in the same way, or try and fix my router.

@hackergrrl
Copy link
Member

OK, I think this is fixed! 👌 You can install [email protected] to get the latest goodies, or git fetch the latest and reinstall deps.

post mortem

what went wrong?

Some cabals "stopped working". The symptoms were that you'd see yourself connect to peers, but new messages wouldn't appear, and others wouldn't see your messages. New peers couldn't download old state and would see an empty-looking cabal. This seemed to happen more frequently with high traffic cabals, like the public cabal on the cabal.chat website.

On a lower level, we noticed that these "broken cabals" had many empty hypercores in them. (cabal is built on multifeed, which manages a set of hypercores; each user maps to one hypercore.) Often many more hypercores than users: one public cabal had over 700.

At first I thought maybe we were getting spammed by someone, but me, @cblgh and @nikolaiwarner were able to reproduce it in a cabal only we had access to. I thought this might be the result of someone using an old version of cabal (like the one with hyperdb but not realizing it (package-lock.json and yarn.lock files and just plain node_modules management can be confusing)), but eventually I was able to reproduce it myself using a version of cabal that was definitely on master with the latest deps.

multifeed wraps the hypercore replication protocol, hypercore-protocol. This wrapper has each peer send a formatted header before starting hypercore-protocol replication. The format was <UINT32:NUM_KEYS><LIST(BUFFER(32))>, (each BUFFER(32) is a hypercore public key) so that each side knew which hypercores would be synced over the replication stream. Each side looks for keys that they don't already have locally and creates them in preparation for sync.

What was happening though is when garbage / unexpected data is sent, the peer will still interpret the first 4 bytes as "# of keys", even if it's something very big like 1290801. It will then read the next 1290801 * 32 bytes as keys of hypercores to create! Wuh oh.

how did we find it?

This was very difficult to track down, since we knew nothing in the beginning but "sometimes cabals stop syncing".

@nikolaiwarner wrote a patch to cabal-cli that let you pass in --message and --timeout switches to the client that would let you post a message, sync, and then quit. Then he wrapped this in a bash loop and had two machines over the internet write to the same private cabal for a long time (1+ days) and thousands of messages. Eventually, he noticed that rogue empty hypercores would start to get created.

I added logging to multifeed replication using debug, and asked @cblgh and @nikolaiwarner to try to reproduce the bug again while also capturing log output.

Eventually I was able to reproduce it with logging turned on, and realized that somehow , every once in 1000s of sync connections, one peer would send garbage-looking data on the header handshake / key exchange that would result in 1s 10s or 100s of empty hypercores being created locally. Once they were created locally, the node would treat them like real hypercores and sync them to other peers.

The reason this would break cabals and not just result in vestigial hypercores is because hypercore-protocol has a hardcoded limit of 128 hypercores per replication stream. The more rogue hypercores, the lower the chance that you'll replicate with a hypercore related to a real user, and so you end up eventually not getting any new data from anyone, because your replication streams are saturated by the rogue empty hypercores.

how did we fix it?

I updated the header format in multifeed to send a length-prefixed JSON object as its header. This is much more strongly structured than the old format, and very difficult to produce through sending random bytes, or bytes from some other protocol by mistake. The new header also includes a replication protocol version, so that we can make backwards-incompatible changes (if we need to) in the future.

We've tried reproducing the bug with lots of us spamming a private cabal, and so far it seems to be holding up.

What's difficult is not knowing the root cause: why does a peer sometimes send unexpected header data?. For now, until the root cause of the race condition is clear, things should work OK, and in the rare care that a garbage header gets sent, the replication channel will shut down and reset itself (and, presumably, work).

@makew0rld
Copy link

I am really happy to see this fixed, thanks! But I am still unable to talk between to machines on the same LAN, or read from public cabals. I am on cabal (cli) 5.0.0 on both of them, and cannot see any messages or peers. I know I am having mDNS issues, but I figure they should still be able to communicate over bittorrent.

@hackergrrl
Copy link
Member

@makeworld-the-better-one Have you tried another network, to rule that out?

Also, the new public cabal is bd45fde0ad866d4069af490f0ca9b07110808307872d4b659a4ff7a4ef85315a

@hackergrrl
Copy link
Member

@makeworld-the-better-one You can try running each peer with DEBUG=* cabal --key KEY --seed to capture a full dump of debug output for mDNS, bittorrent, etc, which might give some clues!

@cblgh
Copy link
Member

cblgh commented Oct 18, 2018

🔥 🔥 🔥 🔥 🔥 🔥

@neauoire
Copy link

Congratulation!

@makew0rld
Copy link

@noffle So as you maybe saw in the in the new public cabal, I was able see the messages in the cabal you posted, on both machines. I said something from my pi, and it showed up on my computer, but I couldn't get the reverse to work.

@hackergrrl
Copy link
Member

@makeworld-the-better-one same network?

@makew0rld
Copy link

@noffle Yep.

@cinnamon-bun
Copy link
Member

I found some clues about header corruption and wrote them up in multifeed's issues.

It seems to be losing some bytes at the start of the stream, then beginning to deserialize in the middle of the header's JSON string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants