Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes get disconnected from libp2p mesh (on devstack) #950

Closed
enricorotundo opened this issue Oct 26, 2022 · 1 comment
Closed

Nodes get disconnected from libp2p mesh (on devstack) #950

enricorotundo opened this issue Oct 26, 2022 · 1 comment
Labels
type/bug Type: Something is not working as expected

Comments

@enricorotundo
Copy link
Contributor

enricorotundo commented Oct 26, 2022

Context

In the production environment, we've seen that sometimes nodes disconnect from the libp2p network and therefore miss events such as job creation. This leads to errors like Job not found reported in #934. The root cause of such disconnection has yet to be determined.

To collect more information about the above, #938 added an internal events logger to our libp2p node which logs to a JSON file.

#938 also activated WithFloodPublish which should be pro-actively publishing messages to every node.

Moreover, the production bootstrap connections number has been changed 1 -> 3 in #944. This changed the topology of the production network from a star-shaped net to a fully connected mesh, the aim was to try to keep more routes for events to propagate.

Release v0.3.6 shipped the changes above, we're now keeping an eye out to see if it happens again in prod.

Similar/same issue on my localhost

The Job not found error appeared with devstack running v0.3.6.

❯ bacalhau --api-port 20002 docker run ubuntu date
Job successfully submitted. Job ID: e47ceec2-564e-4ca8-ab9b-22831834c1a8
❯ bacalhau describe --api-port 20001 e47ceec2-564e-4ca8-ab9b-22831834c1a8
Job not found. ID: e47ceec2-564e-4ca8-ab9b-22831834c1a8

I don't have a repro for this but what happened is:

  1. launched bacalhau devstack in the afternoon
  2. laptop went to sleep overnight (which interrupts network comms?!)
  3. resumed the following morning
  4. Job not found popped up after (3) but not immediately after that

Investigation

Check network shape

Our network viz tool renders a blank page, indicating nodes are disconnected from each other.

Check libp2p tracer log file

The bacalhau-libp2p-tracer.json file (located at ~/.bacalhau) should help shed a light on what happened. Here's a link to my file for reference.
Unfortunately, it seems to be malformed because it contains binary data (see screenshot below).

Screenshot 2022-10-26 at 13 59 36

On top of that, that file seems incomplete too because the first timestamp from cat bacalhau-libp2p-tracer.json | jq (whose output is available on this gist) is 1666772852236502000 which is a today's AM time (remember I launched devstack yesterday afternoon). Ok this may be jq silently ignoring the first part of the log file (i.e. logs prior to laptop sleep), in any case there's something wrong with the libp2p tracer file...

Considerations

  1. The libp2p tracer is not working properly, this is unexpected because it's just a simple tracer
  2. There's a suspicious score parameter decay in our libp2p package that we don't fully understand but may be related to the disconnections: its purpose is to kick out malicious nodes from the mesh (see libp2p docs) and this may be what's happening here
  3. It's worth mentioning devstack bootstraps via a single node, as mentioned above production now uses 3 nodes.
  4. (1) and (2) may be unrelated

Thanks @binocarlos for pairining on this one.

@enricorotundo enricorotundo added the type/bug Type: Something is not working as expected label Oct 26, 2022
@enricorotundo enricorotundo changed the title Nodes get disconnected on devstack - libp2p issue? Nodes get disconnected from libp2p mesh (on devstack) Oct 26, 2022
@aronchick aronchick moved this to Triage in Bacalhau Main Nov 27, 2022
@aronchick
Copy link
Collaborator

stale?

@github-project-automation github-project-automation bot moved this from Triage (Unprioritized) to To Celebrate in Bacalhau Main Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: Something is not working as expected
Projects
None yet
Development

No branches or pull requests

2 participants