You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the production environment, we've seen that sometimes nodes disconnect from the libp2p network and therefore miss events such as job creation. This leads to errors like Job not found reported in #934. The root cause of such disconnection has yet to be determined.
To collect more information about the above, #938 added an internal events logger to our libp2p node which logs to a JSON file.
#938 also activated WithFloodPublish which should be pro-actively publishing messages to every node.
Moreover, the production bootstrap connections number has been changed 1 -> 3 in #944. This changed the topology of the production network from a star-shaped net to a fully connected mesh, the aim was to try to keep more routes for events to propagate.
Release v0.3.6 shipped the changes above, we're now keeping an eye out to see if it happens again in prod.
Similar/same issue on my localhost
The Job not found error appeared with devstack running v0.3.6.
❯ bacalhau --api-port 20002 docker run ubuntu date
Job successfully submitted. Job ID: e47ceec2-564e-4ca8-ab9b-22831834c1a8
❯ bacalhau describe --api-port 20001 e47ceec2-564e-4ca8-ab9b-22831834c1a8
Job not found. ID: e47ceec2-564e-4ca8-ab9b-22831834c1a8
I don't have a repro for this but what happened is:
launched bacalhau devstack in the afternoon
laptop went to sleep overnight (which interrupts network comms?!)
resumed the following morning
Job not found popped up after (3) but not immediately after that
Investigation
Check network shape
Our network viz tool renders a blank page, indicating nodes are disconnected from each other.
Check libp2p tracer log file
The bacalhau-libp2p-tracer.json file (located at ~/.bacalhau) should help shed a light on what happened. Here's a link to my file for reference.
Unfortunately, it seems to be malformed because it contains binary data (see screenshot below).
On top of that, that file seems incomplete too because the first timestamp from cat bacalhau-libp2p-tracer.json | jq (whose output is available on this gist) is 1666772852236502000 which is a today's AM time (remember I launched devstack yesterday afternoon). Ok this may be jq silently ignoring the first part of the log file (i.e. logs prior to laptop sleep), in any case there's something wrong with the libp2p tracer file...
Context
In the production environment, we've seen that sometimes nodes disconnect from the libp2p network and therefore miss events such as job creation. This leads to errors like
Job not found
reported in #934. The root cause of such disconnection has yet to be determined.To collect more information about the above, #938 added an internal events logger to our libp2p node which logs to a JSON file.
#938 also activated
WithFloodPublish
which should be pro-actively publishing messages to every node.Moreover, the production bootstrap connections number has been changed 1 -> 3 in #944. This changed the topology of the production network from a star-shaped net to a fully connected mesh, the aim was to try to keep more routes for events to propagate.
Release v0.3.6 shipped the changes above, we're now keeping an eye out to see if it happens again in prod.
Similar/same issue on my localhost
The
Job not found
error appeared with devstack running v0.3.6.I don't have a repro for this but what happened is:
bacalhau devstack
in the afternoonJob not found
popped up after (3) but not immediately after thatInvestigation
Check network shape
Our network viz tool renders a blank page, indicating nodes are disconnected from each other.
Check libp2p tracer log file
The
bacalhau-libp2p-tracer.json
file (located at~/.bacalhau
) should help shed a light on what happened. Here's a link to my file for reference.Unfortunately, it seems to be malformed because it contains binary data (see screenshot below).
On top of that, that file seems incomplete too because the first timestamp from
cat bacalhau-libp2p-tracer.json | jq
(whose output is available on this gist) is1666772852236502000
which is a today's AM time (remember I launched devstack yesterday afternoon). Ok this may bejq
silently ignoring the first part of the log file (i.e. logs prior to laptop sleep), in any case there's something wrong with the libp2p tracer file...Considerations
Thanks @binocarlos for pairining on this one.
The text was updated successfully, but these errors were encountered: