"Host unreachable" error message in service server when the client times out #243

peci1 · 2021-05-10T17:33:10Z

Environment

OS Version: Ubuntu 18.04
Source or binary build? binary 9.1.0

Description

I get spurious errors when calling services:

error sending response: Host unreachable

The whole simulation runs on localhost, so it's weird it says host unreachable. Maybe the network card gets micro-resets and when ZMQ hits exactly the micro-reset, it fails sending? (dmesg doesn't show any micro-resets)

I think the service replier should try harder, with e.g. 3 attempts, around the lines

https://github.com/ignitionrobotics/ign-transport/blob/133dcf0379e1d6231e30dc6cb0bf1692fe13c4f1/src/NodeShared.cc#L787-L850

The text was updated successfully, but these errors were encountered:

peci1 · 2021-05-12T10:51:35Z

From time to time, I see this in dmesg:

TCP: request_sock_TCP: Possible SYN flooding on port 43893. Sending cookies.  Check SNMP counters.

However, it's not too frequent and I'm unable to say whether it is from Gazebo or some other traffic...

peci1 · 2021-05-12T10:52:03Z

I tried the suggestions from https://access.redhat.com/solutions/30453 and they do not help preventing these SYN flood messages.

peci1 · 2021-05-12T12:43:54Z

I added some debug prints when it happens.

My test application is SubT artifact validator with some custom nodes running around:

https://github.com/osrf/subt/blob/5023446a033ab6710e966a3ce0582288ed143b11/subt_ign/src/ArtifactValidator.cc#L382

This happens (sometimes) when I call the move_to service:

[Msg] Moving to: backpack_2
        * Connected to [tcp://127.0.0.1:40005] for sending a response
0
err try 0 res 0
err try 1 res 0
err try 2 res 0
err try 3 res 0
err try 4 res 0
1
NodeShared::RecvSrvRequest() error sending response: Host unreachable host tcp://127.0.0.1:40005

You can see I even ran this instance with IGN_IP=127.0.0.1 to rule out any network adapter issues. It did not help. You can also see in the log that it executed 5 retries of the first send (my custom debugging attempt), each with a 1 second delay. It apparently does not help. This is really weird as other networking services are not interrupted (I'm doing it over SSH and no problem). Can it be some ZeroMQ resource exhaustion?

peci1 · 2021-05-12T13:36:22Z

Ahh, I think I got it. The host unreachable error on the service server side means that the client timed out before the service finished and closed the connection. It is kind of expected that the server will have no endpoint to send the response to, but the error message is quite confusing, as the host is pretty normally reachable, but the port is not. The error text however seems to be coming from ZeroMQ, which, in turn, translates it from the EHOSTUNREACH code returned from socket syscall.

I'm not really sure if something can be done about this, but a better error message would definitely be welcome.

peci1 · 2021-05-12T14:07:14Z

Steps to reproduce:

add std::this_thread::sleep_for(std::chrono::milliseconds(10000)); to https://github.com/ignitionrobotics/ign-transport/blob/3ae2925d2a5c6c2b57662d37a64b678ff6af722c/example/responser.cc#L29 .
Build examples
run ./responser
run ./requester

Requester has timeout 5 secs, responser will take 10 secs with the change from step 1.

peci1 · 2021-05-12T14:59:35Z

Oh no, the spurious errors are back now that I've increased all timeouts to 100 seconds (which is way more than should be needed). I've seen it happen that the service finishes in about 1 second (I have a log print right before return true in the callback) and then I get the "host unreachable" error even though the client is still there waiting for 99 more seconds... So back to the beginning :(

peci1 · 2021-05-12T20:09:02Z

I've tried to workaround the service call by a pair of topics, and I had a problem even with those. A lot of messages get lost. So I forced them to repeat (both request messages and response messages) and now it works with a few re-sends on some service calls.

What's interesting is that some subscribers get into an invalid state after some time. I'm spawning ign topic -e -t ... -n 1 via subprocess.call from Python. Sometimes, this gets into an invalid state, so that the call does not exit even though messages are published to the topic and another ign topic -e echoes the messages correctly. Either I'm doing Gazebo something really unpleasant, or there is something really wrong in the transport library. This might be the same problem as SubT sees with breadcrumb dropping commands being lost sometimes.

peci1 · 2021-05-12T20:16:06Z

Here's an example of topic statistics when I substituted the service with repeated topics:

stats.txt

I see lost messages being reported there.

The publishers in this case are ign topic --pub calls from Python via subprocess.check_output().

caguero · 2021-05-12T20:19:05Z

Out of curiosity, are you publishing messages very fast? It's expected to loose messages if there's a fast publisher or a slow subscriber because their buffers will overflow. That's a parameter that we can set in Ignition via environment variables. See https://ignitionrobotics.org/api/transport/10.0/envvars.html (IGN_TRANSPORT_RCVHWM and IGN_TRANSPORT_SNDHWM). Maybe for this test both parameters should be set to 0 (infinite) to rule out buffer overflow situations.

peci1 · 2021-05-12T20:24:33Z

It depends on what you mean by fast. I have 20 hi-res cameras running, each publishing RGB and Depth at about 0.3 Hz (so it's rather large than fast). Simulation is running almost realtime because I'm using a triggered sensors system instead of the continuously producing one. It is the trigger commands for the customized sensor system that get lost often, but I send these only at 0.3 Hz (sometimes also set_pose service calls get lost). So the only fast topics should be transforms (probably at 250 Hz).

Thanks for pointing out the watermarks, I haven't known about them. I'll give them a try.

peci1 · 2021-05-12T20:41:42Z

I still see dropped messages even with infinite buffers (in the topic-based workaround).

peci1 · 2021-05-12T21:01:32Z

I even tried lowering the RTF of the simulation, but even that did not help (though the part calling the triggered sensors system is RTF-independent and goes as fast as the sensors render). Dropped messages are still there.

How is it with ign-transport's threading model? Is it like in ROS, that each subscriber can only process one task at a time? And the per-subscriber queue size from ROS is substituted by the per-process queue configured by watermarks? My trigger-controlling callbacks do not block (as well as set_pose if setting a value to a ECM component is not blocking), so it should be no problem, but maybe detailed knowledge of the threading model could help here...

caguero · 2021-05-12T22:12:55Z

There's a dedicated thread in NodeShared.cc running the RunReceptionTask() function in an infinite loop. This thread is pretty much waiting on a poll() call that wakes up each time there's activity in some of the relevant zmq sockets. Depending on the socket, we'll decide if this is a msg update, a service request or an incoming service response. As an example, if there's activity on the msg socket, we parse all the fields from the socket, unserialize the message, and execute all the subscriber callbacks. This is all happening in this dedicated thread, not in a user thread.

peci1 · 2021-05-12T23:21:51Z

Thanks for the explanation, you could turn it into a tutorial =) Or even better - there should really be a tutorial explaining the differences in topic/service concepts between Ignition and ROS, because this isn't the first time I was confused that something that seems like a ROS concept basically is very different.

If I got it correctly, NodeShared is a per-process singleton. So there's no way of getting multiple independent callback queues like in ROS? Or is it just the deserialization that happens in a single thread and the callbacks are processed on some queues?

If there's really only a single queue for everything, then I see that my original concept with a service call triggering a rendering loop which takes about 2 seconds wasn't really the best idea... Is there a proper way around it? Does it mean blocking services are basically forbidden? I really don't like solutions of type "let's reimplement TCP functionality on a UDP layer"... (which is what I tried with the topic-based service call workaround).

caguero · 2021-05-13T11:29:37Z

We're mixing a lot of issues in this ticket :) but is there any way you could share with me an IgnitionTransport-only example where the messages are dropped?

peci1 · 2021-05-14T15:21:10Z

Yes, we are, but I thought these information could help me understand the problem better.

Do you have an idea on how to convert the whole Gazebo launch file to a transport-only example? Would Transport Log help here, or does it not record service calls?

caguero · 2021-05-20T14:12:12Z

Logs don't record service calls and pretty much anything that let us narrow the issue without the extra complexity of SubT or Gazebo in our way will help.

peci1 added the bug Something isn't working label May 10, 2021

peci1 changed the title ~~Spurious "Host unreachable" error in service calls~~ "Host unreachable" error message in service server when the client times out May 12, 2021

chapulina added the help wanted Extra attention is needed label Jul 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Host unreachable" error message in service server when the client times out #243

"Host unreachable" error message in service server when the client times out #243

peci1 commented May 10, 2021 •

edited

Loading

peci1 commented May 12, 2021

peci1 commented May 12, 2021

peci1 commented May 12, 2021 •

edited

Loading

peci1 commented May 12, 2021

peci1 commented May 12, 2021

peci1 commented May 12, 2021 •

edited

Loading

peci1 commented May 12, 2021

peci1 commented May 12, 2021

caguero commented May 12, 2021 •

edited

Loading

peci1 commented May 12, 2021 •

edited

Loading

peci1 commented May 12, 2021

peci1 commented May 12, 2021

caguero commented May 12, 2021

peci1 commented May 12, 2021

caguero commented May 13, 2021

peci1 commented May 14, 2021

caguero commented May 20, 2021

"Host unreachable" error message in service server when the client times out #243

"Host unreachable" error message in service server when the client times out #243

Comments

peci1 commented May 10, 2021 • edited Loading

Environment

Description

peci1 commented May 12, 2021

peci1 commented May 12, 2021

peci1 commented May 12, 2021 • edited Loading

peci1 commented May 12, 2021

peci1 commented May 12, 2021

peci1 commented May 12, 2021 • edited Loading

peci1 commented May 12, 2021

peci1 commented May 12, 2021

caguero commented May 12, 2021 • edited Loading

peci1 commented May 12, 2021 • edited Loading

peci1 commented May 12, 2021

peci1 commented May 12, 2021

caguero commented May 12, 2021

peci1 commented May 12, 2021

caguero commented May 13, 2021

peci1 commented May 14, 2021

caguero commented May 20, 2021

peci1 commented May 10, 2021 •

edited

Loading

peci1 commented May 12, 2021 •

edited

Loading

peci1 commented May 12, 2021 •

edited

Loading

caguero commented May 12, 2021 •

edited

Loading

peci1 commented May 12, 2021 •

edited

Loading