-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Host unreachable" error message in service server when the client times out #243
Comments
From time to time, I see this in dmesg:
However, it's not too frequent and I'm unable to say whether it is from Gazebo or some other traffic... |
I tried the suggestions from https://access.redhat.com/solutions/30453 and they do not help preventing these SYN flood messages. |
I added some debug prints when it happens. My test application is SubT artifact validator with some custom nodes running around: This happens (sometimes) when I call the
You can see I even ran this instance with |
Ahh, I think I got it. The I'm not really sure if something can be done about this, but a better error message would definitely be welcome. |
Steps to reproduce:
Requester has timeout 5 secs, responser will take 10 secs with the change from step 1. |
Oh no, the spurious errors are back now that I've increased all timeouts to 100 seconds (which is way more than should be needed). I've seen it happen that the service finishes in about 1 second (I have a log print right before |
I've tried to workaround the service call by a pair of topics, and I had a problem even with those. A lot of messages get lost. So I forced them to repeat (both request messages and response messages) and now it works with a few re-sends on some service calls. What's interesting is that some subscribers get into an invalid state after some time. I'm spawning |
Here's an example of topic statistics when I substituted the service with repeated topics: I see lost messages being reported there. The publishers in this case are |
Out of curiosity, are you publishing messages very fast? It's expected to loose messages if there's a fast publisher or a slow subscriber because their buffers will overflow. That's a parameter that we can set in Ignition via environment variables. See https://ignitionrobotics.org/api/transport/10.0/envvars.html ( |
It depends on what you mean by fast. I have 20 hi-res cameras running, each publishing RGB and Depth at about 0.3 Hz (so it's rather large than fast). Simulation is running almost realtime because I'm using a triggered sensors system instead of the continuously producing one. It is the trigger commands for the customized sensor system that get lost often, but I send these only at 0.3 Hz (sometimes also Thanks for pointing out the watermarks, I haven't known about them. I'll give them a try. |
I still see dropped messages even with infinite buffers (in the topic-based workaround). |
I even tried lowering the RTF of the simulation, but even that did not help (though the part calling the triggered sensors system is RTF-independent and goes as fast as the sensors render). Dropped messages are still there. How is it with ign-transport's threading model? Is it like in ROS, that each subscriber can only process one task at a time? And the per-subscriber queue size from ROS is substituted by the per-process queue configured by watermarks? My trigger-controlling callbacks do not block (as well as |
There's a dedicated thread in |
Thanks for the explanation, you could turn it into a tutorial =) Or even better - there should really be a tutorial explaining the differences in topic/service concepts between Ignition and ROS, because this isn't the first time I was confused that something that seems like a ROS concept basically is very different. If I got it correctly, NodeShared is a per-process singleton. So there's no way of getting multiple independent callback queues like in ROS? Or is it just the deserialization that happens in a single thread and the callbacks are processed on some queues? If there's really only a single queue for everything, then I see that my original concept with a service call triggering a rendering loop which takes about 2 seconds wasn't really the best idea... Is there a proper way around it? Does it mean blocking services are basically forbidden? I really don't like solutions of type "let's reimplement TCP functionality on a UDP layer"... (which is what I tried with the topic-based service call workaround). |
We're mixing a lot of issues in this ticket :) but is there any way you could share with me an IgnitionTransport-only example where the messages are dropped? |
Yes, we are, but I thought these information could help me understand the problem better. Do you have an idea on how to convert the whole Gazebo launch file to a transport-only example? Would Transport Log help here, or does it not record service calls? |
Logs don't record service calls and pretty much anything that let us narrow the issue without the extra complexity of SubT or Gazebo in our way will help. |
Environment
Description
I get spurious errors when calling services:
The whole simulation runs on localhost, so it's weird it says host unreachable. Maybe the network card gets micro-resets and when ZMQ hits exactly the micro-reset, it fails sending? (dmesg doesn't show any micro-resets)
I think the service replier should try harder, with e.g. 3 attempts, around the lines
https://github.com/ignitionrobotics/ign-transport/blob/133dcf0379e1d6231e30dc6cb0bf1692fe13c4f1/src/NodeShared.cc#L787-L850
The text was updated successfully, but these errors were encountered: