-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reliability #4
Comments
Benedikt: Separate networked (zmq) and local (threaded) modes have been proposed. Christoph: Benjamin: Regarding the inproc mode, I am not too sure whether that is necessary if we already have all the ports set up properly - I don't think we need the performance gain which it might bring that badly, and making a full parallel comms network with inproc just to substitute queues, to do everything in zmq might not be worth it. Or course, future contributors would only need to become familiar with one of the two if we only have one of the two, but I guess substituting the queues with a hardened version of zmq inproc comms would then be at the back of the roadmap. |
I think, one part of the reliability problems encountered by pymeasure developers with zmq might be about timing. PUB-SUB connections need a bit to connect to each other properly, however a One case in which messages are dropped is when a ROUTER socket is supposed to send a message to an address which it does not know (it does not exist). This case can be handled by checking that a message which should be sent is actually sent, on the side of the ROUTER. For the Coordinators, this would mean that when a request for Component C1 arrives from Component C2, but C1 just died, the Coordinator can catch that (check the return value of Another thing in zmq regarding dropped messages is the High-Water-Mark (HWM), which individual sockets (ROUTERs afaik) use to avoid congestion (and essentially a possible memory-leak). If a Component is very slow in accepting messages from a Coordinator, but this Coordinator continues to receive commands for this Component, it stores a certain number of those messages, up to the HWM, after which it starts dropping new messages it should actually send, as otherwise it would need to store every increasing numbers of messages which it waits to send to the Component. I guess, we should check for whether messages have been dropped because the HWM was hit (especially in the command protocol Coordinators, they have the ROUTER sockets as of the current discussion), and think about a contingency, or at least make it understandable under what circumstances messages might get lost, and definitely warn about it once it starts to occur. As far as I understand it, the reliability-benefit of using inproc versus using tcp (if the comms stay within a single process, necessary for inproc to work) is to cut the round-trip across the OS-network-infrastructure, the above described cases of "silently" dropped messages would occur in both modes, I think. the great benefit of inproc is speed, cutting away quite a bit of overhead. |
It is great to have you on board @bklebel. Your additional insight in Zmq helps a lot. I like the idea of checking for dropped messages, I did not think about that part. Somw sockets block, while reaching hwm, others drop messages. I have to look it up. Reliability
|
I just looked the HWM up again, very close, the zmqguide says
and
But this is then an implementation detail, which number to put in exactly. |
Thanks for your praise! I really enjoy working on this, even though I have a hard time to keep up with all the different conversations here.
Okay, true, so we should keep all channels in mind, not just the control one, ok....
So, generally, the receiving sockets in our current idea of an architecture block, while sending sockets drop the messages. Blocking sockets are not a problem in terms of reliability, dropping sockets are - but if we say we disagree with it being most sensible to just drop the messages in this case, because we are concerned with reliability, we could, if we see that a message was not sent because the HWM was reached, just schedule to send it again. |
When does a pub socket drop messages? If the recipient cannot keep up. That will happen, if the recipient stalled or is really slow. Both are the problems of the recipient. In the command protocol, a recipient (probably an Actor) has a high backlog, due to slow device communication. In that case, the Actor might send an error "Device busy" to the sender of some request. |
You surely are aware already, but the zmq guide has a hole chapter on reliable REQ-REP patterns: https://zguide.zeromq.org/docs/chapter4/, probably something can be picked up from there? Also, we should probably decide what kind of reliability we want for message receipt - at-most-once, exactly-once, at-least-once? AFAIK, these all have different trade-offs. If we have a message-id field, we can probably easily handle at-most-once by discarding already seen messages on the receiving side. Also, should we pause/postpone the design-for-reliability until more of the protocol design has been done (because we know more about the trade-offs etc), or do you think it will be important to some central design questions? |
Thanks for linking that chapter. Reading the part regarding heartbeats, I got the impression, that it would be good, if the Coordinator acknowledges every message received (serves as a Coordinator heartbeat and the Component knows, that it's message is on its way). I think Zmq ensures, that a message is received just once or dropped. EDIT: MermaidDiagram of the message flow sequenceDiagram
Component1 ->> Coordinator: "To:Component2,From:Component1. Give me property A"
Coordinator ->> Component1: "ACK: I got your message"
Coordinator ->> Component2: "To:Component2,From:Component1. Give me property A"
Component2 ->> Coordinator: "To:Component1,From:Component2. Property A has value 5"
Coordinator ->> Component2: "ACK: I got your message"
Coordinator ->> Component1: "To:Component1,From:Component2. Property A has value 5"
The basic ideas of reliability should enter this discourse, as they might influence the protocol definition. |
Reading the zmq guide (parts of it) again: We do not need a checksum, as zmq ensures, that the whole message (even a multipart message) arrives in one piece. |
Should we make a heartbeat pattern (as proposed in zmq guide) to respond to every message. Either with content or with an empty message (i.e. heartbeat)? |
I guess it will be hard to know when to stop ACKing. Every message having a reply sounds nicely symmetric, though! (and make it much easier to reason about nested/recursed message flow) W.r.t. your diagram above, I did not expect the ACK from Coordinator to Component2 -- the message with the value was already the reply. I would expect the Coordinator to wait for a reply, and if none is coming, to ask again. Same with the first ACK from Coordinator to C1 -- the reply should be the value message. If there's just an "ACK", what does the Component do with that? It is not the reply that was requested, does that mean something happened? Note: We might (optionally/later) want to have a separate "WAIT/ACK" exchange do deal with expected long delays, but otherwise I would keep the request-response pattern direct. |
I reasoned, that the Coordinator does not know, whether any message will go back to Component2, so it just acknowledges, that it received a message and hands it on.
That ACK is a heartbeat, stating: I'm still alive, your message is on its way.
In the Message format issue I formulated the idea: Each message with content is acknowledged with an empty message. That prevents the infinite ACKing. |
Yeah, but do we need/want that? What happens in zmq if the endpoint/recipient of your message is not alive? Do you notice? do you not get an error back? |
The message waits happily in the outbound buffer until the endpoint comes back online, then the message is sent. You do not get an error. |
That is the reason I went for the ping pong heartbeat: https://zguide.zeromq.org/docs/chapter4/#Heartbeating-for-Paranoid-Pirate We could (to reduce data transfer) make these heartbeats without any frames (even without names!). |
that of course is very valuable context! I'll have to think... |
This depends on the socket, I would think - a ROUTER which wants to send something to a dead connection might drop the message (silently if not checked for internally by looking at the return value of the sending function), or do I misunderstand something now? If the recipient is dead but the connection is "more or less still there"? I think the DEALER would store it in the outbound buffer, but the ROUTER would drop it. |
Messages just get dropped if the buffer overflows. |
Mmmmm no, I don't think so, as per zmqguide, undeliverable messages will get dropped by a ROUTER. This we should however definitely catch. Also, for reliability we might want to take a closer look at this flowchart - which funnily does not make it into the "Reliable messaging patterns" chapter of the guide, I think because it is more about flaws in the implementation than catching problems which occur "in the wild". It is still a very valuable resource. |
It says in the guide, that we could (and should, I think) catch non routable messages instead of dropping them. Set ROUTER socket option ZMQ_ROUTER_MANDATORY to True. "Since ZeroMQ v3.2 there’s a socket option you can set to catch this error: ZMQ_ROUTER_MANDATORY. Set that on the ROUTER socket and then when you provide an unroutable identity on a send call, the socket will signal an EHOSTUNREACH error." Python code (if |
We should also think about - beside logging - if dropped packages over a certain time could be used to trigger follow up actions. |
I agree that it's important to enable that handling, but I'm not sure how much behaviour we should specify in the protocol. I think we should maintain a Status/error register or somesuch for Components ( |
Yes this is really more an implementation topic, only the status definition(s) should be part of the protocol description |
Thank you @LongnoseRob, this is an important aspect! I think too that the status definition is good and useful, and should be part of the protocol, but what Actors and Directors do with the respective stati should be up to the implementation. For example, for me it would rather mean that if an Actor does not get anything anymore from upstream, they continue to do whatever they have been told, as I would like that most of an experiment can continue to live on and generate data if parts of the network fail, e.g. if the responsible control-Coordinator fails shortly after I started a temperature sweep, most of the time I would like that both the sweep and whatever is being measured continue to work. This way, the sweep in one direction can be fully recorded, although maybe the other sweep direction is then blocked because the Director cannot tell the Actor to now start the second sweep - and by the time the first sweep finishes, I might have checked on the system, and restarted whatever link broke in the middle. |
Yes, what you describe @bklebel is also a good approach. |
@mcdo0486 raises the importance of protocol reliability:
However, I think we need to make sure that a data protocol doesn’t replace the current threaded pymeasure architecture. This discussion is on a proposed data protocol, not implementation, but I think the protocol design can easily creep into fundamental architecture changes.
The most important thing with an experiment design and operation framework is reliable recording of data as fast as possible. I can see synchronization and data loss being potential problems if message passing was moved entirely over to zmq.
Pymeasure used to use zmq for message passing but moved to thread queues instead. If you look through the commits there are comments like this ominous one related to removing zmq for thread queue: “[listens through a thread queue] to ensure no messages are lost”
pymeasure/pymeasure@390abfd
Listeners and workers were moved to a threaded approach. While the Worker will setup a zmq publisher, and emit data over it, it doesn’t emit data to anything by default. You can rip out all the zmq logic in the Worker class and your procedure would run just fine.
That isn’t to say we can’t do things better now, we can, but the most important thing is fast and reliable data recording as mentioned.
So in sum, I don’t think there should be a “new measurement paradigm” but an “additional measurement paradigm” that stays compatible with current workflow with threads.
The text was updated successfully, but these errors were encountered: