-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes messages sent over the Project Link nodes do not arrive #68
Comments
This customer reported this issue on Friday (26th April 2024) - https://app-eu1.hubspot.com/contacts/26586079/record/0-1/1956 |
I'm setting up a test based on my devices demo here - https://app.flowfuse.com/instance/cbdcbf3a-da70-468c-941e-c333ea1a0e43/overview The demo consist of 65 devices (~6000 miles away from FF Cloud) which reply to a 'ping' from a NR instance running on FF Cloud. The pings are sent every 5 seconds. I will update the demo to alert me if any of the pings do not make it back to the instance on FF Cloud. |
I am seeing evidence of occasional missing messages, this API returns each ping and a count of devices which responded. If it's less than 64 something failed - https://hmi-development.flowfuse.cloud/export I think it would be worth someone validating how I'm producing this data, totally possible there is a bug in my flows. |
We need to correlate any message drops with the underlying connectivity of the nodes. My theory is the nodes are having their ws-mqtt connection bounce during which time the node doesn't do any store/forward whilst disconnected. I'm not sure that's something easy for you to do with the nodes as-is. I'll have a think on how we can debug this. |
I was mistaken about my theory - the mqtt library we use does do store/forward by default. Did a quick local test where I dropped the device's mqtt connection whilst continuing to send messages from it. Once it reconnected, the messages were forwarded on without any dropped. In this test, the project nodes were sending to a hosted instance rather than another device. Next I'm going to look at the receive side of the equation - do the messages get discarded if the subscriber goes offline. |
Can confirm the messages are discarded if the subscriber is not connected. Need to pick through our connection settings here. Any changes we make will potentially mean the broker has to start storing messages indefinitely for offline devices - that can become unmanageable. Will need to look at both the project node connection settings (clean session/qos etc) as well as broker configuration around persistent state and queue depths etc. |
This has also been reported by https://app-eu1.hubspot.com/contacts/26586079/record/0-1/3995301 |
Hi, We are using a (1 at this time, we plan to have several) Project Call node to send msg's to a NodeRED instance that is running on a Windows server, running FlowFuse Agent. A flow on the Windows server instance is used to run Powershell scripts, which also have an output that is sent back to the calling flow over the Project Out return. We have been experiencing a problem of return messages suddenly not being delivered for some time now. Multiple versions of NodeRED, multiple versions of Project nodes, multiple versions of NodeJS (on the Windows server). We are unable to reliably recreate the problem. It appears to surface after some time of usage or # of messages (?) of the Project Call node. But hesitant to say this because I've experienced a stall after just 7 messages, while I've also seen it do 60+ without a problem. We can see the Powershell scripts appears to be running start to finish (according to logging), but a return message after the script is done is not received by the Project Call node when or after a stall happens. The output of the node that is starting the Powershell script is connected directly (both stdout and stderror) to the Project Out return node. Last week I implemented a test flow on the instance running on the Windows server. On (manual) inject from the NR instance all project calls are made from, it sets a timestamp, sends it over a Project Call node to the Windows server instance, sets a timestamp there and returns to the calling project node where the time difference is calculated. Last Friday, I've updated the NodeJS version on the server we're running the FlowFuse Agent on (and where Project In and Out nodes live). It is now running NodeJS 20.12.2, NR 3.1.9, and we have not seen any stalls yet. But too early to tell anything definitive because usage hasn't been that much due to holiday. Yesterday evening I saw there was an update to Project Nodes (version 0.6.4) which I installed on all instances (which I did have to do twice on all instances, strangely enough..). Hope all this helps with troubleshooting. I'm not sure if there is anything more I can do, but if I can be of any assistance with further information please let me know. Thanks! |
@SynoUser-NL THanks for the information. The 0.6.4 release included the fix for a specific issue where messages would not be queued up for the nodes if they had temporarily dropped their connection. From your description, that doesn't quite feel the same symptom - unless the nodes are disconnecting under the covers; would be good to check the Node-RED logs for any suggestion of a disconnect. Let us know how you get on with 0.6.4 - if the problem persists we'll get a new issue raised to focus on your scenario. |
@knolleary Welcome of course. I agree, this doesn't quite feel like the same issue. Thanks! |
Hi, I'm sorry to say it appears we're still experiencing problems with Project node replies stop coming through sometimes. And the only remedy when that happens is to restart the layer from which the project calls originate. How would we proceed from here to find a permanent solution? Thanks, Den |
Current Behavior
It has been reported that sometimes messages sent via the FF Project Link nodes do not arrive at the other end.
Expected Behavior
Assuming there are not any networking issues, all messages should arrive at the intended destination assuming the flows have been configured correctly.
Steps To Reproduce
I can't reproduce this bug at this time but just raising an issue so we have a place to store customer reports and any theories about what might be causing this issue.
Environment
The text was updated successfully, but these errors were encountered: