Sometimes messages sent over the Project Link nodes do not arrive #68

robmarcer · 2024-05-01T10:51:04Z

Current Behavior

It has been reported that sometimes messages sent via the FF Project Link nodes do not arrive at the other end.

Expected Behavior

Assuming there are not any networking issues, all messages should arrive at the intended destination assuming the flows have been configured correctly.

Steps To Reproduce

I can't reproduce this bug at this time but just raising an issue so we have a place to store customer reports and any theories about what might be causing this issue.

Environment

FlowFuse version:
Node.js version:
npm version:
Platform/OS:
Browser:

robmarcer · 2024-05-01T10:52:25Z

This customer reported this issue on Friday (26th April 2024) - https://app-eu1.hubspot.com/contacts/26586079/record/0-1/1956

robmarcer · 2024-05-01T10:54:52Z

I'm setting up a test based on my devices demo here - https://app.flowfuse.com/instance/cbdcbf3a-da70-468c-941e-c333ea1a0e43/overview

The demo consist of 65 devices (~6000 miles away from FF Cloud) which reply to a 'ping' from a NR instance running on FF Cloud. The pings are sent every 5 seconds. I will update the demo to alert me if any of the pings do not make it back to the instance on FF Cloud.

robmarcer · 2024-05-01T14:16:56Z

I am seeing evidence of occasional missing messages, this API returns each ping and a count of devices which responded. If it's less than 64 something failed - https://hmi-development.flowfuse.cloud/export

I think it would be worth someone validating how I'm producing this data, totally possible there is a bug in my flows.

knolleary · 2024-05-01T14:57:37Z

We need to correlate any message drops with the underlying connectivity of the nodes. My theory is the nodes are having their ws-mqtt connection bounce during which time the node doesn't do any store/forward whilst disconnected. I'm not sure that's something easy for you to do with the nodes as-is. I'll have a think on how we can debug this.

knolleary · 2024-05-02T09:09:13Z

I was mistaken about my theory - the mqtt library we use does do store/forward by default. Did a quick local test where I dropped the device's mqtt connection whilst continuing to send messages from it. Once it reconnected, the messages were forwarded on without any dropped.

In this test, the project nodes were sending to a hosted instance rather than another device. Next I'm going to look at the receive side of the equation - do the messages get discarded if the subscriber goes offline.

knolleary · 2024-05-02T09:16:12Z

Can confirm the messages are discarded if the subscriber is not connected. Need to pick through our connection settings here. Any changes we make will potentially mean the broker has to start storing messages indefinitely for offline devices - that can become unmanageable. Will need to look at both the project node connection settings (clean session/qos etc) as well as broker configuration around persistent state and queue depths etc.

robmarcer · 2024-05-07T12:58:12Z

This has also been reported by https://app-eu1.hubspot.com/contacts/26586079/record/0-1/3995301

SynoUser-NL · 2024-05-08T07:47:26Z

Hi,

We are using a (1 at this time, we plan to have several) Project Call node to send msg's to a NodeRED instance that is running on a Windows server, running FlowFuse Agent. A flow on the Windows server instance is used to run Powershell scripts, which also have an output that is sent back to the calling flow over the Project Out return.

We have been experiencing a problem of return messages suddenly not being delivered for some time now. Multiple versions of NodeRED, multiple versions of Project nodes, multiple versions of NodeJS (on the Windows server). We are unable to reliably recreate the problem. It appears to surface after some time of usage or # of messages (?) of the Project Call node. But hesitant to say this because I've experienced a stall after just 7 messages, while I've also seen it do 60+ without a problem.
A (manual) restart of the calling instance (where the Project Call node is) solves the problem. Obviously, this isn't desirable in a production environment.

We can see the Powershell scripts appears to be running start to finish (according to logging), but a return message after the script is done is not received by the Project Call node when or after a stall happens. The output of the node that is starting the Powershell script is connected directly (both stdout and stderror) to the Project Out return node.
The Project Call node times out when no return message is received. Other than that, there is no indication that anything is wrong.

Last week I implemented a test flow on the instance running on the Windows server. On (manual) inject from the NR instance all project calls are made from, it sets a timestamp, sends it over a Project Call node to the Windows server instance, sets a timestamp there and returns to the calling project node where the time difference is calculated.
When the Project Call node that is responsible for running the Powershell scripts appears to stall, this test node keeps returning messages. So it appears not all project connections are affected when one stalls. And this also means there is no connection problem (MQTT or otherwise) at the time of a stall.

Last Friday, I've updated the NodeJS version on the server we're running the FlowFuse Agent on (and where Project In and Out nodes live). It is now running NodeJS 20.12.2, NR 3.1.9, and we have not seen any stalls yet.
I'm also restarting the calling node instance every morning at 6.00 hrs to (hopefully) prevent any issues.

But too early to tell anything definitive because usage hasn't been that much due to holiday.
And to be clear: messages sent to the Windows instance over the Project Call node are always received, it's only the returns we occasionally have problems with.

Yesterday evening I saw there was an update to Project Nodes (version 0.6.4) which I installed on all instances (which I did have to do twice on all instances, strangely enough..).

Hope all this helps with troubleshooting. I'm not sure if there is anything more I can do, but if I can be of any assistance with further information please let me know.

Thanks!

knolleary · 2024-05-08T08:11:41Z

@SynoUser-NL THanks for the information. The 0.6.4 release included the fix for a specific issue where messages would not be queued up for the nodes if they had temporarily dropped their connection.

From your description, that doesn't quite feel the same symptom - unless the nodes are disconnecting under the covers; would be good to check the Node-RED logs for any suggestion of a disconnect.

Let us know how you get on with 0.6.4 - if the problem persists we'll get a new issue raised to focus on your scenario.

SynoUser-NL · 2024-05-08T08:28:59Z

check the Node-RED logs for any suggestion of a disconnect.

@knolleary Welcome of course.
The logs show no signs of disconnect, on either end.

I agree, this doesn't quite feel like the same issue.
Next week will be a lot busier again, so hope to be able to give some more information on this then.

Thanks!

SynoUser-NL · 2024-05-24T11:42:27Z

Hi,

I'm sorry to say it appears we're still experiencing problems with Project node replies stop coming through sometimes. And the only remedy when that happens is to restart the layer from which the project calls originate.
While messages sent via a second project call (to the same instance as where the other one stalls) keeps working perfectly.

How would we proceed from here to find a permanent solution?

Thanks, Den

robmarcer added needs-triage Needs looking at to decide what to do customer request requested by customer labels May 1, 2024

knolleary added this to ☁️ Product Planning May 1, 2024

knolleary added this to 🛠 Development May 2, 2024

knolleary self-assigned this May 2, 2024

knolleary moved this to In Progress in 🛠 Development May 2, 2024

knolleary moved this from In Progress to Up Next in 🛠 Development May 2, 2024

knolleary mentioned this issue May 2, 2024

Add a 2 minute session expiry on the mqtt connection #69

Merged

Steve-Mcl closed this as completed in #69 May 2, 2024

github-project-automation bot moved this to Closed / Done in ☁️ Product Planning May 2, 2024

Steve-Mcl moved this from Up Next to Review in 🛠 Development May 3, 2024

joepavitt added size:XS - 1 Sizing estimation point and removed needs-triage Needs looking at to decide what to do labels May 7, 2024

joepavitt moved this from Review to Done in 🛠 Development May 7, 2024

knolleary mentioned this issue May 28, 2024

Project Call nodes stalling #74

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sometimes messages sent over the Project Link nodes do not arrive #68

Sometimes messages sent over the Project Link nodes do not arrive #68

robmarcer commented May 1, 2024

robmarcer commented May 1, 2024

robmarcer commented May 1, 2024

robmarcer commented May 1, 2024 •

edited

Loading

knolleary commented May 1, 2024

knolleary commented May 2, 2024

knolleary commented May 2, 2024

robmarcer commented May 7, 2024

SynoUser-NL commented May 8, 2024

knolleary commented May 8, 2024

SynoUser-NL commented May 8, 2024

SynoUser-NL commented May 24, 2024

Sometimes messages sent over the Project Link nodes do not arrive #68

Sometimes messages sent over the Project Link nodes do not arrive #68

Comments

robmarcer commented May 1, 2024

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

robmarcer commented May 1, 2024

robmarcer commented May 1, 2024

robmarcer commented May 1, 2024 • edited Loading

knolleary commented May 1, 2024

knolleary commented May 2, 2024

knolleary commented May 2, 2024

robmarcer commented May 7, 2024

SynoUser-NL commented May 8, 2024

knolleary commented May 8, 2024

SynoUser-NL commented May 8, 2024

SynoUser-NL commented May 24, 2024

robmarcer commented May 1, 2024 •

edited

Loading