Glass cyclically disconnects in NT4 mode #5263

chauser · 2023-04-12T17:23:50Z

Transferring this from Discussion #5262
We observed several times over the season that glass would enter a mode where it would connect, disconnect within a second, reconnect, etc. When it occurred it would happen on multiple glass clients (2) connected to the robot at the same time. Switching to NT3 mode ended the cycle. This continued even with the latest WPILib release.

At the time there was no opportunity to gather more data about what was happening and I do not know how to reproduce the problem.

I'm putting it here as a discussion rather than an issue to find out if others have noticed the behavior. I'll dig deeper if it happens again, but we are going to be interacting with the robot a lot less so it may not come up again for us this year.

@PeterJohnson replied:

This has been reported a few times. In general it is more likely to happen when there are a large number of topics (~1000), or a large amount of data per topic, and a constrained network environment (poor wireless). It was more common in earlier releases, as we've made a number of changes to try to address this throughout the season, but it's been too high risk to make the more significant changes required to completely address the underlying issue mid-season.

Fundamentally the cause of this is the amount of data that needs to be sent in response to the initial subscription causes a backlog in the network connection, which results in the server terminating the connection if the backlog doesn't clear (and there's more data to be sent) within ~1 second. Glass creates a bigger challenge for this than other dashboards, because it subscribes not only to the topics, but also to all the "meta" topics (these are topics that describe the clients, publishers, and subscribers for each of the "real" topics), which roughly translates into 3x the number of topics and initial data being sent.

I have a few strategies in mind for addressing the issue, the first one is the real solution.

Rate limit the initial burst of subscription data (e.g. space the transmits out rather than sending it as a big burst, so that other updates have a chance to make it through). This is a little tricky to do because of ordering concerns--we can't send values for a topic until the publish message is sent, we need to make sure that the current value is sent "eventually" if it's not sent due to some other change, and we also don't want to send the current value if a "newer" value is sent in the interim. It's certainly possible to do with the right flags etc, but the complexity of this is why we didn't make a mid-season change.
Change Glass to only subscribe to meta-topics if that information is actually being shown
Add a publisher option to not keep the last value (this then won't send any value until a new value is published)

PeterJohnson · 2023-10-12T22:14:03Z

Fixed in #5659.

AngleSideAngle · 2024-01-17T03:24:32Z

Similarly to #5817, my team, and several others (@Bankst @shueja) continues to experience this issue in Glass and OutlineViewer 2024.1.1 on Windows 10.

PeterJohnson · 2024-01-17T04:32:31Z

Yes, this came back due to a late change. Reopening until that is fixed.

chauser · 2024-01-17T04:42:22Z

We are also seeing this. The behavior is a little different from last year in that the connect/disconnect oscillation occurs without ever displaying any date (except the connection status) whereas last year it would show a lot of data and then disconnect.
Further when I left it running for awhile it eventually succeeded in connecting -- and both the glass and outline viewer succeeded within a second or two of each other.

chauser · 2024-01-17T05:15:41Z

I have a wireshark log that I can upload in an hour or so if that would be helpful.

…

On Tue, Jan 16, 2024 at 8:32 PM Peter Johnson ***@***.***> wrote: Yes, this came back due to a late change. Reopening until that is fixed. — Reply to this email directly, view it on GitHub <#5263 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGLUU757HOPFJAQAVWMYPDYO5H6VAVCNFSM6AAAAAAW37JZYSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUHEZDCOBRGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

chauser · 2024-01-17T06:39:10Z

As promised - wireshark packet capture log. I haven't tried to analyze it since it appears the cause of the problem may already be known. If that's not true and you'd like me look at it in more detail let me know.
GlassOscillatingConnection.pcapng.gz

chauser · 2024-01-17T23:00:00Z

A thought about the websocket keep-alives implemented by WS PING and PONG messages. If I'm understanding the comments on PR #5659 correctly, if a PONG response is not received within a timeout period after a PING is sent, the connection will be closed. Couldn't anything received on the websocket after the PING was sent be considered as adequate evidence that the connection is still alive? It seems to me that this would solve the problem of PINGs and PONGs getting queued behind a great deal of other data leading to the timeouts that we are observing (at least the wireshark capture seems to support that that is what is going on.)
For actually-broken connections this should not cause significantly greater delay in detecting the broken connection but for still-working connections it should avoid detecting them as being broken.

PeterJohnson · 2024-01-18T14:48:02Z

Good idea; we would need to make it actually look for received bytes not frames (since individual frames may be quite long).

The current disconnect issues have something more fundamentally wrong happening (data corruption or incorrect frames being sent). I have a good way to reproduce it now, and am working on isolating the root cause.

chauser · 2024-01-18T18:16:57Z

What I'm seeing in the Wireshark log is a single Websocket frame of 127892 bytes made of 88 TCP segments. This is on a bandwidth-limited FRC radio (4mbit/s?). Early on, there is a dropped TCP packet that eventually gets resent by TCP Fast retransmission; receipt of that packet completes the Websocket frame. It pretty consistently takes 285ms to receive all those segments. This is all pretty much agrees with what I'd expect to see when sending that much data on a 4mbit/s link. It's not clear that just fragmenting at the websocket level would help. The kernel tcp stack would happily accept all of those fragments into its own buffers. And all that data would still be "ahead" of any ping or pong messages. But maybe the intention is to pace the sending of those segments?

…

On Thu, Jan 18, 2024 at 6:48 AM Peter Johnson ***@***.***> wrote: Good idea; we would need to make it actually look for received bytes not frames (since individual frames may be quite long). The current disconnect issues have something more fundamentally wrong happening (data corruption or incorrect frames being sent). I have a good way to reproduce it now, and am working on isolating the root cause. — Reply to this email directly, view it on GitHub <#5263 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGLUU7GKYW5OB2EIN2CLRDYPEY23AVCNFSM6AAAAAAW37JZYSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGYZDCNZXGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

chauser · 2024-01-18T19:39:14Z

I just did another experiment, this time connected by USB. I still got the cycling behavior but it was different in two ways: first, the glass viewer would actually populate with NT data before the disconnection happened; and second, the wireshark logs do not look like a timeout is occurring. Now I wonder if there are 2 different causes of this behavior -- the corruption/incorrect frames that you are looking at Peter, and timeouts on low-bandwidth connections.

sciencewhiz · 2024-01-24T05:30:28Z

Is this fixed in 2024.2.1?

chauser · 2024-01-24T06:43:49Z

I have not seen it with 2024.2.1 either with our actual robot code or with Peter's stress test code, both of which would always trigger it.

ghost · 2024-02-08T21:59:54Z

We occasionally see this happen on all NT4 clients; shuffleboard, a custom dashboard, PhotonVision, etc.

chauser · 2024-02-08T22:30:56Z

Are you using clients built against WPILib 2024.2.1? Full benefit of the changes is only achieved if both the robot code and the clients use the new implementations.

calcmogul added component: glass Glass app and backend type: bug Something isn't working. labels Aug 2, 2023

PeterJohnson closed this as completed Oct 12, 2023

PeterJohnson reopened this Jan 17, 2024

PeterJohnson closed this as completed Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Glass cyclically disconnects in NT4 mode #5263

Glass cyclically disconnects in NT4 mode #5263

chauser commented Apr 12, 2023 •

edited

Loading

PeterJohnson commented Oct 12, 2023

AngleSideAngle commented Jan 17, 2024

PeterJohnson commented Jan 17, 2024

chauser commented Jan 17, 2024

chauser commented Jan 17, 2024 via email

chauser commented Jan 17, 2024

chauser commented Jan 17, 2024

PeterJohnson commented Jan 18, 2024

chauser commented Jan 18, 2024 via email

chauser commented Jan 18, 2024

sciencewhiz commented Jan 24, 2024

chauser commented Jan 24, 2024

ghost commented Feb 8, 2024

chauser commented Feb 8, 2024

Glass cyclically disconnects in NT4 mode #5263

Glass cyclically disconnects in NT4 mode #5263

Comments

chauser commented Apr 12, 2023 • edited Loading

PeterJohnson commented Oct 12, 2023

AngleSideAngle commented Jan 17, 2024

PeterJohnson commented Jan 17, 2024

chauser commented Jan 17, 2024

chauser commented Jan 17, 2024 via email

chauser commented Jan 17, 2024

chauser commented Jan 17, 2024

PeterJohnson commented Jan 18, 2024

chauser commented Jan 18, 2024 via email

chauser commented Jan 18, 2024

sciencewhiz commented Jan 24, 2024

chauser commented Jan 24, 2024

ghost commented Feb 8, 2024

chauser commented Feb 8, 2024

chauser commented Apr 12, 2023 •

edited

Loading