-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Glass cyclically disconnects in NT4 mode #5263
Comments
Fixed in #5659. |
Yes, this came back due to a late change. Reopening until that is fixed. |
We are also seeing this. The behavior is a little different from last year in that the connect/disconnect oscillation occurs without ever displaying any date (except the connection status) whereas last year it would show a lot of data and then disconnect. |
I have a wireshark log that I can upload in an hour or so if that would be
helpful.
…On Tue, Jan 16, 2024 at 8:32 PM Peter Johnson ***@***.***> wrote:
Yes, this came back due to a late change. Reopening until that is fixed.
—
Reply to this email directly, view it on GitHub
<#5263 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGLUU757HOPFJAQAVWMYPDYO5H6VAVCNFSM6AAAAAAW37JZYSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUHEZDCOBRGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
As promised - wireshark packet capture log. I haven't tried to analyze it since it appears the cause of the problem may already be known. If that's not true and you'd like me look at it in more detail let me know. |
A thought about the websocket keep-alives implemented by WS PING and PONG messages. If I'm understanding the comments on PR #5659 correctly, if a PONG response is not received within a timeout period after a PING is sent, the connection will be closed. Couldn't anything received on the websocket after the PING was sent be considered as adequate evidence that the connection is still alive? It seems to me that this would solve the problem of PINGs and PONGs getting queued behind a great deal of other data leading to the timeouts that we are observing (at least the wireshark capture seems to support that that is what is going on.) |
Good idea; we would need to make it actually look for received bytes not frames (since individual frames may be quite long). The current disconnect issues have something more fundamentally wrong happening (data corruption or incorrect frames being sent). I have a good way to reproduce it now, and am working on isolating the root cause. |
What I'm seeing in the Wireshark log is a single Websocket frame of 127892
bytes made of 88 TCP segments. This is on a bandwidth-limited FRC radio
(4mbit/s?). Early on, there is a dropped TCP packet that eventually gets
resent by TCP Fast retransmission; receipt of that packet completes the
Websocket frame. It pretty consistently takes 285ms to receive all those
segments. This is all pretty much agrees with what I'd expect to see when
sending that much data on a 4mbit/s link.
It's not clear that just fragmenting at the websocket level would help. The
kernel tcp stack would happily accept all of those fragments into its own
buffers. And all that data would still be "ahead" of any ping or pong
messages. But maybe the intention is to pace the sending of those segments?
…On Thu, Jan 18, 2024 at 6:48 AM Peter Johnson ***@***.***> wrote:
Good idea; we would need to make it actually look for received bytes not
frames (since individual frames may be quite long).
The current disconnect issues have something more fundamentally wrong
happening (data corruption or incorrect frames being sent). I have a good
way to reproduce it now, and am working on isolating the root cause.
—
Reply to this email directly, view it on GitHub
<#5263 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGLUU7GKYW5OB2EIN2CLRDYPEY23AVCNFSM6AAAAAAW37JZYSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGYZDCNZXGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I just did another experiment, this time connected by USB. I still got the cycling behavior but it was different in two ways: first, the glass viewer would actually populate with NT data before the disconnection happened; and second, the wireshark logs do not look like a timeout is occurring. Now I wonder if there are 2 different causes of this behavior -- the corruption/incorrect frames that you are looking at Peter, and timeouts on low-bandwidth connections. |
Is this fixed in 2024.2.1? |
I have not seen it with 2024.2.1 either with our actual robot code or with Peter's stress test code, both of which would always trigger it. |
We occasionally see this happen on all NT4 clients; shuffleboard, a custom dashboard, PhotonVision, etc. |
Are you using clients built against WPILib 2024.2.1? Full benefit of the changes is only achieved if both the robot code and the clients use the new implementations. |
Transferring this from Discussion #5262
We observed several times over the season that glass would enter a mode where it would connect, disconnect within a second, reconnect, etc. When it occurred it would happen on multiple glass clients (2) connected to the robot at the same time. Switching to NT3 mode ended the cycle. This continued even with the latest WPILib release.
At the time there was no opportunity to gather more data about what was happening and I do not know how to reproduce the problem.
I'm putting it here as a discussion rather than an issue to find out if others have noticed the behavior. I'll dig deeper if it happens again, but we are going to be interacting with the robot a lot less so it may not come up again for us this year.
@PeterJohnson replied:
This has been reported a few times. In general it is more likely to happen when there are a large number of topics (~1000), or a large amount of data per topic, and a constrained network environment (poor wireless). It was more common in earlier releases, as we've made a number of changes to try to address this throughout the season, but it's been too high risk to make the more significant changes required to completely address the underlying issue mid-season.
Fundamentally the cause of this is the amount of data that needs to be sent in response to the initial subscription causes a backlog in the network connection, which results in the server terminating the connection if the backlog doesn't clear (and there's more data to be sent) within ~1 second. Glass creates a bigger challenge for this than other dashboards, because it subscribes not only to the topics, but also to all the "meta" topics (these are topics that describe the clients, publishers, and subscribers for each of the "real" topics), which roughly translates into 3x the number of topics and initial data being sent.
I have a few strategies in mind for addressing the issue, the first one is the real solution.
The text was updated successfully, but these errors were encountered: