Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glass cyclically disconnects in NT4 mode #5263

Closed
chauser opened this issue Apr 12, 2023 · 14 comments
Closed

Glass cyclically disconnects in NT4 mode #5263

chauser opened this issue Apr 12, 2023 · 14 comments
Labels
component: glass Glass app and backend type: bug Something isn't working.

Comments

@chauser
Copy link
Contributor

chauser commented Apr 12, 2023

Transferring this from Discussion #5262
We observed several times over the season that glass would enter a mode where it would connect, disconnect within a second, reconnect, etc. When it occurred it would happen on multiple glass clients (2) connected to the robot at the same time. Switching to NT3 mode ended the cycle. This continued even with the latest WPILib release.

At the time there was no opportunity to gather more data about what was happening and I do not know how to reproduce the problem.

I'm putting it here as a discussion rather than an issue to find out if others have noticed the behavior. I'll dig deeper if it happens again, but we are going to be interacting with the robot a lot less so it may not come up again for us this year.

@PeterJohnson replied:

This has been reported a few times. In general it is more likely to happen when there are a large number of topics (~1000), or a large amount of data per topic, and a constrained network environment (poor wireless). It was more common in earlier releases, as we've made a number of changes to try to address this throughout the season, but it's been too high risk to make the more significant changes required to completely address the underlying issue mid-season.

Fundamentally the cause of this is the amount of data that needs to be sent in response to the initial subscription causes a backlog in the network connection, which results in the server terminating the connection if the backlog doesn't clear (and there's more data to be sent) within ~1 second. Glass creates a bigger challenge for this than other dashboards, because it subscribes not only to the topics, but also to all the "meta" topics (these are topics that describe the clients, publishers, and subscribers for each of the "real" topics), which roughly translates into 3x the number of topics and initial data being sent.

I have a few strategies in mind for addressing the issue, the first one is the real solution.

  • Rate limit the initial burst of subscription data (e.g. space the transmits out rather than sending it as a big burst, so that other updates have a chance to make it through). This is a little tricky to do because of ordering concerns--we can't send values for a topic until the publish message is sent, we need to make sure that the current value is sent "eventually" if it's not sent due to some other change, and we also don't want to send the current value if a "newer" value is sent in the interim. It's certainly possible to do with the right flags etc, but the complexity of this is why we didn't make a mid-season change.
  • Change Glass to only subscribe to meta-topics if that information is actually being shown
  • Add a publisher option to not keep the last value (this then won't send any value until a new value is published)
@calcmogul calcmogul added component: glass Glass app and backend type: bug Something isn't working. labels Aug 2, 2023
@PeterJohnson
Copy link
Member

Fixed in #5659.

@AngleSideAngle
Copy link
Contributor

Similarly to #5817, my team, and several others (@Bankst @shueja) continues to experience this issue in Glass and OutlineViewer 2024.1.1 on Windows 10.

@PeterJohnson
Copy link
Member

Yes, this came back due to a late change. Reopening until that is fixed.

@PeterJohnson PeterJohnson reopened this Jan 17, 2024
@chauser
Copy link
Contributor Author

chauser commented Jan 17, 2024

We are also seeing this. The behavior is a little different from last year in that the connect/disconnect oscillation occurs without ever displaying any date (except the connection status) whereas last year it would show a lot of data and then disconnect.
Further when I left it running for awhile it eventually succeeded in connecting -- and both the glass and outline viewer succeeded within a second or two of each other.

@chauser
Copy link
Contributor Author

chauser commented Jan 17, 2024 via email

@chauser
Copy link
Contributor Author

chauser commented Jan 17, 2024

As promised - wireshark packet capture log. I haven't tried to analyze it since it appears the cause of the problem may already be known. If that's not true and you'd like me look at it in more detail let me know.
GlassOscillatingConnection.pcapng.gz

@chauser
Copy link
Contributor Author

chauser commented Jan 17, 2024

A thought about the websocket keep-alives implemented by WS PING and PONG messages. If I'm understanding the comments on PR #5659 correctly, if a PONG response is not received within a timeout period after a PING is sent, the connection will be closed. Couldn't anything received on the websocket after the PING was sent be considered as adequate evidence that the connection is still alive? It seems to me that this would solve the problem of PINGs and PONGs getting queued behind a great deal of other data leading to the timeouts that we are observing (at least the wireshark capture seems to support that that is what is going on.)
For actually-broken connections this should not cause significantly greater delay in detecting the broken connection but for still-working connections it should avoid detecting them as being broken.

@PeterJohnson
Copy link
Member

Good idea; we would need to make it actually look for received bytes not frames (since individual frames may be quite long).

The current disconnect issues have something more fundamentally wrong happening (data corruption or incorrect frames being sent). I have a good way to reproduce it now, and am working on isolating the root cause.

@chauser
Copy link
Contributor Author

chauser commented Jan 18, 2024 via email

@chauser
Copy link
Contributor Author

chauser commented Jan 18, 2024

I just did another experiment, this time connected by USB. I still got the cycling behavior but it was different in two ways: first, the glass viewer would actually populate with NT data before the disconnection happened; and second, the wireshark logs do not look like a timeout is occurring. Now I wonder if there are 2 different causes of this behavior -- the corruption/incorrect frames that you are looking at Peter, and timeouts on low-bandwidth connections.

@sciencewhiz
Copy link
Contributor

Is this fixed in 2024.2.1?

@chauser
Copy link
Contributor Author

chauser commented Jan 24, 2024

I have not seen it with 2024.2.1 either with our actual robot code or with Peter's stress test code, both of which would always trigger it.

@ghost
Copy link

ghost commented Feb 8, 2024

We occasionally see this happen on all NT4 clients; shuffleboard, a custom dashboard, PhotonVision, etc.

@chauser
Copy link
Contributor Author

chauser commented Feb 8, 2024

Are you using clients built against WPILib 2024.2.1? Full benefit of the changes is only achieved if both the robot code and the clients use the new implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: glass Glass app and backend type: bug Something isn't working.
Projects
None yet
Development

No branches or pull requests

5 participants