-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix room joining flow #3935
Fix room joining flow #3935
Conversation
14d55d1
to
0fc1c65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR summary seems to explain the _reconnectionErrorListener
that you added to exit the scene, but I don't fully understand the other changes. Left some comments with questions
if (this._onOccupantsChanged) { | ||
this._onOccupantsChanged(this.occupants); | ||
} | ||
} catch (err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the error that could have been caught here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any exception during the join, ie. an exception in the "join" request. We were catching that error there and finishing the connect flow, which led to "virtually" joining the room, as await this._joinRoom();
was succeeding and resolving the promise, while you weren't really joined.
src/hub.js
Outdated
let newHostPollInterval = null; | ||
|
||
// When reconnecting, update the server URL if necessary | ||
// We don't need to update the server URL when reconnecting, it's always updated every time we join a room |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you are responding to the comment that was previously here, but it's an odd comment to leave behind without that context. Maybe remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/naf-dialog-adapter.js
Outdated
resolve(); | ||
} catch (err) { | ||
if (isInitial) { | ||
this.reconnect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this promise be rejected or should this.reconnect()
be return
ed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to try to reconnect until the timer set here expires or we reach the maximum number of retries (which we have never used)
https://github.com/mozilla/hubs/blob/f180d53b4dda27a6f239a65e6220978099286040/src/hub.js#L633
The way it works right now after the PR is that we keep on retrying for 90 secs and then we show the error page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense that we want to reconnect, but shouldn't we still resolve the promise? I don't know the details of how javascript handles this or if I'm using the correct vocabulary, but won't there some execution state "parked" at await new Promise(...)
indefinitely?
More simply, a code snippet like this seems suspect:
await new Promise((resolve, reject) => {
console.log("Do some work but don't resolve");
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm, good catch. I guess nothing happens and at some point the Promise object gets garbage collected but I've been doing some more refactoring and the re-connection listeners are not required anymore now that we are handling the server url during room join so I've gone ahead and removed that code and handled the promise correctly.
f7921d0
to
5ccfa60
Compare
@johnshaughnessy Looking at this today, I've realized that if signaling gets disconnected randomly, Protoo has an automatic retry resort. In that case when the WSS is connected again we need to make sure that we are joining the room again otherwise only signaling gets connected but we don't reconnect at the WebRTC level. That's probably the reason why joining was handled inside the open event. Also every time I look at that join room retry code it makes less and less sense: if signaling is connected chances that room join flow fails are low and if it does why would we want to retry and not just show the error page and simplify that flow? I've also removed the retry, I think it covers a very rare case and makes the connection flow more confusing. Let me know your thoughts. Also I've updated so we correctly handle the case when signaling is closed after retrying so in case it's a network blip we will reconnect and persist the cut and if it's permanent we just show the connection error page. I've also removed janus adapter traces, I don't think that code is used anymore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know whether the room rejoin flow is frequently used -- it's a little scary to me to remove that retry logic, but on the other hand I'm not against simplifying if our intention is to gradually make the whole thing more robust. We should just be vigilant about looking for bug reports of failing to join the room in the weeks following this deploy.
Only changes I would like to see before release is to completely remove the safari webrtc hack if you're going to remove it.
src/hub.js
Outdated
@@ -1353,7 +1314,7 @@ document.addEventListener("DOMContentLoaded", async () => { | |||
scene.addEventListener("adapter-ready", async ({ detail: adapter }) => { | |||
// HUGE HACK Safari does not like it if the first peer seen does not immediately | |||
// send audio over its media stream. Otherwise, the stream doesn't work and stays | |||
// silent. (Though subsequent peers work fine.) This only affects naf janus adapter | |||
// silent. (Though subsequent peers work fine.) This only affects naf dialog adapter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You removed half the hack so should also remove this comment (and presumably the later usage of track
, oscillator
, and stream
)
this.emitRTCEvent("error", "Adapter", () => `Join room failed: ${error}`); | ||
error("_joinRoom() failed:%o", err); | ||
this._connectSuccess(this._clientId); | ||
this._initialAudioConsumerPromise = Promise.all(audioConsumerPromises); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is not new code, but want to make a note of it anyway -- It is worrying to me that we wait on connecting to all participants. If a peer is about to disconnect when this happens, do we recover?
5ccfa60
to
a9d0e8e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also surprising to me that we'd want to remove this retry logic. Is it actually sufficient to rely solely on protoo's retry mechanism? Doesn't that only account for signalling disconnects, and not dialog/webrtc disconnects? Also the previous retry code also handled changing the server URL in case a dialog server went down completely, isn't that logic missing now? You've also removed the 90 second connection timeout. We put that in place specifically to handle cases where we would have the user wait on a loading screen forever if something had gone wrong with the connection flow. Are we sure that the "close" event will always fire if something goes wrong, or that connect()
will throw an error in every case?
I think if we were to release this as it is, we would:
- Want to make sure to test how it behaves with various disconnect scenarios. For example, a disconnect before signalling, a disconnect after signalling but before webrtc connections are established, and a disconnect after all connections are established. I would expect some retry mechanism to activate in all of those cases.
- Test the case where you have two dialog servers and one of them goes down. Does the code somehow still switch server URLs and reconnect successfully?
- Test to make sure that we never leave the user waiting on a loading screen that will never complete. I don't think we fully understood why this was happening, but it would be good to have this fallback anyway. We have definitely seen users just wait at a loading screen for upwards of 15 minutes because they really wanted to get into an event, and we don't want to mislead them into thinking that the connection will eventually establish itself, so a 90 second timeout seemed more reasonable.
- Add telemetry to ensure that we won't have an increase in number of cases where we would show users the "connect error" screen, effectively ending the session, where previously we might have attempted to reconnect instead.
@brianpeiris Thanks for the review. For some context here's how are currently handling the adapter connection/disconnection through Protoo signaling events:
So unless the Protoo signaling WSS hasn't been opened yet, the Protoo state always transitions to disconnected and then calls failed for every time it tries to recover from failure. If it can't recover, it transitions to closed, otherwise it transitions to opened. So we always should either close the connection after retries or reconnect if we manage to recover. As far as I understand Protoo should always transition to closed in case of any persisting error and I haven't been able to reproduce any other case where it doesn't but I think it won't do any harm to leave the retry safeguard. I've added a commit to restore it (0c268fd). In case of a Protoo failure, we are currently using the default retry configuration which takes quite a long time to fail in case the issue persists, I've updated to only retry 3 times, I think it's probably more reasonable (0d2817e) The WebRTC transport disconnection is handled automatically by the WebRTC stack (it retries automatically in a similar fashion as Protoo does) and for the case where it transitions to a failed state we handle it through Mediasoup's transport connectionstatechange event and restart ICE or recreate the transport depending on the transport state. We are currently getting the Dialog host again in that case and I talked about this with John and I think it's a better idea to use the same host as it was used during the adapter connect, I've updated that (9dbc99b). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining the retry logic. That makes me feel more confident about the changes, so I agree these changes put us in a better place. Keeping that 90 second fallback is good too.
I see how caching the dialog params makes sense during the ice restarts, but I'm still not sure if we're handling dialog server disconnects correctly when you have two or more dialog servers, and one of the servers dies. We use getHost
when signalling has succeeded, but shouldn't we also check for a new host when we receive a "failed" signalling event?
0c268fd
to
dc66dc2
Compare
@brianpeiris I've updated and added support for server rollovers as we talked. Now in case of a connection failure we check if the server has changed and reconnect using the new URL. I've added a watchdog for the re-connections after joining and both the joining and re-connection watchdog now wait 30 secs, 90 felt a little bit too long. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
We are not correctly failing in case of a room join issue ie. invalid credentials. This PR fixes the flow so we correctly fail and halt the process.