Fix room joining flow #3935

keianhzo · 2021-02-19T14:28:33Z

We are not correctly failing in case of a room join issue ie. invalid credentials. This PR fixes the flow so we correctly fail and halt the process.

src/naf-dialog-adapter.js

johnshaughnessy

The PR summary seems to explain the _reconnectionErrorListener that you added to exit the scene, but I don't fully understand the other changes. Left some comments with questions

johnshaughnessy · 2021-03-09T23:21:30Z

src/naf-dialog-adapter.js

-      if (this._onOccupantsChanged) {
-        this._onOccupantsChanged(this.occupants);
-      }
-    } catch (err) {


What was the error that could have been caught here?

Any exception during the join, ie. an exception in the "join" request. We were catching that error there and finishing the connect flow, which led to "virtually" joining the room, as await this._joinRoom(); was succeeding and resolving the promise, while you weren't really joined.

johnshaughnessy · 2021-03-09T23:33:16Z

src/hub.js

-      let newHostPollInterval = null;
-
-      // When reconnecting, update the server URL if necessary
+      // We don't need to update the server URL when reconnecting, it's always updated every time we join a room


It looks like you are responding to the comment that was previously here, but it's an odd comment to leave behind without that context. Maybe remove it

src/hub.js

johnshaughnessy · 2021-03-09T23:57:20Z

src/naf-dialog-adapter.js

+          resolve();
+        } catch (err) {
+          if (isInitial) {
+            this.reconnect();


Should this promise be rejected or should this.reconnect() be returned?

I think we want to try to reconnect until the timer set here expires or we reach the maximum number of retries (which we have never used)
https://github.com/mozilla/hubs/blob/f180d53b4dda27a6f239a65e6220978099286040/src/hub.js#L633

The way it works right now after the PR is that we keep on retrying for 90 secs and then we show the error page.

It makes sense that we want to reconnect, but shouldn't we still resolve the promise? I don't know the details of how javascript handles this or if I'm using the correct vocabulary, but won't there some execution state "parked" at await new Promise(...) indefinitely?

More simply, a code snippet like this seems suspect:

await new Promise((resolve, reject) => { console.log("Do some work but don't resolve"); });

Humm, good catch. I guess nothing happens and at some point the Promise object gets garbage collected but I've been doing some more refactoring and the re-connection listeners are not required anymore now that we are handling the server url during room join so I've gone ahead and removed that code and handled the promise correctly.

keianhzo · 2021-03-12T15:30:31Z

@johnshaughnessy Looking at this today, I've realized that if signaling gets disconnected randomly, Protoo has an automatic retry resort. In that case when the WSS is connected again we need to make sure that we are joining the room again otherwise only signaling gets connected but we don't reconnect at the WebRTC level. That's probably the reason why joining was handled inside the open event.

Also every time I look at that join room retry code it makes less and less sense: if signaling is connected chances that room join flow fails are low and if it does why would we want to retry and not just show the error page and simplify that flow? I've also removed the retry, I think it covers a very rare case and makes the connection flow more confusing. Let me know your thoughts.

Also I've updated so we correctly handle the case when signaling is closed after retrying so in case it's a network blip we will reconnect and persist the cut and if it's permanent we just show the connection error page.

I've also removed janus adapter traces, I don't think that code is used anymore.

johnshaughnessy

I don't know whether the room rejoin flow is frequently used -- it's a little scary to me to remove that retry logic, but on the other hand I'm not against simplifying if our intention is to gradually make the whole thing more robust. We should just be vigilant about looking for bug reports of failing to join the room in the weeks following this deploy.

Only changes I would like to see before release is to completely remove the safari webrtc hack if you're going to remove it.

johnshaughnessy · 2021-03-15T17:42:53Z

src/hub.js

@@ -1353,7 +1314,7 @@ document.addEventListener("DOMContentLoaded", async () => {
      scene.addEventListener("adapter-ready", async ({ detail: adapter }) => {
        // HUGE HACK Safari does not like it if the first peer seen does not immediately
        // send audio over its media stream. Otherwise, the stream doesn't work and stays
-        // silent. (Though subsequent peers work fine.) This only affects naf janus adapter
+        // silent. (Though subsequent peers work fine.) This only affects naf dialog adapter


You removed half the hack so should also remove this comment (and presumably the later usage of track, oscillator, and stream)

johnshaughnessy · 2021-03-15T17:47:26Z

src/naf-dialog-adapter.js

-      this.emitRTCEvent("error", "Adapter", () => `Join room failed: ${error}`);
-      error("_joinRoom() failed:%o", err);
+    this._connectSuccess(this._clientId);
+    this._initialAudioConsumerPromise = Promise.all(audioConsumerPromises);


I know this is not new code, but want to make a note of it anyway -- It is worrying to me that we wait on connecting to all participants. If a peer is about to disconnect when this happens, do we recover?

brianpeiris

It's also surprising to me that we'd want to remove this retry logic. Is it actually sufficient to rely solely on protoo's retry mechanism? Doesn't that only account for signalling disconnects, and not dialog/webrtc disconnects? Also the previous retry code also handled changing the server URL in case a dialog server went down completely, isn't that logic missing now? You've also removed the 90 second connection timeout. We put that in place specifically to handle cases where we would have the user wait on a loading screen forever if something had gone wrong with the connection flow. Are we sure that the "close" event will always fire if something goes wrong, or that connect() will throw an error in every case?

I think if we were to release this as it is, we would:

Want to make sure to test how it behaves with various disconnect scenarios. For example, a disconnect before signalling, a disconnect after signalling but before webrtc connections are established, and a disconnect after all connections are established. I would expect some retry mechanism to activate in all of those cases.
Test the case where you have two dialog servers and one of them goes down. Does the code somehow still switch server URLs and reconnect successfully?
Test to make sure that we never leave the user waiting on a loading screen that will never complete. I don't think we fully understood why this was happening, but it would be good to have this fallback anyway. We have definitely seen users just wait at a loading screen for upwards of 15 minutes because they really wanted to get into an event, and we don't want to mislead them into thinking that the connection will eventually establish itself, so a 90 second timeout seemed more reasonable.
Add telemetry to ensure that we won't have an increase in number of cases where we would show users the "connect error" screen, effectively ending the session, where previously we might have attempted to reconnect instead.

keianhzo · 2021-03-31T16:14:17Z

@brianpeiris Thanks for the review. For some context here's how are currently handling the adapter connection/disconnection through Protoo signaling events:

Regular flow: "open" -> join room (transports creation)
Failure before room creation: "failed" (retries) -> "closed" -> connection error page.
Failure after room creation: "opened" -> "disconnected" (transport destruction) -> "failed" (retries) -> "closed" -> connection error page.
Recovers from failure: "opened" -> "disconnected" (transport destruction) -> "failed" (retries) -> "opened" -> join room (transports creation)

So unless the Protoo signaling WSS hasn't been opened yet, the Protoo state always transitions to disconnected and then calls failed for every time it tries to recover from failure. If it can't recover, it transitions to closed, otherwise it transitions to opened. So we always should either close the connection after retries or reconnect if we manage to recover. As far as I understand Protoo should always transition to closed in case of any persisting error and I haven't been able to reproduce any other case where it doesn't but I think it won't do any harm to leave the retry safeguard. I've added a commit to restore it (0c268fd).

In case of a Protoo failure, we are currently using the default retry configuration which takes quite a long time to fail in case the issue persists, I've updated to only retry 3 times, I think it's probably more reasonable (0d2817e)

The WebRTC transport disconnection is handled automatically by the WebRTC stack (it retries automatically in a similar fashion as Protoo does) and for the case where it transitions to a failed state we handle it through Mediasoup's transport connectionstatechange event and restart ICE or recreate the transport depending on the transport state. We are currently getting the Dialog host again in that case and I talked about this with John and I think it's a better idea to use the same host as it was used during the adapter connect, I've updated that (9dbc99b).

brianpeiris

Thanks for explaining the retry logic. That makes me feel more confident about the changes, so I agree these changes put us in a better place. Keeping that 90 second fallback is good too.

I see how caching the dialog params makes sense during the ice restarts, but I'm still not sure if we're handling dialog server disconnects correctly when you have two or more dialog servers, and one of the servers dies. We use getHost when signalling has succeeded, but shouldn't we also check for a new host when we receive a "failed" signalling event?

src/naf-dialog-adapter.js

keianhzo · 2021-04-07T18:38:29Z

@brianpeiris I've updated and added support for server rollovers as we talked. Now in case of a connection failure we check if the server has changed and reconnect using the new URL. I've added a watchdog for the re-connections after joining and both the joining and re-connection watchdog now wait 30 secs, 90 felt a little bit too long.

brianpeiris

LGTM!

keianhzo requested a review from johnshaughnessy February 19, 2021 14:28

keianhzo self-assigned this Feb 19, 2021

vincentfretin reviewed Feb 19, 2021

View reviewed changes

src/naf-dialog-adapter.js Outdated Show resolved Hide resolved

keianhzo force-pushed the join-creds-verification branch 2 times, most recently from 14d55d1 to 0fc1c65 Compare March 9, 2021 15:31

johnshaughnessy reviewed Mar 10, 2021

View reviewed changes

keianhzo requested a review from brianpeiris March 10, 2021 14:41

keianhzo force-pushed the join-creds-verification branch 2 times, most recently from f7921d0 to 5ccfa60 Compare March 12, 2021 14:45

johnshaughnessy suggested changes Mar 15, 2021

View reviewed changes

keianhzo force-pushed the join-creds-verification branch from 5ccfa60 to a9d0e8e Compare March 16, 2021 18:12

brianpeiris reviewed Mar 16, 2021

View reviewed changes

keianhzo requested review from brianpeiris and johnshaughnessy March 31, 2021 16:14

keianhzo added 5 commits March 31, 2021 23:18

Fix room joining flow

0d7c0ca

Remove remaining janus hacks

aaff7a3

Add Protoo failure logging and lower the retries.

0ed815b

Cache the dialog params to be consistent when reconnecting transports

1c0d965

Restore the connection timeout safeguard

7516573

brianpeiris reviewed Apr 6, 2021

View reviewed changes

src/naf-dialog-adapter.js Outdated Show resolved Hide resolved

keianhzo added 2 commits April 7, 2021 20:15

Remove unused variables

86be42c

Handle Dialog server rollovers

dc66dc2

keianhzo force-pushed the join-creds-verification branch from 0c268fd to dc66dc2 Compare April 7, 2021 18:25

keianhzo added 2 commits April 7, 2021 20:27

Fixed typo

8755ace

Provide server params at the same time as the server url

1788339

brianpeiris approved these changes Apr 7, 2021

View reviewed changes

keianhzo merged commit b328deb into master Apr 7, 2021

keianhzo deleted the join-creds-verification branch April 7, 2021 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix room joining flow #3935

Fix room joining flow #3935

keianhzo commented Feb 19, 2021

johnshaughnessy left a comment

johnshaughnessy Mar 9, 2021

keianhzo Mar 10, 2021

johnshaughnessy Mar 9, 2021

keianhzo Mar 10, 2021

johnshaughnessy Mar 9, 2021

keianhzo Mar 10, 2021

johnshaughnessy Mar 11, 2021

keianhzo Mar 11, 2021

keianhzo commented Mar 12, 2021 •

edited

Loading

johnshaughnessy left a comment

johnshaughnessy Mar 15, 2021 •

edited

Loading

johnshaughnessy Mar 15, 2021

brianpeiris left a comment •

edited

Loading

keianhzo commented Mar 31, 2021 •

edited

Loading

brianpeiris left a comment •

edited

Loading

keianhzo commented Apr 7, 2021

brianpeiris left a comment

Fix room joining flow #3935

Fix room joining flow #3935

Conversation

keianhzo commented Feb 19, 2021

johnshaughnessy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keianhzo commented Mar 12, 2021 • edited Loading

johnshaughnessy left a comment

Choose a reason for hiding this comment

johnshaughnessy Mar 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianpeiris left a comment • edited Loading

Choose a reason for hiding this comment

keianhzo commented Mar 31, 2021 • edited Loading

brianpeiris left a comment • edited Loading

Choose a reason for hiding this comment

keianhzo commented Apr 7, 2021

brianpeiris left a comment

Choose a reason for hiding this comment

keianhzo commented Mar 12, 2021 •

edited

Loading

johnshaughnessy Mar 15, 2021 •

edited

Loading

brianpeiris left a comment •

edited

Loading

keianhzo commented Mar 31, 2021 •

edited

Loading

brianpeiris left a comment •

edited

Loading