Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] MSC2359: E2EE VoIP Conferencing #2359

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions proposals/2359-e2ee-voip-conferencing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@

# E2E Encrypted SFU VoIP conferencing via Matrix


## Background

Matrix has experimented with many different VoIP conferencing approaches over the years:

* Using FreeSWITCH as an MCU (multipoint conferencing unit - i.e. mixer) via
matrix-appservice-verto, where Riot would place a normal Matrix 1:1 VoIP call
to an endpoint on the MCU derived from the conf room ID where the conf call
was being triggered, with the existence of the ongoing conf call tracked in
the conf room’s room state. This predated Matrix E2EE, and suffered due to
problems with tuning FreeSWITCH to handle low bandwidth connections, as well
as suffered bad UX relative to an SFU, and was removed from Riot in ~2017.

* Using Jitsi as an SFU (stream forwarding unit) via widgets augmented by native
support. This provides a much better UX, but doesn’t provide E2EE. It’s
fiddly to get working (particularly screensharing) on Riot/Desktop though, and
the React/Native dependencies on Riot/Mobile end up being quite a pain to
maintain. Jitsi occasionally adds unwanted analytics dependencies &
functionalities too. It’s also a bit of a shame to rely on embedding a random
“out of band” centralised focal point for conferencing via a widget, rather
than leveraging Matrix as a data transport or signalling layer.

* Using full mesh VoIP calls, where all the clients in a given room initiate
1:1 VoIP calls in DMs in order to establish a conf call. This was done as a
quick hack for
[vrdemo](https://github.com/matrix-org/matrix-vr-demo/blob/master/src/js/components/structures/FullMeshConference.js)
and worked surprisingly well - but has not been evolved due to lack of
braincycles (and because Jitsi was working well enough, with a nice UX). It
provides decentralised E2EE conferencing out of the box, but consumes
significant bandwidth & CPU/GPU/power to handle all the simultaneous 1:1
calls.

This proposal is a sketch of a 4th type of conferencing, providing SFU
semantics but leveraging Matrix’s E2EE to stop the SFU being able to intercept
call media.


## Overview

* You start off with a normal E2EE matrix room
* All members start a VoIP 1:1 call in a DM with the SFU
* However, the SRTP keys for the media RTP (not RTCP) streams are
Copy link
Member Author

@ara4n ara4n Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emcho wisely points out that SDP hasn't included SRTP keys in ~10 years, you just get the a=fingerprint line these days, which is a DTLS fingerprint. WebRTC doesn't currently give you a way in-browser to suppress the DTLS handshake and exchange the keys OOB, so there isn't a way to avoid handing the keys to the bridge endpoint. Therefore this proposal is completely flawed.

Hum.

Copy link
Member Author

@ara4n ara4n Mar 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, discussing further, it turns out that libwebrtc may well provide easy hooks these days to get at the compressed frames and encrypt them before transmit (and vice versa). So you'd keep the DTLS handshake to secure the SRTP stream, but would be transferring e2e-encrypted payloads within the SRTP stream.

Apparently Jitsi is then able to switch the streams around, assuming you packetise them with appropriate metadata, for which there's an spec or RFC somewhere (perhaps from PERC?).

In other words, while this may be impossible in browser, it could be done on Electron (especially with our fancy new build process) - and on the mobile apps, given we control the webrtc version they use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...it is now no longer impossible in browser, as per https://twitter.com/jitsinews/status/1249682190209896448?s=20

So, our plan now is to integrate to Jitsi much as we do today in Riot, but somehow tracking jitsi’s e2ee keys within megolm on the Matrix side. In future we could swap our their XMPP signalling entirely for Matrix and simplify things loads.

deliberately stripped from the SDP of the m.call.invite and m.call.answer
by the clients, so the SFU can’t decrypt the call media. The call
signalling negotiates typical SFU srtp streams for:
* Sending audio (if not muted)
* Sending thumbnail video (if not muted)
* Sending full-res video (if requested by the SFU and not muted)
* Receiving 1-n multiplexed audio streams
* Receiving 1-n multiplexed video streams (mix of thumbnail & full-res)
* The 1:1 rooms could/should be E2EE to protect metadata, although this
isn’t strictly necessary to protect the call media.
* The members exchange the SRTP keys via timeline events (ideally state
events, but they’re not E2EE yet) in the main conference rooms, so the
clients can decrypt the forwarded SRTP streams.
* The SFU itself:
* Looks at the bandwidth of the media streams being received from the
various clients, and uses REMB or TMMBR or whatever RTCP congestion
control mechanism to request that the sending client’s full-res bitrate is
clamped to the lowest receive bitrate determined from the clients which
are currently trying to view the full-res streams.
* (Particularly slow receiving clients could be ignored and be forced to
(use the thumbnail rather than the full-res stream instead)
* Tracks which clients are trying to view the full-res streams (via
datachannel?) and forwards the full-res streams to the clients in question
(requesting them via datachannel from the client if needed).
* The SFU could also use the datachannel to determine who’s currently
claiming to talk, to let users control the conference focus.
* Does the same for thumbnails too. (Could assume that everyone wants a copy
of the thumbnail streams).
* Relays the audio streams to everyone.
* We use the datachannel for the SFU control rather than Matrix to minimise
latency (which is really important when rapidly switching focus based on
voice detection in a call).
* This consciously leaks metadata about who was talking and when, but at least
the call data isn’t leaked.
* The fact the SFU can’t decrypt the streams means that some tricks aren’t
available:
* We can’t framedrop when sending to slow clients, as we don’t know where
the frames are. (Unless we provide some custom RTP headers or RTCP
packets outside the SRTP payloads to identify the frame types, but WebRTC
doesn’t support this afaik?)
* We also can’t downsample for slow clients, obviously. We could however
negotiate multiple send streams from the clients to try to support a
slower clients better.
* SVC (which is patent encumbered anyway) probably is ruled out, as
exploiting spatial redundancy between the low & high res send streams is
probably impossible between the separated streams.
* However, some tricks are still available?
* We can however forward keyframe requests from clients via RTCP.
* This has been written without reference to perc, so is probably missing insights
from there.

TL;DR: it works like a normal SFU, except the SRTP keys for the media streams
are exchanged in the megolm room where the conference was initiated, so the
SFU can never decrypt the media - but can still do rate control and forward
the streams around intelligently.

## Details

Need to specify:

* matrix timeline events for advertising the SRTP keys for the various streams in the conf room
* matrix state events for announcing the existence of a conf call in the conf room
* DataChannel API for SFU floor control (or perhaps we could start off with Matrix to keep things a bit simpler?)
* resolution/fps of the pyramid of send streams? ability to let the SFU dynamically negotiate the send stream resolution/fps?
* TMMBR or REMB or whatever folks use for CC these days?