With Google's recently released WebRTC Insertable Streams, users of the WebRTC communication API were offered something more reliable and secure in their online communication. While WebRTC by itself has made VoIP calls more accessible to developers and end users since its introduction, it has always had one major weakness: like other VoIP conferencing tools, true end-to-end encryption (E2EE) for multi-party calls is exceptionally difficult to implement. During a conference call, for example, each participant's video and audio media streams must be merged together with the other participants before being transmitted to each endpoint. For this merging to happen, the server must access the media streams which means that, at best, these calls can be encrypted with what is called "hop-by-hop" encryption. In this type of encryption system, a compromised server could eavesdrop on the conversation, presenting a major security concern.
One method used previously to address this issue is the implementation of a Selective Forwarding Unit (SFU). This server-side tool allows the media streams to remain separate and encrypted in multi-party conversations by sending each stream to the endpoints individually, and allocating bandwidth to each stream based on which user is speaking. Unfortunately, the use of an SFU typically limits calls to 5 users or less.
WebRTC Insertable Streams takes a different approach to keeping conference calls safe from eavesdroppers and Man-in-The-Middle (MiTM) attacks. Since WebRTC already includes DTLS-SRTP encryption by default, Insertable Streams simply adds a step to the usual communication process to further obfuscate the contents of the call. By default, a DTLS handshake is used to derive keys which are subsequently used to encrypt the payload of the RTP packet. This is rightfully called E2EE because the negotiated keys do not leave the local device, so nobody in between the call endpoints can access the encrypted payloads.
Insertable Streams leaves this process of encryption untouched and simply adds a step between the encoder and the RTP packetizer. In this new step, Insertable Streams does exactly what its name implies: it accesses the media streams and inserts metadata into this stream in order to transform the encoded frame, making it indecipherable to someone eavesdropping in between the call endpoints. The same step is also added to the other end of the call where Insertable Streams accesses the media streams and removes the inserted metadata before the decoder is applied. This allows the other call participants who hold the E2EE key to decipher the media streams - anyone who does not hold the key is unable to do so.
While promising, Insertable Streams is not yet a fix-all solution for implementing E2EE in multi-party VoIP calls. The major missing pieces are authentication and key distribution, which need to be done securely to ensure that the media streams can only be accessed by the intended call participants. If a suitable method of implementing these features becomes available, WebRTC Insertable Streams has a good chance of making truly E2EE conference calling a reality for a wide range of VoIP applications.