WebRTC primer

As a relatively new technology WebRTC is still quite unheard of even though it will likely be the next Big Thing. It offers a whole new way to create interactive peer to peer multimedia applications within a browser - without requiring any additional plugins. Support is already built-in to the newest versions of Chrome on both desktops and Android. Firefox and Opera also have WebRTC capabilities, but still have some features missing.

It's bit of a chicken-egg problem really. There is not yet widespread adaptation so development doesn't have the highest priorty, and the development is not the highest priority because there is no widespread adaptation. While I can't really do much about the APIs, I can still try and present my take on the basic WebRTC connection flow. Hopefully this helps someone to create a cool WebRTC application and thereby indirectly contributing to development priorities.

But please note that this text is written as part of a project I've been working on and is not meant to be the singular introduction to WebRTC, nor is this meant to primarily be a tutorial. If you are looking for a more thorough introduction, see the great tutorial on HTML5 Rocks. After you have read that tutorial and still feel disoriented, I hope you came back here and read what I've written. Hopefully at least my diagram will clarify something.

WebRTC connection flow

In short, to establish a connection between two peers the following needs to be done:
  • Create a signaling channel between the peers
  • Get local media, and negotiate codecs
  • Perform interactive connection establishment assisted by the signaling channel
  • And finally start streaming data
This is my take on the issue. It is not the one and only way to do things, especially with the signaling channel. But then again, signaling is not covered in the WebRTC specification even though it's a very important piece of the puzzle. The easiest way is to roll your own asynchronous server using something like python-gevent or node.js, but you could as well adapt something like SIP or XMPP for this purpose.

At the very beginning both of the clients connect to this signaling / management server and mutually agree on a session. Later they then use this session to exchange the messages necessary to build their own direct connection with the steps illustrated in the diagram 1 below(use a state machine, you'll thank yourself later). Most of the functions in the diagram refer to the WebRTC-stack but some are just to illustrate a point. Also note that some functions might fail due to user actions, and some due to timeouts. Application-defined timeouts can also occur waiting for state transitions. Finally, the traffic between the peers is the traffic via the signaling server.

Diagram 1. Basic connection flow



Negotiating the connection

One of the peers will function as the caller, and the another as the callee. The caller will first send an invite to the callee, notifying it of the incoming call. To save some time, it also immediately starts the local media capture and attaches the obtained media stream to the RTCPeerConnection and generates the SDP offer. This offer is then sent via the server to the callee. This process can be interrupted by a negative reply from the callee in case it doesn't want to receive the call. It might, for example, be already engaged in another call.

If the call is accepted, the callee will start it's local capture, read the SDP offer and generate an SDP answer to be sent to the original caller. The offer and answer are used to negotiate audio and video codecs and other media parameters.

In addition to generating the offer/answer, the RTCPeerConnection also looks for ICE candidates. These are IP/port pairs that can be used in communication with the other peer. Typically some are generated almost immediatelly, but some might take several or even dozen of seconds to appear. Each of these candidates is sent via the server to the other peer as they get generated, even during a call.

The actual connection

After the answer has been processed by the caller, it can notify the callee that the call can finally be established. Both peers will then start to checking the ICE candidates and try get a connection made. When a suitable connection channel is finally found, the media data starts moving around and the call state can be marked as connected. Typically this initial connection should be established in a second or two, depending on network latency. Testing for better connections still happens in the background. When no more candidates are left to test, the ICE state is marked as completed.

After this the peers can happily keep streaming media as long as they so desire, even without the signaling server. Ending a call can be done by simply stopping the transmission, but ideally it should be communicated via the signaling server or other kind of medium. Otherwise the other peer may need to wait until a timeout expires. That's not very good UX.

Getting around NATs

To facilitate NAT traversal WebRTC includes support for both STUN and TURN. STUN can be used to open NAT ports unless a symmetrical, IP-address specific NAT is encountered. In these cases a TURN server is required. This server will pass the packet traffic between the clients. It does not do any kind of media transcoding, nor even decrypting possible encryption. It should also be noted that TURN in WebRTC requires user authentication to work properly. If the initial handshake didn't prompt the WebRTC middleware for credentials, it is silenty ignored without any kind of error message.

Closing words

Thanks for reading! Drop a comment below if you found this helpful or want to leave constructive criticism.

No comments: