[guardian-dev] Blog post: VoIP Security Architecture

Thu Nov 21 19:11:06 EST 2013

https://guardianproject.info/2013/11/21/voip-security-architecture-in-brief/

VoIP security architecture in brief

Voice over IP (VoIP) has been around for a long time. It’s ubiquitous
in homes, data centers and carrier networks. Despite this ubiquity,
security is rarely a priority. With the combination of a handful of
important standard protocols, it is possible to make untappable end to
end encryption for an established VoIP call.

TLS is the security protocol between the signaling endpoints of the
session. It’s the same technology that exists for SSL web sites;
ecommerce, secure webmail, Tor and many others use TLS for security.
Unlike web sites, VoIP uses a different protocol called the Session
Initiation Protocol (SIP) for signaling: actions like ringing an
endpoint, answering a call and hanging up. This is the metadata of
calls. SIP-TLS uses the standard Certificate Authorities for key
agreement. This implies trust between the certificate issuer and the
calling endpoints.

To add a little complexity, the content of calls has only a small
relationship to SIP. The key agreement protocol for P2P VoIP content
is called ZRTP. In a true P2P system, all the key agreement and
encryption of a call’s content happens in the endpoint applications.
An important distinction between VoIP and other networked
communications is that all devices are both client and server at once,
so we have only “endpoints” rather than “clients” or “servers”. Once
the endpoints agree on a shared secret, the ZRTP session ends and the
SRTP session begins. When established, all audio and video content
going over the network is encrypted. Only the two peer endpoints who
established a session with ZRTP can decrypt the media stream. This is
the part of the conversation that cannot be wiretapped nor can
metadata of sessions in progress be spied on.

To step back a little, let’s review some acronyms. First there is SIP
(Session Initialization Protocol). This protocol is encrypted with
TLS. It contains the IP addresses of the endpoints who wish to
communicate but it does not interact with the audio or video stream.

Second, there is ZRTP. This protocol enters into the mix after a
successful SIP dialog establishes a call session by locating the two
endpoints. It transmits key agreement information over an RTP channel
between the peers. The peers use their voices to speak a secret they
read over a plaintext channel.

Third, enter SRTP. Only after the ZRTP key exchange succeeds is the
call content encrypted with the Secure Real Time Protocol. From this
point forward, all audio and video is secure and uniquely keyed to
each individual session.

This brief was inspired by the numerous discussions I’ve participated
in online and offline during my ongoing operation of ostel.co, a
secure VoIP service sponsored by The Guardian Project. I understand
that VoIP is complex when compared to HTTP and the mainstream
understanding of the securirty elements often omits the ZRTP/SRTP
content, rather focusing on only the SIP-TLS signaling. While
signaling is important, few calls would be useful without content.