Bandwidth requirements for Voice over IP can be a
tricky beast to tame until you look at the method and factors
involved. This guide investigates what bandwidth means for VoIP,
how to calculate bandwidth consumption for a
VoIP network and how bandwidth can be saved by using voice
compression.
Table of contents
- What
about bandwidth for VoIP?
-- An introduction to bandwidth issues for Voice over IP and its
different components. - Calculating
bandwidth consumption for VoIP
-- This section discusses how bandwidth can be calculated for VoIP
transmissions and what strategies work best for the majority of
situations. - How
can voice compression save bandwidth?
-- Using voice compression can be one of the best strategies when
trying to save bandwidth. This section discusses how these
'savings' can be achieved.
What about bandwidth for
VoIP?
Voice over IP (VoIP) is the descriptor for the technology used to
carry digitised voice over an IP data network. VoIP requires two
classes of protocols: a signaling protocol such as SIP, H.323 or
MGCP that is used to set up, disconnect and control the calls and
telephony features; and a protocol to carry speech packets. The
Real-Time Transport Protocol (RTP) carries
speech transmission. RTP is an IETF standard introduced in 1995
when H.323 was standardised. RTP will work with any signaling
protocol. It is the commonly used protocol among IP PBX
vendors.
An IP phone or softphone generates a voice packet every 10, 20,
30 or 40ms, depending on the vendor's implementation. The 10 to
40ms of digitised speech can be uncompressed, compressed and even
encrypted. This does not matter to the RTP protocol. As you have
already figured out, it takes many packets to carry one word.
The shorter the packet, the shorter the delay
End-to-end (phone-to-phone) delay needs to be limited. The
shorter the packet creation delay, the more network delay the VoIP
call can tolerate. Shorter packets cause less of a problem if the
packet is lost. Short packets require more bandwidth, however,
because of increased packet overhead (this is discussed below).
Longer packets that contain more speech bytes reduce the bandwidth
requirements but produce a longer construction delay and are harder
to fix if lost. Many vendors have chosen 20 or 30ms sise
packets.
RTP packet format
The RTP header field contains the digitised speech sample (20 or
30ms of a word) time stamp and sequence number and identifies the
content of each voice packet. The content descriptor defines the
compression technique (if there is one) used in the packet. The RTP
packet format for VoIP over Ethernet is shown below.
Ethernet
Trailer | Digitised
Voice | RTP
Header | UDP
Header | IP
Header | Ethernet
Header |
RTP can be carried on frame relay, ATM, PPP and other networks
with only the far right header and left trailer varying by
protocol. The digitised voice field, RTP, UDP and IP headers remain
the same.
Each of these packets will contain part of a digitised spoken
word. The packet rate is 50 packets per second for 20ms and 33.3
packets per second for 30ms voice samples.The voice packets are
transmitted at these fixed rates. The digitised voice field can
contain as few as 10 bytes of compressed voice or as many as 320
bytes of uncompressed voice.
The UDP header carries the sending and receiving port numbers
for the call. The IP header carries the sending and receiving IP
addresses for the call plus other control information. The Ethernet
header carries the LAN MAC addresses of the sending and receiving
devices. The Ethernet trailer is used for error detection purposes.
The Ethernet header is replaced with a frame relay, ATM or PPP
header and trailer when the packet enters a WAN.
'Shipping and handling'
In reality, there is no Voice over IP. It is really voice over
RTP, over UDP, over IP and usually over Ethernet. The headers and
trailers are required fields for the networks to carry the packets.
The header and trailer overhead can be called the shipping and
handling cost.
The RTP plus UDP plus IP headers will add on 40 bytes. The
Ethernet header and trailer account for another 18 bytes of
overhead, for a total of at least 58 bytes of overhead before there
are any voice bytes in the packet. These headers, plus the Ethernet
header, produce the overhead for shipping the packets. This
overhead can range from 20% to 80% of the bandwidth consumed over
the LAN and WAN. Many implementations of RTP have no encryption, or
the vendor has provided its own encryption facilities. An IP PBX
vendor may offer a standardised secure version of RTP (SRTP).
Shorter packets have higher overhead. There are 54 bytes of
overhead carrying the voice bytes. As the size of the voice field
gets larger with longer packets, the percentage of overhead
decreases -- therefore the needed bandwidth decreases. In other
words, bigger packets are more efficient than smaller packets.
Header compression
Cisco has created a header compression technique that is now the
standard called RTP header compression. This technique actually
compresses the RTP, UDP and IP headers and significantly reduces
the RTP, UDP and IP overhead from 40 bytes to between 4 and 6
bytes. The bandwidth consumption for compressed voice packets can
be reduced by nearly 60%. This technique has less value for large
uncompressed voice packets. The header compression technique is not
recommended for the LAN implementations because there is typically
more than enough bandwidth for voice calls. The header compression
technique should be considered for the WAN implementations, where
bandwidth is limited and much more expensive.
Calculating
bandwidth consumption for VoIP
The bandwidth needed for VoIP transmission will depend on a few
factors: the compression technology, packet overhead, network
protocol used and whether silence suppression is used. This tip
investigates the first three considerations. Silence suppression
will be covered in a later tip.
There are two primary strategies for improving IP network
performance for voice: Allocate more VoIP bandwidth (reduce
utilisation) or implement QoS.
How much bandwidth to allocate depends on:
- Packet size for voice (10 to 320 bytes of digital voice)
- CODEC and compression technique (G.711, G.729, G.723.1, G.722,
proprietary)
- Header compression (RTP + UDP + IP), which is optional
- Layer 2 protocols, such as point-to-point protocol (PPP), Frame
Relay and Ethernet
- Silence suppression/voice activity detection
Calculating the bandwidth for a VoIP call is not difficult once
you know the method and the factors to include. The chart below,
"Calculating one-way voice bandwidth," demonstrates the overhead
calculation for 20 and 40 byte compressed voice (G.729) being
transmitted over a Frame Relay WAN connection. Twenty bytes of
G.729 compressed voice is equal to 20 ms of a word. Forty bytes of
G.729 compressed voice is equal to 40 ms of a word.

The results of this method of calculation are contained in the
next table, "Packet voice transmission requirements." The table
demonstrates these points:
- Bandwidth requirements reduce with compression, G.711 vs.
G.729.
- Bandwidth requirements reduce when longer packets are used,
thereby reducing overhead.
- Even though the voice compression is an 8 to 1 ratio, the
bandwidth reduction is about 3 or 4 to 1. The overhead negates some
of the voice compression bandwidth savings.
- Compressing the RTP, UDP and IP headers (cRTP) is most valuable
when the packet also carries compressed voice.
Packet voice transmission
requirements
(Bits per second per voice channel) |
| Codec | Voice bit rate | Sample time | Voice payload | Packets per second | Ethernet | | PPP or Frame Relay | | RTP | cRTP |
|
| G.711 | 64 Kbps | 20 msec | 160 bytes | 50 | 87.2 Kbps | 82.4 Kbps | 68.0 Kbps |
| G.711 | 64 Kbps | 30 msec | 240 bytes | 33.3 | 79.4 Kbps | 76.2 Kbps | 66.6 Kbps |
| G.711 | 64 Kbps | 40 msec | 320 bytes | 25 | 75.6 Kbps | 73.2 Kbps | 66.0 Kbps |
| G.729A | 8 Kbps | 20 msec | 20 bytes | 50 | 31.2 Kbps | 26.4 Kbps | 12.0 Kbps |
| G.729A | 8 Kbps | 30 msec | 30 bytes | 33.3 | 23.4 Kbps | 20.2 Kbps | 10.7 Kbps |
| G.729A | 8 Kbps | 40 msec | 40 bytes | 25 | 19.6 Kbps | 17.2 Kbps | 10.0 Kbps |
Note: RTP assumes 40-octets
RTP/UDP/IP overhead per packet
Compressed RTP (cRTP) assumes 4-octets RTP/UDP/IP overhead per
packet
Ethernet overhead adds 18-octets per packet
PPP/Frame Relay overhead adds 6-octets per
packet |
This table provided courtesy of
Michael
Finneran.
The varying designs of packet size, voice compression choice and
header compression make it difficult to determine the bandwidth to
calculate for a continuous speech voice call. The IP PBX or IP
phone vendor should be able to provide tables like the one above
for their products. Many vendors have selected 30 ms for the
payload size of their VoIP implementations. A good rule of thumb is
to reserve 24 Kbps of IP network bandwidth per call for 8 Kbps
(G.729-like) compressed voice. If G.711 is used, then reserve 80
Kbps of bandwidth.
If silence suppression/voice activity detection is used, the
bandwidth consumption may drop 50% -- to 8 Kbps total per VoIP
call. But the assumption that everyone will alternate between voice
and silence without conflicting with each other is not always
realistic. Silence suppression will be discussed in a later
tip.
Most enterprise designers do not perform these calculations. The
vendor provides the necessary information. The designer does have
some freedom, such as selecting the compression technique for voice
payloads and headers, and may be able to vary the packet size.
How can voice
compression save bandwidth?
The Public Switched Telephone Network
(PSTN) started with the transmission of
analog speech. This worked well for decades until the areas
under city streets became saturated with copper cables, one
copper pair per call. Starting in the 1950s, AT&T Bell Labs
developed a technique to carry more voice calls over copper
wire. They developed digitised voice technology through which 24
digital calls can be carried on two pairs of copper wire,
thereby increasing the carrying capacity of the cables
twelvefold. The voice is digitised into streams of 64,000 bps
per call. The technology is called a T1 circuit and the
bandwidth for the 24 calls is 1.544 Mbps. This worked well for
domestic connections. The T1 technology then became the
mechanism for long-distance domestic transmission.
Most of the early voice compression technologies were designed
for undersea cables, where bandwidth was limited and expensive.
Voice compression technologies were created to reduce this
bandwidth requirement. Voice compression is also used for digital
cell calls, operating at about 8 Kbps instead of 64 Kbps. So voice
compression is not new.
As the PBX market has moved into an IP-based environment, voice
compression has become attractive for WAN transmission. Voice
compression can be used on a LAN, but since LANs have so much
available bandwidth, it is not commonly applied to the LAN.
The quality of a PSTN voice call provides enough analog
bandwidth to understand the speaker in any language. It is also
enough bandwidth for speaker recognition. The analog bandwidth
delivered by the PSTN is about 3.4 KHz. This is considered toll
quality. Voice compression can reduce the speech quality and may
affect speaker recognition, so there is a limit to how much
bandwidth reduction is possible before callers complain about voice
quality.
The
CODEC (COder/DECoder) is the component in an
IP phone that digitises the voice and converts it back into an
analog stream of speech. The CODEC is the
analog-to-digital-to-analog converter. The CODEC may also
perform the voice compression and decompression.
There are several voice digitisation standards and some
proprietary techniques in use for VoIP transmission. Most vendors
support one or more of the following ITU standards and avoid
proprietary solutions:
- G.711 is the default standard for IP PBX vendors, as
well as for the PSTN. This standard digitises voice into 64 Kbps.
There is no voice compression.
- G.729 is supported by many vendors for compressed voice
operating at 8 Kbps, 8 to 1 compression. With quality just below
that of G.711, it is the second most commonly implemented
standard.
- G.723.1 was once the recommended compression standard.
It operates at 6.3 Kbps and 5.3 Kbps. Although this standard
further reduces bandwidth consumption, voice is noticeably poorer
than with G.729, so it is not very popular for VoIP.
- G.722 operates at 64 Kbps, but offers high-fidelity
speech. Whereas the three previously described standards deliver an
analog sound range of 3.4 kHz, G.722 delivers 7 kHz. This version
of digitised speech has been announced by several vendors and will
become common in the future.
It is important to note that all of the voice digitisation
transmission speeds are for voice only. The actual transmission
speed required must include the packet protocol overhead.
The quality of a voice call is defined by the Mean Opinion Score
(MOS). A score of 4.4 to 4.5 out of a possible 5.0 is considered to
be toll quality. Voice compression will affect the MOS. An MOS
below 4.0 will usually produce complaints from the callers. Cell
phone calls average about 3.8 to 4.0 for the MOS. The following
table presents the voice MOS for different standard CODECs:
| Standard | Speed | MOS | Sampling delay per phone |
| G.711 | 64 Kbps | 4.4 | 0.75 ms |
| G.729 | 8 Kbps | 4.2 | 10 ms |
| G.723.1 | 6.3 Kbps
5.3 Kbps | 4.0
3.5 | 30 ms |
This table illustrates two points. First, as the voice is
compressed, the voice quality (MOS) decreases. The MOS in the table
does not include network impairments such as jitter and packet
loss. These impairments will further reduce the voice quality. The
VoIP network designer should choose a compression technique with a
higher MOS so the network impairments will not reduce the voice
quality to an unacceptable level.
Second, voice compression also adds delay to the end-to-end
call. The table shows the sampling delay for one phone. This delay
is doubled for the two phones of a call. This end-to-end delay
needs to be limited. As compression increases, the delay
experienced in the IP network needs to decrease, which increases
the cost of transmission over the WAN, but not the LAN. The delays
shown in the table are the theoretical minimum. The actual delays
experienced will probably exceed 30 ms, no matter what compression
technology is implemented. This delay will vary by vendor.
The conclusion is that digital voice compression is worth
pursuing for VoIP transmission on a WAN, but it comes with some
costs in voice quality reduction and increased end-to-end
delay.
About the author:
Gary Audin has more than 40 years of computer, communications and
security experience. He has planned, designed, specified,
implemented and operated data, LAN and telephone networks. These
have included local area, national and international networks, as
well as VoIP and IP convergent networks in the U.S., Canada,
Europe, Australia and Asia.