The sound of things to come

PC audio has come a long way since the integration of audio onto computers, but greater changes are looming ahead

PC audio has come a long way since the integration of audio onto computers, but greater changes are looming ahead


3D sound is what we hear in everyday life. Sounds come at us from all directions and distances and we can distinguish each sound by the pitch, tone, loudness and reverberation. The spatial location of a sound is what gives it a three-dimensional aspect. Our ability to locate sounds in space involves a complex human process. Being able to synthesise such spatial sounds would clearly add to the immersiveness of a virtual environment. To effectively and accurately synthesise 3D audio, we must understand there are auditory localisation cues that allow us to determine the position of a sound source. The following briefly describes them.

Interaural cues

Interaural time delay (ITD) describes the time delay between sounds arriving at the ears while interaural level difference (ILD) refers to the level difference at the ears of the listener. They are the primary cues for interpreting the lateral location of a sound source. Because the ears are located some distance apart, sound source from the left speaker will take a longer path to reach the right ear compared to the left and vice versa. This delay results in a time delay that is interaural time delay. ITD cues are effective at localising low frequency sounds. High frequencies are localised using ILD and will laterise a sound towards the ear with greater intensity.

Head shadow

Sounds have to go through or around the head before reaching the ears. The head provides a filtering effect and substantial attenuation.


The external ear or pinna acts like an acoustic antenna and is the primary cue for elevation of a sound source. Frequencies are filtered by the pinna in such a way they are amplified or attenuated which will affect the perceived lateral position of a sound.


Frequencies in the range of 1-3kHz are reflected from the upper torso of the body. Reflection produces echoes and the reflectivity of a sound is frequency dependent.

Head movement

Naturally, the movement of the head is a key factor in determining the location of a sound source. There will be more head movement as frequency increases because higher frequencies are harder to localise ( they tend not to bend around objects as much as lower frequencies.


Sounds in the real world consist of their reflections from surfaces, be it walls, floors, ceilings, furniture, and so on. Reverberations will affect the determination of a sound's distance and direction in space. These cues assist our ability to locate a sound in 3D space. 3D sound synthesis thus needs to carefully handle these cues to provide precise sound immersion.

3. 3D Sound Synthesis

The technique widely adopted to synthesise 3D sound revolves around a mathematical function called the Head Related Transfer Function (HRTF). It is a complicated transfer function that contains information relating to the location and frequency of a sound source. It is a scientific measurement modeled from a dual-channel audio system based on the study of the human auditory cues as discussed earlier. However, since ITD is related to measured phase shift while ILD is to measured level, they are not mutually exclusive to HRTF. Derivatives of HRTF are then constructed to deliver binaural audio to the ears and to eliminate crosstalk from each speaker to the opposite ear. The following will provide an insight toward the understanding of the fundamentals of HRTF.

Binaural synthesis

Since our pair of ears can localise spatial 3D sounds in the real world, a pair of speakers or headphones is sufficient to emulate expressive 3D audio will be a logical deduction. The trick is to recreate and measure the sound pressures at both ears that would exist (the left and right HRTFs) if the listener was actually present. The approach is to place two microphones in the ear canals of an acoustic mannequin with artificial pinnae and record the encoded positional information that is picked up. A recording done in this way is called a binaural recording and is more realistic as the binaural signals reproduced closely resemble that from the human acoustic system. Mathematically, the binaural signals can be accurately synthesised by combining the input signal with the pair of HRTFs, a procedure termed binaural synthesis.

Headphones are an ideal choice for channel separation and unquestionably simplify the task of delivering one sound to one ear and another sound to another ear. However, there exists a major flaw that conflicts strongly with our area of study here: the sounds move along with the listener's direction. Headphones are also cumbersome and uncomfortable to wear for long periods of time. Sounds heard often seem too close, so much so, that sources that are supposed to be ahead appear to originate inside the head. The choice of speakers over headphones will avoid most of these problems, except now that sound from each speaker will cross over to the opposite ear. The solution is a technique called crosstalk cancellation.

Crosstalk cancellation

HRTF theory has been around and widespread for many years; the technique has been applied extensively in rendering realism in sound reproduction. With the growing interest in 3D audio in the PC audio environment, HRTF has extended to this area of application since measurements are readily available. Nevertheless, HRTF has some limitations in interactive 3D audio implementation that will lessen the immersiveness intended. First, when headphones are used an undesirable in-head image is created. Second, when speakers are used there is a comfort zone known as the sweet spot for HRTF measurements using speakers. The preceding discussion on HRTF measurements by means of binaural recording using two speakers has a serious shortcoming: crosstalk cancellation only functions in a fixed listener location called the sweet spot. This is the one ONLY spot that best interprets the positional information relative to the actual position of the listener. When the listener moves away from the sweet spot, the 3D spatial illusion is lost. The worst scenario is when he is rotated 90 0 from the front as depicted.

The determinant for the HRTF matrix, HLLHRR - HLRHRL will be close to 0 at any frequency. It is therefore singular and the inverse transfer matrix is undefined. In technical terminology, compensation for the high order crosstalk cancellation becomes unstable. In short, no noticeable spatial differentiation for all sound sources emitting from the speakers. Interestingly enough, although both sets of HRTFs are quite similar, HLR and HRR give more clues for directional effects compared to HLL and HRL due to head shadowing and shoulder echo. This can be verified by the HRTF plots depicted in the Appendix. A better way to recreate true 3D audio is to place additional rear speakers in the listening environment. The obvious advantage is that the listener can localise sound better as he will be positioned in a sound field that is encapsulated with enough speakers.

Read more on Data centre hardware