Many studies have been carried out since the late 19th century, which show cues other than left-right balance to be of importance in sound positioning. These cues mostly come about by way of the physical structures, mechanisms, and what might be seen as impediments, that combine to make auditory perception as effective and versatile as it is. Put alternatively, sounds are received differently by the listener, depending on the distance and direction of their source. We do not readily, however, notice such differences at a conscious level. It is due solely to these cues that we are able to assign a source to an unfamiliar sound, and to know positively if a sound source is located above or below us, in front of or behind us. Our unconscious recognition of such cues is testimony to the highly interpretive nature of auditory perception.
The study of sound positioning is the field from which Surround Sound and Ambisonics are drawn. These systems implement solutions to sound positioning that work considerably better than quadraphonic systems ever managed to. Studious usage of perceptual cues based on delays and filtering create a reasonable likeness to human perception that can work even when the listener is situated outside the area enclosed by the speakers. Thanks to these cues also, it is possible to create aural illusions that suggest contrary positions for any given sound. By the proper use of these techniques even with only two speakers, a better positioning effect may be created than with four speakers using volume balancing alone. This article shall look into some of these techniques to construct a reasonable picture of positional cues in human hearing.You will have heard of Surround Sound, but probably not Ambisonics. A good introductory resource on the subject of Ambisonics can be found at the York University Music Technology Group.
Head Related Transfer Function
HRTF is a collective term encompassing a number of phenomena. These are derived from various physical structures of the aural system, and from the acoustical properties of human body. In the simplest sense, when audio waves reach the listener, they do not all travel directly to the eardrum. They will tend to reflect off the contours of the pinna (the outer ear), and the lining of the middle ear. In so doing, certain bands and notches of the audio spectrum are filtered out and resonant frequencies emerge, particularly with waves colliding in the confined space of the ear canal. As sounds pass straight through the head, they are not heard only by the ear closest to the source, but by the other as well. After passing through a solid object, the frequency spectrum is filtered so that frequencies whose physical length (Hz x the speed of sound) is less than the distance traveled through the object are decreased in amplitude. Thus, the "second copy" of the sound is of a far mellower timbre. Sounds are also reflected off the shoulders towards the ears, and are refracted through the torso towards the head.
The patterns of treble reduction, notches and resonances are unique and consistent for any direction from which a sound may come; they can, therefore, be used as a signifier of direction. Thankfully, HRTF parameters are broadly similar for most people, which makes it a study pursuit with a practical end. However, they are sufficiently variable from one individual to another to make it insensible to apply the findings of any HRTF measurement in minutest detail. The most worthwhile application, then, will be based more on a moving average showing the more significant features of a frequency response, than on raw data and precise details.
The most common means of studying HRTF and related phenomena is with the use of a KEMAR dummy. A KEMAR dummy is a sort of mannequin that possesses internal acoustical qualities similar to a normal human body. Two microphones are inserted in the head, which are to replace the inner ear. The frequency responses of the microphones are known, so that results may be calibrated accordingly. A sound source (a speaker) is then placed at precise, evenly spaced positions in a series of arcs, which are taken to represent every possible direction. One such measurement report, including its raw results, is available from the MIT Media Lab. The experimenters, Bill Gardner and Keith Martin, have provided sound files that they recorded with the mics in the dummy. You can download them from the site. I have spent some time performing Fourier analyses on each one of the files in the compacted data collection (assembled specially for multimedia developers, etc). Each analysis has been screen captured and saved as 16 color GIF's. You can link across to the index page for that data here.
The pinna works differently for low and high frequency sounds. For low frequencies, it behaves similarly to a reflector dish, directing sounds toward the ear canal. For high frequencies, however, its value is more sophisticatedly reckoned. While some of the sounds that enter the ear travel directly to the canal, others reflect off the contours of the pinna first: these enter the ear canal at a minute delay. Such a delay translates into phase cancellation where the frequency component whose wave period is twice the delay period, is virtually eliminated. Neighboring frequencies are dropped significantly. This is known as the pinna notch, where the pinna creates a notch filtering effect. Regarding any source position, and considering the differences between listeners' ear shapes, this notch may exhibit a central frequency from 6kHz to 16kHz. While the pinna notch is more pronounced for sounds coming from in front, than for sounds coming from above (because its reflective capabilities are greater overall for such), it is established that the pinna is the primary source of cues for the perception of elevation, or azimuth. The finer structures of an individual pinna also have an effect on its frequency response, which differs depending on the direction of the sound. So too, does the actual material of the pinna, whose skin and cartilage absorb some of the energy of higher frequencies.
The earliest study in the field of directional audio was conducted by John Strutt (aka Lord Rayleigh), on the phenomena of Interaural Time Difference (ITD) and Interaural Intensity Difference (IID). Together, these are known as the Duplex Theory. The concept of ITD is based on the fact that a sound coming from one side of the listener reaches the nearest ear ahead of the other. This creates a fractional delay of 0.6ms to 0.7ms. This is quite easily perceptible to the subconscious and allows horizontal positional judgement in many cases.
Sounds that have a broad frequency content lend themselves to directional perception by IID also. Here, the reason for the proposal of a duplex system becomes apparent. A solid object is able to filter frequencies whose physical wavelength is equal to or less than the length of the cross-section through which they pass. Higher frequencies passing through a human head can be impeded by more than 20dB. Thus, what reaches the opposite ear is heavily filtered. IID is then the discernment of a filtered version of the sound that reaches the opposite ear.
In the instance of a sound having entirely treble content, the filtration of such sounds will tend to act as de-amplification. ITD is thus of lesser usefulness with such sounds. However, in such cases IID still appears to provide the brain with a reliable guide. Sounds having mostly bass content can be directionally judged by ITD, but not by IID. Any sound which has no content above 250Hz cannot be directionally judged at all. This is the reason why a single subwoofer in a stereo or multi-channel system works just as well for sonic components as having woofers in all speaker cabinets. This is also the reason for the inclusion of the joint stereo option in MPEG compression, mixing all bass into a central mono channel as an additional means of data reduction. These strategies save the cost of equipment, the time of data transfer, and the cost of data storage, but all have one disadvantage. This is the loss of the stereo (or multi-channel) imaging properties lent by subsonic content. By concentrating all bass in one location, some recordings may obtain a neater, cleaner sound, but will lose some of the original naturalness or ambiguity of sonic image that can distinguish them and add interest for the listener.
The means to the determination of distance are somewhat different from the foregoing. Whereas all of the above has to do with our own physical structure, distance cues have to do with external factors. The most recognizable of these is the amount of reverberation. Sound volume drops inversely in proportion to the square of the distance from the source. With the greater distance traveled by reflected sounds, then, a greater amount of the "dry" direct sound is to be heard from a source situated close-by. In the case of a sound coming from a great distance in a highly ambient environment, the bulk of what is heard will have been reflected off walls and other surfaces. Thus, the balance of reverberation and "dry" source sound is a valid cue for distance in an enclosed area. The acoustical nature of the environment must, however, play a major role in this interpretive process. This suggests either that the presence of other sounds with known positions, or a familiarity with the behavior of the given type of environment, is important in subconscious calibration for precise judgement.
Another cue is the complexity and treble content - relative to what might be expected - of a soundwave. Air particles move in all directions, in an uncontrolled fashion. Each particle has its own vector and is already in motion before being incorporated into a soundwave. Thus, particles combine their original motion vector with that passed to it by the impact of those carrying a soundwave compression. Moreover, the impact of any two particles is rarely direct, and must cause particles to scatter, rather than to move in a straight line, identically to each other. Thus, the shape of the wave is not the actual movement of all the particles in its path, but only the commonality, or average vector, of all the individual movements.
The integrity of the soundwave, then, is compromised, with tiny compressions and rerefractions being erased, fading into inaudible noise, leaving only the major movements intact. The longer the distance that the sound must travel, the greater this effect will be, as countless trillions of impacts occur with each meter of distance travelled. This means that a soundwave that has traveled a distance of hundreds of meters contains appreciably less of the smaller details than it had when first emitted. These smaller details are partly responsible for the clarity of a sound, which in music lends sonic beauty. This is one of the differences between the sound obtained by close-miking and distant-miking. A similar effect can also be observed in the different qualities of a single violin and an entire violin section. Synthesizers, since the beginning, have used plain, filtered sawtooth waves to mock up a sound like a violin section. Yet solo violins were left well alone in the days before multisampled instruments became feasible.
In any sound, the smaller details are, as it happens, mostly in the treble range. A smaller period generally suffers the same effect as a smaller amplitude. Conversely, subsonic waves can travel for kilometers. (Seabirds make use of this phenomenon to know in which direction to find the sea, while they are several kilometers inland). Therefore, a more distant sound is slightly mellower also. In engineering for music, the filtering contour involved in mimicking this seems best set with a cutoff in the lower midrange and an even slope going right up to the maximum frequency. Some flattening out of the frequency response in the high treble range does not appear to reduce the overall effect, and is useful to retain an amount of brightness.
In this article we have concentrated on a few specifics that account for some of the major aspects of sound positioning. The pinna is used in the judgement of azimuth and front-rear positioning while IID and ITD are involved in left-right positioning. Distance is judged primarily by physical changes that occur to a sound as it travels. Other sources for positional cues exist, such as shoulder reflection and torso refraction, but we have not addressed them here. A developer or engineer can use these factors to very good effect, including the creation of auditory illusions. You can do this by combining contradictory cues or creating patterns of cues that do not associate with each other in natural circumstances.