An Introduction to Companding: Compressing Speech for Transmission Across Telephone Systems
This article introduces the analog-to-digital conversion required to compress speech for transmission across telephone systems.
This article introduces the topic of companding—the digitization, transmission, and conversion of human speech across telephone systems.
A Brief Background
Telephone systems have been in high demand since their invention and have evolved from public switched telephone networks (PSTNs) to modern wireless digital mobile systems. Analog-to-digital conversion-based pulse coded modulation (PCM) systems have been in use for the past six decades. It should be noted that irrespective of what kind of encoding is used, all telephone systems work by exploiting the basic facts underlying the human speech and hearing mechanisms.
Human Speech and Hearing Mechanism
Speech is a natural communication mechanism amongst human beings. Words are composed of various phonemes, individual sounds that vary in amplitude, with quieter phonemes occurring more frequently than louder phonemes. Generally, the speech signal generated by human beings falls within a frequency range of 70 Hz to 400 Hz, while the frequency of human hearing ranges from 20 Hz to 20 kHz. Our hearing is selective and offers the highest sensitivity for the sounds produced within the range of 300 Hz to 10 kHz.
These experimentally-backed facts have led to the conclusion that when a speech signal is recorded within the range 0.3 to 3.4 kHz, the information conveyed by the speaker is readily understood by the listener.
Figure 1. The "Speech Banana" shows phonemes and their frequencies at various amplitudes required for recognition. Image courtesy of Clear Value Hearing.
When the ability of hearing is expressed in terms of the dB scale, it ranges from 0 dB SPL (the threshold of hearing) to 130 dB SPL (the threshold of pain).
There is a large ratio between lower and higher amplitudes. In a general sense, lower-amplitude sounds are thought of as whispers while higher-amplitude sounds are thought of as shouts. However, even normal conversational speech has considerable variation in amplitude levels as it is composed of different phonemes. Further, it is seen that quieter phonemes carry more information and have more entropy than louder ones.
A PCM-Based Telephone System without Companding
Telephone systems first emerged as analog in nature and have now become digital. As a result, whatever we speak needs to be digitized and then transmitted—so the actual analog speech signal requires recovery on the receiver end. Conversion of any analog signal into its digital form is comprised of three important phases: sampling, quantization, and coding.
Sampling of a Speech Signal
Sampling is a process through which we can convert an original signal, defined at all instants of time, into a discrete signal which will be defined only at specific instants of time.
How do we decide at which points to define the signal?
We begin by considering the basic but very important fact that we are not only interested in transmitting the signal from the sender, but also in recovering the signal at the receiver.
The theorem associated with the process is the well-known Nyquist theorem, which states that faithful recovery of the transmitted signal is feasible only when it is sampled at least at the rate of twice the highest frequency contained within it.
So, if the highest frequency is f, then the frequency at which we need to sample the signal should be greater than or equal to 2f. This, in turn, means that we need to define our signals at the time instants which are spaced at a distance less than or equal to 1/2f (due to the fact that frequency and time are inversely proportional to each other).
From the discussion presented in the previous section, we know that our interest for telephony conversation spans over a frequency range of 0.3 to 3.4 kHz. And any successful transmission of signals demands the presence of guard bands, due to which the overall range becomes 0 to 4 kHz. Thus, in our case, a sampling rate of 8 kHz (= 2 x 4 KHz) is a good choice.
This indicates that, after sampling, we have our speech signal discretized along the time-axis wherein the spacing between the adjacent samples will be $$ \frac{1}{8\;\text{KHz}}=125\;\text{µs} $$.
Quantization and Coding of Speech Signal
Note that sampling digitizes the signal over time-axis only (refer to the typical example shown in Figure 2 in which a red sinusoidal signal is converted into blue discrete-valued signal by sampling). However, in order to make the speech signal completely digital in nature, we need to discretize it even along its amplitude-axis, an act regarded as quantization.
Figure 2. Sampling of sine wave
Now, our next question would be very similar to that in the case of sampling - how do we decide when to define our signal along its amplitude-axis? In other words, what should be the spacing between the points along which we define the amplitude of our signal (this is technically termed as step-size)?
Even in this case, we need to choose the step-size keeping in mind that we need to have a minimum distorted signal at the receiver side. Thinking so, let us suppose that we choose a very small step-size to quantize a low amplitude signal (sine wave altering between the values +1 and -1, shown in pink color in Figure 3a). Smaller steps mean we will define our signal at very close intervals along its amplitude-axis (Figure 3a) due to which the number of steps required to define our signal would be very large, which requires a large number of bits to code it, which demands a large bandwidth.
Figure 3. Quantization of low-amplitude sine wave with (a) small step-size (b) large step-size
Keeping bandwidth-point in view, let us suppose that we use too few steps to define our signal. A lower number of steps implies a large spacing between the points at which we define the signal along its amplitude-axis. This allows us to define our signal very coarsely (Figure 3b), which leads to problems when we reconstruct the signal at the receiver side, as much of the information present would be lost during quantization.
Next, we analyze the effect of varying the step-size in the case of large amplitude signals. This is important in the present context because we know from the discussion presented in the section on human speech and hearing mechanism that our signal of interest, speech, comprises of wide range of amplitudes.
Figure 4 examines the effect of quantization using the same step sizes used in Figure 3 when the amplitude increases by a factor of four (original sine wave in Figure 4 has peak-to-peak amplitude varying between +4 to -4). Here, Figure 4a re-emphasizes the fact that smaller step-size is always better when exactly we need to replicate the original signal.
Figure 4. Quantization of large amplitude sine wave with (a) small step-size (b) large step-size
Another important point to note is that the quantized signal in Figure 4b is not as distorted as the quantized signal shown in Figure 3b. That is, quantization using large step-size still produces acceptable results when the signal amplitude is higher. This means that the step-size which proved to be ‘really large’ for a low amplitude signal is not ‘that large’ when it comes to a large amplitude signal. In other words, it can be said that higher the amplitude of the signal, the greater the step-size will be to quantize it, without too much distortion.
Companding: An Introduction
Every researcher believes that any system, no matter how good, can be improved in some way or another. Nevertheless, in order to find out what works the best (or better), the concepts and the methods presently deployed have to be carefully reviewed and must be scrutinized from different perspectives.
To accomplish this in our case, let us retrace our path through the article while pondering over two important points.
First, recall that human speech is not isotropic when it comes to the matter of information contained within it. The quieter phonemes of speech occur more frequently and contain more information than the louder phonemes. Second, note that the step-size chosen to quantize the signal can be larger (without affecting its quality) for higher amplitude signals in comparison to the lower ones.
If this is so, why can’t we quantize low amplitude speech signals using smaller steps while using larger steps for higher amplitude speech signals? It can be done. In fact, this technique of quantizing the speech signal using non-uniform levels is known as ‘companding’, a portmanteau of Compressing and Expanding.
Companding is the process in which the signal is coded using unequal quantization levels. In this technique, a large number of small levels are used to code the low amplitude signals while the higher amplitude signals are coded using the small number of large levels. This means by making use of companding, we can quantize our speech signal with fewer levels while maintaining the required amount of fidelity. Further, the lower number of levels means fewer bits to code, which implies a reduced bandwidth requirement.
Conclusion
This article introduced the concepts related to human speech and its characteristics with respect to PCM-based telephone systems. I hope you've gained a superficial knowledge of companding and its importance in the field of telecommunications.
The details of companding techniques and its other advantages will be covered in the next article in this series.
What? I just read four pages of bs to learn that the author doesn’t explain companding? How would one take small amplitude steps and big amplitude steps? Not explained! Nothing of the compression/expansion process is explained or how compression benefits the transmission process. What a weird article.
A very good article thank you! More about the theory and reasons for modifying the signal to allow for the human hearing/speech characteristics than the (relatively simple) methods of companding. Keep submitting!