### Representing Fractional Numbers on Low-Cost DSPs

To lower the cost of the implementation, many digital signal processors are designed to perform arithmetic operations only on integer numbers. To represent fractional numbers on these processors, we can use an implied binary point.

For example, the eight-bit word $$a=01010110_{2}$$ represents $$86_{10}$$ when interpreted as an integer number; however, we can consider an implied binary point for $$a$$ and interpret it as a fractional number.

Assume that the binary point is between the fourth and fifth bit positions, i.e., $$a=0101.0110_{2}$$. Hence, we obtain the equivalent decimal value of this number as

$$a=0 \times 2^{3}+1 \times 2^{2}+0 \times 2^{1}+1 \times 2^{0}+0 \times 2^{-1}+1 \times 2^{-2}+1 \times 2^{-3}+0 \times 2^{-4}=5.375$$

With this interpretation, we are using four bits to represent the integer part of the number and another four bits to represent the fractional part.

As you can see, the first bit position to the right of the binary point has a weight of 0.5, the second bit position has a weight of 0.25, and so on. Note that this implied binary point is not specified in the hardware. And the programmer needs to consider an appropriate scaling factor to interpret the result of the calculations correctly. For the above example, the hardware uses eight storage elements to represent the eight-bit word $$a=01010110_{2}$$. Now, if the programmer aims to represent $$5.375$$ with $$a$$, he/she should remember that a scaling factor of $$2^{-4}$$ must be appropriately applied to the result of any calculation using the variable $$a$$.

As another example, we can consider the binary point of $$a$$ between the fifth and sixth bit positions, i.e. $$a=010.10110_{2}$$. In this case, the equivalent decimal value will be $$a=2.6875_{10}$$.

### The Q Format

Depending on where the binary point is assumed to be, a given number can be interpreted as several different values. To make programming simpler, we generally use a fixed binary point throughout the algorithm. To easily specify how many bits are used to represent the integer and fractional parts of the number, we use a notation called the Q format. For example, to specify that we are using three bits for the integer part and four bits for the fractional part, we may say that the numbers are in Q3.4 format.

Another possible notation is to specify only the length of the fractional part. This is based on the assumption that the wordlength is known for a given processor. For example, when working with a processor which has a wordlength of 16 bits, we may simply say that we are using the Q15 format to represent the numbers. This means that we are putting 15 bits to the right of the binary point and one bit to its left. In this case, the Q15 format is equivalent to the Q1.15 format.

### Choosing the Position of the Binary Point

To choose the position of the binary point, we need to consider two main factors:

- The largest number that we may need to represent in a given algorithm
- The tolerable quantization noise

The former specifies how many bits must be used for the integer part and the latter determines the length of the fractional part.

Note that, apart from using an implied scale factor, the Q format has nothing new compared to the well-known concept of representing numbers on a digital computer. As a result, we can use the Q format to represent signed two’s complement numbers, too. In this case, we only need to allocate the most significant bit (MSB) to the sign and use the two’s complement form for the negative numbers.

### Examples of the Q Format

Now that we have a basic understanding of what the Q format is, let's see some examples of using it.

#### Example 1:

*Assume that an algorithm tested using floating-point arithmetic involves operations on $$a=9.216957436091198_{10}$$. Now that we are satisfied with the performance of the algorithm in floating-point representation, we have decided to implement it on a low-cost fixed-point processor which has a wordlength of 16 bits. What would be the appropriate Q format to represent $$a$$ on this processor? *

Since the integer part of $$a$$ is between the decimal values $$8$$ and $$16$$, we need at least four bits to represent the integer part of the number. Assuming that we are working with signed numbers, we can consider five bits to the left of the binary point and use the remaining bits for the fractional part. Hence, we can use the Q5.11 format.

In this format, the achieved binary representation will be interpreted with a scaling factor of $$2^{-11}$$. In other words, the Q5.11 representation of $$a$$ without the implied binary point is equal to $$a$$ multiplied by $$2^{11}$$. Hence, to represent $$a$$ in the Q5.11 format, we multiply it by $$2^{11}$$, round it to the nearest integer, and convert the rounded result into the binary form. We obtain

$$a \times 2^{11}=18876.3288 \approx 18876 = 100\;1001\; 1011\; 1100_{(2)}$$

Since $$a$$ is positive, we only need to consider a sign bit of zero. Therefore, the Q5.11 format of the number will be $$01001.00110111100$$. For a negative number, we would have to first find the Q format of the absolute value and, then, convert it to the two’s complement representation to take negative sign into account.

#### Example 2:

*Assume that $$a=10.01_{2}$$ is a signed number in Q2.2 format. What is the decimal equivalent of $$a$$?*

Since the sign bit is one, the equivalent decimal value will be the opposite of the two’s complement of $$a$$ which is $$01.11_{2}$$. Hence, we have $$a=-1.75_{10}$$.

You can find more examples of quantizing numbers and also the effect of this quantization on the performance of a digital filter in this article.

### Sign Extension

When adding two signed numbers, the addend and augend may have different lengths. In this case, we have to extend the sign bit of the shorter number otherwise the result may not be correct.

For example, extending the sign of $$1011_{2}$$ by two bits, we obtain $$111011_{2}$$. How is this replication of the sign bit justified? We know that putting zeros at the left of a positive number will not change its value.

To understand why sign extension does not change the value of a negative number, we should remember that, in the two’s complement representation, a negative number is defined with regard to a complementation constant. Working with k-bit numbers, the complementation constant of the two’s complement representation is $$M=2^{k}$$.

As an example, when working with four-bit numbers, the complementation constant will be $$M=2^{4}$$ and the opposite of a four-bit number, $$b$$, will be represented by $$M-b$$. Let’s see how we can represent this signed four-bit number using six bits.

To represent the opposite of $$b$$ with six bits, we should use $$M'=2^{6}$$ and represent $$-b$$ by $$M'-b$$. Hence, the difference between the six-bit representation and the four-bit representation is $$(M'-b)-(M-b)=M'-M=2^{6}-2^{4}=0110000_{2}$$. This shows that, in order to represent a four-bit negative number with six bits, we only need to add $$0110000_{2}$$ to the four-bit representation—or, equivalently, we could extend the sign by two bits.

In simpler terms, by replicating the sign bit of a number, we are only changing the complementation constant and the value of the number is not changed. You can also verify this statement by calculating the decimal equivalent of both the four-bit and six-bit representations of this example. The decimal equivalent of $$1011_{2}$$ is obtained as the opposite of its two’s complement which gives $$-(0101_{2})=-5_{10}$$. Similarly, the decimal value of $$111011_{2}$$ is $$-(000101_{2})=-5_{10}$$. As you can see, the equivalent decimal value does not change with sign extension.

### Addition in Q Format

To add two numbers in Q format, we should first align the binary point of the two numbers and sign extend the number that has shorter integer part. Let’s see an example:

#### Example 3:

*Calculate $$a+b$$, if $$a=-1.25$$ (note that $$-1.01_{2}$$ becomes $$10.11_{2}$$ in two’s complement notation) and $$b=+3.25$$ (which becomes $$011.010_{2}$$ in two’s complement representation). As you can see $$a$$ and $$b$$ are two signed numbers in Q2.2 and Q3.3 formats, respectively.*

We should first align the binary point of the two numbers, sign extend the number with shorter integer part, and then perform the addition. We obtain

$$\begin{split} \; \; 1 \; 1 \; 0 \; &. \; 1 \; 1 \; \; \; &-1.25 \\

+ \; \; 0 \; 1 \; 1 \; &. \; 0 \; 1 \; 0 \; \; \; &+3.25 \\

\hline \\

\; 1 \; 0 \; 1 \; 0 \; &. \; 0 \; 0 \; 0 \; \; \; \; &+2

\end{split}$$

Apart from an implied scaling factor, the above addition is exactly similar to adding two numbers in two’s complement representation. Since the two’s complement representation is based on modulo-M arithmetic, it’s clear that we should discard the bit of the result which is above the sign bit. Hence, $$a+b=010.000_2=+2_{10}$$ which is consistent with decimal result shown above. Please note that, without sign extension, the result would be incorrect because, in this case, we would be actually adding $$010.11_2=+2.75_{10}$$ to $$+3.25_{10}$$ rather than adding $$10.11_2=-1.25_{10}$$ to $$+3.25_{10}$$.

When using fixed-point representation to perform arithmetic operations, we must be careful about the range of the values that can be represented with a given Q format. As an example, assume that we add $$a$$ in Qna.ma format to $$b$$ in Qnb.mb format. Similar to the above example, we can sign extend the number with a shorter integer part and represent the result in Qnc.mc format where nc=maximum{na, nb} and mc=maximum{ma, mb}.

However, we should note that there are chances for overflow because adding two N-bit numbers can lead to an (N+1)-bit result. With the mentioned Qnc.mc format for the result, we would have to make sure that overflow has not occurred. The following example further clarifies this point.

#### Example 4:

*Calculate $$a+b$$, if $$a=10.11_2$$ and $$b=100.001_2$$ are two signed numbers, respectively, in Q2.2 and Q3.3 formats.*

We should first align the binary point of the two numbers, sign extend the number with shorter integer part, and then perform the addition. We obtain

$$\begin{split} \; \; 1 \; 1 \; 0 \; &. \; 1 \; 1 \; \; \; &-1.25 \\

+ \; \; 1 \; 0 \; 0 \; &. \; 0 \; 0 \; 1 \; \; \; &-3.875 \\

\hline \\

\; 1 \; 0 \; 1 \; 0 \; &. \; 1 \; 1 \; 1 \; \; \; \; &-5.125

\end{split}$$

If we discard the fourth bit in the integer part, we obtain $$010.111_2=2.875_{10}$$. While we are adding two negative numbers, the result is positive and, hence, overflow has occurred. This is due to the fact that the minimum number that can be represented in Q3.3 format is $$100.000_2=-4_{10}$$ and the result of the addition is smaller than $$-4_{10}$$.

To circumvent the overflow from addition, we can either scale the inputs or use an adder which can process numbers in Q4.3 format. For the latter, we would have to sign-extend the numbers to have four bits in the integer part. Note that since adding two N-bit numbers can lead to an (N+1)-bit result, a Q4.3 output format will avoid overflow when adding two Q3.3 numbers together.

#### Example 5:

*Calculate $$a+b$$, if $$a=10.11_2$$ and $$b=100.001_2$$ are two signed numbers, respectively, in Q2.2 and Q3.3 formats. Assume that the adder can process numbers in Q4.3 format.*

We should first align the binary point of the two numbers. Since both $$a$$ and $$b$$ have less than four bits in the integer part, we should sign-extend them to the Q4.3 format and then perform the addition. We obtain

$$\begin{split} \; \; 1 \; 1 \; 1 \; 0 \; &. \; 1 \; 1 \; \; \; &-1.25 \\

+ \; \; 1 \; 1 \; 0 \; 0 \; &. \; 0 \; 0 \; 1 \; \; \; &-3.875 \\

\hline \\

\; 1 \; 1 \; 0 \; 1 \; 0 \; &. \; 1 \; 1 \; 1 \; \; \; \; &-5.125

\end{split}$$

Discarding the bits above the sign bit, we obtain $$a+b=1010.111_2=-5.125_{10}$$.

### Avoiding Overflow with Guard Bits

Many digital signal processors choose the size of the output register of the accumulator larger than its input registers by several bits. These extra bits are called the guard bits. The guard bits allow the programmer to avoid overflow while accumulating a number of values without even scaling the accumulator inputs. You can easily verify that an accumulator with $$n$$ guard bits allows us to accumulate $$2^n$$ values without overflow.

While we can use larger adders to prevent overflow, we cannot generally allow the wordlength to grow without limit. Hence, somewhere in our calculations, we have to truncate or round the addition results to a shorter wordlength. This generally means that we would have to allocate more bits to the integer part of the addition result so that we can represent larger values. In other words, we would have to change the position of the binary point.

### Summary

- Fixed-point representation allows us to use fractional numbers on low-cost integer hardware.
- To lower the cost of the implementation, many digital signal processors are designed to perform arithmetic operations only on integer numbers. To represent fractional numbers on these processors, we can use an implied binary point.
- If we are using n bits for the integer part and m bits for the fractional part, we may say that the numbers are in Qn.m format.
- When adding two signed numbers, the addend and augend may have different lengths. In this case, we have to extend the sign bit of the shorter number otherwise the result may not be correct.
- To add two numbers in Q format, we should first align the binary point of the two numbers, sign-extend the number that has shorter integer part, and then perform addition.
- When using fixed-point representation to perform arithmetic operations, we must be careful about the range of the values that can be represented with a given Q format.

To see a complete list of my DSP-related articles on AAC, please see this page.

0 CommentsLogin