# Introduction to Distributed Arithmetic

## This article will review the distributed arithmetic which is an interesting method of efficiently implementing multiply-and-accumulate operations.

This article will review the distributed arithmetic (DA) which is an interesting method of efficiently implementing multiply-and-accumulate operations.

DA recognizes some frequently used values of a multiply-and-accumulate operation, pre-computes these values, and stores them in a lookup table (LUT). Reading these stored values from ROM rather than calculating them leads to an efficient implementation. It should be noted that the DA method is applicable only to cases where the multiply-and-accumulate operation involves fixed coefficients.

### Distributed Arithmetic

Consider calculating the following expression:

$$y = \sum_{i=1}^{N} c_i x_i$$

where the $$c_i$$ coefficients are fixed-valued and $$x_i$$ represents the inputs of the multiply-and-accumulate operation. Assume that these inputs are in two’s complement format and are represented by b+1 bits. Moreover, assume that they are scaled and are less than 1 in magnitude. To keep things simple, we’ll consider the above equation for N=3. Hence,

$$y = c_1 x_1 + c_2 x_2 + c_3 x_3$$

*Equation 1*

*Equation 1*

Since the inputs are in two’s complement format, we can write

$$x_1 = -x_{1,0} + \sum_{j=1}^{b} x_{1,j} 2^{-j}$$

$$x_2 = -x_{2,0} + \sum_{j=1}^{b} x_{2,j} 2^{-j}$$

$$x_3 = -x_{3,0} + \sum_{j=1}^{b} x_{3,j} 2^{-j}$$

where $$x_{1,0}$$ is the sign bit of $$x_1$$ and $$x_{1,j}$$ is the *j*th bit to the right of the sign bit. The same notation is used for $$x_2$$ and $$x_3$$. If you need help with deriving these equations, read the section entitled “An Important Feature of the Two’s Complement Representation” in my article, "Multiplication Examples Using the Fixed-Point Representation" and note that we have assumed $$ |x_i| <1$$.

Substituting our last three equations into Equation 1 gives

$$\begin{align}

y = &- x_{1,0} c_1 + x_{1,1} c_1 \times 2^{-1} + \dots + x_{1,b} c_1 \times 2^{-b} \\

&- x_{2,0} c_2 + x_{2,1} c_2 \times 2^{-1} + \dots + x_{2,b} c_2 \times 2^{-b} \\

&- x_{3,0} c_3 + x_{3,1} c_3 \times 2^{-1} + \dots + x_{3,b} c_3 \times 2^{-b}

\end{align}$$

*Equation 2*

*Equation 2*

How can we use a LUT to efficiently implement these calculations?

For now, let’s ignore the $$2^{-j}$$ terms of Equation 2 and look at this equation as a summation of some columns rather than a summation of some rows. For example, the second column of Equation 2 is

$$y_1 = x_{1,1} c_1 + x_{2,1} c_2 + x_{3,1} c_3$$

How many distinct values are there for this expression? Note that $$x_{1,1}$$, $$x_{2,1}$$, and $$x_{3,1}$$ are one-bit values. Hence, $$y_1$$ can have only eight distinct values as given in Table 1 below:

*Table 1*

*Table 1*

Ignoring the $$2^{-b}$$ term of the last column, we have

$$y_b = x_{1,b} c_1 + x_{2,b} c_2 + x_{3,b} c_3$$

Again, we can only have the eight distinct values of Table 1. As you can see, the columns of Equation 2 involve calculating the function given by Table 1 (provided that we ignore the minus sign of the first column and the $$2^{-j}$$ terms). Instead of repeatedly calculating this function, we can pre-calculate the values of $$y_1$$ and store them in a LUT, as shown in the following block diagram:

*Figure 1*

*Figure 1*

As shown in the figure, the *j*th bit of all the input signals, $$x_1$$, $$x_2$$, $$x_3$$, will be applied to the LUT, and the output will be $$y_j$$. The output of the ROM is represented by *l* bits. *l* must be large enough to store the values of Table 1 without overflow.

Now that the LUT is responsible for producing the $$y_j$$ terms, we can rewrite Equation 2 as

$$y = - y_0 + 2^{-1} y_1 + 2^{-2} y_2 + \dots + 2^{-b} y_b$$

Therefore, we need to take the $$2^{-j}$$ terms into account and note that the first term must be subtracted from the other terms.

Let’s assume that we are using only five bits to represent the $$x_i$$ signals, i.e., $$b=4$$. Hence,

$$y = - y_0 + 2^{-1} y_1 + 2^{-2} y_2 + 2^{-3} y_3 + 2^{-4} y_4$$

By repeatedly factoring $$2^{-1}$$, we can rewrite the above equation as

$$y = - y_0 + 2^{-1} \Bigg (

y_1 + 2^{-1} \bigg

( y_2 + 2^{-1} \Big ( y_3 + 2^{-1} ( y_4 + 0 \big )

\Big ) \bigg )

\Bigg )$$

Note that a zero is added to the innermost parentheses to further clarify the pattern that exists. The multiply-and-add operation is now written as a repeated pattern consisting of a summation and a multiplication by $$2^{-1}$$. We know that multiplication by $$2^{-1}$$ can be implemented by a one-bit shift to the right. Therefore, we can use the ROM shown in Figure 1 along with a shift register and an adder/subtractor to implement the above equation. The simplified block diagram is shown in Figure 2.

*Figure 2*

*Figure 2*

At the beginning of the calculations, the shift register SR is reset to zero and the other shift registers are loaded with the appropriate inputs. Then, the registers $$x_1$$, $$x_2$$, and $$x_3$$ apply $$x_{1,4}$$, $$x_{2,4}$$, and $$x_{3,4}$$ to the ROM. Hence, the adder will produce $$acc=a+b=y_4+0=y_4$$. This value will be stored in the SR, and a one-bit shift will be applied to take the $$2^{-1}$$ term into account. (As we’ll see in a minute, the output of the adder/subtractor will generate the final result of the algorithm by gradually accumulating the partial results. That’s why we’ve used “acc”, which stands for accumulator, to represent the output of the adder/subtractor.)

So far, $$2^{-1}(y_4+0)$$ has been generated at the output of the SR register. Next, the input registers will apply $$x_{1,3}$$, $$x_{2,3}$$, and $$x_{3,3}$$ to the ROM. Hence, the adder will produce $$acc=a+b= y_3+2^{-1}(y_4+0)$$. Again, this value will be stored in the SR and a one-bit shift will be applied to take the $$2^{-1}$$ term into account, which gives $$2^{-1}(y_3+2^{-1}(y_4+0))$$. In a similar manner, the sum and shift operations will be repeated for the next terms, except that for the last term, the adder/subtractor will be in the subtract mode.

Note that the number of shift-and-add operations in Figure 2 does not depend on the number of input signals N. The number of inputs affects only the size of the ROM’s address input. This is a great advantage of the DA technique over a conventional implementation of a multiply-and-add operation, i.e., an implementation in which partial products are generated and added together. However, a large N can lead to a slow ROM and reduce the efficiency of the technique.

In the DA architecture, the number of shift-and-add operations depends on the number of bits used to represent the input signals, which in turn depends on the precision that the system requires.

### Conclusion

DA recognizes some frequently used values of a multiply-and-accumulate operation, pre-computes these values and stores them in a lookup-table (LUT). Reading these stored values from ROM rather than calculating them leads to an efficient implementation. It should be noted that the DA method is applicable only to cases where the multiply-and-accumulate operation involves fixed coefficients.

1 CommentThank you very much Sir for your valuable explanation. I understood the concept the way u explained in lucid manner.

Thanks…