Programming an FPGA (field programmable gate array) is a process of customizing its resources to implement a definite logical function. This involves modeling the program instructions using the FPGA’s basic building blocks like configurable logic blocks (CLBs), dedicated multiplexers, and others to meet the requirements of the digital system.

During the design process, one important criterion to be taken into account is the timing issue inherent in the system, as well as any constraints laid down by the user. One design mechanism which can help a designer achieve this objective is pipelining.

### What Is Pipelining?

Pipelining is a process which enables parallel execution of program instructions. You can see a visual representation of a pipelined processor architecture below.

**Figure 1. **A visual representation of a pipelined processor architecture. Each square corresponds to an instruction. The use of different colors for the squares conveys the fact that the instructions are independent of one another. Image courtesy of Colin M.L. Burnett [CC-BY-SA-3.0].

**Figure 1.**A visual representation of a pipelined processor architecture. Each square corresponds to an instruction. The use of different colors for the squares conveys the fact that the instructions are independent of one another. Image courtesy of Colin M.L. Burnett [CC-BY-SA-3.0].

In FPGAs, this is achieved by arranging multiple data processing blocks in a particular fashion. For this, we first divide our overall logic circuit into several small parts and then separate them using registers (flip-flops).

Let's analyze the mode in which an FPGA design is pipelined by considering an example.

### An Example

Let's take a look at a system of three multiplications followed by one addition on four input arrays. Our output *y _{i}* will therefore be equal to (

*a*×

_{i}*b*×

_{i}*c*) +

_{i}*d*.

_{i}

#### Non-Pipelined Design

The first design that comes to mind to create such a system would be multipliers followed by an adder, as shown in Figure 2a.

**Figure 2a.** An example of non-pipelined FPGA design. Image created by Sneha H.L.

**Figure 2a.**An example of non-pipelined FPGA design. Image created by Sneha H.L.

Here, we expect the sequence of operations to be the multiplication of *a _{i}* and

*b*data by multiplier M

_{i}_{1}, followed by the multiplication of its product with

*c*by multiplier M

_{i}_{2}and finally the addition of the resultant product with

*d*by adder A

_{i}_{1}.

Nevertheless, when the system is designed to be synchronous, at the first clock tick, only multiplier M1 can produce valid data at its output (*a _{1}* ×

*b*). This is because, at this instant, only M

_{1}_{1}has valid data (

*a*and

_{1}*b*) at its input pins, unlike M

_{1}_{2}and A

_{1}.

In the second clock tick, there would be valid data at the input pins of both M_{1 }and M_{2}. However, now we need to ensure that only M_{2} operates while M_{1} maintains its output the way it is. This is because, at this instant, if M_{1 }operates, then its output line changes to (*a*_{2 }× *b*_{2}) instead of its expected value (*a*_{1 }× *b*_{1}) leading to erroneous M_{2} output (*a*_{2 }× *b*_{2 }× *c*_{1}) and not (*a*_{1 }× *b*_{1 }× *c*_{1}).

When the clock ticks for the third time, there would be valid inputs at all the three components: M_{1}, M_{2}, and A_{1}. Nevertheless, we only want the adder to be operative as we would expect the output to be *y*_{1} = (*a*_{1 }× *b*_{1 }× *c*_{1 }+ *d*_{1}). This means the first output of the system will be available after the third clock tick.

Next, as the fourth clock tick arrives, M_{1} can operate over the next set of data: *a*_{2} and *b*_{2}. But at this instant, M_{2} and A_{1} are expected to be idle. This has to be followed by the activation of M_{2}—*only* M_{2}—at the fifth clock tick and the activation of A_{1}—*only* A_{1}—at the sixth clock tick. This ensures our next output, *y*_{2} = (*a*_{2 }× *b*_{2 }× *c*_{2 }+ *d*_{2}).

When a similar excitation pattern is followed for the components, we can expect the next outputs to occur at clock ticks 9, 12, 15 and so on (Figure 2b).

**Figure 2b. **

**Figure 2b.**

#### Pipelined Design

Now, let's suppose that we add registers to this design at the inputs (R_{1} through R_{4}), between M_{1} and M_{2} (R_{5} and R_{8}, respectively) and along the direct input paths (R_{6}, R_{7}, and R_{9}), as shown by Figure 3a.

**Figure 3a. **An example of pipelined FPGA design. Image and table created by Sneha H.L.

**Figure 3a.**An example of pipelined FPGA design. Image and table created by Sneha H.L.

Here, at the first clock tick, valid inputs appear only for registers R_{1} through R_{4} (*a*_{1}, *b*_{1}, *c*_{1} and *d*_{1}, respectively) and for the multiplier M_{1} (*a*_{1} and *b*_{1}). As a result, only these can produce valid outputs. Moreover, once M_{1} produces its output, it is passed on to register R_{5} and stored in it.

At the second clock tick, the values stored in registers R_{5} and R_{6} (*a*_{1 }× *b*_{1} and *c*_{1}) are made to appear as inputs to M_{2} which enables it to render its output as *a*_{1 }× *b*_{1 }× *c*_{1}, while the output of R_{4} (*d*_{1}) is shifted to register R_{7}. Meanwhile, even the second set of data (*a*_{2}, *b*_{2}, *c*_{2}, and *d*_{2}) enters into the system and appears at the outputs of R_{1} through R_{4}.

In this case, M_{1} is allowed to operate on its inputs so as to cause its output line to change from *a*_{1 }× *b*_{1} to *a*_{2 }× *b*_{2}, unlike in the case of non-pipelined design. This is because, in this design, any change in the output of M_{1} does not affect the output of M_{2}. This is because the data required to ensure correct functionality of M_{2} was already latched in register R_{5} during the first clock tick (and remains undisturbed even at the second clock tick).

This means insertion of register R_{5} has made M_{1} and M_{2} functionally independent due to which they both can operate on different sets of data at the same time.

Next, when the clock ticks for the third time, the outputs of registers R_{8} and R_{9} ((*a*_{1 }× *b*_{1 }× *c*_{1}) and *d*_{1}) are passed as inputs to adder A_{1}. As a result, we get our first output *y*_{1} = ((*a*_{1 }× *b*_{1 }× *c*_{1}) + *d*_{1}). Nevertheless, at the same clock tick, M_{1} and M_{2} will be free to operate on (*a*_{3}, *b*_{3}) and (*a*_{2}, *b*_{2}, *c*_{2}), respectively. This is feasible due to the presence of registers R_{5} isolating block M_{1} from M_{2} and R_{8} isolating multiplier M_{2} from adder A_{1}.

Thus, at the third clock tick, we would even get (*a*_{3 }× *b*_{3}) and (*a*_{2 }× *b*_{2 }× *c*_{2}) from M_{1} and M_{2}, respectively, in addition to *y*_{1}.

Now when the fourth clock tick arrives, adder A_{1} operates on its inputs to yield the second output, *y*_{2} = ((*a*_{2 }× *b*_{2 }× *c*_{2}) + *d*_{2}). In addition, the output of M_{1} changes from (*a*_{3 }× *b*_{3}) to (*a*_{4 }× *b*_{4}) while that of M_{2} changes from (*a*_{2 }× *b*_{2 }× *c*_{2}) to (*a*_{3 }× *b*_{3 }× *c*_{3}).

On following the same mode of operation, we can expect one output data to appear for each clock tick from then on (Figure 3b), unlike in the case of non-pipelined design where we had to wait for three clock cycles to get each single output data (Figure 2b).

### Consequences of Pipelining

Latency

In the example shown, pipelined design is shown to produce one output for each clock tick from third clock cycle. This is because each input has to pass through three registers (constituting the pipeline depth) while being processed before it arrives at the output. Similarly, if we have a pipeline of depth *n*, then the valid outputs appear one per clock cycle only from *n*^{th} clock tick.

This delay associated with the number of clock cycles lost before the first valid output appears is referred to as latency. The greater the number of pipeline stages, the greater the latency that will be associated with it.

#### Increase in Operational Clock Frequency

The non-pipelined design shown in Figure 2a is shown to produce one output for every three clock cycles. That is, if we have a clock of period 1 ns, then the input takes 3 ns (3 × 1 ns) to get processed and to appear as output.

This longest data path would then be the critical path, which decides the minimum operating clock frequency of our design. In other words, the frequency of the designed system must be no greater than (1/3 ns) = 333.33 MHz to ensure satisfactory operation.

In the pipelined design, once the pipeline fills, there is one output produced for every clock tick. Thus our operating clock frequency is the same as that of the clock defined (here, it is 1/1ns = 1000 MHz).

These figures clearly indicate that the pipelined design increases the operational frequency considerably when compared to the non-pipelined one.

#### Increase in Throughput

A pipelined design yields one output per clock cycle (once latency is overcome) irrespective of the number of pipeline stages contained in the design. Hence, by designing a pipelined system, we can increase the throughput of an FPGA.

#### Greater Utilization of Logic Resources

In pipelining, we use registers to store the results of the individual stages of the design. These components add on to the logic resources used by the design and make it quite huge in terms of hardware.

### Conclusion

The act of pipelining a design is quite exhaustive. You need to divide the overall system into individual stages at adequate instants to ensure optimal performance. Nevertheless, the hard work that goes into it is on par with the advantages it renders while the design executes.

11 CommentsLoginkolingV2018-03-26really comprehensive and concise!!! I have read many articles about pipeline and been instructed by my supervisor. But this is the first time i have understood it totally.

snehahl2018-04-06Thanks. I am happy to know that my article served your purpose.

Dubacharla Gyaneshwar2018-06-01The best explanation of pipelining concept. thank you for the article

michalcz2018-06-22“Nevertheless, when the system is designed to be synchronous, at the first clock tick, only multiplier M1 can produce valid data at its output (a1 × b1). This is because, at this instant, only M1 has valid data (a1 and b1) at its input pins, unlike M2 and A1”

IMHO incorrect, because multiplier produces stable output after a given time delay, dependent on longest combinational path (which in turn is technology and architecture dependent), which has nothing to do with clock frequency.

In other terms, one can supply a clock with such frequency that the circuit in Fig 2a will have steady output in one or more clock cycles. Signal propagation time determines highest applicable clock frequency, not the other way around.

RK372018-06-27I see your point, but I think you’re overlooking the importance of the phrase “when the system is designed to be synchronous.” Perhaps the author did not intend this description to apply directly to Figure 2a. Rather, it applies to a system that is based on the one in Figure 2a but has been modified to ensure synchronous operation. The synchronous version would have clock-driven storage elements that prevent A1 from producing a valid output during the first clock cycle.

michalcz2018-06-30If there were in fact registers (“clock-driven storage elements”) in A1, M1, M2, then why would one attempt to insert R(5-9) into the design?

“The non-pipelined design shown in Figure 2a is shown to produce one output for every three clock cycles. That is, if we have a clock of period 1 ns, then the input takes 3 ns (3 × 1 ns) to get processed and to appear as output.

This longest data path would then be the critical path, which decides the minimum operating clock frequency of our design. In other words, the frequency of the designed system must be no greater than (1/3 ns) = 333.33 MHz to ensure satisfactory operation.”

Clearly, the author meant that the circuit Fig. 2a is combinational, but may have forgotten that multipliers and adders have different propagation time, which should not be called a clock cycle - it is misleading at the minimum.

RK372018-07-02You’re right, the article implies that the Fig. 2a circuit is combinational. I’m not sure why the author discusses clock cycles in the context of the combinational circuit blocks; the maximum operating frequency would be determined by the total propagation delay. If the inputs are made available simultaneously, M1 will produce a valid output (after its propagation delay), then M2 will produce a valid output, then A1 will produce a valid output. The A1 output could then be stored in a register, and a new multiplication operation could be performed. Only one clock cycle is required (with the clock period chosen according to the propagation delays).

It seems to me that in this particular example pipelining does not offer a major improvement in performance. Would you agree?

michalcz2018-07-02Apologies if the formatting messes up.

Au contraire, since maximum frequency for circuit in Fig. 2a is equal roughly to 1/(2*t_M1+t_A1), where t_x denotes delay of block x and t_M1 is equal to t_M2, we introduce the pipeline to increase maximum frequency up to 1/t_M1 and that is the real value of pipelining designs in FPGA’s.

RK372018-07-02OK, that makes sense. Pipelining allows you to establish the clock frequency according to the propagation delay of just one stage, instead of the total propagation delay. It looks like we need to revise this article.

Would you have any interest in rewriting/expanding this article?

michalcz2018-07-03I am swamped with work at the time, so let’s revisit this idea later on. Contact me via e-mail in 3-4 weeks.

Wojciech Zabolotny2018-09-25For complex pipeline designs, where information is split into multiple parallel branches and then combined back it may be difficult to keep the same latency in all paths. That problem is addressed in different graphical environments (e.g., Xilinx System Generator or MathWorks HDL Coder) but may be difficult to solve in designs implemented in pure HDL (VHDL or Verilog). I have faced that in a few of my projects, and tried to create an automated solution. You may find the Open Source implementation at https://opencores.org/project/lateq . It is also described in my paper http://dx.doi.org/10.1117/12.2247943