The Why and How of Pipelining in FPGAs

This article explains pipelining and its implications with respect to FPGAs, i.e., latency, throughput, change in operating frequency, and resource utilization.

Technical Article February 15, 2018 by Sneha H.L.

This article explains pipelining and its implications with respect to FPGAs, i.e., latency, throughput, change in operating frequency, and resource utilization.

Programming an FPGA (field programmable gate array) is a process of customizing its resources to implement a definite logical function. This involves modeling the program instructions using the FPGA’s basic building blocks like configurable logic blocks (CLBs), dedicated multiplexers, and others to meet the requirements of the digital system.

During the design process, one important criterion to be taken into account is the timing issue inherent in the system, as well as any constraints laid down by the user. One design mechanism which can help a designer achieve this objective is pipelining.

What Is Pipelining?

Pipelining is a process which enables parallel execution of program instructions. You can see a visual representation of a pipelined processor architecture below.

Figure 1. A visual representation of a pipelined processor architecture. Each square corresponds to an instruction. The use of different colors for the squares conveys the fact that the instructions are independent of one another. Image courtesy of Colin M.L. Burnett [CC-BY-SA-3.0].

In FPGAs, this is achieved by arranging multiple data processing blocks in a particular fashion. For this, we first divide our overall logic circuit into several small parts and then separate them using registers (flip-flops).

Let's analyze the mode in which an FPGA design is pipelined by considering an example.

An Example

Let's take a look at a system of three multiplications followed by one addition on four input arrays. Our output y_i will therefore be equal to (a_i × b_i × c_i) + d_i.

Non-Pipelined Design

The first design that comes to mind to create such a system would be multipliers followed by an adder, as shown in Figure 2a.

Figure 2a. An example of non-pipelined FPGA design. Image created by Sneha H.L.

Here, we expect the sequence of operations to be the multiplication of a_i and b_i data by multiplier M₁, followed by the multiplication of its product with c_i by multiplier M₂ and finally the addition of the resultant product with d_i by adder A₁.

Nevertheless, when the system is designed to be synchronous, at the first clock tick, only multiplier M1 can produce valid data at its output (a₁ × b₁). This is because, at this instant, only M₁ has valid data (a₁ and b₁) at its input pins, unlike M₂ and A₁.

In the second clock tick, there would be valid data at the input pins of both M₁and M₂. However, now we need to ensure that only M₂ operates while M₁ maintains its output the way it is. This is because, at this instant, if M₁operates, then its output line changes to (a₂× b₂) instead of its expected value (a₁× b₁) leading to erroneous M₂ output (a₂× b₂× c₁) and not (a₁× b₁× c₁).

When the clock ticks for the third time, there would be valid inputs at all the three components: M₁, M₂, and A₁. Nevertheless, we only want the adder to be operative as we would expect the output to be y₁ = (a₁× b₁× c₁+ d₁). This means the first output of the system will be available after the third clock tick.

Next, as the fourth clock tick arrives, M₁ can operate over the next set of data: a₂ and b₂. But at this instant, M₂ and A₁ are expected to be idle. This has to be followed by the activation of M₂—only M₂—at the fifth clock tick and the activation of A₁—only A₁—at the sixth clock tick. This ensures our next output, y₂ = (a₂× b₂× c₂+ d₂).

When a similar excitation pattern is followed for the components, we can expect the next outputs to occur at clock ticks 9, 12, 15 and so on (Figure 2b).

Figure 2b.

Pipelined Design

Now, let's suppose that we add registers to this design at the inputs (R₁ through R₄), between M₁ and M₂ (R₅ and R₈, respectively) and along the direct input paths (R₆, R₇, and R₉), as shown by Figure 3a.

Figure 3a. An example of pipelined FPGA design. Image and table created by Sneha H.L.

Here, at the first clock tick, valid inputs appear only for registers R₁ through R₄ (a₁, b₁, c₁ and d₁, respectively) and for the multiplier M₁ (a₁ and b₁). As a result, only these can produce valid outputs. Moreover, once M₁ produces its output, it is passed on to register R₅ and stored in it.

At the second clock tick, the values stored in registers R₅ and R₆ (a₁× b₁ and c₁) are made to appear as inputs to M₂ which enables it to render its output as a₁× b₁× c₁, while the output of R₄ (d₁) is shifted to register R₇. Meanwhile, even the second set of data (a₂, b₂, c₂, and d₂) enters into the system and appears at the outputs of R₁ through R₄.

In this case, M₁ is allowed to operate on its inputs so as to cause its output line to change from a₁× b₁ to a₂× b₂, unlike in the case of non-pipelined design. This is because, in this design, any change in the output of M₁ does not affect the output of M₂. This is because the data required to ensure correct functionality of M₂ was already latched in register R₅ during the first clock tick (and remains undisturbed even at the second clock tick).

This means insertion of register R₅ has made M₁ and M₂ functionally independent due to which they both can operate on different sets of data at the same time.

Next, when the clock ticks for the third time, the outputs of registers R₈ and R₉ ((a₁× b₁× c₁) and d₁) are passed as inputs to adder A₁. As a result, we get our first output y₁ = ((a₁× b₁× c₁) + d₁). Nevertheless, at the same clock tick, M₁ and M₂ will be free to operate on (a₃, b₃) and (a₂, b₂, c₂), respectively. This is feasible due to the presence of registers R₅ isolating block M₁ from M₂ and R₈ isolating multiplier M₂ from adder A₁.

Thus, at the third clock tick, we would even get (a₃× b₃) and (a₂× b₂× c₂) from M₁ and M₂, respectively, in addition to y₁.

Now when the fourth clock tick arrives, adder A₁ operates on its inputs to yield the second output, y₂ = ((a₂× b₂× c₂) + d₂). In addition, the output of M₁ changes from (a₃× b₃) to (a₄× b₄) while that of M₂ changes from (a₂× b₂× c₂) to (a₃× b₃× c₃).

On following the same mode of operation, we can expect one output data to appear for each clock tick from then on (Figure 3b), unlike in the case of non-pipelined design where we had to wait for three clock cycles to get each single output data (Figure 2b).

Consequences of Pipelining

Latency

In the example shown, pipelined design is shown to produce one output for each clock tick from third clock cycle. This is because each input has to pass through three registers (constituting the pipeline depth) while being processed before it arrives at the output. Similarly, if we have a pipeline of depth n, then the valid outputs appear one per clock cycle only from n^th clock tick.

This delay associated with the number of clock cycles lost before the first valid output appears is referred to as latency. The greater the number of pipeline stages, the greater the latency that will be associated with it.

Increase in Operational Clock Frequency

The non-pipelined design shown in Figure 2a is shown to produce one output for every three clock cycles. That is, if we have a clock of period 1 ns, then the input takes 3 ns (3 × 1 ns) to get processed and to appear as output.

This longest data path would then be the critical path, which decides the minimum operating clock frequency of our design. In other words, the frequency of the designed system must be no greater than (1/3 ns) = 333.33 MHz to ensure satisfactory operation.

In the pipelined design, once the pipeline fills, there is one output produced for every clock tick. Thus our operating clock frequency is the same as that of the clock defined (here, it is 1/1ns = 1000 MHz).

These figures clearly indicate that the pipelined design increases the operational frequency considerably when compared to the non-pipelined one.

Increase in Throughput

A pipelined design yields one output per clock cycle (once latency is overcome) irrespective of the number of pipeline stages contained in the design. Hence, by designing a pipelined system, we can increase the throughput of an FPGA.

Greater Utilization of Logic Resources

In pipelining, we use registers to store the results of the individual stages of the design. These components add on to the logic resources used by the design and make it quite huge in terms of hardware.

Conclusion

The act of pipelining a design is quite exhaustive. You need to divide the overall system into individual stages at adequate instants to ensure optimal performance. Nevertheless, the hard work that goes into it is on par with the advantages it renders while the design executes.

Learn More About

fpga digital design programmable logic cpld pipelining latency Throughput

K
kolingV March 26, 2018

really comprehensive and concise!!! I have read many articles about pipeline and been instructed by my supervisor. But this is the first time i have understood it totally.

Like. Reply
- snehahl April 06, 2018
  
  Thanks. I am happy to know that my article served your purpose.
  Like. Reply
Dubacharla Gyaneshwar June 01, 2018

The best explanation of pipelining concept. thank you for the article

Like. Reply
M
michalcz June 22, 2018

“Nevertheless, when the system is designed to be synchronous, at the first clock tick, only multiplier M1 can produce valid data at its output (a1 × b1). This is because, at this instant, only M1 has valid data (a1 and b1) at its input pins, unlike M2 and A1”

IMHO incorrect, because multiplier produces stable output after a given time delay, dependent on longest combinational path (which in turn is technology and architecture dependent), which has nothing to do with clock frequency.
In other terms, one can supply a clock with such frequency that the circuit in Fig 2a will have steady output in one or more clock cycles. Signal propagation time determines highest applicable clock frequency, not the other way around.

Like. Reply
- RK37 June 27, 2018
  
  I see your point, but I think you're overlooking the importance of the phrase "when the system is designed to be synchronous." Perhaps the author did not intend this description to apply directly to Figure 2a. Rather, it applies to a system that is based on the one in Figure 2a but has been modified to ensure synchronous operation. The synchronous version would have clock-driven storage elements that prevent A1 from producing a valid output during the first clock cycle.
  Like. Reply
- - M
    michalcz June 30, 2018
    
    If there were in fact registers ("clock-driven storage elements") in A1, M1, M2, then why would one attempt to insert R(5-9) into the design? "The non-pipelined design shown in Figure 2a is shown to produce one output for every three clock cycles. That is, if we have a clock of period 1 ns, then the input takes 3 ns (3 × 1 ns) to get processed and to appear as output. This longest data path would then be the critical path, which decides the minimum operating clock frequency of our design. In other words, the frequency of the designed system must be no greater than (1/3 ns) = 333.33 MHz to ensure satisfactory operation." Clearly, the author meant that the circuit Fig. 2a is combinational, but may have forgotten that multipliers and adders have different propagation time, which should not be called a clock cycle - it is misleading at the minimum.
    Like. Reply
  - - RK37 July 02, 2018
      
      You're right, the article implies that the Fig. 2a circuit is combinational. I'm not sure why the author discusses clock cycles in the context of the combinational circuit blocks; the maximum operating frequency would be determined by the total propagation delay. If the inputs are made available simultaneously, M1 will produce a valid output (after its propagation delay), then M2 will produce a valid output, then A1 will produce a valid output. The A1 output could then be stored in a register, and a new multiplication operation could be performed. Only one clock cycle is required (with the clock period chosen according to the propagation delays). It seems to me that in this particular example pipelining does not offer a major improvement in performance. Would you agree?
      Like.
    - M
      michalcz July 02, 2018
      
      Apologies if the formatting messes up. Au contraire, since maximum frequency for circuit in Fig. 2a is equal roughly to 1/(2*t_M1+t_A1), where t_x denotes delay of block x and t_M1 is equal to t_M2, we introduce the pipeline to increase maximum frequency up to 1/t_M1 and that is the real value of pipelining designs in FPGA's.
      Like.
    - RK37 July 02, 2018
      
      OK, that makes sense. Pipelining allows you to establish the clock frequency according to the propagation delay of just one stage, instead of the total propagation delay. It looks like we need to revise this article. Would you have any interest in rewriting/expanding this article?
      Like.
    - M
      michalcz July 03, 2018
      
      I am swamped with work at the time, so let's revisit this idea later on. Contact me via e-mail in 3-4 weeks.
      Like.
Wojciech Zabolotny September 25, 2018

For complex pipeline designs, where information is split into multiple parallel branches and then combined back it may be difficult to keep the same latency in all paths. That problem is addressed in different graphical environments (e.g., Xilinx System Generator or MathWorks HDL Coder) but may be difficult to solve in designs implemented in pure HDL (VHDL or Verilog). I have faced that in a few of my projects, and tried to create an automated solution. You may find the Open Source implementation at https://opencores.org/project/lateq . It is also described in my paper http://dx.doi.org/10.1117/12.2247943

Like. Reply
ghani bourenane February 28, 2024

Great explanation, the simplicity in approaching the topic complexity made it fascinating!

Like. Reply
A
Amos Sitenda December 28, 2024

Great explanation.

Like. Reply
SpencerHomes July 09, 2026

Correct me if i am wrong. So in Fig 3a “Here, at the first clock tick, valid inputs appear only for registers R1 through R4 (a1, b1, c1 and d1, respectively) and for the multiplier M1 (a1 and b1). As a result, only these can produce valid outputs. Moreover, once M1 produces its output, it is passed on to register R5 and stored in it.”
How is multiplier output stored in R5 in the same clock cycle as it is being stored in R1 AND processed my multiplier at the same instantaneous time? R5 will store garbage value because it takes some delay by the multiplier to produce output , which needs to be stable to be stored. But here its implied that R5 and R1 are toggled at the same clock cycle.

Like. Reply