Boosting Memory Performance in the Age of DDR5: An Intro to DDR Training ModesNovember 06, 2020 by Steve Arar
Analysts predict that DDR5 will dominate the DRAM market in coming years. How do you calibrate DDR for peak memory performance?
Cadence Design Systems recently released a silicon-proven IP for the DDR5 and LPDDR5 DRAM memory standards on TSMC N5 process.
The new multi-standard IP is targeted for applications such as data center, storage, artificial intelligence/machine learning (AI/ML), and hyperscale computing. Supporting both DDR5 and LPDDR5 protocols makes the new IP a single-chip solution that can be used in products with different DRAM requirements.
Block diagram of Cadence's LPDDR PHY IP. Image used courtesy of Cadence
DDR5, with its high data rate, is expected to possess as large as 43% of the global DRAM market share by 2024, according to SK Hynix. One of the key techniques that make the high data rate of DDR5 a reality is decision feedback equalization (DFE).
In this article, we’ll take a look at another important technique, namely the DDR calibration concept, that enables the optimal performance of this memory interface.
We commonly need to employ several memory chips to increase a system's memory capacity. In these cases, the wiring strategy can have a significant impact on the ultimate memory performance. One option is the T-branch connection shown below.
A double-T architecture for DDR layout and routing. Image courtesy of Altium
With this configuration, which is commonly used with DDR2 chips, the CLK/command/address lines are routed to a central point and then distributed from that central node to different DRAM chips. This allows us to have matched trace length for the CLK/command/address lines when communicating with different memory chips in the system.
Having almost the same propagation delay for the CLK/command/address signals simplifies the design procedure. However, a T-branch topology increases the capacitive loading of these signal lines.
An alternative solution is the fly-by topology employed with DDR3 and newer generations of DDR technology. The fly-by topology incorporates a daisy chain structure when routing clock, command, and address lines from the controller to the DRAM chips. This is depicted below.
Fly-by topology. Image courtesy of Altium
Note that the data (DQ) and strobe signals (DQS) are connected in a star configuration as in the case of a T-branch connection. With fly-by configuration, we can more easily deal with the increased capacitive load because the arrival time of signals at different DRAM chips is slightly different.
Since the signals encounter the input capacitance of the DRAM chips at slightly different times, the overall capacitive load appears as a distributed load to these signals. Hence, for a given system memory capacity, the capacitive loading is effectively reduced, and consequently, signal integrity and data rate are improved.
The downside to this technique is that the control and address signals that are daisy-chained experience a larger delay compared to the data and strobe signals that have a shorter point-to-point connection. Besides, the control and address signals arrive at different DRAMs at different times. At speeds greater than 1 GHz, these time skews can make it very challenging to meet the signal set-up/hold time requirements.
To address this issue, high-bandwidth memory interfaces, such as DDR4 and DDR5, employ training modes to measure the time skew of PCB traces. Having the time skew, the controller can introduce an appropriate delay to the data signals driven from the controller to the DRAMs so that the data arrives with a well-understood timing relationship with respect to the command and address signals.
One of these training modes is write leveling.
For a reliable write operation, the edge of the strobe signal (DQS) should be within a predefined vicinity of the clock edge. With fly-by topology, the clock signal that is daisy-chained experiences a larger delay compared to the strobe signal that has a shorter point-to-point connection. To align these two signals, DDR3 and newer DDR generations offer the write leveling training mode.
In this mode, which happens during device initialization, the controller constantly sends strobe signals to a particular DRAM. When the DRAM receives the strobe signal, it samples the clock signal and returns its value on the data bus back to the controller.
At the beginning of write leveling, the returned value is zero because the clock signal experiences a larger delay. The controller will introduce more and more delays to the DQS signal until the controller observes a transition from zero to one on the data bus. At this point, the controller will lock on this calibrated delay setting and use it for future write operations.
The controller will introduce this delay to the data and strobe signals when performing write operations. This de-skew will make data and control signals arrive at the DRAM inputs with appropriate timing. The following figure illustrates the write leveling training mode.
A timing diagram that depicts the before and after effects of write leveling. Image courtesy of NXP
Note that the skew between clock and DQS is not the same for the different DRAM chips. Hence, write leveling should be performed for each DRAM in the system.
Training Modes of DDR5
DDR5 supports several different training modes that have a significant impact on its high data rate capability. In addition to write leveling discussed above, DDR5 includes a new read preamble training mode, command/address training mode, and chip select training mode. DDR5 also has new functionality to compensate for the unmatched DQ-DQS receiver architecture, further enabling faster data rates.
The data patterns associated with DDR5 read training include the default programmable serial pattern, a simple clock pattern, and a linear feedback shift register (LFSR)- generated pattern that can be used to have a more robust timing margin while dealing with the DDR5 high data rates.