QMD: Expediting Core-to-Core Communication in Multicore Processors

December 02, 2016 by Dr. Steve Arar

Researchers have designed a more efficient multicore processor.

Researchers have developed a simple but efficient technique for core-to-core communication in multicore processors.

Increasing the number of cores in a multicore processor can significantly speed up certain operations. To this end, we need to develop software programs which can divide a certain function between multiple cores operating simultaneously.

The Bottleneck in Adding More Cores

Dual and quad-core processors have exhibited considerable improvement over conventional ones. However, the rate of improvement has diminished as more and more cores have been added. This is somehow due to the fact that maintaining cache coherency between the cores is challenging. Simply put, cache coherency ensures that the cores are on the same page.

In a multicore processor, each core has a small cache to store its frequently used data. There's also a large shared cache for the whole processor. All the cores of the processor can gain access to this shared memory. Since a number of cores may work simultaneously on the shared data, it is necessary to keep track of the data updates and the version of the shared data that a certain core has taken.

It's clear that managing this process needs a relatively large memory and some computational resources. In a 64-core processor the memory to maintain cache coherency, called "directory memory", takes up almost 12% of the shared cache. As the number of cores increases, the percentage would go up from there.

Last year, MIT researchers proposed a method, called Tardis (after British sci-fi show, Doctor Who), which could significantly reduce the size of the required directory memory. Despite the conventional methods in which the size of the directory’s memory is proportional to the number of cores, the memory used in Tardis increases only with the logarithm of the number of cores.

Cache coherency not only requires a sizeable memory but also some computational effort. Since the processor becomes busy maintaining cache coherency, it cannot reach its potential capability in solving the problem in hand. While Tardis aims at reducing the required memory, a recent study, done by a group of researchers at North Carolina State University and at Intel, aims at expediting the process by establishing a high-speed reliable communication between the cores.

Core-to-core communication becomes more and more important as the number of cores increases. For example, communication between the 18 cores of the Intel Haswell-EX Xeon E7 V3 processor is a real challenge. According to Yan Solihin, a professor of electrical and computer engineering at North Carolina State University involved in the study, communications between cores is becoming the bottleneck of multicore processors.

The compact layout of today's processors. Image courtesy of IEEE.

Moving from Software to Hardware

Recently, various research teams have proposed designs which achieve speed improvements by offloading frequently used functions of the system from software to hardware implementations. For example, with the end of Moore’s law, Microsoft has resorted to implementing some AI algorithms of its servers on FPGAs rather than on software. With FPGA-based servers, the company claims that, for a given transistor performance, it is still possible to achieve speed improvements well until 2030.

While Microsoft has applied this technique to expedite its servers, Solihin's team has proposed hardware implementations to accelerate the core-to-core communication in a multicore processor. According to Solihin, there is only one way to improve performance by improving energy efficiency: moving from software to hardware. He adds that the main challenge is in figuring out if a particular function is used frequently enough to be implemented in hardware or not.

A Solution: The Hardware Queue

Currently, the core-to-core communication is performed by sending and receiving software commands between cores. Therefore, the processor needs to allocate a considerable part of its computational resources to executing these software commands.

Solihin’s team has proposed employing a hardware queue instead of a software one. The technique, called Queue Management Device (or QMD for short), was tested on a 16-core processor. In this case, packet processing—which is frequently performed on network nodes—was expedited by a factor of 20 compared to the conventional software-based designs. The study shows that as the number of the cores increases the speed improvement of QMD becomes even more pronounced.

Srini Devadas, an MIT expert in cache control systems who was involved in Tardis, notes that QMD has near-term potential and Intel needs to add a small piece of hardware to achieve a significant improvement. On the contrary, Tardis is a radical approach which can be used in processors further in the future.

Although  Intel researchers have not commented on commercializing QMD in the near future, they are investigating its potential and we can expect that soon multicore processing will be even more powerful.