Graphics processing units (or GPUs), and to some extent FPGAs, have generally been deployed for training models in deep learning neuron nets. And the computationally intensive evaluation and training process is often done offline in large server farms.
Next up, these trained models are translated into the actual production environments using a hybrid of CPUs and GPUs, or a hybrid of CPUs and FPGAs. But what about embedded systems in automotive, consumer, and industrial environments that are highly sensitive to both cost and power consumption?
Enter DSP-based system-on-chips (SoCs) that offer high-performance neural processing while providing more affordable low-power solutions in the embedded environment. Digital signal processors (or DSPs) mark the third viable silicon option in a deep learning system, especially for low-power embedded deployments.
There is a myriad of embedded applications—ADAS, virtual reality, object recognition, etc.—that are ripe for employing deep learning technology. And here DSPs clearly surpass both GPUs and FPGAs when it comes to performance/watt benchmarks. Moreover, the deep learning chips using DSP cores represent a more specialized and flexible solution compared to general-purpose GPUs and FPGAs.
DSPs allow deep learning's reach to embedded applications. Image courtesy of CEVA.
Two Case Studies
Take CEVA, the supplier of DSP cores for low-power embedded systems, which has recently illustrated a 24-layer convolutional neural network (CNN) powered by its XM4 vision processor. According to CEVA, the DSP-based CNN engine was able to deliver nearly three times the performance compared to a typical hybrid CPU/GPU processing solution.
Furthermore, apart from consuming 30 times less power than a GPU, CEVA claimed that the DSP engine conserved nearly one-fifth of the memory bandwidth. Alongside the CEVA-XM4 imaging and vision processor, the company offers a network generator that translates the trained network into a cost-effective CNN execution.
CEVA brings deep learning to the embedded space by taking a neural network that has been tuned and trained on a workstation and converting it to run on its DSP-based XM4 processor. The DSP core supplier converts the floating-point operations from the workstation to fixed-point instructions so that they can run more efficiently on the DSP core.
Toronto–based Phi Algorithm Solutions is using the CEVA deep neural network (CDNN) framework for embedded systems and XM4 processor to implement deep learning in its Universal Object Detector algorithm. The algorithm is now available for applications such as ADAS, pedestrian detection, and facial recognition.
Cadence is targeting its Tensilica Vision P6 DSP at CNN applications. Image courtesy of Cadence.
Cadence is another notable player in the DSP camp investing heavily in deep learning and CNN applications. The firm claims that the deep learning processor based on its Tensilica Vision P6 DSP core can achieve twice the image frame rate at lower energy usage compared to commercially available GPUs.
The features in P6 DSP—wide vector SIMD processing, VLIW instructions, and fast histogram and scatter/gather intrinsics—make it inherently more suitable for the demanding deep learning environment. The processor combines power-efficient implementation of CNN algorithms with on-the-fly data compression that substantially reduces memory footprint and bandwidth requirements for neural network layers.
The Vision P6 DSP core—based on Tensilica's Xtensa architecture—supports OpenCV and OpenVX libraries.
Below are the links to the two previous articles about how silicon technologies are shaping the deep learning world.