The Importance of Reliability Verification in AI/ML Processors

With the adoption of artificial intelligence and machine learning in a wide variety of applications, reliability verification of AI/ML processors is critical since failures can have major consequences for the validity and legitimacy of AI/ML technology.

Industry Article January 21, 2020 by Neel Natekar, Mentor

In the last few years, there has been a rapid expansion in the number of companies deploying artificial intelligence (AI) and machine learning (ML) in a wide range of applications. In fact, studies show that 2019 was a record year for enterprises adopting AI and ML, and that these companies consider these two features as the most needed to achieve their business strategies and goals. This growing adoption is due primarily to the improvement in algorithms, advancements in hardware design, and the increase in data volume created by digitization of information.

However, to support and sustain the growth of AI/ML, companies must continue to prove to the marketplace that the results they obtain with AI/ML technologies can be trusted. That trust starts with the design and verification of the integrated circuits (ICs) that underlie AI/ML functionality.

Classification of AI and ML

AI processing can be broadly classified into datacenter/cloud-based or embedded, depending on whether it is performed on a cloud/datacenter site, or on the end-user side (by embedding a dedicated AI chip or an AI co-processor engine with a system-on-chip (SOC) inside devices or at the edge). Edge in this context refers to a local server or a machine that is closer to the device than a data center or cloud.

In terms of the target application, an edge device can be classified into training (ML) or inference. Historically, the training portion was accomplished on the cloud site, with the inference portion being handled either in the cloud or on the edge device. With the development of new high-performance edge computing solutions, we are witnessing a paradigm shift as progressively more training activity is being transferred to the edge side.

AI/ML Chip Design

AI/ML chips in edge computing solutions or embedded inside local devices are designed for use in specific environments, such as enterprise, automotive, industrial, healthcare, Internet of things (IoT), etc. Some of these applications are mission-critical, meaning any failure can result in disastrous consequences in the real world. For example, consider advanced driver assistance systems (ADAS) used inside automobiles. If an ADAS processor experiences more than a given latency while reading data from the sensor and drawing an inference, it can cause a collision.

The ICs used in AI/ML applications are characterized by large parallel processing computation units, high power dissipation, and complex circuitry that can deliver maximum performance within a strict power budget. While some companies employ traditional central processing units (CPUs) for AI-related tasks, some industry experts argue that using CPUs isn’t very efficient, due to the distributed nature of state-of-the-art AI algorithms. These algorithms do render themselves well to parallel computing solutions, like those provided by graphics processing units (GPUs). Owing to their reconfigurable nature, field-programmable gate arrays (FPGAs) have also attracted interest for use as accelerators for AI chips.

ASICs in AI/ML Applications

Overall, there is a growing consensus that the tricky problems of AI and ML don’t lend themselves to a one-size-fits-all design solution. To combat this issue, many companies develop their own application-specific ICs (ASICs), which they optimize in conjunction with the software stack to deliver the best value for a given AI/ML application (Figure 1).

Figure 1. Block diagram for an ASIC AI chip design.

These companies claim various benefits from the use of these ASICs, such as better performance, more operations per cycle, a simpler and more deterministic design compared to a CPU or GPU, area savings (due to the exclusion of complex constructs and mechanisms used in a CPU), lower power usage, and faster development time.

Heterogeneous Computing

There has also been an increase in the use of heterogeneous computing—systems that use a combination of different compute core types in an effort to combine the best of different capabilities. For example, in a system using a combination of a CPU and a GPU, heterogeneous computing can be beneficial by off-loading the parallel tasks to the GPU, while the CPU handles tasks such as process control, which is serial by nature.

One common aspect between different classes of AI processors is that they are optimized for high performance and low latency, often offering multipliers of tera operations per second (TOPS) performance. To gain an edge in this highly competitive market, power efficiency (measured by performance per watt) has become just as important as the raw throughput. Power efficiency is often achieved by using a combination of one or more design techniques such as power and clock gating, dynamic voltage and frequency scaling, multi-Vt designs, etc.

Ensuring the reliable design and verification of these complex ICs is critical since circuit failures in these chips can have major consequences for the validity of the technology and legitimacy of the results they provide.

AI/ML IC Reliability Verification

Reliability verification is a massive challenge in AI/ML chips, due to the size and complexity of these designs, with transistor counts in the order of millions — sometimes even multi-billions. For example, NVIDIA’s TESLA P100 GPU boasts a staggering transistor count of 15.3 billion, while Intel’s Loihi IC contains 128 neuromorphic cores and 3 X-86 cores, with 2.07 billion transistors. And, because reliability requirements for each use environment is different, designers must understand the applicable set of requirements and ensure that they are met by testing their designs against well-defined reliability requirement specifications.

Design Reliability Verification Methods

Traditionally, designers used a variety of methods to ensure design reliability, including manual inspection and simulation techniques, relying mainly on the expertise and experience of their design team. However, manual inspection is not a feasible approach for these large and complex AI/ML chips, since it is time-consuming and prone to human error, and virtually impossible to provide sufficient coverage. Traditional SPICE-like simulation approaches aren’t practical for these ICs, either, due to their lack of scalability for large designs.

To overcome capacity and runtime issues, many design teams manually partition a design and verify different intellectual property (IP) blocks independently through simulation or traditional tools. However, there are many interactions between different IP blocks in a design (e.g., between different compute cores and the bus or link or high bandwidth memory), and the interactions between interfaces often tend to be overlooked during a manual partitioning process. Traditional IC verification tools struggle with excessively long runtimes to verify these complex designs, often taking days to verify large designs, and potentially delaying time to market.

The deficiencies in each of these processes highlight the need for a comprehensive automated electronic design automation (EDA) solution that can take advantage of the computing power of multiple CPUs and devices simultaneously. With automated, qualified reliability verification, product design and verification teams can converge more quickly on reliability verification and fixes, reducing their overall turnaround time (TAT) from days to hours.

The Calibre PERC Reliability Platform

Over the last few years, a new class of IC reliability verification tools that solve these process issues has emerged. Tools such as the Calibre™ PERC™ reliability platform leverage a rich set of features and functionality to deliver fast, foundry-qualified reliability verification. For example, the Calibre PERC reliability platform takes advantage of the Calibre platform’s multi-threaded (MT) and multi-threaded flexible (MTflex) scaling, which distributes tasks to multiple CPUs and/or remote machines to provide fast, efficient execution of verification processes on large and complex chips like AI/ML ICs (Figure 2).

Figure 2. Multi-threaded, flexible scaling distributes tasks to multiple remotes for faster overall execution.

Beyond these basic but essential mechanics, the Calibre PERC reliability platform provides innovative processing that combines both netlist and layout information from a design to quickly and precisely evaluate a wide range of potential reliability issues. By enabling designers to efficiently and confidently reduce a design’s susceptibility to performance and operational failures, this approach to advanced reliability verification helps support the continued growth and adoption of trusted AI/ ML technology.

Transistor-level Reliability

A majority of ML/AI designs use multiple power domains for a variety of purposes, such as providing a clean, noise-free power supply for analog IP, enabling the ability to gate or shut off power to a certain area of a chip, scaling voltages up or down independently for selected IPs, or meeting high current demands using multiple voltage regulators. For example, Intel’s Skylake processor contains nine primary power domains.

Implementing a multiple power domain design requires the use of special circuit elements, such as voltage regulators, header and footer switches, level shifters, isolation cells, and state retention cells. These elements present a unique set of challenges for reliability verification. For instance, designers must verify that appropriate level shifter or isolation cells are used at each domain interface and that they are correctly connected (Figure 3).

Figure 3. Use of special elements (such as level shifters, isolation cells, and power gating switches) inside a low power design requires specialized verification techniques.

Figure 3. The use of special elements (such as level shifters, isolation cells, and power gating switches) inside a low power design requires specialized verification techniques.

They must also ensure that they are using the appropriate types of devices on different power domains, such as thick oxide devices for high voltage supply. Verifying these conditions requires very specific knowledge and processes.

The Unified Power Format Technique

The unified power format (UPF) is a widely-used technique that enables designers to employ a consistent description of the power intent throughout the design flow. However, traditional UPF-based verification flows are used to validate IPs at the logic or gate level, but they lack the ability to validate the final transistor-level implementations, particularly the well and bulk connections.

The Calibre PERC reliability platform can read the UPF file for a design and leverage UPF information to perform various analyses at the transistor level, such as identifying missing or incorrectly-connected level shifters, electrical overstress (EOS) conditions, floating wells, and much more. By using the Calibre PERC reliability platform in conjunction with the UPF data, designers can evaluate device interactions programmatically to provide repeatable and deterministic reliability verification.

Lifetime Reliability of AI/ML Chips

Operational safety is a critical aspect for most AI/ML chips, which are expected to operate throughout their designed lifetime without any glitches or failures. Some electrical reliability issues, such as bias temperature instability (BTI) and EOS, may not manifest as immediate failures but can cause rapid degradation and aging over time if not corrected before manufacturing. Reliability verification can help ensure robust operation over an extended period by checking for various issues such as point-to-point resistance, positive and negative BTI, current density, and electromigration (EM), all of which can create performance degradation or catastrophic failure.

Consider the case where a high-voltage domain device is driving a thin-oxide device that isn’t rated to handle the high voltage. During design, the designer fails to insert a high-low level shifter. Even though this condition won’t necessarily affect functionality at first, it will stress the thin-oxide device over time, eventually causing failure. The actual failure time is dependent on the voltage value, time in which the supply is ON versus OFF, and the process parameters.

EM (the migration of atoms in a conductor due to electrical current) is another major issue that affects the long-term robustness of interconnects used in AI/ML ICs. This migration causes voids and hillocks to form on wires. The voids cause a significant increase in resistance, while the hillocks can create shorts, both of which lead to circuit failures. The EM effect is dependent on many factors, such as the length and width of the metal line, the interconnect material, operating temperature, uni-directional vs. bi-directional currents, etc.

Foundries provide design companies with EM limits for the maximum current that the wires can handle, based on the expected use conditions for the product. For example, the EM limits for an IC used inside a mobile phone would be considerably lower than for an IC used in an industrial environment. Some companies have dedicated teams who actively engage with the foundry to define appropriate specifications, create test structures, and perform product qualification for EM tolerance. Obviously, it is harder to define these limits for a product that could be used in multiple environments, so designers typically design these chips for the worst-case operating conditions. In all cases, it is crucial to test the design against the foundry-defined EM limits and validate that the design can withstand EM effects.

Failure to catch and correct the different reliability issues during the pre-silicon verification phase can result in a broad range of impacts, including multiple tape-out spins, delays in getting the product to market, loss of customer trust, significant negative market reaction, product recalls, and even catastrophic consequences, such as physical injury or loss of life. Identifying and fixing reliability violations before tape-out minimizes the chance of circuit malfunctions or failures that can prove to be costly.

Analysis and Management of AI/ML Reliability is Crucial

The recent success of and expansion in AI/ML functionality is largely based on advances in semiconductor technology. As these new designs are developed, the hardware design community must be aware of the need to analyze and manage the reliability aspects of a design, such as the target environment, operating conditions, reliability criteria, etc. Powerful EDA reliability verification tools designed to address the specific reliability issues and requirements of these large, complex chips can help design houses ensure that their products perform as intended throughout their designed lifetime. In turn, that translates to confidence in the results achieved through the use of AI/ML applications in the broader markets, supporting their continued use and expansion.