A Standards Based Approach to the Reliability Specification of IP Components

Adrian Evans^*, Dan Alexandrescu^*, Enrico Costenaro^*, Michael Nicolaidis⁺
^*iRoC Technologies
⁺TIMA Laboratory (Grenoble INP, UJF, CNRS)

Abstract

Reliability is a growing concern in complex embedded systems. There is an increasing need to understand the failure modes and the overall reliability characteristics of large SoCs which are built using IP components from diverse sources. IP providers must be able to specify the way silicon failures (e.g. single bit upsets, permanent faults) affect the operation of their IP components. This is especially challenging for providers of soft IP because the designs are often extremely configurable and the high-level failure mechanisms (e.g. interrupts, reboots, etc.) must be expressed independently of the underlying implementation technology (e.g. FPGA, ASIC, etc.). In this paper, we show how a standard modelling format called RIIF (Reliability Information Interchange Format) can be used to specify the failure modes of complex IP components including a worked example based on a processor core.

I. INTRODUCTION

Due to the ever finer dimensions of transistors, there is a growing number of technology failure mechanisms (e.g. NBTI, HCI, electro-migration, etc.). Furthermore, the effect of soft-errors in electronics is also increasing as the number of devices integrated on a single die grows with Moore’s Law. At the same time, complex SoCs using these advanced technologies are being used in high-reliability and safetycritical applications such as automobiles and medical devices.

The designers of these silicon devices are required to perform detailed Failure Modes and Effects Analysis (FMEA) to show that the resulting systems comply with the applicable safety and reliability standards (e.g. ISO26262, DO-254, etc.). It is thus essential to understand how failures at the cell and transistor level propagate upwards through design blocks and what effects they provoke on the overall system. Figure 1 shows examples of how failure modes at the cell/technology level (e.g. a Single Event Upset, SEU) can provoke an effect at the block level (e.g. error interrupt, exception) and how this can then manifest itself at the system level (e.g. a reboot, an engine lamp going on in a car, etc.).

Modern chips integrate billions of transistors, therefore, an exhaustive analysis of failure effects at the transistor, or even cell level, is not possible. The problem of failure analysis can only be solved through new EDA tools, judicous abstractions and effective modelling techniques.

When the entire design of a chip occurs within a single organization, it may be possible to use in-house approaches

Fig. 1: Propagation of Failures in a Silicon System

to perform such reliability analysis. Increasingly, however, SoCs are assembled from IP components which are sourced from external providers (or from multiple departments within a large organization). Standard cell library and memory IP are also typically sourced externally. In this complex development environment (shown in figure 2), it becomes difficult for an SoC integrator to perform a detailed reliability or FMEA analysis for a SoC.

The challenge comes from the fact that there is no standard format to specify the failure modes of electronic components or design blocks. Today in industry, it is common to use complicated spreadsheets to compute the effects of silicon errors in a complex SoC [1]. It is often necessary to transcribe a large volume of design data (e.g. gate counts, technology FIT rates, failure modes) from multiple sources into a spreadsheet. This approach is error-prone and does not scale with design size. Spreadsheets are not modular, thus when components of the design are changed (e.g. a new technology library, a new version of an IP block), there is no easy path to updating the reliability model. The problem of reliability modeling for IP blocks has been identified previously [2], primarily in the context of analyzing memory soft error rates (SER). It is now necessary to analyze the effects of multiple silicon failure effects in memories, flip-flops and random logic.

To address this problem, the RIIF modeling language [3], is being developed. RIIF stands for Reliability Information Interchange Format and it is a specialized modelling language that can express the failure modes of generic components. Inthe first portion of this paper, we provide an introduction to the RIIF language and an overview of its development status. In the second half, we present a worked example showing how the failure modes of an IP component can be expressed in a technology independent fashion and then combined with technology failure rate information to build a model for a full SoC. This is followed by the conclusions.

Fig. 2: SoC Design Flow

II. RELIABILITY INFORMATION INTERCHANGE FORMAT

The RIIF language was first presented at [3] and can be used to write parameterized, object-oriented models that describe the reliability and failure modes of electronic components or systems.

The basic element in the RIIF language is the component. A component-based approach enables scalability and modularity. Complex models are constructed from a hierarchy of simpler components. If one component in the system changes, for example a new SRAM is substituted in the design, it is only necessary to substitute a new RIIF model for this component in the hierarchical model.

The susceptibility of transistors to failures is highly dependent on the operating conditions such as voltage and temperature. For this reason, RIIF components are parameterized. In the RIIF language, parameters are used to specify environmental conditions that vary through the lifetime of the component such as temperature or voltage.

Most IP components are highly customizable through Verilog parameters (VHDL generic) or other means. Examples of such parameters might be the size of a cache or the type of selected memory protection scheme (e.g. parity versus ECC). The corresponding reliability models must reflect these design parameters. There may also be operational aspects of a design that vary through its lifetime such as the workload or memory utilization which also must be reflected in the reliability model. Each component within a reliability model has a set of failure modes. A failure mode can be viewed as the occurence of an undesirable event within the component. In a memory, the most common failure modes due to soft-errors are SBEs, MBEs and SEFIs. Memories also have semi-permanent failure modes such as NBTI or HCI. In a RIIF model, for each failure mode, the rate of occurence of the failure is expressed as a function of the component’s parameters.

In the RIIF language, complex components can instantiate child components. The complex component, then must express how the failure modes in the child components propagate. This is achieved through aggregation functions. The most basic form of aggregation is a simple summing of the failure rates of the lower level failure modes, which corresponds to a series system with no redundancy. In parallel systems with redundancy, certain components can fail without impacting system operation. Practical systems are a combination of parallel and series components and there is a well developed theory for analyzing the reliability of such systems [4].

III. WORKED EXAMPLE

In this section, we present a worked example of how a technology independent RIIF model of a soft processor core can be combined with RIIF models for the failure rates in a target technology.

A. Faults in Processor Cores

The high-level failure modes of a processor core typically identified as Silent Data Corruption (SDC) and Detectable Uncorrectable Errors (DUE). SDC occurs when a soft-error (or other silicon failure) occurs in a processor, it is not detected and the result of the computation is incorrect with no indication of the error to the end user. SDC is highly problematic and designers of high-reliability processors make all efforts to minimize the rate of SDC. The end effect of SDC depends on the nature of the application running on the processor and the system-level detection mechanisms. DUE occurs when a processor detects an error but it is not able to recover. DUE may occur as the result of parity errors or other detection mechanisms. Typically the processor core must be reset to recover after a DUE although in some cases an exception handler can gracefully recover from the error.

The SoC integrator using the core is interested in determining the absolute FIT rate of Silent Data Corruption (SDC) and of Detectable Uncorrectable Errors (DUE) in their target technology. The SoC integrator will communicate these absolute FIT rates to their customers where they will feed into the system level reliability analysis (e.g. FMEA, etc.). The SoC integrator, through their relationship with the foundry and the memory IP provider has access to the technology failure rate (e.g. FIT/Mbit for memories, flip-flops, etc.) but they lack detailed knowledge of the processor core architecture. The technology failure rate data is typically sensitive and can not be shared with the IP core provider. In this example, we show how the provider of the core can create a technology-independent RIIF model of the core. Users of the core can then extend this model with technology data in order to get a full reliability model for their implementation.

B. Soft CPU Core Model

In this example, we assume there are two sources of errors at the technology level : soft-errors in the cache memory and softerrors in flip-flops. This assumption is made to simplify the example, however the RIIF language is not limited to modeling soft-errors.

The code for the reliability model of the CPU core is shown in figure 3. There is a component declaration, various parameter declarations and the definition of the failure modes. The rates of the failure modes are calculated from the parameters.

Click to enlarge

Fig. 3: RIIF Model of a Soft CPU Core

The model declares parameters to represent the intrinsic technology failure rates (e.g. single bit and multi-bit errors in the cache SRAM and single bit upsets in the flip-flops). However, since the model is technology independent, numeric values are not assigned to these parameters.

The effect of errors in the cache memory depend on the protection scheme the user of the IP has selected in their design : none, parity, SEC-DED ECC. In the RIIF model, two parameters (CACHE ECC ENABLED, CACHE PARITY ENABLED), determine the type of protection. With no protection scheme, all cache errors result in silent data corruption. With parity, single bit errors (SBEs) result in a detectable error (DUE) but mulitple bit errors (MBEs) are undetected. ^1. With SECDED ECC, single bit errors are corrected and multi-bit errors (MBEs) produce a DUE. Since the absolute rate of SBEs and MBEs in the cache memory are not known to the IP provider, parameters are declared to represent these values, but no values are assigned. Using the parameters and equations based on the ternary operator, the relationship between the technology failure rates and the system effect are embedded in the RIIF model.

For memories, the relationship between soft-errors events and their effect depends on the memory occupancy and has been studied [5]. For flip-flops, establishing the relationship between faults and their effects is more challenging. In most processor designs, the majority of flip-flop upsets are masked [6]. The notion of Architectural Vulnerability Factor (AVF) is used to express the fraction of faults that manifest themselves as user visible errors. In a processor, units that only provide performance improvement, such as a branch prediction unit, have an AVF that is near zero, since an error does not impact correct program execution. However, units such as the program counter (PC) are highly sensitive and have an AVF that is close to 100% since virtually any upset will produce incorrect execution. The designers of the processor core have knowledge of the micro-architecture and how instructions are executed and can thus calculate the AVF. Seperate AVF values can be calculated for SDC and DUE, to express the fraction of faults that produce both types of failures.

There are various techniques for calculating AVF including application of Little’s Law and the use of fault-models or through fault-injection simulations [7]. In this example, we assume that the designers of the core have performed an AVF analysis and that the resulting AVF values are embedded in the processor reliability model as constants. In this example, we present a flat reliability model for the processor. A more complete model would be hierarchical with each of the functional units in the processor modeled as separate child components.

Because the processor core is targetted to safety-critical applications, the designers have identified a set of critical flip-flops. These flip-flops have a high AVF and are likely to trigger observable errors. This set of critical flip-flops may be determined subjectively based on the designers knowlwedge of the micro-architecture, it may be determined through faultinject analysis or through the application of specialized SER analysis tools such as SoCFIT [8], [9]. Users of the core have the option to reduce the failure rate by mapping the critical flip-flops to a hardened, low SER flip-flop cell [10], [11] during synthesis.

The model thus has two flip-flop technology parameters : CPU FF NORMAL FIT RATE and CPU FF CRIT FIT RATE. If hardened cells are used for the critical flip-flops, the lower SER value can be assigned to the parameter CPU FF CRIT FIT RATE. Otherwise, if no hardened flip-flops cells are used, the user assigns the same FIT rate value to both parameters.

Finally, the rates of SDC and DUE for the processor are calculated by summing the contributions from cache errors and flip-flop errors. A full model would also include the contribution of other failure mechanisms. This reliability model for the soft CPU core captures the information about how faults in different parts of the core (e.g. flip-flops, cache memory) propagate and manifest themselves (e.g. DUE, SDC) however the model remains technology independent.

C. SRAM and Flip-Flop RIIF Models

Foundries as well providers of cell libraries and hard IP need to determine the sensitivity of their designs to soft-errors and other failure mechanisms and communicate this data to their customers. The soft-error susceptibility can be calculated using cell-level simulation tools such as TFIT [12], [13]. It is important to note that the rate of soft-errors depends on voltage, particle flux and to a lesser extent on temperature. Thus, it is not possible to simply quote a single numeric value. Rather a table or function that expresses the SER rate in terms of these operating parameters must be provided. This is especially true in systems which employ dynamic voltage scaling (DVS) and where the voltage is continuously adjusted to minimize energy consumption.

Fig. 4: RIIF Model for Memory SER

In industry, it is common practice to communicate technology SER data through written reports or spreadsheets. An alternative is to capture the SER behaviour in a RIIF model as shown in figure 4. The SEU model shown here is based on [14]. Alternatively, it could have been expressed using a table-lookup based on empirical test data. The key is that a standard template defines the parameters and failure modes that are common to a class of component, such as an SRAM. The provider of a specific SRAM then extends this template to specify the actual FIT rates of the SRAM implemented in a specific process technology. By using a parameterized model, the variation of the FIT rate with voltage and neutron flux is preserved. The SRAM model shown here has been simplified for illustrative purposes. A more complete SER model for an SRAM, would include the alpha SER contribution, the dependence on the stored data values and the dependence on memory activity (static vs. dynamic).

Similar to the SRAM model, the provider of the cell library must provide a RIIF model which describes the SER of the flip-flops in the cell library. Assuming the library contains a regular and a hardened (e.g. DICE) flip-flop, then there would be two such models. If the library contains multiple flip-flops, each one can have an associated model.

D. Combining the Models

The SoC integrator needs to calculate the rate of occurence of high-level failures such as SDC and DUE when the processor core is implemented in the target technology. Fundamentally, these rates remain a function of voltage, and other opertaing parameters, so it is not possible to represent the technology failure rate as fixed numeric value.

Fig. 5: RIIF Code for CPU Core Implemented in 45nm Technology

The object-oriented RIIF models for the soft IP core (figure 3) and the technology components (figure 4) can be combined succintly using the notion of multiple-inheratance. A new component object (CPU CORE 45NM) is derived from the models of the soft cpu core and the technology elements as shown in figure 5. The derived model has the union of the failure modes and parameters of the base components. In the model of the soft CPU core, the technology FIT rates were declared but no values were assigned. In the derived model, these parameters are assigned the values from the technology models. In this way, the relationship between the rate of high-level failures (e.g. SDC, DUE) and the technology operating parameters (e.g. voltage, neutron flux) is preserved. The resulting model also preserves the application parameters of the cpu core (e.g. CACHE UTILIZATION).

As a result, the user of the resulting model can make tradeoffs between power and reliability based on the operating voltage of the system. It is important to note that each technology provider only needed to provide a model for their IP and that the SoC integrator is able to combine the models with minimal effort to derive a sophisticated reliability model of the IP implemented in the target technology.

In object-oriented design, inheritance diagrams are used to show class relationships. In figure 6, the inheritance diagram for this example is shown. A library of base classes for standard objects (e.g. SRAMs, flip-flops, etc.) ensures consistency in the parameters and failure modes. The objectoriented approach makes it easy to move reliability models to a new process technology while preserving those portions of the model that are technology independent.

Fig. 6: Object Hierarchy of Technology Specific CPU Core

IV. CONCLUSIONS

The number of transistors integrated on a die is increasing. Through the use of HDLs and the increasing use of standard verification IP (VIP) it is possible to design and verify SoCs that integrate complex IP components from multiple sources. When these chips are used in safety critical applications, there is a need to analyze their reliability and how the effects of failures at the transistor and cell level manifest themselves at the system level.

The RIIF language provides a means to specify the reliability of complex systems using a standardized format. Models written in this language can be exchanged between industry partners such as IP providers and their customers. Using this approach, it becomes possible to efficiently assemble a reliability model for a complex SoC.

The RIIF language is being developed through a working group within the IEEE’s TTTC. The working group has a mandate to identify the reliability modeling needs across multiple industry segments and then to standardize the language. The language provides the ability to describe the operating environment of electronic components including how these parameters vary through the course of a mission. There is also work underway to create a means to exchange encrypted reliability models in order to protect sensitive information. Reliability is an increasing constraint in silicon design that needs to be accurately analyzed and optimized. In this paper, we have shown how the RIIF modelling language can be used to exchange information about the reliability of soft IP cores.

ACKNOWLEDGMENT

The authors would like to acknowledge the support of the TIMA Laboratory. A portion of this work was carried out during a research sabbatical at TIMA laboratory.

REFERENCES

[1] R. Wong, B. Bhuva, and A. Evans, “System-level reliability using component-level signatures,” in Reliability Physics Symposium (IRPS), 2012 IEEE International, April 2012.

[2] R. Aitken, “Modeling soft-error susceptibility for ip blocks,” in On-Line Testing Symposium (IOLTS), 2005 IEEE 16th International, july 2005, pp. 70–73.

[3] A. Evans, M. Nicolaidis, D. Alexandrescu, and E. Costenaro, “Riif - reliability information interchange format,” in On-Line Testing Symposium, 2012. IOLTS 2012. , July 2012.

[4] B. Johnson, Design and Analysis of Fault-Tolerant Digital Systems. Addison-Wesley Publishing Company, 1989.

[5] D. Alexandrescu, “A comprehensive soft error analysis methodology for SoCs/ASICs memory instances,” in On-Line Testing Symposium (IOLTS), 2011 IEEE 17th International, july 2011, pp. 175–176.

[6] H. Nguyen, Y. Yagil, N. Seifert, and M. Reitsma, “Chip-level soft error estimation method,” Device and Materials Reliability, IEEE Transactions on, vol. 5, no. 3, pp. 365–381, sept. 2005.

[7] S. Mukherjee, Architecture Design for Soft Errors. Morgan Kaufmann, 2008.

[8] S. Wen, D. Alexandrescu, and R. Perez, “A systematic method of quantifying seu fit,” in On-Line Testing Symposium (IOLTS), 2008 IEEE 14th International, 2008.

[9] D. Alexandrescu and E. Costenaro, “Towards optimized functional evaluation of see-induced failures in complex designs,” in On-Line Testing Symposium, 2012. IOLTS 2012. , 2012.

[10] T. Calin, M. Nicolaidis, and R. Velazco, “Upset hardened memory design for submicron CMOS technology,” Nuclear Science, IEEE Transactions on, vol. 43, no. 6, pp. 2874–2878, dec 1996.

[11] N. Seifert, V. Ambrose, B. Gill, Q. Shi, R. Allmon, C. Recchia, S. Mukherjee, N. Nassif, J. Krause, J. Pickholtz, and A. Balasubramanian, “On the radiation-induced soft error performance of hardened sequential elements in advanced bulk CMOS technologies,” in Reliability Physics Symposium (IRPS), 2010 IEEE International, may 2010, pp. 188–197.

[12] D. Alexandrescu, E. Costenaro, and M. Nicolaidis, “A practical approach to single event transients analysis for highly complex designs,” in Defect and Fault Tolerance in VLSI Systems (DFT), 2011 IEEE 26th International Symposium on, 2011, pp. 155–163.

[13] H. Belhaddad, “Circuit simulations of seu and set disruptions by means of an empirical model built thanks to a set of 3d mixed-mode device simulation responses,” in Radiation and Its Effects on Components and Systems, 2006. RADECS 2006. 8th European Conference on, 2006.

[14] N. Seifert, P. Slankard, M. Kirsch, B. Narasimham, V. Zia, C. Brookreson, A. Vo, S. Mitra, B. Gill, and J. Maiz, “Radiation-Induced Soft Error Rates of Advanced CMOS Bulk Devices,” in Reliability Physics Symposium Proceedings, 2006. 44th Annual., IEEE International, march 2006, pp. 217–225.

^1. A more refined model could reflect the fact that MBEs that produce an odd number of upsets are detectable with parity