Easing the Development of Functionally Safe Controllers

James Scobie, ARM
February 2017

Abstract

There is an increasing demand for safety, security and reliability in applications ranging from autonomous cars and engine controllers to industrial or medical robotics. This comes at a time when the complexity of these systems is increasing with a corresponding expansion in their software complexity. This is resource intensive to integrate and validate. Many of these applications control real-time systems where responsiveness and determinism are critical for safe operation; so being able to maintain this, while meeting all other objectives, can be very challenging. Systems with these characteristics are set to enjoy significant growth in demand over the coming years, which brings new processors such as the ARM Cortex-R52 into sharper focus. This paper identifies the need for increased adoption of functional safety and describes how processors can help address the high level of criticality more simply. It details how the provision of fault detection and control features for random errors can be offered together with protection against systematic software errors, while meeting the demanding high-performance and deterministic execution.

Introduction

There is a trend towards increasing complexity in systems across multiple markets, including automotive, industrial control and healthcare. Medical systems provide continuous monitoring and autonomous drug control or provide improve precision in surgical procedures. Industrial robots are becoming increasingly collaborative and vehicles are beginning to drive themselves. More of these systems are demanding functionally safe operation and functional safety to be provided at a higher level.. Some developers are very familiar with creating products to address these requirements, yet for others it’s a new field.

What is functional safety?

Unlike primary safety where the risk of harm is caused by the system itself, functional safety is dependent on the correct operation of the system to prevent an unsafe event, or a hazard, from occurring. In some systems functional safety is immediately obvious. For example, in the braking system of a car, it is clear that the brakes need to work for the car to be safe, applying the correct force without excessive delay.

Antilock Braking Systems (ABS) was introduced more than 40 years ago, and the system has continued to be developed with increasing functionality, placing rising demand on the functionally safe operation. As complexity has increased, systems have taken greater levels of control from the driver, which in turn demands that more and more of the system is functionally safe. In the case of vehicle braking, ABS has moved to Electronic Stability Control (ESC), adjusting individual brakes pressure to maintain directional control. Braking has further advanced to Assisted Braking, where the system is activated and the brakes are primed, ready to deploy if an emergency stop is required. More recently, the vehicle brakes can be applied without any driver action. In Advanced Emergency Braking (AEB) the detection of an imminent collision will result in the brakes being applied automatically to reduce crash severity – if not avoid it.

The demand for functional safety is rising

The automotive industry has been delivering solutions for functional safety for several vehicle generations.

The demand for critical systems, such as airbag, engine and gearbox controllers, continues; but it is added to by new applications also requiring functionally safe capability and placing greater demands on the chips used in the control these systems.

Global demand for electric powered vehicles is rising, where the operation of high-power traction motors and battery control needs to be safely managed. When you consider the lithium-ion battery in the Tesla Model S has the same amount of energy as 77kg of dynamite, it shows the impact an operational error could have. In addition, the growth in connected vehicles and fast planned rollout of autonomous vehicle systems is driving the need for greater functional safety. In addition to the considerable challenges of autonomous operation, there are increased opportunities for errors. Unlike airbags and ABS, which are required to be available to operate on an infrequent and exceptional basis, systems will be depended upon for long periods of time to control the vehicle’s operation. The net result is that the embedded controllers enabling these applications are being pushed for higher performance, while continuing to deliver fault detection and control mechanisms to enable safe operation.

Accelerated adoption of autonomous systems is not limited to automotive applications. Advanced Driver Assistance Systems (ADAS) and Highly Autonomous Drive (HAD) needs have parallels in other market segments, such as industrial and medical robotics.

Easing the path to functionally safe controllers

It is not possible to claim that a processor is safe. This is because the CPU is just one element within the system and it needs to be used in an appropriate manner to enable safe operation. With a programmable system, it is feasible for the processor to work as instructed and without fault, but to be commanding operations without appropriate checks and controls resulting in hazardous operation. In addition, during the development of a new processor there is not full knowledge of how the IP will be used or even the specific systems in which it will be deployed. In order to be suitable for use in a functionally safe system, the processor is developed with assumptions about what requirements are supported by the CPU; those which the wider system will support and the interaction between them. This hierarchical approach to functional safety is a common method employed in the standards and requires careful documentation. In the case of ISO 26262, the CPU is termed as a Safety Element out of Context (SEooC). The processor is able to support functional safety in the way it addresses random errors that occur during its operation and systematic faults. It can achieve this through a combination of its functional safety features and its development process.

There are two types of issues that can occur: random errors and systematic faults. A random error can occur at any time across the device and can be either transient or permanent. Errors can be caused by a range of sources, including electro-migration for hard faults or transient faults from alpha particles. These result in an incorrect value on a cell within the design. Statistical data can be created based on historical information and so predict the type of defect enabling the probability of their occurrence. Typically, counter measures are included in the hardware to detect and protect against these random errors. There are a range of different features which can be used to help detect these errors.

Processors with the ability to support a lockstep copy enable a check on the processor’s execution by using the same code and without the need to modify the software. The results of each execution cycle are compared between both processors with any differences flagged for action to recover. Implementing a short temporal offset between each of the CPU’s operations, such as a two-cycle delay helps eliminate common-cause transient effects. As well as the processor’s operation being checked with a lockstep copy, the protection must often extend to the interrupt controller, and making a lockstep copy of this IP available simplifies the solution in a part of the system where latency is critical.

The availability of core self-test libraries supporting run-time execution enables partners to provide run time diagnostic coverage where dual-core lockstep cores are not needed. Executing the self-test library at start-up or following the detection of errors as well as at run time for latent defect detection will help all system configurations, lockstep and non-lockstep, with error diagnosis and support fault recovery.

Using Error Correcting Codes (ECC) to detect or in some cases enable correction of data stored in memory such as caches, or Tightly Coupled Memory (TCM) within the processor increases fault detection. Applying these checks to the data and address transactions taking place across the bus also improves fault coverage. Providing this capability within the processor enables its own memories to be protected and those external to the core. The processor encodes and decodes the transactions automatically; simplifying the tasks of the chip developer.

Fault management can be simplified by capabilities within the processor that captures and records faults as they occur, to enable more sophisticated error management. This allows the processor to more readily continue operation, or decide to take action to mitigate against the fault such as reset or reconfigure itself if necessary based on the gathered information.

The Cortex-R52 processor is ARM’s most advanced processor for safety, implementing the highest level of features and developed from start to finish within a robust process for safety certification. It is enabled with fault detection and management features described above, as well as the availability of a software self-test library. The software test library together with the fault detection and management features, delivered with the processor, can help implementers develop solutions capable of addressing ASIL D applications.

Some of the most difficult errors to detect are those that occur due to systematic errors. Processors can implement some features to protect against these types of faults, but systematic errors are often produced through human error or the use of tools during the development of the system and can occur at any stage in the product’s life cycle. These faults can be due to the hardware implementation, the software running on that hardware or a combination of both. They must be fundamentally addressed in the way the solution is developed. Unlike random faults, it is difficult to predict their occurrence, but once a systematic error is created, it will always manifest itself given the appropriate conditions. Due to the difficulty in predicting these faults, protection against systematic errors is achieved by the adoption of appropriate processes, procedures and best practices.

ARM processors are tailored based on the target markets’ safety requirements, with two categories used. Standard support provides documentation on the processor, how its capabilities should be used for functional safety, what happens in the presence of a fault and how to integrate the core. Extended support for use in applications where ISO 26262 ASIL D or IEC 61508 SIL 3 would be applied offers the greatest level of detail in the safety package and is matched to the processor’s capabilities to support the highest levels of functionally safe operation. A processor with extended support provides a documentation package that is independently assessed.

In addition to procedures to guard against systematic faults, software errors can also be guarded against by the provision of features in the processor. The software running on the processor may come from a range of different providers and some of the functions may not have a specific safety function. When combined on a processor where safety functions are being performed, such a function could impact safe operation. For example, a rogue task could inadvertently corrupt data used by a safe task, or could take control of a resource such as a peripheral preventing a safety related task being correctly completed. As software complexity increases, these interactions can become harder to identify and test at the same time as the system demands increased levels of functional safety.

A protection mechanism to prevent unintended interactions between software is advantageous when coupled together with the capability of supporting different criticalities on the same controller. However, for real-time systems, this cannot be at the expense of determinism or power. Software protection can be attempted with some existing processors, but this introduces indeterminism. It can be reduced in some cases and for some systems, but always requires careful and time-consuming consideration of resource allocation and the inevitably associated restrictions. Alternatively, this can be attempted using software, but requires robust custom software to be developed. In both cases, there are demands created for additional processing performance, which has an impact on power consumption. For software, precious bandwidth is used to complete these tasks, taking away from the functionality of the application. In real-time systems where determinism is important a variation in this time can result in extending the worst case execution time. Where this results in missing the required system schedules additional performance overhead must be added to accommodate this change.

As part of the ARMv8-R architecture, the Cortex-R52 processor provides hardware support for virtualization. Virtualization enables isolation of different parts of the software by creating multiple safety isolated sandboxes for the functions to exist within. This virtualization is achieved with an additional exception level at which monitor software or a hypervisor can run, allocating resources and managing the software interaction. A Berlin-based firm, OpenSynergy, recently announced their development of the industry’s first hypervisor for the Cortex-R52. The concept of virtualization can sometimes be perceived as complex, and can be misunderstood to have a large penalty for frequent switching of contexts or reduced determinism. This is not the case for the Cortex-R52 processor. The hypervisor can be light-weight with fast context switches. This is due to a second-stage Memory Protection Unit (MPU) providing the mechanism for resource protection, which can be accessed directly and rapidly by the processor providing fast deterministic performance – which is very valuable in these real-time applications. Although address translation is not offered through the MPU, the applications it is targeted to be deployed in have software that is pre-composed and does not demand the need for a virtual address space.

Conclusion

The effort required to develop a device that enables safety critical applications is considerable; it demands effort to validate and certify the product so that it can be subsequently used in an end system. It is increasingly valuable for specific elements of this work to be available for complex IP, such as processors. It eases the integration of the CPU into these devices, while offering solutions within the processor that can be tailored to the needs of the device. Simplifying chip manufacturers task and reducing the effort necessary to develop products for functional safety can be achieved through the availability of a processor where the following capabilities have are supported:

fault detection and management
software self-test libraries
software separation

Furthermore, the rapid increase in software complexity places additional demands on features to protect interactions and allow different software to be integrated together more simply

Industry Articles

Easing the Development of Functionally Safe Controllers