By Gregory Baudet, Sebastien Rabou (Barco Silex)
In this paper, we describe a versatile IP core providing cryptography and security, complemented with a software wrapper including the necessary low-level drivers and communication interfaces between the Linux OS and OpenSSL, the most-widely used cryptographic library in embedded systems. The framework has been developed to provide both ASIC- and FPGA-based systems with very fast and scalable hardware cryptography and security. This solution completely offloads cryptographic operations from the processor and significantly increases the cryptographic performance. An additional requirement was to make this hardware layer transparent to the developers of embedded applications, and to provide them with a software bundle that enables a straightforward and fast integration of hardware-based TLS, even in existing SoC applications.
Today, more electronic systems are developed than ever before based on programmable logic, such as the new SoC-FPGAs from Altera and Xilinx. One of the drivers of this renaissance is ubiquitous wireless connectivity for the Internet-Of-Things.
For many of these systems, an easy-to-integrate security is essential, since they handle and transmit data that should remain confidential (medical, financial, military…), or signals that should not be tampered with (car-to-car or car-to-infrastructure, satellites, industrial equipment, sensors…).
Many of these applications have short development cycles, and are designed and implemented by software experts. Running cryptography algorithms in software, however, may require much of the available computational power, and result in a poor performance and less-than-ideal security.
The solution is to implement cryptography in hardware, offloading all authentication and encryption operations from the application processor. A resource-saving solution is adding and reusing well-designed third-party IP cores, under the condition that these blocks are scalable, flexible, and easy to integrate. Additionally, they have to offload these cryptographic operations very efficiently.
So in a first phase, we built a complete set of IP blocks that make a well-dimensioned, flexible, and very fast crypto hardware engine that completely offloads the crypto operations. This solution has already been silicon-proven in a wide range of applications, including financial transaction systems, state-of-the-art video streaming and display, or energy-efficient secured communication ICs.
However, adding hardware blocks and interfacing these requires additional development cycles and hard-to-find expertise that may seriously impact the time-to-market and profitability of systems. So to bring hardware security in the reach of many more future and existing applications, we have extended our crypto IP blocks with a software framework that makes the hardware layer fully transparent, and that effectively interfaces the full OpenSSL functionality to the power of a cryptographic coprocessor.
II. Embedded security – limits of a software implementation
The cryptographic protocols that provide security and data integrity for communications over TCP/IP networks such as the Internet are the Transport Layer Security (TLS) protocol and its predecessor, the Secure Sockets Layer (SSL). TLS/SSL provide endpoint authentication and confidentiality of the communication using cryptography.
TLS/SSL have grown to be the basis of secure communication on the Internet of computers, e.g. securing home banking, online shopping, and virtual offices. But they have also grown to be standards for communication between embedded systems , now morphing into the Internet-of-Things.
The requirements and needs of today’s SoC applications are widely varying. On the one hand, we may have a hidden sensor that only needs short bursts of encrypted communication in symmetric mode to a server once every minute. But an application on the server side may have to handle tens to hundreds of secured connections per second, authenticating and setting up asymmetric encryption. Additionally, a typical traffic of 1Gbits/s may have to be encrypted/decrypted.
If TLS/SSL are provided completely in software and there is frequent authentication and moderate to heavy cryptography, the processor will be loaded at 80-100% with the necessary calculations at a given time. This may seriously downgrade the performance of the SoC application, halting it altogether at times. For instance, the RSA-2048 operation that is commonly used in the authentication and key exchange phase requires 84 million 16-bits x 16-bits multiplications. Moreover, symmetric operations such as AES-256 may also choke the processor’s performance. An ARM Cortex-A9 core at 800 MHz can go up to 160 Mbits/s with a processor usage of 100%, meaning that no other application is able to run at the same time.
A second issue is the true random generator (RNG), a key concept in cryptography, used to generate keys that must not be predictable or repeatable, even to the most sophisticated attackers. With algorithmically deduced random or pseudo-random numbers, there is a risk that hackers manage to predict the keys, and thus compromise the application. A hardware implementation of a true RNG can be based on statistically random physical phenomena to which attackers have no access.
In addition, software cryptography will leak more critical information, making it easier for hackers to set up successful attacks . And even with direct information unavailable, software operations are still more prone to side channel attacks.
III. A flexible crypto hardware platform for SoCs
Answering these issues, we have developed a suite of IP blocks for ASIC or FPGA applications, covering all complex cryptographic calculations needed to run TLS/SSL. These include algorithms such as RSA/ECC, AES, SHA, and true random number generation. The RSA/ECC operations are used during the authentication phase between servers and clients but also during key exchanges (Diffie-Helman algorithms). Once both sides have established a common secret (i.e. a symmetric key), the AES algorithm, which may be combined with a hash function (SHA-1/SHA-2), is used to encrypt/decrypt and authenticate the data on both sides of the communication channel.
Just diverting these algorithms to a hardware accelerator does not necessarily improve the performance. The necessary operations on large numbers are automatically partitioned in elementary operations on smaller values. This involves many data transfers between the accelerator and memory to get an access to all operands and intermediate results. For example, if the hardware accelerator is not able to collect the data and write back the results, the embedded CPU must handle these data transfers. Since data processing is continuous, a solution without internal memory access (DMA) capability results in a constant CPU load and an occupied common data bus.
In contrast, we developed our IP cores to result in a near 100% offload of the processor. This includes a build-in scatter/gather DMA and a scalable data path built on a highly-pipelined implementation. The IP core for asymmetric operations even has an internal micro-coded sequencer .
At the same time, this technique also allowed to size the crypto cores according to the requirements of very diverse applications. This would not have been possible with an implementation consisting of straightforward finite-state-machines. The result is an optimal trade-off between footprint (silicon cost) and utilization for all applications. A feature that places our IP cores among the most efficient in the industry, based on speed and performance/footprint.
In addition, we developed our IP cores for maximum ease of integration, providing industry standard interfaces such as AXI/AHB/APB that are typically used in SoC architectures in both ASIC and FPGA.
The resulting hardware solution offloads 100% of the cryptography operations from the processor to the custom logic. In addition, we also see an increase in cryptography performance, which may be as high as 100X compared to a software-only solution, depending on the application and the cryptography used.
Figure 1 – Schema of a SoC design including a processing unit and a cryptography coprocessor
IV. Interfacing the hardware cryptography transparently
As a second requirement, our crypto hardware block should be transparently callable for the SoC developer, who typically implements the application on top of the Linux OS. One of the most-widely used software suites that developers use to implement security is OpenSSL. And OpenSSL offers the possibility to support custom implementations of cryptographic algorithms using the engine mechanism. These engines may not only be software implementation, they can also be interfaces to hardware-based coprocessors.
We decided to change as little as possible to this mechanism and to the OpenSSL calls, and hide the interface to the crypto hardware inside the OpenSSL, transparent for the developer. The resulting software architecture is a layered stack that mediates between the crypto hardware on the one hand, and the OpenSSL on the other hand.
On the deepest level, we developed the OS-independent API for the crypto hardware. These enable the accesses to the hardware registers. A layer above, we wrote the necessary device drivers for the Linux kernel to access the hardware. These amongst others translate the user address space into the physical address space, and also handle DMA including descriptor list generation.
These kernel drivers, however, are not directly accessible from typical user-space programs and libraries such as OpenSSL. To make this possible, we used and reconfigured CryptoDev , a dedicated Linux kernel device, which enables user space applications to access Linux drivers such as the drivers to our crypto hardware. CryptoDev supports all major cipher and hash algorithms that will be called from OpenSSL and that can be supported by the crypto hardware.
Figure 2 – Depending on the system architecture, the application may access different abstraction layers to use the Crypto Coprocessor
For the application engineer, calling this engine, and thus the crypto hardware is as simple as changing the call:
openssl rsa2048 <args>
openssl rsa2048 <args> –engine cryptodev
V. Typical sample application using a SoC FPGA
An example of the platforms where an easy-to-integrate crypto-hardware solution brings a big advantage is the so-called SoC FPGA, such as Altera’s SoC, or Xilinx’ Zynq.
These platforms are often presented as turnkey solutions, offering developers an up-and-running Linux OS on which to start developing their applications. To add secure communication, they just have to install the OpenSSL library on top of Linux, and use the security routines in their application.
With the framework presented in this paper, they can very conveniently switch from a software-only solution to hardware, even after their application has been developed. It also allows them to test and benchmark the two approaches next to each other.
Figure 3 shows a typical setup for a SoC FPGA application. On the one side accessible by clients from the network (LAN, WAN or Internet), and potentially open to attackers. On the other side, it may be connected to e.g. a database with confidential medical information, or to critical machinery. So it needs reliable authentication and encryption.
Figure 3 – Use environment of ASIC/FPGA SoC with hardware acceleration
VI. Benchmark results
By using the ‘-engine cryptodev’-flag, it is possible to do an analysis of OpenSSL performance and CPU utilization running the cryptographic operations either in software or hardware.
We ran a benchmark using a Cortex-A9 processor at 800 MHz, and one instance of the hardware IP. The IP can easily run at a clock frequency above 400 MHz on ASIC technology. The flexibility of the IP allows for the same performance with lower frequencies, which is well suited for low-end/mid-range FPGAs.
In a first test, we looked at the RSA-2048 operation and ECDSA-256 signature verification. We compared the software implementation against the use of a small IP block and a standard IP block. The small IP block is optimized for minimum hardware resource usage, while the standard IP block is well-balanced for good performance and limited resource usage. The results show that the gain in performance is significant even with the smaller configuration of the hardware block. Another important aspect is the CPU, which remains almost completely in idle state thanks to the hardware offloading (Figure 6).
Figure 4 – Performance of RSA-2048 with OpenSSL comparing software (Cortex-A9) and hardware acceleration.
The performance gain in operations per MHz also implies a much lower latency with the hardware IP blocks. Reducing the latency of asymmetric operations such as RSA-2048 and ECDSA-256 has a direct beneficial impact on the negotiation time of the connection for many protocols.
Figure 5 – Performance of ECDSA-256 verification with OpenSSL comparing software (Cortex-A9) and hardware acceleration.
Figure 6 – CPU utilization while running RSA or ECDSA operations.
In the second test, we selected the AES-256 CBC operation, which is a state-of-the-art cipher for new applications. With the CPU running at 100%, the maximum achievable throughput was 160 Mbps. This may not be enough to handle a typical full duplex 100 Mbps Ethernet link. Also here, offloading the operation to hardware will significantly increase the performance. Depending on the application, a suitable configuration can be selected to achieve the targeted throughput, enabling up to 1 Gbps full duplex (Figure 7). Also in this second test, the hardware IP block results in an extremely low CPU usage.
Figure 7 – Maximum achievable throughput with AES-256 CBC cipher.
Our new framework now enables SoC developers to easily integrate hardware security and cryptography in their (existing) applications. The new solution adds a software layer to an industry-leading crypto IP core. At the base of this layer are a number of drivers for the Linux OS, drivers that interface the crypto blocks in hardware with the widely-used OpenSSL library. For application and system developers, adding the power of hardware cryptography to their system may now become as easy as recompiling and installing a few software libraries. The result offloads intensive processing from the application processor, offers a superior performance at low power consumption, is highly scalable, and more secure than a software-only implementation.
 Rabou S, Galerin D, Pauwels T, “Smart Engine for Public Key Cryptography,” BarcoSilex 2012, http://www.barco-silex.com/white-paper/smart-engine-crypto
 S. Ravi, A. Raghunathan, and P. Kocher, “Security in Embedded Systems : Design Challenges,” J. ACM Trans. Embed. Comput. Syst., vol. 3, no. 3, pp. 461–491, 2004
 N. Sklavos, K. Touliou, and C. Efstathiou, “Exploiting Cryptographic Architectures over Hardware Vs . Software Implementations : Advantages and Trade-Offs 2 Software Security Limitations,” in Proceedings of the 5th WSEAS International Conference on Applications of Electrical Engineering, 2006, vol. 2006, pp. 147–151
 P. Sutter, N. Mavrogiannopoulos, M. Weiser, and M. Ludvig, “Cryptodev-linux module,” April 2012, http://home.gna.org/cryptodev-linux/