FPGAs lower costs for RSA cryptography
By John Fry and Martin Langhammer, EE Times
September 29, 2003 (12:44 p.m. EST)
The rapid, continuing expansion of Internet and wireless based communications across open networks is creating an increasing need to encrypt the frequently sensitive or confidential data being transmitted. The RSA public key cryptographic algorithm is frequently used in these applications, either as the primary method of encryption or, more often, to exchange secret information such as keys. The advantage of the RSA algorithm is that it is a secure, high quality, public key algorithm. The disadvantages are that it is very computationally intensive, operating on very large (typically thousands of bits long) integers. Consequently, the throughput is very slow compared with most secret key algorithms, like DES and AES. For this reason, RSA is usually paired with a secret key algorithm for high performance applications; RSA is employed to exchange the keys, and the secret key algorithm is then used to encrypt the bulk of the data.
Until now, system designers who have chosen the RSA algorithm for cryptographic applications have been forced to confront a difficult choice. Although processors are flexible and can be very cost effective, especially when an existing processor in the system also runs the cryptographic application, RSA in processors is inefficient.
For higher throughput, a generally power hungry and expensive high performance processor is required. Application specific standard products (ASSPs) like the security processors offered by several vendors offer a much higher RSA performance, but are inflexible and expensive. As most are designed with specific applications in mind, such as Internet Protocol Secure, the RSA component of the ASSP may not be easily accessible to the user for high performance cryptography in its own right.
In certain applications, the designer of a proprietary system may choose to handle data differently. For example, each individual session may be very short in duration or payload. Alternatively, the da ta rate may be relatively slow. In both of those cases, an ASSP may have the wrong mix of features, and the processor may still be too slow.
Programmable logic offers a third alternative. With the design of a new core, which combines a modified version of an existing modular exponentiation algorithm with the logic structures of a new generation device, it is possible to implement a relatively high performance, user parameterizable and synthesizable core at a very low cost.
Public key cryptography
Public key cryptography allows any party that would like to exchange encrypted data to generate an encryption and decryption key pair, and publicly broadcast the encryption key. (Although the two keys are related, it is very difficult to generate the decryption key from the encryption key). Anyone can encrypt data using the encryption key and the publicly known algorithm, but only the originator of both keys can decrypt the message. In most applications, a key exchange protocol like Di ffie Hellman is used when two parties set up a session. The easiest way to illustrate the operation of RSA is by way of a simple example (see Fig. 1a).
Once M and E have been calculated, they can be publicly disclosed. The algorithm is well known, but the system is considered secure because it is computationally infeasible to work backward and break the code using just the encrypted message, encryption key and M. In practice, the sizes of the A, D and M are often thousands of bits in length. The encryption exponent chosen is generally a very small number, like 3 or 17, which does not affect the security of the algorithm.
The heart of the RSA algorithm is the function AE mod M, which is calculated with every message that is encrypted or decrypted. The AE mod M function also shows another property of RSA in practice the algorithm is asymmetric in performance. As the encryption exponent, E, is much smaller than the decryption exponent, D, encryption can easily be 1,000 times faster than decrypti on, as the message and modulus are often 2,048 bits or longer.
Although other functions such as calculating the primes and keys are complex, the performance bottleneck will be in the modular exponentiation function. The other functions can be calculated in software: If an external processor is used, the RSA core can act as an accelerator to the processor.
Using a modified Montgomery multiplication algorithm, the only arithmetic features required are adders, allowing the use of low cost programmable logic devices designed for consumer applications. The core size and performance are defined by the memory interface widths 16, 32 or 64 bits but consume less than $1 worth of resources in all cases (see Fig. 1b).
The Montgomery multiplication algorithm requires iterative encryption word width additions, as can be seen by the expressions in the inner loop of the algorithm, which calculates Z = ZA mod M:
The qi result can be calculated from just a few bits of the current internal state, Si, but the next state Si 1 requires the addition of three word wide numbers. (Many modular multiplications are required for each modular exponentiation some square the last result, Z, and some multiply the last result, Z, with the message A). The word wide additions can be calculated in multiple steps using a series of smaller additions. As ripple carry adders are very efficient on the current generation of programmable logic devices at the word widths supported by the block memories in the devices, breaking the algorithm into a series of smaller additions allows the core to calculate any size modular multiply at the same system frequency. In fact, this method allows the core to be dynamically programmable for key size, as delay is converted to latency. The overall throughput of the core will then depend on the memory geometry chosen. As can be seen from Fig. 1b, performance will scale linearly with memory size, with a constant system frequency, and only a small logic penalty.
The core interface is designed to provide the simplest connections to processors and bus systems. A memory style interface, where the values of X, E and M are written in and Z is read out by an address bus and read/write line, allows the core to operate with any common DMA or bus standard. Also, the core architecture enables memory access in a round robin fashion that converts read, write and execution times into latency, not degradation in throughput.
This core architecture delivers the scalability and linearly increasing performance required to support a wide range of RSA based applications. Also, silicon efficiency has a significant impact on cost, enabling the core to offer many cost/performance options. These features, in combination with interfaces designed to simplify connections to any common processor or bus, make this core the most cost effective and flexible solution for accelerating the performance of RSA based cryptography systems.
John Fry (email@example.com) is European DSP systems FAE and Martin Langhammer (firstname.lastname@example.org) is chief scientist at Altera Corp. (San Jose, Calif.).
See related chart
See related chart