Data-in-transit Protection for Application Processors

Chapter 1: Introduction

This whitepaper attempts to help designers tasked with building an Application Processor based system that needs to incorporate support for what is typically called 'Data in Transit Protection'. For a given system this often translates in a requirement for high speed cryptographic data processing. So the emphasis is on high-speed data processing, as opposed to high security, or operations requiring high computational loads but little data - we are going to talk about processing lots of data, fast. Specifically, we mean 'fast' in the context of the resources available to the system. The assumption is that we are dealing with a system that also has other things to do than cryptographic processing. In fact in the majority of cases, the system was designed and dimensioned with a different task in mind . and Data in Transit protection is only added after the fact, or in a second revision of the product. Thus, the challenge we are going to address here is not just about doing high-speed crypto. It is about doing high speed crypto while minimizing its impact and footprint on the rest of the system.

Just like Application Processor based systems have evolved over time to become more powerful and more complex, so have the cryptographic coprocessors that accompany these processors. To sketch the broad range of solutions available today, we will use 'a brief history of cryptographic offloading' to build a 'timeline' of cryptographic offloading solutions, with every step along the way adding additional sophistication to the cryptographic offloading. Most systems today don't need the fullest, most comprehensive solutions that have been introduced recently -but your system is bound to be comparable with a location 'somewhere on this timeline'.

Figure 1: A history of cryptographic acceleration

Chapter 2: Cryptographic offloading â€“ a brief history

2.1 Software only

The simplest form of cryptographic processing is obviously doing it all in software. This solution is the simplest to build and integrate in the system, but it also has the highest impact on the system. To appreciate why cryptographic processing of data is so hard on a processor system, consider the following:

In most efficient software implementations, for instance those including a networking stack, one of the design requirements is to copy the actual data around in the system, as little as possible. With Data in Transit protection, the processor is required to touch every byte of the packet data, twice (most protection schemes require the data to be hashed and encrypted).
Crypto operations typically involve (bit level) data manipulation operations not present on most application processors.
The data being processed tends to be transient â€“ that is, itâ€™s processed by the system once, after which itâ€™s forwarded out of the system. Thus, having local caches to speed up access to data in external memory doesnâ€™t help much â€“ in fact, in most cases, youâ€™re better of making sure the data does not end up in the cache system.
With all the data moving around in the system, not just the processor is being tasked; also the bus system and external memory interface, leaving little room for â€˜otherâ€™ applications to make use of these resources.
With all these resources in full swing, power consumption will be affected as well.

2.2 Individual Crypto Engines with DMA support

The first step in cryptographic offloading is adding dedicated hardware to take care of the crypto algorithms. This already provides a significant performance boost as the hardware crypto will be much more efficient in performing the cryptographic transformations than the processor itself. By adding DMA capability to the crypto cores, the processor only needs to set up the key material and DMA parameters, and off goes the accelerator - the processor can spend its cycles on other tasks. This is a relatively easy scheme to support in software since itâ€™s a straightforward replacement of the crypto operation in software, with a call to the crypto hardware. The resource utilization on the rest of the system is still high though:

Bus loading/cycle stealing still happens. The processor may have cycles to spare, but the bus system and external memory still havenâ€™t â€“ starving the processor of instructions or data.
Every data byte processed, still crosses the bus system three times: read (for encryption/decryption), write (after encryption/decryption) and read again (for hashing).
The processor is still â€˜in the driving seatâ€™ for the crypto processing. Interaction with the crypto hardware is inherently synchronous and hardware inefficient.
- Processor sets up the operations, writes the key material â€“ crypto and hash cores idle.
- Ciphering happens â€“ hash core idle.
- Hashing â€“ crypto core idle.
- The processor has to wait for the crypto hardware to complete â€“ requiring a polling operation or interrupt to be serviced.
In a lot of cases, crypto processing can only be scheduled â€˜one data block at a timeâ€™ to allow the processor to read status and update key material in the crypto cores, in between data blocks.

In practice this model typically suffers a heavy performance penalty if the crypto processing has to occur on small blocks of data. The high level of processor involvement causes a lot of idle time on the crypto hardware, causing them to never reach their full potential.

The popularity of this acceleration model is confirmed by the fact that a lot of software only supports crypto acceleration in this model â€“ this is for instance how OpenSSL expects to interact with cryptographic hardware.

2.3 Protocol Transform engine

A more advanced form of cryptographic offloading takes the operation to be offloaded from the crypto- to the protocol level. In other words, instead of only accelerating an individual cipher- or hash operation, the hardware takes care of a complete security protocol transformation in a single pass. This generation of crypto acceleration also brings another improvement: it takes control over its DMA capability, making it a bus master, and allowing it to autonomously update state- and data in system memory. Although this new bus mastering capability makes integration with software more complicated, it allows for a huge efficiency increase for cryptographic acceleration:

Instead of requiring the processor to write key material and read hash results before and after every individual crypto operation, the processor sets up the key material and result location for the accelerator in system memory. The crypto accelerator autonomously fetches the data it needs and returns the results to system memory.
Because the crypto accelerator can update protocol related state information in memory autonomously, such as the IPsec sequence number or the SSL/TLS Initialization Vector, the host no longer needs to be involved to carry this state information from packet to packet.
The two items above allow the processor to â€˜queue upâ€™ multiple operations without the need to be involved in between operations. This in turn allows forms of â€˜batch processingâ€™ or â€˜interrupt coalescingâ€™ to lighten the processors I/O access and interrupt burdens, resulting in a significant improvement in the number of packets per second a system can handle.
The fact that multiple packet operations can be queued also allows the crypto accelerator to work on multiple packets at a time, allowing more efficient use of its processing pipeline, and allowing the â€˜hidingâ€™ of data access latencies by reading data for the next packet while processing on the current packet is still ongoing.
By allowing single-pass hash- and encrypt operations the crypto accelerator can keep both the hash and cipher engines busy at the same time while decreasing the number of times the packet crosses the bus system, saving a packet read operation compared to the previous scenario.

These points allow the crypto hardware to achieve an almost 100% utilization, while still reducing the per-packet load on system resources. Because the maximum data throughput and maximum number of packets per second the system can process improve significantly compared to the single-core acceleration scenario in the previous section, overall resource load goes up â€“ there simply is less waiting to be done, and more packet data to be processed.

2.4 Parallelized protocol transform engines

For some systems, even a protocol transform engine is not fast enough. The simple answer then seems to be to â€˜just throw more hardware at itâ€™ to speed things up. As every system architect knows, reality is hardly ever that simple. To explain why this is the case, we need to dive a little deeper into the world of cryptographic hardware and data in transit protection protocols.

Because of the way most encryption- and message integrity modes are designed, itâ€™s not possible to assign multiple cipher- and hash cores to work on the same packet. Almost all encryption and message integrity modes incorporate an internal feedback loop that requires the result of the current step to be used as input to the next step â€“ there is no way of working on multiple â€˜stepsâ€™ in parallel. Itâ€™s the â€˜sizeâ€™ of this step that determines the maximum throughput a single encryption- or message integrity mode can achieve. Thus, and individual protocol engine can only process a packet as fast as the encryption- and message integrity mode can achieve in the technology used. Only by using multiple transform engines in parallel and processing multiple packets simultaneously is it possible to achieve throughputs beyond this limit. For modern technology and crypto algorithms, the limit is around 4 to 5Gbps, implying that with the arrival of the next generation Ethernet speeds of 10, 40 or even 100Gbps, multiple protocol transform engines will have to be deployed in parallel.

A notable exception to the limitations mentioned above are the algorithm modes specifically designed for speed. AES-GCM is a great example; this mode used the AES algorithm in â€˜counter modeâ€™, which does not use data feedback and thus allows multiple AES cores to be deployed in parallel to work on the same packet. By also using an integrity mode that allows internal parallelization, AES-GCM can be built to provide throughputs far beyond the limits mentioned earlier. Obviously this also affects the latency that a packet incurs due to the crypto operation. This is one of the reasons why the designers of MACsec have chosen to only allow the use of AES-GCM as data protection mechanism. Unfortunately for all of the 'older' data in transit protection schemes, such as IPsec and SSL, this restriction to a single mode of operation can't be afforded or enforced, simply because connections to legacy systems will have to be supported. For older data in transit protection schemes, the use of multiple protocol engines in parallel to achieve the higher throughputs required for modern networks is required.

multiple transform engines in parallel however brings a new challenge. As already indicated in the previous section, protocols like IPsec and SSL maintain 'state information' for a connection (or 'tunnel'). This information, typically referred to as a 'Security Association' or SA, is required before a transform engine can start processing on a packet, and it's updated after processing is done. Processing a packet using old SA data may cause the processing to fail completely, as is the case with SSL or TLS. Or, it may cause certain checks to become critical, such as with the IPsec replay window check where the currently allowed window of packet sequence numbers is kept as connection state. This 'challenge' causes a lot of systems to keep track of the security connection to which a packet belongs, so the packet for a 'single tunnel' can be scheduled for processing on the same transform engine every time. This way the system is making sure the connection state is carried correctly between packets. It obviously also negates the parallel processing capability for packets belonging to the same tunnel; parallel processing is only possible for packets belonging to different connections. In other words, a system specified to support 40Gbps of IPsec traffic may only be capable of handling 5Gbps of IPsec traffic per IPsec tunnel. Such a system will only be capable of achieving the full 40Gbps if that traffic is distributed over multiple IPsec tunnels.

The good news is that for IPsec, where 'single tunnel operation' is common, this limitation can be addressed, provided the protocol acceleration hardware is designed for it. For SSL and TLS, this limitation can't be addressed as easily. Fortunately though for SSL/TLS the typical usage scenario results in lots of different, short lived, connections so the limitation is not as serious. The one scenario that may result in a single SSL connection, is with SSL based VPN's. For this reason, SSL based VPN's tend to use a modified version of SSL/TLS called 'Datagram TLS', DTLS, which is designed to allow operation over UDP instead of TCP, which means the DTLS protocol must be able to deal with datagrams that arrive out of order; the guaranteed packet ordering provided by TCP is not available. As a result, DTLS allows parallel processing of packets belonging to the same connection.

2.5 Moving on

Even the parallelized protocol transform engine from the previous section isn't always sufficient to achieve the data throughput and packets per second a system architect is looking for. Simply adding more crypto hardware doesn't always do the trick; for various reasons, other system bottlenecks may prevent the crypto hardware from reaching its potential. Examples of performance limiting effects are:

Data Bandwidth limitations

Adding data in transit protection to an existing data stream tends to multiply the amount of data that needs to be moved around on the internal bus system. Where originally packet data came in over an external interface (Ethernet, WiFi) and got stored in memory, for use by some application running on the host processor, now the packet needs to be read from memory, get decrypted, and stored back in memory before it can be given to the application. The same obviously holds for outbound traffic. Thus, a gigabit interface that used to consume a single gigabit of internal bandwidth, all of a sudden requires 3 gigabit of internal bandwidth. In addition, for every packet processed, the key material and tunnel state (SA) needs to be read and updated by the crypto engine. Although in itself not a lot of data, it may still add up to a large data stream if a lot of small packets are processed. This may be alleviated by using an SA cache, on systems requiring support for limited number of tunnels; however systems dealing thousands of simultaneous tunnels typically can't afford to provide sufficient local memory to make caching effective.

Processor Bandwidth limitations

Every packet arriving in the system requires some attention from the system processor(s), even if the actual data movement and modification is handled by support hardware. This goes for connections that are not encrypted . the rule of thumb for terminating a TCP connection on a host processor used to be that for every bit of TCP traffic terminated, 1 Hz of processor bandwidth was required. It will be obvious that this does not improve if data in transit protection is added to a data stream

The key here is that, assuming packet data movement is handled by DMA and cryptographic processing is handled by a crypto accelerator, the processor has to perform all packet handling operations, such as those for TCP described above, for every packet, regardless of the size of the packet. This means that every system has an upper limit for the number of packets it can handle per second, especially if the amount of bandwidth the processor is allowed to spend on packet handling is limited. Most systems donâ€™t just move packets along; they actually need to act on them so they reserve, or would like to reserve, the majority of their bandwidth to other tasks, putting a further limit on the maximum number of packets the system can handle per second. Obviously this upper limit will decrease if the processor is given more tasks per packet due to the addition of data in transit security. Some common causes for this are:

Classification workload. Even if the actual cryptographic operations are offloaded to hardware, the processor still has to inspect the packet headers and determine the right key material for the crypto accelerator to use. As a rule of thumb, this â€˜classification workloadâ€™ can be considered roughly equal to the workload needed to terminate a TCP connection.
I/O interaction. In order to exchange packets with the crypto accelerator, even in case of a sophisticated type such as the protocol transform engine mentioned earlier, processor bandwidth is required. How much bandwidth depends on the operating system, hardware system and accelerator type used, but the following items tend to require a significant amount of bandwidth if not properly addressed:
- Drivers for cryptographic accelerators tend to exist in kernel space, whereas the application using the crypto accelerator operates from user space. This means that each time the crypto accelerator is used a user/kernel space transition is required.
- Cryptographic accelerators capable of asynchronous operation may rely on interrupts to indicate completion of packet operation(s). Switching into- and out of interrupt service modes may place a heavy burden on the system processor, especially if this has to happen very often.
- Cache sizing, especially for application processor, is often such that the cache can hold all code and data in cache for the task at hand. If that task includes packet I/O, then adding data in transit protection all of a sudden adds the â€˜classification and packet I/O code to the code set that should remain in cache. If the cache size is not properly sized to support this, processing may take a significant performance it as the number of cache misses increases.

It will be obvious that having to move data around in the system unnecessarily never helps throughput; this is true for any system, not just for cryptographic accelerators. Especially for DMA capable peripherals, data flow and data buffer management should be optimized such that data alignment, data buffer location and management as well as address translation (between the virtual addresses used by applications and physical address used by a DMA engine) allow for optimal use and cooperation by the peripheral and the OS or application.

With the previous paragraphs in mind, we can construct a graph that shows the maximum throughput a system can achieve, as a function of packet size used to transfer the data. This graph clearly shows the two areas where throughput is limited by processor bandwidth and data bandwidth, respectively. Tangent A shows the maximum throughput achievable due to the systemâ€™s ability to process a maximum number of packets per second (c), thus maximum throughput is c x Packet Size. Tangent B shows the maximum data bandwidth available for the cryptographic accelerator.

Any system deploying look-aside type cryptographic hardware will perform according to this graph, although obviously the exact slope of tangent A, and location of tangent B, will differ.

Figure 2: Typical throughput graph for packet processing systems

Attempting to improve throughput of a system by just adding additional cryptographic acceleration capability will obviously move tangent B up; however without further improvements to the system, the slope of tangent A is not changed, limiting the effect of the additional cryptographic hardware, as illustrated in the following figure.

Figure 3: Effect on throughput when adding HW acceleration capable of 2x the original acceleration performance without improving processor packet handling efficiency

The slope of tangent A, dictated by coefficient c, can be improved by increasing the efficiency of the IP- and cryptographic protocol stacks, and by making sure the interaction with the cryptographic hardware is as efficient as possible. The points mentioned in this section can help to achieve this, up to a certain point; if additional improvement is needed, it becomes necessary to move more functionality from software, directly to hardware. For this reason the class of â€˜Inline Protocol Acceleration enginesâ€™ was introduced.

2.6 Inline Protocol Acceleration engines

Originating from the Network Processor world, the concept of 'Inline Operation' has started to be used in the Application Processor world as well. The crypto acceleration architectures discussed so far, operate in what is typically referred to as 'Look-Aside mode': packet handling is done completely under software control, and only when the actual cryptographic operation needs to be performed, does the software 'Look Aside' to the cryptographic accelerator. After the crypto accelerator has completed its task, the packet is handed back to software and packet processing continues. The conceptual difference introduced by Inline Processing is the fact that software is no longer involved both before and after crypto acceleration . all cryptographic operations are performed on the packet before the software 'sees' the packet for the first time (or vice versa). This form of Inline Operation is typically called the 'Bump in the Stack' processing model. Some systems, especially those targeting networking gateway applications, take this concept one step further and allow a packet to travel from network interface to network interface completely through hardware, without involving software running on the Application Processor at all. This operational model, which is almost a hybrid between the typical Application Processor setup and a dedicated Network Processor setup, is often referred to as the 'Bump in the Wire' processing model. Since we are specifically addressing Application Processors in this whitepaper, we will focus primarily on the 'Bump in the Stack' model. After all, most Application Processors are used in a system that is required to actually use (consume) the packet data it receives (and vice versa); only network gateway applications are typically set up to 'forward' packet data without actually looking at the packet contents.

The following two figures illustrate the difference between the look-aside and inline processing models, from a protocol stack point of view. The first figure shows a 'typical' protocol stack for IP with IPsec. Typical packet flow is from Ethernet, at the bottom, through the IP stack in software, making a brief excursion to the Cryptographic accelerator for decryption, and further up to the application. Outbound packets follow the same flow, in reverse.

Figure 4 Example of (data plane) packet handling operations, on the left a typical IP with IPsec Protocol stack, on the right the operations executer in HW by a Flow Through accelerator

When an Inline cryptographic accelerator is used, the picture changes as shown on the right side. All packet operations â€˜in betweenâ€™ the Ethernet MAC and the Cryptographic accelerator are performed in the hardware of the Inline protocol engine. The packet no longer makes an â€˜excursionâ€™ from the software stack, to get processed by the cryptographic accelerator; rather, the software stack only â€˜seesâ€™ the packet after it has been decrypted. With a Bump-In-The-Stack flow, the packet travels from Ethernet to the application, and vice versa. In a Bump in the Wire flow, the protocol accelerator also implements an IP forwarding function, so packet that arrive from Ethernet can be processed all the way up to the IP layer, get decrypted, and are then forwarded â€˜back downâ€™ to Ethernet again, causing the packet to never hit the software part of the IP stack.

Both the Bump in the Stack as well as the Bump in the Wire operational models present some software integration challenges as typical networking stacks and applications are not designed for use in this model. When properly integrated however, major benefits can be achieved:

From a data plane point of view, Inline acceleration makes the system appear as a regular networking system again:
- Only a single packet data transfer from the networking interface to system memory (and vice versa) is required.
- No additional processor involvement with individual packets due to data in transit protection â€“ processor involvement is limited to control plane operations.
- No I/O interaction from both network interface and crypto accelerator hardware. The combined Inline setup looks like a regular Ethernet interface to the system.
Because data in transit protection no longer requires additional system resources for data plane operations, performance becomes predictable â€“ it no longer depends on processor activity or bus- or SDRAM utilization by other system tasks. Basically the system can operate at its normal performance level, without data in transit added, assuming of course the crypto accelerator has â€˜line rate performanceâ€™.

In other words, most or all of the issues raised in the previous section â€˜go awayâ€™ when an inline crypto accelerator is deployed.

Using an inline crypto accelerator in â€˜Bump in the Wireâ€™ mode can have an even more dramatic effect; because the crypto accelerator in this scenario comes with a built-in â€˜packet forwarding engineâ€™, the packet forwarding capability of the system through the crypto pipeline can outstrip the packet forwarding capability of the application processor in the system, to such an extent that often the terms â€˜fast pathâ€™, denoting the inline crypto accelerator, and â€˜ slow pathâ€™, denoting the application processor, are used â€“ terms typically used in the world of network processors to indicate the optimized data path through the packet processing engines, versus packet handling by the slower general purpose processor.

2.7 Power

Cryptographic accelerators not only bring improved data throughput. They also provide improved power consumption compared to a software-only solution. It will be obvious that an on-chip accelerator, using only the necessary amount of logic gates needed to perform the cipher- and hash operations, consumes significantly less power than a general purpose application processor. The application processor, and the parts of the system it uses to perform the required cryptographic operations, will activate much more internal logic compared to a dedicated crypto accelerator. In addition the application processor typically executes from off-chip SDRAM, increasing the combined power consumed even more.

The most significant power savings are achieved by moving the cryptographic operations to dedicated hardware, preferably a protocol engine (to minimize the amount of data movement in the system). Beyond that, using Bump in the Stack type acceleration provides power optimization compared to a Look-Aside deployment, again because of the fact the packet data is moved in and out of SDRAM less often. Bump-in-the-Wire operation improves power consumption even more because packet data does not necessarily have to enter SDRAM any more at all, combined with the fact that the processor is not spending any cycles on packet processing.

Chapter 3: Efficient Packet Engine design and integration

Up to this point we have been discussing the different cryptographic acceleration architectures found in application processors today. Having established the application and usefulness of cryptographic acceleration, we will now look at what features make an accelerator efficient. This section focuses on the protocol-level accelerators from the previous section; these are often referred to as 'packet engines' hence you'll see that term used in the following sections as well.

3.1 The 'simple things'

Any peripheral with (high throughput) DMA capability needs to provide certain features to allow easy integration with controlling software; packet engines are no exception. This means that the packet engine DMA subsystem should provide the following features:

Descriptor based operation.

Rather than requiring the processor to manually program the DMA engine on a transfer-by-transfer basis, the packet engine should allow the processor to queue a number of packets, leaving it to the packet engine to set up the individual bus mastering transactions autonomously. This also allows the packet engine to pre-fetch data to hide memory access latencies.

Support for data Scatter- and Gather capabilities.

While software often enjoys the services of an MMU to make a buffer scattered in memory look like a contiguous virtual buffer, the lack of IOMMU in a lot of systems means the packet engine DMA will have to be able to deal with the scattering and gathering of data itself.

Interrupt Coalescing, time-out and polling.

One of the benefits of descriptor-based control for the packet engine is that it allows asynchronous interaction with the packet engine. This implies that the processor either gets interrupted by the packet engine when processing for a packet is completed, or the processor polls on a regular basis to determine if processed packets are available and whether new ones can be queued. While the overhead of dealing with a hardware interrupt may be acceptable for situation with a low 'packet arrival rate', this overhead becomes inhibitive if the number of interrupts rises due to a high packet arrival rate. One way of lightening the interrupt load is to apply 'Interrupt Coalescing', which simply means that an interrupt is fired by the packet engine for every n packets processed, rather than for every single packet. This mechanism works fine if the system is dealing with a continuously high packet arrival rate. If packet arrival rate drops, however, interrupt coalescing may result in some packets not getting serviced for a long time because the 'coalescing limit' isn't reached, preventing the interrupt from being fired. In this case the packet engine must allow a time-out to be set; if processed packets are waiting and no new packets arrive during this time-out period, the packet engine triggers the interrupt to the processor anyway. This mechanism allows packet handling latency to remain under control. Finally, if the system finds itself under such a high packet arrival rate that even interrupt servicing becomes undesirable, the system may want to switch to a form of polling, similar to the behavior of the Linux New (Packet Processing) API. The above implies that the packet engine should provide support for all these mechanisms.

Another thing to look for in a DMA-capable (or bus mastering) peripheral is its ability to interact efficiently with the internal bus system; it must be capable of:

Supporting programmable minimum- and maximum burst sizes.
Dealing with different data alignments:
- Big-versus little endian.
- 1, 2, 4, 8 byte aligned data transfers (or more).
Supporting the available sideband signaling, for instance to assist with cache coherency and simultaneous data access synchronization.
Setting up multiple simultaneous transactions to allow pipelining and latency hiding.

Other system level considerations apply when using DMA capable peripherals, such as (data) cache coherency mentioned above, for instance. These however, need to be dealt with at the system level as they canâ€™t be alleviated by the peripheral itself (alone). What will also help in this respect is having a software support environment for the device that is aware of these issues and can help deal with them.

3.2 Supporting modern application processor hardware

Application processors, even those in mobile systems, have evolved from single-processor, 32 bit, single OS or RTOS into Multi-Processor, 64-bit systems with support for Virtualization, possibly running multiple OSâ€™s and definitely running more applications in parallel. In addition, the presence of MMU and IOMMU functions, and the use of higher throughput, pipelined, memory ease the use of a DMA-capable peripheral in the system and provide much higher data throughputs. These features allow higher network throughput and make it easier for a crypto accelerator to be accessed from different applications in the system. On the downside, memory read access times have grown to a point where two or three new packets arrive in the system while the crypto accelerator is waiting for a single read access to get completed by the memory subsystem.

This implies that to be effective in a modern system, a packet engine has to support a number of features that have nothing to do with the crypto operations themselves but rather, allow the packet engine to achieve its maximum potential as a part of a bigger, complex, system. In a sense these requirements hold for any highperformance peripheral in the system:

The packet engine must be capable of working on multiple packets at the same time. This allows the engine to set up multiple parallel read transactions, in order to hide the high read latency times exhibited by modern systems. We are not talking one or two packets here â€“ to be effective the packet engine must be capable of handling 10s of packets simultaneously.
With this amount of packets active in the packet engine simultaneously, it must provide sufficient internal buffering and caching capability to minimize interaction with system memory (read- and write a data structure only once). It must also be able to deal with the associated data consistency challenges â€“ for instance updates to connection state need to be propagated to other packets in the pipeline on the same connection, as well as to system memory.
Data read by the packet engine may reside in different types of system memory (on-chip RAM, off-chip SDRAM, system memory accessible through eg. a PCIe bus). Data read from on-chip RAM will be returned quicker than data read from off-chip memory, so the system may return data out-of-order; the packet engine must be able to deal with this efficiently.
Address widths of 64 bit and data widths of 128 and more are no longer â€˜exceptionalâ€™.
Data read requests may return corrupted or incomplete data or time out altogether, especially if data originates from across an inter-chip bus (eg. PCIe).
Additional sideband information must be provided with bus transactions to support Virtualization and cache management.
To optimize power consumption the engine should support dynamic clocking schemes such that only those parts of the engine that are actually used, are provided with a clock signal.
To support even more rigorous power saving, the packet engine must be capable of transitioning into idle state in a controlled manner, and signal the system that it has arrived in that state, so the system can shut down all clocks and possibly power, to the engine.

The packet engine must be capable of dealing with the system level requirements mentioned above before it can operate efficiently, i.e. achieve the maximum performance that the internal crypto algorithms can achieve. Next we will look at some of the requirements put on a packet engine to allow it to be used efficiently from a modern software perspective.

3.3 Supporting multiple applications and virtualized systems

As already indicated earlier, modern application processors support virtualization, either in the classical sense, running multiple operating systems, or from a security perspective, deploying a normal- and a secure world, or even both at the same time. In addition, each virtualized environment can run multiple applications that require

interaction with the packet engine. In top of that, some of these applications require a kernel component to control the crypto operations (which is typically the case for IPsec, for example), while other applications require access to the crypto accelerator from user space (such as SSL/TLS). This means that, for the packet engine to be used effectively in such an environment, the packet engine should provide the following features:

The ability to separate global initialization and control functionality, affecting the core as a whole, from data- and control interaction required by individual applications. In other words it must be possible to have a single master function in the system that initializes the packet engine and reacts to global error situations.
Every individual application must then be able to interact with the packet engine hardware directly, without requiring access coordination through a driver or a virtualization subsystem.
Furthermore, the packet engine should make sure that data structures, registers, counters etc. used by an application are not visible or accessible by another application. The same goes for data structures in memory and interrupts used just by a single application.
Robust synchronization mechanisms must be provided for data structures that are shared between the application and the packet engine. This is true for data that is located only in packet engine registers (which may be the case, for instance for classification rules). It is even more relevant for data structures that are created and maintained by the application but can be â€˜cachedâ€™, and possibly updated, by the packet engine while processing packets.

These are just a handful of requirements posed on a packet engine as it gets integrated in a modern multiprocessor application processor. In the past, requirements like these used to be applicable to high-end server systems; however there is clearly a shift in system complexity happening with the ever increasing power of application processors.

3.4 Upping the performance

As indicated in the first half of this whitepaper, it may be necessary to use multiple processing pipelines in order to exceed the single-packet throughput limitations imposed by certain cryptographic operations. In addition we determined that it was beneficial for the packet engine to be able to work on multiple packets in parallel so more parallel read transactions and data pre-fetch operations could be set up, to allow more efficient read latency hiding. For this reason high-speed packet engines comprise of multiple processing pipelines, with each pipeline consisting of multiple stages, each stage operating on a different packet. Doing this brings improved throughput but it also brings some additional challenges:

Every packet requires access to a â€˜packet contextâ€™ that contains information on how to process the packet (protocol, mode, connection and tunnel state) and the key material to use. With multiple packets active in the system, the packet engine needs high speed access to these packet contexts, making it necessary for the packet engine to implement a sophisticated internal cache system.
Any updates that are required to this packet context need to be propagated to packets and contexts already present in the packet engine, as well as to the host system.
For certain protocols, packet processing must be strictly serialized because the protocol is built in such a way that the next packet needs the processing result of the previous packet in the connection, before it can be processed. SSL/TLS is like this. Other protocols require special measures to allow parallel processing of multiple packets for the same SA. This is the case for the replay (sequence number) check in IPsec.
Most systems require that packet ordering, at a system level, is also maintained. Thus, a small packet being processed on one processing pipeline must not overtake a larger packet processed on a second processing pipeline. At the same time, stalling of processing pipelines is undesirable as it obviously affects packet engine throughput.
The supporting infrastructure around the multiple processing pipelines must be capable of servicing all the simultaneous data requests from the different pipelines while also maintaining compliance with the SoC bus system and the requirements for efficient system integration mentioned earlier.

Chapter 4: The Software Angle

In the previous sections we have focused on the cryptographic hardware and what is needed to allow efficient interaction with software, at a fairly low level and from a predominantly hardware-based perspective. Now letâ€™s take a look at the requirements that are put on a protocol stack as a whole, to allow it to make efficient use of two of the more advanced crypto acceleration architectures mentioned above, the Look-Aside Protocol Acceleration model, and the Inline Protocol Acceleration model. In general, the different offloading modes represent increasing integration challenges but also yield significant performance- and offloading improvements, as illustrated in the following figure for the IPsec scenario.

4.1 Look-Aside Model

The first and most basic item to look at is the capability of the protocol stack to support protocol-level crypto acceleration. If the protocol stack only allows cryptographic offloading on an algorithm level, the added value of a sophisticated protocol acceleration engine is going to be limited.

To unlock the full potential of the look-aside protocol engine, it is necessary that the protocol stack can keep the packet queue populated with packets at all times. It will be obvious that this requires the protocol stack to support asynchronous packet exchange with the protocol core, allowing the protocol stack to handle processed packets and set up new ones, while the crypto accelerator is processing the queued packets. Even more basic, the protocol stack must be built to allow simultaneous processing of multiple packets, either by multiple invocations of the data processing path or by some other form of parallel processing.

It also requires that the protocol stack operates without accessing tunnel context on a per-packet basis. This means it needs to relinquish control over context updates, leaving those to the hardware. With context (or â€˜tunnel contextâ€™) we mean any secure tunnel-related state data that needs to be carried between packets, such as cipher

engine state or sequence number information. If the software stack â€˜insistsâ€™ on updating tunnel context by itself, then it effectively needs to wait to submit a packet for a specific tunnel to the crypto hardware, until any previous packet from the same tunnel is completed- so the software stack can perform the context update and submit the next packet for the same tunnel.

Similarly, if the protocol stack is designed to read all of the processing parameters from the tunnel context in order to submit them directly to the hardware (as part of the call to invoke hardware acceleration) the protocol stack needs to wait for any updates from the previous packet (for the same tunnel) to be completed before being able to submit the next packet. Unfortunately most â€˜standardizedâ€™ crypto APIâ€™s in existence today operate using this model, supplying the key material with the call to the crypto algorithm, since they were not designed for highspeed data throughput applications.

Thus, chances are that if a protocol stack uses a standard crypto API for hardware offloading, it is not going to be very efficient working together with a protocol engine. This often implies that a protocol stack capable of efficient hardware acceleration comes with its own proprietary crypto acceleration API. This API should be designed to allow efficient interaction with crypto hardware in general. Consider for instance the following two potential bottlenecks:

User/kernel space transitions.

Interaction with hardware typically requires interaction with (kernel) drivers. Obviously, frequent switches between user- and kernel mode require significant processor bandwidth. An efficient protocol stack minimizes these transitions.

Using DMA capable buffers.

A protocol acceleration engine comes with its own DMA capability. This puts a requirement on the software stack to place data to be processed by the engine, in a memory location that is accessible and usable by the packet engine DMA. If the protocol stack is unaware of this and puts packet data in memory buffers that are inaccessible for DMA transactions, are unaligned, or cause cache coherency issues, the packet engine driver may be forced to copy the data to a â€˜DMA safeâ€™ location.

Use of system cache.

In most systems, the performance difference between being able to execute operations from cache, versus external memory, is huge. The presence of hardware acceleration potentially helps, as the cipherand hash code no longer needs to be in processor cache. Still, the protocol stack must make sure the hardware interaction with the crypto accelerator doesnâ€™t prevent the cache system from functioning efficiently, for instance because data structures â€˜ownedâ€™ by the hardware engine end up in cache, or because large amounts of packet data are â€˜pulled throughâ€™ the cache.

In general, the protocol stack itself should be reasonably efficient in handling of individual packets, even if it can offload cryptographic transformation to hardware â€“ if it takes the protocol stack longer to prepare a new packet for crypto processing than it takes the crypto accelerator to process it, then data throughput is still going to be limited by processor bandwidth. Processor bandwidth, protocol stack efficiency, and cryptographic accelerator throughput should be in balance.

4.2 Inline model

Any form of inline processing, either Bump-in-the-Wire or Bump-in-the-Stack, typically requires dedicated integration with the system. The reference to the complete system, as opposed to just the protocol stack, is deliberate: Inline protocol accelerators can be connected directly to the Ethernet MAC interface. For that reason, the inline accelerator must take care of a number of non-cryptographic packet operations that are otherwise done by layers 2- and 3 of the IP protocol stack. Depending on the deployment, this may result in a system where the regular Ethernet Driver is completely integrated with the driver for the Inline protocol accelerator, with the combination acting as an â€˜advanced Ethernet driverâ€™ in the system. In this scenario, modifications to the protocol stack are also significant; rather than actually taking a packet, classifying it, and managing the crypto transform, the protocol stack now just needs to be aware that IPsec processing has already happened (on ingress), even before the protocol stack â€˜seesâ€™ the packet for the first time. Or vice versa, on egress, that IPsec processing can be deferred until after the protocol stack hands off the packet for transmission. In this scenario, the data plane in the protocol stack is â€˜reducedâ€™ to maintaining statistics, error detection, and exception processing.

Due to the fact that Inline protocol acceleration hardware is capable of autonomous packet classification, the protocol stack needs to have support for functionality that is traditionally only found in network processors: it needs to be capable of interacting with hardware classification functions. This implies setting up and maintaining classification rules, shared with and used by the hardware classifiers, and synchronizing access to data structures used simultaneously and, more importantly, autonomously, by the inline protocol accelerator.

In the case of look-aside operation, every operation on the cryptographic accelerator is initiated and controlled by software, and operation on certain data structures can thus be easily stopped to manage the core or its associated data structures. With inline protocol acceleration, the accelerator hardware receives packets directly without intervention from software, putting additional requirements on the synchronization between hardware and software in order to manage shared data structures.

Another item often overlooked is the fact that a protocol stack, supporting the use of an inline protocol accelerator, must be capable of working with multiple â€˜data planesâ€™. This means that the protocol stack must understand the fact that packet data, plus the associated context data, may be handled by one (or more) hardware protocol accelerators. In addition, the protocol stack itself must implement a full data plane in order to deal with exception situations and packets that cannot be handled by hardware.

A final remark related to inline processing is the fact that not every protocol â€˜lends itself wellâ€™ to this operational model. Security protocols that are designed for use at the application level, or â€˜higher up the IP stackâ€™, rely on services provided by the lower stack levels. To support inline acceleration for such higher-level protocols, the inline protocol accelerator would have to implement these services in hardware as well; a task that may not always be feasible. Alternatively, in case an operation is required that is not supported by the hardware, the packet can be processed using an â€˜exception pathâ€™ in software, bypassing the hardware accelerator. This should of course only occur for a very small percentage of packets processed.

An obvious example of a higher level service that is not typically supported in inline hardware is the packet ordering feature of TCP. Protocols relying on this feature, such as SSL/TLS, are typically not fully supported by inline protocol accelerators, which are designed to process packets as they arrive. This implies that â€˜Inlineâ€™ acceleration of SSL/TLS is typically implemented asymmetrically. For packet data originating from the local host, packet ordering is guaranteed and inline acceleration is feasible. For ingress, where packets can arrive out of order, the protocol acceleration is typically implemented as a look-aside operation, to allow the packets to be ordered by the TCP stack in software, before submitting them for decryption to the protocol accelerator hardware. For typical (http) server deployments this works well, since ingress traffic is typically low, with clients requesting data, and egress traffic is high, containing the actual requested data.

An example of a lower-level service that is not typically supported by inline accelerators is that of fragmentation/reassembly. Since this is â€˜not supposed to happenâ€™ in a well-configured setup anyway, the processing of fragmented packets is left to the software exception path or â€˜slow pathâ€™.

Chapter 5: And then thereâ€™s thisâ€¦

Up to this point we have only discussed â€˜data plane accelerationâ€™ for Data in Transit protection. Data plane acceleration assumes that the key material required to perform the encryption/decryption operation, is already present. Before a tunnel is created, these keys must be exchanged with the communication partner. Systems dealing with a high connection setup/tear down rate may be limited in the number of tunnels they can create because of the cryptographic operations required during this key exchange. This is in fact a typical scenario for a web server protected using SSL/TLS. In this case, a different type of cryptographic accelerator is available specifically designed to offload the very compute-intensive large number modular exponentiation operations required by typical key exchange protocols. This type of cryptographic accelerator is referred to as a â€˜Public Key acceleratorâ€™.

In addition to a PKA, a system dealing with a high connection setup rate also tends to require access to a high amount of truly random data. True random data is used to make it hard for an attacker to guess the value of the key material; the security of most protocols relies directly on the quality of the random data used. Creating highquality random data in a digital system is a challenging task in itself; generating a lot of it without compromising its quality is even harder. To help systems with this challenge, hardware true random number generators are available that use inherent quantum-level effects of semiconductor circuits to generate random data.

By now we have seen that a system providing Data in Transit protection deals with a lot of key material, as well as identity information used to establish a trust relationship with a communicating peer during tunnel/connection setup. The typical â€˜security modelâ€™ for devices providing data in transit protection, is that the device itself is located in a secure environment and that it is therefore not needed to provide specific protection for the key material and identity information handled by the device. There can be situations where this security model is not valid, for instance if untrusted software is running on the device, or if the device is located in an unprotected environment. In this case, it may be required to provide hardware based protection for the key material and identity information handled by the device; this also requires cryptographic hardware, such as a hardware key store or a trusted execution environment.

Chapter 6: Conclusion

In this whitepaper we have highlighted some of the challenges for achieving high throughput data in transit protection for application processors. Different architectural models have been explained, showing the evolution of cryptographic offloading hardware and the effects the different architectures have on the hardware, software and performance of an application processor based system. We have also looked at the features that make for an efficient cryptographic accelerator. Finally we looked at the requirements that modern and future systems will place on cryptographic accelerators, both from hardware and a software perspectives.

It will be clear that packet engine design and integration is no longer (primarily) related to the ability to provide high â€˜raw crypto throughputâ€™. The requirements the system poses on the crypto hardware to allow the system to tap the acceleration potential have become much more important.

Another ongoing trend is the fact that the crypto accelerator is pulling in more and more functionality from the surrounding system. Virtualization support in the packet engine hardware allows the software component in the virtualization layer to become smaller. Bump-in-the-Stack and Bump-in-the-Wire operational modes pull OSI layer 2 and 3 functionality, plus parts of the packet forwarding function, into the packet engine hardware. Lastly, packet engines are evolving to overcome limitations imposed by legacy cryptographic modes and protocols that were never designed to go up to the speeds offered by modern network technologies. Although perhaps today this development is of primary use to server deployments, the next generation of applications processors may benefit from the lessons learned today.

AuthenTecâ€™s latest generation of packet engines, the SafeXcel-IP-97 and SafeXcel-IP-197 IP core series are built to support all of the presented optimization, acceleration and offloading mechanisms. These IP codes are supported by the DDK-97 and DDK-197 driver development kits as well as AuthenTecâ€™s QuickSec and Matrix toolkits.

About AuthenTec

AuthenTec is a leading provider of mobile and network security. The Company's diverse product and technology offering helps protect individuals and organizations through secure networking, content and data protection, access control and strong fingerprint security on PCs and mobile devices. AuthenTec encryption technology, fingerprint sensors and identity management software are deployed by the leading mobile device, networking and computing companies, content and service providers, and governments worldwide. AuthenTec's products and technologies provide security on hundreds of millions of devices, and the Company has shipped more than 100 million fingerprint sensors for integration in a wide range of portable electronics including over 15 million mobile phones. Top tier customers include Alcatel-Lucent, Cisco, Fujitsu, HBO, HP, Lenovo, LG, Motorola, Nokia, Orange, Samsung, Sky, and Texas Instruments. Learn more at www.authentec.com.

AuthenTec offers an extensive selection of silicon IP cores that offer efficient HW acceleration of IPsec, SSL, TLS, DTLS, sRTP, MACsec, HDCP protocols, in Look-Aside, Bump-in-the-Stack and Bump-in-the-Wire architectures, as well as 3DES, AES (ECB, CBC, CTR, CCM, GCM, XTS), RC4, KASUMI, SNOW3G, ZUC, RSA, ECC, DSA ciphers, MD5, SHA-1, SHA-2 has and HMAC cores, accompanied by Driver Development Kits and industry leading toolkits such as QuickSec/IPsec, QuickSec/MACsec, MatrixSSL, MatrixSSH and DRM Fusion/HDCP. Acceleration performance from a few 100Mbps to 40 and even 100Gbps can be achieved in todays 90, 65, 45, 40 and 28nm designs. Please visit AuthenTecâ€™s website for more details (http://www.authentec.com/Products/EmbeddedSecurity.aspx).

Industry Articles

Data-in-transit Protection for Application Processors