John Poremba and Marco Zandonadi, Teja Technologies
Dec 09, 2004 (5:00 AM)
Improvements in silicon process technology — 130 to 90 to 65 nm — have brought system-on-chip (SoC) devices with multiple embedded processors to mainstream applications ranging from consumer-level handsets sporting multimedia functionality to high-density network infrastructure equipment. Multi-core approaches keep hardware design in the low frequency domains (e.g. less power and heat), offering significant price, performance, and flexibility over higher speed single-core designs.
The embedded processor cores in SoC devices are often part of a "fast-path" fabric designed for applications where performance is paramount. These architectures are equipped with nonstandard memory, bus, and interconnect architectures supplemented with specialty logic circuits such as search engines, cryptographic units, and media access controllers (MACs). As a consequence, the performance of each core is intimately tied to the activities being performed by the system as a whole. In order to achieve the highest performance, developers must write code for each core from a global or system-wide perspective, accounting for contention between cores for the same shared resource.
Abstraction is a key enabling technology for multi-core development. Most developers view the key benefit of abstraction as easing code creation. Abstraction prevents mistakes because you can no longer do certain things wrong, like break pointers or cross into another context's memory or register space. It also reduces easy to make mistakes, such as mixing types or violating mutual exclusion.
However, left solely at the level of simplified code creation, abstraction does indeed produce less efficient code than hand-crafted code. It is when code generation smears the application, operating system (OS), and hardware configuration boundaries that new types of system-level optimization become possible that are simply beyond human capacity. It in fact is possible for code generators to create more efficient code than a person can.
Traditional code development separates the OS, compiler/code generator, and application code. Treating the OS and application as distinct systems requires generalizing interactions between the OS and application using inefficient APIs. To achieve the most performance gains, code must be written to eliminate the unnecessary overhead this kind of generalization introduces.
The typical high-performance programming model is to work with a low-level language like assembly that developers hand-optimize according to the constraints of the fast path processing components to achieve a high processing efficiency. However, locking specific hardware and software implementation details into the application code makes it difficult to fine tune the partitioning of a system, limiting the exploration of alternative configurations that could potentially yield higher performance.
The challenge with manual fine-tuning is that if you change even one aspect of the system, potentially all of the interactions between components change as well, requiring a comprehensive retuning. Additionally, such code is tied to the particular processor architecture and a change in processor necessitates a complete software redesign.
Abstracting code using a higher-level language alleviates some fast-path constraints. For example, code can be quickly ported to a new platform. However, languages like C don't have the inherent flexibility required to exploit the detailed functionality of fast path components. For this reason, many processors support some flavor of C that has been modified for a particular processor. Unfortunately it is often a language peppered with pragmas and extensions that requires as much hand-optimization and as effectively ties code to a particular platform as native assembly does.
To bring order to the chaos, developers must be able to abstract the underlying and individual components of the fast path. In this way, developers can focus on designing at the functional level, leaving the details of resolving and optimizing these abstractions to tools that can tune and retune code and hardware assignments from a global perspective.
The foundation behind abstracting both hardware and software components is remapping. Certain mechanisms, such memory allocation, message passing, mutual exclusion, synchronization, and queuing, to name a few, are common across processors and operating systems. What differs is the particular implementation of those mechanisms on each architecture and how each implementation affects overall system performance.
Through remapping, the code generator uses the overall system view (global perspective) and determines the implementation details for the mechanisms in each particular case based on the dynamic configuration of the system. By abstracting this to the code generator, remapping becomes a simple process that optimizes performance at compile time without burdening the processor with the runtime overhead traditionally associated with ease-of-programming.
Many designers rely on embedded OS to schedule tasks and manage shared resources across multiple cores. Such blind assignment at runtime undermines system determinism and cannot take into account real time contention and load issues. Designers must take a trial-and-error approach to fine tuning core interaction that is, at best, a generalized but far from optimal implementation.
To truly optimize resource efficiency, designers need to strip away all layers of abstraction that increase system overhead. For example, complex OSes like Linux abstract memory into a virtual address space, making it possible for programs to see more memory than is physically present on the system and to use memory that is protected from external interferences. These benefits, however, come at a cost of several runtime conversions and translations, resulting in a one-to-several cycle penalty to be paid with every memory access. For fast-path processing, such a penalty is too high. Put another way, designers may find themselves specifying a higher performance processor than they actually need.
To optimize the fast path, a designer needs to resolve abstractions like memory virtualization at compile time, as well as bypass well-meaning OS mechanisms (e.g. preemptive scheduler) that slow code down. This requires consolidated teamwork between the OS, compiler, and application.
Briefly, to bypass the OS, the code generator needs to be able access hardware primitives directly (Figure 1). However, the particular hardware primitives that need to be accessed differ based on a variety of operating circumstances that are application specific. By having access to the overall application (the global perspective), the code generator can specify those (and only those) primitives needed in each instance. As a result, the code generator will actually create multiple versions of the same task, each optimized for where it is placed in hardware and its relative relationship to the other tasks/functions with which it interacts.
Figure 1: To optimize the fast path, the code generator must be able to access hardware primitives directly, bypassing high-overhead API and OS mechanisms.
Consider the example of task handoff. TaskX running on core A1 will pass off data to an instance of TaskY which may be on core A1, core B1, or core C2. The data registers of core A1 are directly accessible by core A1. Core B1, another core on the same die, can access A1 registers using next-neighbor mechanisms. Core C2 on another die cannot access A1 registers directly but it can access a copy of those registers through an interchip communications link or shared memory (Figure 2).
Figure 2: TaskX running on core A will pass off data to an instance of TaskY which may be on core A1, core B1, or core C2.
In an abstracted development environment, the code running on cores A1, B1, and C2 would all be the same and could be as simple as a single line of code: TaskY(parameter1, parameter2,...parameter n). However, the underlying implementation of access to the A1 registers is very different depending on in which core it resides.
In an OS-driven environment, the handoff to TaskY is a virtual function call in all three cases. This single line of code requires a great deal of decoding by the OS, which must first determine the relationship of A1 to the handoff core, resolve which handoff mechanism to use, and then complete the transfer.
In a fast-path optimized system, the programmer still gets the benefit of ease of programming. The difference is that virtualization is resolved at compile time by the code generator, which optimizes runtime transfers between tasks by bypassing the operating system. While the high-level code on the three cores looks the same to the programmer, the actual code generated is quite different.
In the case of a task handoff from core A1 to a task within core A1, the exchange is a simple register to register move. The code generator generates this single opcode. From A1 to B1 the exchange is a simple next-neighbor register move, again a single opcode. From A1 to C2 requires an interchip request and transfer, and the code generator creates x opcodes. Note that in all three cases, the resulting code bypasses OS mechanisms and overhead and goes straight to bare hardware primitives.
Now reconsider the task handoff from core A1 to A1. There may be no need to actually exchange any data at all. Since the code generator creates the code for all tasks in the system (global perspective), it is able to further optimize code. For example, a commonly used value may be stored in a register to avoid multiple load/stores. Usually the code generator will dedicate several consecutive indirect or relative registers as opposed to specific registers. In this way, the code generator merely passes the base register to TaskY and assigns a new sequence of indirect registers to TaskX with almost zero overhead.
To code such an exchange by hand would be onerous, especially considering that any change in the system could, for example, constrain the dedicated use of a register, requiring a substantial rewrite of code. For code generators, such rewriting is not onerous. This makes it possible for each instance of the code to be optimized for its actual operating circumstances. And since the code is generated, any changes in the system that would have a ripple effect on this optimized code are absorbed during the next compile.
Such context-sensitive optimization possible from a global perspective extends to many aspects of a system. Mutual exclusion (mutex) of shared resources such as memory, for example, is used to prevent two or more tasks from simultaneously using the same resource. If a mutex is private to a task — no other task accesses the resource — the code generator can implement the mutex in a trivial fashion.
Likewise, instead of calling OS services or functions in a generic and parameter-bulky fashion, only those parameters that are used are passed. This can be implemented by inlining calls or supporting multiple entry points into a function.
Bypassing the OS
The handoff example above shows how particular fast-path components can be abstracted, easing system design, while at the same time facilitating highly efficient system-level optimizations. Such abstraction can be taken as far as you like, even to the point of enabling completely hardware-independent APIs. Once you reach this point, the particular fast path mechanisms and even processor don't matter. In each instantiation of code, the code is optimized for the resources available. In fact, cores A and B don't even have to be the same kind of processor, such as in heterogeneous NPU+CPU, DSP+CPU, or even FPGA+CPU systems.
Note that abstraction doesn't need to stop at hardware. It is also possible to create OS-independent implementations by abstracting the appropriate OS services. Consider the role of a scheduler that manages a queue of tasks. As fast-path optimization creates more deterministic code, it becomes possible for the code generator to optimize scheduling in the same way.
As an example, consider that as the code generator resolves abstract fast path components, it can determine which components will trigger a context switch. Context switches can be expensive in terms of cycles. Registers may need to be stored and restored, base registers and memory spaces redefined, and system flags adjusted. The OS may also have a number of housekeeping items to take care of, pushing a context switch into potentially hundreds of cycles (Figure 3a). The reason context switches are so costly is because you don't know what code may run or which registers will be used and you have to make sure that none of a task's data gets stepped on while it is waiting.
Figure 3: Diagram showing how to eliminate context switching overhead. 3a) Shows two executing threads and the OS functionality required to manage them. 3b) Shows how traditional OSes require context switching between tasks to perform concurrent execution. Context switching is performed at runtime and may take hundreds of cycles. 3c) Shows the fast path optimized approach which allows tasks to be interleaved without context switching, thereby eliminating this expensive overhead.
A code generator with a global perspective does have this information. Instead of absorbing the overhead of a context switch, the code generator can resolve the context switch inline and only those elements which would actually be affected are stored and restored.
Consider two tasks that share no data or registers (Figure 3b above). There is no context to save between them. By interleaving their code, the code generator achieves a zero overhead context switch. Effectively the core runs a single thread managing multiple tasks in parallel without the OS overhead of managing the tasks; task management is done during compile time rather than at runtime.
For tasks that share registers, the code generator could look ahead and determine (a) what dependencies there are between the tasks and (b) whether any of these dependencies are actually compromised during this particular context switch. This logic scales to several tasks; interleaving two tasks or ten, while more complicated for a person, is a more or less a matter of dependency database management for the code generator.
A More Deterministic Nature
With context switching managed by the code generator, performance takes on an even more deterministic nature. In the case of task handoff using OS-based context switching, chip-to-chip communication may take long enough to warrant a context switch. Whether the task initiates a context switch is determined by the handoff virtualization function only after the type of handoff is resolved. With a fast-path approach, the need for a context switch is determined by the code generator and implemented in a manner that results in zero overhead.
With such a high degree of control over task management, it becomes possible to manage external events via polling as opposed to interrupts or software signals. Interrupts and signals are cycle-heavy mechanisms necessary for architectures running multiple threads because there is no guarantee that the polling mechanism will be called frequently enough to not miss an event. Interrupt mechanisms also add a significant element of non-determinism to a system.
In a context-managed architecture running all tasks as a single thread, the code generator can bypass OS mechanisms to poll for events directly, such as checking if a new packet is available. Polling can take place deterministically based on a set range of cycles, consuming less cycles in the long term than interrupt mechanisms.
Other forms of optimization include intelligent profiling of the fastest path in the fast path. Consider that the fast path focuses on typical processing, not the exception. Ideally, when classifying what action to take on a packet or data, you want to limit the number of conditional branches, which reduce processing efficiency by flushing the pipeline. By designing complex logic in terms of a state machine, it becomes possible to profile how frequently each transition is used. With this information, the code generator can create branchless statements (Figure 4).
Figure 4: Typical conditional branching logic results in a branch if true and a branch if false, flushing the pipeline in either case.
Instead of each conditional resulting in a branch if true and a branch if false (branching is evenly distributed and always triggers a pipeline flush), branchless conditionals do not branch if true and branch twice if false (branching is biased towards the exception, resulting in fewer overall branches). Figure 4 above illustrates a few nested conditionals. Note: Actual code typically involves 10s of conditionals and is feasible only through code generation.
To achieve the performance efficiency required for fast-path applications, developers need to be able to abstract both hardware and software, as well as bypass the operating system wherever feasible. However, ease of programming cannot come at the expense of runtime resolution. By remapping fast-path components at compile-time, code generators can create content-sensitive code optimized with an application-wide (global) perspective of the system. Smearing the application, OS, and hardware configuration boundaries at this level not only makes multi-core designs feasible, it results in performance gains up to an order of magnitude better than the design model using traditional OS scheduling and partitioning.
About the Authors
John Poremba is a senior software engineer with Teja Technologies. He holds a M.S. in Engineering from Santa Clara University and a B.S. in Electrical Engineering from the California Polytechnic Institute. John can be reached at email@example.com.
Marco Zandonadi is a senior software engineer with Teja Technologies. He holds a M.S. in Computer Science from the Universita Degli Studi di Milano. Marco can be reached at firstname.lastname@example.org.
Copyright © 2003 CMP Media, LLC | Privacy Statement
Click here to read more ...