By Lorenzo Di Gregorio, Lantiq
The Protocol Processor (PP) is an application-specific processor employed in several products at Lantiq. In this paper we discuss the limitations of simple application-specific processors within manycore architectures and we report on our work on the PP to eliminate or mitigate these limitations. We show that far more than 50% of the performance can be lost within manycore architectures and that much of it can be recovered by more advanced design techniques.
Within communications and networking chips, most of the general-purpose processors still operate in single or dual-core architectures, but across the semiconductor industry there is growing evidence that the number of processor cores operating in manycore architectures is duplicating every three years.
These cores, increasingly embedded in networking and multimedia products, are very often application-specific designs, tailored to the characteristics of their selected domains.
The principle behind the development of these application-specific processor cores has been that a simple architecture with a specialized instruction-set can be largely more efficient than a complex architecture with a general-purpose instruction set. While this comparison can hold true on an instruction basis in isolation, it does not hold true within systems where the memories are shared among multiple clients and the communication is affected by long latencies. Under such limiting environments, more complex processor architectures are capable of delivering more performance.
In this paper we argue that the evolution toward multicore systems requires some fundamental changes in the development of application-specific processor cores and we present the work that we have carried out on one application-specific processor to cope with these changes.
Figure 1: four processors sharing four interleaved program memories and other generic devices.
Application-specific instructions demand many operands from multiple different sources and provide results to multiple different destinations. When operating in isolation, the sources and destinations are readily available to the processor, but in manycore architectures the same resources can be occupied by other devices, causing any simple processor architecture to stall. In particular, the program memory can be shared, because processors operating in manycore architectures execute often the same software and it would be prohibitively costly to replicate the whole program contents for every parallel instance of the processors.
We have applied efficient architectural techniques for reducing these stalls to the newest Protocol Processor (PP) of Lantiq. The PP series are multithreaded controllers, enhanced for moving and manipulating data in fast and reactive data planes. “Data plane” is the generic name given to a set of modules dedicated to the handling of multiple heterogeneous data streams, which would otherwise easily overload a standard processor. The PP cores are employed instead of these modules or next to these modules, in order to provide flexible functionality and to relax a tight coupling to the standard processor. They provide a fundamental protocol processing technology and have been already successfully employed in many products.
The results of our work are fairly general and our conclusions apply to a broad range of architectures.
Figure 2: the utilization of every processor drops considerably within large manycore architectures.
2. HOW SERIOUS IS THE PROBLEM
One most common style of parallel programming is called “single-program-multiple-data” (SPMD) and it consists of spawning the same task to multiple processing cores working on different data sets.
Figure 1 represents this concept by a block diagram of a system with four processors sharing in full mesh one common logical program memory constituted by four interleaved physical blocks, as illustrated by the numbers representing the logic addresses mapped within each physical block.
The rationale behind this organization is that the processors tend to fetch code sequentially and after a transient time they all fetch instructions from different physical blocks. This tendency is contrasted by non-sequential fetches caused by branches or stalls due to accesses to local or shared devices, in particular to one shared data memory.
Figure 2 demonstrates that one typical workload, achieving more than 80% processor utilization, can fall to nearly 30% when raising the number of cores and interleaved program memories from 1 to 16.
The lower diagram in figure 2 explains the main reasons behind the performance drop: the stalls are initially caused by the lack of instructions due to non-sequential fetches and later by the raising traffic congestion on the shared data memory.
The data in figure 2 has been produced for the standard program memory access latency of a one-cycle-memory, but the depth of the logic for arbitration and distribution of program fetches across the memories is likely to require a higher latency.
Figure 3 shows the effect of this increased latency on the same workload: congestion is dominant in larger configurations, but smaller configurations with less congestion can loose 50% of their utilization only because of the increased latency.
Figure 3: a manycore not affected by congestion can be very sensitive to the program access latency.
The PP series provide instructions for supporting all possible bit-wise data moving between internal registers, data memory interface and peripheral interface. Figure 4 presents an overview of the architecture: the white boxes constitute the hardware core and the yellow boxes constitute the environment. The peripherals are addressed by fix-place decoding and the program and data memory interfaces are registered, thereby providing one whole cycle of setup time and one additional whole cycle of propagation delay. Both the execution and data alignment units provide bit manipulation features and the two branch units decide on branches with respectively three or four slots which can be configured as delay slots, for higher performance, or penalty slots, for higher code density. It is relevant to the discussion of the results of this paper in section 4, to point out that transfers from the data memory to the peripherals can be executed by one single instruction which does not employ any general purpose registers.
Figure 4: schematic overview of the PP.
This pipeline provides two whole cycles on both program and data memories interfaces (address registered-out, memory access, data registered-in),hence there is some timing slack available for employing a higher number of physical memory blocks.
Figure 5: gain for multiple physical memories.
The results shown in figure 5 demonstrate that just by employing multiple physical memory blocks, the utilization gain which can be obtained ranges between 5% and 10%. The contributors to the stalls show that the reduced congestion on the program memory is contrasted on the data memory by an increased congestion which limits the achievable gain. This effect is not caused by lack of bandwidth, but rather by higher data access latency.
In order to overcome these limits, we employ:
- Multithreading in form of virtual processor
- Minimalistic access/execution decoupling
- Speculation on linear instruction fetch
Multithreading is a well-known method to tolerate latencies. A simplified version of the multithreaded programming model is sometimes called “virtual processor”: a virtual processor is a set of threads with at the least one non-terminating thread. This set behaves as a threads-reduced version of its underlying physical processor. The architectural registers of one virtual processor are called “context” and a context switch operation selects one virtual processor for execution: the typical policy is to make a fair selection among those contexts which are active, i.e. not stalled due to lack of data from the environment.
The environment interacts with the virtual processors as if they were independent physical processors and a QoS scheduler must switch among the virtual processors for ensuring a fair allocation of the execution times. In this process, a desirable property is that the cycle budgets may not be affected by the context switch activity; consequently the context switch operations must be carried out zero cycle cost and in random order.
Virtual processors can tolerate latencies because the underlying context switch policy keeps busy the physical processor by selecting an active virtual processor as another one stalls but, after a number of threads has stalled, there is no further parallelism and in order to sustain the performance for longer latencies it is necessary to keep the virtual processors running despite of the lacking reaction of the memories to some data the threads might request.
A cheap technique to do so is called decoupling: rather than waiting on a reply from the memory system to finish carrying out a load instruction, a decoupled architecture merely marks the target register of the load as “invalid” and proceeds to subsequent operations. If one operation to be executed bears an invalid register as one of its arguments, the execution is stalled until the expected data from the memory is received and the register is marked as valid again. Decoupling emerges naturally out of the Tomasulo’s scheme for out-of-order execution, but its main purpose is to provide a cheap scheme for an implementation within an in-order pipeline.
Decoupling is effective for improving the performance because it allows issuing further operations while data latency takes place. In particular, it is possible to issue a burst of loads to a memory system which presents a high latency to the first load, e.g. due to a pipelined memory arbiter, in order to cache in a data line from which the subsequent loads can be served quickly.
The same effect achieved by decoupling on the data memory can be achieved on the program memory by a form of speculative prefetch which we term “greedy fetch”: if instructions are not being delivered, a processor must stall, but by means of some microarchitectural algorithms we can insert bubbles into the top of the pipeline and thereby keep the fetch unit active for some cycles and also keep executing the instructions already in the pipeline. If a branch diverts the control flow, the instructions fetched in excess must be dismissed.
Figure 6: Buffers introduced for supporting virtual processors, decoupling and greedy fetch.
Figure 7: overview of the instruction fetch.
The buffers shown in figure 6 provide the fundamental support for enhancing the pipeline of PP.
- Instruction buffer: collects instructions being returned from the program memories to one context while this context is not being executed.
- Pipeline register per context: holds a decoded instruction which has stalled a thread because it is waiting for a valid register value or a valid operand from some peripheral.
- Access buffer: collects memory access operations waiting for memory data to proceed.
- Memory buffer: collects memory data waiting for memory access operations to be processed.
A scheduler can select any context and if the instruction buffer contains instructions, they get executed. Figure 7 presents an overview of the instruction fetch logic. Once a fetch decision is taken, the outgoing program address gets stored along with a few attribute bits into a context-specific FIFO. When the corresponding instruction drops in from the program memory, it can be executed if it is in the currently active context. The purpose of the many bypasses is to present only one additional 2:1 MUX gate to the critical path through the decoder.
In the decoding stage, the instruction accesses register file and peripherals to fetch its operands. These operands can be valid or invalid. An invalid operand is a value from the periphery, which has not been provided yet, or a register which has been designated to be the target of a load but has not yet received its data from the memory, yet.
On reading an invalid value, the instruction stalls its thread and waits in the pipeline register of the execution stage until all of its operands are valid. Then it proceeds through the execution stage or the first branch decision. In the execution stage, a data access instruction initiates a memory read cycle.
Figure 8: overview of the instruction decoupling.
After the execution stage, if the processor is not frozen the instruction can proceed to the memory delay block or to the second branch decision stage, where test-and-branch instructions, which performed a test in the execution unit, get decided.
In this stage, memory access instructions get queued into the access buffer if they need to wait for their data from the memory buffer or if there are other access instructions before them in the queue. At the same time an invalidation command is sent to the register file for setting the target register of the load as invalid. If the load data is available immediately, both the invalidation command and the valid data are sent to the target register and the valid data has highest priority, ultimately setting the register as valid. Execution instructions are sent past the access buffer to the writeback.
Write-after-write hazards are avoided by making instructions wait in the pipeline registers of the execution stage until their target register becomes valid. While this is not the most performing scheme, it is convenient to implement because many instructions manipulate single bits and do not fully overwrite their destination register.
The PP can switch context at every clock cycle without restrictions. The register file and the buffers are all banked per context. The program addressing scheme maintains per context:
- Current program address: always incremented by one unit for the next sequential fetch.
- Branch target address: updated if a branch is taken but its target cannot be fetched because the branch decision has not taken place in the context selected as the current one in decoding.
- Count of the fetched branch slots: when all delay slots of a branch, whose amount is individually configurable, have been fetched, the current program address can be replaced by the branch target address for taking the branch.
Figure 9: block diagram of the switch resolution logic.
- Count of the executed delay slots: after all delay slots have been seen through the execution unit, further instructions get nullified until the branch target reaches the execution unit.
One difficult problem in implementing virtual processors is how to handle conflicts between thread switch instructions in the software and context switch commands from the hardware without breaking up with rule of making the switching activity neutral to the cycle budgets, i.e. by stalling the software instruction or presenting constraints to the hardware. Realizing this feature can allow running a thread scheduler within a virtual processor, as shown in the example of figure 10.
The policy we follow in case of such conflicts is to take the hardware switch and ´postpone the software one. It will be executed in place of the “return” instruction which was supposed to terminate the hardware switch. Figure 9 shows that this concept can be implemented by replacing the return operation with the contents of a “return register” which is properly programmed when a conflict is detected.
A formal proof of this scheme has been carried out.
Figure 10: transparent integration of software thread scheduling (top) within virtual processor switches (mid)
Figure 11: virtual processors achieve a higher utilization until prevented by data memory congestion.
The improvements offered on general purpose workloads by full-blown multithreading, decoupling and speculation techniques are well-known within the high-end computer architecture community, but we need quantify the improvements achieved by our lightweight variants within manycore configurations with respect to our workloads of reference.
Figure 11 presents the gain in processor utilization employing multithreading. Since it is well-known that multithreading is effective in tolerating latencies, we have increased the random latency of the blocking accesses to peripherals, but we have left the data memory access latency at its basic value because a higher latency would massively penalize the single thread performance, as we will show later, and produce a less clear comparison. Furthermore, we have employed two physical memories rather than one in order to increase the bandwidth and avoid instable queues. As expected, we observe that multithreading is capable of increasing the utilization up to 100%, but only as long as the traffic congestion on the data memory allows it. When this congestion raises the data access latency, the lack of timely response from the data memory becomes the major cause of processor stalls during multithreaded operations, as shown in the diagram at the bottom of the figure.
For overcoming the data stalls in the threads, we have decoupled access from execution and report our results in figure 12. Since we now compare both multithreaded versions, we can reproduce a more feasible scenario increasing the data memory access latency by 6 cycles. The outcome of this comparison is that decoupling is capable of eliminating most data stalls, thereby gaining 5% to 10% of utilization on the top of the multithreaded operations and in all configurations. In the diagram at the bottom of the figure we can observe that if too many loads are in-flight between processor and data memory, the instruction buffers in the pipeline get full and become a cause of stalls for the threads. In this example the instruction buffers have a depth of 3 instructions per context and equal in stall probability the load-use hazards of the workload.
Figure 12: decoupling eliminates most data stalls until the access buffers (figure 8) get full.
Although an improvement of 5% to 10% might not appear significant, it has to be noted that it is most of what can be achieved by removing the data stalls in multithreaded operations. For comparison, in figure 13 we report the same situation under single thread operations. While the utilization of a single thread without decoupling is irremediably sunken at 40% due to the data stalls, the utilization of the same thread with decoupling reaches 60% to 80%, with a gain of 20% to 40% and a large reduction of the data stalls.
Figure 13: demonstration of how effective decoupling can be against data stalls (no thread parallelism).
One important remark in this context is the use of instructions which transfer data without requiring a general purpose register. These data transfer instructions present no load-use hazard; hence a simple decoupling can completely remove the data stalls as long as the instruction buffers have a sufficient depth to contain the in-flight loads.
In figure 3 we have shown how the impact of the access latency to program memory can massively affect the overall utilization and we have addressed
Figure 14: greedy fetches can tolerate latencies.
this problem by developing a speculative (greedy) instruction fetch scheme. When the program fetches are speculated, the utilization of the processor is not the figure of merit anymore because it contains an increased amount of penalties: instructions which have been fetched but must be nullified due to failed speculations. In figure 14 we report the percentages of really “executed instruction”: we have subtracted from the utilization cycles the cycles in which those instructions, which have been fetched due to failed speculations, have been nullified. We observe that the amount of executed instructions can be considerably less sensitive to the access latency by employing greedy fetches: for a latency of 5 (address to memory access plus 3 cycles) the achieved performance can be doubled on all configurations.
We have demonstrated how operating in manycore configurations a simple processor architecture can loose far more than 50% of its performance and we have shown three techniques which can mitigate or almost eliminate this loss.
We have implemented these techniques in the last version of the PP, which we have briefly described.
Some examples have been provided in figure 7, where only one 2:1 mux gate affects the critical path and the instruction buffers can be sized to tradeoff area and performance (greedy fetch), and in figure 9 where a rather complex algorithm turns out to require a negligible amount of logic and a few muxes outside of the critical path.
Although many important structures could not be documented in this paper due to obvious space limitations, we can state that our implementation has only had a minimal impact on the achievable frequency and that the area impact can be traded against performance and is directly related to the chosen depth of the buffers and not to the logic.