ASIC design flow gives CPU core custom performance

ASIC design flow gives CPU core custom performance
By Naresh Soni, Director, Advanced Designs, Nick Richardson, ST Fellow, Lun-Bin Huang, Senior Principal Engineer, Central R&D, STMicroelectronics, Inc., San Diego, Calif., EE Times
August 19, 2002 (10:35 a.m. EST)
URL: http://www.eetimes.com/story/OEG20020814S0036

The ever-increasing levels of CPU performance demanded by embedded applications and product design cycles that have often been reduced to only a few months, have made it important to produce synthesizable processor cores capable of execution speeds typically only achievable by complex custom solutions. In many cases, it is prohibitively expensive to use lengthy custom design flows.

The iCORE project at STMicroelectronics was born out of the need to demonstrate a solution to this problem. The goal was to create a very high performance version of the company's ST20-C2 embedded CPU architecture, but it also had to be shown that the design could be quickly and easily portable as a soft-core across existing and future technologies. This ruled out the use of extensive custom circuits, and led to the adoption of a methodology close to a traditional ASIC design flow, but one tuned to the aggressive performance goals demanded by the project.

The ST20-C2 architecture is a superset of the Inmos Transputer, which was first introduced in 1984 — Inmos was acquired by STMicroelectronics in 1989. The ST20-C2 added an exception mechanism (interrupts and traps), extensive software debug capability and improved support for embedded systems. It is now used in high-volume consumer applications such as chips for set-top boxes.

The performance gap between custom and synthesized embedded cores can be closed by using deep, well-balanced pipelines, coupled with careful partitioning, and mechanisms that compensate for increased pipeline latencies. This enables the use of simple and well-structured logic functions in each pipeline stage that are amenable to highly optimal synthesis. The performance gains are consolidated by use of placement-driven synthesis, and careful clock-tree design.

The ST20-C2's basic instruction-set specifies a set of simple, RISC-like operations, making it a good candidate for relatively easy high-frequency implem entation. However, it extends conventional RISC technology by having a variable-length instruction word to promote code compactness, and uses some instructions that specify complex operations to control hardware-implemented kernel functions such as task scheduling and inter-process communications. Also, rather than being based on a RISC-like load/store architecture with a large number of machine registers, the ST20-C2 employs a three-deep hardware evaluation stack, also known as the register-stack, on which the majority of instructions operate. This scheme permits excellent code density, ideal for embedded applications, and is carefully matched with an efficient local memory or cache system in order to minimize the frequency of memory references outside the CPU core.

These characteristics required special consideration to devise a design capable of maintaining high instruction throughput on a per-clock-cycle basis (measured in instructions per cycle (IPC)) without compromising the CPU's frequency of operation.

For the project to be considered successful, however, we had to demonstrate optimization in all the following areas, in approximate order of importance:

Frequency
Execution efficiency (IPC)
Portability
Core size
Power consumption

To achieve good instruction-execution efficiency without excessive design complexity, previous implementations of the ST20-C2 architecture employed relatively short pipelines to reduce both branch-delays and operand/result feedback penalties, thus producing good IPC counts over a wide range of applications. In the case of iCORE, however, the aggressive frequency target dictated the use of a longer pipeline in order to minimize the stage-to-stage combinatorial delays. The problem was how to add the extra pipeline stages without decreasing instruction execution efficiency, since there would be little point in increasing raw clock frequency at the expense of IPC. Following analysis using a C-based per formance model, a relatively conventional pipeline structure was chosen, but one that had some important variations targeted at optimizing instruction flow for the unique ST20 C2 architecture.

The pipeline microarchitecture has four separate units of two pipeline stages each. The instruction fetch unit (IFU) includes the IF1 and IF2 pipeline stages and is responsible for fetching instructions from the instruction cache and performing branch predictions. The instruction decode unit (IDU) is responsible for decoding instructions, generating operand addresses, renaming operand-stack registers and checking operand/result dependencies, as well as maintaining the program counter, known in this architecture as the instruction pointer (IPTR), and a local workspace pointer (WPTR). The operand fetch unit (OFU) is responsible for fetching operands from the data cache, detecting load/ store dependencies, and aligning and merging data supplied by the cache and/or buffered stores and data-forwarding buses. The exe cute unit (EXU) comprises the EXE and WBK pipeline stages and performs all arithmetic operations other than address generation, and stages results for writing back into the register file unit (RFU) or memory.

The RFU contains the 3-deep register-stack, conceptually organized as a FIFO, comprising the A, B, and C registers . In practice, it is implemented using a high-speed multi-port register file. The memory-write interface is coupled with a store buffer (SBF) that temporarily holds data while waiting for access to the data cache.

The instruction-fetch portion of the pipeline (the IFU) is coupled to the execution portion of the pipeline (the IDU, OFU, and EXU) via a 12-byte Instruction Fetch Buffer (IFB).

Memory bandwidth

Early on in the analysis of ST20-C2 program behavior it became apparent that the microarchitecture should not only be tun ed for high frequency, but also for very efficient memory access, to get best value from the memory bandwidth available in its target environment: typically a highly integrated SoC with multiple processors sharing access to off-chip memory. This was done by including two in-line operand caches in the pipeline, both capable of being accessed in a single pipeline throw.

Together, these two highly integrated operand caches are complementary to the ST20-C2 architecture's A, B, and C registers, effectively acting like a very large register file.

An important effect of the in-line operand caches is to increase the opportunities for instruction folding. This is an instruction decoding technique that combines two or more machine instructions into a single-throw pipeline operation. Since the ST20-C2 has a stack-based architecture, most of it's instructions specify very simple operations such as loading and storing operands to and from the top of the 3-deep register-stack or performing arithmetic operation s on operands already loaded onto the register-stack.

But for simplicity, memory and ALU operations are never combined in the same instruction. Without such compound instructions that allow memory and arithmetic operations to occur as single-slot pipeline operations, it would not be possible to take advantage of iCORE's in-line memory structure. With folding, however, up to three successive instructions can be merged into one operation that occupies a single execution slot and which fully uses iCORE's pipeline resources.

In a deeply pipelined design, it is important to mitigate the effects of delays caused by pipeline latencies. One significant source of latency-based performance degradation is caused by data dependency between successive instructions. If the dependency is a true one — such as when an arithmetic operation uses the result of another instruction immediately preceding it — then there is no way of eliminating it, although microarchitecture techniques can be used to r educe the pipeline stalls it causes. This is generally done by providing data forwarding paths between potentially inter-dependent pipeline stages.

To activate the forwarding paths at the appropriate times, register dependency-checking logic (also known as register score-boarding) must be used. In iCORE, this logic resides in the ID2 stage. The operation is quite straightforward, and works on the basis of comparing the name of a register required as a source operand by one instruction with the names of all registers about to be modified by instructions further ahead in the pipeline.

For the ST20-C2 architecture, register dependency checking must handle operands constantly changing position between the A, B, and C registers of the three-deep register-stack, since most instructions perform implicit or explicit pushes or pops of the stack. The effect is that almost every instruction appears to have data dependencies on every older instruction in the pipeline, since nearly all of them cause new values to be loaded into A, B, and C, even though most of those dependencies could be considered "false", in the sense that they are caused simply by the movement of data from one register to another rather than by the creation of new results. Nevertheless, they would cause considerable performance degradation if newer instructions were stalled based on those false dependencies.

Basically, iCORE solves this problem by a simple renaming mechanism that maps the conceptual A, B, and C registers of the ST20-C2's architecture onto fixed hardware registers, named R0, R1, and R2. The key is that the same hardware register remains allocated to a particular instruction's result throughout its lifetime in the pipeline despite the result apparently being moved between A, B, and C by subsequent instructions pushing and popping the architectural register-stack. This is accomplished by a mapping table for each pipeline stage that indicates which architectural register (A, B, or C) is mapped onto which fixed hardw are register (R0, R1, R2) for the instruction in that pipeline stage.

Operations that only move operands between registers do so by simply renaming the R0, R1, and R2 registers to the new A, B, or C mapping, rather than actually moving data between them. The mapping is performed in ID2 and is based on the effect of the current instruction in ID2 on the existing mapping up to but not including that instruction.

Branch predicting

iCORE's relatively long pipeline gives rise to the danger of branches causing significant performance degradation compared to previous ST20-C2 implementations. After performance simulations showed that the effect of the longer pipeline on some benchmarks was significant, a low-cost branch prediction scheme was incorporated into iCORE's microarchitecture.

iCORE implements a branch prediction mechanism that reduces the penalties in both these cases. A two-bit predictor scheme is used to predict branch and subroutine-call instruction behavior (tak en or not taken), while a branch target buffer is used to predict target addresses for taken branch and call instructions. Also, a 4-deep return stack is used to predict the special case of ret instruction return addresses.

Features of the cache controller's design were focused on the need for simplicity (thereby enabling high frequency) and the requirement for very tight coupling to the CPU pipeline. These requirements resulted in a two-stage pipelined approach, the first stage of which handled the tag access and tag comparison, and the second stage that handled the data access and data alignment. This gave a very good balance of delays, and had the advantage of saving both power and delay by eliminating the need for physically separate data RAMs for each associative cache-bank. To further reduce delays, integrated pipeline input registers were built into both the tag and data RAMs to eliminate input wire delays from their access times.

The cache controller implements a write-back policy, w hich means writes which hit the cache write only to the cache, and not main memory as opposed to a write- through cache which would write to both. Thus, the cache can contain a more up-to-date version of a line than main memory, necessitating that the line be written back to memory when it is replaced by a line from a different address. To facilitate this, a dirty bit is kept with each line to indicate when it has been modified and hence requires writing-back to main memory on its replacement. The dirty bits are kept in a special field in the data RAM, and there is only one per four word locations, since each word is 4 bytes, and there are 16 bytes in a line. The dirty bit for a given line is set when a write to that line hits the cache.

Writes to the data cache presented a potential performance problem, since write requests are generated by the WBK stage at the end of the CPU pipeline, and so in any given clock-cycle could clash with newer read requests generated by the AGU at the end of the ID2 sta ge. Since the cache is single-ported, this would cause the CPU pipeline to stall, even though on average the cache controller is capable of dealing with the full read/write bandwidth. Performance simulations showed that most collisions could be avoided by the addition of a special write-buffering pipeline stage, SBF, which is used to store blocked memory write requests from the WBK stage, and which is coupled with the ability to generate write requests from either the WBK stage if the SBF stage is empty and the data cache is not busy, or from the SBF stage if it is not empty and the data cache is not busy. This gives write requests more opportunities to find unused data cache request slots. Additional write buffers would further increase the opportunities for collision avoidance, but were found to give much smaller returns than the addition of the first one, so were not implemented in the demonstration version of iCORE.

Finally, due to the high memory utilization of ST20-C2 programs, it was found ver y beneficial to add a store/load bypass mechanism, whereby data from queued memory writes in the EXE, WBK and SBF stages could be supplied directly to read requests from the same address in OF2, in replacement of stale data obtained from the data cache. This avoids having to stall memory reads from data cache locations that need to be updated with outstanding writes to the same location.

All data alignments and merges with data from the cache and write buffers are catered for, so the pipeline never needs to be stalled for this condition.

Fast response support

Some of the more complex ST20-C2 instructions are designed to optimize operating system functions such as hardware process scheduling and inter-process communication, to give extremely fast real-time response. One of the iCORE project goals was to ensure that the new microarchitecture could support these instructions efficiently.

The first prototype version of iCORE was designed to demonstrate the execution speed of the regular instructions, so in the interests of expediency it did not support the complex instructions. However, for future versions of the CPU, a configurable solution was selected in which the regular instructions are implemented by hardware decoders and state machines , as in the first version, but the complex instructions are supported by a microcode engine and microinstruction ROM. This interfaces to the main instruction decoder in the ID2 stage, and supplies sequences of regular (non-complex) instructions to forward pipeline stages for execution of the complex instruction-driven algorithms, while stalling backward pipeline stages until the sequence is complete.

For systems that require full backward compatibility with the complex instructions but do not require their full-speed execution, the microcode ROM can be eliminated and replaced with a mechanism that on encountering a complex instruction in an early pipeline stage, generates a very fast instruction-trap into a specially protected r egion of main memory that contains pre-loaded emulation code that implements the instruction's functionality. A minimal amount of hardware support is provided, for example, extra hardware registers and status bits for intermediate results and instruction cache lock-down lines for critical sections, to ensure that the instruction emulation runs at reasonable speed. A preliminary paper analysis showed that this approach provides satisfactory performance for the majority of iCORE's applications at very low cost.

Most of the microarchitectural enhancements' effect on IPC were first studied through the use of a C-based statistical performance model, which "executes" various benchmark program traces produced by an ST20-C2 instruction set Ssmulator (ISS). They include basic instruction folding, the local workspace cache and its alternative designs, enhanced instruction folding enabled by the use of LWC, and branch prediction schemes and designs. The architecture of some of the features was fine-tuned durin g RTL implementation, as more accurate performance and area trade-off analyses were made possible.

Examples of some of the more noticeable implementation-level enhancements include modification of branch predictor state transition algorithm to handle the case where multiple predicted branch instructions reside in same cache-word, and the addition of a dynamic LWC disable feature to minimize the time when it cannot be used during coherency updates. Implementation of the design was based on a synthesis strategy for the control and datapath logic. Only the RAMs used custom design methods.

The synthesis process was divided into two steps. The first step employed a bottom-up approach using design compiler in which all top-level blocks were synthesized using estimated constraints. Blocks were defined as logic components generally comprising a single pipeline stage. The execute unit of the processor included a large multiplier, for which module compiler was used as it was found to provide better re sults than design compiler. As the blocks were merged into larger modules, which were finally merged into the full processor, several incremental compilations of the design were run to fine tune the performance across the different design boundaries.

The only region of the design where synthesis provided unacceptable path delays was in the OF2 and the IF2 stages. In both cases the problems involved the use of complex multiplexer functions. Custom multiplexer gates were designed for these cases and logic was manually inserted with a don't touch attribute to achieve the desired results. Other complex multiplexer problems were resolved by re-writing the HDL to guide the synthesis tool towards the preferred implementation.

The second step of the synthesis strategy employed a top-down approach, with cell placement using Synopsys' physical compiler, which optimized gates and drivers based on actual cell placement. Physical Compiler proved to be effective in eliminating the design iterations often encountered in an ASIC flow due to the discrepancy between the gate load assumed by the wire-load model and that produced by the final routed netlist. The delays based on the placement came within 10% of those obtained after final place and route. Physical compiler can be run with either an "RTL to placed gates", or a "gates to placed gates" flow. The "gates to placed gates" flow was found to provide better results for this design.

A higher performance standard cell library (H12) was developed for iCORE to improve circuit speed compared to a supplied generic standard library. Speed improvement of more than 20% was observed in simulation and on silicon when the H12 library was used.

The conventional CMOS static designs used in the generic library were used in the H12 library, but with the following differences:

Reduced P/N ratio
Larger logic gates (vs. buffered)
More efficient layout
Increased cell height

The P/N ratio for ea ch cell was determined by the best speed obtained when the gate was driving a light load. It was observed that when the gate-load increased, the P/N ratio needed to be increased to get optimal speed. However, the physical-synthesis tool minimizes the fan-out and line loading on the most critical paths, so the H12 library was optimized for the typical output loading of those paths.

Delays were also reduced by creating larger, single level cells for high-drive gates. The original library added an extra level of buffering to its high-drive logic gates to keep their input capacitance low and their layout smaller, but at the expense of increasing their delay. Eliminating that buffer by increasing the cell size also increased the input capacitance, but still the net delay was significantly reduced. These cells were only used in the most critical paths, so their size did not have much impact on the overall layout size.

The H12 cells generally had larger transistors that required the layouts to be taller. However, by using more efficient layout techniques, many of the cells widths were reduced compared to the generic library. Also, the larger height of the cells allowed extra metal tracks to be used by the router that in turn increased utilization of the cell placement.

Physical compiler was used to generate the placement for the cells. The data cache and instruction cache locations were frozen during the placement process and the floor planning was data-flow driven. Clock lines were shielded to reduce delay uncertainty by using Wroute in Silicon Ensemble by Cadence. To minimize the power-drops inside the core, a dense metal5/Metal6 power grid mesh was designed.

The balanced clock tree was generated with CTGen. The maximum simulated skew across the core was 120 picoseconds under the worst-case process and environmental conditions. Typically this would be significantly less. The final routed design was formally compared to the provided netlist using Formality. The Arcadia extracted n etlist delay was calculated using Primetime.

The iCORE processor was fabricated in STMicroelectronics' 0.18m HCMOS8D process. Excluding the memories, the entire chip is seen to have the distinctive random layout of synthesized logic. The large memories on the right and on the top and bottom of the plot are the data and instruction RAMs. The other small memories seen in the plot are the local workspace cache, the branch history table and branch target buffer.

Timing simulation indicated that a good balance of delays between the pipeline stages was achieved. Silicon testing showed functional performance of the design from 475 MHz at 1.7 V to 612 MHz at 2.2 V and 25 degrees Celsius, ambient.

Analysis of the Dhrystone 2.1 benchmark showed that iCORE achieved an IPC count of about 0.7, which met the goal of being the same or greater than that of previous ST20-C2 implementations, indicating that the various pipeline optimizations were functioning correctly.

Other contributors to th is article include Razak Hossain, senior principal engineer, Tommy Zounes, senior principal engineer, Central R&D, and Julian Lewis. Architecture Manager, Digital Video Division, STMicroelectronics, Inc.

See related chart

Industry Articles

ASIC design flow gives CPU core custom performance