by Himanshu Sanghavi and Steven Leibson – Tensilica, Inc.Santa Clara, California USAAbstract:
Because today’s System-On-Chip (SOC) designs contain millions of transistors, design engineers must treat power dissipation as an important design goal for IP blocks and not as just a data-sheet parameter. Early and frequent analysis of an IP core’s power dissipation will help keep power dissipation under control and on target. RTL power analysis is the necessary tool for early, frequent power analysis because it is fast, accurate, and can be used before the design is synthesized. This paper describes how engineers at Tensilica met aggressive power goals for a configurable and extensible microprocessor core design by using RTL power-analysis techniques to measure and reduce core power dissipation.
Tensilica develops configurable and extensible RISC microprocessor cores for embedded SOC applications. These microprocessor cores are used in a variety of applications such as consumer electronics, networking, wireless communications, multimedia, and imaging. Power dissipation is a major consideration in the design of these systems, and thus is an important criterion for the selection of the microprocessor used in the design. The trend in SOC design is to use multiple microprocessor cores to distribute the processing load and to keep clock frequencies down. In fact, some Tensilica customers use tens of processor cores on a single integrated circuit (IC). It thus becomes very important to pay close attention to the power dissipated in a microprocessor core. With the above considerations in mind, Tensilica engineers set a goal of cutting the power dissipation of a preliminary processor design by 25%. It was clear that we would have to enhance our design flow by focusing on power dissipation as early as possible to meet this goal.
After reviewing our existing design flow, we concluded that monitoring power dissipation on a regular basis during the active design phase of the project would help us meet this aggressive powerdissipation goal. During this phase, you have the best opportunity to influence the design’s powerdissipation characteristics and make changes if the initial design does not meet the goals. The rest of this paper describes how we implemented an RTLbased power-analysis flow to measure and improve the effectiveness of clock gating in our design. This flow allowed us to meet our power-dissipation target and proved useful for further designcharacterization efforts.
2. Reducing Power Through Clock Gating
The total power consumed by CMOS circuitry consists of two components: the static or leakage power and the dynamic or switching power. Leakage power is typically addressed through various circuit- and physical-design techniques beyond the scope of this paper. On the other hand, a design’s micro-architectural and structural characteristics significantly impact switching power, which can be effectively addressed at the RTL design level.
One of the most effective design techniques for reducing switching power is the use of clock gating, which stops the clock in parts of the design that need not be active during a particular clock cycle. Tensilica’s Xtensa microprocessor core employs a two-level clock-gating scheme (global and functional clock gating) as illustrated in Figure 1. Global clock gating halts most of the processor’s circuitry when a “wait for interrupt” instruction is executed, putting the processor into a sleep mode. An interrupt later awakens the processor from sleep mode. Only a minimal amount of processor logic, the logic needed to respond to interrupts, is clocked while the processor is in sleep mode. This feature is very useful for applications that require the device to be in stand-by mode for long time periods.
Figure 1: Tensilica’s Xtensa microprocessor core employs a 2-level clock gating scheme.
When the processor is actively executing instructions, the second level of clock gating comes into play. Functional clock gating dynamically turns off clocks to different modules within the processor core when these modules are not needed. Clocks to these modules are enabled and disabled automatically to save power whenever possible.
As part of our power-reduction effort, we wanted to measure and improve the effectiveness of our clock gating. By measuring and improving the effectiveness of global clock gating, we would ensure that most modules were indeed consuming a negligible amount of power when the processor was in sleep mode. By measuring and improving the effectiveness of functional clock gating, we would ensure that when the processor was executing instructions unrelated to a particular module, that module would consume negligible power.
3. The Case for RTL Power Analysis
The power dissipated by a design is greatly influenced (amongst other things) by the standardcell library used to synthesize the design and the particular instances of cells within that library that the synthesis tool uses to create the final gate-level net list. As a result, it is common practice in the industry to run power analysis on a design’s gatelevel net list. While this may be the perfect check to run just before tapeout to ensure that the design is within its power budget, it is far from ideal as a monitoring tool during the design phase of the project because, for most IC design projects, gatelevel net lists become available only late in the design cycle. By that time, it is difficult to make major changes in the design without significantly delaying the project schedule.
Consequently, there is growing interest in analyzing SOC power dissipation at the RTL level because of the many advantages of this approach. In contrast to gate-level net lists, the RTL test bench generally is up and running much earlier in the design cycle; before all the core functionality has been fully implemented and before the start of a long verification cycle. Preliminary power-analysis results can be very useful at this early stage and can be used to effect design changes.
Gate-level net lists are typically flat, so poweranalysis numbers obtained on such net lists are hard to analyze. Specifically, it is difficult to start with a power-analysis report of a flattened design and quickly identify the modules in the design that exceed their power budget. RTL power-analysis tools generate reports that preserve design hierarchy from the full-chip level to the lowest-level leaf module, and even to major functional blocks within modules. Having this detailed breakdown of power dissipation across all modules in the design can be invaluable in identifying opportunities for saving power.
In the past, gate-level simulations were an integral part of every IC design flow because they were used to verify a circuit’s functional correctness. Today, formal equivalence checking between the RTL design and the gate-level net list has largely replaced gate-level simulations for functional verification. Designers prefer formal tools because they check for mathematical equivalence between the two designs, whereas gate-level simulations depend on the quality and quantity of test vectors used. Gate-level simulations are also susceptible to false failures from ‘X’-propagation issues and incorrect clock-skew modeling. These failures are hard to debug due, in part, to the flat nature of the gate-level net list and the fact that many of the RTLlevel signals may not be preserved in the gate-level representation. Due to the availability of formal equivalence-checking tools and the difficulty of running gate simulations, many design teams have eliminated gate simulations from their design flow. It is thus desirable to have an RTL power analysis flow that is not dependent on running gate-level simulations.
4. RTL Power Analysis of an Xtensa microprocessor design
Because of the many advantages of RTL power analysis, we incorporated the technique into our design flow for a next-generation, configurable Xtensa microprocessor design. We started running weekly power regressions shortly after the RTL test environment became operational. We developed a few specific test programs to measure power and ran them every weekend on various Xtensa processor configurations. The analysis results were posted on our company intranet, where all team members could access them through a web browser and see the effects of their work on the design’s overall power dissipation. These reports provide a hierarchical breakdown of power dissipation across all of the design’s modules and sub-modules so that each team member can identify the results for their module. During the week, RTL designers would use the previous weekend’s run to guide their powerreduction efforts. The results of these changes could be seen in the next weekend’s run, thus providing continuous and specific direction to our power reduction efforts.
We also plotted weekly power numbers on a graph, which provides two benefits. If a design change in a particular week significantly increased power dissipation, the graph visually highlighted the power spike. By comparing the hierarchical power dissipation reports of the previous two weeks, we could immediately identify the module responsible for the power surge. This allowed the appropriate team member to take remedial action quickly and effectively. A second benefit is that the graph serves as a quantitative measure and record of the overall power improvements made during the course of the project. Figure 2 shows a graph of the power dissipated by one particular Xtensa processor configuration, averaged over all the power analysis programs. The graph covers a period during which we actively worked to reduce power dissipation. Using RTL-level power simulation to monitor power dissipation, we met our goal of reducing overall power dissipation by 25% by week 15.
Figure 2: Power Dissipation Improvement Over Time
5. RTL Power-Analysis Test Evolution
Initially, we used a few test programs from our verification suite to establish the power-analysis flow. We had an existing suite of diagnostic test programs for functional verification that ran in this simulation environment, which allowed us to run power regressions almost as soon as the test bench was complete. However, knowing that these existing tests were written primarily for functional verification, we started a parallel effort to develop additional test programs written specifically for power analysis.
The first power-analysis program we wrote put the processor in sleep mode during the entire simulation. In this mode, global clock gating should be in effect and thus the vast majority of the processor’s circuitry should be inactive. Our expectation was that the design would consume an order of magnitude less power under these conditions than it did in the active state.
We were surprised to discover that one of the smallest modules of the design, the module that implements the on-chip debug (OCD) logic, was one of the biggest power consumers during the sleep-mode test. Earlier power-optimization efforts had ignored the OCD module because it consumes a negligible percentage of the total power when the processor is active. Further, this module contains some circuitry that “wakes up” the processor from its sleep mode under the control of a debugger. This logic could thus not be powered down even in sleep mode. However, the RTL-level power analysis results, which can reach down to the level of individual flip-flops in the design, identified clockgating opportunities in the OCD module that allowed us to quickly reduce its power dissipation by more than 50%. If you halve the power dissipation of enough modules that are dissipating “negligible” amounts of power, you can achieve substantial overall power savings. Thus this test program helped us improve the effectiveness of our global clock gating.
The next power-analysis program we developed causes the processor to execute a large number of NOP (no-operation) instructions. This program exercises parts of the processor such as the instruction-fetch and -decoding blocks but most of the processor’s execution modules are inactive. The power dissipated during NOP execution sets a lower bound for the processor’s active power dissipation. By comparing results of the sleep-mode test with those of the NOP test, we validated some design assumptions.
For example, the instruction-fetch unit of the processor exhibited a noticeable power increase during the NOP test (compared to the sleep-mode test), whereas the load/store unit exhibited a negligible increase. Most of the instruction-fetch logic is active during the NOP program execution, while most of the load/store logic is inactive because no loads or stores are executed. RTL-stage power analysis proved that the existing design of the load/store unit used all or most of the available opportunities for power savings. Thus the NOP test program helped us measure the effectiveness of functional clock gating as implemented in the load/store (and a few other) units.
We wrote a few additional power-analysis programs to exercise different modules in the processor. Each of these programs toggles as much logic as possible in a particular module. For example, highly optimized, hand-written, FIR kernel assembly code is used to analyze power dissipation for the Xtensa processor’s DSP extensions. These power numbers set an upper bound on the power dissipated by a particular module. While these tests did not identify additional power saving opportunities, they gave us added confidence that the processor’s overall worstcase power dissipation meets specifications.
Towards the end of the RTL design stage of the project, the software group requested that a new instruction be added to the DSP extensions to the Xtensa microprocessor core. Because of a mistake in the implementation of this instruction, an extra set of pipeline registers and bypass multiplexers for a 160-bit wide, vector register file was accidentally added to the design. During subsequent RTL power-analysis simulations, the power dissipated by the FIR kernel running on this configuration jumped noticeably while the power dissipated by all the other configurations was unchanged. This test result prompted a review of recent changes made in the DSP extensions. The source of the problem was quickly identified and fixed within a couple of days.
6. Spotting Legacy Power Creep
The NOP diagnostic also uncovered a primitive cell—used by many modules in the design—that did not incorporate clock gating. This cell was initially designed for use in a few very narrow register banks and the designer omitted clock gating to save gates. However, over the course of the design, other designers used this primitive cell a lot more than originally anticipated.
Various instances of this primitive cell, which were spread through different modules, were the top power consumers identified by the NOP test program. Adding clock gating to this one primitive cell significantly reduced the power dissipated during the NOP diagnostic. This result again demonstrates that the ability to zoom down to the lowest hierarchical level in RTL-stage poweranalysis reports helps the design team to quickly identify and fix design problems.
7. Speed/Accuracy Simulation Tradeoffs
The RTL description of a design accurately specifies its micro-architectural and logic characteristics. However, behavioral RTL descriptions do not specify the structural and circuit characteristics of the design. Because the power dissipated by a design is affected by all the above parameters, RTL power analysis fundamentally involves making some assumptions about the design’s final circuit representation.
Power analysis on a placed-and-routed gate-level net list with back-annotation data provided by 3-D extraction tools produces the most accurate estimate of a design’s power dissipation. RTL power analysis provides a reasonably good estimate of this number (within about 30%, in our experience) that can be used effectively to make design choices that reduce power. In particular, we observed that the relative numbers (or the percentages of total power that RTL power analysis attributes to each module in the design) are very accurate. In the early design stages, this relative accuracy is sufficient to focus the power reduction efforts in the right direction.
When a higher accuracy level is desired, close correlation can be obtained between the RTL numbers and the post place-and-route numbers by tuning the RTL power analysis flow. This tuning involves setting up appropriate values (instead of using defaults) for various parameters related to process technology and the standard-cell library used by the design. Once tuned, RTL power analysis can provide accurate power numbers for a range of similar designs all targeted towards the same process technology and using the same standard-cell library.
We observed one accuracy deviation that is worth mentioning. RTL power analysis consistently overestimates the power dissipated by modules that undergo substantial logic optimization during the synthesis flow. This is particularly true of designs that, for example, rely on cross-module logic optimization or whose RTL description relies on behavioral retiming to optimize the number of flipflops in the design. The accuracy deviation arises from the RTL power-analysis tool’s lack of awareness of the advanced logic optimization techniques used by synthesis tools.
8. Special Simulation Challenges for a Configurable IP Core
Most IC design teams work on one particular instance of a design at a time. However, Tensilica engineers deal with the special challenge of developing a configurable microprocessor core. The Xtensa microprocessor can be tailored for a particular application through a selection menu on a web-based GUI. For example, the designer can choose the number of general-purpose registers in a customized core; various cache-memory attributes; the number and priorities of interrupts; and support for specialized operations such as a floating-point execution unit, specialized DSP operations, and an entire vector DSP coprocessor called Vectra. As the designer configures a core, the GUI provides immediate feedback on the processor core’s estimated area, timing, and power dissipation through live bar graphs displayed by the GUI as shown in Figure 3. To provide accurate, real-time feedback, Tensilica extensively characterizes Xtensa microprocessor cores using numerous combinations of the various configuration options. Further, because the core is licensed as “soft” IP, it must be characterized for multiple semiconductor process technologies. This compute-intensive process generates a database of area, timing, and power numbers that drive the GUI estimator.
Figure 3: Live bar graphs on the Xtensa Processor Generator GUI provide real-time feedback about processor speed, area, and power dissipation at the bottom of the GUI page.
It takes weeks of simulation to generate estimation data for the Xtensa GUI. Any procedure that speeds this process is therefore very valuable. We have had a positive experience with RTL-based poweranalysis tools and have already tuned our design flow to provide close correlation to post place-androute numbers for Xtensa microprocessor designs. Because RTL power analysis is much faster than gate-level simulation, we intend to use RTL power analysis to generate the bulk of the estimation data for our next-generation configurable processor.
Today’s multimillion-transistor SOC designs create new challenges for engineers to effectively manage the amount of power being dissipated in their design. To meet aggressive power budgets, designers must focus on power consumption throughout the course of the design. RTL power analysis can help in this process and it offers numerous advantages over the traditional flow, which involves analyzing power on gate-level net lists. One significant advantage is that RTL power analysis can be started much earlier in the design cycle when it is still possible to reduce power dissipation through design changes. Another important advantage is the preservation of design hierarchy in the RTL power reports, which makes it easy to pinpoint opportunities for power savings.
We have demonstrated that RTL power analysis is an effective design methodology for monitoring and reducing a microprocessor core’s power dissipation. It is now an integral part of our design flow and has also helped us with our design-characterization efforts. We believe that this methodology has general applicability to the vast majority of today’s SOC designs.