| Contributors to this article include Steve Carlson, Jack Erickson, Gurudev Sirsi, and Ashutosh Mauskar. |
Designing for low power always has been challenging for engineers. Applications demand low power without compromising overall performance. Yet most of the low-power design techniques have an effect on performance attributes such as timing, area, testability and signal integrity.
Designers need to understand how low-power techniques affect performance attributes, and choose a set of techniques that are consistent with these attributes. This paper outlines the various techniques available for designing low-power chips and their impacts on these attributes.
The power equation
The average power dissipation (Pavg) in an integrated circuit has three basic contributing components:
Pavg = Pshort + Pleakage + Pdynamic
Pshort = Power from stacked P and N devices in a CMOS logic gate that are in the "on" state simultaneously. This happens briefly during switching. This type of power dissipation can be controlled by minimizing the transition times on nets. It usually accounts for 20% of the overall power.
Pleakage = Power dissipation due to spurious currents in the non-conducting state of the transistor.
Leakage current becomes a larger and larger problem as geometries shrink and threshold voltages drop. The leakage current in a .13u process with a threshold voltage of 0.7V is about 10-20pA per transistor. In that same process, if the threshold voltage is lowered to 0.2-0.3V, then leakage current skyrockets to 10-20nA per transistor. For a 10M transistor chip, leakage power can increase from 0.15mW to 150mW due to a lower threshold.
Leakage current depends on the Vdd (or how close it is with respect to threshold voltage), threshold voltage itself (Vt), the transistor size (W/L) and the temperature. Leakage power used to be only 5% for technologies 0.18um and above. As the voltage scales down with technology, this has increased exponentially and has become a problem in nanometer technologies. Increasing die area affects the leakage power adversely, as this causes the number of transistors to be increased.
Pdynamic = Dynamic power dissipation, also called switching power. This is the dominant source of power dissipation in CMOS systems-on-chip (SoCs), accounting for 75%.
Pdynamic can be expressed in the form of the following equation:
Pdynamic = kCV2fp
k = Constant (usually varies from 0 to 1).
C = Overall capacitance that is to be charged and discharged. Technology scaling has resulted in smaller transistors and hence smaller transistor capacitances. But interconnect capacitance has not scaled with the process and has become the dominant component of capacitance. With technology scaling, designers pack more systems on a single chip. This has increased the number of interconnects and hence the overall power dissipated for newer-generation designs.
V = The supply voltage of the component. Voltage scaling has the biggest impact on power dissipation.
f = The switching frequency of the component. The clock network itself switches twice every clock cycle. The average frequencies of designs have increased.
p = Switching probability of signals. For example, in a microprocessor, signals inside instruction RAMs may switch every 2 cycles (p=0.5), while those inside data RAMs may switch only once in 4 cycles (p=0.25). Switching probabilities tend to increase as the need for bandwidth increases and designers resort to time-sharing techniques to save on area.
Where power goes
Figure 1 shows how the various components of power have changed with process technologies. Some might find it surprising to see that the display is not the biggest power sink in a PC. For a representative laptop PC today, the power is distributed as follows:
Figure 1 — PC power distribution
With continuing advancements in display and disk technology, the CPU is likely to garner an even larger percentage of the power budget. This is especially true when the trends in CPU power consumption are examined. Each generation of processor gets bigger and more power hungry (Figure 2)1.
Figure 2 — Intel processor power trend
Influencing power in a chip
As processor power goes up with each subsequent generation, it is important to understand the contributing factors of this power before we can suggest a comprehensive power strategy. The breakup of power inside a typical processor is as given in Figure 3.
Figure 3 — Power dissipation in a typical processor
Most of the power dissipation happens in the memories, I/Os and custom structures, decisions on which are made very early in the design process. The clock network also contributes a significant amount of power. Fundamentally, the earlier in the design process that power is addressed, the bigger the impact that can be made.
For power-sensitive applications, designers may choose to express low-power needs through design intent. Using more specialized hardware offers better energy efficiency than a general-purpose processor. The trade-off being made for this efficiency is diminished flexibility. The hierarchy of energy efficiency for different design approaches shows that dedicated hardware can be 1,000 times or more power efficient than a general-purpose programmable solution (Figure 4).
Figure 4 — Energy efficiency of various implementations
At higher levels of abstraction, there are more degrees of freedom for large changes to the design implementation (both in terms of design intent and constraints). New system and architecture designs can yield implementations that are 10 to 20 times or more power efficient. It is therefore critical to be power aware and able to optimize and analyze power early in the design process, and at high levels of abstraction.
The ability to influence the power without diminishing the application performance becomes tougher and tougher as the design goes through the stages of realization. However, even at the lower levels of design, the possible savings may be very significant to product design teams, even though these methods yield just a few tenths of a percent of savings (Figure 5).
Figure 5 — Power savings and abstraction
The techniques that influence power focus mainly on the dynamic power and leakage. Dynamic power reduction techniques consider both voltage scaling and capacitance reduction. Voltage scaling is emphasized at the architectural level, while capacitance reduction is the primary focus of the implementation teams. Reducing switching power is of lower priority and treated only at the architectural level.
System and architectural level
Voltage scaling can result in significant power savings at the system and architecture levels, as dependence of power on voltage is quadratic. It is beneficial for power only up to a certain point for a given technology. But beyond this point (critical voltage — Vc), lowering Vdd can actually cause delays to increase and thus cause a reduction in performance.
To compensate for the reduced performance, designers may choose different architectures, such as parallel and pipelined architectures. Parallel architectures duplicate functional blocks that will result in an increase in the number of components, but can result in a reduction in switching power. An increase in the number of components also results in an increase in leakage power. The area may also increase, and so does the chance of additional timing violations.
Pipelined architectures provide parallelism. The same components can be reused by dividing the overall functionality into many stages and executing these stages in parallel. It offers the same level of power savings as parallel architectures, with no penalty in area.
But the drawback is the need to have a high degree of control on the stages, so that there are no race conditions (data passing through multiple stages without any checks) or hazards (spurious data transmitted). This is achieved through additional circuitry, which may diminish the area advantages and increased leakage power. These techniques may also be combined to get higher performance.
Another technique is to lower the effective capacitance by representing data more efficiently, re-synchronization for glitch minimization, and resource sharing. Designers could choose many types of efficient data representation, such as encoding and decoding data or applying 2's complement arithmetic for adder-subtractor functions. 2's complement works best if the range of numbers that undergo transitions during the actual operation of the circuit is small.
Since the gates have finite propagation delays, signal transitions can result in spurious output transitions. For example, in Figure 6, assume that the signals A and B are 0. Assume that the inverter has a delay of 1 and the OR gate has a delay of 2. In this scenario, output of the OR gate (O) will be 1 at time t=0.
Now, let A and B switch from 0 to 1 at time t=0. The final state of "O" should be 1. But "O" will switch to 0 at time t=2 and switch back to 1 at time t=3 due to the delay in the inverter. These two transitions waste switching power. Techniques to prevent this type of power loss amount to balancing logic across branches (so that delay incompatibilities do not arise) and avoiding lots of logic gates in the path (minimizing the amount of switching).
Figure 6 — Spurious switching resulting from imbalance in logic structure
Use of power-hungry circuits may be minimized by reusing the same resource multiple times, reducing the capacitance. If a functional unit is consuming a lot of power, we can time multiplex the data to the functional unit, and if a bus consumes power, we can connect multiple functional units to the bus instead of duplicating the logic. Both of these cases increase switching activity and thus need to be analyzed for power savings.
RTL and synthesis level
Techniques at the synthesis level also target minimizing switching power, but usually within one clock cycle. Using don't care conditions to optimize the logic can result in lower power. Common expressions can also be factored out to reduce the overall capacitance.
When synthesizing state machines, care must be given to the way state assignments are done, as bad state assignments can result in higher switching power by increasing switching probabilities of the state bits. For example, in the following state machine with 4 states, implementation 2 may result in better power.
Figure 7 — State machine implementations
When the logic levels are deep, register retiming may be used to save power. A similar operation, pre-computation, may be used in case of functional units. These techniques spread the switching activity and capacitance to earlier clock cycles.
In many cases, data is loaded into registers infrequently, but the clock signal continues to switch at every clock cycle, driving a large capacitive load. The registers get evaluated to the same value every clock cycle. The clock may be shut off to these registers using a gating circuit, which basically prevents the clock from triggering the registers. Typically, clock gating can result in 30% of the power savings compared to the design without clock gating. Figure 8 shows a typical clock gating circuit.
Figure 8 — Clock gating circuit
Sometimes functional units can remain "on" (evaluating) even though the results are not used in the subsequent stages. These functional units may be switched "off" when their results are not needed. This is done by gating the input to that combinational logic when the output is not in use. Figure 9 shows an example of what is called "operand isolation:"
Figure 9 — Operand isolation
Notice that the enable signal ("en" in the example above), which controls the output of the datapath unit, has been pulled up front to gate the input of the datapath logic. When "en" is off, there will be no switching triggered across the datapath unit.
Choosing a different type of logic family can affect the overall power. For example, implementing a design using dynamic logic will eliminate glitches and short-circuit power while reducing the overall parasitic capacitance. But it increases switching activity due to pre-charge cycles, and does not conform to automated techniques.
Some designers have resorted to using pass-transistor logic to realize functions which can result in a reduction of the overall load capacitance. But there may be speed problems when Vdd is reduced, and problems due to static power dissipation due to partial turn-off of the devices. These different logic families also have reliability and testability problems.
Choosing optimal sizes for the gates can lower the overall power. The synthesis engines should be aware of the actual "wires" (topology and layers), so that the interconnect capacitances can be accurately computed to do size selection. Buffer removal and slew reduction can reduce the switching power. Pin swapping can help reduce both switching and leakage power.
Design partitioning and floorplanning can play an important part in the overall power reduction. Wires with high switching probabilities must be kept within a partition to reduce the capacitance. Placing the blocks in the right position can dramatically decrease the overall capacitance. Cell placement also affects the overall power dissipation, as placing cells far from each other increases the wire length, and thus increases the switching capacitance.
Accurate placement involves many iterations, as it is difficult to predict the wires that contribute most of the capacitance. Silicon virtual prototyping (SVP) lets designers analyze the design with real wires fast, cutting down the overall design time. Another advantage of SVP is that it reduces local congestion or local hot spots which can increase the power efficiency. Designers can also use selective voltage scaling to reduce power on certain blocks (multiple supply voltage). This is useful when designing an SoC from various off-the-shelf IPs.
Designers can accurately analyze leakage power only at the implementation level. Special cells can be designed to control leakage by changing the effective width of the transistors. Designers can also use cells that have high-Vt transistors to reduce leakage power (multi-Vt optimization).
Usually, these cells are used to replace the normal cells to reduce the overall leakage power. But this replacement needs to be done in a controlled manner, as replacing all can degrade the performance. As an example, replacing all the cells by high-Vt cells can reduce the power by 20%, while degrading performance by more than 2X.
The implementation needs to consider chip temperature (thermal distribution), as it can contribute a leakage power surge. SVP helps designers analyze and correct potential hot spots.
The clock network is usually the biggest network in a chip. The clock network touches all the registers and offers a huge capacitance. It also has the highest switching probability (p=1) as it switches twice in every cycle (2f). Hence the switching power dissipation due to clock signal is enormous, and can account for 30% to 50% of the overall power dissipation.
The clock network needs to be constructed with lots of care. Techniques to build a good clock tree include minimizing the overall insertion delay by decreasing the number of levels, minimizing skews but balancing the tree, and reducing overall capacitance by selecting appropriate clock buffers.
In the layout view, the register partitions created by the clock gating logic may not be localized. The gating logic might not be placed closely to the register that it is controlling, but close to other registers that are controlled by a different gating logic. Designers can analyze the placement of all the registers and merge those clock-gating instances with the same control signals, and then clone some gating logic to the new grouped registers to minimize the overall capacitance on the clock tree.
Figure 10 — De-clone and clone of the clock gating logic
Another technique used to further save power is to shut down the clock branches by identifying sets of clock-gating instances driven by the same clock based on the physical proximity. This can be done by inserting gating logic at the root of the clock to save the power consumed by the clock-tree network feeding these instances. The gating signal for the new root-gating logic will be a Boolean OR of the gating signals of the clock-gating instances.
Figure 11 — Root gating
Even as the switching power is minimized, the implementation can result in local areas where there is a lot of switching activity (such as around the sense amp regions in RAMs). These areas can dissipate more power when all the components switch simultaneously.
This can even result in dynamic IR drop. To counter this effect, designers can place a number of decoupling capacitors which act as charge stores around this area. These cells supply the additional power needed during simultaneous switching. But addition of decoupling capacitors increases the overall leakage power. Hence the number of such cells added should be balanced against the overall power dissipation.
Short-circuit power is controlled by controlling the transition times of the signals. Usually, designers impose a limit on the transition time (for example, 10% of the clock cycle), and then optimize the design to meet this limit. This power dissipation cannot be controlled beyond a certain point.
Process and layout level
Process technology offers techniques to control leakage power. The leakage power is divided into two areas — sub-threshold leakage and tunneling leakage from the gate. Sub-threshold leakage results from the reverse biased diode conduction and sub-threshold operation. Sub-threshold leakage can be controlled by using insulators for the substrate (silicon-on-insulator — SOI), lengthening the channel and having multiple gates.
SOI is a process of using an insulator as opposed to silicon for the substrate to eliminate leakage completely. Tunneling leakage results from spurious current through the gate based on the state of the transistor. Tunneling leakage can be controlled by using high-K dielectric or novel gate materials. Usage of high-K dielectric can change the capacitance and hence decrease the leakage current.
These techniques involve a number of complicated steps and can be very expensive. Actual power gain may be difficult to measure using conventional analysis tools. Designers usually rely on specialized tools to analyze such devices.
EFFECT OF LOW-POWER TECHNIQUES ON TIMING CLOSURE
Since frequency is a major factor in determining the power, high speed designs are power hungry. But voltage scaling and capacitance reduction intuitively imply a smaller delay, and hence higher speed. Even though all the low-power techniques try to reduce power without changing the circuit performance (in fact, they attempt to speed up the circuit), they inherently have an effect on the timing closure of the design. In this section, we examine some of the effects they have on timing closure.
System and architectural level
Parallel and pipelined architectures can help increase the frequency of operation of a design. But they will result in extra area due to duplicated blocks and control logic. These additional components can make the timing closure a challenge during later stages in the design.
Compressing the data lines will burden timing closure due to additional circuitry in encoders and decoders. 2's complement arithmetic itself is as fast as the normal arithmetic, but sign computation can be the bottleneck for timing closure. Balancing logic across multiple paths and reducing logic levels aid in timing closure. But it becomes a challenge to place and route the circuit with such constraints. Resource-sharing techniques have complex control structures that can become problems during timing closure.
RTL and synthesis level
Using don't care conditions to optimize a circuit will result in simple gates and hence can aid the timing closure. Factoring out common sub-expressions for power savings can result in high fan-out nets, which can complicate timing closure. Simplifying state assignments does not have a major impact on timing closure.
Register retiming can help meet the timing. Clock gating and operand isolation can make timing closure difficult because of the extra stages of logic they impose. Clock gating can also affect the skew as well as insertion delay in the clock network.
Choosing different logic families such as dynamic logic and pass-transistor logic does not have any major effects on timing closure. But automatic timing analysis is very rare for these families, and hence the designers have to rely on circuit simulators and customized tools. Placement and routing for these families can only be done with a lot of restrictions. Physical effects such as crosstalk, IR drop and EM can affect the functionality of such gates.
Choosing optimal sizes for gates and buffering wires at regular intervals can help achieve timing closure. In fact, the primary objective of synthesis and optimization tools is to meet timing. The consequence is also power savings.
Optimal design partitioning and floorplanning are keys to timing closure. Silicon virtual prototyping can help designers tremendously in these areas. The resulting designs are faster, and at the same time, well optimized for power.
Defining multiple supply voltages affects timing closure due to the additional cells (level shifters) and constraints regarding different voltage regions. Multi-Vt optimization is only performed for timing non-critical signals, as high Vt cells tend be slower than the normal cells. Clock cloning and de-cloning help achieve timing closure and routability by cutting down the total wire length (capacitance). Root gating can also have a minor effect on timing closure because a number of downstream gating elements can be removed, which will reduce the capacitance.
Process and layout level
Using high-K dielectric materials in gates to control power can create transistors that are slow, and can make the timing closure difficult. Hence, such transistors should only be used for non-critical signals. Using SOI does not have any impact on timing closure. However, transistor characteristics can change from cycle to cycle, which makes timing analysis very difficult. The reliability and controllability of the design can become issues when such techniques are used.
It is clear that there are power consequences at every stage of the design process, and that designers can employ many techniques in order to design low-power circuits. Further, bad choices at any stage of the processes can work to negate those gains made at earlier stages.
Looking beyond power, many design transformations that are made to reduce power can affect other attributes of silicon quality — timing, area, signal integrity — both positively and negatively. Complex techniques can also increase the design turnaround time and force the design to miss the market window. The end user application also plays a major role in the number of techniques that are applicable. Choosing the set of techniques for low-power design is truly a balancing act of all these considerations.
The benefits of adopting new tools and methodologies need to be quantified in the context of actual quality-of-silicon measurements. A general solution to this highly complex issue needs to be encompassed by a holistic methodological approach. This approach should span from architecture to final sized and routed transistors. Automated optimization engines are needed at each successive stage of design implementation. Accompanying the optimizations, there must be analysis capabilities that use the best available detailed information.
Power is truly the next frontier in design that needs to be conquered. A new generation of technologies and tools needs to be meshed to create a solution that fits seamlessly into the context of the existing speed and die size constraint-based solutions. Power requirements and power-based differentiation will likely supplant performance as the critical deciding factor in success or failure of both end electronics products and the tools that are used to design them.
1 Trends and challenges in scaling of CMOS devices, circuits and systems — Siva Narendra, Intel Labs, ICCAD 2003
Anand Krishnamoorthy has been with Cadence for the past 4 years. Currently, he is a Senior Product Marketing Manager for the Encounter Digital IC Design Platform. Prior to joining Cadence, Krishnamoorthy worked at HAL Computers as a Design Manager with a focus on high-end microprocessor design.