By Philippe Soulard, Yijun Xu from NXP SemiconductorsAbstract Low power consumption is becoming a critical factor for SystemonaChip (SoC) designs. System level power estimation for SoCs has gained importance with the increase of SoC design complexity. This paper presents a highlevel power estimation methodology for processors in the context of digital SoCs. It is based on SystemC TLM (Transaction Level Modelling) models including a cycle accurate ISS (Instruction Set Simulator) for simulation performance aspects and on fast characterization from gatelevel implementations for accuracy aspects. The experiments show that for average power estimation and power curve estimation, an excellent accuracy has been reached and simulation performance is greatly improved compared to the gatelevel.
1. Introduction In addition to speed and area, low power has been the crucial design requirement of SoCs for a long time. Different power optimization techniques are applied [1] at different abstraction levels in the VLSI design flow. Power estimation techniques are used at each abstraction level to calculate power or energy dissipation with certain accuracy and thereby gain confidence in the power consumption of a design and evaluate the effects of power optimization.
At the highest abstraction level (the functional level), the current SoC design methodologies define the overall functions and determine the cost metrics, such as power consumption. The power related design choices made at this level have the most significant impact on power saving. Power estimation techniques used at this level are mostly based on spreadsheet approaches. The drawbacks of those methods are outlined in [2]. This spreadsheet approach is very time consuming and error–prone as to the expected coverage of all the operating scenarios for a very complex SoC, especially where power management techniques are applied.
In addition, it also cannot accurately estimate the impact of software on power consumption.
At the implementation level (RTL) and the circuit level (gate–level), power estimation tools are already available from either EDA commercial vendors (e.g. Synopsys PrimePower or Sequence PowerTheatre) or in–house providers. These tools can estimate power consumption very accurately. For gate–level power estimation, a 10% deviation from real silicon can be reached, for RTL power estimation, a 15–20% deviation. However, simulation at these levels for an entire SoC is quite slow. Power estimation comes also quite late in the design cycle.
At the architectural level (the level between the functional level and the implementation level), a complete SoC is modelled in a high–level language such as C, C++, SystemC or Java. Based on this target architecture the intended application programs are developed. A lot of Electronic System Level (ESL) design methodologies are being developed to decrease the design productivity gap and to shorten time to market. However, there is not much power estimation tooling available. This leads to a lot of research activities with respect to system level power estimation.
The goal of the methodology described in this paper is to create a high–level power estimation flow that is :
 more accurate than a spreadsheet approach
 much faster than RTL/gate–level power estimation
We present a new methodology that provides power estimation at the early design stage, such that a designer can quickly consider different design alternatives.
The remainder of this paper contains the following sections:
 section 2 discusses related work and highlights our contributions
 section 3 presents our power estimation methodology and its flow; system–level modelling, power modelling and power characterization are also described
 section 4 presents the validation experimental results
 section 5 presents our conclusions and provides an overview of future work
2. Related work In this section, system–level power estimation techniques for SoCs are discussed and our contributions are highlighted. [5] has proposed a hybrid approach for core–based system–level power modelling. High–level models have been used to speed up simulation and low–level core–based characterization has been used to improve estimation accuracy. Our approach has similar ideas as this one. However, we use transactions based on the standard SystemC TLM modelling instead of instructions of each core used in [5]. We also take power consumption of SoC–level clock trees, interconnectivity and I/O pads into account.
In [8], SystemC TLM based power estimation techniques have also been proposed. They have developed a hierarchical organization of the Transaction Level characterization data. The data used in SystemC TLM models depends on the models characteristics in a system. In our approach, we explicitly distinguish eight component types and we take all the transactions for each component type into account. Each component type has different power characteristics that will be incorporated into its SystemC TLM model.
The power modelling we use is state/mode–based, similar to the one described in [2][4] in combination with transactions. We embed a power model in an existing functional TLM model instead of writing its standalone power state machine as a separate power model. In [11], state–based power models of the individual components have been completely inferred from the datasheet information. However, the datasheet information of each component in a SoC is not always available. In this paper, we also present power characterization methods to derive average power values or energy values for power models. Based on gate–level/analog simulations, we create a power table for each type of component.
In [12] and [13], the instruction set is also characterized in order to obtain energy figures for each instruction, but this is done only through measurements of a board, which means very late in the design trajectory. In [12], no results are given for power curves over time. In [13], results shown do not have a very large dynamic.
In [14], results are good for power estimation, characterization is also done from a gatelevel description, but the characterization effort, requiring weeks for the processor for example, is far too important. We want to perform characterization of all blocks composing the system in less than a day.
3. Power methodology and flow In order to accomplish our goal of having both faster than gate–level/RTL power estimation and more accurate than static methods, we propose a power estimation methodology for SoCs at the architectural level. Based on power estimates on this level, designers can optimize the architecture of the SoC, take measures to reduce the energy used by the processor running the SW on the SoC, or reduce power that is consumed by certain hardware parts in the SoC.
Figure 1: Power estimation flow3.1 Power estimation flow The power estimation flow consists of the steps illustrated in Figure 1. We implemented this flow into a toolset called SLEEP. Our power model is generic, but the requirements for accuracy and characterization efforts depend on the type of component being modelled and on how large its power contribution might be in a whole system. Our power model can therefore be seen as heterogenous, even if the model itself is generic.
3.2 FSM Power model and parameters The power modelling is based on a coarse–grain Finite State Machine (FSM) that will be incorporated into an existing SystemC TLM/PVT functional model. The states of this FSM are related to the power modes of the component which is modelled.
Examples are active mode, sleep mode and idle mode, which will determine the power consumption of a component. Per mode, it is possible to assign leakage power dissipation, average dynamic power dissipation and energy dissipation per transaction. Between modes, a switch energy can also be given.
Figure 2: FSM power model example The power consumption for each FSM power model takes the following parameters (as illustrated in Figure 2) into account:
 the set indicates the set of states in the FSM, N being the total number of states; T(Si) represents the total time duration of the state S_{i} over the whole simulation
 in each state Si, the static power dissipation is indicated by L(S_{i}), corresponding mainly to leakage
 in each state Si, energy per transaction O_{j} is indicated by E(S_{i}, O_{j}); the total number of occurrences of O_{j} is given by n(S_{i}, O_{j}) over the whole simulation
 the energy to switch from state Si to state Sk is given by M(Si, Sk); the total number of occurrences of such a switch is given by n(Si, Sk) over the whole simulation
 in each state Si, the average power for all transactions can be given by P(Si); it is aimed to be an average of E(S_{i}, O_{j}) over state Si, when transaction based energy values are not available; P(S_{i}) is therefore frequency dependent
The total energy E
_{tot} of each component can be obtained by summing all the possible contributions over time. It is formulated as the following equation:
From that total energy, we can derive the average power figure for a given time interval. The power curve over time uses the same formula, but in addition, each energy contribution is accurately located in time.
3.3 Parameter characterization Characterization is made at the gate–level, but its required accuracy depends on its type and importance in a system. There have strong requirements on the amount of time needed to perform the characterization. For a processor, the full characterization should not take more than a few days.
3.3.1 Hardware IP We need here an average level of power per mode, because the expected contribution of these blocks is quite low in comparison to cores, caches and memory blocks. We distinguish here at first 3 modes:
 low power mode (LOW)
 iddle mode (IDLE)
 active mode (ACTIVE)
The LOW mode corresponds mainly to a leakage value. The IDLE mode adds a dynamic figure for the clock tree on top of leakage.
For the ACTIVE mode, we compute the mean value and the standard deviation of the distribution of average power obtained over a set of representative configurations. The value of the standard deviation is a good indication of the accuracy of our characterization.
Within SLEEP, we have written a tool to perform that process automatically.
3.3.2 Processor We have here also a LOW mode, an IDLE mode and an ACTIVE mode. In ACTIVE mode, we want here to get a power table with an accurate average energy dissipation for each instruction.
In order to achieve this goal, we adopted the following method:

initial estimation of energy E(nop) of a NOP (a NOP being an instruction doing nothing, taking a clock cycle), and also the energy E(bch) of a branch instruction, using simple assembly programs

initial estimation of the energy consumption of an instruction in the middle of NOP's

instruction grouping, depending on homogeneity criteria

computation of a correction factor for each group based on the execution of relevant applications
The initial estimation of the energy of each instruction is using the following method:
 for each other instruction, we create a small program Pinst repeating N times the following process:
→ write random values to some registers
→ perform a high number of NOP's
→ execute the instruction on those registers
→ perform a high number of NOP's
 we generate a program Pnop from Pinst by replacing the execution of an instruction by one or multiple NOP's, resulting in the same period length for the whole process
 we execute the programs Pinst and Pnop on a gate level power estimator able to generate a power curve over time for both Pinst and Pnop (see example on figure 3)
 we calculate the integral of power between the middle of the first serie of NOPs and the middle of the second serie of NOPs, we sum over the N loops to obtain two energies EPinst and EPnop
 we derive the energy E(inst) of each instruction by computing:
Figure 3: power curves for Padd and Pnop
Instruction grouping consists in creating G groups depending on criteria of homogeneity. We used so far 2 kinds of criterion:
 homogeneity in energy
 homogeneity in functionality
Finally, we calculate correction factors with the following method:
 we run a relevant application on the ISS in the SystemC environment to extract a trace file and on the gate–level netlist to generate a power curve
 we compute a vector of G multiplication factors applied to energies of the G groups of instructions so that the power curve reconstructed with SLEEP from the trace file has a minimal difference with the power curve extracted from the gate–level simulation
 we apply each multiplication factor of a group on each of its instructions to obtain its final energy figure
3.3.3 Cache For cache accesses, we use the same kind of techniques as the one used for core instructions, by using an initial value taken directly from the memory blocks composing the cache, and by applying some correction factors for:
 instruction cache access energy
 data cache access energy
 instruction cache fetch energy
 data cache fetch energy
3.3.4 Memory Memory blocks have a simple model, with 2 modes:
 low power (LOW)
 active mode (ACTIVE)
In LOW mode, we have only some leakage. In ACTIVE mode, we also have some clock dissipation, and an energy figure per memory access. We distinguish here read and write operations. We can get those values directly from the gate level memory models of memory blocks.
3.3.5 Network We use one mode (ACTIVE). In this mode, we want to compute the average energy dissipation of a bit toggle on address or data bits. In order to do that, we run some application examples that exhibit communications on the network, and we measure:
 energy dissipation of a whole network: Etot
 energy dissipation of the clock: Eclk
 number n of bit toggles on address and data bits
The value Emean we look for is equal to:
3.3.6 Other components For the I/O, we directly use the gate–level memory model of the I/O pad.
For the clock tree, we need the average power dissipation P as a function of frequency. We compute it through the estimation of the total capacitance C and the formula:
3.4 SystemC instrumentation We need to instrument the SystemC description according to our findings during characterization. We achieve that by means of a C++ class called power monitor. The API of this C++ class is the following :
 indicateMode : mode change
 indicateTran : operation executed
 indicateVect : network vector change
 indicateFreq : change of frequency
 indicateVolt : change of voltage
For blocks described with P parameters, we simply make use of indicateMode. For blocks described by E parameters, we make use of indicateTran (memories) and indicateVect (networ). For the power management unit, we use indicateMode for the block itself and we also use indicateFreq and indicateVolt for each frequency domain and for each voltage domain.
For processor core and cache accesses, this is automatically done through the generation of a trace file by the instruction set simulator. This requires, for each type of processor, a post–processing tool to translate the trace file into the event database.
4. Experimental results In order to provide guarantees to system integrators, we validated separately each part of the system. We present here our results for each kind of block.
4.1 Validation for memory For memories, the power model at the system–level is identical to the power model at the gate–level, in the sense that each read or write access is recorded. We just checked here that we have indeed the same accuracy by using our tools. Results are within 10 % ot gatelevel estimation.
4.2 Validation for network We took as examples 2 kinds of network:
 AXI network (Advanced eXtensible Interface)
 AHB network (Advanced High–performance Bus)
For each network, we made our characterization and we ran different applications on the netlist, exhibiting some network communications. The values estimated by our approach and the values obtained through a gate–level estimation are within 10 %. Furthermore, power curves over time are very close to each other.
4.3 Validation for hardware IP We just need to check that the order of magnitude of power estimation is correct, since those components will not represent an important power contribution. We used here as examples:
 a memory controller for AXI
 an interrupt controller for AHB
We could observe that the power estimation with our method gives a maximum deviation of 30 %.
4.4 Validation for core and caches For core and caches, we need here much more accuracy. We conducted here experiments on 2 subsystems:
 experiment 1 is based on an ARM11 (see [16]) and an AXI bus
 experiment 2 is based on a TriMedia TM3271 (see [17]) and an AHB bus
Each subsystem is modelled in SystemC TLM for performance analysis.
The computation of the initial characterization for each experiment took 10 hours, following our method. In those experiments, the industry standard Dhrystone 2.2 is used to obtain the corrected power table for the processor.
For each experiment, we ran applications on:
 the SystemC virtual platform (where a cycle accurate model is used)
 its corresponding implementation (where a gate–level netlist of the processor is used)
In experiment 1, we used MP3 and MP4 decoding applications, results are shown on figure 4. We obtained here a speedup of 1300.
In experiment 2, we used MPEG2DVS and JPEG decoding applications, results are shown in figure 5. We obtained here a speedup of 100.
Power estimation results for both experiments are summarized in table 2.
We used the same frequency for gatelevel and for SLEEP. We observe and excellent correlation (within 5 %) between the SystemC power estimation and the gate–level power estimation, for both average power and power curve over time.
exp  software  gate–level (mW)  SLEEP (mW)  Δ (%) 
1  MP3  3.16  3.23  +2.1 
MP4  2.62  2.51  –4.2 
2  MP2DVS  256.0  251.0  –1.9 
JPEG  72.3  70.1  –3.0 
Table 2: Power estimation resultsFigure 4: Power curves for ARM1176 core and caches
Figure 5: Power curves for TM3271 core and caches
5. Conclusions and future work
We have developed a system–level methodology and flow for digital SoC power estimation. We have addressed how power models can be built into the existing SystemC TLM models based on our existing SystemC TLM design methodologies. Using SystemC design methodologies, simulation performance can be significantly increased. We have also shown that we can use existing low–level implementation of components to quickly characterize power values in order to increase accuracy of power estimation.
The validation experiments show that for both average power estimation and power curve estimation, an excellent accuracy compared to the gate level power estimation has been reached.
In addition, since we already include voltage and frequency dependencies in our flow, we can now study the impact of voltage and frequency scaling at the SystemC level. We also look into the study of options of memory mapping on power consumption. Therefore, our environment, for both characterization and SystemC flow, reveals to open lots of opportunities for performing design space exploration for power with confidence.
References
[1] D. Soudris, Ch. Piguet and C. Goutis, “Designing CMOS Circuits for Low Power”, Kluwer Academic Publisher, 2002
[2] R.A. Bergamaschi, Y.W. Jiang, “State–Based Power Analysis for Systems–on–Chip”, DAC2003, June 2–6, 2003, Anaheim, California, USA, pp 638–641
[3] Th. Grötker, S. Liao, G. Martin, S. Swan, “System Design with SystemC”, Kluwer Academic Publishers, 2002
[4] L. Benini, R. Hodgson and P. Siegel, “System–level Power Estimation And Optimization”, ISLPED 98, August 10–12, 1998, Monterey, CA, USA, pp. 173–178
[5] T.D. Givargis, F. Vahid, J. Henkel, “A hybrid approach for core–based system–level power modelling, Proceedings of the Asia South Pacific Design Automation Conference, January 2000, pp. 141–145
[6] T.D. Givargis, F. Vahid, J. Henkel, “Trace–driven System–level Power Evaluation of System–on–a–chip Peripheral Cores”, Proceedings of the 2001 conference on Asia South Pacific design automation, pp. 306–311, 2001
[7] C. Talarico, J.W. Rozenblit, V. Malhotra, A. Stritter, “A new framework for power estimation of embedded systems”, Computer Volume 38, Issue 2, Feb. 2005 Page(s): 71–78
[8] N. Dhanwada, I.C. Lin, V. Narayanan, “A Power Estimation Methodology for SystemC Transaction Level Models”, CODES+ISSS’05, Sept. 19–21, 2005, Jersey City, USA
[9] J.F. Edmondson et al, “Internal Organization of the Alpha 21164, a 300 MHz 64bit Quadissue CMOS RISC Microprocessor”, Digital Technical Jounal, Vol. 7, No 1, 1995, pp.119–135
[10] N. Jouppi et. al, “A 300 MHz 115w 32 bit Bipolar ECL microprocessor”, in IEEE Journal of Solid State Circuits, Nov. 1993, pp. 1152–1165
[11] T. Šimunić, L. Benini and G. De Micheli, “Cycle–Accurate Simulation of Energy Consumption in Embedded Systems”, pp.867–872, DAC 99, New Orleans, Louisiana
[12] V. Tiwari, S. Malik and A. Wolfe, “Instruction Level Power Analysis and Optimization of Software”, Journal of VLSI Signal Processing, No 13, pp. 223–233, 1996
[13] H. Shafi et al, “Design and validation of a performance and power simulator for PowerPC systems”, IBM Journal Research and Development, Vol 47, No 5/6, September–November 2003
[14] S. Abrar, “Cycle–Accurate Model and Source–Independent Characterization Methodology for Embedded Processors”, 17th International Conference on VLSI Design, 2004
[15] D. Elleouet, N. Julien, D. Houzet, “A high level SoC power estimation based on IP modeling”, 20th IPDPS, 2006
[16] ARM1176 processor documentation, http://www.arm.com
[17] TM3271 processor documentation, http://www.nxp.com