Power Optimization in Image Superscalar IP
By Akhilesh Mahaja, Naveen Tiwari, Raghuram P.
ASIC Design Group, DS India Labs
Samsung India Software Operations
As process geometries have scaled, design teams have used more and more of the additional silicon real estate available on chips to integrate embedded memories that serve as scratch-pads, FIFOs and caches to store data for the computational cores. These embedded memories allow for significantly better system performance and lower power compared to a solution where off-chip memories are used. As a result, most current designs have over 50% of their area used by embedded memories and these memories account for 50-70% of the total SoC power dissipation. Clearly, any attempt to reduce SoC power is incomplete if it does not attempt to reduce the power consumed by the embedded memories in the design.
Most of the Embedded Memory is single port SRAM, which provide single clock, read and write cycle operation. However for higher through put application Single port RAM are replaced by Dual Port RAM which results in Higher Area and More power consumption. Also IPs are designed for wide image size (Full HD) however in general only the regular size (QVGA or HD) image applications are only used. This lead to under utilization of Embedded Memories as there is no way to switch off the un-used segment of the memory.
In this paper we have present an optimize power aware architecture named as Cluster Memory Architecture. This Architecture is implemented in Design and Development of 60 fps Super Scalar IP which can convert 60 VGA Frame to Full HD Frame per second. This architecture ensures similar or reduction of power consumed for same size Single Port SRAM Memory and similar performance as Dual Port RAM. This architecture also facilitates for Switching Off the un-used segment of the memory.
For designers charged with integrating a wide range of functions into system-on-chip (SoC) solutions, the cost and complexity of incorporating digital logic, processing, memory, and analog functions often prove inhibitive. Designer need memory architecture capable of delivering bandwidth well beyond traditional solutions in order to fully exploit the escalating capabilities of today's high-performance processor cores.
On the other hand, designers need a memory architecture that offers maximum flexibility to meet the needs of a wide variety of system architectures, including those that incorporate multiple processors. Designers also need a memory architecture that reduce s the complexity of their design and in the process drives down cost and shortens the design cycle.
Cluster Memory Architecture is presents a solution to this problem. Cluster Memory Architecture is discussed in later section of this paper.
The advent of smaller geometries has made it possible and practical to integrate more Functionality onto a semiconductor chip. Developers look to incorporate features that will distinguish their products from their competitors, and with these features comes the growing need for embedded memory. Bringing memory onto the ASIC often lowers cost and power consumption, improves performance, and increases the reliability of the system on a chip (SoC). Many of todays chips demand more embedded memory than ever before. Large amounts of SRAM, ROM, EPROM, multi-port RAM and DRAM are finding their way on board. For example, in the case of high-performance microprocessors, 30 to 50 percent of the premium space and 80 percent of the transistors are allocated to the memory alone.
Most memories embedded in SoCs are static RAMs or register files. The key sources of Power consumption in such memories are:
- Dynamic or switching power dissipated when read or write operations are Performed.
- Static or leakage power dissipated by the logic in the periphery and core memory array whenever the memory is powered on
The dynamic power consumed by a memory when a read or write operation occurs, can be broken up into the power consumed by:
- Toggling of the clock network
- Peripheral logic to decode the address
- Bit-lines in the memory array
- Core memory cells changing state
Leakage power is becoming a more significant component of the total memory power at 65 nm and below process nodes and it can account for 40-50% of the total memory power.
Overall power consumption of memories account for 50-70% of the total SoC power dissipation.
Clearly, any attempt to reduce SoC power is incomplete if it does not attempt to reduce the power consumed by the embedded memories in the design. Design success truly depends upon efficient and optimized memory design.
In our work we have tried to reduce both Dynamic Power consumption and leakage power consumption by using Clock gating logic coupled with efficient memory architecture.
SINGLE PORT Vs DUAL PORT MEMORY
Designer Engineering are a facing a dilemma for using Single Port RAMs or Dual Port RAMs. Single port RAMs offers low-cost and low-complexity solution to designers. Designer would like to make Single work RAMs work at clock speed for both read and write operation. This approach requires either single-port memories with access times equal to one half of a bus cycle time, or requires that each processor access the RAM on every other cycle. Often these requirements are limiting due to speeds of available RAMs and difficulties with interleaving transactions between processors.
By using dual-ported RAMs, the efficiency of memory accesses can essentially be doubled. This inherent performance property proves to be one of the largest benefits of using dual-port RAMs.
However, now days shared Memory solution are frequently used as an alternative to Dual Port Rams. Finding a single-shared memory solution capable of meeting all these criteria has not been a simple task. One popular option is a traditional muxed SRAM. Built around a standard, off-the-shelf memory manufactured in high volume, a muxed SRAM offers a very attractive cost structure. However, this advantage is often deceptive. A standard muxed SRAM costs less than a specialized dual-port memory on a per-bit basis, however there advantage losses shine in long term.
Cluster Memory Architecture tries to find middle path between single port RAMs and Dual Port RAMs. Using Cluster Memory Architecture we have got benefits of both Single Port as well as Dual Port RAMs.
RTL CLOCK GATING
In the traditional synchronous design style used for most HDL and synthesis-based designs, the system clock is connected to the clock pin on every flip-flop in the design. This result in three major components of power consumption:-
- Power consumed by combinatorial logic whose Values are changing on each clock edge;
- Power consumed by flip-flops
- The power consumed by the clock buffer tree in the design.
Power consumption due to combinatorial logic was by far the smallest contributor to the total power consumption. On the other hand, RTL clock gating had the potential of reducing both the power consumed by flip flops and the power consumed by the clock distribution network.
RTL clock gating works by identifying groups of flip-flops which share a common enable term .Traditional methodologies use this enable term to control the select on a multiplexer connected to the D port of the flip-flop or to control the clock enable pin on a flip-flop with clock enable capabilities. RTL clock gating uses this enable term to control a clock gating circuit which is connected to the Clock ports of all of the flip-flops with the common enable term. Therefore, if a bank of flip flops which share a common enable term has RTL clock gating implemented, the flip-flops will consume zero dynamic power as long as this enable term is false.
To get zero dynamic power for unused memory, the un-used memory clusters are switched off using clock gating techniques. This result in significant power saving for the SoC.
CLUSTER MEMORY ARICHTECTURE
In Superscalar IP processing happens row by row. Both up sampler and Down Sampler blocks take 5x5 Pixel as input data as shown in figure 1. In order to get 5x5 Pixel data we need to store intermediate 5 consecutive pixel row lines. Maximum width of row is fixed to 3840 Pixels, so we need a total of (3840*5) pixel memory. A Single memory of width 3840 can used to store the single row data, but this might stall one or other module as each of them uses memory either for reading or writing. Dual Port SRAM can be uses to avoid this problem but again it will increase the area To avoid address conflicts , dual port memory and system stalling, superscalar IP uses cluster memory architecture as shown in fig 2. As shown in the figure a single row memory is divided into 8 different memory clusters. 7 memory clusters are of size 512 and last memory cluster is of size 256 all together memory cluster memory sums equal to 3840.
In Cluster memory architecture only authorized module has either read or control at any moment of time. Once the authorized modules have written into any cluster, the cluster read control will passed to other module. In this simultaneously read and write can happen without using dual port memories. The complete read and write operation is shown in fig3.
Chips are designed for wide image size (Full HD) however in general only the regular size (QVGA or HD) image applications are only used. This lead to under utilization of Embedded Memories as there is no way to switch off the un-used segment of the memory. By Dividing memory into cluster, there exists a possibility of switching off the power supply to unused memory clusters. RTL power gating techniques have been used to implement this functionality. This results in significant power saving as there is zero dynamic and static power consumption by the switch off memory clusters.
Superscalar IP has two cluster controllers to control internal buffer memories. The major task for the cluster controller is.
- Activate Read/Write control for any cluster
- Switch off cluster(s) using clock gating if image size is less then the maximum limits.
- Change the Line status after every row processed as Shown in table 1.
Single Port memory architecture in systems-on-chip represents a significant performance bottleneck. Dual-port memories are a common solution to this problem, because they allow parallelizing accesses. However, they are not an area and power optimized solution. We propose a cluster memory architecture that can be used as a substitute for dual port memories. Experiments on a set of parallel benchmarks show power savings of about 40% with respect to a dual-port memory architecture, at a very limited area penalty.
 Memory Power Reduction in SoC Designs by Anmol Mathur
 Power Reduction through RTL Clock Gating By Frank Emnett and Mark Biegel
 Developing a design methodology for Embedded memories by Broadcom
Our Sincerer thanks to our colleagues and team mates at DS India Lab for their constant support and cooperation.
Figure 1 5x5 Pixel Map
Figure 2 Cluster Memory Organizations
Table 1 Line Status Table
Figure 3 Read and Write Operation