Module Threading Technique to Improve DRAM Power and Performance
By Dinesh Malviya, Arjun Mohan (Rambus Chip Technologies Ltd)
The interface speeds of DRAM (dynamic random access memory) components have improved dramatically with every new generation. However DRAM core speeds have not seen much improvement. The trend is that with every new DRAM generation, the access granularity is becoming double and some of the timing parameters like tRDD and tFAW are restricting the data throughput. A consequence of this is that certain classes of applications which do not demand large data and demand lower granularity are facing performance issues. The results from our studies show that significant performance improvements can be obtained by adopting the module threading technique.
This paper provides details of all DRAM timing constraints which have heavy impact on memory system performance and introduces module threading technique to overcome these limitations. It also provides detailed theoretical analysis on how module threading can offer finer granularity, higher bandwidth and importantly lower power consumption. It also provides board level analysis where 25% power was saved and a higher performance achieved by adopting module threading technique.
The DRAM components are optimized for low cost per storage bit and not for core performance. DRAM storage arrays are designed to be as large as possible, so that the row and column support circuitry occupies a relatively small fraction of the chip area. A consequence of this is that the row and column access times are relatively large because of the heavily loaded word lines, bit lines, and column IO lines. That is the reason why the minimum data accessed from memory is large.
A. DRAM Timing Parameters
The read and write accesses to the DDR SDRAM are burst data transfers oriented i.e the DRAM can be read or written only in terms of burst data transfers. The burst length or pre-fetch length determines the minimum number of column locations that can be accessed by a single READ or WRITE command. The burst length can be defined by following formula.
Burst Length = tCC/tBit
The tBit is the DQ bit time - the interval of time occupied by a bit of information on a data signal.
The tCC is column cycle time, the interval required by a column access to transfer a block of information between a sense amplifier in the DRAM core and a pipeline register in the DRAM interface.
Burst length represents the number of parallel bits that are accessed during a tCC interval, and which are transferred serially though a DQ signal in sequential tBIT intervals.
Historically, the tBIT parameter has changed much more rapidly than the tCC parameter. The doubling of burst length every three years is due mostly to corresponding reductions in the tBIT parameter. In case of DDR3 memory, tCC value is 5 ns and tBIT value 0.625 ns resulting in a burst length of 8.
The minimum amount of data that can be read from or written into a memory decides its access granularity. The minimum data that can be accessed by a single column access is called column granularity and the minimum data that can be accessed by a single row access is called row granularity. The column granularity is a product of DQ width and burst length.
Column Granularity = DQ Width * Burst Length
DDR3 Column Granularity = 64 * 8 = 64 Bytes
Row granularity on the other hand depends on two timing parameters tRC, the row cycling time, the time interval between two row accesses within a single bank and tRRD, the row-to-row delay between accesses to different banks. Traditionally the minimum tRRD value is twice the tCC value, meaning that two column accesses may be performed during each row access. This leads to the following module row granularity relationship (i.e. data transferred during a row access):
Row Granularity = 2 * Column Granularity
DDR3 Row Granularity = 2 * 64 Bytes = 128 Bytes
The row and column granularity value for different DDR DIMMs (which has 64-bits DQ) are shown in Table-I.
The tRC limitation is generally hidden by using a technique called bank interleaving. However the timing limitation can come in to picture due to the following condition.
tRC >= number of banks cycled * tRRD
If memory device does not have sufficient number of banks to cycle through as in above equation then there will be an impact over the performance. Similarly tRRD also becomes a performance issue if it is greater than BL/2 for single read/write access.
TABLE I. TREND OF MODULE ACCESS GRANULARITY
|DRAM ||Burst Length ||Column Granularity ||Row Granularity ||tFAW(Clocks)|
|DDR ||2 ||16 B ||32 B ||NA|
|DDR2 ||4 ||32 B ||64 B ||7.5-16.67|
|DDR3 ||8 ||64 B ||128 B ||16-32|
To make things worse, the DDR protocol puts another limitation for DRAM is tFAW timing. It restricts access of more than 4 banks in a rolling time window - tFAW.
B. Impact on DRAM Performance
The DDR3 column granularity is 64 Bytes, it means that minimum data accessed with DDR3 memory is 64 bytes through a read or write operation. But some classes of applications do not require such a high granularity and faces performance penalty. On top of this, DRAM timing parameters like tRRD, tFAW etc are also an overhead on the available bandwidth. These restrictions have tremendous effect on DQ bandwidth as shown in Table-II. This table shows the DQ bandwidth loss due to tRRD and tFAW restrictions for different speed grades of DDR3 using datasheet specified standard IDD7 patterns. The total losses are in the range of 25% to 50%.
All DRAM timing restrictions are limited to a single memory module and not across memory modules. So if memory module is split in two or more module channels on the same module substrate and accessed separately, then the access granularity can be reduced and timing restrictions can be minimized. This technique of dividing same memory module in two or more independent module channels is referred as module threading.
II. MODULE THREADING
Threading simply means the process of bringing concurrency into a system. In a memory system, concurrency is brought about by increasing the number of banks. There are many ways of adding banks into a memory system. One of these techniques is Module threading. It is an approach of allowing a single memory module to be separated into two or more independently accessible data groups, or threads. It uses standard DRAMs which are accessed in parallel by time multiplexed chip-select. Based on the access granularity requirement, system can use dual or quad threading. In dual threading, two memory channels on same module are used and in quad threading, four memory channels on same module are used. The quad and higher threading can be applied if enough RQ bandwidth is available for command scheduling. The module threading offers following advantages:-
- Finer access granularity
- Better DQ utilization by relaxing tRC, tRRD and tFAW restriction
- Lower power consumption
TABLE II. EFFECT OF TRRD & TFAW ON DQ BANDWISTH
III. CLASSIC MODULE SYSTEM
The classic module is referred as standard DIMM (Dual in-line memory module). DIMM is small printed circuit board that has multiple DRAM memory chips in it. It has 64-bit DQ bus which connects to all DRAM memories. It is available in single rank or double rank topology. Single rank DIMM has one 64-bit DQ module, double rank DIMM has two 64-bit modules in a bussed topology but at a time only one module can be accessed via chip select pin. Inside the DIMM, the DQ is point-to-point but clock and CA (command and address bus) are connected in fly-by topology.
Figure 1. Classic Memory System
Figure 2. Module Threaded Memory System
Click on image to enlarge
Figure 3. Classic Memory System Timing Diagram
Click on image to enlarge
Figure 4. Module Threading System Timing Diagram
As shown in Figure-1, each component of 8 DQ bits is connected to the memory controller data path adding up to 64 bits. The chip selects of all the components are shorted and accessible as a single CS while the DQ lines are point-to point. Thus all the DRAMs are accessed simultaneously with common CA and CS for performing any read or written with 64-bit DQ lines.
Figure-3 shows the memory transactions with DDR3 memory module in classic case where different banks are accessed in cyclic manner. At cycle-0 active command is scheduled to bank 0 and after tRRD time from first active command, another active command is scheduled to bank 1 at cycle-4. At cycle-1, the first read command is scheduled using additive latency (AL=CL-1) to access data from bank 0. At cycle-5, after tCCD time from first read another read command is scheduled to access data from bank 1. The first data from memory is received after CAS Latency (CL) time from first read at cycle-11 and continue till cycle-14. In a similar fashion, remaining commands are scheduled to memory to get back to back data on DQ bus for different banks. The DDR3 burst length is eight and that is the reason why each read access provides four clocks of data. This defines the minimum granularity of DDR3 module as 64 Bytes (64 pins x 8 bits).
IV. DUAL THREADED SYSTEM
As name suggests, single module is divided into two separate memory channels and each caters to a 32-bit point-to-point DQ and shares a clock and a CA. These channels are selected via two separate chip-selects. Since memory topology is different from that of classic, memory controller needs modification to access both modules simultaneously using the chip-select. The memory controller should be design to engage both the memory channels and achieve higher data efficiency. This could be achieved by implementing a special address mapping scheme and deep command and data FIFOs in controller design. The following sections describe in detail the improvements that can be brought about by a dual threaded system:
A. Finer Granularity
As in Figure-2, the standard DIMM is divided into two independent modules of 32-bits DQ A and B, and each has 4 memory components in it. These modules are accessed independently via separate chip-select CS1 and CS2. Due to this change, the bank size, column size and DQ width becomes half. Hence the access granularity becomes half as compared to classic case as shown in Table-III.
Column Granularity = 32 * 8 = 32 Bytes
Row Granularity = 2 * 32 Bytes = 64 Bytes
TABLE III. ACCESS GRANULARITY WITH MODULE THREADING
|DRAM ||Burst Length ||Column Granularity ||Row Granularity ||tFAW Relax Window (Clocks)|
|DDR ||2 ||8 B ||16 B ||NA|
|DDR2 ||4 ||16 B ||32 B ||15-33.34|
|DDR3 ||8 ||32 B ||64 B ||32-64|
The reduction in access granularity is a boon for accesses of minimum size. A single column access results in bandwidth loss due to tRRD and if an application demands 32-bytes access granularity, it has to use the burst-chop option which will supply data at only 50% efficiency. However by adopting the dual threading, in both cases, 100% bandwidth can be achieved. Hence module threading is suitable for such applications that demand lower granularity as well as application that requires higher bandwidth.
B. Better DQ Utilization by Relaxing tRC, tRRD and tFAW Restrictions
In dual channel module threading, the access granularity and rate of bank activates become half. Each access in a dual channel module threading case is made over two consecutive column accesses, thus relaxing the tRC and tRRD limitations. This means to access same amount of data in dual threaded module, memory need to be accessed twice and will take twice number of clocks compared to classic module. This relaxes the tRC, tRRD and tFAW limitations.
The Figure-4, illustrates memory transactions scheduled to dual threaded memory. At clock cycle-0, module-A is accessed with active command by asserting the CS-1 to access bank 0. At cycle-2, module-B is accessed with active command to bank 1 which is possible as there is no tRRD restriction between bank accesses across the modules. Cycle-1 and 5, two read commands are sent to module-A and at cycle-3 and 7 two more read commands to module-B. The first data is received from module-A at cycle-11 that continues for burst length duration; similarly at cycle-13 first data is received from module-B. Each module is accessed in time multiplexed manner alternatively to get back to back data on DQ1 and DQ2. The data accessed by one Read command is 32 Bytes which is half as compared to classic case. That is the reason why two commands have to be scheduled to get same amount of data i.e 64 Bytes. Effectively this increases the gap between two active commands for same module. Because of that tRC, tRRD and tFAW restriction relaxes and data efficiency improves. This is explained in details in theoretical analysis section [V][C].
C. Lower Power Consumption
The module threading not only improves the data efficiency but also reduces the total power for each transaction. To access same amount of data with memory, the number of activate commands scheduled in threaded case is half compared to classic case.
Typically, the module power required for row accesses (the active command) accounts for 0.25 to 0.50 of the total power which can be referred as activate device power, with the rest consumed by the column operation (the read/write command). Dual-threading has the same total number of column accesses as classic module, but only one-half as many devices are accessed for each row transaction. Hence the power consumption in module threading is much lower than the classic module. It is further explained in section [V][B].
V. THEORETICAL ANALYSIS
For studying the effect of dual threading, we selected DDR3 DRAM of different speed grades. We used DDR3 specified IDD7 patterns to analyze the impact of dual threading on data efficiency and power consumption. We scheduled the memory transactions in dual treaded module in two different ways, first to get same DQ utilization as in classic module at lower power and second to get higher DQ utilization at lower power. To compare the data efficiency we accessed same amount of data in classic as well as in dual threading case.
A. Classic Module System
Figure-6 shows the access in a classic DDR3-1600K DRAM with IDD7 pattern. In this pattern total eight different banks are accessed with eight read commands. The banks are activated with tRRD gap of six clocks which is higher than the BL/2 and causes two clocks bubble in DQ. After A3R3, at cycle-19 additional delay of eight clocks is introduced due to tFAW restriction. The same eight clock bubble gets reflected on the DQ lines. The bubble in DQ lines due to tRRD occurs with every new bank activate operation and bubble due to tFAW occurs once in every 4 bank activates. These bubbles together bring down the overall DQ utilization to 50%.
B. Low Power Module threading
In this case, the commands are scheduled in similar way as scheduled in classic case as shown in Figure-7. Since the access granularity is half in dual threading, to access same amount of data two read commands need to be scheduled to each bank. Because of that sixteen read commands are generated but total number of active command remains eight, four to each module. Since we are only activating four memory components as compared to eight in classic case the power consumption is lower.
In this analysis we are comparing only device activate power. In classic case, 8 activate commands propagate to all eight DRAMs and consume 64 (8*8) activate device power. But in dual threaded case, 4 activate commands are propagated to 4 DRAMs each and consumes 32 (4*4+4*4) activate device power which is half as compare to classic case.
C. Higher Performance Module Threading
Figure-8 shows the behavior of a dual threaded DRAM for a high performance case when the same IDD7 pattern is applied to it. The scheduling is optimized for best DQ utilization without violating any DRAM timings. Now four activate commands are seen within tFAW timing window without bubbles. Due to that no bubbles are seen in DQ path and 100% DQ utilization is achieved.
The complete performance analysis results are shown in Table-IV and Table-V. The Table-IV lists DQ utilization and RQ utilization number for different DDR3 timing bins. Most of the cases the DQ utilization improved by 25% to 100% as shown in Figure-5.
The Table-V lists the power analysis results in terms of Activate power. The activate power is compared here because same number of column are accessed in both cases. Also major power is consumed in DRAM because of page activation. In all the cases dual threaded module consumed half activate device power as compared to classic case.
Hence our theoretical analysis clearly shows the data efficiency and power advantage of module threading system over classic module system.
TABLE IV. PERFORMANCE ANALYSIS RESULTS
|Speed Grade ||Org. ||RQ Utilization ||DQ Utilization |
|Classic (%) ||Threaded (%) ||Classic (%) ||Threaded (%) ||Improvement (%)|
|DDR3-800 ||x4/x8 ||50 ||75 ||100 ||100 ||0|
|x16 ||40 ||75 ||80 ||100 ||25|
|DDR3-1066 ||x4/x8 ||40 ||75 ||80 ||100 ||25|
|x16 ||29.62 ||75 ||59.3 ||100 ||68.63|
|DDR3-1333 ||x4/x8 ||40 ||75 ||80 ||100 ||25|
|x16 ||26.67 ||75 ||53.33 ||100 ||68.76|
|DDR3-1600 ||x4/x8 ||33.33 ||75 ||66.67 ||100 ||50|
|x16 ||25 ||75 ||50 ||100 ||100|
TABLE V. POWER ANALYSIS RESULTS
|DDR3IDD7 Pattern ||Classic (Activate Power) ||Threaded (Activate Power)|
|DDR3-800 (All Bins and all Org.) ||64 ||32|
|DDR3-1066 (All Bins and all Org.) ||64 ||32|
|DDR3-1333 (All Bins and all Org.) ||64 ||32|
|DDR3-1600 (All Bins and all Org.) ||64 ||32|
Figure 5. DQ Bandwidth Improvement with Dual Module Threading
Click on image to enlarge
Figure 6. Classic Module Timing Diagram for DDR3-1600K, IDD7Pattern
Click on image to enlarge
Figure 7. Dual Threaded Module Timing Diagram for DDR3-1600K, IDD7 Pattern
Click on image to enlarge
Figure 8. Dual Threaded Module Timing Diagram for DDR3-1600K, IDD7 Pattern
VI. BOARD LEVEL ANALYSIS
We also did board level analysis, where standard DIMM was modified to support dual threading keeping following objectives in mind:
- Quantify power savings with measurements on an ATE tester
- Quantify bandwidth improvements through analysis and benchmark simulations
We used DDR3-1333H IDD7 pattern for analysis and results were quite promising and we not only measured 25% improvement in data efficiency but also 25% improvement in total power consumption.
The column granularity of DDR3 is 64 Bytes and it will increase to 128 Bytes in DDR4. This has impact on memory performance for certain class of applications. The architectural technique of module-threading may be applied to conventional memory modules with relatively low incremental cost. This technique overcomes the tRC, tRRD and tFAW restrictions and permits the module to provide greater performance. It also offers smaller access granularity and decreases the number of row activations by half, resulting in a 25% reduction in total memory power.
 A Performance Comparison of DRAM Memory System Optimizations for SMT Processors. Zhichun Zhu. The 11th International Symposium on High-Performance Computer Architecture Palace Hotel, San Francisco, February 12-16, 2005 http://www.hpcaconf.org/hpca11/papers/19_zhuperformancecomparisonofdram_updated.pdf
 Improving Power and Data Efficiency with Threaded Memory Modules by Frederick A. Ware and Craig Hampel, Computer Design, 2006. ICCD 2006. International Conference
 Micro-threaded Row and Column Operations in a DRAM Core. Frederick A. Ware and Craig Hampel. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS05), Austin TX, March 20, 2005. http://www.rambus.com/news/technical_docs/MicroThread.pdf
 DDR/DDR2/DDR3 SDRAM Tutorial. JEDEX Conference 2006, San Jose, CA, April 16, 2006.
 DDR/DDR2/DDR3 SDRAM Standards
Contact Rambus Inc.