By Andrew Jones, Mark Hill, Mark Beaumont, James Pascoe, Stuart Ryan, Robert Deaves STMicroelectronics R&D Ltd, Bristol UK Abstract
This paper presents the architecture of a high performance level 2 cache capable of use with a large class of embedded RISC cpu cores. The cache has a number of novel features including advanced support for data prefetch, coherency, and performance monitoring. Results are presented showing the performance improvement profile over a large class of applications. 1. Introduction
The use of level 2 (L2) caches is set to proliferate in embedded SoC designs as CPU speed continues to outpace DRAM latency. For example, A 2ns CPU cycle time on a complex SOC using DDR2 memory may waste as much as 50% of the available CPU cycles while stalled waiting on an external memory access .
Some estimates put the gap between processing rate and DRAM speed growing at 50% per year . There are two main problems with the approach of simply increasing the size of the Level 1 cache to better combat DRAM latency. Firstly, this approach limits the maximum frequency (Fmax) of the processor. Secondly, many cores are conveniently optimised and delivered as hard macros for each technology and so, in practice, a fixed size level 1 cache has advantages in minimizing the resource required for CPU core development and maintenance.
With the progression to finer process geometries embedded systems have also relied on simple blocks of on-chip RAM close to the CPU to combat latency. Although this tightly coupled memory is typically faster and more area-efficient than an L2 cache it does complicate the memory architecture presented to software. In this architecture software would have to manage the notion of faster access address ranges and slower access address ranges in order to make best use of on-chip RAM. This non-uniformity is undesirable in highly complex systems using standard operating systems.
To understand the benefit of the level 2 cache consider the idealised memory hierarchy depicted below : Fig.1: Simplified L2 cache model
The standard formula  for calculating miss latency is :
Average memory access time = Hit timeL1
+ Miss rateL1
x (Hit time L2
+ Miss rate L2 x Miss Penalty L2
With typical hit rates for L1 and L2 caches the addition of an L2 cache can reduce the average memory access time by a factor of 2-4.
The average hit rate for an L2 cache is closely dependent on its size and the memory footprint of the application. Determining the optimal size for a level 2 cache can be a significant system architecture activity in cost-sensitive embedded systems .
Most Level 2 caches are CPU-family specific. The desire to make coherency and other operations transparent to software means that side band signals are commonly used between the level 1 and level 2 caches to control this. This creates a problem for highly complex SoCs. Many SoCs use multiple cores for different parts of the application. For example, STMicroelectronics set-top box chips commonly use 4 or more different types of CPU core in a single chip. These include a 32-bit RISC core for running general applications code a VLIW core for audio/video decode assist as well as other cores supporting transport processing and DMA functions.
We had the goal of architecting and designing a single L2 cache which would work across these cores. Also, we wanted to limit the amount of design resource required to maintain this IP.
Level 2 caches rely on the same principles of spatial and temporal locality as do level 1 caches for their performance gain. However, the allocation relationship between the two cache levels is central to appropriate sizing of caches. For small level 2 caches exclusivity between lines can be advantageous where lines cannot be present in both caches simultaneously. For larger caches the simpler loosely (aka mainly) inclusive model where level 1 cache lines are likely to also be in level 2 but not necessarily so, performs well. See  for a treatment of the trade-offs facing the Level 2 cache architect. 2. Architecture
The design topology is illustrated below. The CPU, including its level 1 caches, is interfaced directly to the level 2 cache solely using a high-performance on-chip interconnect known as the STBUS. All memory requests issued by the CPU/L1 cache are thus sent to the Level 2 cache for servicing.
Fig.2: Level 2 cache s/w model
A key aspect of this architecture is how the L2 cache determines that a particular request is cacheable in the L2 in the absence of direct signals from the CPU. One way of achieving this would be to associate cacheability to one or more address ranges which could be configured by software. The problem with such approaches is that they end up with an associative data structure representing the windows which needs to be managed in software.
The approach that we use is to re-use the level 1 caches notion of cacheability which, in part, is controlled by the CPUs TLB (translation lookaside buffer). That is to say, that we establish that a data request is cacheable in the level 2 cache if and only if it is cacheable in the level 1 cache at the time the request is made.
We can determine whether a request is cacheable in the level 1 cache simply by looking at the size of the request; allocating level 1 cacheable requests are all 32 byte read accesses. Also it is true for our CPUs that 32 byte requests cannot arise from any other activity other than from a cacheable access. This gives a simple and fast way of establishing cacheability.
The main cache parameters we decided upon are shown in the table below :
2.1 Dual Port Architecture
|Feature || |
|Type || Unified |
|Size ||256KB (128KB- 1MB) |
|Line ||32 Bytes |
|Associativity || 8-way |
|Replacement Policy ||Random |
|Addressing ||Physical Address & Tags |
|Allocation ||Read Miss/Prefetch |
|Write-Policy ||Write-Through/Write-back configurable |
|Inclusion || Loosely inclusive |
|Pipelining || 16 outstanding misses on each L2 cache port. |
In order to support use of the L2 cache by IP with DMA capability the Cache has two ports. The first is dedicated for use by the CPU but the second is usable by the rest of the SoC. Allowing arbitrary DMA into an L2 cache can significantly complicate the design of an L2 cache. Also mechanisms need to be in place not only to manage conflicts between requests servicing level 1 misses and writebacks but also to manage DMA traffic. Our system analysis showed that a major component of DMA activity is in prefetching buffers which are to be processed by the CPU. We have deployed a novel mechanism for getting such stream data into an L2 cache. This uses a special prefetch register accessible though the second port. This enables DMA capable IP to instruct the L2 cache to prefetch data into the cache before it has been requested by a CPU. This mechanism avoids much of the difficulty of supporting direct DMA into the cache while retaining many of the advantages of getting data into the L2 cache ahead of the time that the L1 cache requests it.
Fig.3: Attachment of L2 cache to system. 2.2 Coherency Management
The L2 cache is software-managed. There are memory mapped registers which allow the cache to be flushed, invalidated and purged per address or per entry. In addition there are operational modes in which both the tag and data array are memory-mapped. This is to allow debugging but also to permits creation of a RAM mode in which the data array of the cache can be mapped as a block of RAM and so functionally resemble tightly coupled memory. This enables its use in functionally diverse environments where close management of memory buffer placement is possible. 2.3 Performance monitoring
On large SoCs it is increasingly difficult to gather good profiling information on large real-time application performance. Much of the literature on this subject uses data garnered by RTL simulations or other restricted environments.
The L2 cache includes a number of event and cycle counters which enable the gathering of statistics related to hit/miss rates, bandwidths and latencies observed by the cache. It is possible to determine the number of compulsory, capacity and conflict misses. 2.4 Security
The increased use of conditional access (CA) and digital rights management (DRM) controlled content means that in general SoC architects have to be guarded about which access paths through the chip are allowed. Secure data has to be kept separate and distinguished from insecure interfaces and code. Because only the L2 caches primary CPU port can read the contents of data in the L2 there is limited scope for eavesdropping of secrets or secure contents. The cache is able to function in partitioned systems of this type solely by restricting access to the secondary port.
Even in a partitioned system it may be advantageous to allow a specific communication channel across the partition in a way that avoids the penalty of external memory latency. In order make this channel robust we use a source filter on the interconnect. A source filter is attached to an interconnect target and only allows specified initiators to access that target. Initiators without permission will get an error response should they attempt to access the target. See Figure 6
This allows us to implement a partitioned system with robust high-speed communication channels between partitions when necessary.
Fig.4 : L2 cache with source filteringImplementation
The level 2 cache has been implemented in a 65LP process. Its maximum frequency is 500MHz and it consumes 21mW of leakage. Its area is 2.81 mm2 for a 256Kbyte implementation. It is able to service an L1 cache miss in approximately 13 cycles as seen from the CPU core. 4. Performance
We now summarize the benefits of using an L2 cache for various applications.
The primary benefit of embedded L2 caches is to maintain high application performance despite large off-chip memory latency. The secondary benefits come as a reduction in off-chip bandwidth requirement and thus lower power consumption.
In embedded systems the latency experienced by a processor increases as the total system bandwidth requirement increases. Thus the benefit of implementing an L2 cache increases with the memory stress on the system. In the following graphs we present the results of FPGA simulation of the L2 cache performance benefit against SoC memory latency for a wide variety of applications. Fig.5: L2 cache MPEG Performance graph(1)
Fig.6: L2 cache MPEG Performance graph(2)
Fig.7: L2 cache various applications Performance graph
The above 3 graphs illustrate the performance advantage gained by using an L2 cache against the background of increasing system latency. For example an MP3 decode is more than 40% faster with a 256K L2 cache when the cache miss penalty is 144 CPU cycles.
The following graphs show how the L2 cache behaves when performing a block copy. These show that the performance varies with the size of the data block being copied, for 4 different cache modes (WriteThrough, CopyBack, Hidden, and Disabled). For these graphs an average was taken over all system latencies.
The key features are:
- Performance decreases as copy_size increases
- The L2 has a performance sweet spot between 32 and 256 Kbytes of application footprint.
- CopyBack and WriteThrough modes have very similar performance in these tests
As the copy size increases a smaller proportion of the application data will be held in either of the L1 or L2 cache, therefore performance will decrease.
However this decrease happens at 2 switch points: 32 and 256 KBytes. Below 32 KBytes all the data is principally held in the L1 cache and the performance is at its highest. Above 32, but below 256 KBytes, the data is held in the L2 cache. Above 256 KBytes the data is not found in either cache and has to be transferred to/from system memory. At this point the system yields the lowest performance.
Note that the knee of the curve around 32K bytes is sharper than that around 256K bytes. This is due to the fact that the 32K L2 cache uses an LRU line replacement policy on a 4-way cache and the L2 uses a random replacement policy on an 8-way cache.
Fig.8: L2 cache memory copy performance(1) Fig.9: L2 cache memory copy performance (2) 5. Conclusions
The major impact of deploying an L2 cache in a high performance embedded device is the performance insulation it provides from the relatively slow speed of system DRAM. This problem is acute now and will worsen in coming generation.
This paper has presented a novel L2 cache architecture able to be used with a wide class of embedded CPU without complex interfacing but which achieves significant performance advantages.
This flexibility will be key in future systems as the memory hierarchy of embedded SoCs will deepen for many IP not just the CPUs. With the advent of significant amounts of embedded DRAM we can expect SoCs with several Level 2 caches.
Because the L2 cache has a standard interconnect interface it can be used at several points in the SoC topology to combat latency issues. Of course architects need to be mindful of coherency issues but our system analysis has shown that there is significant opportunity for easy deployment of this IP.
The IP has been designed to permit single or multiple core use, can be of configurable size, associativity and location and will support both big and little endian domains.
On the kind of complex SoC for which this IP is targeted DMA engines are frequently implemented with CPU cores as this allows an amount of processing of data in flight. This processing can include filtering, endian conversion, encryption or decryption. The architecting of a prefetch register allowing DMA IP to place data in the L2 cache combats the latency problems suffered by the traditional flow of DMA-ing buffers into memory and requiring the CPU to fetch into the L1/L2. The prefetch register is low-cost to implement in terms of design complexity and area; a fact which offsets its disadvantage in that it does not reduce external system bandwidth. This appears to be a reasonable trade-off however as purely bandwidth problems are typically less acute in many of the SoCs that are being deployed in the set-top box market.
Most application CPUs of the type that we have looked are required to run multiple applications concurrently. Therefore the active footprint is likely to be larger than the set of single application benchmarks we have presented here. In practice this difference will add further complexity to this analysis. We aim to use the performance monitors embedded in the current generation of this IP to gather data to refine the next generation. This will be data having the advantage of being collected in-situ rather than in the simplified environments used here. 6. Acknowledgements
Thanks to STMicroelectronics Bristol CSD design team as a whole and in particular, Robert Hogg, David Shepherd and Richard Curnow for their experience, insights and pragmatism in implementing this architecture. References
 S. Narita, "SH4 RISC Microprocessor for multimedia, gaming machine" IEEE Design, Automation and Test in Europe, 2001.
 J Hennessy & David Patterson Computer Architecture A quantitative approach. 3rd Ed. Morgan Kaufmann publishers (Elsevier Science USA). Chapter 5.
 R. Deaves and A. Jones, A Toolkit for Rapid Modeling, Analysis and Verification of SoC Designs, IPSOC, Nov. 2003.
 R. Deaves and A. Jones, An IP-based SoC Design Kit for Rapid Time-to-Market, IPSOC, Dec. 2002.
 A Stevens, Level 2 Cache for High-performance ARM core-based SoC Systems. Arm Ltd White Paper 2004
 A Jones & S. Ryan. A re-usable architecture for functional isolation of SoCs. IP 07 IP Based Electronic System Conference. Dec 2007