By Nan Wang and Magdy A. Bayoumi, University of Louisiana at LafayetteLafayette USA
Modern system-on-chip (SOC) designs consist of numerous heterogeneous components integrated onto a single chip (embedded CPUs, dedicated hardware, FPGAs, embedded memories, etc). The on-chip communication is becoming the bottleneck for these SOC designs most of which employ shared-bus based communication architecture. This paper presents an efficient scalable communication architecture, Data Pre-fetch Core Interface (DPCI), for shared-bus based SOC systems to support scalable and pipelining communication between those IP blocks, the shared memory and the bus so as to improve the system performance and increase the system bandwidth and flexibility. The proposed architecture exhibits both hardware simplicity and system performance improvement. Through experimentation it has shown that the proposed architecture not only reduces the bus idle time and the communication overhead, but also improves the system performance significantly.
As technology scales toward deeper submicron, the role played by the on-chip communication architecture is becoming a critical determinant of system-level metrics, such as system performance, and power consumption which depends more on the efficient communication among master cores and on the balanced distribution of the computation among them, rather than on pure CPU speed .
The most widely adopted interconnected architecture for the SOC IP blocks is still bus-based. Some semiconductor vendors have developed several on-chip bus architectures [3-5] for embedded system designs, which employ numerous communication architectures [6-12]. However, such approach has several shortcomings which will limit its use in future SOCs, such as non-scalability, non-predictable wire delay and large power consumption.
In this paper, a scalable communication architecture for shared-bus based SOC system, Data Pre-fetch Core Interface (DPCI) is presented. The architecture of the proposed design is shown in Fig.1.
Figure 1. The proposed communication architecture
A dedicated DPCI inserted between each master core and the shared bus not only supports regular scalable communication between masters and the shared resources, but also serves as an Open Core Protocol (OCP) to allow third party IP cores to be plug-and-played in the system; it also supports data pre-fetch operations for the master cores. The bus utilization and system performance has been increased significantly by employing the DPCI architectures.
This paper is organized as follows: Section 2 introduces shared-bus based SOC communication architecture. Section 3 details the new architecture design. The test results are presented in Section 4 and section 5 concludes the paper.
2. SHARED BUS-BASED SOC COMMUNICATION ARCHITECTURE
The shared-bus communication architecture consists of a network of shared and dedicated communication channels to which various SOC components are connected. These include: master components which are components that can initiate a communication transaction (CPUs, DSPs, DMA controllers and, etc.); and slave components, components that merely respond to transactions initiated by masters.
For shared-bus based architectures, all the master components share the bus bandwidth. The bus access will be decided by their priorities (priority bus)  or some pre-defined orders (Round Robin and TDMA) [7-9] or the number of their pending requests and Lottery tickets (LotteryBus)  or the pre-assigned bus fractions (fraction control buses) . When a bus request is rejected by the bus arbiter, the master core has to wait and postpone all other operations that depend on this bus transaction until the bus request is granted which will produce a huge waste of operation time.
3. PROPOSED ARCHITECTURE DESIGN
To address the problem and offer an efficient solution, we now present our proposed DPCI architecture.
A. Data Pre-fetch Core Interface
Generally, master cores and the shared bus are operating at different speeds. To avoid system meta-stability, timing and data corruption problems, we insert the DPCI architectures between the masters and the shared bus as shown in Fig.2. First, they serve as a traditional buffer which takes care of the problem of crossing of the clock domains and alleviates the communication contentions between the master cores and the shared bus. Secondly, the DPCI architectures support data pre-fetching operation for the masters so as to further increase the system efficiency. Finally, by configuring the configuration unit of the DPCI architecture, IP cores can be plug-and-played in our system.
Figure 2. The pipelined communication stages.
The DPCI architecture consists of a configuration unit, a write buffer and a read buffer. The IP cores from third-party suppliers can be integrated into the system by configuring the Configuration Unit (Speed, data format and etc.). The write buffer receives the data from master core and passes it to the shared memory. The read buffer pre-fetches data from shared memory before the master core requests it when the bus is free and the master is performing other effective tasks so as to overlap the communication overhead with effective operations. More importantly, the system computation has been decoupled from the system communication by embedding the DPCIs and all the communication tasks will be taken care of by the DPCIs.
The architecture and signal integrities of DPCI are shown in Fig.3.
Figure.3 The architecture and signal integrities of DPCI
B. Functions of the Communication Architectures
The functions of the DPCIs are described as following:
- Core Interface: the DPCIs act as core interfaces between master cores and the shared bus to synchronize the speed and data format differences between the master and the bus.
- Parallel and Pipelining: as shown in Fig.2 and Fig.4, the communication between the master cores and the shared memory has been divided into two stages: (1) stage 1, communications between master cores and DFCIs, and (2) stage 2, communications between the DFCIs and the shared bus. Stage 1 is pipelined with stage 2 so that the communication time of stage 1 is overlapped with the communication time of stage 2, and the communication between masters and the buffers are paralleled.
- Overlapping: as shown in Fig.4, effective operations of the master cores are overlapped with stage 2 when the bus is granted or the bus waiting time when the bus is not granted. With DFCIs embedded, master cores do not have to wait when the bus is not granted; they can either get the pre-fetched data from read buffers directly or continue to carry out other tasks, and let DFCIs deal with the shared bus so as to decouple the computation tasks from the system communication tasks.
Figure 4. Pipelining communication
- Data pre-fetch: Taking advantage of the fact that the data for one computation task is usually stored in memory spatially, the data stored next to data read by the masters last time will be pre-fetched to the read buffers whenever the bus is available. Data pre-fetching allows masters to "look-ahead" and fetch data from memory before it is needed by the master cores. Data pre-fetching results in fewer masters’ pipeline stalls, and higher overall performance on many applications.
However, the data pre-fetch scheme has its own problem. In case the read buffer has been updated with the wrong pre-fetched data, the bus time to update the buffer is wasted; this results in longer waiting time of other pending masters, communication overhead to access the buffer and re-fetching of the correct data from the shared memory. However, because of the spatial feature of the data we mentioned above, this appears obviously to be a convenient tradeoff in most cases.
C. The Operation Algorithm
- Write buffer: The operation algorithm for Write Buffer is described in pseudo-code form as follows:
(write request is granted) then
(write buffer is empty) then
write data from master to write buffer only; else if
(write buffer is full) then
write data from write buffer to memory through bus only; else
write data from buffer to memory through bus and;
write data from master to write buffer; else if
(write request is rejected) then
(write buffer is not full) then
write data from master to write buffer; else
wait for the bus idle time to write the data from write buffer
to memory before master issues another request;
(task is not finished)
- Read buffer: The operation algorithm for Read Buffer is described in pseudo-code form as follows:
if ( read request is granted) then
if (read buffer has been updated by correct data) then
read data from read buffer to master;
pre-fetch next set of data from memory to read buffer;
else read data from memory to read buffer ;
else if (read request is rejected) then
if (read buffer has been updated by correct data) then
read data from read buffer to master only;
else wait for the bus idle time to update the buffer with pre-fetch
data before master issue another read request;
(task is not finished)
In summary, the proposed DPCI architecture is capable of overlapping the communication time with other effective operation time of the master cores so as to reduce the communication overhead and improve the overall system performance. Moreover, the system performance has been increased further by employing the data pre-fetch scheme.
4. TEST SYSTEM AND RESULTS
A. Design Complexity and Speed
We map the DPCI architecture onto Xilinx Vertex2Pro FPGA. The targeting device is xc2vp2, package fg456. The estimation gate count of the communication architecture is 2156 and the maximum delay of the architecture is 3.3ns, so that it can work in the system with maximum speed 300MHz.
B. Efficiency of the DPCI Architecture
The Increment Priority Bus (IPB), Weighted Round Robin (WRR), TDMA bus, LotteryBus, SFCB and DFCB were implemented on the generalized shared-bus architecture as shown in Fig. 2 to perform 8 * 8 matrix multiplications. Every master core calculates two rows in the result matrix which includes 16 * 8 reads and 8 writes, 16 * 8 word multiplications and 16 * 7 double word additions. The master cores were kept busy computing and communicating with the data memory through the shared bus until the job was completed. The assignments were: (1) IPB (Increments): 1, 2, 3 and 4 (2) WRR (weights): 1, 2, 3 and 4 (3) TDMA (slots): 1: 2: 3: 4 (4) LotteryBus (tickets): 1: 1: 4: 6 (5)(6)SFCB and DFCB: 8%: 8%: 32%: 52% for master cores number 1 to 4 respectively. Read/write ratio and burst size were set to 8:1 and 8 words respectively. The matrix multiplication was carried out twice on the shared bus architecture with and without embedding DFCIs to test the efficiency of our proposed DFCI architecture. The test results of average execution speed (demanded execution time), bus utilization (bus busy time/total execution time) and throughout (number of processed matrixes/per second) from two executions are shown in Table 1.
Table1: The efficiency of the DPCIs
|Average || Without DPCIs || With DPCIs |
|Execution Time (ns) ||19950 || 15075(+32.34%) |
|Bus Utilization || 83.24% ||91.03% |
|Throughput || 16708.438 ||22111.175 |
Our proposed DFCI architecture serves as a core interface to allow the IP cores to be easily plug-and-played into the system and is capable of overlapping the communication time with other effective operation time of the master cores so as to reduce the communication overhead and improve the overall system performance. Moreover, the system performance has been increased further by employing the data pre-fetch scheme.
A scalable communication architecture for shared-bus based SOC system is presented in this paper. The test results demonstrate that the proposed communication architecture helps to reduce bus idle time and communication overhead and improve the system performance by providing the scalable and pipelining communication ability, while only bring a reasonable extra cost to the system.
 Kanishka Lahiri, Sujit Dey and Raghunathan, “On-chip Communication: System-level Architectures and Design Methodologies”, http://esdat.ucs.edu/projects/codesign/right-frame.html, 2001.
 Francesco Poletti, Davide Bertozzi, Luca Benini, and Alessandro Bogliolo, “Performance Analysis of Arbitration Policies for SOC Communication Architecture”, in Proc. Design Automation for Embedded System, 2003, pp. 189-210.
 “IBM On-chip CoreConnect Bus Architecture”
 D.Flynn, “AMBA: Enabling Reusable On-chip Designs”, IEEE Micro, vol.17, no.4, 1997, pp 20-27.
 AMBA 2.0 Specification.
 “Peripheral Interconnect Bus Architecture”
 “Sonics Integration Architecture, Sonics INC.”
 OMI 324 PI Bus, Rev 0.3d. OMJ Standards Drafts, 1994.
 “Round Robins and challenges”
 K. Lahiri, A.Raghunathan, G. Lakshminarayana, “LotteryBus: A New High-Performance Communication Architecture for System-on-Chip Designs”, 38th Conference on Design Automation (DAC'01), 2001, pp.15-20.
 S. Lee, C. Lee and H-Jae Lee, “A New Multi-channel On-chip-bus Architecture for System-on-chips”, IEEE International SOC Conference 2004, pp 305-308, Sep 2004.
 Nan Wang and Magdy A. Bayoumi, “Dynamic Fraction Control Bus: New SOC On-chip Communication Architecture Design”, in Proc. IEEE Intl. SOCC Conf., Sep 2005, pp 199-202.