| System on chip (SoC) design in the billion-transistor era will involve the integration of numerous heterogeneous semiconductor intellectual property (IP) blocks. These will range from general purpose RISC processors, DSPs, MPEG/JPEG processors, and distributed memories to communication cores and specialized instruction-set processors. |
One of the major problems associated with future SoC designs arises from non-scalable global wire delays. Global wires carry signals across a chip, but these wires typically do not scale in length with technology scaling. Though gate delays scale down with technology, global wire delays typically increase exponentially or at best linearly by inserting repeaters.
Even after repeater insertion the delay may exceed the limit of one clock cycle. It is estimated that nonrepeated wires with practical constraints result in delays of about 120-130 clock cycles across the chip in the 45 nm technology node. In ultra-deep submicron processes, 80 percent or more of the delay of critical paths will be due to interconnects.
In the forthcoming technologies, global synchronization of all IP blocks will lose its significance, due to the impossibility of sending signals from one end of the chip to another within a single clock cycle. Instead of aiming for global control, one attractive option is to allow self-synchronous IPs to communicate with one another through a network-centric architecture.
Existing on-chip interconnect architectures will give rise to other problems. The most frequently used on-chip interconnect architecture is the shared medium arbitrated bus, where all communication devices share the same transmission medium. In a bus-based SoC, multiple IP blocks share the transmission media.
As the number of connected IP blocks increases, the capacitance attached to the bus wires increases correspondingly. This negatively impacts propagation delay, and ultimately the achievable clock cycle. This limits the number of IP blocks that can be connected to the bus, and thereby the system scalability. For SoCs consisting of tens or hundreds of IP blocks, bus-based interconnect architectures will lead to serious bottleneck problems, as all attached devices share the bandwidth of the bus.
Using a communication-centric approach to integrating multiple heterogeneous IP blocks in complex SoCs overcomes both problems. This new model allows decoupling of the processing elements from the communication fabric. The need for global synchronization can thereby disappear.
In this model, a group of IPs is connected to a neighboring switch, and global signals spanning a significant portion of a die in current architectures now only have to circulate between switches. In this scenario, the top level interconnects will only consist of wires between switches.
In our network-centric approach, the communication between IPs can take place in the form of packets. The on-chip network resembles the interconnect architecture of high-performance parallel computing systems. The common characteristic of these kinds of architectures is that the functional IP blocks communicate with each other with the help of intelligent switches.
As such, the switches can be considered as infrastructure IPs (I2Ps) providing a robust data transfer medium for the functional IP (FIP) modules. There are different possible interconnect architectures in the parallel processing domain. In the System on Chip Research Laboratory at the University of British Columbia, we use the butterfly fat tree (BFT) as an interconnect template for SoC design.
A butterfly fat tree is a derivative of the fat tree architecture. The fat tree architecture has found extensive use in different parallel machines (Connection Machine CM5) and shown to be hardware efficient. We evaluate this architecture with respect to communication-centric performance parameters that include throughput, latency, susceptibility to deadlock and corresponding silicon area requirements.
A key component in the new design paradigm is an interface that provides consistency for packetized data/control signals traveling through the network. The obvious choice is the OCP-IP socket, which offers the ability to capture core's data, test and control flows, without imposing any limitation on the interaction of the core with the system.
The OCP signals emerging out of the FIPs are packetized and transported with the help of I2Ps. The I2Ps and the structured wires between them provide a highly pipelined communication medium. Through detailed circuit level design and analysis, we are able to constrain the delay of each pipelined stage to be within the International Technology Roadmap for Semiconductors (ITRS) suggested limit of 15 FO4 delay units. In this way we demonstrate the latency hiding capability of such a network fabric.
Andre Ivanov and Resve Saleh are professors at the University of British Columbia. Partha Pande and Cristian Grecu are graduate students. The research described here comes from the Network on Chip project at the University of British Columbia's System-on-a-Chip (SoC) Research lab, which has established itself as a world-class research center for the design, verification and testing of high-speed mixed-signal system on chip projects.