AltiVec power: PCI buses fall short
AltiVec power: PCI buses fall short
By Thomas Roberts, Product Marketing Manager, Mercury Computer Systems Inc., Chelmsford, Mass., firstname.lastname@example.org, EE Times
November 15, 2001 (4:45 p.m. EST)
No one will argue that the Altivec vector-processing engine in the PowerPC G4 microprocessor continues to lead high-end performance for embedded signal- and image-processing applications. Operating at 500 MHz, the 128-bit vector engine can execute floating-point calculations at a rate of up to 4 Gflops and can execute 8-bit-pixel data up to 16 billion operations/second (Gops) by performing two operations per cycle on each of 16 bytes of data. But the problem is, how do you feed that engine enough data to keep it busy and avoid unproductive processor time?
Altivec technology adds a 128-bit vector execution unit to the basic PowerPC architecture, resulting in a voracious demand for data. This vector-processing engine operates in parallel with other PowerPC processing elements, so multiple operations can be performed in each processor cycle. Altivec processors use a single-instruction, multiple-data model that performs operations on several data eleme nts at the same time. Each Altivec instruction specifies up to three source operands and a single destination operand. Because more work is done in each clock cycle, more data must be available for the processor to work on. Together this can require up to 48 bytes of data for input as well as 16 bytes of data for output on each clock cycle.
Today's PCI buses typically run at 266 Mbytes/s, or approximately 32 times slower than the Altivec engine can process data. However, that in itself is not a severe problem, since most signal- and image-processing algorithms perform many operations on each piece of data. The problem occurs when the processor has to share that fixed bandwidth with other processors or other devices.
A common architecture for commodity signal-processing boards is to place four Altivec-enabled processors on the same PCI bus. Several examples exist in the market from a number of suppliers. Most use a single PCI bus segment to load and unload all four processors, as well as to pass con trol information.
Using PCI and a four-processor model, data is fed to one of the processors at up to 266 Mbytes/s, and then the results are read from a processor at up to 266 Mbytes/s. This procedure must be repeated three more times to get data through each of the four processors. While this is going on, 240 cycles have passed in each processor, with up to two operations on each piece of data in each cycle. Not many algorithms require 240 to 480 operations on each data point before working on another data point. Clearly, the shared-bus segment emerges as a chokepoint in this architecture.
The limitations of PCI solutions become apparent in demanding applications such as real-time signal processing in radar systems. The high data bandwidth of these applications demands multiple processors working in parallel at each of many stages in a signal-processing pipeline. Often data must be transferred from one processor to many others, and this operation must occur in parallel for many separate processo rs in a stage.
Depending on the number of samples per second to be processed, it is easy to see that the internal data bandwidth can become very large. The multiprocessor architecture must provide enough aggregate bandwidth to satisfy these intense peak demands. In addition, it must be possible for multiple simultaneous transfers to take place without the danger of collisions and delays caused by blocked data paths.
The shortcomings of PCI-based systems in feeding today's high-performance processors are in no way a fault of their original design. PCI was designed to provide system control for CPU-centric communications with components such as disks, networks and less-demanding DSP peripherals including sound cards. It emerged in the days when 33 MHz was considered a fast clock speed. Because of this, PCI is poorly suited to handle the multiple high-bandwidth data streams c ommon to digital signal and image processing, where aggregate I/O data rates can easily exceed hundreds of megabytes per second.
The PCI community has attempted to respond to these limitations by bolstering the PCI specification. The clock rate has increased to 66 MHz, and then to 133 MHz for PCI-X. These upgrades increase the PCI bus data rate, but bump up against physical limitations that result in a smaller number of devices per bus segment. Ultimately, this solution turns out to be no solution at all for multiprocessor architectures: The problem it tries to address is how to feed multiple processors faster, so the answer cannot be to feed fewer processors.
Yet PCI does offer advantages, such as its smaller size and rich set of plug-and-play components. PCI's form factor makes it convenient for laboratory racks or even desktops places where the ruggedness of VME-based solutions is not required. In addition, graphics cards, Fibre Channel cards, A/D submodules and other off-the-shelf assembl ies abound in the PCI world, with capabilities that would benefit DSP system providers if only they could be used in a data-flow or stream-computing manner.
Buses provide a shared communication resource that, unfortunately, can be easily overloaded by high-speed I/O streams or flurries of interprocessor communication. Beyond the limited bandwidth of a single shared resource, contention for use of the bus can induce long latencies that limit real-time operation. The general solution is not to share: Give each processor its own dedicated data interface.
Such a solution can be attempted with a tree of PCI buses, but that generally does not improve the situation. Having multiple PCI buses enables a transaction to occur on each PCI segment if the transactions are restricted to the local PCI segment. The problem is that any single transaction that traverses the tree still ties up all of the PCI segments in the tree. A single communication path from one side of a bridge to another can consume all of the communication resources for both the PCI upstream bus and downstream bus segments.
Although the PCI bridge specification allows for up to 256 PCI bus segments through a tree hierarchy of bridges, a communication path from one side of the tree to the other would consume all communications resources on all of the PCI segments. The latency of such a data transfer would also be prohibitive for most real-time applications.
These limitations occur because PCI's architectural constraints are incorporated in its buses and bridges. They create bandwidth bottlenecks, increased latency and heightened contention, each of which can cripple a large configuration. Further, PCI-based systems have limited scalability. They are hampered by both physical and architectural constraints. The physical constraints involve electrical and mechanical restrictions imposed to ensure signal integrity.
Limits on loading, trace lengths and connectors restrict the typical PCI bus segment to a total of 10 loads, with each conne ctor considered as two loads. Thus, a typical motherboard or passive backplane can accept only up to four plug-in boards the on-board chip set or PCI takes up the remaining loads.
A more general solution is a switch-fabric interconnect that provides both an interface to each processor and the ability for multiple simultaneous data transfers. In such a model, each processor gets a dedicated interface that runs at least as fast as a PCI connection, thereby eliminating the dilutive effects of sharing the data interface. The switches replace the multidrop bus, enabling many transactions to occur throughout the fabric. Advanced features such as adaptive routing around network hot spots are also possible.
PCI can be combined with switched fabrics in a number of ways to create scalable processor systems. One way to extend PCI with switch-fabric communication is to add a high-speed auxiliary communication network independent of the PCI bus. Such a design uses the PCI bus for basic control information and low-bandwidth I/O. High-bandwidth communication passes though the switch fabric with both high speed and low latency.
As processing requirements of the application grow, more processors are added that interface to both the PCI bus and the switch fabric. The switch fabric adds another point-to-point connection for each additional processing node, thereby scaling bandwidth with processing.
Clearly, a switch-fabric interconnect provides only part of the solution, as data typically is not deposited directly into the processor but instead into local memory. That makes the DRAM bandwidth the next stress point. System designers must choose the fastest DRAM the processor can use to prevent the bottleneck at the processor's memory interface from becoming the primary driver of system performance. PowerPC G4 products currently support 133-MHz external interfaces, but many commodity-memory subsystem designs support speeds of 66 or 83 MHz. Many signal- and image-processing algorithms are memory bound, and t he effect of the memory bandwidth on those applications can be dramatic.
The Raceway Interlink (ANSI/VITA 5-1994) is a leading example of these powerful auxiliary buses. Raceway provides multiple 160-Mbyte/s pathways within systems, implemented through a switch-fabric crossbar solution that supports networks of up to 1,000 processor nodes and more than 1 Gbyte/s in aggregate bandwidth. However, applications have emerged that swamp such early auxiliary buses. They include semiconductor wafer inspection, multispectral and hyperspectral imagery, dynamic route planning and radar-jamming resistance.
Second-generation switched fabrics address this requirement with balanced performance increases. As an example, Race++ is an evolutionary enhancement to the Race architecture that significantly increases system bandwidth, connectivity and topology. Race++ brings data to each Altivec-enabled processor at a peak rate of 267 Mbytes/s. It supports up to 4,000 processors in a single system with an aggregate bandwi dth of 500 Gbytes/s. Race++ also includes architectural improvements that eliminate bottlenecks and congestion, provides multiple paths between source and destination, and allows data to travel through a richly connected, multidimensional switch fabric. Race++ also provides the determinism required for real-time computing.
As processor clock rates increase further, even the current switch-fabric technologies will be stressed to keep up with the processors' thirst for data. Newer, high-speed embedded fabrics like RapidIO are coming online to fill those high-end needs. The RapidIO specification defines a high-performance interconnect architecture designed for passing data and control information between microprocessors, DSPs, communications and network processors, system memory and peripheral devices within a system.
The initial RapidIO specification defines physical-layer technology suitable for chip-to-chip and board-to-board communications across standard printed-circuit-board technology at throug hputs exceeding 10 Gbits/s, utilizing low-voltage differential-signaling technology. These data rates are a good match for gigahertz Altivec processors that can process pixel data at up to 16 Gops.
The availability of new switch fabrics such as RapidIO represents an important step in assuring overall system balance. Future processors will connect directly to the switch fabric, promoting linear scalability of bandwidth as the number of processors increases. Contention for bus resources will virtually disappear as data transfers migrate from the bus to the high-performance fabric. This will also remove chokepoints such as PCI bridges and the limitations of its underlying architecture, either by eliminating buses entirely or at least returning them to the uses for which they were originally designed and can still support.
Copyright © 2003 CMP Media, LLC | Privacy Statement