Mobile phone manufacturers are still unsure about which services end users will look for in emerging 3G products. But one thing they are betting on iswireless video using the MPEG-4 streaming video protocol. When you factor in the complexity of encoding and decoding MPEG-4 video streams, designersquickly realized adding mobile video is not an easy task.
How do you solve this dilemma? One answer is by pushing the limits on existing digital signal processors (DSPs) and microprocessor cores already housed in current mobile terminals. Unfortunately, these components are already pushed to their limits handling tasks like 3G baseband processing.
Another solution might be to add dedicated ASICs that simply handle MPEG-4 functionality. But that means another coprocessor taking up silicon or worse, more chips in the architecture.
A new computing alternative is needed to enable video capability for mobile and wireless devices, and it comes in the for m of adaptive computing technology.
A New Alternative
So why is adaptive computing such a good option for wireless? The answer lies in its ability to allocate system resources.
Under the adaptive computing approach, the system hardware dynamically adapts its architecture on the fly to handle a specific task. In the case of wireless video, if an MPEG-4 stream is released, the system will poll its resources and then, based on that poll, allocate the resources that are available to handle the MPEG-4 task. The benefit here is that designers can obtain optimal use of their processing resources, while keeping power consumption to a minimum.
A number of other technologies are on the market to help the mobile phone adjust and allocate re-sources. Adaptive computing technology differs from conventional ICs, such as DSPs, ASICs, microprocessors, and FPGAs, by bringing into existence, for as long or short a time period as required (clock cycle by clock cycle, if necessary) the exact hardware impl ementation the software algorithm requires.
IC architectures supporting adaptive computing can adapt hundreds of thousands of times a second, while consuming little power. This permits the adaptive computing chips to implement spatial and temporal segmentation. Spatial and temporal segmentation is the process of adapting dynamic hardware resources to rapidly perform various portions of an algorithm in different segments of time (temporal) and in different locations in the adaptive circuitry matrix (spatial) of an adaptive computing architecture.
Let's compare an ASIC and an adaptive computing MPEG-4 implementation. The ASIC version uses multiple IP cores with a dedicated core for motion estimation (ME), shape coding, padder, motion compensation (MC), discrete cosign transform (DCT), in-verse DCT (IDCT), and quantization.
The advantage of having computing adapt its architecture for each of the MPEG-4 functions on demand is the elimination of fixed-function silicon A SIC blocks, which take up space when not in use and force a re-spin of silicon when specs change.
Conversely, adaptive computing technology uses a single adaptive circuitry matrix to execute each and every algorithmic kernel known as compressed binaries. For instance, when an MPEG-4 implementation calls for ME, this particular compressed binary is downloaded into a matrix. This kernel then expands and takes the particular shape and function in hardware.
Let's look at the ME portion of MPEG-4. ME is the most computationally intensive portion of MPEG-4, using about 60 to 70% of the total MPEG-4 power consumption.
However, by implementing ME as a compressed binary in the matrix rather than as fixed-function silicon accelerators, adaptive computing can re-use the same silicon for a set of functions. This compares to an alternative approach of executing the ME algorithm entirely in software on a RISC processor. This particular approach reduces power consumption because the ME operation is performed mo re rapidly with adaptive computing technology rather than in software. As compared to a pure hardware implementation, adaptive computing will not be obsolete at design time. It enables the tracking of spec changes, bug fixes, and product up-dates late in the design cycle and in the field as new features are developed.
Using adaptive computing, a certain percentage of the matrix is taken spatially by the ME algorithm, yet there is a remaining spatial portion that can be used to execute another compute intensive algorithmic kernel. The hardware function is executed for as long as MPEG-4 requires it to stay in existence in the matrix, and then it gets swapped back out. In the meantime, all other MPEG-4 kernels are executed simultaneously.
Kernels that exist simultaneously on the matrix are examples of spatial segmentation. Any kernels that exist for only brief instances of time are examples of temporal segmentation.
Figure 1 shows an example of total spatial and total temporal segmentation. The granularity of computation for ME is 6X greater than that of any other kernel. Hence, it is necessary to execute all the other kernels six times per each execution of ME in order to process the same size of data.
Figure 1a shows that all the MPEG-4 algorithmic kernels can be laid out in a completely spatial manner in an adaptive computing matrix. Figure 1b, on the other hand, shows that they can also be segmented temporally. Thus, by using adaptive computing technology, the designer can have ME come into the matrix and have it run very quickly. Next, MC comes in for a period of time, likewise for DCT, then quantization for another period of time, and so on.
At this point, the complete series of MPEG-4 algorithmic kernels can be repeated over and over again. Or, as shown in Figure 1c, the designer has the option of changing the mix slightly. ME can be left in the matrix for a longer period of time and then MC, DCT, and quantization can be quickly cycled in and those three algorithmic kernels can be repeated over and over.
In addition to providing spatial programming, the adaptive computing method also allows for temporal programming. To do this, the adaptive computing technique uses the matrix to run a particular application over a time period. It then adapts the matrix to run another application for another period of time, it can adapt to the previous application or to any one of a number of other applications.
Figure 2 shows a temporal programming example. This example is based on the eight most compute intensive inner code loops of the QCELP speech compression algorithm. These eight inner cod e loops are code book search, pitch search, line spectral pairs (LSP) computation, recursive convolution, and four filters.
As the figure illustrates, this virtual hardware runs a sequence of different application binaries that are sequentially downloaded into the adaptive computing circuitry matrix. The figure also shows the code book search inner code loop downloaded into the matrix and the hardware created specifically to run that binary.
The algorithmic description of a piece of the code book search algorithm is illustrated in Figure 2b. This particular algorithmic element is performing a 36-b integer division, feeding a 16- x 36-b integer multiply followed by 37-b subtraction (comparison) with the time delayed result from the previous computation. An efficient mapping between the algorithmic element description and the underlying hardware that must come into existence to closely match this description is extremely import ant.
Subsequently, the pitch search code loop is downloaded into the matrix, associated hardware is created, and code book search is taken out. Next, LSP computation is downloaded and taken out when recursive convolution is downloaded, and so on. These binary changes occur 400 times a second, or put another way, each binary is changed 8 times every 20 ms.
Space And Time Working Together
Algorithms or compressed binaries can be loaded into an adaptive computing matrix from an on-chip cache memory in a spatial manner. Any number of different features and functions, as well as supporting algorithms can also be implemented via temporal mapping.
The designer can run more than one task spatially on the matrix, as well as program either temporally in time or spatially across an adaptive computing architecture's silicon matrix. The designer can also make trades with space being traded for time, or vice versa. For example, if designers want to design in only one major task without adapting it, they can spatially place the temporal elements discussed above in Figure 2, and all will simultaneously exist in the matrix space.
However, if a design calls for a considerable number of simultaneous tasks, then the designer can trade space for time. Hence, the designer can run multiple tasks temporally on the silicon matrix. Moreover, there are shades of gray involved in these space/time trade-offs.
Take, for example, a simple three-input adder that is executed in one clock cycle in the matrix. This complete algorithm can be laid out spatially. If the space isn't available, it can be laid out temporally. As shown in Figure 3 the algorithm is divided into two temporal elements: segment 1 adds A and B together and segment 2 adds the results of A, B, and C together.
Simplifying The Design Flow
One of the added benefits of turning to adapt ive computing technology is the simplification of the overall system design flow. Since the number of chips does not increase, designers do not have to add an additional design flow to their mix, making system design a lot easier.
In conventional IC designs, the designer contends with three different design flows to support the microprocessor, DSP, and ASIC components. In turn, the designer and/or programmer programs each IC with a different language and toolset. Assembly language or C is used for the DSP, C or C++ for the microprocessor, and VHDL or Verilog for the ASIC design. The productivity of an engineer using each of the languages differs widely. Compared to C/C++, programming in assembly language is roughly an order of magnitude slower in terms of lines of code per day. Compared to assembly language, VHDL/Verilog is roughly an order of magnitude slower in terms of lines of code per day. Compared to VHDL/Verilog, schematic entry design is roughly an order of magnitude slower in terms of gates per day.
Hence, three to four languages (at markedly different productivity levels) and a variety of different toolsets are used simultaneously, and at the end of the design, the designer must have them all synchronized. A major design like combining wireless and MPEG-4 functionality thus becomes inordinately challenging.
By using adaptive computing technology, designers can turn to a single high-level language (HLL) and single toolset to handle the entire design flow. Additionally, this language simultaneously represents both hardware and software elements associated with running a particular algorithm, including both the temporal and spatial information.
A key advantage to the HLL is that it allows designers to move portions of a design within an adaptive computing architecture. The language's behavior partitions itself across different portions of adaptive computing silicon, enabling designers to experiment with different configurations to fit their own specific applications.
Real-world exa mple
Table 1 illustrates the impact that one adaptive computing HLL has on the wireless design flow. This table lists two compute intensive voice-over-IP (VoIP) speech compressors/decompressors, G.729e and G.729a, and the G.168 echo canceller. The ITU publishes the C code for each vocoder's specification. C code for the G.729e requires 550 MHz unaltered running on a workstation to make this algorithm run in real time; a typical target for a DSP implementation is to run it in 20 MIPS.
In a normal DSP tool flow, the programmer profiles the C code. The hot spots (or those algorithmic kernels that are the most power hungry) are determined, and then hand-tuned assembly language is developed to speed up those hot spots. Next, the assembly language is tuned for the particular DSP hardware structure. A total of 36 labor weeks or nine months of DSP development time is needed for reducing the G.729e MIPS to the targeted 20 MIPS required for a DSP to run this VoIP vocoder.
The G.729a VoIP vocoder is less computationally intensive, but still requires 100 MIPS unoptimized, to run in real time. To run in a DSP, the designer and programmer target about 21 DSP MIPS. In this case, 28 labor weeks are required to profile the C code and identify hot spots. A majority of the time is spent on hand tweaking the assembly language to fit the particular DSP hardware structures and get the MIPS reduced to a reasonable number.
Conversely, as shown in the table, by using the HLL, adaptive computi ng development times are four and three labor weeks for the G.729e and G.729a, respectively. This is an eight to 10 times design improvement over conventional IC technologies.
Authors note: For more background information including some examples see the white papers and chapter 1 & 2 at: http://www.quicksilver.com/white_papers.htm).
Paul Master is vice president of technology at QuickSilver Technology, Inc. He received a BA in computer science from the University of California at Berkeley and an MS in computer science and engineering from Santa Clara University. He can be contacted at email@example.com.
Bjorn Freeman-Benson is the director of compiler tools at QuickSilver Technology, Inc. He received BS, MS, and Ph.D. degrees in computer science from the University of Washington. He can be contacted at firstname.lastname@example.org. FONT>