# High-Performance DSPs -> Overcoming hot spots in Viterbi equalization

Overcoming hot spots in Viterbi equalization

By Bob Uvacek, Director of Systems Engineering, QuickSilver Technology Inc., Santa Clara, Calif., EE Times

November 15, 2001 (4:37 p.m. EST)

URL: http://www.eetimes.com/story/OEG20011115S0064

Today's wireless, cellular, and mobile communications devices rely on advanced algorithms, such as Viterbi estimation, to perform voice, video, audio, and data operations. These emerging technologies require optimal computational power efficiency to deliver high computing capability in mobile devices and to overcome the computational hot spots associated with advanced algorithms. However, conventional microprocessor and DSP technologies often prove inefficient because they don't pack the level of computational power required, leading designers to settle for implementing suboptimal algorithms.

A case in point is the signal-processing techniques targeted at improving the link performance in hostile wireless/mobile environments. As consumer demand for novel wireless services increase data rates, it is increasingly important to apply the most powerful algorithms for wireless communication. The mobile-radio channel is particularly destructive to transmis sion information due to multipath fading and Doppler spread. These effects have a strong negative impact on the bit-error rate (BER) of any modulation technique. Mobile-radio channel impairments cause the signal at the receiver to significantly distort or fade. Equalization, diversity, and channel coding are three techniques used independently or in tandem to overcome poor received-signal quality. This article focuses on Viterbåi equalization and bit synchronization as they are hot spots in the processing algorithms needed for the operation of mobile devices with high bit-data rates.

In theory, the best receiver performance with lowest BER can be achieved with soft-decision Viterbi equalization used in demodulation. This algorithm uses the knowledge of the modulator structure to correct received sequences of symbols that are unlikely or impossible to occur, and to pick the most probable one overlooking a longer piece of the sequence. The deviations between the received and the identified most-probably-se nt sequence are used to adapt the impulse response of the channel estimation. The longer the piece of a symbol sequence chosen, the better the performance. As the complex valued symbol sequences get more and more loaded with higher data rate information, higher oversampling must be applied to recover the sent wave forms, and more precise channel equalization must be applied for good demodulation results. The length of the channel impulse response has been found to be up to 12 coefficients long. For lower bit rates in the past, an approximation with only five coefficients turned out to be sufficient. This is no longer the case with the future high bit-rate standards.

In a cellular-phone receiver's baseband section, the three main components are front end, error correction, and vocoder. The front end is further divided into analog-to-digital (A/D) conversion, bit synchronization, and intelligent demodulation with maximum likelihood sequence estimation (MLSE). These last two portions of the front end are particularly compute-intensive. For example, bit synchronization uses oversampled correlation, which, when completed in real-time in one time slot, takes up to 1.5 giga instructions per second.

A second example is the demodulator using the Viterbi algorithm. Unlike other compute-intensive algorithms like VSELP and QCELP in the vocoder section, which are characterized by lots of branching in different inner code loops and subroutines, the computations in the front end are characterized by a uniform flow of data. Yet because the data rates are radically increasing, more frequent sampling and more computations are needed. In fact, computations increase exponentially with longer data sequences in the Viterbi equalizer. These algorithms can be best described as "hot spots" or algorithmic areas that consume inordinate power in the front end.

The Viterbi algorithm is an optimized searching algorithm for finding the most probable path out of several possible paths. In the front end, the Viterbi equalizer i s comprised of a matched filter, a Viterbi (MLSE) estimator, and a channel estimator (FIR filter). The key principle of the Viterbi algorithm is that if the most likely path between point A and point C passes an intermediate point B, then the segment of this path between point A and point B is the most probable path as well. The algorithm consists of two steps: forward probability accumulation and backward tracking of the most likely sequence.

Knowing the structure of the modulator in the sender, a graph called trellis can be constructed of all the possible sequences generated by that modulator. A received symbol sequence can be looked at as a path through the Viterbi algorithm's trellis. Hence, the job of the Viterbi estimator is to pick the most probable path through the trellis graph.

Essentially, the Viterbi estimator performs either hard- or soft-decision estimation. A hard-decision estimator puts any received symbol into two classes, 0 or 1. A soft-decision Viterbi estimator uses more classes between 0 and 1 that are related to the logarithmic likelihood of the received symbol. Soft-decision decoding generally produces better performance at the cost of increased complexity. A soft-decision Viterbi estimator (or MLSE) supports the demodulation inside of the Viterbi equalizer. The MLSE is, in turn, supported by a channel estimator or finite impulse response (FIR) filter and a matched filter. Using the FIR estimate of the channel impulse response within the soft-decision Viterbi algorithm, the MLSE tests all possible data sequences (rather than estimating each received symbol by itself) and chooses the data sequence with the maximum probability as the output. An MLSE typically has a large computational requirement, especially when the delay spread of the channel is large. In this case the impulse response is long, and accordingly the trellis becomes longer which means the Viterbi computations grow exponentially.

The complexity of the channel estimation provided by the Viterbi algorithm depends on the number of coefficients used in the FIR filter. If only five coefficients are involved, the computation is relatively minor. However, at 12 coefficients, the soft-decision Viterbi cannot execute because a DSP lacks sufficient computational power to handle this algorithm.

If the soft-decision Viterbi with at least 12 coefficients isn't used, more errors will be incurred during demodulation. Hence, the cellular-phone user cannot efficiently communicate as he/she moves further away from a basestation, or if greater amounts of distortion are experienced. Plus, the user cannot attain the highest possible data transfer rates if greater numbers of errors are incurred, thereby limiting cell-phone performance.

The soft-decision Viterbi algorithm demands about 2,000 Mips for it to operate optimally. This computational requirement calls for at least 10 latest-generation DSPs, each capable of 100 to 300 Mips. However, this multitude of DSPs for an advanced cellular phone design draws an inordinate amount of power because the power consumption for each DSP ranges from 100 to 200 milliWatts. Total power consumption in this case precludes using this many DSPs in a cellular or wireless design. Therefore, the designer relying on DSPs must settle for suboptimal algorithms and a less effective demodulator design.

Also, DSP architectures are designed to perform such mathematical operations as multiplication and adds. They are good in approximating any mathematical function by the Taylor series. But advanced communications designs call for blocks of bits that cannot be approximated; they must be absolutely exact. Bit-exact block processing is essential for symbol sequences. DSP structures are not acceptable in this regard because they lack the architecture to efficiently perform bit-exact block processing. For example, it takes 10 instructions to perform a simple one-symbol operation in a soft-decision Viterbi estimator.

On the other hand, Adaptive Computing Machine (ACM) technology can perform virtually any compute-intensive, high-precision algorithm more efficiently. The ACM is a new class of IC that can perform a variety of tasks by changing its basic architecture at blazingly fast speeds via algorithmic downloads. As a system is running, the silicon adapts itself to become the best ASIC to perform the given function. It adapts over and over again so that the ACM chip literally changes its architecture hundreds to thousands of times per second.

This fast adaptability allows a small piece of silicon to emulate a much larger ASIC. For example, an ACM can change bus and memory widths, multiplier sizes, and I/O constructs. These are functions and operations not readily adaptive by other methods. The ACM allows software algorithms to build and then embed themselves into the most efficient hardware possible for the application. Since this eliminates much of the code overhead, such as memory fetches and ALU/MAC set-up procedures, these algorithms are freed up to run at hardware speeds rather than the slower s oftware speeds. This constant conversion of software into hardware allows advanced algorithms, such as Viterbi estimation, to operate faster and more efficiently than with conventional microprocessor and DSP technologies.

The ACM closely tracks the necessary computational resources requested by any algorithm and provides the correct number and type of computational units required. For example, if a particular design needs matrix multiplication, all previously adapted functionality on the ACM silicon is erased with an algorithmic download that creates many multiplier units. These units form into a highly parallel operations pipeline that exactly executes a matrix multiplication.

Secondly, the Viterbi soft-decision algorithm can be downloaded onto the ACM silicon. This algorithm doesn't need the previously embedded multiplication hardware. Instead, it requires additions, decision-making comparisons, and selections. After the multiplier units are erased, an algorithmic download creates the necessary ad der, comparison, and selection units, forming into an operations pipeline that exactly executes a Viterbi estimator. Hence, a dedicated group of computational units and interconnections are focused at a given point in time on handling the Viterbi soft-decision algorithm. This way, both algorithmic computational and power consumption requirements are closely tracked.

At a third step, only the FIR filter may need to be computed. It may only need 10 multipliers, an adder, and nothing else. The rest of the silicon can be switched off or just powered down to perform the FIR filter operation.

If for example, in one clock cycle, the function to be performed should be a FIR filter, at that point in time, an ACM will use the exact bit widths for the adders and multipliers required. If in the following clock cycle, the function to be performed is a cascaded integrated comb (CIC) filter, at that point in time an Adaptive Computing Machine creates the exact bit widths for the adders and shifters. The ACM techn ology's extreme flexibility boosts performance because the exact hardware needed can be brought into existence on-demand for specific algorithms, which results in a reduced clock rate.

A further example is the requirement in CDMA QCELP coders to perform certain computations in double-precision integer arithmetic. Since most implementations use a 16-bit DSP, these computations take from two to eight times longer than the computations of an Adaptive Computing Machine. In one cycle, the ACM can bring into existence the 32-bit structures required to optimally perform the calculations.

The advantages of an ACM over a conventional microprocessor or DSP can be summed up in the following:

- The ability to process advanced algorithms, such as Viterbi estimation, in the most power-efficient way, thereby enabling high data rates in mobile devices;
- The ability to bring into existence, for the exact minimum amount of time, the most efficient hardware for an algorithm to run on;
- The abi lity to erase and create, on demand, deep pipelines of variable depth that require minimum power to execute operations;
- The ability to form custom bit widths, which match the native bit widths of the algorithm being executed, thereby enabling efficient use of bits to provide the necessary precision for mathematical operations.

### Related Articles

- Optimizing High Performance CPUs, GPUs and DSPs? Use logic and memory IP - Part II
- Optimizing high-performance CPUs, GPUs and DSPs? Use logic and memory IP - Part I
- Overcoming the embedded CPU performance wall
- DSPs with PCI Express interface extend connectivity while improving performance and power efficiency
- Asynchronous DSPs: Low power, high performance

### New Articles

### Most Popular

E-mail This Article | Printer-Friendly Page |