Solving AMR Speech Codec Porting Challenges
Ethan Bordeaux, Analog Devices
Aug 19, 2004 (6:00 AM)
val3 = val1 + val2;
The adaptive multi-rate (AMR) speech codec was introduced at the end of 1998 as a new codec for both the GSM wireless network and as the mandatory speech codec for third-generation (3G) wideband CDMA (W-CDMA) systems. As of the beginning of 2004, AMR was the de-facto codec for existing GSM systems and is a key component for the rollout of 3G wireless designs.
But, while widely adopted, implementing the AMR codec is not an easy task. Since AMR was defined by the European Telecommunications Standards Institute, it is supplied with reference C code and test vectors. At first glance it would seem porting this code to a specific processor architecture simply involves compiling the provided code and optimizing key sections until the desired performance metrics are met, using the test vectors to ensure compatibility. However, as with many engineering projects, once a more detailed study of the problem is made, a number of complex factors must be analyzed and balanced to create a robust implementation.
In this article, we'll examine the factors that must be considered when porting the AMR codec to a digital baseband processor featuring a digital signal processor core, an ARM7TDMI core, as well as a variety of memories, buses, and peripherals. While we will look into some of the steps followed for porting the AMR speech codec to this specific processor, they are general enough that they can be applied to most any modern DSP architecture.
Converting ETSI basicops
The basic steps followed in porting the AMR algorithm include:
- Convert ETSI basicops from function calls to intrinsic DSP instructions.
- Replace stack usage with a dynamically allocated scratch structure.
- Build and test the algorithm, profiling for functions which consume the greatest number of processor cycles.
- Port processor-intensive functions to native assembly language.
- Repeat steps 3 and 4 until desired performance is met.
- Enable space-optimizing features in C compiler and compile remaining C files.
Instead of using native definitions for basic computational operations, ETSI precisely defines their behavior through explicit function calls. For example, the typical method of adding two 16-bit numbers together in C (held in variables val1 and val2, with the result placed in val3) would be:
However, if this operation were called for in an algorithm defined by ETSI, it would look like this:val3 = add(val1, val2);
The function add is defined in basicop2.c in the AMR reference code. Inside this function, the addition is performed along with a specific form of fixed-point saturation defined by ETSI. A wide range of basic mathematical operations such as addition, subtraction, multiplication and division, along with bit manipulation and shifting operations are defined and whenever that operation is required by the algorithm, the appropriate function call is made.
There are two major reasons for ETSI's precision in defining these basic operations. First, by explicitly defining how all mathematical operations are handled, it is possible to guarantee complete compatibility and bit-exactness across all target architectures, whether they are a simple fixed-point DSP or a microprocessor sitting inside a PC. Additionally, by placing every computational operation in a function call, ETSI provides a simple method of counting the number of mathematical operations that are required to process a frame of data. This enables a method of estimating the number of cycles it takes to run the AMR speech codec on a particular architecture before an optimized port begins. ETSI refers to this figure as weighted millions of operations per second (WMOPS).
All basicop functions are independently weighted, dependent on the complexity of the underlying operation. Additionally, WMOPS weightings can be hand-tuned allowing the final calculated WMOPS value to more closely reflect the possible performance on a particular processor.
One consequence of ETSI's reliance on basicops is that it is advantageous for a DSP architecture to support ETSI's definitions of basic mathematical operations in hardware. If the hardware does not support the specific definitions of particular computational operations, a software simulation which follows how the basicop function works is necessary, greatly decreasing overall performance.
One very important method of decreasing the processing load of the AMR speech codec is to replace the function calls to all mathematical operations into something more computationally efficient. Depending on the DSP architecture and tools set chosen, there may be support in software and hardware for C compiler intrinsics.
An intrinsic is a compiler-specific feature which allows for direct mapping of C functions into native assembly language. Intrinsics look like normal C function calls, but they are translated directly into assembly language by the compiler.
A good method to follow for this transformation is to use the C preprocessor to substitute function calls to ETSI basicops with the appropriate intrinsic functions, enabling the proper mapping of the function calls to their intrinsic nomenclature via a predefined compiler macro. If this is done, it is possible to use common C source code for both the DSP and the PC, while optimizing the code which runs on the DSP and keeping the important profiling features of basicops on the PC.
Optimizing Memory Usage
The reference code for the AMR speech codec places all of its temporary variables (also known as scratch variables) on the stack. A stack is a common data structure where data and addresses are stored and retrieved sequentially in memory. Variables which are declared and used inside a C function are placed on a stack, and when the function ends, the memory is freed for future usage. While this is a simple method of managing memory, it is very inefficient in embedded designs.
Placing all temporary variables on the stack limits the ability to utilize slow and fast memory, as all stack variables will be in the same memory segment. Additionally, stack usage forces the placing of variables in sequence and does not allow for full use of multiple memories for dual data accesses.
Given the above reasons, it is important to minimize stack usage, especially for large and frequently accessed buffers. One common method of reducing dependence on the stack is through a "dynamic memory allocator" such as the ANSI C standard function malloc.
Malloc allocates memory on a heap, which is similar to a stack. One big advantage of the heap over the stack is that some implementations of malloc allow for multiple heaps, providing a method of placing variables in different memory segments based upon how frequently they are accessed.
One potential problem with relying on the heap is that the programmer must manually free allocated memory when it is no longer needed. If this is not done properly, a memory leak can occur, where a portion of memory is not freed at the end of algorithm execution.
One method of minimizing the possibility of memory leaks is to create a scratch memory structure by analyzing how memory is allocated in each function and then placing those allocations in a C structure. In this case, the high level API interface calls malloc to allocate enough space for the entire algorithm. A simple example of how a program can be converted from a standard stack-based memory allocation to one that uses memory allocation and a scratch structure is given in Code Listing 1 and Listing 2.
The data memory requirements between the two listings are the same. Because the two subfunctions in Listing 2 have their scratch arrays unioned together, they only take up as much memory as the larger of the two arrays. This is similar to how the stack would overlay the local memory allocation of two functions at the same call level.
Listing 2 shows how calls that reside at the same level can safely have their scratch memory unioned together. If the algorithm had another level of calls within it (subfunc1 and/or subfunc2 made function calls) another level of structures (and potentially unions based on the exact details of the function calls) would be placed within the scr_sf1 and scr_sf2 structures.
On many DSP architectures, multiple data accesses along with computational operations are possible within a single cycle; however, buffer locations must be configured such that they reside in different memory banks. One simple method of enabling a wide range of dual data accesses is to place the scratch structure and the state structure (variables whose values must be maintained from frame to frame as long as the algorithm is operational) in different data banks. While this covers many instances where multiple data accesses are desirable, there are cases where additional optimizations could occur if multiple elements of the scratch structure could be accessed in the same cycle.
For instance, a number of correlation functions call for multiple accesses to different locations within the same scratch memory buffer. To enable dual data accesses in this situation, a secondary buffer is created and it is forced to reside in the opposite bank that the main scratch structure is placed within. At the start of the correlation functions, a portion of the buffer to be processed is copied from the main scratch structure to the alternate scratch structure. Once the copy completes the correlation function begins and data is accessed in a far more efficient manner, optimizing execution time for the function (Figures 1 and 2).
Click here for Figure 1Figure 1: Because both data accesses are to the same data bank, only one piece of data can be accessed in a single cycle, slowing algorithmic computations.
Click here for Figure 2Figure 2: By copying the correlation buffer to an alternate bank, dual data accesses are possible.
Compilers developed for modern DSP architectures are very powerful and efficient tools. However, reference code is often not written so that optimized execution time is a priority. Even the most powerful compilers can miss processor-specific optimizations, meaning that algorithms with hard real-time constraints (such as a speech codec) may require porting some C functions into native assembly language. However, it is wise to only port the minimum number of functions to assembly language, as keeping C portability and maintainability is a highly desired implementation trait.
The question of when to stop porting from C to assembly can be difficult to answer, especially if there are not any hard guidelines on required performance or memory usage. If hard guidelines are available, the procedure of running test vectors, searching for functions which consume the most processor cycles, and then porting these algorithms to assembly and benchmarking again is a good approach.
However, before embarking on the optimization process, it is important to examine the WMOPS value to determine if it is even possible to reach the desired performance. If the WMOPS value is very close to the target millions of cycles per second (MCPS) value, chances are it will either be incredibly difficult or impossible to reach this value, as WMOPS do not take into account any inefficiencies due to limited register sets or instruction/data caches. In this case, it would be wise to reconsider the target DSP architecture or the original required performance.
If a specific performance metric is not known, it is more complicated to determine how far to proceed in porting the C code to assembly. Many modern DSP tool chains include sophisticated profiling features for determining which functions are consuming the most cycles and where the porting effort is best focused. In the case of the AMR speech codec, by hand-optimizing only 15% of the code, total MCPS were reduced by 50%. This illustrates the necessity to focus particular effort on those functions which show the highest cycle consumption in the profiling process. Because of this, the gains in algorithmic performance decrease exponentially as more functions are ported from C to assembly language (Figure 3).
Click here for Figure 3Figure 3: As code is ported from C to assembly language, MCPS decrease exponentially but never reach the ideal WMOPS.
It is also important to consider the WMOPS value when a specific desired performance metric does not exist. In this case, the WMOPS value gives an estimate of the absolute best-case performance from a full custom port of the code to assembly language. For example, if the measured WMOPS value for a particular DSP architecture is 12.5 WMOPS and the currently measured performance is 14.7 MCPS, it will only be possible to gain some value less than 2.2 MCPS through a total port to assembly language.
Estimating Worst-Case Cache Performance
Due to the size and complexity of common DSP algorithms, it is often necessary to use a caching mechanism to handle the increased memory requirement. The main problem with using a cache in DSP algorithms is that it adds a level of indeterminacy in execution time.
There is no longer a simple one to one correspondence between the code being executed and the time it will take for it to execute. Additionally, interrupts or pre-emptive execution of code can flush a previously full cache, increasing execution time in a live system. Determining the worst-case scenario in execution time is very important in algorithms which have hard real-time constraints (such as the AMR speech codec, which must complete its processing in 20ms or audible artifacts occur).
In older wireless handsets, it was more common that the software modules had a single known priority level and memory mapping at runtime. This was true because handsets typically were single purpose products designed to handle only voice communications.
However, with the advent of mobile multimedia and messaging services, the priority level for the AMR speech codec could potentially change based on real-time requirements. Therefore, it is important to benchmark this algorithm across a wide variety of memory layouts to give the system designer a full understanding of the MCPS requirement through many profiles. Additionally, it is important to mimic the effects of flushing the instruction and data caches while calculating isolated performance metrics.
As the priority level of the speech codec is lowered, it is more likely that the speech codec will be pre-empted by another algorithm, potentially causing a portion of the cache to be invalidated. In testing the AMR speech codec, a baseline of a complete instruction and data cache flush was performed at the start of each frame. This helped provide padding in the MCPS calculations necessary to ensure safe performance in a live application.
The process of porting the AMR speech codec from reference code to target hardware requires a number of transformations beyond porting cycle-intensive function to assembly language to reach target performance. Reference C code is often not appropriate for wireless systems and requires careful handling such that it both performs efficiently and is easy to integrate into a final complex system. Customizing the memory allocation methods and location of key buffers, along with rewriting specific functions in assembly language and intelligently using C compiler features leads to delivering a highly optimized and easily integrated algorithm in the shortest period of time.
About the Author
Ethan Bordeaux is a DSP software engineer at Analog Devices. Ethan has developed a variety of signal processing algorithms pertinent to the wireless handset market. He can be reached at email@example.com.
Copyright © 2003 CMP Media, LLC | Privacy Statement