Integrating High-Quality Audio into Mobile Design
by Dave Sparks, Sonic Network, Inc.
Multi-media functionality is becoming more important in handheld products, and consumers are demanding higher fidelity audio in their wireless devices, PCs, games, software applications, and music synthesizers. In order to meet this demand and enable high-quality music playback on a wide range of consumer devices, product designers must trade-off performance, memory, and power consumption to optimize for user preferences and expectations. As a result, there are some key considerations when integrating audio synthesis into an embedded platform, including the functional components of the systems, related standards, design options, memory issues, and performance trade-offs.
Differentiating to Compete
As consumers continually demand more features and functionality from their mobile handsets, mobile operators continue to look for new ways to differentiate product offerings and expand revenue streams. In turn, these operators drive handset/terminal manufactures to provide them with products that offer the consumers more capabilities, features, and options. At the same time, they look to source handsets from manufactures that can offer their network some differentiation over their wireless competition.
MIDI Audio synthesis is one such feature that has provided an opportunity to both differentiate products and increase revenue by way of polyphonic ringtone capabilities. Therefore, it is becoming more important for embedded designers to understand the major issues and design techniques when integrating a high-quality audio synthesis solution into a mobile platform.
MIDI: The Format Standard
There are dozens of different file formats that have been developed over the years for storing data for audio synthesis. The most widely accepted standards-based format is the Standard MIDI File (SMF), a standard jointly overseen by the MIDI Manufacturers Association (MMA) and the Association of Musical Electronics Industry (AMEI).
The MIDI standard originated as a protocol to transmit a musical performance over a 31.25Kbits/s serial cable. With limited bandwidth available, the protocol comprises primarily control data such as "Note-On," signifying the moment when a performer presses a note on a musical keyboard, and "Note-Off," signifying when the note is released.
The SMF format was created in the 1980s as a means to capture the MIDI data stream into a file for editing on a computer. Events are stored in a stream format with the addition of delta timestamps to mark the amount of time since the last event. The file format is very compact, with a typical files size between 10 and 100KB. A similar file stored in a perceptive audio coder format such as MP3 would be 4MB. While MIDI behaves similar to a codec at the receiver end, an encoder that can encode an arbitrary audio stream is impractical today.
MIDI is able to achieve these levels of "compression" because the audio data itself is not stored in the file, only the actions of the performer, similar to the concept of a player piano roll with its coded instructions of what notes to play when, how hard to play them and for what duration. The audio synthesis engine then interprets this data into audio producing actions using a synthesis algorithm. There are two common forms of synthesis used in mobile platforms today: frequency modulation (FM) and sample-based synthesis or more simply, sampling synthesizer.
FM synthesis uses a purely algorithmic technique of modulating a carrier signal with a modulator. The resulting output is a rich spectrum of sound created by the sums and differences of the two frequencies. By varying the amount of modulation applied to the carrier over time, the spectrum can be manipulated to imitate real instruments, or create new synthetic sounds. This is the synthesis technique popularized by Yamaha, and which is incorporated into their MA series of MIDI synthesis ICs.
In contrast, a sampling synthesizer utilizes recordings of actual instruments as well as synthetic sounds. By varying the playback speed through interpolation, a single recording can be used to synthesis a range of frequencies. The sound is often further manipulated using filters to dynamically vary the output spectrum.
Generally, sampling synthesizers produce more realistic sounding instruments but the realism comes at the cost of additional read-only memory (ROM) for the sample library. FM synthesizers require less memory to store the algorithm parameters for their sounds, but signal processing requirements tend to be much higher.
Audio Synthesis Components
The process of reading a MIDI file and synthesizing an audio output stream from it can be broken into three distinct components: the file parser, the MIDI interpreter, and the synthesis engine.
The file parser reads MIDI data from a file or input stream and reconstructs the timeline from the delta timestamps stored in the file. Timestamps are generally specified relative to the tempo of the musical piece although they can also be specified relative to the SMPTE (Society of Motion Pictures and Television Engineers) time code. The file parser converts the relative timestamps in the file to absolute time so that events can be fed to the MIDI interpreter at the appropriate time.
The MIDI interpreter acts on the performance data in the MIDI stream. For example, when a "Note-On" event is received, the MIDI interpreter must locate the algorithm parameters that characterize the musical instrument to be synthesized, allocate resources ("voice") to synthesize the note and start the process of synthesizing the note.
The performance data may occasionally request more voices than are available, in which case the MIDI interpreter must determine which notes have priority. "Voice stealing" occurs when an active voice is reallocated to synthesize a new note.
The synthesis engine receives control data from the MIDI interpreter and synthesizes the audio based on the supplied parameters ("program") and, in the case of a sampling synthesizer, the sample data. The output of all the voices is mixed together based on the MIDI controls to render the final audio output.
Sweetening the Audio Output
In addition to the basic algorithmic processing required to synthesize a note, other signal processing may take place including audio filters, chorus and reverb, audio exciter, compressor/limiter, and equalization (EQ). These effects are often referred to as "audio sweeteners" and they can greatly enhance the quality of the audio. This is another opportunity for manufacturers to differentiate and add more value to their audio offerings.
Audio filters are used to vary the spectrum of the synthesizer to simulate changes in brightness, such as the natural decay of a piano string. A chorus is a delay line with a variable tap used to simulate multiple voices, providing a richer tone to brass and string section sounds. A reverb is a combination of delay lines and all-pass filters used to simulate the reverberation of different environments such as a concert hall or stadium. All of these effects are normally controlled on an individual instrument level. For example, the brass section can have chorus effects applied without affecting the piano.
An audio exciter brightens the audio by adding harmonics to fill in the upper frequency range, an effect that can help make up for harmonics that may be lacking in the original samples due to compression or size limitations. A compressor/limiter maximizes the output signal level by increasing the output gain when the overall volume of the synthesizer drops, which is a useful effect for a ringtone that needs to be heard in a noisy environment. EQ can be used to compensate for characteristics of the transducer and acoustics of the mobile device itself.
Size is Important
FM synthesizers use a purely algorithmic method of synthesizing instrument sounds, while sampling synthesizers use a mixture of algorithms and recorded audio. As a result, FM synthesizers usually require much less memory for storing instrument programs than a sampling synthesizer.
Since sample-based audio synthesis has at its core a wavetable of recorded sounds to drive the oscillators, the size and quality of the wavetable is crucial to the resulting quality of the synthesized sound. Therefore, the process of wavetable creation or selection is one of the most important aspects of a successful MIDI solution. After all, you can have the most elegant synthesizer design possible but if the samples you are playing back are of poor quality, the entire solution will sound bad.
The samples must be free of any background or player noise, of adequate dynamic range, and consistent in loudness and timbre across the range of notes sampled. It is not enough to have balance across one instruments scale; all instruments must also balance with each other when played together in the context of a musical piece. This requires more than engineering finesse and is process best undertaken by professionally trained musicians with highly discerning ears.
Once the instruments have been sampled, the resulting recordings need to be key-mapped. This is a process in which individual samples are assigned a range of notes they are used for playback on. After the key mapping process is completed, the task of "voicing" or adding the synthesizer control structures is done. This involves musical decisions and programming to take the final set of recordings to a playable state.
Time and velocity variant filters are added, amplitude envelopes to modulate the volume over time, pitch modulation, layering of sounds for synthesizer voices, etc. all are done at this stage. In order for the final wavetable to sound correctly with the standard MIDI files available, careful attention should be given to volume balancing and "mixing" the instrument set so it plays well in a multi-timbral, musical setting.
Small footprint wavetables for mobile handsets and audio players take on an extra set of important tasks that involve several techniques to reduce the size of the wavetable, while maintaining a high quality output. These tasks may involve pitch and time compression techniques, specialized looping and sampling rate reduction, equalization, and many others. To ensure the best results, special consideration should be given with small footprint wavetables to optimize them for the playback synthesizer and final product application.
Unlike a typical audio codec, MIDI does not guarantee a specific output for a specific input. Synthesizers from different vendors will produce different outputs from the same MIDI file based on their own algorithms and samples. Some vendors have addressed this issue with proprietary formats such as SMAF (Yamaha) and CMX (Qualcomm/Faith). While these proprietary standards allow the content author more control over the sound, they limit content to specific platforms that support the proprietary standard.
In contrast, General MIDI (GM) is a joint MIDI Manufacturers Association (MMA)/ Association of Musical Electronics Industry (AMEI) standard that defines a common set of 128 instruments and 47 percussion sounds and the means to select them on any platform that supports it. This gives the author of a music file some assurance that when his or her composition requires a violin, that the platform will attempt to reproduce a violin sound. General MIDI 2 increases the number of sounds available and further defines the behaviors of a compliant platform. However, even the combination of General MIDI and SMF files still cannot assure the quality of the sound that will be reproduced on a particular platform.
To address this limitation, the Downloadable Sounds (DLS) standard was jointly created by the MMA and AMEI to allow content authors to create files of instrument sounds that can be downloaded to a compliant synthesizer. DLS gives the author a standardized method to control the sound of the instruments used to reproduce a musical performance. DLS-2 increases the capability of DLS-compatible synthesizer and provides for both forward- and backward-compatibility. DLS-2 (under the moniker SASBF) was adopted by the MPEG standards body in a joint effort with the MMA as part of MPEG-4 Structured Audio.
Shortly after the DLS-1 standard was ratified, MMA/AMEI released the eXtensible Music Format (XMF) file format, which combines an SMF music file with a DLS file into a single encapsulated file. This format gives the author a way to deliver an audio performance in a single compact file that gives the listener a consistent playback experience on compatible platforms.
Given the push to open the mobile platform to more content, we can expect to see standards-based formats make significant inroads in the near future. Indeed, third-generation project partnership (3GPP) has been working with the MIDI organizations to standardize a new musical file format for mobile devices. To address this issue, a joint task group from the MMA, AMEI, and 3GPP approved the Mobile-DLS standard (mDLS) in September 2004. This is an extension of the Downloadable Sounds (DLS) standard intended for mobile applications.
Mobile-DLS is a subset of the DLS-2 standard that provides for different profiles based on the capabilities of the device. A Mobile-DLS file can be combined with a MIDI music file into a Mobile-XMF file, creating a single file that can be accurately reproduced on a compatible synthesizer. While Mobile-XMF does not fully specify the audio output of the synthesizer down to the bit level, it represents a big step towards giving users a consistent playback experience across different mobile platforms.
Finally, JSR-135 is a Java MP specification that provides a way for Java applications running on a mobile device to access the music synthesizer. Through predefined transport controls, this interface can be used in games to play audio sound tracks, or in music applications that allow the user to compose or "remix" audio.
Software-based audio synthesis requires a considerable amount of processor and memory bandwidth. The actual bandwidth required is highly influenced by the polyphony of the synthesizer (the number of simultaneous notes that can be synthesized), specifics of the algorithm (such as the sample rate), the need for additional signal processing stages (including individual voice filters, chorus or reverb effects), and post-processing enhancements (such as an audio exciter or compressor/limiter).
Perhaps the most important performance trade-off an embedded audio system designer or product manager will need to consider is quality versus quantity. Audio synthesizers are susceptible to the same marketing pressures that affect other technologies in the mobile market, and quite often it is a numbers game, with synthesizer polyphony being the most prevalent number bandied about.
It is possible to get more voices for the same processor bandwidth by reducing the complexity of a voice. As usual, this comes at a cost to quality, but that point may be moot if the user is listening to the audio through an 8mm transducer. Here are some tradeoffs to examine when playing the numbers game:
Sample rate is probably the single biggest contributor to audio quality. However, if the transducer specs are 300 Hz to 4 kHz +/-3dB, there is little point to running the synthesizer at 48 kHz. There is a direct relationship between the processor bandwidth used by the synthesizer engine and the sample rate. Of course, as the sample rate drops, other parts of the synthesizer become larger contributors to the overall performance.
Some synthesizer architectures feature a low-pass filter that can be controlled by the sound designer. This can be used to increase the overall quality of the instrument sounds. The filter uses considerable processor bandwidth in the synthesizer engine and eliminating it may reduce execution time by as much as a third. However, dropping the filter may require additional sample memory to properly synthesize certain types of sounds.
Stereo output can also be costly. While most of the signal path in a mobile synthesizer is monophonic, the final output stage uses a stereo pan control to steer audio output to left and right channels. Eliminating the stereo pan control reduces execution time by eliminating the control logic and MACs in the inner loop, reduces the memory footprint by cutting the buffer size in half, and reduces cache pollution as well.
ARM Performance Optimizations
In this section, we present some specific optimizations that can be used on the various ARM architectures to improve performance of an audio synthesizer, and in signal processing applications in general.
The most critical optimization is the use of registers for temporary results. The ARM architecture provides a modest number of registers for use by the application code. Load/store operations are costly in a processing loop: 3 cycles for an LDR instruction on an ARMv4 plus any memory wait-states; and 1-3 cycles for an LDR instruction on an ARMv5 depending on when the data is needed by the pipeline. Thus, the first order of business is to examine the algorithm to see if the temporary results can be re-arranged to avoid any load/store operations.
In some cases, it is possible to take a single monolithic algorithm that requires multiple load/stores for temporary results in the processing loop and break it up into independent processing loops where all the temporary results can be kept in registers. Even though intermediate results must still be stored in memory, the overall execution time is reduced significantly by avoiding multiple load/stores inside the processing loop. This turns out to be critical for the ARMv4 architecture with its three-stage pipeline because it does not have a memory or write pipeline stage as does the ARMv5 and later technology.
Signal processing typically requires a number of multiply or multiply-accumulate (MAC) instructions. On the ARMv4, the multiplier is a 32x8 bit array with early termination. This means that a 32x32 bit multiply takes place as a series of 32x8 multiplies that terminates when there are no more significant bits in the Rs register (i.e. the remaining bits are the same value as the sign bit). Thus, a 32x8 multiply takes 1 cycle while a 32x32 bit multiply takes 4 cycles.
One simple optimization for ARMv4 is to make sure that the smaller of the two numbers is always in the Rs register so you can take advantage of early multiply termination whenever possible. For example, if you have a filter that uses 16-bit coefficients, you can place the coefficient in the Rs register and the multiply will always take 2 cycles, whereas a perfectly random 32-bit input will average 2.5 cycles (audio is not perfectly random, but is typically normalized to reduce quantization effects). There is also a one cycle penalty is you use the multiplier on back to back instructions, and sometimes a simple instruction reorder will eliminate this penalty.
For ARMv5, the multiplier is a pipelined 32x16 bit array with no early termination. This means that a multiply will always take 2 cycles, however there is a one cycle penalty if the result is needed on the next instruction (other than for a multiply-accumulate). In most cases, a simple instruction reordering will eliminate this penalty.
The ARMv5 five-stage pipeline also allows some optimizations around load/store operations. As mentioned earlier, an LDR instruction on ARMv5 requires 3 cycles from the time the instruction is presented until the data is available. However, the pipelined memory interface allows the processor to continue processing other instructions as long as the target register is not used in the interim. By reordering instructions to place the LDR several cycles ahead of when the data is needed, the cycle count for an LDR instruction can effectively be reduced to a single cycle.
If the target of an LDR instruction is needed early in the loop, it may be necessary to unroll the loop. For example, consider the following code fragment: FilterLoop:
LDRSH r0, [r1], #2 ;load 16-bit input
MLA r0, r2, r0 ;perform filter op
SUBS r3, r3, #1 ;sample count
LDRSH r0, [r1], #2 ;load 16-bit input
The LDR instruction has an effective cost of 3 cycles because the MLA instruction uses the target register of the LDR instruction. The load/store operation needs to be pulled forward in the loop, but it is the first instruction in the loop. To pull it forward, it is necessary to partially unroll the loop:
MLA r0, r2, r0 ;perform filter op
SUBS r3, r3, #1 ;sample count
LDRNESH r0, [r1], #2 ;load 16-bit input
The LDRSH has been replaced with a LDRNESH instruction near the end of the loop. The NE suffix is required so that the instruction doesnt execute the last time through the loop, resulting in an input sample being thrown away. The BNE instruction has a cost of 3-cycles, reducing the effective cost of the original LDRSH instruction from 3 cycles to 1 cycle.
The LDRSH ahead of the loop primes the pump by prefetching the first input sample. The first MLA instruction incurs a 2-cycle penalty, but it may be possible to eliminate this penalty as well if there are other instructions ahead of the loop that can be inserted between the LDRSH and the MLA instructions.
High quality audio is creating another opportunity for mobile operators to differentiate their product offerings. Polyphonic ring tones allow users to personalize their mobile devices such as uniquely identifying callers based by the ringtone that sounds. The audio synthesizer is also providing new content plays including multimedia games and music composition application. Integrating a high-quality audio synthesizer into a mobile platform presents its own unique challenges, but the rewards can greatly boost the bottom line and improve listeners' audio experiences.
About the Author
Dave Sparks is Director of Engineering with Sonic Network, Inc. (www.sonicnetworkinc.com) with more than 25 years experience in embedded systems design. He chaired the MMA committee that drafted the DLS standard, authored the DLS-2 standard and served as liaison to the MPEG standard body during its adoption into the MPEG-4 standard. He can be reached at [dave dot sparks at sonicnetworkinc dot com].