Integrating High-Quality Audio into Mobile Design

Dave Sparks, Sonic Networks
Mar 29, 2005 (5:00 AM)

Multimedia functionality is becoming more important in handheld products, and consumers are demanding higher fidelity audio in their wireless devices, PCs, games, software applications, and music synthesizers.

In order to meet this demand and enable high-quality music playback on a wide range of consumer devices, product designers must trade-off performance, memory, and power consumption to optimize for user preferences and expectations. As a result, there are some key considerations when integrating audio synthesis into an embedded platform, including the functional components of the systems, related standards, design options, memory issues, and performance trade-offs.

Differentiating to Compete
As consumers continually demand more features and functionality from their mobile handsets, mobile operators continue to look for new ways to differentiate product offerings and expand revenue streams. In turn, these operators drive handset/terminal manufactures to provide them with products that offer the consumers more capabilities, features, and options. At the same time, they look to source handsets from manufactures that can offer their network some differentiation over their wireless competition.

MIDI Audio synthesis is one such feature that has provided an opportunity to both differentiate products and increase revenue by way of polyphonic ringtone capabilities. Therefore, it is becoming more important for embedded designers to understand the major issues and design techniques when integrating a high-quality audio synthesis solution into a mobile platform.

MIDI — The Format Standard
There are dozens of different file formats that have been developed over the years for storing data for audio synthesis. The most widely accepted standards-based format is the Standard MIDI File (SMF), a standard jointly overseen by the MIDI Manufacturers Association (MMA) and the Association of Musical Electronics Industry (AMEI).

The MIDI standard originated as a protocol to transmit a musical performance over a 31.25Kbits/s serial cable. With limited bandwidth available, the protocol comprises primarily control data such as "Note-On," signifying the moment when a performer presses a note on a musical keyboard, and "Note-Off," signifying when the note is released.

The SMF format was created in the 1980's as a means to capture the MIDI data stream into a file for editing on a computer. Events are stored in a stream format with the addition of delta timestamps to mark the amount of time since the last event. The file format is very compact, with a typical files size between 10 and 100KB. A similar file stored in a perceptive audio coder format such as MP3 would be 4MB. While MIDI behaves similar to a codec at the receiver end, an encoder that can encode an arbitrary audio stream is impractical today.

Synthesizing Audio
MIDI is able to achieve these levels of "compression" because the audio data itself is not stored in the file, only the actions of the performer, similar to the concept of a player piano roll with its coded instructions of what notes to play when, how hard to play them and for what duration. The audio synthesis engine then interprets this data into audio producing actions using a synthesis algorithm. There are two common forms of synthesis used in mobile platforms today: frequency modulation (FM) and sample-based synthesis or more simply, sampling synthesizer.

FM synthesis uses a purely algorithmic technique of modulating a carrier signal with a modulator. The resulting output is a rich spectrum of sound created by the sums and differences of the two frequencies. By varying the amount of modulation applied to the carrier over time, the spectrum can be manipulated to imitate real instruments, or create new synthetic sounds. This is the synthesis technique popularized by Yamaha, and which is incorporated into their MA series of MIDI synthesis ICs.

In contrast, a sampling synthesizer utilizes recordings of actual instruments as well as synthetic sounds. By varying the playback speed through interpolation, a single recording can be used to synthesis a range of frequencies. The sound is often further manipulated using filters to dynamically vary the output spectrum.

Generally, sampling synthesizers produce more realistic sounding instruments but the realism comes at the cost of additional read-only memory (ROM) for the sample library. FM synthesizers require less memory to store the algorithm parameters for their sounds, but signal processing requirements tend to be much higher.

Audio Synthesis Components
The process of reading a MIDI file and synthesizing an audio output stream from it can be broken into three distinct components: the file parser, the MIDI interpreter, and the synthesis engine.

The file parser reads MIDI data from a file or input stream and reconstructs the timeline from the delta timestamps stored in the file. Timestamps are generally specified relative to the tempo of the musical piece although they can also be specified relative to the Society of Motion Pictures and Television Engineers (SMPTE) time code. The file parser converts the relative timestamps in the file to absolute time so that events can be fed to the MIDI interpreter at the appropriate time.

The MIDI interpreter acts on the performance data in the MIDI stream. For example, when a "Note-On" event is received, the MIDI interpreter must locate the algorithm parameters that characterize the musical instrument to be synthesized, allocate resources ("voice") to synthesize the note and start the process of synthesizing the note.

The performance data may occasionally request more voices than are available, in which case the MIDI interpreter must determine which notes have priority. "Voice stealing" occurs when an active voice is reallocated to synthesize a new note.

The synthesis engine receives control data from the MIDI interpreter and synthesizes the audio based on the supplied parameters ("program") and, in the case of a sampling synthesizer, the sample data. The output of all the voices is mixed together based on the MIDI controls to render the final audio output.

Sweetening the Audio Output
In addition to the basic algorithmic processing required to synthesize a note, other signal processing may take place including audio filters, chorus and reverb (typically called "side-chain effects"), audio exciter, compressor/limiter, and equalization (EQ) (typically called "post-processing effects"). These effects are often referred to as "audio sweeteners" and they can greatly enhance the quality of the audio. This is another opportunity for manufacturers to differentiate and add more value to their audio offerings.

Audio filters are used to vary the spectrum of the synthesizer to simulate changes in brightness, such as the natural decay of a piano string. A chorus is a delay line with a variable tap used to simulate multiple voices, providing a richer tone to brass and string section sounds. A reverb is a combination of delay lines and all-pass filters used to simulate the reverberation of different environments such as a concert hall or stadium. All of these effects are normally controlled on an individual instrument level. For example, the brass section can have chorus effects applied without affecting the piano.

An audio exciter brightens the audio by adding harmonics to fill in the upper frequency range, an effect that can help make up for harmonics that may be lacking in the original samples. A compressor/limiter maximizes the output signal level by increasing the output gain when the overall volume of the synthesizer drops, which is a useful effect for a ringtone that needs to be heard in a noisy environment. EQ can be used to compensate for characteristics of the transducer and acoustics of the mobile device itself.

Performance Optimization
Software-based audio synthesis requires a considerable amount of processor bandwidth. The actual bandwidth required is highly influenced by the polyphony of the synthesizer (the number of simultaneous notes that can be synthesized), specifics of the algorithm (such as the sample rate), the need for additional signal processing stages (including individual voice filters, chorus or reverb effects), and post-processing enhancements (such as an audio exciter or compressor/limiter).

The specifications of the processor architecture are very important. The number of registers, availability of zero wait-state memory such as cache or tightly-coupled memory (TCM), and signal processing capability (such as multiply-accumulate operation [MAC] pipelines and saturating arithmetic) can all significantly influence performance.

The bulk of the code in a software-based synthesizer is the control logic in the file parser and MIDI interpreter. This code represents 5 to 20% of the overall execution time of the synthesizer, runs well on a 32-bit general purpose processor, and benefits from both instruction and data cache.

The code for the synthesizer engine is usually much smaller than the control code, but represents 80 to 95% of the overall execution time of the synthesizer. The synthesizer engine should be one that is designed specifically for embedded applications and consists of small loops of a few hundred bytes executing tens to hundreds of cycles at a time. Due to its small size, it is not significantly impacted by nor does it contribute to cache pollution. If no cache is available, locating the synthesizer engine code into TCM will likely double the performance of the synthesizer engine.

Due the nature of signal processing in the engine code, it will also benefit from a MAC pipeline and saturating arithmetic. If DSP bandwidth is available, it may make sense to offload this code to a DSP, which is usually more efficient at executing signal processing algorithms. If the control code is to run on a separate general-purpose processor, some consideration will have to be given to moving the processed control data down to the DSP to control the synthesis engine.

Sampling synthesizers also access a large amount of sample data, which is typically stored in ROM, from inside the synthesizer engine inner loop. Access tends to occur in periodic sequential reads. Making the sample data cacheable can result in a significant performance increase, as a typical 32-byte cache line holds enough data to keep the inner loop running at zero wait-states for many iterations. Assuming that instruction and read-write data are already cached, enabling cache for sample data may nearly double the performance. While sample data is not very susceptible to cache pollution, it does contribute to it, as sample data is typically used once or twice in a loop and then it may be many more iterations before it is used again.

Performance Trade-offs
Perhaps the most important performance trade-off an embedded audio system designer or product manager will need to consider is quality versus quantity. Audio synthesizers are susceptible to the same marketing pressures that affect other technologies in the mobile market, and quite often it is a numbers game, with synthesizer polyphony being the most prevalent number bandied about.

It is possible to get more voices for the same processor bandwidth by reducing the complexity of a voice. As usual, this comes at a cost to quality, but that point may be moot if the user is listening to the audio through an 8mm transducer. Here are some tradeoffs to examine when playing the numbers game:

Sample rate is probably the single biggest contributor to audio quality. However, if the transducer specs are 300 Hz to 3 kHz +/-3dB, there is little point to running the synthesizer at 48 kHz. There is a direct relationship between the processor bandwidth used by the synthesizer engine and the sample rate. Of course, as the sample rate drops, other parts of the synthesizer become larger contributors to the overall performance.

Some synthesizer architectures feature a low-pass filter that can be controlled by the sound designer. This can be used to increase the overall quality of the instrument sounds. The filter uses considerable processor bandwidth in the synthesizer engine and eliminating it may reduce execution time by as much as 35%. However, dropping the filter may require additional sample memory to properly synthesize certain types of sounds.

Stereo output can also be costly. While most of the signal path in a mobile synthesizer is monophonic, the final output stage uses a stereo pan control to steer audio output to left and right channels. Eliminating the stereo pan control reduces execution time by eliminating the control logic and MACs in the inner loop, reduces the memory footprint by cutting the buffer size in half, and reduces cache pollution as well.

Size is Important
FM synthesizers use a purely algorithmic method of synthesizing instrument sounds, while sampling synthesizers use a mixture of algorithms and recorded audio. As a result, FM synthesizers usually require much less memory for storing instrument programs than a sampling synthesizer.

Since sample-based audio synthesis has at its core a wavetable of recorded sounds to drive the oscillators, the size and quality of the wavetable is crucial to the resulting quality of the synthesized sound. Therefore, the process of wavetable creation or selection is considered by many to be the most important aspect of a successful MIDI solution. After all, you can have the most elegant synthesizer design possible but if the samples you are playing back are of poor quality, the entire solution will sound bad.

The samples must be free of any background or player noise, of adequate dynamic range, and consistent in loudness and timbre across the range of notes sampled. It is not enough to have balance across one instrument's scale, all instruments must balance with each other when played together in the context of a musical piece. This requires more than engineering finesse and is often a process undertaken by professionally trained musicians with highly discerning ears.

Once the instruments have been sampled, the resulting recordings need to be key-mapped. This is a process in which individual samples are assigned a range of notes they are used for playback on. After the key mapping process is completed, the task of "voicing" or adding the synthesizer control structures is done. This involves musical decisions and programming to take the final set of recordings to a playable state.

Time and velocity variant filters are added, amplitude envelopes to modulate the volume over time, pitch modulation, layering of sounds for synthesizer voices, etc. all are done at this stage. In order for the final wavetable to sound correctly with the standard MIDI files available, careful attention should be given to volume balancing and "mixing" the instrument set so it plays well in a multi-timbral, musical setting.

Small footprint wavetables for mobile handsets and audio players take on an extra set of important tasks that involve several techniques to reduce the size of the wavetable, while maintaining a high quality output. These tasks may involve pitch and time compression techniques, specialized looping and sampling rate reduction, equalization, and many others. To ensure the best results, special consideration should be given with small footprint wavetables to optimize them for the playback synthesizer and final product application.

Related Standards
Unlike a typical codec, MIDI does not guarantee a specific output for a specific input. Synthesizers from different vendors will produce different outputs from the same MIDI file based on their own algorithms and samples. Some vendors have addressed this issue with proprietary formats such as SMAF (Yamaha) and CMX (Qualcomm/Faith). While these proprietary standards allow the content author more control over the sound, they limit content to specific platforms that support the proprietary standard.

In contrast, General MIDI (GM) is a joint MIDI Manufacturers Association (MMA)/ Association of Musical Electronics Industry (AMEI) standard that defines a common set of 128 instruments and 47 percussion sounds and the means to select them on any platform that supports it. This gives the author of a music file some assurance that when his or her composition requires a violin, that the platform will attempt to reproduce a violin sound. General MIDI 2 increases the number of sounds available and further defines the behaviors of a compliant platform. However, even the combination of General MIDI and SMF files still cannot assure the quality of the sound that will be reproduced on a particular platform.

To address this limitation, the Downloadable Sounds (DLS) standard was jointly created by the MMA and AMEI to allow content authors to create files of instrument sounds that can be downloaded to a compliant synthesizer. DLS gives the author a standardized method to control the sound of the instruments used to reproduce a musical performance. DLS-2 increases the capability of DLS-compatible synthesizer and provides for both forward- and backward-compatibility. DLS-2 (under the moniker SASBF) was adopted by the MPEG standards body in a joint effort with the MMA as part of MPEG-4 Structured Audio.

Shortly after the DLS-1 standard was ratified, MMA/AMIE released the eXtensible Music Format (XMF) file format, which combines an SMF music file with a DLS file into a single encapsulated file. This format gives the author a way to deliver an audio performance in a single compact file that gives the listener a consistent playback experience on compatible platforms.

Given the push to open the mobile platform to more content, we can expect to see standards-based formats make significant inroads in the near future. Indeed, third-generation project partnership (3GPP) has been working with the MIDI organizations to standardize a new musical file format for mobile devices. To address this issue, a joint task group from the MMA, AMEI, and 3GPP approved the Mobile-DLS standard (mDLS) in September 2004. This is an extension of the Downloadable Sounds (DLS) standard intended for mobile applications.

Mobile-DLS is a subset of the DLS-2 standard that provides for different profiles based on the capabilities of the device. A Mobile-DLS file can be combined with a MIDI music file into a Mobile-XMF file, creating a single file that can be accurately reproduced on a compatible synthesizer. While Mobile-XMF does not fully specify the audio output of the synthesizer down to the bit level, it represents a big step towards giving users a consistent playback experience across different mobile platforms.

Finally, JSR-135 is a Java MP specification that provides a way for Java applications running on a mobile device to access the music synthesizer. Through predefined transport controls, this interface can be used in games to play audio sound tracks, or in music applications that allow the user to compose or "remix" audio.

Wrap Up
High quality audio is creating another opportunity for mobile operators to differentiate their product offerings. Polyphonic ring tones allow users to personalize their mobile devices such as uniquely identifying callers based by the ringtone that sounds. The audio synthesizer is also providing new content plays including multimedia games and music composition application. Integrating a high-quality audio synthesizer into a mobile platform presents its own unique challenges, but the rewards can greatly boost the bottom line and improve listeners' audio experiences.

About the Author
Dave Sparks is a senior software architect with Sonic Network, Inc. with more than 25 years experience in embedded systems design. He chaired the MMA committee that drafted the DLS standard, authored the DLS-2 standard and served as liaison to the MPEG standard body during its adoption into the MPEG-4 standard. He can be reached at sax_man@pacbell.net.

Industry Articles

Integrating High-Quality Audio into Mobile Design