Hype has quickly been building around wireless video for the past few years. With 2.5G and 3G systems on the way, many have started to view the delivery of video content to mobile phones as one of the killer apps.
The challenge, however, is making this work. Streaming video to a mobile phone places huge strains on the processing engine within these systems. Designers looking to add applications like MPEG-4 to a mobile product will face a tough power consumption vs. performance choice when building baseband solutions.
One solution to this dilemma could emerge in the form of compression technologies. Sophisticated video compression techniques, such as MPEG-4, are needed to make wireless video happen. But, these compression techniques also bring their share of challenges. The article that follows will explore the benefits and challenges that compression brings to a mobile phone architecture.
The truth about compression
Sophisticate d video compression standards, like MPEG-4 and H263, make realtime video streaming on wireless handsets a reality - but not without some complexity. Video compression standards implement several computationally demanding techniques such as motion estimation (ME) between frames to encode temporal redundancy; the energy-compacting discrete cosine transform (DCT) algorithm, which encodes spatial redundancy; quantization; and entropy encoding.
Although these compression techniques help fit video streams into the bandwidth available in wireless communication channels, they also raise a number of issues which affect the memory, computational capabilities, and internal data transfer channels of wireless communication devices. In turn, how these issues are resolved will have an affect on the cost-effectiveness of the device and the useful life of its battery.
In addition, video compression techniques make a video image highly sensitive to errors in the bit stream. In fact, the wireless communication enviro nment is highly prone to various types of interference that can introduce errors into digital bit streams. Video compression algorithms remove much of the redundancy in video data, and, as a result, the effects of channel interference ripples through not just the current image being displayed, but also successive images. The predictive imaging techniques used in MPEG and other compression algorithms cause any errors in a reconstructed video frame to propagate through time into future frames. This can also cause the video decoder to lose synchronization.
Newer compression standards like MPEG-4 have devised a number of techniques to compensate and overcome many of the errors encountered in a typical digital video bit stream. These error resilience tools enable detection, containment, and concealment of errors. These tools are:
- Resynchronization markers (RM): With earlier video compression techniques, resynchronization points were inserted at the beginning of each row of ma croblocks of the image. Macroblocks are tiny tiles that make up the image and are only 16 x 16 pixels in size. If a channel error is encountered in the encoded data, a complete row of the image may be lost. Because the most complex portion of any image requires more bits, any errors in the bit stream might mean that the most complex portion of the image would be lost entirely.
With MPEG-4, a new technique is used which places RM evenly throughout the bit stream. The data between two RMs is known as the video packet (VP) and it can correspond to approximately one row of the image. In this way, an error in a VP can only cause the loss of the bits in that one VP.
- Header extension codes (HEC): In video encoded data, the most critical information, such as image type, time stamps, and coding types, is contained in the headers. With earlier compression techniques, an error in the header would cause the loss of a complete image.
An HEC is a single bit that is placed in every VP. When the HEC is se t to one, the header data is repeated for each VP, meaning that every VP could be decoded independently. As a result, an error in a header would not cause the loss of an entire image.
- Data partitioning (DP): VPs contain macroblocks that can be made up of two different kinds of video data: motion data or texture data. The DP technique separates motion from texture data with a certain type of RM, which is called a motion marker. Because of this technique, a decoder can resynchronize and recover uncorrupted data partitions when an error is encountered in either a motion or texture DP without discarding all of the information between the two RMs.
- Reversible variable length coding (RVLC): Texture data is encoded with variable length code words and a decoder is used to recover the texture coefficients in the bit stream. When an inconsistent code word is found, the video packet (or if DP is being used, the texture data in a video packet) is declared as corrupt and discarded. With RVLC, texture da ta can be decoded in either a forward or backward direction. When an inconsistent code word is encountered, the decoder can resynchronize at the next RM and then decode backwards, recovering as much data as possible.
In addition to these error resilience tools, MPEG-4 Annex E also provides some guidelines for error detection. For example, a semantic error is detected when more than 64 DCT coefficients are decoded for a block (a block is of size 8 x 8 pixels and hence has exactly 64 DCT coefficients) and when there is inconsistency in resynchronization header information (for example, the quantization parameter in a video packet header is out of range).
Error concealment techniques
The error resilient techniques described in the previous section help in isolating the transmission errors in the video bitstreams. Once the errors have been isolated, error concealment has to be adopted to estimate the macroblocks lost due to transmission errors.
Error concealment techniques are ou tside of the scope of the MPEG-4 specification. They are more properly thought of as post-processing techniques that follow the activity of the video decoder. The combination of error resilience tools and various error concealment techniques are improving the quality of streaming video transmission over wireless communications channels.
A highlight of some of the commonly adopted error concealment techniques is given below:
System designers must be aware that error concealment techniques can have deleterious effects on performance unless the demands of these techniques are taken into account when the architecture of the system is developed. For example, concealment requires significant processing power. In addition, errors in wireless communication channels typically occur in bursts.
- Temporal prediction based techniques: These techniques make use of temporal information (such as the information in previous video frames) to conceal the lost macroblocks. The simplest temporal prediction technique is macroblock copy. Under this procedure, corrupted macroblocks are replaced with macroblocks from the previous frame. In practice this technique works quite satisfactorily when the amount of motion in video sequences is low, such as the head and shoulder video sequence type that arises in videoconferencing. More sophisticated techniques use the motion vector of the macroblock to copy the motion compensated macroblock from the previous frame. Note that if the motion vector of the macroblock is not available, then it has to be estimated from the motion vector of the neighboring macroblocks.
- Spatial interpolation: In this technique, the lost blocks are interpolated from neighboring correctly received blocks. Spatial interpolation can be carried out both in the pixel domain as well as in the DCT domain. When the lost block is far-off from the correctly received blocks, it is typically replaced by a constant equal to the mean value of the neighboring correctly received blocks. This operation can be carried out more efficiently in the DCT domain. The DCT domain includes 64 elements; the first element (number 0), represents the mean value coefficient, which gives the mean value of the block. The DCT DC coefficient is estimated from the DCT DC coefficients of the neighboring blocks.
- Spatio-temporal techniques: These techniques are more complicated and make use of both the spatial and temporal information to conceal a lost macroblock. More details on error resilience and concealment techniques can be found in Y. Wang's article, "Error resilient video coding techniques."1
Concealing these errors would mean that the MIPS load placed on the system's processor would rise and fall sharply as errors are encountered. Concealment techniques will also increase data transfers from memory because the erroneous macroblocks that are used to interpolate an image are stored in memory that is external to the processor.
Migrating to 2.5/3G
The vast majority of today's voi ce-only (2G) wireless communications devices were originally based on a dual-processor architecture. A digital signal processor (DSP) handled many of the communications tasks, such as modulating and demodulating the bit stream, coding and decoding to maintain the robustness of the communications link despite transmission bit errors, encrypting and decrypting for security, and compressing and decompressing the signal. The second processor was a general-purpose processor, which processed the user interface and the upper layers of the communication protocol stack.
The basic dual-processor architecture of 2G will migrate to data-centric 2.5 and 3G devices, but it will require some significant enhancements to handle demanding multimedia applications like streaming video. As the computational and others capabilities of a wireless system increase drastically to meet the requirements of streaming video applications, a partitioning of tasks between the two processors becomes increasingly critical for several rea sons. System throughput is more efficient when tasks are assigned to the processor that is best suited to the task. But, just as important as system throughput, an effective partitioning of tasks will reduce power consumption and extend the system's battery life.
The most effective way to reduce power consumption is to limit the number of processor cycles devoted to every task. If more processor cycles are needed for a particular task, power consumption increases.
New 2.5 and 3G applications, such as streaming video and others, will change the nature of wireless communication devices. Designers of wireless platforms should be concerned about maintaining a high degree of flexibility as consumers will seek to download applications from the Internet onto their new handheld wireless systems. The handset, in a sense, will become an open application platform. It is incumbent upon designers to take this need for flexibility into account when designing next generation platforms.
Meeting the processi ng requirements
Figure 1 illustrates a proposed baseband architecture for a 2.5/3G mobile phone. The processing involved in streaming video applications can be divided into roughly two types of functions: control and transport (CT), which involves real-time streaming protocol (RTSP) session control and real-time transport protocol (RTP) media transport; and media decode (MD), which involves media decoding, error concealment, and other ancillary signal processing steps such as echo cancellation and others.
The CT and MD functions have different processing requirements. CT is not computationally intense and mainly involves string parsing, data packet manipulation, and finite state machine implementation. An MCU is best suited for these types of tasks. The MD functionality is much more computationally intense because of the sophisticated signal processing required by audio and video coding algorithms. A high-performance, low-powe r DSP is better suited for MD functions.
Figure 2 shows an efficient way to partition a streaming video application on a dual-processor platform. Note: Both RTSP and RTP are internet proposed standards.
In Figure 2, RTSP and RTP are layered on TCP/UDP/IP. RTSP handles the description, setup, control, and tear down of streaming sessions. RTP manages the transport of media and provides sequencing information that is helpful in detecting packet losses. In addition, RTP supplies timestamps and payload identification information as well as a real-time control protocol (RTCP), which is used for QoS feedback and inter-media synchronization information. RTSP can be layered over both TCP and UDP, while RTP is almost always layered only over UDP.
The data flow
The data flow in a streaming video application is as follows: The streaming data enters the archit ecture by way of a 2.5 or 3G modem. The MCU will be running the protocol layers (RTP/RTSP and TCP/IP) and demultiplexing the audio and visual data. The audio and video compressed bitstreams are extracted from the respective RTP packets and are then forwarded to the DSP's internal RAM.
The DSP then decodes the images for display. The DSP also stores a copy of the reconstructed frame for use in the decoding of the next frame and so on.
In a video streaming application, previous images are used to extrapolate the current image. The previous image is moved macroblock by macroblock from the video buffer into the DSP's internal RAM where it is combined with other information and sent to the display screen as the current image.
Because streaming video involves moving a tremendous amount of data in real-time, I/O issues are critical considerations. At least two direct memory access (DMA) channels, and possibly more, will be needed to avoid I/O bottlenecks, which would slow down the system and mitigate the effective computational speeds of the DSP and MCU. It is also important that specific DMA capabilities are included, which will simplify the movement of two-dimensional pixels, byte alignments, and byte-by-byte transfers.
As benign as a dual-processor architecture may seem, below the surface is a myriad of challenging design issues. One of the biggest challenges is memory management.
Shared memory must be managed to avoid conflicts involving both processors accessing the same memory location at the same time. Memory access requests must also be ordered consecutively in time, while ensuring a predetermined access time for both processors.
In addition, to make the most efficient use of the two processors, designers may choose to implement two OSes, because an OS well suited to a DSP will not function effectively on an MCU and vise versa. If the designer decides to implement two OSes, he or she must determine how to reconcile the differences between the Oses, as they will certainly handle memory addressing, memory accesses, and housekeeping chores differently.
The structure and size of processor cache memory will also have a decided effect on system performance. For example, with a wireless communication device running the GSM protocol and MPEG-4 video encoding/decoding, simulations have been done to determine the optimum size of cache memory for maximizing cache hits and minimizing processor wait states. The following cache sizes were derived for the system's MCU:
- Instruction cache: 16 KB, two-way set associative, with a line of 16 B
- Data cache: 8 KB, two-way associative, with a line of 16 B
Research has shown that these sizes and types of memory would have a cache miss-to-hit ratio of just 3.4% for the instruction cache and only 9% for the data cache. When this type of simulation was performed for the DSP, it was found that an instruction cache of 16 KB organized as two-way associative with a 16-B line would result in an instruction miss-t o-hit ratio of less than 1%.
Wireless video streaming
The architecture described above can be used effectively to perform video decoding in a streaming video application. In a wireless streaming video application, one of the issues that most concerns designers involve the demands that are placed on the processor in terms of cycles. This relates back to power consumption, because reducing the number of processor cycles required for a task will reduce the power expended on that task.
Designers must examine the aptitude of the DSP core to determine its suitability for video encoding/decoding. Some high-performance DSPs have reduced the processor cycles needed to perform inverse DCTs (IDCTs) and half pixel interpolations (HPI) on a macroblock from 1,200 and 350 cycles, respectively, for previous-generation DSPs to 147 and 70 cycles.
The size of the display image also affects processor cycles. Using a high-performance, low-power DSP to display a streaming video application in the larger common intermediate format (CIF) at 45 frames per second takes at least 108 million DSP processor cycles per second. But if the smaller quarter CIF (QCIF) were used at 15 frames per second, the processor load would fall to 12 million cycles per second.
For many wireless handheld devices the smaller QCIF format will be appropriate. Shifting to QCIF can then lower power consumption and lengthen battery life.
In addition, newer DSPs consume less power no matter which image format is used. For example, a new DSP processing CIF video images at 45 frames per second will consume as little as 110 mW, while QCIF video images at 15 frames per second will consume only 12 mW.
Vendors also have developed new instructions which further reduce the number of processor cycles needed for streaming video. Instead of the classical DSP instructions, these new instructions accomplish more of the video decoding task in real time. These instructions accelerate the video decoding process by a factor of two and reduce p ower consumption by requiring fewer processor cycles.
Sebastien De Gregorio joined Texas Instruments (TI) in 1996 in the Wireless Communications Business Unit, where he was involved in DSP algorithms and advanced speech processing. Since 1999, he has served as audio/video project lead. He is based in Nice, France. He can be contacted at email@example.com.
Madhukar Budagavi has been with TI since 1995 as a Member of Technical Staff in the DSP Solutions R&D Center. He works on MPEG-4 and wireless video communications. Prior to TI, he was first a software engineer and then a senior software engineer in Motorola India Electronics Ltd., developing DSP software and algorithms for the Motorola DSP chips. He can be contacted at firstname.lastname@example.org.
Jamil Chaoui joined TI in 1995 and is a Member of the Technical Staff in the European Wireless Application Group of Texas Instruments. Prior to TI, h e was with Alcatel as a DSP software and system engineer in Alcatel's mobile phones group. He can be contacted at email@example.com.
- Wang, Y., et. al., "Error resilient video coding techniques," IEEE Signal Processing Magazine, vol. 17, No. 4, pp. 61-82, July 2000.