David Katz and Rick Gentile, Analog Devices, Inc.
Aug 17, 2005 (10:00 AM)
Selecting a processor for networked multimedia (NM) applications is a complex endeavor. First, a thorough analysis of the processor's core architecture and peripheral set must be prepared, in the context of both present and near-term industry interface needs. Next, it is crucial to understand how multimedia data, such as video, images, audio and packet data, flow through the system in order to prevent bandwidth bottlenecks. Finally, it is helpful to understand the various system attributes that could make the difference between a marginal implementation and a robust solution.
The choice of media processor for multimedia applications is determined by the performance and connectivity requirements of the design. Many applications use both a microcontroller unit (MCU) and a digital signal processor (DSP). The MCU provides the control functionality for the system, while the DSP performs the intensive numeric computation. Today, these distinct roles can be united in a single processor. This type of device presents control code density and intensive signal processing in a single architecture, while offering a wide peripheral set suitable for multimedia connectivity.
Among the first measures that system designers should analyze when selecting a processor are:
- the number of instructions performed each second,
- the number of operations accomplished in each processor clock cycle,
- the efficiency of the computation units.
The merits of each of these metrics can be determined by running a representative set of benchmarks, such as video and audio compression algorithms, on the media processors under evaluation. The results will indicate whether the real-time processing requirements exceed the processor’s capabilities, and, equally as important, whether there will be sufficient capacity available to handle new or evolving system requirements. Many standard benchmarks assume that pre-processed data already resides within internal memory. This technique allows a more direct comparison between processors from different suppliers, as long as the designer reconciles the I/O considerations separately.
Internal data bus architecture
The data bus architecture of a multimedia processor is just as important to the overall system performance as the core clock speed. Figure 1 shows an example of bus interconnections between various processor subsystems. Because there are often multiple data transfers taking place at any one time, the bus structure must support both core and direct memory access (DMA) to all areas of internal and external memory. DMA allows data transfer operations to occur without involving the processor core.
It is critical that arbitration between the DMA controller and the core be handled automatically, or performance will be greatly reduced. Core-to-DMA interaction should only be required to set up the DMA controller and then to respond to interrupts when data is ready to be processed.
Because internal memory is typically constructed in sub-banks, simultaneous access by the DMA controller and the core can be accomplished in a single cycle by placing data in separate banks. For example, the core can be operating on data in one sub-bank while the DMA is filling a new buffer in a second sub-bank. Simultaneous access to the same sub-bank is also possible under some conditions.
When access is made to external memory, there is usually only one physical bus available. Considering that on any given cycle, external memory may be accessed to fill an instruction cache line at the same time it serves as a source and destination for incoming and outgoing data, the bus arbitration challenge becomes clear.
Click here for Figure 1
Fig. 1 I Processor architecture map showing relationship between core, peripherals, DMA and external memory
One way to alleviate these external bus bottlenecks is to allow more than one SDRAM page to be “open” at a time. An SDRAM page can be 4KB in size, and it is not hard to imagine systems where at least 3 pages are being accessed in any given processing interval (e.g., compressed data in, working buffer, and processed data out). This page management is handled automatically on many high performance processors.
The right peripheral mix saves time and money by eliminating the need for external circuitry to support the needed interfaces. Networked multimedia devices draw from a universe of standard peripherals. Prominent among these is, of course, connectivity to the network interface. In wired applications, Ethernet (IEEE 802.3) is the most popular choice for networking over a LAN, whereas IEEE 802.11a/b/g is emerging as the prime choice for wireless LANs. Many Ethernet solutions are available either on-chip or bridged through another peripheral (such as asynchronous memory or USB). In addition, on processors that can support both DSP and MCU functionality equally well, a TCP/IP stack can be managed right onboard.
Also necessary for linking the processor to the multimedia system environment are synchronous and asynchronous (UART) serial ports. In NM systems, audio codec data often streams over synchronous 8- to 32-bit serial ports, whereas audio and video codec control channels are managed via a slower serial interface such as SPI or a 2-wire interface. Furthermore, UARTs can support RS-232 modem implementations, as well as IrDA functionality for close-range infrared transfer.
Many media processors provide a general-purpose interface such as PCI or USB, because these can bridge to several different types of devices via external chips (e.g., PCI to IDE, USB to 802.11, etc.). PCI can offer the extra benefit of providing a separate internal bus that allows the PCI bus master to send or retrieve data from processor memory without loading down the core or other peripherals. Additionally, media processors should include an external memory interface that provides both asynchronous and SDRAM memory controllers. The asynchronous memory interface facilitates connection to FLASH, EEPROM and peripheral bridge chips, whereas SDRAM provides the necessary storage for computationally intensive calculations on large data frames.
A peripheral available on some high-performance processors is known as a parallel peripheral interface (PPI). This port can gluelessly decode ITU-R BT.656 data, as well as act as a general-purpose 8- to 16-bit I/O port for high-speed A/D and D/A converters or ITU-R-601 video streams. It can also support a direct connection to an LCD panel. Additional features are available which can reduce system costs and improve data flow within the system. For example, the PPI can connect to a video decoder and automatically ignore everything except active video, effectively reducing an NTSC input video stream rate from 27 MB/s to 20 MB/s and markedly reducing the amount of off-chip memory needed to handle the video.
It is instructive to examine some ways that the PPI connects in multimedia systems, to show how the system as a whole is interdependent on each component flow. In the top part of Figure 2, an image source sends data to the PPI, where the DMA engine then dispositions it to L1 Memory, which is on-chip memory that runs at the processor speed. In L1 memory, the data is processed to its final form, and then it is sent out through a high-speed serial port (SPORT). This model works very well for low-resolution video processing and for image compression algorithms like JPEG, where small blocks of video (several lines worth) can be processed and are subsequently never needed again. This flow also can work well for some data converter applications.
In the bottom part of Figure 2, the video data is not routed to L1 memory, but instead is directed to external SDRAM. This configuration supports algorithms such as MPEG-2 and MPEG-4, which require storage of intermediate video frames in memory in order to perform temporal compression. In such a scenario, a bidirectional DMA stream between L1 Memory and SDRAM allows for transfers of pixel macroblocks and other intermediate data.
Click here for Figure 2
Fig. 2 Possible video port data transfer scenarios
As previously mentioned, processors suitable for NM applications must have a DMA engine that is independent of the core processor. Furthermore, the total number of DMA channels available must support the wide range of peripherals and data movement options. Additionally, a flexible DMA controller can save extra data passes in computationally intensive algorithms by allowing direct data transfer between peripherals and internal/external memory, rather than having to stage all data in L1 memory before transferring to external storage.
As data rates and performance demands increase, it becomes crucial to have access to "system performance tuning" controls. For example, the DMA controller might be optimized to transfer a data word on every clock cycle. When there are multiple transfers ongoing in the same direction (e.g. all from internal memory to external memory), this is usually the most efficient way to operate the controller because it prevents idle time on the DMA bus.
In other cases, such as those involving multiple bidirectional data streams, it is important to control the traffic flow to match the application. For instance, if the DMA controller only operated in a mode where each "ready" transfer was immediately granted the DMA bus, overall throughput would actually suffer when connected to a device such as an SDRAM.
In situations where data transfers switch direction on every cycle, the latency associated with turn-around time will cut into throughput significantly. Processors that can adjust the "burst" size of the DMA possess a distinct advantage over those that stick to a more traditional DMA model. Because each DMA channel can be used to connect a peripheral to internal or external memory, it is also important to be able to automatically service a peripheral that may issue an urgent request for the bus.
Another feature, two-dimensional DMA (2D DMA) capability, offers several system-level benefits. For one, it can facilitate transfers of macroblocks to and from external memory, allowing data manipulation as part of the actual transfer. This eliminates the overhead typically associated with transferring noncontiguous data. It can also allow the system to minimize data bandwidth by selectively transferring, say, only the desired region of an image, instead of the entire image.
As another example, 2D DMA allows data to be placed into memory in a sequence more natural to processing. For example, as shown in Figure 3, RGB data may enter aprocessor’s on-chip memory from a CCD sensor in interleaved RGB format, but using 2D DMA, it can be transferred to external DRAM in separate R, G and B planes. Interleaving/deinterleaving color space components for video and image data saves additional data moves prior to processing.
Click here for Figure 3
Fig. 3 Two-dimensional DMA allows separation of interleaved data into discrete buffers.
Other important DMA features include the ability to prioritize DMA channels to meet current peripheral task requirements, as well as the capacity to configure the corresponding DMA interrupts to match these priority levels. Additionally, it is useful to have separate DMA Error and DMA Completion interrupt vectors to allow for efficient servicing of interrupt routines without having to determine which outcome caused a given DMA interrupt.
System data flow
Before reaching a final decision on the processor choice for a NM design, it is imperative to understand the system-level data flow and how that flow can be implemented on the processor. Specifically, can data be brought in and out of the processor without falling behind on data and signal processing? Can the processor be kept fed with data, and can the data be accessed as needed during any given processing interval? It is crucial to ask these questions when designing a multimedia, network-centric system, in which running algorithms efficiently is not enough by itself. The processor must also handle the complete bidirectional system data flow.
Click here for Figure 4
Fig. 4 A video decoder application demonstrates the complexities of data flow that a media processor must manage
Consider the video decoder system depicted in Figure 4. Here, a low-bandwidth, compressed audio/video stream is transferred into the system via the LAN. The compressed data is then decoded into an audio and video stream that can be played in real time by the processor. The Ethernet TCP/IP stack is managed onboard the processor. In addition to the incoming A/V stream, several data movements between internal and external memory are required to complete the decode algorithm. Often, an input buffer streams into SDRAM concurrent with the processor core compressing the data from a previously filled buffer.
Since many video compression algorithms operate on one block of data at a time, each block can be transferred as needed from external memory. Some algorithms require multiple image or video frames to complete the desired processing, resulting in multiple bidirectional data transfers between internal and external memory. In an MPEG-based compression scheme, for example, macroblocks of 16x16 pixels are moved between various memory banks as the processor navigates through the data to be decoded.
In the end, a continuous audio stream is sent out to an audio codec and a video stream of up to 27 MB/sec is sent to a video encoder. This process must occur in real time, or there will be glitches in the video stream or audio sync problems.
This system scenario is a realistic depiction of the daunting data transfer rates that must occur between several subsystems to support networked multimedia applications. There are at least 5 sets of simultaneous data movements involved in the above example. What’s more, when considering the overall dataflow, it is not sufficient to simply verify that the total byte traffic moving through the system does not exceed the processor’s theoretical internal bandwidth (obtained by multiplying the bus speed by the bus width).
For example, in parts with high core clock rates, the buses between the core processor and the peripherals will typically run at a rate of 133MHz. With bus sizes of 32 bits, the throughput should ideally approach 532MB/s. In reality, this peak number can only be achieved if exactly one transfer is active and no other transfers are pending. As individual peripherals are added to the application, they must each arbitrate for the internal processor bandwidth. System designers typically allow for arbitration delays by assuming that only 50% of the internal bandwidth is available.
Processor selection for networked multimedia applications is a crucial and complex task. However, by taking system-level issues into account at the initial processor selection stage, designers can guarantee their present application data flows will be handled, and they can ensure that processor headroom and peripheral connections exist for straightforward upgradeability as network and multimedia standards evolve.
About the author
David Katz and Rick Gentileare both Senior Applications Engineers in ADI's Blackfin Processing Group. Their book, Embedded Media Processing, will be available in September. They can be reached at: firstname.lastname@example.org and email@example.com.