FengNiu & LeiTang & ShaojunWei, Datang Microelectronics Technology Co., LTD.Beijing, ChinaAbstract :
Presently, seeking appropriate balance between optimal supporting target applications and the broadly applicable characteristic has become the central problem of SoC design, accordingly Datang Microelectronics Technology Co., LTD., Beijing, China, is now focusing on SoC platform development facing not only TD-SCDMA but other application series such as network multimedia as a pioneer. In this paper, a typical research case will be described on how to extend an existing SoC architecture of baseband processor on 3G terminal to the player on network multimedia compressed with MPEG-4, especially on how to build an effectively multi-layer and shared memory system to cope with high-density computation and transfer tasks. Finally we use the CoCentric System Studio, one of Synopsys' ESL tools, helping us verify our idea and get our goal architecture. 1.Introduction
In order to answer the chance and challenge of SoC design, many corporations are developing more universal SoC that can adapt to several or maybe more application products and customers. With more and more different functions being integrated, inherent parallel communication and parallel data stream transaction are increasing rapidly in application system. And along with the increasing parallel feature, using more than one mini and special processor cores will be as a natural architecture of advanced SoC. As a result, the performance of many application on SoC will only be restricted with properly high bandwidth, low latency time, correspondence mode between processors and the capability of integrating multi-processor on one chip.
As a brave pioneer, Datang Microelectronics Technology Co., LTD. Beijing, China, is now focusing on SoC platform based hardware designs and software applications. And the software applications include not only TD-SCDMA, one of the 3G standards, but also other application series such as network multimedia. Now we have an existing architecture of SoC which mainly facing baseband processor on the 3G-telecommunication terminal, but we also want our ultimate SoC be able to satisfy the processing demands of playing network multimedia compressed with MPEG-4. So extending the current architecture is necessary. Fig. 1 shows the diversion between the two applications. Fig.1 Extending the 3G terminal SoC to Network Multimedia
In this paper, we will describe how to extend the existing SoC architecture of 3G terminal to network multimedia player, and how to build an effectively multi-layer and shared memory system to cope with the high density compute and mass transfer tasks. As follows, analysis of the SoC architecture for 3G and the challenges to current architecture brought with new application will be introduced in section 2 and 3; and in section 4, extending the existing architecture for MPEG-4 will be described as an emphasis; we also use the CoCentric System Studio verify our idea and get the simulation result in section 5, then draw a conclusion in section 6.2. Analysis of the current architecture for TD-SCDMA
To develop a baseband processor platform for TD-SCDMA handset, following points need to pay attention to:
Fig.2: Analysis of the architecture of TD-SCDMA baseband for MPEG-4 application
- The data rate of 384k bps can be supported.
- Hardware macro IP such as Turbo or Viterbi decoder is needed to accelerate the digital signal decoding process.
- Special TD-SCDMA RFIF or GSM RFIF (for double mode terminal) is needed to receive the baseband data.
- On the physical layer, it needs at least two DSP: one for Join detector with mass computation in down-link and BC in up-link, the other for channel decoding, data division, AMR decoding in down-link and channel coding, data conformity, AMR coding in up-link.
- And on the protocol layer, a strong RISC processor is needed to deal with the protocol stack and application data task.
- LCDC module should be placed on the RISC subsystem because the being displayed data comes from protocol stack transaction.
- The platform also needs other modules like many kinds of memory and memory controller interface, DMA, SPIF, UART, SSI, GPIO, high performance interconnect bus protocol such as AMBA 2.0 and so on.
So we got architecture for the baseband processor of TD-SCDMA terminal. A simple framework is shown in Fig. 2 with the key modules to system performance analysis. We can see that system is divided into two partitions: one is ARM subsystem with one ARM926ejs core to deal with user's application programmes and OS tasks, the other is DSP subsystem including two ZSP540 cores each with 450MIPS peak value performance under 100MHZ system clock and two corresponding tightly coupled memories, which is used to process the baseband digital signal on the physical layer. Besides that, the two subsystems are connected by two DMA, one of Synopsys' DesignWare IPs, each with one AHBlite bus, and the two DMA modules can sure work for one certain subsystem. In addition, a shared memory module is used for the storage of exchanging data between the two subsystems. Finally the memory controller interface module (MemCtl) in ARM subsystem has only one AHB interface, thus an ICM module is needed to arbitrate the simultaneous requests from different AHB buses.3. Challenges to the architecture with multimedia application
Now let's discuss our new application, which is called Multimedia On Network Storage (MONS). That's a home multimedia system, being based on TV displaying, while a great deal of high-quality program materials can be obtained by system navigator on the broadband network. What's more, it needs to support the MPEG-4 decoding function besides H263, MP3, JPEG data decompression and display. Some supposable requirements are following:
- 25-30 frames of picture may need to be decompressed and display on TV per second.
- Each frame is displayed with the size of 720x576 pels.
- Hard disk data backup should be supported.
- Ethernet interface with more than 1.5Mbit/s rate is demanded.
- User identification and remote control are also necessary.
From these requirements, obviously the MONS application has the features of higher throughput and computing complexity than TD-SCDMA terminal. In this paper we lay a strong emphasis on MPEG-4 decoding because it's high-density compute, storage and transfer demands. For example:
The frames to be decoded can be sorted into I frame, P frame and B frame, and every frame is divided again into 16x16 macro blocks. All the frames compressed need disposals of IDCT, inverse quantization, Zigzag scan, VLD and up-sampling for data format conversion of YUV420 to YUV422, while P frame and B frame need motion compensation which requires the just decompressed pictures to be reference frames. And furthermore, reference frames must be accessed by the motion compensation ZSP core with strong time limit, which results in that reference frames must be stored in chip, not out of chip.  From Fig.3, it can be concluded that two reference frames are at most needed during the decoding period. Otherwise, the ZSP core should access the memory out of chip via MemCtl instantly. By the way, the decompressed pictures should be sent to LCD or TV interface in real time.Fig.3 MPEG-4 decoding features
Unfortunately the tightly coupled memories' size in DSP subsystem is not enough for reference frames to be stored in the SoC architecture of TD-SCDMA, moreover, ZSP cores cannot directly access the MemCtl interface in ARM subsystem. As shown in Fig. 2, the reasons may be as follows:
4. Extending the existing architecture for MPEG-4
- The two tightly coupled memories of the ZSP cores are separate and independent each other.
- Both DMAs transfer data from source to sink only with one AHBlite bus.
- MemCtl module in ARM subsystem also has only one AHB slave interface.
- LCD or TV interface is located in ARM subsystem.
Now the principal task is to optimize the current architecture to be more suitable for MONS application, while minimizing modification price and keeping the adaptability to 3G terminal baseband processor are required. From section 3, by estimating the performance of the existing architecture for MPEG-4, we know that the capability of system storage and transmission is the bottleneck that needs to be eliminated.
There are two ways to enhance the bandwidth of main memories. One is sharing the separate memories and making them be accessed by different processors simultaneously; the other is using hierarchy system with multi-level middle memories. And the middle memory is controlled by programme, not same as auto-searching mean of cache. Its thought way is making the memories and the pipeline nearly cooperate, further making the operands close to the processors and keeping the processors busy working.
Comparing some architecture with multi-port or dual-port memory and ST GreenSIDE main memory , we developed an effectively multi-level and shared memory system as shown in Fig 4. It brings following advantages:
- Sharing the separated tightly coupled memory is beneficial to increase the accessing capacity of each ZSP core and improve the utilization proportion of each shared memory. And it's easy to execute task for the two ZSP in a macro pipeline mode, at the same time, the exchange data throughput between the processors on chip is reduced intensely because there is no need for DMA to transfer the exchange data after sharing memories between the ZSP cores. There is a memory management module used to control and arbitrate the access from different masters.
Fig.4: An effectively multi-layer and shared memory system
- Using memory hierarchy system enhances the executing efficiency of processors and address accessing space. Setting double DMA between the ZSP memory side and the MemCtl and LCD side elevates the data transmittability between the two subsystems, and then makes decompression, display and reservation of the reference frames simultaneous. Here TV interface is converged into LCD controller module.
In the multi-level and shared memory system, we also modified the configuration of DMA and MemCtl. Each DMA employs two AHB buses, while the MemCtl module is added several AHB slave ports. That's because:
- Typically, it takes two bus cycles for DMA to complete a transfer—one for reading the source and one for writing to the destination. However, when the source and destination peripherals of a DMA transfer are on different AMBA layers, it is possible for the DMA to fetch data from the source and store it in the channel FIFO at the same time that the DMA extracts data from the channel FIFO and writes it to the destination peripheral. This activity is known as pseudo fly-by operation.  In order for this to occur, the master interface for both source and destination layers must win arbitration of their AHB layer.
5. Experiments and Results
- Having more than one AHB slave ports enables high-bandwidth peripherals direct access to the multi-port memory controller (MPMC), without data having to pass over the main system bus. Providing multiple AHB interfaces improves system performance by enabling several access requests to be presented to the memory controller at the same time. This enables the MPMC to pipeline many of the operations (for example, bank activate and precharge), and so reduce the average system access latency and improve utilization of external memory. The use of multiple AHB interfaces also improves system performance by removing heavy DMA traffic from the main AHB bus.
An HW/SW co-simulation platform based on Synopsys Corporation’s CoCentric System Studio using SystemC was developed to test our extended architecture, which is shown in Fig. 5.  For accelerating the simulation efficiency, we only selected the key models to system performance on the virtual platform, such as DMA, ARM, ZSP, tightly coupled memory, MemCtl, LCDC, AHB bus, shared AHB memory, APB bus with I2S interface and ICTL. Fig.5: Experimental platform in CCSS
Because we just wanted to know the capability of our architecture facing the most serious pressure, which means that MPEG-4 decompression programme is only needed to run on the platform. Similarly the data flow of MPEG-4 algorithm needs to be paid more attention to during system modeling. So we use the ZSP models with ISS with which the software can be loaded, run and debugged, and a pseudo ARM model like stimulant generator that is used for configuration of system initialization, DMA transfer, response of the interrupt. System frequency was still 100MHZ.
As to the algorithm programme, we rescheduled the MPEG-4 flow in order to adapt the software to the hardware. For example, each frame with 1620 macro blocks is divided into several groups according the size of frame row, as substitutes of the frames to be transacted in the motion compensation process. In this way, the memory capacity needed by reference frame is reduced and saved. The real-time feature of application system can be satisfied so long as enough transmittability is applied. The software running on ZSP was coded in C, with some frequent computations such as IDCT coded in ZSP assembler instructions to improve the execution efficiency.
From the simulation results, as part shown in Fig.6, we found that both the execution and the transfer efficiency of the system were improved with the extended SoC architecture.
- In average the decoding of each macro block occupied 4147cycles on the old architecture, far from the MPEG-4 demand of 2400cycles per macro block. While on the new architecture, because of two ZSP cores’ execution in pipeline mode, the average occupied cycles of each macro block decoding reduced approximately 50%.
- Also in the old architecture, one DMA transfer of a data burst to MemCtl with SDRAM expanded may use 5.3cycles per 32bits word. While in new architecture and considering the configuration time for DMA transfer, one DMA with two AHB master ports can transfer one word only in about 2cycles by burst mode.
So the extended architecture was verified to be able to satisfy the requirement of new application.
Fig.6: Segment of simulation results6. Conclusion
This paper introduces the process of extending the SoC architecture of TD-SCDMA baseband processor to network multimedia application. Also an effectively multi-layer and shared memory system to meet the high-density computation and transfer tasks of MPEG-4 is described as an emphasis, which is useful to build a SoC architecture with high capacity of storage and transmission. What’s more, the new architecture remains applicable to the original application.
The thought way of HW/SW co-design is used and clarified in this paper. An optimal SoC architecture must be oriented by its applications, at the same time, modifying the schedule of application algorithm within a possible limit to adapt it to the architecture may gain unexpected system performance. References
 ARM Inc., AMBA Spectification (Rev. 2.0), www.arm.com
 Iain E. G. Richardson, H.264 and MPEG-4 Video Compression Video Coding for Next-generation Multimedia, John Wiley & Sons Ltd,, 2003
 Ric Hilderink Stefan Klostermann, Transaction Level Modeling of SoC platforms using SystemC, Design Automation and Test in Europe, 2002.
 Remi Francard Mick Posner, Verification Methods Applied to the ST Microelectronics GreenSIDE Project, www.design-reuse.com
 Synopsys Inc., DesignWare DW_ahb_dmac Databook, www.synopsys.com