by Fabien Clermidy, Didier Varreau, Didier Lattard CEA/LETI, Grenoble, France Abstract:
In this paper, we present a NoC-based communication framework that is used to develop complex chips including a large number of heterogeneous IPs. Synchronization and a reconfiguration schemes are proposed to handle the complexity and efficiently decouple SoC communications from computation. Finally, we describe an 8Mgates NoC-based chip dedicated to telecommunication applications that has been developed to prove the concept. I. INTRODUCTION
Today, multi-applications platforms are mandatory to fulfil cost and time-to-market requirements. With this growing complexity, current bus-based SoC architectures can lead to major mismatches during the integration phase. At this level, each modification is very costly in time and resources. Moreover, once the integration problem resolved, the applications mapping difficulty is still present.
IP integration in this kind of System on Chip (SoC) leads to some classical issues, as functional and performance validations, test and debug facilities, place and route, but also new problems, such as multi-application programming and platform reconfiguration.
In this paper, we propose a complete communication framework based on a Network on Chip (NoC), and its associated methodology. The major feature of this system is to propose both synchronization and reconfigurable schemes adapted to multi-applications support.
This framework has been used for the development of a multi-applications telecommunication platform named FAUST (Flexible Architecture of Unified System for Telecom), aimed at handling multiple OFDM-based systems.
The first mapped applications are IEEE 802.11a and SISO MC-CDMA in the frame of the MATRICE project , and it was naturally extended to MIMO for the 4MORE project . II. FAUST NOC ARCHITECTURE A. Introduction and Motivation
A NoC is a fully distributed communication system with only local connections, which allows all possible data transfers between connected IP (Fig. 1). A NoC is typically composed of nodes, links between nodes and Network Interfaces (NI), which are the layer that allows IP to communicate with the communication medium.
Currently, the NI is able to change protocols between the IP and the NoC protocol . More smartness is added in  with a QoS mechanism, that splits the available bandwidth between the different IP, in a configurable way.
Nevertheless, the communication flow itself is managed inside the different IP with a higher level protocol. Also, once the IP integrated in the system, there is no guarantee of communication matching with other IP.
Fig. 1: NoC structure example
The architecture described in this paper proposes a high level network interface that combines QoS policy, a synchronization scheme that both allows a secured integration of the IP and eases the application programming of the final SoC, and a general dynamic reconfiguration scheme of the IP. B. Overview of the FAUST NoC
FAUST architecture is a 2-D mesh-based NoC. Nodes are connected to a maximum of 5 blocks that can be neighbor nodes or IP blocks.
The NoC transfers packets in a wormhole flow, from an emitter to a receiver. Packets are composed of flits. The first flit, named the header, contains the routing path (coded in the emitter IP). The node decodes this path to know the new direction (i.e. the output of the node) to follow. Then, it arbitrates between the different requests to this direction using a first-come first-serve algorithm. Two virtual channels are available, the first one is for best-effort traffic, and the second one is dedicated to real-time traffic (guaranteed latency packets).
The network interface contains blocks dedicated to QoS communication management, data flow synchronization, reconfiguration management, and also interruption, test and debug management (Fig. 2). In this paper, we only focus on communication synchronization and reconfiguration aspects. Fig. 2: Network Interface overview C. Synchronization scheme
A NoC controller (typically a CPU) is in charge of NoC programming and coarse grain synchronization (at the application level). The fine grain synchronization is a feature of the NI.
The data flow control and the synchronization are done thanks to a “pull data” scheme. Two NI blocks, the Input Communication Controller (ICC) and the Output Communication Controller (OCC) are used for this operation. The ICC is a programmable credit generator. It is associated to a FIFO and distributes the available places of the FIFO to other IP (the data producers) in a sequential and programmable manner. For simultaneous input flows management, 2 or more FIFO could be used.
At the output side, and depending on the configuration selected by the input data flow, the OCC sends the data to the corresponding consumer(s) according to available credits. The OCC then forms a packet to the consumer(s).
For example, figure 3 shows an OFDM input flow control that appears in a classical transmission process (framing operation). In a first phase (Fig. 3a), “full pilots” have to be transmitted from a RAM unit to the OFDM modulator:
- The ICC of the OFDM block sends n credits to the RAM unit.
- The OCC of the RAM block sends a first packet that contains the beginning of the data block.
- The two previous steps are repeated until the whole data block is received by the OFDM unit.
In a second phase (Fig. 3b), data have to be transmitted from the mapping block and, simultaneously, continuous pilots are coming from a RAM unit:
- The ICC of the 2 input FIFO of the OFDM block send credits to both source units (RAM and mapping).
- The RAM unit sends the continuous pilots to FIFO 1 and the mapping unit sends data to FIFO 0.
- The previous steps are repeated until the whole data block is received by the OFDM unit.
a) first phase b) second phase Fig. 3: Example of framing flows in OFDM
To conclude, this technique requires a small amount of hardware overhead to perform the merge of two or more sequential or simultaneous flows and/or the splitting of one flow in two or more sequential or simultaneous flows. D. Reconfiguration scheme
Thanks to the NI features, the computation and communication of an IP are dynamically reconfigurable during application processing.
The Read Write Decoder (RWD) and the Configuration Manager (CFM) are the two blocks dedicated to the IP configurations management. The RWD decodes the configurations needed for both the NI itself and the IP core. As only the NI part is generic, a tool allows designers to add the IP specific address mapping for the decoding of the configuration. According to the applications requirements, the RWD is able to decode and store several configurations.
The CFM is dedicated to manage the configurations decoded by the RWD.
For a given IP, the whole data flow is split into data blocks, i.e. a set of data using the same computation configuration. The first network packet of a data block contains a specific command (INIT_WRITE) and the configuration identifier to be used for the computation of these data. The following network packets contain classical WRITE commands.
When an INIT_WRITE command occurs, the CFM is waking up. It decodes the configuration identifier and checks if the configuration identifier is a valid one. If not, an interruption is sent through the IT manager and the data flow is stopped, else the CFM waits until the end of the previous configuration (all corresponding data must be processed and carried out of the IP). Then, the new configuration is loaded and the computation is launched. E. Conclusion
The synchronization and reconfiguration mechanisms presented above improve the separation between communication and computation, which make easier IP integration. The reconfiguration features associated to the NoC architecture performances (high throughput and high flexibility) allow to support multi-applications. And last but not least, as in our framework the synchronization of the communications is a NI feature, an early evaluation of the impact of the communications is made possible  and the integration phase becomes more efficient. III. INTEGRATION ENVIRONMENT A. Introduction
A complete environment based on a mixed SystemC/HDL methodology ,  is used to develop and validate a whole application on our NoC-based architecture.
This environment eases the design and integration of IP. It also gives a general framework for the verification and programming of the IP and the whole architecture. Moreover, all the programming aspects done at the design level can be used directly on the real demonstrator. B. HDL Lego core
The integration of the IP core is done using a data flow manner. The designer has to distinguish the different parallel input and output flows. One FIFO is then associated to each flow. An ICC block is joined to the input FIFO for the merge of input flows, and a configuration of the OCC block is joined to the output FIFO to handle the output flows. RWD block is created by the designer with the IP core address map information. We can choose the number of configurations to play in the CFM, the complexity of ICC, the number of OCC configurations, select IT for the core, and add DUMP and TEST blocks. C. SYSTEMC/TLM platform
The Transaction Level Modeling (TLM) platform includes a global architecture development framework and the associated programming facilities. It allows both the verification of the global integration and the real application development.
The global architecture framework contains the 2 main components to build a NoC-based structure: the nodes (described by an asynchronous event-based model) and the NI.
In order to answer to the programming issues, we strongly split the configuration part of each IP from the computation core.
Thus, each IP is split into a configuration class and a computation class. The configuration class contains the core configurations and the address map of the different fields through two methods, a decode method that is equivalent to the RWD, and a code method that converts the configurations into packets. This last method is very useful from a programming point of view, because once the configuration class written for an IP, this IP becomes easily programmable, simply with a call to this method, and without a deep understanding of the IP. The IP designer can also add a method to program or configure its IP. In that case, the IP with its associated NI can be a black box for the integration team.
The IP core itself can be written in C, C++, SystemC or an HDL language. For the last case, TLM to HDL and HDL to TLM translators are available for co-simulation and only the SystemC configuration class has to be developed.
Finally, the SystemC model could be used for different purposes:
IV. DEMONSTRATOR DESCRIPTION
- The IP verification once wrapped in the NI thanks to an integration kit (a simple NoC-based testbench).
- The IP programming.
- The application programming.
- The communication performances analysis .
- The application development: some IP can be described with a SystemC model, other ones with HDL model, and other ones only with communication flow emulation. All levels of IP can co-exist in this environment and we can proceed to a step by step application development.
In order to validate the concepts previously described, we have designed a first NoC-based prototype ASIC dedicated to 4G telecom applications. This architecture contains 23 IP connected to a 20 nodes network (Fig.4 and 5) for a total complexity of 8 Mgates (0.13 µ CMOS technology from STMicroelectronics).
Five of the IP come from different partners:
- An ARM946ES core which is included in an AHB subsystem. The subsystem is connected to the NoC via a specific block composed of an AHB wrapper and a NI.
- Two intensive data processing blocks, a turbo encoder from France Telecom R&D, and a convolutional decoder (Viterbi) from Mitsubishi/ITE.
- A reconfigurable block (DART unit) which has been developed in the frame of collaboration between CEA/LIST and IRISA.
The integration of these IP proves the ability of our framework to handle with different kind of IP, in particular programmable (CPU with standard protocol) and reconfigurable structures.
For example, the convolutional decoder from ITE/Mitsubishi has been integrated in our NoC environment in only one week. Test vectors available before the integration have been played through the NoC with complete success. Fig. 4: FAUST chip architecture V. RESULTS
A typical NI interface, developed for the demonstrator with both synchronization and reconfiguration capabilities, corresponds to about 10 KGates, without the FIFO. The NI has been synthesized, placed and routed in different units to up than 250 MHz frequency (using a 0.13 m technology). The NI adds only 2 latency cycles at both the input and output levels. For the AHB subsystem, the cost of the wrapper block is about 20 % of the ARM946 core itself, and less when taking into account the whole subsystem.
We developed two compatible versions of the nodes: a synchronous one and an asynchronous one that is very suitable to Globally Asynchronous Locally Synchronous (GALS) implementation . With a 5 inputs*5 outputs node connected to an IP, the total cost for the integration is about 18 kGates for the synchronous version (node + NI) and about 45 KGates for the asynchronous version (node + Async/Sync interfaces + NI).
It is possible to build a mixed asynchronous / synchronous NoC architecture according to IP and subsystems complexity: a 100 KGates IP with a synchronous integration, and a 350 KGates IP or subsystems with an asynchronous integration corresponds to 15 % area overhead to manage QoS communications, flow synchronization, reconfiguration, IT management, test and debug aspects. VI. CONCLUSION
To success in the design of very complex multi-applications SoC, new high-performance communication structures coupled to efficient design and programming methodologies must be set up.
In this paper, we present a solution to make easier IP integration: both architecture environment and integration process are described. The proposed architecture is organized around a NoC structure to support communications purposes; this modular backbone brings scalability at the architecture level and flexibility at the application level. A complete development framework based on TLM methodology was used to help in IP integration and verification, and also to program multi-applications.
A first step toward fully dynamically reconfigurable platform has been performed thanks to synchronization and reconfiguration mechanisms.
Finally, a complete chip prototype has been designed to prove the efficiency of the proposed approach and leads to a prototyping platform for multiple telecommunication applications. Fig. 5: FAUST chip layout and NoC implementation VII. ACKNOWLEDGMENTS
This work was supported by ST Microelectonics and by the European Commission in the framework of FP6 with the IST-2002-507039 4More project (4G MC-CDMA multiple antenna System-on-Chip for Radio Enhancements). VIII. REFERENCES
 J. Henkel, W. Wolf, S. Chakradhar, “On-chip networks: a scalable, communication-centric embedded system design paradigm”, in 17th International Conference on VLSI Design, 2004, pages 845 – 851.
 E. Rijpkema, K. Goossens, A. Radulescu, J. Dielissen, J. van Meerbergen, P. Wielage, and E. Waterlander, “Trade offs in the design of a router with both guaranteed and best-effort services for networks on chip”, in DATE, 2003, pages 350-355.
 R. Lemaire, F.Clermidy, Y. Durand, D. Lattard and A. Jerraya “Performance Evaluation of a NoC-Based Design for MC-CDMA Telecommunications using NS-2”, in RSP’05 Intl Conference, 2005
 “SystemC 2.0.1 Language Reference Manual”, Open SystemC Initiative, http://www.systemc.org
 A. Clouard et al., “Using Transaction-Level Models in a SoC Design Flow”, in “SystemC: Methodologies and Applications”, edited by W. Muller, W. Rosenstiel, J. Ruf, Kluwer Academic Publishers, 2003, pp. 29-63.
 E. Beigne, F.Clermidy, P. Vivet, M. Renaudin, A. Clouard, “An Asynchronous NOC Architecture Providing Low Latency Service and its Multi-level Design Framework”, in ASYNC’05 Int’l Conference, 2005