By Thiago Felski Pereira & Cesar Albenes Zeferino, Univali
The design of a Network-on-Chip – NoC requires the use of simulation tools to characterize its performance metrics. However, cycle-accurate models are time-costly and the simulation of a large system can consume several hours of computing. The evaluation time can be significantly reduced by running the performance evaluation experiments on a NoC implemented directly on hardware, typically using FPGA. In this paper, they are presented synthesizable cores for a traffic generator and a traffic meter developed to be used in a platform for performance evaluation of NoCs in FPGA. These cores were implemented as single-purpose processors and allow obtain a fast performance evaluation of a NoC.
The increase in the integration level of silicon components has enabled the building of complex systems on a single chip with multiple processors, memories, peripherals and I/O controllers. Such systems are named System-on-Chip (SoCs). In order to deal with the complexity in the design of SoCs, designers reuse pre-defined and pre-verified hardware blocks, which are called cores or IPs .
In order to meet time-to-market requirements, the interconnection between cores of a SoC is usually done by means of shared multipoint buses due to their reusability. However, future SoCs, with dozens of cores, will demand communication architectures with better performance. In this scenario, Networks-on-Chip – NoCs  emerge as a solution for intra-chip interconnection. They provide scalable performance and parallelism in communication, meeting the requirements identified for future SoCs.
The design of a NoC involves several phases, like specification, architectural design, implementation, validation, and performance analysis for the target application. The performance analysis is necessary to explore design space and tune the network parameters (e.g. channels width, buffers depth,…) in order to meet the application requirements.
Performance analysis is usually done by simulation at the Transaction Level – TL or at the Register Transfer Level – RTL. TL provides smaller simulation time, while RTL provides more accuracy in results, which is very necessary during the design phase of a NoC. However, the performance characterization of a large NoCs by means of RTL simulation is time costly and requires several hours of computing.
The performance evaluation of a NoC can be speed up by doing it directly on hardware instead of using simulation models. To accomplish with that, they are necessary synthesizable models for the network routers, traffic generator and traffic meters. Traffic generators inject packet on the network, which are forwarded by the routers to their destinations. Traffic meters monitor the delivering of packets and obtain the information necessary for performance evaluation.
Two approaches can be used to implement synthesizable traffic generators and traffic meters: software-based or hardware-based. In the first approach, they are used General Purpose Processor (GPPs), which are programmed to perform the functionalities needed to inject and collect packets into/from the network. In the second approach, these functionalities are implemented as non-programmable Single Purpose Processors (SPPs), which are processors specially designed to carry out a specific task. The software-based solution offers more flexibility, while the second offer a greater throughput, allowing emulating much more cycles of operation per second.
In this paper, they are presented the design of synthesizable hardware-based cores for on-chip traffic generation and measurement. These cores were implemented for use in an under development platform that will allow speed up the performance evaluation of NoCs. Such platform will be composed by traffic generators, traffic meters and the NoC, and such SoC will implemented and run in FPGA.
This paper is organized as follows. Section 2 discusses some related works. Section 3 presents the proposed architecture for on-chip traffic evaluation. In Sections 4 and 5, they are described the design of the traffic generator and traffic meter cores. Section 6 presents currents results and, concluding, in Section 7, they are presented the final remarks.
2. RELATED WORKS
The most part of the works describing performance evaluation of NoCs is based on simulation. In general, the traffic generator and the traffic meter are modeled at the Transaction Level (TL), by using SystemC, and the network components are modeled at the RT Level, by using SystemC  or a hardware description language, like VHDL . Some works use more abstract models, and all the components are modeled at TL .
The number of works where NoCs are evaluated in hardware is still too limited. In , it is presented a platform for on-chip performance evaluation of a NoC. The traffic generator is implemented by software running on an ARM9 processor, which injects traffic into an FPGA, where a NoC is implemented. Only a single router is synthesized on this FPGA, and the full NoC is emulated in a sequential way. According to the authors, the proposed approach makes possible to perform the evaluation of large NoCs, since the network size is not constrained by the logic density of the FPGA. The on-chip evaluation is from 80 to 300 times faster than the evaluation of a SystemC model running on a computer.
In , hardware-based approach is used for performance evaluation of NoCs in FPGA. They are used instances of a synthesizable traffic generator (named TG) to generate and inject flows of packets into the NoC. They are also used instances of a traffic meter (named TR – Traffic Receptor) to eject these flows from the NoC and collect data for performance evaluation. TGs can generate flows based on stochastic traffic (Uniform distribution or Burst mode) or on traces of real traffics. As described by authors, using their platform, an 1-billion packet emulation needs only 3’20” (less than four minutes) to run on chip. In a SystemC simulation-driven platform, the same experiment would spend 6 days to run – a speed up of four orders of magnitude.
3. ON-CHIP PLATFORM FOR PERFORMANCE EVALUATION
In this work, we present the development of cores for an on-chip platform for fast performance evaluation of SoCIN NoC , by using an approach similar to the one applied in .
The platform is derived from the one originally used to evaluate performance of SoCIN by using SystemC, as described in , and is composed by VHDL synthesizable cores of the following components: (i) a router named ParIS (Parameterizable Interconnect Switch) ; (ii) a traffic generator (TG); and (iii) a Traffic Meter (TM)
As it is shown in Fig. 1, a TG-TM pair is attached to a terminal of a ParIS router. TG injects and collects packets into/from the network, while TM collects data from the packets received by TG, including the information necessary to calculate performance metrics, such as latencies (minimal, maximum and average) and throughput. In the platform, a set of cores composed by a ParIS router, a TG and a TM can be seen as a system node.
Fig. 1. Platform for on-chip performance evaluation of NoCs.
In the platform, each TG is able to generate several flows with different features for a single or for different destinations. For instance, one flow of a TG can send packets of the same size at constant bit rate (like in PCM voice channels), and a second flow of the same TG can uses a variable bit rate to emulate the sending of compressed video data (like in MPEG flows). In order to describe flows with different features, it is used the traffic model proposed in .
Beyond the cores described above, the platform also will include a centralized core (named ECB – Emulation Control Block) responsible to configure TG instances, to control the experiment execution (identifying the stop condition) and to collect data from TM instances. This core has an interface for communication with a software running on a PC which is named Supervisor.
Like ECB, Supervisor will be developed in a second phase of this project. It will offer the following functionalities/features: (i) configuration of the network; (ii) specification of the traffic pattern; (iii) interface with EDA tools in order to compile the platform and synthesize it into the FPGA; (iv) configuration of TGs; (v) control of experiments; (vi) gathering of data collected by TMs; and (vii) analysis of experiments.
In this paper, we focus on the description of TG and TM blocks, which are the major cores of the proposed platform. The synthesizable core of ParIS router is described in .
4. TRAFFIC GENERATOR ORGANIZATION
When a source TG sends packets to a destination TG under a same traffic configuration, it is said that these packets are part of a communication flow. A TG can inject more than one communication flow into the network, for a same or for different destinations. A flow is described by a set of parameters that defines its descriptor word, shown in Fig. 2, which is composed by seven fields. The required bandwidth specifies the injection rate (ranging from 1 to 100 %). The second field allows identify one flow among several ones of a given source-destination pair. HLP (Higher Level Protocol) is reserved for future uses (like to identify classes of traffic). The fourth field is the network address of the destination node. The last three fields describe the number of packets to be sent, their size and the number of wait cycles between the end of a packet and the beginning of the next packet in the flow.
The organization of TG is shown in Fig. 3. It has a communication interface with a network terminal and a shift in/hift out serial interface used to build a configuration chain for all the TGs, which is used by ECB to configure the traffic generators. Internally, a TG is composed by an array of Flow Descriptors (FDs), a Flow Generator (FG), a bus interconnecting them, and a bus arbiter.
Each FD is responsible to store the configuration of a flow and determine when the cycle to send a new packet is reached, according to the required bandwidth. This is done by comparing the value of a register (Cycle to Send Next Packet - CSNP) with the value of the Global Cycle Counter (GCC), not shown in the figure, which counts the number of cycles since the experiment was started. When GCC ≥ CSNP, a request is sent to the bus arbiter, which schedules the requests from all FDs and selects one of them to be connected to FG. FG then injects a packet for the selected FD into the network with the following information: (i) the destination and source addresses; (ii) the required bandwidth; (iii) the flow identifier; and (iv) the current value of CSNP register (i.e. the cycle in which the packet was created). After sending the packet of the selected FD, its CSNP register is updated to the cycle to send the next packet and a register with the number of packets to be sent is decremented. When it reaches 0, FD stops requesting the sending of new packets.
Fig. 2. Descriptor word for communication flows.
Fig. 3. Organization of TG core.
TG can be seen as subsystem composed by a set of peripherals (FDs) connected to a CPU (TG) by means of a bus, whose access is scheduled by a centralized arbiter. Internally, FG has a control unit based on a Moore’s FSM and a datapath composed by two adders, two subtractors, two accumulators and five multiplexers. It also includes a FIFO and a link level controller in its datapath for connection to the network. The functional units are used to control the sending of packets and to update registers of FDs, like CSNP and the register that stores the number of packets to be sent by the flow.
5. TRAFFIC METER ORGANIZATION
TM core is attached to a network terminal and monitors packets arriving to a TG core. As packets arrive, it computes the information necessary to determine the performance metrics.
Fig. 4 depicts the datapath of TM core. It has an input channel to collect data from the network terminal, an interface to the Global Clock Counter (GCC), and a shift in/shift out serial interface that is used to build the chain of TMs necessary to send information to ECB.
When a packet is transferred through the NoC terminal, TM copies its flits to its Input Register. As packets are being received, it computes the packet latency by subtracting CSNP (extracted from the packet trailer) from the current value of GCC. This information is used to compute the maximum latency of the packets received at the network terminal, the minimum latency, and the accumulated packets latency. It also computes the number of packets and flits received through the terminal, and accumulates the values of bandwidth required by each packet (extracted from the packet header). To compute these information, TM datapath has a number of registers and functional units interconnected as is shown in Fig. 4.
When emulation is finished, ECB collect the information computed by TM in order to send them to Supervisor. With such information, Supervisor will be able to determine the accepted traffic (the network throughput) and the average latency for a given offered load.
Current version of TM collects data from all the packets of the NoC terminal and it is not able to select a given flow for monitoring. This functionality will be included in a future version.
Fig. 4. Organization of TM core.
6. IMPLEMENTATION AND RESULTS
TG and TM were modeled in VHDL and developed by using Altera Quartus II tools. They are totally parameterized cores, feature that allows generating models with different capabilities. For instance, FDs can be customized in a way that their registers have only the size needed for their flows in order to save area.
The following piece of code illustrates the configuration parameters of a TG core instance which allows generating up to 4 flows. In this configuration, one flow can generate up to 1 million of packets (PCK_2SEND_WIDTH is 20-bit wide), each packet with up to 1 Kflits (PCK_SIZE_WIDTH is 12-bit wide). For that, they are spent 937 logic elements of an Altera 35K LCs Cyclone II FPGA (EPC2C35F672C6) – that is, 3% of the total logic of the chip.
|NB_FLOWS : ||NATURAL : ||= 4;|
|DEST_ADDR_WIDTH : ||NATURAL : ||= 8;|
|HLP_WIDTH : ||NATURAL : ||= 9;|
|REQ_BW_WIDTH : ||NATURAL : ||= 7;|
|NEXT_PCK_CYCLE_WIDTH : ||NATURAL : ||= 32;|
|PCK_2SEND_WIDTH : ||NATURAL : ||= 20;|
|PCK_SIZE_WIDTH : ||NATURAL : ||= 10;|
|WAIT_CYCLES_WIDTH : ||NATURAL : ||= 12;|
|CYCLES_COUNTER_WIDTH : ||NATURAL : ||= 32;|
The VHDL fragment bellow illustrates the configuration parameters of a TM core instance. It allows emulate up to 4 billion clock cycles and count up to 256 millions of packets per TM.. For this configuration, they are spent 399 logic elements of the same FPGA – that is, less than 1% of the total logic of the chip.
|DATA_WIDTH : ||NATURAL : ||= 32;|
|REQ_BW_ACC_WIDTH : ||NATURAL : ||= 32;|
|FLITS_ACC_WIDTH : ||NATURAL : ||= 32;|
|PACKETS_ACC_WIDTH : ||NATURAL : ||= 28;|
|LATENCY_ACC_WIDTH : ||NATURAL : ||= 32;|
|MIN_LATENCY_WIDTH : ||NATURAL : ||= 8;|
|MAX_LATENCY_WIDTH : ||NATURAL : ||= 16;|
|CYCLES COUNTER_WIDTH : ||NATURAL : ||= 32;|
Although TM is not so expensive, for the used configuration, TG costs are not negligible. Since a 5-port 32-bit word ParIS router, with 4-flit buffer at each input and output channel, spends 1,837 LCs, a pair of TG-TM costs about 73% of a router.
For such a configuration, we expect to integrate up to 10 nodes (ParIS+TG+TM = 3,173 LCs) in an EPC2C35F672C6 FPGA running at almost 100 MHz. By using more economic configurations, especially for the TGs, this number of nodes can be incremented.
In this paper, they were presented a set of cores developed to be used in a platform for on-chip performance evaluation of NoCs in FPGA. By using a hardware-based approach, we intend to speed up performance evaluation of SoCIN NoC and improve the alternatives for design space exploration by using a faster evaluation tool. Currently, we are developing the ECB and implementing a basic version of Supervisor, which will allow the validation of the platform in a development kit.
This project was supported by CNPq (Edital Universal 2006 – Process ID 475768/2006-0) and by University of Vale do Itajaí – Univali (ProBIC research program).
 R. K. Gupta, Y. Zorian, Y. “Introducing core-based system design”. IEEE Design & Test of Computers, v. 14, n. 4, pp. 15-25, Oct.-Dec, 1997.
 A. Jantsch, H. Tenhunen (Eds.), Networks on Chip. Kluwer Academic Publishers, Boston, 2003.
 C. A. Zeferino, J. V. Bruch, T. F. Pereira, M. E. Kreutz, M. E, Susin, A. A., “Avaliação de Desempenho de Redes em Chip Modelada em SystemC”. Proc. WPerformance, pp. 559-578, 2007. (in Portuguese).
 L. P. Tedesco, A. V. Mello, D. Garibotti, N. L. V. Calazans, F. G. Moraes, “Traffic Generation and Performance Evaluation for Mesh-based NoCs”. Proc. 18th Symposium on Integrated Circuits and Systems Design - SBCCI, pp. 184-189, 2005.
 M. E. Kreutz, C. M. Marcon, L. Carro, F. Wagner, A. A. Susin, “Design Space Exploration Comparing Homogeneous and Heterogeneous Network-on-Chip Architectures”. Proc. SBCCI 2005.
 Wolkotte, P. T., Hölzenspies, K. F., Smit, J. M., “Fast, accurate and detailed NoC simulations”. Proc. International Symposium on Networks-on-Chip, pp. 323-333, 2007.
 N. Genko, D. Atienza, G. De Micheli, J. M. Mendias, R. Hermida, F. Catthoor, “A Complete Network-On-Chip Emulation Framework”, Proc. Design, Automation and Test in Europe - DATE, pp. 246-251, 2005.
 C. A. Zeferino, A. A. Susin, “SoCIN: A Parametric and Scalable Network-on-Chip”. Proc. SBCCI, pp. 168-174, 2003.
 C. A. Zeferino, F. G. M. do Espírito Santo, A. A. Susin, “ParIS: A Parameterizable Interconnect Switch for Networks-on-Chip”. Proc. SBCCI, pp. 204-209, 2004.