Nguyen Huy Nam*, Nguyen Tuan Anh*+, Amir Nakib+, Eric Petit+
* Bull S.A.S.
This paper describes a TLM-based simulation framework developed to support the design, performance estimation and validation of distributed computing systems.
High-Performance Computing (HPC) systems are distributed systems made of tens of thousands of processing nodes communicating across large packet-switched interconnection networks. Dynamic performance evaluation of such systems is mandatory to assist the micro-architecture design process and to optimize application placement but it requires specific developments to support the huge amount of resources (data and processing time) to be handled.
Therefore, an end-to-end simulation of HPC systems requires modeling at an appropriate abstraction level to accelerate processing time by a significant factor while providing access to enough hardware design details, and offering flexibility to cope with the scalability in the composition of architecture.
As the MPI is the standard used in HPC applications, input stimuli to our simulation will be MPI traces (i.e. traces of MPI calls in applications) which can be either generated automatically or extracted from previous real executions.
Those considerations led to the development of a framework named CoSIN (Composition and Simulation of Interconnection Networks) to support TLM modeling of micro-architecture and to provide the ability to execute large benchmarks for the purpose of performance estimation.
Our works can be related to  with basic differences in the TLM router model, implementation of performance measurements and the choice of SystemC as simulation environment.
FEATURES OF THE ROUTER MODEL
The CoSIN framework is built around a basic brick which is a parameterized SystemC TLM of a specific router. Main features of this router include a wormhole routing protocol and a micro-architecture as illustrated in Figure 1. This transaction-level model adopts an approximate time scale and manipulates data of flit granularity (32B). Other parameterizations proposed by the TLM model for the purpose of micro-architecture investigation include:
- Number of virtual channels;
- Assignment of input to output virtual channels;
- Arbitration strategies based respectively on threshold or age/distance;
- Size of fifos;
- Size of messages.
The router models can then be assembled automatically together following a topology description (in xml format) to form large network models which routing strategy can be described for each router by means of a routing table.
During simulation a set of measures are recorded on the fly (e.g. counters attached to fifos) to provide eventually final statistical computation (e.g. average latency, fifo occupation, etc.). In addition to performance evaluation, correctness of the routing strategy is also verified on the fly (e.g. complete reception of message by appropriate node, absence of deadlock, etc.)
MPI traces input to simulation can adopt either a XML or OTF standard.
Figure 1: Generic Router Micro-Architecture
THE COSIN FRAMEWORK
As illustrated in Figure 2, the CoSIN environment provides utility components to generate classical regular topologies including 3D-mesh, torus, etc. and to perform automatically a variety of simulation strategy (all-to-all, all-to-one simulation or specific traffics).
Figure 2: CoSIN Framework
A FAT-TREE TOPOLOGY CASE STUDY
We describe hereafter an example of using CoSIN to analyze a fat-tree network.
The network fat-tree topology has been generated by a third party’s tool and automatically translated into the XML format for CoSIN. Main characteristics of this network topology are: Downlinks [8,8], Uplinks (4,4], Interlinks [1,1] and 16 terminals (Cf. Figure 3).
The network features 1024 Nis (Network Interfaces), 112 switch nodes, 20 ports per switch and 2 virtual channels used alternately.
A specific benchmark is used for this study. Only 10% of the benchmark is executed due to the large amount of execution time required (66H CPU time). The sequences of messages are input in burst mode at each node (thus creating rapidly saturation). The process placement is performed as a one-to-one assignment of process Ident to NI Ident of the network.
Routing parameters used for the analysis are <qip>_<qop>_<int>_<tresh>[_<age>] with:
<qip> : fifo size / input port
<qop> : fifo size / output port
<int> : Internal fifo size
<tresh> : Treshold for arbitration
<age> : Age for arbitration
This study uses the average latency over all switches expressed in a unit of time (corresponding roughly to a flit transfer through an arbiter) as criterion for comparison between different combinations of parameters.
Figure3: A fat-tree topology
Table 1: Variation of average latency vs routing parameters
Global counters tracing the status of fifos seem to indicate a relative high frequency of saturation of those fifos during the benchmarks (stimuli fed in burst mode).
In order to observe the traffic related to this benchmark, we plot in the following Figure the number of flits crossing each router (1-120).
Figure 4: Distribution of flits across routers
The same benchmark has been used during a shorter interval of time for the purpose of comparing the influence of placement on the performance: The previous corresponding assignment between process ID and NIC ID has been compared against a random placement, giving the following results which demonstrate the importance of placement.
We describe hereafter some results obtained from the application of CoSIN to an industrial benchmark:
- Network features
- Network topology : Fat-Tree with characteristics : 448 routers, 4 NIs per terminal router (=> 1024 NIs)
- Total number of routers : 448, 8 up-links, 16 down-links at each router
- 4 virtual channels : 1 dedicated to acknowledge, 3 VCs (randomly selected) for requests
- Results of the benchmark
- Random placement of process
- Message : Variable maximal size ; Average distance = 4.8 hops
- Variation in function of #links (1 and 2 links)
Figure 5a: Latency vs multiple links (Message of 64 MB max)
Figure 5b: Latency vs multiple links (Message of 126 MB max)
Figure 6: Distribution of average latency
This paper describes the virtual prototyping of a router and its integration within the CoSIN simulation framework which is dedicated to the validation of routing strategies in relation with network topologies and to performance evaluation by dynamic simulation of MPI traces. Preliminary results obtained on real industrial benchmarks are presented to illustrate the features of CoSIN. In order to provide the ability to handle larger configurations (104 nodes) with better simulation performance, we are working on parallel execution of SystemC with communication by MPI.
 A Framework for End-to-End Simulation of High-performance Computing Systems, Wolfgang E. Denzel et als., SIMUTools’08, March 367, 2008, Marseille.
 A full system simulation platform Magnusson et als., IEEE Computer, Vol. 35, N° 2, Feb. 2002
 Predicting MPI Applications Behavior in grid environments, Badia et als., Proceedings of the Workshop on Grid Applications and Programming Tools (GGF’03), 2003.