Store-and-Forward Fat-Tree Architecture for on-Chip Networks

by Filippo Mondinelli, University of Brescia
Brescia, Italy

Abstract :

This paper presents a Network-on-Chip architecture based on a store-and-forward switching mode suitable for System on Chip interconnection. The network includes routers which dispatch messages, interfaces which generate messages, and links between routers and interfaces. A fat-tree topology is used. Fat-tree was found to be the most suitable for interconnecting next-generation VLSI systems. The network system is simulated and analyzed with respect to different parameters like message length and packet scheduling policy. Finally results are shown discussing network latency and throughput.

I. INTRODUCTION

Increasing complexity of electronic systems and increasing demand of computational-intensive, multimedia functionalities are pushing research towards highly-parallel multiprocessor and multicore systems on-chip. This leads to difficulties in achieving design closure and impacts time-to-market. Modular, wide bandwidth on-chip interconnect networks able to overcome communication and synchronization problems in large multiprocessor-multicore systems are object of major interest.

The reusable communication templates normally used in current Systems on Chip (SoCs) are based on the bus approach, using either a single shared bus or a hierarchy of buses. However, such approach has strong drawbacks. Buses do not scale with the system size as the bandwidth is shared by all the components attached to it. Furthermore, as the number of interconnected blocks increases, the load capacitance grows, degrading bus operating frequency and energy efficiency.

Networks on a chip (NoCs) are emerging as an alternative to existing on-chip interconnects [1] They structure and manage global wires in new deep-submicron technologies [2], are reusable in different scenarios, can be energy efficient, reliable [3], and are scalable when compared to traditional buses [1].

NoC is naturally oriented towards packet switching mode. It provides a best-effort service [4] and more flexibility than circuit switching mode, but doesn't guarantee any timely delivery. So packet latency becomes one of the most important constraints in a NoC development.

Previous NoC designs have been developed privileging different aspects:

the flexibility, decoupling the computation from communication [4], so improving reuse in different contexts;
the specificity of the architecture [5] that is focusing on high-efficiency.

In the latter case, the control logic inside each router and the network protocol can be optimized on the chosen topology, leading to a better power efficiency and less reusability.

In this work we introduce a structured analysis and design methodology whose goal is the usage of NoCs in next-generation multiprocessor SoC designs. In particular, we envision the usage of a large NoC to build the communication backbone of complex highly-parallel multi-processor and multi-core SoCs.

Clearly the integration of a complex NoC cannot simply be treated as the integration of an IP. Even starting from the same template architecture, there are a number of different design tradeoffs that need to be properly optimized for the specific application.

Fig. 1 shows the reference flow for the integration of NoCs into integrated systems. At the top level is the definition of the reference network architecture. At this stage network topology and routing mode are defined, while other choices like packet delivery policy, parallelism, etc. are left as parameters.

From the architecture template a network simulator and an optimized parametric implementation in synthesizable HDL are derived. Application specs come from system-level considerations of the target SoC systems and are application specific. The simulator is used to evaluate different possible choices and obtain a set of parameters describing the best solution in the space of possibilities offered by the reference architecture. This set of parameters is then used to drive the implementation of the network.

Fig. 1. Design flow.

This work discusses the architecture of a fat-tree network based on a store-and-forward routing-mode. This is taken as the reference template architecture for systems with many processing elements (PEs) working in parallel. Each PE is supposed to have multitasking capability. It could manage potentially different messages coming from different sources at the same time. Each message is composed by packets. The dimension of a packet is assumed to be 128 bits.

The simulator is an ad-hoc C++ program. It is used to assess network performance by simulating parameterized fat-tree networks under user-defined traffic conditions. This paper is structured as follows. In section II, the architecture is presented, motivating the choices about the topology, the routing mode and the packet deliverance. In section
III the structure of the router is presented. In particular, are those features that directly impact on the global network performance explained as the internal packet latency. The two packets scheduling modes used in simulation are defined. Section IV introduces some network parameters used in the simulation and the network behaviors are measured. Section V shows some experimental results. Section VI presents some concluding remarks.

II. THE ARCHITECTURE

The architecture used in this work is based on two components: routers and interfaces to processing elements (PEs). Routers are modules that route the packets stored in their input buffers towards the output buffers. Routers are connected to other routers and interfaces to PEs by parallel link built of two one-way 32-bit data-paths.

PEs are the leaves of the network and each one is connected to one router only. They generate and receive packets.
Packets traveling through the network consist of three parts:

the header: contains the source address, the destination address, and some control bits;
the payload contains the transported data;
the tail contains the error checking or correction code.

For the purpose of architectural analysis the only useful information comes from source and destination addresses.
The control and the correction code fields are assumed to be small compared to the address field length thanks to the simplicity of the chosen architecture. The size of packets can be increased in future works to achieve better power efficiency.

A. Topology

The network topology determines the complexity of the distributed routing decisions, the complexity of the physical implementation and the power efficiency of the system. Regular topologies like meshes or hypercubes make the routing decision functions simple and the same for each router. Using hypercubes extremely large systems could be prohibitive [6]. The topology used in this work is the Fat-Tree (Fig. 2) because it has been formally proven that it is the most cost-efficient for VLSI realizations [7]. Fat-Trees are represented by a tree with the characteristic that the capacity of the network closer to the root is larger than towards the leaves. They are organized with router nodes at the root and intermediate nodes.
Fat-Trees are deadlock free [8] regardless of the packet routing policy (II B) while meshes and other cyclic topologies are not [9].

Fig. 2. Fat-Tree with 3 levels and 4 ports for each intermediate level router.

B. Routing mode

The routing mode influences the buffer size needed in routers and the performance of the system like packet latency. In packet switching networks data items have to be at least partially buffered at each router before they are sent over.
There are two basic types of routing modes:
store-and-forward and wormhole routing. In store and forward routing a node has to store the entire packet before it can forward any part of it along the next link. In wormhole routing messages are sent as worms, each of which consist of a sequence of fixed size parts called flits. Since only the first flit contains the destination of the worm, in order to avoid confusion at the links the flits must travel in a contiguous fashion. This model has some advantages compared to store-and-forward routing, like smaller input buffers, and some disadvantages.
In this work the wormhole routing is not considered because its primary drawback is the contention that can occur even with moderate traffic leading to higher message latency [10] in absence of virtual channels.

C. Pachet delivery

The main purpose of a routing scheme is to deliver the largest number of packets through the network at the shortest possible delay. This goal is mainly achieved by distributing as much as possible the packets over the network. This strategy limits collisions at the routers reducing latency, and takes advantage of the parallelism of the architecture increasing the throughput. The use of an adaptive routing scheme is a possibility to obtain this result. This is more flexible than an oblivious routing but it is also more complex [10].

The scheme used in this work takes advantage of the simplicity of the chosen topology to deliver packets over the network. It implements an oblivious routing that ensures a good packets distribution in the case of network congestion. Any path, covered by a packet in this hierarchical topology can be split into an ascending phase (i.e. towards the root) and a descending phase (i.e. towards the leaves). In this strategy, the path chosen by a packet in the descending phase is univocally defined by the destination, while on the ascending phase it is fixed by the source. Choosing a particular path in the ascending phase is not compulsory because every path takes to the correct destination.

This strategy reduces also the probability that two packets from different PEs may use the same links. As the source-destination couple defines a unique path along the network, every packet arrives always in the correct order. This is important when a PE wraps a long message in several packets which should arrive in the same order of their creation.

III. THE ROUTER

The router is the active component of the network and is key to achieve good network performances. The router developed in this project is designed to be as simple as possible to reduce area and to increase energy efficiency.

The core of a router basically consists of (a) input ports to negotiate packets from other routers and store them in buffers, of (b) output ports which receive packets from an input port of the same router and send them to the input ports of another router, of (c) routing control logic blocks that decide the route of the packets inside the router and of a (d) crossbar which links the input and output buffers (Fig. 3). Each output port has an independent routing control logic block to control the flow of packets to that port.

A. Latency of internal packets

The latency of internal packets is the time spent by a packet inside a router. This is due to 3 operations: packet buffering, negotiation of an output port and the transmission to the output port. The packet buffering takes always a fixed amount of time if the routing modes mentioned above are considered for a given number of bits per packet.
The time spent in negotiation and transmission depends on the contention of the selected output port and on the availability of the outside link. In the router the negotiation begins during packet buffering. If the input buffer of the next router is available, the internal delay is due to the buffering and transmission activities only. If the packet is stopped after the buffering step, it takes one additional clock cycle to move outside the router.

B. Scheduling of internal packets

During the process of packet buffering, each input port of a router transmits a request of transmission} signal to the appropriate routing control logic block.

This decides, using the scheduling algorithm, which port will transmit the packet to the output sending a grant signal (Fig. 4). In our architecture two scheduling algorithms are considered: Random Scheduling; Priority Scheduling. The first algorithm chooses randomly an input port from those which have requested a packet transfer. The second one chooses the input port containing the packet closer to its destination. The latter policy avoids the deadlock of the network [9]. This algorithm implements also a fair policy to limit the internal latency of packets which are more far from the destination. One of the main differences between the random scheduling and the priority scheduling is the complexity of the implementation. The first is simpler and needs only request lines from input ports to the routing control logic and grant lines from routing control logic to the input ports. The second needs also control lines to transmit the priority of a packet and to update it in case of a wait condition. So the first scheme dissipates less energy per packet and needs less area. Both require the same amount of time to route packets.

Fig. 3. Router structure

Fig. 4. Negotiation step in a router with 4 ports

IV. THE SIMULATION ENVIRONMENT

The reference network customized for the simulations contains 64 PEs and 48 routers distributed on 3 levels. Each router at the root (higher level) has 4 input ports and 4 output ports towards the adjacent lower level. Each intermediate router has 8 input ports and 8 output ports, 4 towards the adjacent lower level and 4 towards the adjacent upper level. Each port of a router is connected by a link to a port of another router or PE. Each link consists of a 32-bit wide data bus and two control lines managing the packet transmissions between routers.

Messages are made of N packets sharing the same destination. Packets of the same message are created as soon as the output link is available for transmission. Packets of different messages are
created with probability Pâ€™:

where P is the probability to generate a single packet message. This formula ensures that the same number of packets is generated in a given time frame independently of the parameter N. In the diagrams all the results are normalized to P.

Both internal packets scheduling algorithms (III. B) are simulated with $N=8$ and $N=32$ varying $P$ from 0.02 to 0.8. The simulation time is 5â€¢105 clock cycles.

The results of interest are:

Throughput (Fig. 5): it is obtained by the total number of bits per clock cycle arrived to all PEs divided by the maximum bit rate that can be produced by all PEs of the system (64 × 32 = 2048 bits/clock cycle)
Network congestion (Fig. 6): the average time period where a PE cannot transmit a packet being busy the linked input port of the router.
Average latency (Fig. 7): the average time required to deliver a packet from the source to the destination.
Maximum latency

Fig. 5. Global throughput.

Fig. 6. Network congestion.

Fig. 7. Network latency.

V. EXPERIMENTAL RESULTS

The results show that the performance of the networks with different internal packet scheduling schemes are nearly the same (Fig. 5). The reason of this is that priority scheduling delivers packets quickly along short paths while long path packets are slow and fill the network. More packets in the network mean more packets in each router which takes more advantage of the parallelism of the routing control logic. The random scheduling treats each packet exactly in the same way so each packet goes along its path at the same average speed and doesn't saturate the network. The congestion diagram shows the same behavior from the sources point of view (Fig. 6), in fact more packets in the network results in less time available to let new packets enter the network. The average packet latency of the priority scheduling network is greater because more packets in the network implies also more collisions. The maximum latency of a packet measured in the priority scheduling network (Fig. 7) is about 900 clock cycles and in the random scheduling network it is about 1000 clock cycles. This shows that fair policy doesn't guarantee the lack of packets with high latency.

It can be noted that increasing N means a degradation of the network performances because each message represents a burst of higher correlated data that the network is requested to manage. The dispersion of packets is reduced (lower throughput) (Fig. 5) because of the increased correlation between packets (more contention and more latency) (Fig. 7). Increasing the message length from 8 packet per message to 32 reduces the throughput by 6% while the packet latency increasing by 4% showing a small degradation of network performance.

VI. CONCLUSIONS

Performance evaluation of a store-and-forward Fat-Tree network for Soc were shown. The system was simulated varying the length of the messages and changing the packet scheduling policy. Each simulation takes 5â€¢105 clock cycles. In the case of priority scheduling and messages made of 8 packets the throughput reaches the value of 28% of the bandwidth with an average latency of 97 clock cycles. Using the random scheduling, throughput is reduced by about 5% (in the worst case) and the latency is improved by about 3%. The throughput degradation obtained using messages made of 32 packets instead of 8 is about 6% while the latency increases by 4%. The results show no critical behaviors in any of the four networks which were investigated in this research and they are considered in a future silicon implementation.

REFERENCES

[1] P. Guerrier, A. Greiner, A generic architecture for on chip packet-switched interconnections, DATEâ€™2000.

[2] L.Benini, G. De Micheli, Networks on Chips: A New SoC Paradigm, IEEE Computer, Jan. 2002, pp.70-78.

[3] L.Benini, G. De Micheli, Powering networks on chips. ISSS,2001.

[4] E. Rijpkema, K.G.W. Goossens, A. R¢adulescu, J. Dielissen, J. van Meerbergen, P. Wielage, E. Waterlander, Trade Offs in theDesign of a Router with Both Guaranteed and Best-Effort Services
for Networks on Chip, DATEâ€™2003.

[5] Adrijean Adriahantenaina, Hervé Charlery, Alain Greiner Laurent Mortiez, Cesar Albenes Zeferino, SPIN: a Scalable, Packet Switched, On-chip Micro-network, DATEâ€™2003.

[6] Russ Miller, Quentin F. Stout. Algorithmic techniques fornetworks of processors, CRC Handbook of Algorithms and Theory of
Computation, 1998, pp. 46:1-46:19.

[7] C. Leiserson, Fat-Trees: Universal Networks for
Hardware-Efficient Supercomputing, IEEE Transaction on Computers, vol. C-34, no. 10, pp. 892-901, October 1985.

[8] Weihaw Chuang, Wilson Chin, K-ary n-cubes versus Fat Trees, EE482 Research Paper, Spring 1999.

[9] William J. Dally, Charles L.Seitz, Deadlock Free Routing in Multirocessor Interconnection Networks, 1985.

[10] Christian Scheideler Universal Routing Strategies forInterconnection Networks, LNCS1390, Springer 1991.

Industry Articles

Store-and-Forward Fat-Tree Architecture for on-Chip Networks