Why 400G/800G and Beyond Ethernet for High-Performance Computing Systems

By Jerry Lotto, Sr. Technical Marketing Manager, Synopsys
EETimes (October 11, 2021)

Over the last decade, workflows on high-performance computing (HPC) systems have greatly diversified, often blending AI/ML processing with traditional HPC. In response, a wide variety of specialized HPC computer systems (cluster nodes) have been designed and used to address specific application and framework performance optimization. Queues differentially targeting these systems allow each user to instruct the batch scheduler to dispatch jobs to hardware matched closely to their application’s computational requirements. High memory nodes, nodes with one or more accelerators, nodes supporting a high-performance parallel filesystem, interactive nodes, and hosts designed to support containerized or virtualized workflows are just a few examples of specialized node groups developed for HPC. Nodes might also be grouped in queues by the way they are interconnected.

The density and traffic requirements of the interconnected systems in a datacenter hosting an HPC cluster require topologies like the spine/leaf architecture, as shown in Figure 1. This picture becomes even more complex if HPC systems grow beyond the capacity of a single location and are distributed among multiple buildings or data centers. Traffic patterns involving inter-process communication, interactive access, shared filesystem I/O, and service traffic like NTP, DNS, and DHCP, some of which exhibit strong latency sensitivity, would otherwise have to compete for available bandwidth. Connectivity using the spine/leaf architecture address this problem by enabling routing algorithms that can provide a unique and unfettered path for any node-to-node communication.

HPC is now further evolving from nearly exclusively purpose-built on-premise infrastructure to hybrid or even fully cloud-resident architectures. The high cost of building, operating, and maintaining infrastructure to host dedicated HPC has challenged many government labs, companies, and universities to rethink the strategy of purpose-built HPC over the last couple of decades. Instead of purchasing the space, racks, power, cooling, data storage, servers and networking required to build on-premise HPC clusters, not to mention the staff and expense to maintain, and update these systems, all but the largest HPC practitioners are moving to a more usage-based model from cloud providers who offer HPC services. These changes have spurred a refocused investment in the internet connectivity and bandwidth needed to enable cloud bursting, data migration, and interactivity on cloud-resident infrastructure. This creates new challenges for developers who work to establish custom environments in which to develop and run application frameworks, often engendering complex software version interdependencies. The use of containerization has helped to isolate a lot of these software and library dependencies, making cloud migration simpler due to relaxed host image constraints.

Click here to read more ...

Industry Articles

Why 400G/800G and Beyond Ethernet for High-Performance Computing Systems