By PLDA, Inc.
Modern enterprise workloads in AI and data analytics are driving the need for new compute and storage architectures in IT infrastructures. The growing use of accelerators (GPUs, FPGAs, custom ASICs) and emerging memory technologies (3D XPoint, Storage Class Memory, Persistent Memory), and the need to better distribute and utilize these resources are fueling the transition to composable/disaggregated infrastructures (CDI) in data centers.
With this changing landscape, a number of interconnect protocols have emerged (NVMe-oF, CCIX, Gen-Z, CXL) promising to address the challenges introduced by the composability model. While these interconnect technologies mature and make their way towards mainstream adoption, system vendors still have various options that leverage the well established PCI Express protocol to enable scale-up and scale-out composable fabrics.
In this article, we describe the most common options and present a new trend that involves combining on-chip PCIe switching and PCIe transport over cable to build intelligent, scalable, high-performance composable systems.
2. Fabric Composition with PCIe Switch ICs
PCIe switch semiconductor integrated circuits (ICs) have been available for over a decade, from vendors like PLX (Avago) and Microsemi (Microchip). These PCIe switch ICs have evolved to allow a variety of use models, ranging from simple PCIe fanout expansion using PCIe transparent switches, to more complex PCIe fabric topologies using Non-Transparent Bridging (NTB) as illustrated in Figure 1.
Figure 1 - Example PCIe Switch topologies
While OEMs, ODMs, and many system vendors widely employ transparent PCIe switches for fanout expansion, non-transparent fabric switches are intrinsically more complex, rely on custom NTB software to operate, and are therefore more difficult to integrate and deploy. Even as some companies like Liquid and Dolphin Interconnect Solutions are bringing PCIe fabric based solutions to the market, using discrete switch ICs for building PCIe interconnect fabrics presents several limitations:
Application Specific Integrated Circuits (ASSPs) have fixed architecture and feature set which means they’re likely to be under-utilized or not an exact fit for the application
These ICs often trail new revisions of PCIe Specifications by a minimum of 18 to 24 months, hence may not provide the latest and greatest performance and features
They increase the overall Bill of Material (BoM), increase power budget, and introduce new point of failures in the system
It becomes increasingly difficult for system vendors to build differentiated products when these products are architected around the same ASSPs
For those technology companies that are designing their own chips, it makes sense to look at embedding PCIe switching capabilities into their SoCs as a way to differentiate, future-proof their designs, and implement the exact feature set required by their applications.
3. Fabric Composition with On-Chip Switch IP
SoC architects now have the option of integrating PCIe switch IP into their designs. The main benefits to this approach are:
- The switch IP can be integrated with the SoC’s CPU complex and memory subsystem, thereby reducing the latency and footprint of the solution
- The switch IP can be configured to the exact requirements of the application. Configuration options typically include: number of ports, number of lanes per port, PCIe link speed per port, peer-to-peer communication, port arbitration, low-power support, etc.
- In-line packet processing functions can be added (if supported by the switch IP), such as packet filtering, packet inspection, encryption, and other functions to offload the system’s CPU
- Embedded endpoint functions (such as NIC functions, custom accelerators) can be implemented to optimize costs (BoM, silicon) and reduce latency and power due to the absence of PCIe PHY, as shown in Figure 2.
- Architects can further customize the switching logic in order to customize or add features, such as custom port arbitration schemes, custom routing tables/rules, etc.
Figure 2 - SoC with transparent switch IP and embedded endpoint
The capabilities of the endpoint functions can be further expanded with the use of virtualization, allowing resource sharing among multiple Virtual Machines.
Multiple host domains can be supported with multiple switch IP instantiated in the SoC, along with a NTB mechanism allowing communication across the different PCIe domains, as shown in Figure 3.
Figure 3 - SoC with two PCIe domains connected via NTB
4. Expanding Server Reach with PCIe-over-Cable
16GT/s PCIe 4.0 signals can only travel 3 to 5 inches on standard FR4 PCB, and without going through any connector. Moving to MEGTRON 6 PCB and adding retimer ICs help improve the travel distance, however with a significant cost increase.
With the commoditization of optical communication, it is now possible to deploy, at scale, the infrastructure necessary to transport PCIe 4.0 signals at 16GT/s over hundreds of feet.
Our latest experiment, pictured in Figure 4, has shown a PCIe 4.0 x4 link connecting two peers over 330 feet of optical cable with a slight latency penalty but no impact on data throughput.
Figure 4 - PCIe switching over optical cable demo setup
5. Putting the Pieces Together
We are seeing an increase in the number of IC designers looking to integrate intelligent switching capabilities into their PCIe based SoCs. For the type of architecture outlined in Figure 2 and Figure 3, PLDA XpressSWITCH transparent switch IP is the de-facto solution, deployed since 2016. XpressSWITCH key features include:
- Support of PCIe 5.0 Specification at 32GT/s
- Fully configurable solution: # of downstream ports, per-port link width and link speed, low-power modes, Hot Plug, and more
- Support for PIPE-attached embedded endpoints
- NTB support via embedded endpoints
- Advanced mechanisms such as Broadcast, Multicast, DPC
- Data protection using LCRC, ECRC, Parity, and ECC for memories
- Ultra-low latency switching logic
- Seamless implementation on ASIC and FPGA technologies
- A variety of PCIe PHY supported
Figure 5 provides an architecture overview of XpressSWITCH IP.
Figure 5 - XpressSWITCH IP architecture
XpressSWITCH IP is at the core of PLDA’s INSPECTOR for PCIe, a host platform with diagnostics capabilities used at PCI-SIG Compliance Workshops since 2016 for PCIe 4.0 FYI interoperability testing.
By coupling XpressSWITCH IP with PCIe transport over optical media, as demonstrated using Samtec FireFly™ Micro Flyover System™, system builders can further expand the reach of the fabric to build fully disaggregated PCIe based platforms.
Disaggregated, composable infrastructure is defining the next wave of data center architecture. While new memory semantic fabrics and communication protocols have emerged to enable this paradigm shift, data centers are years away from seeing these technologies deployed in silicon.
Meanwhile, SoC designers are finding ways to leverage the well established PCIe protocol to build intelligent high performance fabrics. With the commoditization of optical communication, SoCs are also able to transport PCIe traffic off-chip across long distances, with minimal impact on performance, thus enabling disaggregated PCIe-based architectures.
By using off-the-shelf PCIe switch IP, such as PLDA XpressSWITCH, IC designers have found a flexible way to build composable PCIe systems, allowing them to define and control every aspect of the solution in terms of capabilities and features, and ultimately create differentiated products.