Non-Transparent Bridging Makes PCI-Express HA Friendly
By Akber Kazmi, PLX Technology, CommsDesign.com
August 14, 2003 (1:47 p.m. EST)
Today's router and switch designers face a daunting task. They must build systems that support high availability while also efficiently and cost-effectively guiding network traffic through a mesh of switch fabrics. Coupled with this challenge is the additional mandate of delivering quality of service (QoS) without adding complexity or cost. Figure 1: PCI Express protocol stack and frame format.
To solve this daunting task, some designers have embraced the PCI Express Base and Advanced Switching (AS) architectures as viable options for developing next-generation communication systems. The AS architecture, based on packet-switching fabric, promises to provide the tunneling of any communication protocol through the fabric using its protocol interface for packet encapsulation (See Merging ATCA, PCI Express Opens Next-Gen Backplane Designs).
While garnering a lot of attention, implementation of AS technology in communication systems may be a few years away, because an ecosystem for AS is yet to be developed. That delay, however, should not prevent designers from turning to PCI Express in next-generation switch and router architectures. Through the use of non-transparent bridging, PCI Express will allow designers to build equipment that can operate in light of multiple host or masters in a single system and thus provide the reliability and QoS capabilities needed in today's networks. Let's see how.
PCI Express Basics
The PCI Express specification is based on a layered architecture that takes advantage of multi-gigabit per second serial interface technology. The protocol stack provides transaction, data link, and physical layers (Figure 1).
The transaction and data link layers support point-to-point communication between endpoints, end-t o-end flow control, error detection, and a robust retransmission mechanism. The physical layer consists of a high-speed serial interface specified for 2.5 GHz operation with 8B/10B encoding and AC-coupled differential signaling. Furthermore, physical interfaces are required to support hot swapping for high-HA applications. These features make PCI Express suitable as a chip-to-chip and board-to-board interconnect technology for high performance communication systems.
One of the most important issues that designers face while developing highly available equipment using PCI Express is the presence of multiple host or masters in a single system. The PCI Express specification offers a significant improvement over PCI technology but does not address multi-host issues. Fortunately, through the use of transparent bridging, designers can resolve this issue. Let's see how non-transparent bridging work.
Non-Transparent Bridges Defined
It is important to review some PCI basics in order to understand how PCI Express non-transparent bridging operates. PCI is a multi-drop bus-based technology that was originally intended for compute applications, with the expectation that the host processor would control the entire system. In the PCI architecture, bridges are used to expand the number of slots possible for the PCI bus.
At power-up, the host performs discovery to learn what devices are present, and then maps them in its memory space. The PCI specification defines standard PCI-to-PCI bridge configurations (transparent bridging), which allows the host to pass through the bridges to discover all the end-points in its address domain. Various non-standard mechanisms are being used to keep the address domains separated if two or more processors are accessing the same bus, memory or endpoints (Figure 2).
Figure 2: Diagram of a typical PCI Express switching system.
A non-t ransparent bridge is functionally similar to a transparent bridge, with the exception that there is an intelligent device or processor on both sides of the bridge, each with its own independent address domain. The host on one side of the bridge will not have the visibility of the complete memory or I/O space on the other side of the bridge. Each processor considers the other side of the bridge as an endpoint and maps it into its own memory space as an endpoint (Figure 3).
Figure 3: diagram showing direct address translation.
In the non-transparent bridging environment, PCI Express systems need to translate addresses that cross from one memory space to the other. The non-transparent bridge also allows hosts on each side of the bridge to exchange information about the status through scratchpad registers, doorbell registers, and heartbeat messages.
Registers and Mes sages
For a non-transparent bridge to be successful, three key elements must be provided: scratchpad registers, doorbell registers, and heartbeat messages. Let's look at these three.
1. Scratchpad Registers
The scratchpad registers provide a means of communication between two processors over a non-transparent bridge. They are readable and writeable from both sides of the non-transparent bridge. The number of scratchpad registers may vary across different implementations. These registers can be accessed in either memory or I/O space from both the primary and secondary interfaces of the bridge. They can pass control and status information between the primary and secondary bus devices, or they can be generic R/W registers.
2. Doorbell Registers
The doorbell registers are used to send interrupts from one side of the non-transparent bridge to the other. These are software controlled interrupt request registers with associated masking registers for each interface on the non- transparent bridge. These registers can be accessed from the primary or the secondary interface of the bridge in either memory or I/O space. An interrupt is asserted on the primary interface whenever one or more of the bits in the request register are set and their corresponding mask bits are zero.
3. Heartbeat Messages
The heartbeat messages are sent from the primary to the secondary host to indicate that it is still alive. The secondary host monitors the state of the primary and, upon detection of the failure, it takes over as primary host, continuing system operation from the last valid checkpoint. The doorbell registers, discussed above, may be used for heartbeat messages. Failure of the primary host is declared when the secondary host fails to receive a certain number of the regularly scheduled heartbeat messages.
HA with PCI Express
An example of a fully redundant HA system is shown in Figure 4. In this example, two control modules and two switch fabric modules ar e interconnected using PCI-Express switches. These PCI Express switches utilize the non-transparent bridging concept.
Figure 4: HA platform built around PCI Express technology and non-transparent bridging.
In many chassis-based systems, a chassis control module monitors the overall operation of the system. In an HA environment, a backup chassis control module is present and is configured to also monitor the system status. Typically, these modules are called primary and secondary, where the primary host is actively controlling the system while the secondary host is only monitoring.
Figure 4 also shows an example of a fully redundant switch or router system. In this example, control module 1 is acting as the primary host while control module 2 is acting as the secondary. The primary host communicates with the secondary host through the non-transparent bridge port of the PCI Express switch on the primary control module.
During the course of normal operation, the control modules exchange status information through doorbells and scratchpad registers. When the primary control module fails, the secondary takes steps to assume control, prevent the failed module from controlling the system, and reconstitute the system state. This example can also be utilized to make both control modules active in a load sharing mode, and to perform failover if one of them becomes non-operational. In that case, both ports connecting them would have to be configured to non-transparent mode in order for them to be isolated from each other during normal operation.
Figure 4 also shows two switch fabric modules and two control modules. Each control module is connected into one switch fabric module via a transparent port for its primary path, and to the other switch fabric module via a non-transparent port for a backup connection. Port adapters are connected into both switch fabrics with one connect ion defined as "upstream". This causes it to be managed by the control module that has a transparent connection to that switch fabric. In this way, a single control module can be set up to manage the entire system or to share the load with its backup.
In Figure 4, the active links between the control module and the switch fabrics are shown as solid red lines and the links to the back-up control module are shown in broken blue lines. The information about the switch fabric modules and port adapters is stored in the memory space of the primary control module. The primary control module monitors the heartbeat of the switch fabric modules through the PCI Express switch ports. If one of the switch fabric modules fails, the primary control module detects the failure and moves all of the port adapter modules to the surviving switch fabric.
The port adapter modules are connected to both the switch fabric modules, as shown in Figure 4, where one PCI Express port on each port adapter is active and the other i s in standby mode (non-transparent mode). When a switch fabric module fails, all of the port adapters using the failing switch fabric migrate their traffic, through the back-up PCI Express switch port, to the surviving switch fabric. The control module performs the failure detection and re-routing of the traffic to the back-up switch port on the port adapter module and through the alternate switch fabric.
In this example, each port adapter module has an independent processor and its own memory domain. Port-to-port transactions within a port adapter may be switched directly between endpoints or by the switch fabric associated with that port adapter. The transactions across the port adapters may be first routed to the switch fabric or control module, and then to the destination port adapter. Isolation of the processor domains on port adapters and control modules is achieved through the non-transparent ports of the PCI Express switch.
It is important to note that the model presented here does not requir e separate control and data planes. The isolation of control and data traffic is performed by utilizing the point-to-point connection feature of PCI Express technology. It is also possible to dedicate some ports of the PCI Express switch to data and others to control traffic by utilizing the virtual channel feature of PCI Express. Two or more 2.5 GHz lanes of PCI Express switch may be aggregated as separate links, which allow designers to create high-bandwidth interconnects (links) for data packets.
The PCI Express architecture offers valuable features as a cost-effective and efficient interconnect technology for telecommunication equipment design. The addition of non-transparent bridging features to PCI Express makes it very compelling technology for this application. This combination provides a non-proprietary solution to telecom OEMs for implementing high-availability functions with many intelligent port adapters operating in one or dual control plane domains. While it may be possi ble to achieve high-speed interconnection and high-availability goals using various other solutions, PCI Express switches with non-transparent bridging offer one of the cleanest, simplest and most cost-effective methods.
About the Author
Akber Kazmi is senior product manager at PLX Technology. He holds MSEE from the University of Cincinnati and an MBA from Golden Gate University. Kazmi also chairs the PCI-SIG PCI Express committee for technical marketing and design enabling support to the communications market segment. Akber can be reached at firstname.lastname@example.org .