Integrating PCI Express IP in a SoC

Ilya Granovsky, Elchanan Perlin - IBM
Haifa, Israel

Abstract:

PCI Express (PCIe) is an emerging protocol for IO devices connection that is being rapidly adopted by the industry. An increasing number of applications require PCI Express connection support. A variety of PCI Express intellectual property (IP) solutions exist, facilitating PCI Express integration into ASIC designs. Due to protocol flexibility and the wide range of supported applications, PCI Express IP usually provides extensive configurability options for optimizing the PCI Express solution for the applicationâ€™s needs. This paper elaborates on the PCIe IP parameterization process and provides useful tools for the PCIe solution evaluation, specification, and verification.

1. Overview

The PCI Express protocol has become increasingly popular as the PCI bus successor for IO device attachment. While PCI Express utilizes advanced fast serial interconnect technologies for improved performance, it retains the successful configuration and programming model of the existing PCI protocol, allowing integration of PCIe devices without requiring specific software support.

PCI Express offers high flexibility by supporting multiple physical links (â€œlanesâ€) combined into a single logical link for increased bandwidth. This way, the same protocol provides optimal solutions for a variety of applications from high-end graphic cards to low-end network adapters.

While retaining full backward compatibility with current software, PCI Express introduces many new features, allowing advanced system diagnostics and error recovery, power management, and traffic differentiation.

PCI Express system topology is tree-based; the host is attached to PCIe hierarchy through one or several root ports, driving multiple endpoint devices. The host may implement multiple root ports with a point-to-point connection to the adapter. Alternatively, the root port may implement a single PCIe link with a switch device attached to the downstream side, providing the required fanout for multiple adapters.

2. PCI Express Stack

PCI Express is a layered protocol that differentiates between the physical layer, the data link layer, and the transaction layer. Usually, an IP solution consists of a set of IP cores implementing different protocol layers forming the PCIe stack. An integrator requiring propriety functions or features may choose to connect to the physical layer through the industry standard PIPE interface, providing the data link and transaction layer logic tailored to the specific application needs. An application using generic PCI Express capabilities may use the entire protocol stack provided by IP vendors with the ability to connect to different standard interconnects (such as AXI, AMBA, PLB, and others) or design an application-specific, interconnect interfacing PCIe IP transaction layer. Maximal utilization of available IP contributes to keeping development costs down and reduces development risk through reuse of existing silicon-proven, standard building blocks.

3. PCI Express Bandwidth Considerations

The PCI Express link speed is 2.5 Gb/sec for gen1 links. This bit rate is equivalent to 250 MByte/sec due to 10b/8b encoding. Multi-lane PCIe links bring the theoretical bandwidth up to 4 GByte/sec for X16 links. The actual bandwidth is lower by 5-20% due to link protocol overhead, buffering efficiency, and other aspects. The PCIe gen2 link (version 0.9 of the specification is available on the PCI-SIG site) doubles the gen1 bit rate bringing it up to 5 Gb/sec. Link width requirements are typically derived from the applicationâ€™s bandwidth needs. Integrators should keep in mind that the physical layer is sometimes the most area-consuming part of the PCIe stack. Unlike the logical protocol layers that have common functionality for all link widths, physical layer size grows linearly with the number of lanes. Since physical layer cores are independent of one other, some IP vendors provide port bifurcation capabilities, allowing sharing of the HSS cores between several logical ports. For instance, such implementation may be statically configured in a single X8 port or two X4 ports.

4. Selecting PCI Express Port Type

Many applications have a very specific use for a PCI Express connection that drives the port type: the root port or endpoint. However, some applications implement several PCIe ports with different types of PCIe connections; for instance, non-transparent PCIe bridges acting as an endpoint device on a prime PCIe hierarchy and controlling another PCIe tree of their own. These applications may want to retain flexibility when specifying the operating port type in the device configuration stage. Such applications need special PCIe logic implementation to support both root port and endpoint configuration models and different configuration register sets. This dual mode supports results in increased transactions and configuration register block size; however, the additional cell count is usually outweighed by the achievement in solution flexibility.

5. Virtual Channels

Virtual Channels (VC) is a mechanism defined by PCI Express standard for differential bandwidth allocation. Virtual channels have dedicated physical resources (buffering, flow control management, etc.) across the hierarchy. Transactions are associated with one of the supported VCs according to their Traffic Class (TC) attribute through TC-to- VC mapping specified in the configuration block of the PCIe device. Therefore, transactions with a higher priority may be mapped to a separate virtual channel, eliminating resource conflicts with lowpriority traffic.

Effective VC implementation requires a similar level of support across the entire system hierarchy. Applications reusing existing interconnect logic must ensure that traffic differentiation can be preserved or properly handled when data leaves PCIe stack boundaries.

System integrators need also to keep in mind that proper utilization of the VC mechanism requires new PCI software capable of identifying and programming configuration space capability blocks controlling a multi-VC operation.

Multi-VC support usually leads to a notable logic area increase due to the additional buffering and separate logic mechanisms required per VC. To support independent queues for different virtual channels, separate logical queues are usually required. One possible multi-VC buffering scheme is implementing separate physical request queues for each VC to allow efficient arbitration between the VCs, while keeping a single physical data buffer with data blocks referenced by the header queues entries.

As of today, the vast majority of PCIe applications do not support multiple VCs; however, there is increasing interest in multi-VC configurations.

6. PCI Express Performance Considerations

Due to the high flexibility of the PCI Express protocol, many parameters need to be considered by the integrator to ensure optimal PCIe bandwidth utilization. Some of these parameters are:

Link width â€“ as indicated before, the PCI Express protocol supports link bandwidth scaling by combining a number of physical links to a single logical link. The integrator should select the desired link width, based on the target bandwidth. While wide links support training on a lower link width while leaving upper lanes idle, interoperability considerations with other devices planned to be attached to the application must also be taken into account to ensure appropriate link width support and optimize lane utilization.

Replay buffer size â€“ PCIe provides CRC protection for all Transaction Layer Packets (TLPs) and specifies packet replay mechanism if CRC errors are detected by the receiver. All TLPs are stored in an intermediate replay buffer before being transmitted and until being acknowledged by the remote receiver. In cases of CRC error, TLP flow resumes, starting from the oldest unacknowledged TLP. When the replay buffer is full, the TLP flow is suspended until sufficient replay buffer space becomes available. The integrator must ensure that the replay buffer is large enough to store a sufficient number of TLPs to allow constant data flow until the TLP acknowledgement. Replay buffer size largely depends on the TLP acknowledgment roundtrip time, which is the period of time from the moment of TLP transmission until ACK DLLP arrival and its processing completion by the TLP originator. Some devices implement ACK DLLP coalescing (issuing a single DLLP to acknowledge several TLPs) by specifying an ACK factor parameter greater than one. Figure 1 illustrates the acknowledgment roundtrip time for an ACK factor of four. Higher ACK factors result in link utilization improvement, but impact the TLP acknowledgement roundtrip period, leading to an increase in the Replay Buffer size required for maximal bandwidth.

Figure 1 â€“ ACK DLLP Roundtrip Latency

Request buffer size â€“ PCIe is a flowcontrol- based protocol. Receivers advertise the supported number of receive buffers, and transmitters are not allowed to send TLPs without ensuring that sufficient receive buffer space is available. Receivers indicate additional buffer availability through the flow control update mechanism to allow constant data flow. Receive buffers must be large enough to cover for data transmission, processing, and flow control update roundtrip, and to allow constant data buffer availability from the transmitterâ€™s perspective to support the desired request rate. Figure 2 illustrates the flow control update roundtrip period from the remote transmitterâ€™s standpoint. Particular attention should be paid to the read requestsâ€™ receive queue depth. For optimal performance, the application must be able to return a read request credit after forwarding the request to the application crossbar, without waiting for the read data to return. Moreover, if the read request queue is not deep enough, non-posted header credit updates may arrive to the remote transmitter at an insufficient rate, limiting the ability to forward read requests to the internal crossbar at a rate that results in optimal utilization of the read data bandwidth on the transmit link.

Figure 2 â€“ Flow Control Update latency

Read data buffering - PCIe supports multiple, concurrent, outstanding read transactions uniquely identified across the hierarchy by RequestorID and transaction tags. Transaction initiators are required to allocate buffering resources for read data upon making the request and advertise infinite credits for completions. Read requests are withheld until sufficient data buffer resources have been reserved. Therefore, a typical systemâ€™s read data return latency must be considered to specify a sufficient number of outstanding reads. The number of reads in conjunction with the supported read request size should be able to compensate for data return latency to allow read data flow at the desired rate. Since PCIe allows a single read request to be completed by multiple completion TLPs, applications are encouraged to utilize large read requests regardless of the configured maximum payload size. Large reads allow achieving the bandwidth target with a smaller number of outstanding transactions, thus simplifying read context management and reducing read request traffic, which imposes an overhead for the data flow on the transmit link.

Maximum Payload Size (MPS) â€“ PCIe supports several configurations for the maximum data payload size allowed to be included in a TLP by a specific device. The default maximum payload size is 128 bytes and can only be modified by PCIeaware configuration software. Additional maximum payload configurations vary from 256 bytes to 4 Kbytes. Smaller payloads require smaller buffers and allow quick flow-control credit turnaround. On the other hand, they result in a higher link overhead. Large payloads require large replay and receive buffers and need to be supported across the entire system for optimal resource utilization. Each application must consider the optimal maximum payload parameter, based on the above criteria. Real life implementations show that a maximum payload size of 512 bytes lies in the sweet spot, allowing high link utilization with reasonably small data buffers. Larger payloads require significantly larger data buffers that do not justify the minor improvement in link utilization.

7. Power Management Support

The PCI Express standard provides support for power management by specifying link power states with different power consumption level and recovery times. The L0s power state allows automatic transmitter shutdown during idle states and provides short recovery times through a fast training sequence. The L1 link power state is applied when the device is placed in a low device power state by power management software. Only a limited number of transactions that are required for supporting the deviceâ€™s return to an operational state are allowed in this state. PCI Express also specifies Active State Power Management (ASPM) capabilities that allow L1-level power savings, but are controlled through application-specific mechanisms. Another power management technique is dynamic lane downshifting. Wide links that do not require full bandwidth may retrain with a lower number of active lanes, shutting down unused lanes and resulting in significant power savings. The gen2 link protocol also introduces an additional means of power saving by providing a software-controlled capability of retraining the link to a lower speed when full bandwidth is not required.

8. Verification of PCI Express Logic

PCI Express IP vendors usually provide an IP design that has undergone extensive verification. However, this is not enough to ensure a lack of defects in chip-level functionality of the PCI Express partition. PCIe IP parameterization needs to be verified in a chip-level environment to prove the correctness and consistency of the selected configuration. Data flows and protocols on the IP interfaces to the chip interconnect must also be covered in chip-level verification. In addition, particular attention should be paid to configuration sequences, system boot and address space allocation, possible transactions ordering deadlocks, interrupt handling, error handling and reporting, power management, and other system-level scenarios. This part of verification requires extensive knowledge of the real-life software behavior that needs to be translated into simulation test cases.

Multiple PCIe verification IP solutions that complement design IP are available to support the PCIe verification effort. In addition to base PCIe device behavior modeling, verification IP implements protocol checkers that provide real-time indications of standard violations, when they occur. Some PCIe verification solutions also provide a test case suite that covers various PCI Express compliance checks and complements chip-level testing.

A chip level verification effort should minimize testing of internal IP features associated with the IP implementation, relying on verification coverage of the IP provider. For instance, the chip integrator may limit testing of the completion timeout mechanism to a single timeout value to validate that the mechanism can be enabled in a chip and that associated error events are handled properly. Testing all the possible timeout values, however, is not necessary, assuming that the testing falls under responsibility of the IP provider.

The following scenarios should be considered for a chip-level verification test plan:

Configuration sequences - Chip verification should verify the flow that the application performs during boot and initial configuration sequence. These are critical scenarios that may have chip architecture impact; therefore early detection of configuration problems is highly important.
Performance - Performance checking is one of the important chip level tests. This test is supposed to prove initial assumptions taken during architecture phase, such as cores latencies, interconnect bandwidth, etc. Completing performance testing in early project stages allows advance detection of critical chip architecture flaws.
Chip reset sequences â€“ SoC applications usually implement several reset mechanisms, including software controlled mechanisms. These mechanisms are critical to chip functionality and, if defective, may lead to chip malfunction. The reset level of the PCIe partition should be determined for each reset sequence and then simulated to prove the ability to recover from the reset state and return to operational mode. Root port and endpoint applications should take PCIe-specific aspects of the link reset into account. For instance, the downstream PCIe device has to be reconfigured after link reset, which usually requires system software intervention.
PCIe reset sequences â€“ The PCI Express protocol specifies a hot reset mechanism, where downstream components reset through link notification. Root ports should validate that this mechanism can be applied and the hot reset indication is properly detected by a remote device. Endpoint applications should determine the level of chip reset desired in case of hot reset detection on the link to validate proper functionality.
Error events â€“ PCIe specifies various events and error conditions that may occur in the system. These events are registered in the configuration space of the relevant devices and are reported to the root complex through PCIe messages. The root complex collects error reports from the hierarchy and forwards them to the system in an application-specific manner. Error reporting sequences, as well as the ability to resolve the error and clear all the relevant status bits should be addressed by the chip-level testing.
INTx interrupts â€“ The PCIe standard specifies two interrupt modes: legacy INTx level interrupts mode and MSI mode. In the INTx mode, endpoints report an INTx interrupt to the host, based on the interrupt configuration. Root complex merges interrupt reports from all the devices into four interrupt lines and sends an interrupt indication further down the hierarchy. The INTx interrupt resolution includes scanning all the devices for interrupt source determination, clearing the interrupt trigger, and then validating interrupt deassertion. In some cases endpoints cannot assume the MSI interrupts mode is implemented by the host and must provide legacy interrupt support.
MSI interrupts â€“ MSI is the preferred PCIe mechanism for interrupt signaling that uses memory write transactions to communicate all the information on the interrupt source directly to the host. Unlike INTx, the MSI message includes all the relevant interrupt data. In MSI mode, the host is able to access the device causing the interrupt directly to clear the interrupt trigger. Chip-level simulation should validate that the MSI interrupt mode can be properly configured and that MSI messages are properly routed in the system.
Power management â€“ Power management scenarios involve interaction between system software, SoC hardware, and PCI IP. The endpoints should validate the power management activation and deactivation scenarios, paying particular attention to the ability to generate a power management interrupt from a low power state. Root ports should validate that they are able to configure the power management state of downstream devices and process power management events during the low power state. The PCIe link power management is tightly coupled with the above scenarios. The system powerdown sequence, based on the PME_Turn_Off and PME_TO_Ack messages, should be also validated.
Random testing â€“ Chip integrators should consider random testing that includes randomization for the following parameters: different data flows of random rates, address ranges and mapping, response time, and random error injection. Random testing attempts to cover areas that might be missed by the directed testing. Random testing may also discover system deadlocks due to the random pattern of data flows injected during the test.

PCI Express compliance checking is one of the major chip-level verification objectives. Validating PCIe compliancy includes a PCIe compliance checklist review by the chip integrator and use of third-party PCIe models and checkers that provide compliance coverage. The chip integrator may also consider additional directed testing for PCIe compliance coverage to improve confidence in the eventual solution.

Random testing environments may consider implementation of PCIe-specific coverage attributes to improve confidence in the PCIe solution. The major risks from integratorâ€™s standpoint are wrong signal connections. Interface toggle coverage may be useful to validate that all IP interfaces were toggled during testing, assuming that in case of a wrong connection, a functional problem would arise. Static configuration inputs may be covered only for desired values, limiting coverage effort to a specific chip setup.

PCIe IP integrators may also consider reuse of IP verification environment components in chip level verification to improve verification quality and reduce the required effort. This includes reuse of IP assertions, integration of IP-specific offline and online checkers in the chip environment, and reuse of the IP test suite in chip-level verification.

9. Summary

Despite the availability of various reliable and proven PCI Express solutions from different IP providers, protocol complexity and extensive configurability dictates a careful approach to PCIe integration. The system performance target should be kept in mind, starting from the initial architecture stages. Various parameters need to be configured based on system requirements, and then thoroughly simulated to ensure the solutionâ€™s correctness.

Industry Articles

Integrating PCI Express IP in a SoC