Managing NVMe Verification Complexity

VIP Expert

Nov 10, 2019 / 3 min read

From inception, NVMe was designed to support multiple hosts accessing shared media. Early implementation included PCIe in-the-box devices such as Endpoint(EP), Root complex(RC) and Root complex integrated endpoint(RCiEP); over time, Cloud and Storage infrastructure created a need for remote storage.

NVMe implementation can address space occupied by both SATA point-to-point architecture and SAS. Successful adoption in both spaces is due to the promise of low latency and a common interface for storage, regardless of location. Though the verification challenges in these two use cases are similar, they still require a different thought process.

Managing NVMe verification complexity chart

NVMe used in point-to-point architecture requires verification to be centered around the controller implementation. The number of controllers in this case is < 10, with the logic built in hardware, application software and firmware. Bandwidth and throughput are key measures in a point-to-point architecture. NVMe controller designers will need to make tradeoffs in implementation to achieve cost/performance goals, though the key tradeoff is made between hardware and software implementation of various features. The details of these tradeoffs won’t be discussed here but suffice it to say the location of the line is important to the verification engineer.

Hardware/software partitioning brings verification complexity. Hardware is traditionally verified in simulation as it requires more rigorous and thorough testing. Software implemented features are lightly tested in co-simulation and hardware accelerated verification environments since updates are not costly if the updates do not affect the hardware. A verification challenge we see here is verifying implementation specific hardware used to accelerate various software functions. Here, software usually needs to setup and offload to the hardware. Depending on how complex the software implementation, simulations can take days to reach the point of verification interest. Simulation startup with co-simulation is a direct schedule threat.

To address the hardware and software issues in simulation, many verification teams utilize hardware accelerated platforms such as ZeBu. Hardware acceleration allows NVMe drivers to be booted on CPU which can connect to the emulation device. The biggest challenge here is reusability. Tests written in simulation are traditionally optimized for simulation testbench and not completely applicable to the acceleration environment. This has been resolved in Synopsys’ ZeBu platform by enabling the reuse of simulation Verification IP in acceleration and conserving the same user interface between simulation and acceleration platform. As ZeBu acceleration platform achieves 100X faster execution performance, software bootup is now possible. This approach allows simulations to get deeper into the tests to uncover functional bugs where pipelines, memory bandwidth, rollover conditions, or stuck at or one shot faults can be vetted. Acceleration also allows waveform based debug which is necessary to resolve hardware-based issues.

Other simulation optimizations need to be considered to reduce the test run time. For NVMe with PCIe as the transport, the entire PCIe stack can be removed, exposing the proprietary TLP interface between the NVMe and the PCIe stack. PCIe stacks tend to be large and require setup time. Removing the stack also removes this specification-based setup time. When removing the PCIe transport, other things need to be considered like buffer management, interrupts, etc. For PCIe design IP that utilize AXI interface (vs proprietary TLP interface), the removal of PCIe stack is easier as AXI is a public standard. This makes the break at the AXI interface relatively portable.

Debug in point-to-point is relatively straightforward although usually tedious. Transaction and simulation logs are used to chase down memory transactions associated with an NVMe command.  Scoreboards can also be effectively utilized, in both inline and sideband scoreboards. Another key aspect of debug is monitoring the structures constructed and manipulated in memory. Tracking down a completion that never made it into the completion queue can be very difficult as the controller is performing the memory access outside the watchful eye of the host or verification IP. Having the ability to “watch” this memory, whether that ability is built into the Verification IP or a verification component, will save countless hours of debug. One additional verification tool to consider is tracking the state of controllers, namespaces and other resources that lie on the other side of the link. By tracking the state within the verification environment, a lot of debug time can be saved by the following:

  • Flagging commands improperly formatted by the test writer
  • Flagging commands which are not supported by the controller due to either insufficient version or unsupported features
  • Flagging issues related to prerequisite facilities not yet being setup

Once the verification environment can track a controller and a namespace, the same tracking is automatically extended to environments with multiple controllers/namespaces, providing a multiplier-effect on the above debug time-savings.

Designing the most effective verification environment as well as selecting the best verification components are essential in achieving a “shift left” in the verification timeline. By re-using components, sequences, etc., more time can be spent uncovering/fixing real DUT bugs. Don’t discount the amount of time saved by good debug facilities – preventing bad tests, pointing to DUT issues, flagging DUT mis-configurations, etc.

Continue Reading