The most challenging and time-consuming step in the design of any chip is the verification: The more complex the chip, the greater the verification task. Our design team was well aware of that when we began to develop a highly complex chip for traffic management. The chip, which will be released this quarter, uses OC-192 technology to provide high-density, low-power solutions for the OC-48 market. We needed an emulation system that would enable us to verify our chip and still get it to market on time. In this article, we describe the rationale for developing our homegrown emulation system. Most emulation systems are FPGA-based. However, because of the complexity of our chip, such an emulation system would have required us to divide our design into more manageable blocks that could fit into individual FPGAs. But virtually all commercially available emulation solutions at which we looked had serious shortcomings in their ability to partition. Also, the e mulation products made it difficult to get real-time data in and out to validate the design.
Commercially available hardware acceleration solutions also had significant shortcomings. While those products could accelerate the RTL code, there was little or no support for testbenches. Regression would be very difficult and could not support real-life traffic. In most cases, by the time hardware acceleration was accomplished, the chip could already be back from fabrication. We did not have the luxury of unlimited time and money and, at the same time, the chip had to function properly at first silicon.
After weighing the options, we rejected all commercially available hardware acceleration and emulation solutions in favor of building our own emulation system.
To understand the size and complexity of the chip, it is useful to walk through the block diagram shown in Fig. 1. Packets or cells are received from either the line or the switch fabric at the incoming interfaces. The data proceeds to the lookup engine, where one of several lookups is performed, including an ATM lookup, a multiprotocol label switching (MPLS) packet lookup and an MPLS-like tag from an network processing unit or classifier to identify the flow.
Next, the segmentation and reassembly engine, with the help of external data memory, segments packets into cells, reassembles cells into packets or propagates packets or cells. External data memory is used to enqueue the packet or cell payload until it is scheduled for dequeue onto an outgoing interface. The header append block adds the outgoing header onto packets or cells, which could be either ATM cells or cells specifically configured for the switch fabric. The header append block also does MPLS processing such as tag translation and push/pop. Finally, the outgoing interfaces send the packets or cells out of the device to the line or switch fabric.
While the data path performs those tasks, the per-flow queue keeps tr ack of all the data stored in data memory on a per-flow basis. The traffic shaper and output scheduler manage the activity of the per-flow queue and determine how to schedule the packets or cells out to the outgoing interfaces.
Supporting 1 million simultaneous flows and 256 Mbytes of external memory, the chip is truly large and complex-and also very difficult to validate in a software verification environment. Our design team wrestled with the verification acceleration issue early in the development process. Designers have traditionally relied on one of two methods for accelerating the verification process: emulation and hardware acceleration.
Rather than actually create a testbench, which would not have produced a realistic environment, we instead chose to feed real-life traffic into our emulation system. We built a front-end card that connected directly to an OC-48 line, bringing in packets and cells that were processed by our traffic manager (Fig. 2).
The proprietary emulation system consiste d of a CompactPCI chassis with two cards: a front-end card, to bring in traffic from an incoming OC-48 line, and an emulation card housing the FPGAs for emulating our system.
The front-end card included OC-48 optics to convert the optical signal to electrical, and CDR and serdes to recover the clock and get parallel Sonet data. A framer/mapper extracted ATM cells and packets from the subchannels within the OC-48 signal. The adapter FPGA, which consisted of portions of both incoming and outgoing interface blocks, adapted PL3 signals from the framer/mapper into our internal interface. Those signals were sent over a semi-flexible cable to the emulation FPGA. The CompactPCI Interface consisted of several devices that provided a connection between the front-end card and the CPCI chassis to manage the CPCI chassis using the CPCI computer, which was also part of the CPCI chassis. The emulation FPGA included the same CPCI interface to allow the emulation FPGA to be addressed by the control processor.
To elim inate the partitioning problem, we built the emulation system with a separate FPGA for each block of the chip. We thus avoided the partitioning issue altogether. The interface of each FPGA remained exactly the same as the interfaces within the chip itself, resulting in a high correlation between our FPGA setup and the chip design.
There was a further risk that needed to be mitigated. Should the interfaces between the blocks require modification, there was a chance that the partitioning would break down if the number of FPGA-to-FPGA interconnects increased significantly. To solve that problem, we overdesigned the interfaces between the blocks by adding 50 percent more I/O.
Where ASSP and FPGA were different, as is the case with memory blocks, we put wrappers around the memory-one wrapper for the FPGA memory and a different wrapper for the ASSP-such that both looked exactly the same to the RTL. Synthesis switches enabled us to select either FPGA or ASSP.
We designed the emulation system to run the 200-MHz (OC-192) design at 25 MHz (OC-24 speed)-one-eighth the normal speed-to simplify FPGA place and route. That enabled us to complete the synthesis and place and route process and validate the results within a reasonable amount of time. We used the densest FPGAs to maintain utilization below 60 percent, which again simplified the place and route.
Among the advantages of this homegrown emulation was the ability to run many millions of bytes of traffic through the system, performing near-real-time emulation using real-life traffic. That allowed us to go far beyond traditional simulation environment testing and not only determine what worked and what didn't, but react to any deficiencies prior to tapeout.
For example, we found a system-level bug associated with traffic shaping that was only evident when incoming test traffic entered the chip at 80 percent or less of the programmed shaper rate. Under those conditions, the design would not function correctly. Such bugs are found only when the ASSP is operating under live traffic and not in a simulation environment. Further, debugging such complicated bugs requires access to internal signals, which is almost impossible on a flip-chip package.
Our emulation system additionally provided quick regression. If we found a bug, we could fix the bug in RTL and resynthesize the FPGAs, regressing on the entire design to make sure the bug was fixed and nothing else was broken. Emulation further aided us in software development, enabling our team to have a "working" chip and test how the chip behaved as part of the system long before tapeout. We found some problems in the interaction among software, chip and system that we were able to fix in the chip. Another benefit of our homegrown emulation environment was its streamlined debugging capability. Whenever we uncovered a problem, either pre- or postsilicon, we could easily repeat the problem in the emulation environment, building a debug en vironment around our FPGAs to trap the problem and determine precisely what was wrong. We could customize that debug environment for each bug we found.
Finally, the emulation system let customers effectively validate our chip in their own labs using their own algorithms and environment.
It should be noted that, in spite of its many benefits, our homegrown emulation system was not a replacement for verification and simulation; it merely augmented simulation. While emulation is ideal for system-level testing, simulation is a must for performing detailed, feature-level testing, which cannot be done effectively in an emulation environment.
--- Bidyut Parruck is chief technology officer and co-founder of Azanda Network Devices (Sunnyvale, Calif.). Before forming Azanda, Parruck was CTO of Paxonet Communications, where he defined and led development of more than 40 communications IP cores. He has a BS degree in electronics from the Indian Institute of Technology (Kharagpur, India) and an MSEE degree from Virginia Tech.
Copyright © 2002 CMP Media LLC
5/1/02, Issue # 14155, page 16.