Designing a high-performance, low-cost switch fabric chip-set using COT

Designing a high-performance, low-cost switch fabric chip-set using COT
By Guntram Wolski, Director of VLSI, Tau Networks, Saratoga, Calif., EE Times
November 27, 2002 (10:53 p.m. EST)
URL: http://www.eetimes.com/story/OEG20021122S0028

Tau Networks recently announced a T64 switch fabric chipset consisting of two multi-million gate chips. The T64 scales transparently from 2.5 Gbit/sec to 640 Gbit/sec of full-duplex user bandwidth. It employs patent-pending queue management to provide up to 64 ports in a fully configured fabric, each with 64 dynamic rate flow-controllable sub-ports and 16 flexible classes of service. Each chip — containing over 34 million transistors along with custom logic and high speed I/Os — was designed using Customer Owned Tooling (COT) flow.

The product strategy depended on delivering outstanding performance and integration at one-half to one-fifth the price of competitive offerings. In order to achieve these goals, we paired the architecture with an effective in-house COT methodology that combined custom and semi-custom components.

Through extensive planning, an implementation employing carefully crafted custom blocks, num erous 3.125 GBd serdes (serializer/deserializer), and traditional place and route (P&R) logic was selected. Flip-chip packaging with true area I/Os and power was employed to meet high-speed I/O signal integrity requirements. One key example of this methodology was the implementation of the T64's scheduling algorithm employing a mix of multiple custom circuits, place and route blocks and high-density wiring.

Our engineers were able to implement T64 in a production-proven and cost-effective 0.15 micron technology. The COT flow facilitated the inclusion of diverse internal and external IP, including standard cells, generated RAMs, PLLs and serdes technology.

So, exactly how was our team able to deliver these multi-million gate chips in record time? Choosing the right process technology is key to any successful COT development. While 0.13 micron technology was available at the project's inception, history has shown that early adoption of a leading-edge process entails a great deal of risk to schedule, development cost and manufacturability. Delays-and additional cost-often occur due to process, library and other changes during development. In addition, daunting mask fees at 0.13 micron were a major consideration. By proving early that all size and timing requirements could be met in 0.15 micron technology, the risk and cost of a more aggressive choice could be avoided entirely.

The team was also able to disqualify the use of 0.18 micron technology by evaluating both a semi-custom and full-custom implementation of a timing-difficult key block — neither of which met timing. Given the goal of delivering chips at one-half to one-fifth the cost of competitive offerings, 0.15 micron technology was the appropriate choice.

With over three million placeable objects on just one of the chips, and with logic design and verification running in parallel with physical implementation, a hierarchical design environment was clearly required. The team depended on rapid turnarou nd through physical design in order to explore different implementation options. Even with substantial hierarchy in the design, many of blocks were larger than desired, requiring additional computing resources and longer turnaround time through place and route. Because of the hierarchy, it was necessary to efficiently assemble the higher levels of the chip, optimizing global wiring, clocks and power. Many blocks achieved physical closure early while others were still under development. Furthermore, this hierarchical approach allowed the entire design flow, from synthesis to final design rule checking, to be proven and refined as the effort progressed from block to block, finally bringing full chips together.

One of the most important reasons for adopting an in-house COT physical design flow was to provide easy integration of full custom circuitry with traditional P&R logic. A side benefit was the ability to perform logic, circuit and physical design in parallel with close communication betwe en front-end and back-end engineers.

Thanks to careful planning and concurrent logic and physical design, it became apparent early in the design cycle which blocks would require custom development. Based solely on size and wiring, many of these blocks would have been entire chips on their own in a standard ASIC flow. The full custom design flow involved schematic capture, spice simulation, and implementation with the company's version of the Magic layout editor. By implementing these blocks using full-custom design techniques, an enormous amount of functionality could be integrated in a small area while meeting cycle-time goals. All blocks adhered to strict guidelines for interfaces and clocking to simplify integration. Since custom block development was performed in parallel with logic development, preliminary models for the blocks with pessimistic timing estimates were made available for the physical team until final models became available. Behavioral models were used for logic and system si mulation until refined gate and switch level models were available for final verification.

Close communication between the logic and physical design teams allowed for rapid feedback to the logic designers regarding the physical implementations of their designs. This led to rapid timing and floorplan closure upon logic completion, and would not have been possible-or would have taken far more time-in a standard ASIC flow. By working on the physical implementation concurrently with the logic, logic designers were able to make timely changes in order to meet structural requirements. IC Wizard provided routing and congestion analysis, while Sonar provided rapid feedback on achievable timing that was much more accurate than possible with traditional wire-load models.

I/O integration

The T64 design required integration of two custom I/O technologies. Numerous 3.125 GBd serdes channels were implemented for backplane communication between the two chips in the T64 chipset. T he serdes were combined with P&R logic to form a hard macro before integration into the final design. Extensive planning was required to enable post-fabrication testing, as well as ensuring a "quiet" electrical environment for these sensitive analog circuits.

Tau's team also developed a full-custom Network Processor Forum Streaming Interface (NPSI). This highly crafted analog I/O block is capable of over 20 Gbit/sec, and provides fully automatic deskew to simplify system design at these high speeds. A simple clocked interface to the block was maintained to minimize chip integration issues. HSPICE and Antrim's AMS tool were used for custom block simulation. The AMS tool allowed circuit simulations of the complete, extracted, mixed-signal high-speed block with the LVDS I/Os to be run with Verilog stimuli within a complete Verilog system context. I/O design proceeded in parallel with the chip design and was merged in near the end of the design process.

Flip-chip interconnect was used to improve power distribution and to enhance signal integrity for all of the T64's high speed I/O. Full area-I/O design takes maximum advantage of Redistribution Layer (RDL) interconnect and flip-chip capabilities, but creates other unique challenges. A handcrafted RDL and bump layer was created as an overlay to the T64 devices. Power grid and bump redistribution were manually created and merged with the chip database for final verification.

Timing closure was accomplished by synthesizing logic with Synopsys Design Compiler based on minimal wire-load models and a reduced cycle time to create "minimum logic" gate level representations. This minimal logic model was used as the basis for Monterey's physical synthesis tools to re-optimize in place for best results. Sonar was used for physical synthesis and prototyping with actual cycle time goals, and Dolphin for final optimization and P&R.

Multiple experiments demonstrated the importance of guiding synthesis tools to the choice of corr ect architecture and implementation, but it was more efficient to let Sonar/Dolphin perform technology (re)mapping and drive strength selection as these tools are aware of the physical world. It is not optimal to allow the logic synthesis tool to buffer, restructure and resize logic when it is going to be redone during physical synthesis and prototyping. Maximum fan-out rules were required during synthesis to ensure consistent results. These rules also helped to ensure that the logic design team was aware of the physical implementation challenges of their logic.

Simplex Fire&Ice QX was used as the "golden" extraction tool for both custom designs and P&R logic, and as validation of Dolphin's timing results. PrimeTime static timing analysis was used for static timing sign off. Each block was individually "closed" and an abstract model created for use at the next higher level for P&R and to reduce STA run times. Full chip STA, without abstracted models, was completed prior to tapeout f or final sign off verification. A continuous effort was made to correlate the different STA results. One clear outcome from these efforts was the realization that even small extraction discrepancies can result in large differences in STA results.

The "donut" bus

One key T64 block, the scheduler, required routing a very large number of 32-bit buses between internal structures. Multiple attempts to place and route this block with little success in the available area indicated custom design would be required to complete it. The custom solution was to hand craft a large "donut bus" around a block of P&R logic that was then further embedded in P&R logic. This donut bus used all available metal layers in every direction-with non-minimum spacing for signal integrity quality-and provided optimized structural connectivity for the circuitry.

No P&R tool would have been able to build this block with the regularity and structure required. The donut bus was caref ully designed to ensure that signals driven onto the bus would meet their required timing at the receiving end. Cross-sections of this donut bus were simulated using a device simulator and extracted parasitic information was merged with the P&R logic for STA. By separating the logic from the bus, design changes could be incorporated rapidly.

Due to the large number of 3.125 GBd serdes I/O macros required in the T64 chipset — 88 in all — a highly optimized macro structure-including aspect ratio, pin locations, area I/O, and so on-was required to minimize die area. The resulting structure differed considerably from any known "off the shelf" macro organization, and allowed for easy routing of the large number of wires required to drive the serdes. Tight control over I/O design resulted in minimum die size and allowed I/O interface logic to be placed and routed within the structure of the I/O cells. The resulting macro was re-used many times.

The use of flip chip technology and RDL allowed great flexibility in the power scheme. Power was fed directly into the die wherever it was needed through dedicated planes in the multi-layer package, rather than consuming routing resources on the die itself. Careful planning ensured an integral power grid spanning virtually the entire chip. Power grid integrity is also provided through bump connectivity to the planes in the flip chip package. On-chip power distribution was verified with Simplex VoltageStorm. RDL and passivation openings for the bumps were manually drawn with the Magic layout editor and put into a cover macro that was overlaid onto the die after successful LVS runs of the full chip.

Clock tree synthesis (CTS) runs simultaneously with place and route, so a separate CTS effort was not required. No direct control of specific latency within the blocks was required; latency was automatically balanced at the top level during P&R based on awareness of each block's insertion delay.

Use of a COT physical des ign flow afforded the company many benefits that were critical to the successful delivery of the T64 switch fabric chipset:

optimized integration of full-custom macros with P&R logic;
incorporation of numerous full-custom, high-speed analog I/Os;
concurrent logic and physical design to minimize iterations required to close timing and area of each block;
and the flexibility to choose the most appropriate process technology based on project requirements.

By employing an in-house COT physical design flow using a suite of state of the art tools, Tau met its performance, cost, and time-to-market goals. The effectiveness of this flow has now been demonstrated by first-pass functional silicon.