Single-mask simplicity needed for SoC
By Laurence H. Cooke, Consultant, Zeev Wurman, Vice President, Software Development, eASIC Corp., San Jose, Calif. , EE Times
March 26, 2001 (2:59 p.m. EST)
The move to multimillion-gate chips has made it necessary to adopt design-reuse strategies for these new system-on-chip devices. This necessity stems both from the cost of logic redesign and reverification, as well as from the increasing expense of dealing with the physical implementation artifacts-such as signal integrity or clock distribution-of deep-submicron processes.
At the same time, an increasing number of metal layers has eliminated the traditional gate array technology as a viable option, and shrinking mask lithography has increased the nonrecurring-engineering costs above $500,000 per prototype. At the other end, the FPGAs with no NRE and zero prototype-turnaround time have become larger thanks to process geometry shrinks. They now include buses, special I/O, memory blocks and processors in addition to the FPGA logic, yet their density and performance are far behind those of standard cells.
The demise of the gate array market came at a most unfortunate time for the industry. Gate arrays disappeared because they could not deliver value when processes required 10 or 15 metal masks. Yet such a low-cost, high-performance product is desperately needed in a marketplace where the cost of going to silicon is heading upward of $1 million.
A technology that employs single-metal-mask programmable interconnects is a viable alternative that can deliver an efficient solution. The interconnect programming provides a low-NRE option for configuration, with performance closer to standard cell and turnaround time closer to FPGA. The single-mask programming technology can be used for standalone ASIC products as well as for intellectual-property (IP) cores designed to be implemented within a system-on-chip (SoC) platform-based design. Such a solution is described here as an answer to today's design challenges.
Even with a single-mask programming (SMP) solution to implement an embedded design, designing millions of g ates is unacceptable from a cost and time-to-market perspective. Hence the need for kernels. A kernel is defined as a customizable hard core that contains the common IP used for all the derivative designs in a specific marketplace. This kernel consists of the application programming interface, real-time operating system, processors, bus, memory and critical common I/O functions. These are optimized for performance, size or power as required by the market segment for which the platform is designed.
Subsequent derivatives are created by combining hardware and software IP and integrating them, along with the kernel, into an SoC chip. A derivative can be created quickly because the engineering activity is limited to IP selection, integration and verification, rather than designing from scratch. The solution is efficient because the critical timing of the bus and processor have already been solved.
Creating a platform SoC, which consists of a kernel and an SMP fabric that can be customized for each derivative, would further reduce the NRE and manufacturing turnaround time over their proposed standard-cell implementation.
A number of SMP architectures exist in the marketplace. All of them have a preexisting structure of wire segments, connected by jumpers patterned on the customized metal layers. Those wire segments connect small structures of transistors or simple gates into custom user logic.
The technology worked well with two or three layers of interconnect, but becomes via-constrained as technology moves toward six to eight metal layers-contacts to these small device and gate structures need to traverse all the way to upper metal layers, creating large vertical blockages and congestion on the custom layers.
On the other hand, FPGA designers have long recognized that it would take fewer interconnects to wire larger-granularity cells together. That is the reaso n most commercial FPGAs have coarse cell structures compared with gate arrays or standard cells. Fewer interconnects imply fewer jumpers to connect them, and coarse-grained cells also require proportionally less jumper customization at top metal layers. Consequently, the silicon area can be more efficiently used.
Based on these observations, a novel fabric has been designed that combines the advantages of large FPGA-like SRAM-programmable logic cells-called eCells-that are connected by segmented pure-metal routing, using SMP. The proposed structure has FPGA-like programmability with density and performance closer to standard cells due to its metal interconnection.
The eCell consists of a pair of three-input lookup tables (LUTs), connected to a flip-flop through a mux. One input of each LUT is driven by a two-input NAND gate. It also includes two inverters of different strength, which can be separately connected to any signal to redrive it.
The LUTs can perform any three-input function, with the NAND providing an LUT-4 subset. The complete cell, equivalent to about 12 logic gates, needs only three metal jumpers for configuration. The rest of the area can be dedicated to single-mask programmable routing.
With conventional metal, this is a two-customized-mask process: one via and one wire mask. With the more recent mechanical-planarization and copper-metal process, the vias and metal are patterned from one mask. A standard via mask is used in conjunction with the customized metal mask. Vias are formed only where both masks exist.
If a line is designed to pass over a via site the via is subtracted off the metal mask. In this way the etching process does not complete the via, but etches enough to make the groove for the metal line. The deposition of the metal is over the whole structure and mechanically polished off the high (nonpatterned) areas.
The basic cell is tiled in a 16 x 16 array (or eUnit), with no routing chan nels between the cells-all routing happens over the cell. Eight such units make up the basic configurable embedded core, or eCore.
Each eCore has its own built-in scan string. The scan string is a simple, single connection through all flip-flops in the eCore, based on a mux scan structure so the system clock is used during scan mode to scan the data into and out of the eCore.
Each eCore has a low-skew, low-power clock grid. The clock line is buffered and both the clock and its inversion are distributed as an evenly loaded clock grid. Such a grid is known to have far less skew than a balanced-tree structure, but it usually takes much more power. To compensate for this, the grid is connected between each plus and minus clock driver with a shunting N-channel transistor, which is enabled for a short time during clock transition to allow charge sharing.
This minimizes the noise and power consumption of the clock structure, while keeping the skew to a minimum. Finally, some of the eCells can be configured as dual-port SRAM, providing the small, distributed RAM blocks embedded within the logic.
Each core can be viewed as a black-box hard IP to be connected at the top level. Specific signals can easily be assigned connection locations for subsequent wiring when configuring the core. The actual number of I/O connections is in the thousands, but varies depending on the number of interconnect layers available.
The interconnect is a series of prefabricated segments that run significant distances in the lower layers in predominantly perpendicular directions between pairs of layers. The underlying cells, clocks and scan take up the first three or four wiring layers. Typically four layers would be used for the SMP routing, split between short and long segments. Jumpers connect the short and long segments, or change routing direction. The longer segments periodically shift over one tract and rotate to the other side of the channel, ensuring the aggressor nets do not travel next to victim nets for very long before being rotated away. This technique reduces the need to reroute to avoid crosstalk signal-integrity problems.
By fine-tuning the devices inside the eCell coupled with selectable output drive, the resulting power and performance numbers are much closer to standard cells than FPGAs.
Such an SMP fabric can be used within a platform design to accelerate the design turnaround time even further. In general, all the variations that may occur in a derivative design should be targeted to be outside the hardened part of the kernel. This should include the interrupt controller, any timers or counters, the protocol for the memory controller, the USB stack and all interface logic for the nondedicated bus ports and the address space on all the bus ports. The bus arbiter should probably be hardened since it is time-critical.
A platform like this one, with a kernel and 1 million additional gates of user-defined SMP lo gic, should fit in an 8 mm2 die in 0.18-micron technology. This size die is highly manufacturable and would be able to support more than 240 signal pins.
The skinny platform has the advantage of using an existing SMP fabric chip, thereby minimizing the platform design costs. Furthermore, the user obtains the entire fixed-platform IP with a single purchase and therefore avoids the hassle of costly and lengthy IP acquisition.
In addition, the whole derivative design methodology for implementing user logic on the skinny platform can be put online, such that the tools, models and methodology can be accessible on a per-use basis from the platform provider. Both Cadence and Synopsys have announced plans to provide such online access to tools.
Thus, a skinny platform using single-mask programmable cores, will provide the next generation of intermediate-volume product designers with quick time-to-market and low NRE costs for their SoC design needs.