The advantages of using massive software parallelism in EDA
By John Lee, Vice President, Magma Design Automation
In the past few years, terms such as multi-threading, multi-processing, and marketing terms derived from these have started to appear as features for existing electronic design automation (EDA) software. Concurrently, availability of cheap compute resources, embodied best in the multicore central processing units (CPUs) available today, can provide a cost-effective way to reduce EDA software runtimes.
Physical design and physical verification software are examples of compute-intensive EDA applications that require such techniques. An example is a typical mask that may include billions of physical geometries. Each geometry in the mask layout must be created, custom libraries designed, then placed and routed, and assembled into a full chip. Each geometry must be verified to match the manufacturing requirements from the foundry, against an ever increasing number of complex design rules. Single-CPU runtimes easily can exceed hundreds of hours for modern designs. Parallism is clearly needed.
Two well known trends increasing design complexity and increasing manufacturing complexity make the challenge larger every year. For example, designers migrating from 90-nanometer (nm) to 65-nm process nodes are seeing a 3X increase in design checking complexity. This is compounded with the transition to 45- and 40-nm, where reduced manufacturing process latitude makes design rule checking (DRC) much more complicated for design and verification software. Complex end-of-line and via configuration rules driven by sub-wavelength photo-
lithography issues are good illustrations. A single measurement is now a complex set of measurements embodied either in rules, equations or models.
This computational burden is further complicated as concurrent design and analysis is needed to avoid design schedule delays. It is also no longer sufficient to perform physical design and then run a separate physical verification stage. Because of the manufacturing rule complexity, all-layer, rule and sign-off checking are needed during physical design. If not done, costly design delays will occur as design teams clean up DRC or layout versus schematic (LVS) errors after place and route is done, adding another level of complexity when concurrency is introduced.
Finally, the need for on-time project delivery is paramount especially today when there are fewer design starts, higher mask costs and dire financial implications if market windows are missed.
While complexity increases, design times must remain constant, if not improve. All companies are looking to do more with less.
A target design today is at 40-nm with 100-million cell instances at the top level and has a mix of memory and analog/mixed-signal content. The computational complexity to efficiently design and verify this increases 6X, making it extremely time consuming to use a traditional flow that barely managed to complete 65-nm designs.
Delivering a solution requires fresh software architectures with flexibility for linear scalability across a large number of CPU cores, along with a powerful data model that allows concurrency between design and verification.
The Desired Solution
Hardware vendors, faced with the reality that increasing clock frequency is not sustainable due to power constraints have adopted parallelism to drive improved performance. Consider multicore CPUs in a multi-socket motherboard. Graphics processor units (GPUs) are highly parallelized single instruction, multiple data (SIMD) machines and, with a proper application programming interface (API) NVIDIAs CUDA interface or AMDs ATI Stream, for example are potentially powerful solutions for general computing.
The solution for software vendors is to effectively use such resources, measured by the ability of the software to use hardware to affordably accelerate computation that is, no costly custom hardware solutions. Cost is incurred when a large amount of memory is needed, the software is constrained to run only a single machine, or a costly custom interconnect is needed.
For example, a reasonably configured eight-core, x86 Linux computer with dual-CPU and 64 gigabyte (GB) of memory can be purchased for well under $5,000. Increasing the memory footprint beyond this is costly the cost to go to 128GB or higher can dwarf the cost of this basic configuration to $40,000 or higher.
An ideal solution would scale linearly on a standard network of Linux machines, either four or eight core, in standard (64GB) or low-end (32GB or 16GB) configurations. Linear scalability implies that running on four computers will be 4X faster than running on two computers. Fortunately, such Linux compute farms are common place these days, with either a load sharing facility (LSF) or GRID distributed computing is used to dynamically schedule jobs for a large number of software applications, from synthesis and circuit simulation to physical design and verification.
A Reference Solution for Physical Verification
Physical verification tools perform geometric analysis on a designs layout to verify manufacturability. DRCs are a large subset of these analyses and a simple example is to check the spacing from one wire to the next.
Traditional physical verification tools rely upon a standard database approach for such computation. The technique is simple a designer represents the layout in a fast searchable data structure and query neighboring geometries to check the distance between wires.
Accelerating this type of computation across multiple CPU cores seems trivial a designer would compute distances for many wires in parallel. Indeed, this approach can scale well if all CPUs are on the same machine and the machine has sufficient memory to completely represent the layout.
Unfortunately, the size of designs is such that both of these assumptions are now invalid. First, a simple divide and concur approach does not work if the rules are complex. At 65-nm and 45/40-nm, new connectivity-based checks are common place to reflect that electrical connectivity is needed to ensure manufacturability and reliability.
Second, the number of geometries is such that a single machine does not have a sufficient number of CPU cores to meet turn around time requirements. Computationally, its inefficient to query databases from one machine to another on a standard Linux network.
Finally, the size of layout databases 100GB plus in many 40nm designs is such that a database approach requires costly, high-memory hardware to use this computational method. Some tools require 256GB machines.
Recently, computer scientists have evangelized the use of a different approach termed data flow or streaming architecture. In the case of physical verification, the layout is not represented as a database but rather as a stream of geometries, much in the way that an MP3 file is a stream of sounds and not a collection of notes.
An advantage of a streaming approach is that it is friendly to multicore, multi-CPU and multi-machine set ups. Because there is no longer a dependency on having an in-memory database, it no longer matters where the processing core is they can be on different die, different packages, different motherboards or different machines.
Streaming architecture enables parallelism that can go from one core to four cores to 16 cores to 64 cores and beyond. As a guideline for standalone physical verification software, linear scalability on 16 cores is needed for effective turn around time on 65-nm nodes. For 45/40-nm, 32-core to 64-core scalability is needed and 128-core will be needed for effective 32/28-nm full-chip verification.
A second advantage is streaming architectures have low-memory usage because there is not a central layout database. As most designers know, even a 2X reduction in memory usage can make the difference between a fast run on existing hardware, and one that dies and requires costly upgrades to hardware or induces long schedule delays. Streaming gives designers the ability to use big iron for applications that require large memory sign-off timing analysis programs, for example and use standard hardware for efficient stream applications.
A third advantage is that they can run faster even on a single CPU core than traditional database approaches. Database approaches, by nature, involve a random query from the CPU to the main memory database. Computer algorithms can make the algorithmic complexity of such a query efficient (e.g.; O(log n), where n are the number of geometries). The speed of the query is hampered by the CPU waiting for the memory request to return from L1 cache to L2 cache to main memory and possibly to disk swap.
Streaming architectures keep all data localized often within the CPU cache limit that provides at least a 10X advantage over a main memory fetch. While not all the time of EDA software is spent waiting for memory fetches, its clear that streaming can be used to improve even single-CPU efficiency.
A disadvantage for streaming architectures is precisely its strength. Because there is not a centralized memory database, operations that require non-localized data may be complicated. For example, if the desired computation is to grab two random objects from the design, then a database approach on a large memory machine will be faster, assuming the data is loaded into memory already.
Physical verification applications from DRC, including antenna and connectivity based checking, and LVS to electrical rules checking are amenable to stream computation. For connectivity based operations, such as an antenna check or a voltage dependent net check, each net can be visualized as a stream of geometries, connected to each other.
An example of a streaming architecture for physical verification is Magmas Quartz DRC and Quartz LVS. Unlike traditional physical verification tools, Quartz was written to target massively parallel computation. Most other tools rely upon architectures that work well for compute server environments common in the 1980s and 1990s when the software was designed.
An illustration of this advantage is shown in the chart below. Here, the number of CPU cores is increased from one, or the baseline, to 64. The relative speed up is plotted on the Y-axis. A linearly scalable application follows the 45 degree line 4X speed up for 4X the number of CPUs. As shown, a streaming tool can scale linearly in standard LSF environments. To compare, traditional database approaches scale reasonably as long as all computation can be done on the same machine. Past a certain point, however scalability falters and eventually saturates, in this case between eight and 16.
Stream computing is not a panacea, however. Because place and route is highly interactive and involves many different services, from parasitic modeling to timing and noise analysis, a centralized data structure/database is most efficient. For such applications, careful attention to algorithm design can enable massive parallelism as well.
An additional challenge for stream computing is to leverage non-traditional cores. In the above example, Quartz is using standard CPU cores common in microprocessors from IBM, Intel and AMD.
Specialized cores found in GPUs from NVIDIA and AMD lack the general computation that CPU cores have. Specialized EDA applications have been ported to GPUs and the future roadmap from the graphics providers including Intel point toward more powerful GPU cores.
About the author
John Lee joined Magma in 2004 through the acquisition of Mojave Design, where he was a co-founder. Previously, he held senior engineering management positions with Synopsys and Avant!, where he managed market-leading products in circuit simulation and parasitic extraction. He has Bachelor of Science and Master of Science degrees in electrical and computer engineering from Carnegie Mellon University in Pittsburgh, Penn.
Contact Magma Design Automation Inc.