Fremont, Calif. — ARM Ltd. raised few eyebrows with the long-anticipated announcement of its Cortex A8 CPU core a couple of weeks ago. But the launch did raise two important questions: Who really needs a 1-GHz general-purpose CPU core, and if it is needed, who will be capable of implementing it?
Although neither question has an immediate or straightforward answer, industry observers last week held that a blazingly fast CPU core will find applications awaiting. Indeed, at least two other chip makers — PMC-Sierra and Intrinsity — have implemented discrete MIPS microprocessors operating in the 1-GHz range, and at least one system-on-chip (SoC) vendor, Raza Microelectronics, has pushed a 90-nanometer internally developed embedded MIPS core to 1.5 GHz for high-performance networking.
But even ARM is cautious on the fabrication issue. "I would say that a 600-MHz core is feasible for at least five of our silicon partners," said Kerry McGuire, Cortex product-marketing manager. "I think we will see a broad range of power and performance points in the Cortex family."
"Such designs require a design team [to be] much closer to the process technology than has been common in the past," said Avner Goren, director of the cellular systems group at Texas Instruments Inc.
John Cornish, processor division marketing vice president at ARM, laid out the Cortex A8 performance facts. "This is the fastest processor core available that is suitable for mobile and consumer applications," he said. "It is capable of at least 600-MHz implementation within a power envelope of 300 milliwatts."
ARM partners present at the introduction were more bullish on clock frequency. Based on the fact that Samsung has achieved 800-MHz operation on a 130-nm ARM-11 core, the company is confident it will be able to ship a 1-GHz Cortex A8, in an unspecified, more-advanced process, in 2007, said Sung-bae Park, vice president of SoC R&D in the LSI division of Samsung.
The first step in determining if there is truly a need for such high-performance general-purpose CPU cores in the mobile and consumer media applications ARM is envisioning is to look at current architectural practice in these areas. Today, systems-on-chip in these applications use the general-purpose CPU core primarily in two ways: for control functions and to implement highly variable, nonperformance-critical signal processing. Any inner loop that is even vaguely stable and requires significant processing power is mapped onto at least a programmable DSP core — where it can run faster and, usually, more efficiently — or into a hardware accelerator, where it can get the optimum combination of energy efficiency and speed.
"For functions that are easily fixed to one algorithm, we expect there will be some dedicated hardware used with the Cortex A8," said Cornish.
"But in the market," McGuire said, "you will see quite a variety of closely and loosely coupled accelerators used with cores like these. Among other things, it depends on the SoC designers' skills."
The availability of more CPU power won't diminish the attractiveness of such CPU-accelerator combinations, according to some architects. "We believe architectures that use multiple, specialized cores provide lower-power solutions, even into the foreseeable future," said TI's Goren. "Specialized hardware will always give you more energy efficiency."
But Goren cited "a second consideration that is just as important, even in mobile systems: Because of their performance and the fact that tasks don't contend on them, specialized task processors give the system superior quality-of-service and responsiveness compared with lumping the tasks onto an arbitrarily fast CPU core."
Goren was careful to point out, however, that in his mind "specialized" does not equal "hardwired." He believes that even the most specialized accelerators still benefit from some level of programmability, for reasons of both flexibility and ease of reuse.
So if the hard tasks are going to be offloaded anyway, is there a need for such high performance in the CPU core? Goren suggested there might well be.
For one thing, he observed, access to a very fast CPU core means that in low-end systems, the CPU could be asked to shoulder a larger portion of the task load — even at the cost of lower battery life — in exchange for a smaller die area. That could make it possible for a product line based on a single SoC design — perhaps with multiple tapeouts — to span a larger range of price/performance points.
Moreover, said Goren, having lots of headroom in the CPU — and in the accelerators, for that matter — "provides a safety net for the unexpected."
The surprises can come in two forms, he said. "One, there can be unexpected new tasks imposed by a change in the product definition. Two, there can be unanticipated bursts in the workload with known tasks.
"If you have more headroom, you have more ability to dynamically manage the load. I would say a 1-GHz CPU would give you enough headroom for the flexibility you'd need over the lifetime of today's new products."
ARM's Cornish made an equally interesting point. When tasks are ill-defined and subject to frequent changes in algorithms — as seems always to be the case early in the life of even standard codecs — those tasks start out in software. They migrate to hardware only when they become stable enough to risk a tapeout.
But if the CPU is faster, more of the application can remain in software, and fewer inner loops have to be committed to accelerators. This, along with clever use of programmability or confgurability in the accelerators, means an application can be committed to an SoC sooner. And that spells reduced time-to-market, even at the cost of higher power.
But if a 1-GHz CPU in an SoC is desirable, can anybody fabricate one? Here, the news is more mixed. "The design challenges associated with aggressive performance without compromising either dynamic or leakage power are very serious," said TI's Goren. "Over time, I'm sure, tools and libraries will rise to the challenge. But today, the design team has to be intimate with the capabilities and problems in the process, and that gives an advantage to teams that are, one way or another, very well-connected to their fabs."
Part of the problem, as Goren suggested, is power management. The Cortex A8 design has been architected with the assumption that it will be implemented using ARM's dynamic voltage island technology — an approach that is just being supported in the necessary Artisan libraries and that is still a generation away for mainstream electronic design automation tools.
"It's the partners' choice how to implement the core," ARM's McGuire said. "Our job has been to provide an architecture that defines the natural boundaries, which in turn define the voltage islands in a useful way."
Another part of the problem is simply achieving the delay budgets necessary for such astronomical clock frequencies. "Today, 600 MHz requires a methodology that mixes traditional RTL synthesis with prestructured netlists and some hard blocks," McGuire said. "The design is still predominantly cell-based. But there are some structured netlists in the more timing-critical areas."
"There are also some custom blocks," Cornish said. "For instance, there are some array-based structures implementing things like the translation lookaside buffers. And if you are really pushing performance, you would use physical IP [intellectual-property] tiles for the basic RAM blocks in the primary caches, for example."
Design-for-manufacturing will also be a serious issue with such a design. At 1 GHz, process variations that affect timing will have to be absolutely minimized if there is to be meaningful yield. That will require a combination of unusual skills in the design team and, as TI's Goren suggested, a highly unusual level of candor between design and process engineers.
It's likely that within a few years, tools and libraries will have caught up to the problem. "We are naturally working to define a design flow and a set of deliverables that will make the performance of the core accessible to a wide range of customers," Cornish said.
For the pioneers, it won't be easy. But in the cases where this kind of CPU performance can mean being first to market or observably better on qualitative in-system performance, it may be a gamble worth taking.