Subtract software costs by adding CPUs

Subtract software costs by adding CPUs
By Jack Ganssle, Embedded.com
Apr 26 2005 (18:59 PM)
URL: http://www.embedded.com/showArticle.jhtml?articleID=161600589

Click here for reader response to this article.

The best way to get a project done on time is to break it into smaller, more manageable chunks. That theory works better than some programmers suspect, extending even into multiprocessing systems. Jack Ganssle shows how breaking one big program into smaller programs that each run on their own processor can work wonders on performance, schedules, and sanity.

In 1946 programmers created software for the ENIAC machine by rewiring plug-boards. Two years later the University of Manchester's Small-Scale Experimental Machine, nicknamed Baby, implemented von Neumann's stored program concept, for the first time supporting a machine language. Assembly language soon became available and flourished. But in 1957 Fortran, the first high-level language, debuted and forever changed the nature of programming.

In 1964, Dartmouth BASIC introduced millions of (comparatively) non-techies to the wonders of computing while forever poisoning their programming skills. Three years later, almost as a counterpoint, OOP (object-oriented programming) appeared in the guise of Simula 67. The C language, still the standard for embedded development, and C++ appeared in 1969 and 1985, respectively.

By the 1990s, a revolt against big, up-front design led to a flood of new "agile" programming methods including eXtreme Programming (XP), SCRUM, Test-Driven Development, Feature-Driven Development, the Rational Unified Process, and dozens more.

In the 50 years since programming first appeared, software engineering has morphed to something that would be utterly alien to the software developer of 1946. That half-century has taught us a few pivotal lessons about building programs. Pundits might argue that the most important might be the elimination of goto, the use of objects, or building from patterns.

They'd be wrong. The fundamental insight of software engineering is to keep things small. Break big problems into little ones.

For instance, we understand beyond a shadow of a doubt the need to minimize function sizes. No one is smart enough to understand, debug, and maintain a 1,000-line routine, at least not in an efficient manner. Consequently, we've learned to limit our functions to around 50 lines of code (LoC). Reams of data prove that restricting functions to a page of code or less reduces bug rates and increases productivity.

But why is partitioning so important?

A person's short-term memory is rather like cache—a tiny cache, actually—that can hold only five to nine things before new data flushes the old. Big functions blow the programmer's mental cache. The programmer can no longer totally understand the code; errors proliferate.

The productivity crash
But there's a more insidious problem. Developers working on large systems and subsystems are much less productive than those building tiny applications.

Table 1: IBM productivity in lines of code per programmer per month

Project size man months	Productivity lines of code/month
1	439
10	220
100	110
1,000	55

Consider the data in Table 1, gathered from a survey of IBM software projects.¹ Programmer productivity plummets by an order of magnitude as projects grow in scope! That is, of course, exactly the opposite of what the boss is demanding, usually quite loudly.

The growth in communications channels between team members sinks productivity on large projects. A small application, one built entirely by a single developer, requires zero communication channels—it's all in the solo guru's head. Two engineers need only one channel.

The number of communication channels between n engineers is:

This means that communication among team members grows at a rate similar to the square of the number of developers (see example in Figure 1). Add more people and pretty soon their days are completely consumed with email, reports, meetings, and memos.

Figure 1: Four workers have six comm channels

Fred Brooks in his seminal (and hugely entertaining) work "The Mythical Man-Month" described how the IBM 360/OS project grew from a projected staffing level of 150 people to over 1,000 developers, all furiously generating memos, reports, and the occasional bit of code.² In 1975, he formulated Brook's Law, which states: adding people to a late project makes it later. Death-march programming projects continue to confirm this maxim, yet management still tosses workers onto troubled systems in the mistaken belief that an N man-month project can be completed in four weeks by N programmers.

Is it any wonder some 80% of embedded systems are delivered late?

Table 2 illustrates Joel Aron's findings at IBM.² Programmer productivity plummets on big systems, mostly because of interactions required among team members.

Table 2: Productivity plummets as interactions increase

Interactions	Productivity
Very few interactions	10,000 LoC/man-year
Some interactions	5,000 LoC/man-year
Many interactions	1,500 LoC/man-year

The Holy Grail of computer science is to understand and ultimately optimize software productivity. Tom DeMarco and Timothy Lister spent a decade on this noble quest, running a yearly "coding war" among some 600 organizations.³ Two independent teams at each company wrote programs to solve a problem posited by the researchers. The resulting scatter plot looked like a random cloud; there were no obvious correlations between productivity (or even bug rates) and any of the usual suspects: experience, programming language used, salary, and so forth. Oddly, at any individual outfit the two teams scored about the same, suggesting some institutional factor that contributed to highly—and poorly—performing developers.

A lot of statistical head-scratching went unrewarded till the researchers reorganized the data as shown in Table 3.

Table 3: Correlating interruptions with productivity

	1st Quartile	4th Quartile
Dedicated workspace	78 sq ft	46 sq ft
Is it quiet?	57% yes	29% yes
Is it private?	62% yes	19% yes
Can you turn off phone?	52% yes	10% yes
Can you divert your calls?	76% yes	19% yes
Frequent interruptions?	38% yes	76% yes

The results? The top 25% were 260% more productive than the bottom quartile!

The lesson here is that interruptions kill software productivity, mirroring Joel Aron's results. Other work has shown it takes the typical developer 15 minutes to get into a state of "flow," where furiously typing fingers create a wide-bandwidth link between the programmer's brain and the computer. Disturb that concentration with an interruption and the link fails. It takes 15 minutes to rebuild that link but, on average, developers are interrupted every 11 minutes.⁴

Interruptions are the scourge of big projects.

A maxim of software engineering is that functions should be strongly cohesive but only weakly coupled. Development teams invariably act in the opposite manner. The large number of communication channels makes the entire group highly coupled. Project knowledge is distributed in many brains. Joe needs information about an API call. Mary is stuck finding a bug in the interface with Bob's driver. Each jumps up and disturbs a concentrating team member, destroying that person's productivity.

COCOMO
Barry Boehm, the father of software estimation, derived what he calls the Constructive Cost Model, or COCOMO, for determining the cost of software projects.⁵ Though far from perfect, COCOMO is predictive, quantitative, and probably the most well-established model extant.

Boehm defines three different development modes: organic, semidetached, and embedded, where "embedded" means a software project built under tight constraints in a complex of hardware, software, and sometimes regulations. Though he wasn't thinking of firmware as we know it, this is a pretty good description of a typical embedded system.

Under COCOMO the number of man-months (MM) required to deliver a system developed in the "embedded" mode is:

where KSLoC is the number of lines of source code in thousands, and F_i are 15 different cost drivers.

Cost drivers include factors such as required reliability, product complexity, real time constraints, and more. Each cost driver is assigned a weight that varies from a little under 1.0 to a bit above. It's reasonable for a first approximation to assume these cost driver figures all null to about 1.0 for typical projects.

This equation's intriguing exponent dooms big projects. Schedules grow faster than code size does. Double the project's size and the schedule will grow by more than a factor of two—sometimes far more.

Despite his unfortunate eponymous use of the word to describe a development mode, Boehm never studied real-time embedded systems as we know them today so there's some doubt about the validity of his exponent of 1.20. Boehm used American Airlines' Sabre reservation system as a prime example of a real-time application. Users wanted an answer "pretty fast" after hitting the enter key. In the real embedded world where missing a deadline by even a microsecond means the 60 Minutes crew appears on the doorstep with multiple indictments, managing time constraints burns through the schedule at a prodigious rate.

Fred Brooks believes the exponent should be closer to 1.5 for real-time systems. Anecdotal evidence from some dysfunctional organizations gives a value closer to 2. The rule of thumb becomes: double the code and multiply man-months by four. Interestingly, that's close to the nearly n2 number of communication channels between engineers noted before.

Let's pick an intermediate and conservative value, 1.35, which sits squarely between Boehm's and Brooks' estimate and is less than most anecdotal evidence suggests. Figure 2 shows how productivity collapses as the size of the program grows.

Figure 2: The productivity collapse

(Most developers rebel at this point. "I can crank out a thousand lines of code over the weekend!" And no doubt that's true. However, these numbers reflect costs over the entire development cycle, from inception to shipping. Maybe you are a superprogrammer and consistently code much faster than, say, 200 lines of code per month. Even so, that only shifts the curve up or down on the graph. The shape of the curve, the exponential loss of productivity, is undeniable.)

Computer science professors show their classes graphs like the one in Figure 2 to terrify their students. The numbers are indeed scary. A million LoC project sentences us to the 32 LoC/month chain gang. We can whip out a small system over the weekend but big ones take years.

Or do they?

Partitioning magic
Figure 2 shows software hell but it also illuminates an opportunity. What if we could somehow cheat the curve, and work at, perhaps, the productivity level for 20-KLoC programs even on big systems? Figure 3 reveals the happy result.

Figure 3: Cheating the schedule's exponential growth

The upper curve in Figure 3 is the COCOMO schedule; the lower one assumes we're building systems at the 20-KLoC/program productivity level. The schedule, and hence costs, grow linearly with increasing program size.

But how can we operate at these constant levels of productivity when programs exceed 20 KLoC? The answer: by partitioning! And by understanding the implications of Brook's Law and DeMarco and Lister's study. The data is stark: if we don't compartmentalize the project, divide it into small chunks that can be created by tiny teams working more or less in isolation, schedules will balloon exponentially.

Professionals look for ways to maximize their effectiveness. As professional software engineers we have a responsibility to find new partitioning schemes. Currently 70 to 80% of a product's development cost is consumed by the creation of code and that ratio is only likely to increase because firmware size doubles about every 10 months. Software will eventually strangle innovation without clever partitioning.

Academics drone endlessly about top-down decomposition (TDD), a tool long used for partitioning programs. Split your huge program into many independent modules; divide these modules into functions, and only then start cranking code.

TDD is the biggest scam ever perpetrated on the software community.

Though TDD is an important strategy that allows wise developers to divide code into manageable pieces, it's not the only tool we available for partitioning programs. TDD is merely one arrow in the quiver, never used to the exclusion of all else. We in the embedded systems industry have some tricks unavailable to IT programmers.

One way to keep productivity levels high—at, say, the 20-KLoC/program number—is to break the program up into a number of discrete entities, each no more than 20KLoC long. Encapsulate each piece of the program into its own processor.

That's right: add CPUs merely for the sake of accelerating the schedule and reducing engineering costs.

Sadly, most 21st-century embedded systems look an awful lot like mainframe computers of yore. A single CPU manages a disparate array of sensors, switches, communications links, PWMs, and more. Dozens of tasks handle many sorts of mostly unrelated activities. A hundred thousand lines of code all linked into a single executable enslaves dozens of programmers all making changes throughout a Byzantine structure no one completely comprehends. Of course development slows to a crawl.

Transistors are cheap. Developers are expensive.

Break the system into small parts, allocate one partition per CPU, and then use a small team to build each subsystem. Minimize interactions and communications between components and between the engineers.

Suppose the monolithic, single-CPU version of the product requires 100K lines of code. The COCOMO calculation gives a 1,403 man-month development schedule.

Segment the same project into four processors, assuming one has 50KLoC and the others 20KLoC each. Communication overhead requires a bit more code so we've added 10% to the 100-KLoC base figure. The schedule collapses to 909 man-months, or 65% of that required by the monolithic version.

Maybe the problem is quite orthogonal and divides neatly into many small chunks, none being particularly large. Five processors running 22 KLoC each will take 1,030 man-months, or 73% of the original, not-so-clever design.

Transistors are cheap—so why not get crazy and toss in lots of processors? One processor runs 20 KLoC and the other nine each run 10-KLoC programs. The resulting 724 man-month schedule is just half of the single-CPU approach. The product reaches consumers' hands twice as fast and development costs tumble. You're promoted and get one of those hot foreign company cars plus a slew of appreciating stock options. Being an engineer was never so good.

Firmware worth its weight in gold

Firmware is the most expensive thing in the universe. In his book Augustine's Laws, Lockheed Martin's Chairman Norman Augustine relates a tale of trouble for manufacturers of fighter aircraft.6 By the late '70s it was no longer possible to add things to these planes because adding new functionality meant increasing the aircraft's weight, which would impair performance. However, the aircraft vendors needed to add something to boost their profits. They searched for something, anything, that weighed nothing but cost a lot. And they found it—firmware! It was the answer to their dreams.

The F-4 was the hot fighter of the '60s. In 2004 dollars these airplanes cost about $20 million each and had essentially no firmware. Today's F-22, just coming into production, runs a cool $257 million per copy. Half of that price tag is firmware.

The DoD contractors succeeded beyond their wildest dreams.

Reduce NRE, save big bucks
Hardware designers will shriek when you propose adding processors just to accelerate the software schedule. Though they know transistors have little or no cost, the EE zeitgeist is to always minimize the bill of materials. Yet since the dawn of the microprocessor age, it has been routine to add parts just to simplify the code. No one today would consider building a software UART, though it's quite easy to do and wasn't terribly uncommon decades ago. Implement asynchronous serial I/O in code and the structure of the entire program revolves around the software UART's peculiar timing requirements. Consequently, the code becomes a nightmare. So today we add a hardware UART without complaint. The same can be said about timers, pulse-width modulators, and more. The hardware and software interact as a synergistic whole orchestrated by smart designers who optimize both product and engineering costs.

Sum the hardware component prices, add labor and overhead, and you still haven't properly accounted for the product's cost. Non-recurring engineering (NRE) is just as important as the price of the printed circuit board. Detroit understands this. It can cost more than $2 billion to design and build the tooling for a new car. Sell one million units and the consumer must pay $2,000 above the component costs to amortize those NRE bills.

Similarly, when we in the embedded world save NRE dollars by delivering products faster, we reduce the system's recurring cost. Everyone wins.

Sure, there's a cost to adding processors, especially when doing so means populating more chips on the printed circuit board. But transistors are particularly cheap inside of an ASIC. A full 32-bit CPU can consume as little as 20,000 to 30,000 gates. Interestingly, users of configurable processor cores average about six 32-bit processors per ASIC, with at least one customer's chip using more than 180 processors! So if time to market is really important to your company (and when isn't it?), if the code naturally partitions well, and if CPUs are truly cheap, what happens when you break all the rules and add lots of processors? Using the COCOMO numbers a one million LoC program divided over 100 CPUs can be completed five times faster than using the traditional monolithic approach, at about one-fifth the cost.

Extreme partitioning is the silver bullet to solving the software productivity crisis.

Intel's pursuit of massive performance gains has given the industry a notion that processors are big, ungainly, transistor-consuming, heat-dissipating chips. That's simply not true; the idea only holds water in the strange byways of the desktop world. An embedded 32-bitter is only about an order of magnitude larger than the 8080, the first really useful processor introduced in 1974.

Obviously, not all projects partition as cleanly as those described here. Some do better, when handling lots of fast data in a uniform manner. The same code might live on many CPUs. Others aren't so orthogonal. But only a very few systems fail to benefit from clever partitioning.

The superprogrammer effect
Developers come in all sorts of flavors, from somewhat competent plodders to miracle workers who effortlessly create beautiful code in minutes. Management's challenge is to recognize the superprogrammers and use them efficiently. Few bosses do; the best programmers get lumped with relative dullards to the detriment of the entire team.

Table 4: The superprogrammer effect

Size in KLoC	Best programmer (months/KLoC)	Worst programmer (months/KLoC)
1	1	6
8	2.5	7
64	6.5	11
512	17.5	21
2,048	30	32

Table 4 summarizes a study done by Capers Jones.⁷ The best developers, the superprogrammers, excel on small jobs. Toss them onto a huge project, and they'll slog along at about the same performance as the very worst team members.

Big projects wear down the superstars. Their glazed eyes reflect the meeting and paperwork burden; their creativity is thwarted by endless discussions and memos.

The moral is clear and critically important: wise managers put their very best people on the small sections partitioned off of the huge project. Divide your system over many CPUs and let the superprogrammers attack the smallest chunks.

Though most developers view themselves as superprogrammers, competency follows a bell curve. Ignore self-assessments. In "Unskilled and Unaware of It: How Difficulties in Recognizing One's Own Incompetence Lead to Inflated Self-Assessments," Justin Kruger and David Dunning showed that though the top half of performers were pretty accurate in evaluating their own performance, the bottom half are wildly optimistic when rating themselves.8

Other benefits
Smaller systems contain fewer bugs, of course, but they also tend to have a much lower defect rate. A recent study by Chu, Yang, Chelf, and Hallem evaluated Linux version 2.4.⁹ In this released and presumably debugged code, an automatic static code checker identified many hundreds of mistakes. Error rates for big functions were two to six times higher than for smaller routines. Partition to accelerate the schedule, and ship a higher quality product.

NIST (the National Institute of Standards and Technology) found that poor testing accounts for some $22 billion in software failures each year.¹⁰ Testing is hard; as programs grow the number of execution paths explodes. Robert Glass estimates that for each 25% increase in program size, the program complexity—represented by paths created by function calls, decision statements and the like—doubles.¹¹

Standard software-testing techniques simply don't work. Most studies find that conventional debugging and QA evaluations find only half the program bugs.

Partitioning the system into smaller chunks, distributed over multiple CPUs, with each having a single highly cohesive function, reduces the number of control paths and makes the system inherently more testable.

In no other industry can a company ship a poorly-tested product, often with known defects, without being sued. Eventually the lawyers will pick up the scent of fresh meat in the firmware world.

Partition to accelerate the schedule, ship a higher quality product and one that's been properly tested. Reduce the risk of litigation.

Adding processors increases system performance, not surprisingly simultaneously reducing development time. A rule of thumb states that a system loaded to 90% processor capability doubles development time (versus one loaded at 70% or less).¹² At 95% processor loading, expect the project schedule to triple. When there's only a handful of bytes left over, adding even the most trivial new feature can take weeks as the developers rewrite massive sections of code to free up a bit of ROM. When CPU cycles are in short supply an analogous situation occurs.

More is more
In 1946 only one programmable electronic computer existed in the entire world. A few years later dozens existed; by the 1960s, hundreds. These computers were still so fearsomely expensive that developers worked hard to minimize the resources their programs consumed.

Though the microprocessor caused the price of compute cycles to plummet, individual processors still cost many dollars. By the 1990s, companies such as Microchip and Zilog were already selling complete microcontrollers for sub-dollar prices. For the first time most embedded applications could cost-effectively exploit partitioning into multiple CPUs. However, few developers actually do this; the non-linear schedule/LoC curve is nearly unknown in embedded circles.

Today's cheap transistor-rich ASICs coupled with tiny 32-bit processors provided as "soft" IP means designers can partition their programs over multiple—indeed many—processors to dramatically accelerate development schedules. Faster product delivery means lower engineering costs. When NRE is properly amortized over production quantities, lower engineering costs translate into lower cost of goods. The customer gets that cool new electronic goody faster and cheaper.

Or, you can continue to build huge monolithic programs running on a single processor, turning a win-win situation into one where everyone loses.esp

Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. Contact him at jack@ganssle.com.

End notes

Jones, Capers. Applied Software Measurement: Assuring Productivity and Quality, McGraw Hill, New York, NY, 1991.
Brooks, Frederick. The Mythical Man-Month: Essays on Software Engineering, 20th Anniversary Edition, Addison-Wesley Professional, New York, NY, 1995.
Demarco, Tom and Timothy Lister. Peopleware : Productive Projects and Teams, Dorset House Publishing Company, New York, NY, 1999.
Jones, Capers. Assessment and Control of Software Risks, Prentice Hall, Englewood Cliffs, NJ, 1994.
Boehm, Barry. Software Engineering Economics, Prentice Hall, Upper Saddle River, NJ, 1981.
Augustine, Norman. Augustine's Laws, AIAA (American Institute of Aeronautics & Astronautics), Reston, VA, 1997.
Jones, Capers. Private study conducted for IBM, 1977.
Kruger, Justin, and David Dunning. "Unskilled and Unaware of It: How Difficulties in Recognizing One's Own Incompetence Lead to Inflated Self-Assessments," Journal of Personality and Social Psychology, December 1999, volume 77, No. 6.
Chou, Andy, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson Engler. "An Empirical Study of Operating System Errors," Proceedings of the 18th ACM Symposium on Operating System Principles, Oct. 2001.
RTI: The Economic Impact of Inadequate Software Testing, Planning Report 02-3, May 2002, http://www.nist.gov/director/prog-ofc/report02-3.pdf.
Glass, Robert. Facts and Fallacies of Software Engineering, Addison-Wesley, New York, NY, 2002.
Davis, Alan. 201 Principles of Software Development, McGraw-Hill, New York, NY, 1995.

That's quite a nice article (as usual), but seems to be taking a great deal for granted.

Mr. Ganssle apparently makes and accepts as 'givens', several assumptions that may or may not be correct, but are critical to his idea (using multiple CPUs to 'partition' projects) succeeding.

First, it's not very clear if Mr. Ganssle is speaking of 'soft core CPU' ASICs or 'stand alone' CPUs/MCUs (exclusively or as a combination). Assuming ASICs, why is it a given that a particular firm would have any experience with ASICs? Or, for that matter, with multiple CPU design (at any level)?

Furthermore, assuming the firm does have such experience, would it be specific (for one particular application) or generalised? Can such experience in fact be generalised (by the firm in question)?

Next, Mr. Ganssle apparently presumes, based upon a careful reading of this article, that there are no additional (and perhaps not only more complex but differently complex, out of the 'knowledge domain' of the firm in question) issues.

After all, if a company/group produces deficient software for single-processor environments, wouldn't it stand to reason that they would produce even more deficient software on more processors? (Raising complexity and interoperability issues by orders-of-magnitude)

So, the question becomes, is it easier (less bug-prone) and quicker to create multi-processor hardware, firmware and software, as opposed to for a single-processor? Where are the considerations of task distribution and inter-processor communication, etc.?

In any case, this seems to be the 'coming thing', Intel and AMD apparently have stopped reaching for higher GHz processors, now they are working on multi-processor chips.

In my opinion, Mr. Ganssle did point out the direction things will most probably be going, but left out some relevant details that should be considered.

- Joshua Newman

Reader Response

Jack,

The statement about the zero cost of the transistor is not valid for mass production volumes. When you sell embedded controllers by the 100000 or the million, every penny counts.

There is another way of looking at the complexity problem. It is to question basic design decisions with the perspective of reducing the size of code. Quite often, when a new program starts, it is reusing large chunks of older ones. When estimates show that the throughput will increase too much or memory gets sparse, we switch to the new gizmo that increase the available throughput by a factor n and that has m times more memory. But at the same time we keep the same old structure that eats up resources like hell.

How many firmware are running on big machine (using quickly all their available power) when the exact same functionality could be achieved on a smaller (cheaper/less complex) machines.

Thanks for your refreshing thoughts,

- Yves Clement

hi jack,

Interesting article. I had a few thoughts about the partitioning exercise. Say we can partition a complex problems into a bunch of "LOW COUPLED" modules running on multiple processors. How is this different from a bunch of low coupled tasks running on the same processor?

- seshadri

When you introduce multiple processors, you also introduce multiple channels of communications assuming the processors will interact (otherwise it is a trivial problem). My experience with dual port ram, FIFO, serial, etc makes me a little wary. But still, many problems are amenable to mult-processor solutions.

- Phil Gillaspy

Have you ever looked at soft processor systems in FPGA devices? (eg. Nios II from Altera)

- stefano

Fantastic article, Jack! Perfect timing, too - I am going to re-evaluate a single-processor system we are developing as a result.

- Andy Kunz

I found this article to be quite interesting. As an EE with some programming background, I have gotten involved in some embedded projects at our small engineering firm. As I have to work with many aspects of a project (mechanical, hardware, software), I am hardly a dedicated programmer. This also means that my programming time can be minimal. My solution was to use multiple 8-bit processors to carry out time-sensitive tasks (communication, motor movement) because I thought it would make programming less time consuming. However, I thought that was because I wasn't an experience programmer (who could easily cram all of my functions into a single chip) and was, to some extent, "cheating". This article definitely made me think that there might be some value to this "cheating" and I shouldn't dismiss it in the future. I also would like to say that this process made programming and debugging the individual microcontrollers a much easier process, letting me focus on function and not interaction. Thanks for the insights.

- Torrey Bievenour

As Joshua mentioned above, the trend even on classic PCs (AMD and INTEL)is multiprocessing. Mainly due to heat (cooling) problems it's with current technology difficult to further increase the CPU frequency significantly.

Semiconductor manufacturers try to make people think that by adding multiprocessors their code would run faster without changing it, like by increasing the CPU frequency, but the software can (and will) run slower, if not rewritten for multiprocessors.

I agree that it makes sense to start partitioning in hardware and not only in software, but we will face new problems. New tools will be needed to be able to help solving this new challanges. (Just think about debugging/tracing a multiprocessor system)

Let's try to extend Brooks law:

1.) adding processors to a late project makes it later. 2.) adding processors to a slow system makes it slower

...instead rethink (rewrite) your algorithms.

Perform your hardware/software partitioning carefully and be ready for new challanges.

- Robert Berger

I wholeheartedly agree with you that productivity and reliability increase when the job is divided into smaller pieces.

I wonder though, if there is a limitation to the benefit of adding processors to facilitate embedded software development time. Increasing the number of processors will make each processor development more productive, but it will have the side effect of making the overall system more complex.

As with the increase in overhead (emails, meetings) with many programmers on a job, there is a price to pay for adding processors that must communicate with each other. Although each processor probably does not need to talk to every other one, I do believe that the system integration time will expand non-linearly, that is exponentially, with the number of processors. Taking the example of a 40 KLoC project split into 2 processors - would I be better off splitting that job into 40 processors with 1K LoC each?

Each processor in a system, even if a microcontroller, adds its own overhead. There is an RTOS of some kind, initialization code, a communication channel, and perhaps external memory or peripherals.

You can still keep those small groups of programmers that were dedicated to a processor. They can be dedicated to the same piece of the system. The communication protocol between teams is now the API without the lower level communication channel. You also save with reduced overhead. You would need some discipline to implement this isolated team approach. I think there would be a very real and perhaps strong tendency to erode the separation of development teams.

Adding processors may certainly help the development cycle. As in most engineering endeavors, I think that there needs to be some analytical thought early in the project that will determine the best balance of partitioning for your system.

- Bruce Kelly

Great article, as usual. But I think you left off one real benefit of migrating to multiple CPUs, and that is power savings. I'm surprised at how few engineers know this, but the bigger a processor gets the less power efficient it is. Simply put, two 50 MIPS processors can do the work of one 100 MIPS processor - but consume less power. In many cases a lot less power.

Case in point: ARM-926 vs. ARM-1026

ARM-926 Performance: 1.1 MIPS/MHz (http://www.arm.com/products/CPUs/families/ARM9Family.html) ARM-926 Power Consumption: 0.5 mW/MHz (on a 0.13u process) (http://www.arm.com/products/CPUs/ARM926EJ-S.html)

ARM-1026 Performance: 1.35 MIPS/MHz (http://www.arm.com/products/CPUs/ARM1026EJS.html) ARM-1026 Power Consumption: 1.05 mW/MHz (on a 0.13u process) (http://www.arm.com/products/CPUs/ARM1026EJS.html)

Of course, taking the MIPS/MHz numbers from a vendor's website is a bit suspect ;-), but it should hold for this comparison.

The ARM-926 gives you 1.1 MIPS/MHz / 0.5 mW/MHz = 2.2 MIPS/mW The ARM-1026 gives you 1.35 MIPS/MHz / 1.05 mW/MHz = 1.28 MIPS/mW

So to deliver a given throughput, say the mythical "Million Instructions Per Second", the ARM-926 would consume 0.45 mW and the ARM-1026 would consume 0.78 mW. The ARM-1026 uses 78% more power to deliver the same performance, although usually it is clocked faster - and burns even more power. Two ARM-926 processors can easily out perform one 1026. In fact, two ARM7 processors can out run a 1026. The ARM7 runs at 0.8 MIPS/MHz and 0.14 mW/MHz, or 2.5 times lower power consumption than the 926 without a cache. Comparing the 7 to the 9 is a bit imprecise, as the 7 specifications on ARM's website do not include a cache. Even though we are not comparing apples to apples here (so be careful drawing conclusions), a simple back of the envelope computation gives the ARM-7 a whopping 5.7 MIPS/mW and can sustain a MIPS at a mere 0.175 mW. This is about 1/4th the power that the 1026 takes to do the same work.

You can do the same computations for your favorite processor family (I just happen to know the ARM family pretty well). Make sure you are comparing numbers for the same silicon process (0.18u or 0.13u, etc), voltages, cache sizes, features, and so on. You'll find that power consumption goes up pretty dramatically as the size of the processor grows. The CPU designers are at a point of diminishing returns when it comes to adding transistors to the processor.

I suspect that CPU vendors don't want designers to know this. As the demand for compute power goes up, they want to sell bigger and bigger processors, which command premium prices. Although, it also seems that engineers want to program biggest processor on the block.

- Anonymous

I regularly read your articles in Embedded Systems and I almost always find myself agreeing with you, but unfortunately, I can't say that about your most recent article "Subtract Software Cost by Adding CPU"". While you made an excellent case for partitioning a larger system into small, loosely coupled, independent sub-systems capable of being independently debugged, I didn't see any argument supporting an implementation using separate hardware. Why can't the independent pieces execute on the same CPU? Asynchronous message passing on a single CPU ( like POSIX mqueues ) is a whole lot easier to implement than robust inter-processor communication. In my experience with multiple CPU', your 10% extra labor for communication is a gross under-estimation. Also, using a single CPU allows much of the integration testing to occur before the hardware is done (and the hardware is easier to build). Imagine trying to update firmware in the field if 6 different processors need the correct update instead of copying six executeables to a single flash device.

The only case I see for HW partitioning is that forces the partitioning. SW partitioning is maintained through discipline since it is easier to violate.

Just another view to think about!

- Jack Rosenbloom

If you could only see me giving you a standing ovation! I must agree with you because of my personal experience, as I have been practicing this approach for years.

As you can see from your fan mail, the opinion on the subject is quite divided. My reasoning for this is based on one of your earlier articles, "The demise of an embedded generalist". The embedded generalists will tend to agree with adding CPUs to reduce cost. Non generalists will never comprehend the beauty of this approach. Obviously, systems requiring more diverse operations are more friendly to multiple CPU/MCU architectures.

My own non-exhaustive list of advantages is as below:

1) Eliminate the need of an OS etc.

2) Easily substitute some subsystems with off-the-shelf modules, than build. Save thousand of dollars in time and resources.

3) Fast time to market. I have saved years in some cases.

4) Performance independence. Take an example of a Bluetooth module, with profiles running in the module itself. Any change in the host application has a zero impact on the performance of Bluetooth profiles.

5) Easily scalable architecture. Changes are extremely localized.

I would compare this approach to Lego blocks... and I love both.

-Irwin Singh

Industry Articles

Subtract software costs by adding CPUs