by John Goodacre, Multiprocessing Program Manager, ARM
The real change now affecting the embedded market is that the application software is also being asked to view the general purpose processor element using a multiprocessing paradigm so that this processor can also benefit from the promises of higher performance and low-power. Mutiprocessing and multithreading both assert this multiprocessing complexity to the embedded developer, all is not equal. This article inspects the cost and trade off between the two.
The Race for Performance
During the last decade the differentiating factor in desktop processor design was simple; speed. Companies such as Intel and AMD were single-minded in their approach to processor design, both determined to develop and release higher frequency processors before the other.
The race to release the world’s first GHz processor was hotly contested with AMD emerging as the eventual winner. During that time both organizations were focused on their quest and slowly the industry became aware of the increased hardware complexity associated with higher MHz processors. The industry also realized that the MHz-only route could not go on indefinitely and other approaches were needed. In addition to improvement in processor efficiency, raising total performance through supporting thread-level parallelism presented themselves through Multiprocessor (MP) and Multithreading (MT) technologies.
Intel was the first to move with a MT technology known as ‘Hyper-Threading’, and AMD more reservedly positioned themselves for what clearly became the dual core race, with both seeking to be the first to offer a true MP solution to the home computing market. What has caused this paradigm shift from two prominent semiconductor companies towards MP?
More recently, this shift to multiprocessing is imposing many of the software paradigms growing in popularity on the desktop towards embedded designs. For many years, embedded designers have leveraged the advantage that by including multiple processors in their design, they can better provide the required computational performance within their limited power budgets. The real change now affecting the embedded market is that the application software is also being asked to view the general purpose processor element using a multiprocessing paradigm so that this processor can also benefit from the promises of higher performance and low power. Although MP and MT both assert this multiprocessing complexity to the software developer, all is not equal when you inspect the costs and complexity trade off between the two.
Changing Nature of Processing (long-term)
The rapid advance of processor technology is putting a continued strain on silicon designers. In the desktop space this has truly limited performance growth with only a few 100 MHz distinguishing today’s part from that of a year ago.
To continue raising performance, silicon designers therefore had to look to the processor architecture for their next generation designs in order to provide the flexibility and scalability to address consumer demands. Working against any fundamental shift in processor architecture is the associated software technology that due to current investments can not move to any radically different architecture. The history of computing is littered with examples of such architectures that despite their clear computational advantage, failed to reach any level of adoption due to the disruption and demands they made on the software community. It is fundamental that any architectural move into multiprocessing takes this into consideration and therefore must find an approach that trades the theoretical idealism available by adopting multiprocessing vs. the costs and complexities any technology dictates on the existing software paradigms.
Furthermore, this drive towards multiprocessing is becoming all the more attractive with many embedded devices clearly demonstrating the occurrence of multiple concurrent activities across their application and/or operating system. This software concurrency helps push further the adoption of hardware architectures such as MP or MT (or in combination) all the more inevitable as this potentially delivers the performance and efficiency requirement for the next generation of embedded devices while also promising the historically elusive portability of software investments.
Multiprocessing and Multithreading
Both MP and MT strive to improve total processor performance and therefore position themselves to decrease the processing time for any application that exposes concurrent software threads for execution. The two technologies however take different approaches in the hardware to address these goals and will subsequently offer different levels of success for any particular example of software code.
It is a common misconception that MP and MT are comparable technologies and that they demand of software the same level of complexity to harness the multiprocessing nature of the hardware architecture. As soon as you look beneath the otherwise common multiprocessing programming interface, the differences in the approaches soon show that the programmer must fully understand the consequence of whether their multiprocessing solution is based on MT, MP or a combination of the two.
The basic philosophy of MT is it strides to increase total processor performance by utilizing the periods of inefficiency in a uniprocessor design typically caused by pairing a high frequency processor with much slower memory. Unfortunately history has shown the benefits of this approach to multiprocessing are far from clear. Fundamentally MT should be classed as a uniprocessor in which only the minimum level of processor logic is duplicated so as to support additional hardware threads. Typically this is at least the programmer’s register set, and often enough of the CPU’s supervisor state so that today’s operating system (OS) can view the hardware thread as a virtual processor. This sharing of the remainder of the processor logic introduces a major problem that increases software complexity. In the simple case of concurrency represented by two existing OS hosted applications, on today’s uniprocessor, the OS would share the processor resources between the two applications by swapping execution between the applications typically between 10 and 100 times a second - known as the context switch. The currently executing application also involves an amount of execution state held in the processor’s registers and memories that also needs to swap along with the current application. In a MT system that considers switching the hardware thread based on a stall in execution due to the latencies of system memory, a context switch can occur 100’s of 1000’s of times a second.
This vastly increased level of switching requires a careful design negotiation between the OS and the MT hardware to ensure there is enough duplicated hardware to limit the saving and reloading of the execution state doesn’t become the dominant cost of the processor. When a MT processor uses a cache, due to its high system costs, it is seldom considered as duplicated for each hardware thread. For software writers, this means they need to be very aware of the impact the higher rates of context switching has on the usefulness of the cache to the application. In this simple two independent application example, it has often been shown that a MT machine will execute the two applications slower than if they were simply time sliced by the OS on a uniprocessor. To benefit from MT the software writer must take great care in expressing software threads so that the execution state held in the cache is carefully shared between the threads.
In addition to the MT software complexity requiring the programmer to manage the impact of their threads on the shared processor resources, there are a number of other hardware design implications also to consider. Adding hardware threading increases the complexity of the processor and, without also fundamentally changing the processors’ microarchitecture, also impacts the overall peak MHz that can be achieved by the design. This added complexity will also increase the overall power consumption, even when executing a single thread. These MT complexities have been recorded to reduce the overall application performance even when only a single application or thread is running.
When all these costs of MT are taken into account versus the limited performance uplift, it becomes clear why there is a growing momentum in the industry to introduce dual-core and many-core MP solutions.
MP is a technology which uses the design principle of duplicating the majority of a processor’s design so as to realize the maximum performance capable from the execution of multitasking software. A subsequent goal of MP is to accomplish this without also introducing any of the software complexity cost associated with MT with respect to managing the shared processor resources. In fact, if you again take the simple case of two independent applications now been executed concurrently across two independent processors, the overall performance will exceed double the performance of a uniprocessor at twice the speed. All OS context switching and all cache interference between the two applications have been removed and each application can continue independently at full speed.
The obvious assumption made of MP is that a design will cost twice as much silicon area and that MT is a much more effective means to introduce advantage from multiprocessing. However, the overall target performance needs to be considered in any such comparison. Using silicon implementation techniques and cache working set requirements, it is actually reasonable to provide a MP processor offering the same performance point as a MT processor in the same silicon area, but with the added advantage that none of the additional MT software issues are apparent.
A further consideration of multiprocessing is the power consumed by design. Whereas MT is fundamentally a more complex uniprocessor, it is limited to the techniques of a uniprocessor for all power management demands such as clock gating, standby modes and voltage and frequency scaling. In a MP design however, each processor can use these uniprocessor techniques, plus the ability to turn entire processors off to save all their consumed power while still executing lower demanding application workloads. This leads MP to always provide the maximum performance exposed by the software and power consumption directly related to the work accomplished.
MT is fundamentally a technology to extract the performance wasted when the gap between the processor’s frequency and memory speed increase disproportionably. MT in these cases is clearly a ‘Band-Aid’ measure used to provide a quick fix in order to attempt to hide these growing inefficiencies. However, such a measure has limited longevity and applicability, as has already been demonstrated by the desktop processor designs, growing MHz has a hard limitation associated with power consumption, with, or without a supporting MT technology. The challenge faced by today’s designers is to understand whether MT offers enough of the performance uplift against a faster uniprocessor while needing to limit the additional software costs incurred against simply utilizing a full multiprocessor.
If software concurrency is available, a far more effective solution could well be to move directly from a uniprocessor and deploy MP as it is a scalable architecture. In MP designs that have taken into consideration the impact of inter-processor communication and the locality of shared data between the distributed MP caches, there is little additional software cost associated with executing multi-threaded software on a uniprocessor versus a full multiprocessor. As such, MP can use the advantages of highly efficient uniprocessor design and better power efficiency to achieve design points that exceed the performance and power conmsumption points at which a MT could otherwise also have been beneficial.
Scaling Design and Performance
MP essentially uses a ‘divide and conquer’ approach using modular design principles where a single (multi)processor is created by bringing multiple processing units together each capable of running a separate concurrent thread. This makes the overall design less complex, and less risky than that of MT, as it is essentially a ‘plug-and-play’ solution, enabling systems designers to simply plug in additional processors as and when they are needed. The design simplicity allows MP to be far more scalable than an MT solution. The design costs associated with increasing a MT capable processor clock speed can often also limit its scalability especially when considering the dominate cost associated with any level of cache-miss.
An additional option is to deploy both MP and MT in a single design. However, it’s already been demonstrated that the associated software complexity was greatly underestimated by the existing multiprocessor community of OS and software writers. In such designs, there is a fundamental conflict between MT’s requirement for software to carefully manage the access and sharing of the processor’s shared resources and MP’s maxim of highest efficiency when running independent application tasks.
As an embedded system designer tasked with evaluating the implementation of a high performance processor, consideration should be given to all the relative costs, both in the hardware architecture but also the complexity associated with achieving the required software functionality. Multiprocessor designs such as the ARM11™ MPCore™ further reduce these costs by delivering a multiprocessor as a single, configurable macro block supported by standard operating systems that are able to fully utilize the MP architecture without complex proprietary considerations, for example, full SMP support for the ARM11 MPCore is currently available on kernel.org as a standard part of Linux.
The key applicability of an MT capable processor has been demonstrated in data throughput applications where any indeterminism that would otherwise be introduced by any typical general purpose application, can be tightly controlled so to keep all shared-resource side effects to a minimum. These applications will often partition, for example the common L1 cache memory, and create a softwarepipeline where each MT thread processes the segments of data in the cache. The policy by which the MT processor causes a context switch is also very application dependent, and at best will require modified OS scheduler when running different applications, or worse, a hardware redesign.
As always, it is the customer’s demands that dictate how systems are to be designed, and as their performance demands are ever increasing, scalability is an issue designers cannot ignore. Systems designers must also carefully consider their future performance expectations when developing for any new architecture.
Any solution that simply ‘Band-Aids’ around the demonstrated limitations and flawed approach of focusing on an increasing frequency can offer nothing more than a point solution for the customer when considering their roadmap for general purpose processing and their need for scalable performance.
Given that many software applications can be designed specifically with each solution in mind, it is unwise to claim generally that one solution as better than the other. However, as MP is a far more scalable solution that at its heart offers simply a traditional uniprocessor to the software developer, designers can benefit today from a degree of flexibility when choosing their development strategy, comfortable in the knowledge that their future architecture will not need to change for some time.