Jaya Suseela, Wipro TechnologiesJeganath Gandhi R, Wipro TechnologiesAbstract: This paper describes the architectural design of RISC based asynchronous microprocessor as an alternative to clocked design. It explores the design of a low power asynchronous CORDIC coprocessor combined with an asynchronous RISC processor to achieve significant power reduction for applications involving computationally intensive algorithms amenable to CORDIC processing. The design presented here provides the processor with a unified, dual-ported view of its memory subsystem using multiple interleaved blocks. Each block has separate instruction and data line-buffers effectively acting as level-zero (L0) cache, making the cache access time highly variable. The cache employs a copy-back write strategy to support a high-performance embedded processor core. The Reorder Buffer is chosen for exception handling and dependency resolution. Most instructions are conditionally executed and can optionally update the condition code flags according to their result. This architecture uses MARBLE, the Manchester AsynchRonous Bus for Low Energy, used in the AMULET3i microprocessor, a two channel micropipeline bus with centralized arbitration and address decoding to provide the interconnection of asynchronous macrocells. The Asynchronous co-processor CORIDC meets the demands of digital communications, which involves the solution of systems of linear equations, QR factorization or the computation of eigen values, eigenvectors or singular values. CORDIC computes a wide range of functions including certain trigonometric, hyperbolic, linear and logarithmic functions. All these tasks can be efficiently implemented using processing elements performing vector rotations. The COordinate Rotation DIgital Computer (CORDIC) algorithm offers an opportunity to calculate the desired functions in a simple and elegant way.
INTRODUCTION: Many possible advantages have been claimed for asynchronous design, including:
Clock skew avoidance: The difference in time between the clocks reaching various sections of a design is referred to as the clock skew. Since the data must be available to all blocks by the time the clock edge reaches the destination, the clock skew must be subtracted from the total cycle time to give the amount of cycle time available for processing. As designs get more complex the clock must be distributed to increasing numbers of gates in the design; also, as designers simplify pipeline stages so that clock speeds can be increased the fraction of the clock cycle taken up by clock skew increases and thus the impact of clock skew on performance increases. These requirements lead to increasingly large, complex and power hungry clock drivers.
Better than worst case execution time: In the synchronous world the pipeline must be clocked slowly enough for the worst case time of the slowest stage. An asynchronous design, however, can take advantage of variations in the completion times of the stages caused by varying data.
Power considerations: In a conventional clocked system all sections of the design are clocked simultaneously, including those which are not in use on all cycles. This causes many nodes to toggle unnecessarily. Clock gating can be used to reduce dynamic power consumption but at the expense of increased complexity and clock skew. In DSM chips it is noted that, the static consumption of circuit constitutes to almost 80% of total power consumption. In an asynchronous design transitions only occur when a block is being used, thus reducing the power consumption.
Security: In a wide range of computer controlled systems the issue of security is often of vital importance. Security of a particular system could be compromised, simply by manipulating the clock signal of the controller of the system. By inserting spurious signals into the clock line of the controller, secure data could be extracted from the system. This manipulation of the clock signal also enables the hacker to skip over the execution of instructions vital to the security of the system.
System developers invest a vast amount of effort in trying to create a secure system while hackers invest a great deal in trying to breach this security. Several different approaches could be taken to tackle this problem. After some consideration it was decided that eliminating the clock from the system could be a very successful solution to the problem.
The above mentioned advantages are also applicable to Digital Signal processing. The evolving market needs for modern DSP applications have resulted in the need for complex solutions. This has put stringent performance parameter requirements on the System Design.
Different applications incorporate different DSP algorithms in their design to meet the required functionality. These DSP algorithms invariably involve numerous arithmetic operations. The computation of trigonometric functions, coordinate transformations or rotations of complex valued phasors is almost naturally involved with modern DSP algorithms. All these tasks can be efficiently implemented using processing elements performing vector rotations. The Coordinate Rotation Digital Computer algorithm (CORDIC) offers the opportunity to calculate the desired functions in a rather simple and elegant way.
CORDIC ALGORITHM : The COordinate Rotation Digital Computer (CORDIC) algorithm was first introduced by Volder [5] in 1959 for the computation of trigonometric functions, multiplication, division and data type conversion, and later generalized by Walther [6] in 1971. It offers the opportunity to calculate all these complex arithmetic operations mentioned above in a simple and elegant way. Implementation of the algorithm can be considered as an iterative sequence of additions/subtractions and shift operations. Due to the simplicity of the involved operations, CORDIC algorithm is very well suitable for VLSI implementations and is used in various applications such as digital communication technology, adaptive signal processing, computer graphics and robot control. However, the CORDIC iteration is not a perfect rotation which would involve multiplications with sine and cosine. The rotated vector is also scaled making a scale factor correction necessary.
ASYNCHRONOUS PROCESSOR: The Asynchronous Processor architecture explained here is based on ARM Architecture, Compatible with v6 instruction set.
The Processor contains a number of asynchronous modules. The pipeline depth is elastic essentially consists of pre-decode, decode/register read, execution stages, an asynchronous reorder buffer and register write back. The asynchronous RAM is local to the processor. The architecture uses a technique called Last Result Reuse to avoid the penalty of RAW dependency [4]. Branch Target Buffer (BTB) reduces the number of instructions pre-fetched unnecessarily. The fetch is therefore suppressed, thus delivering the instruction faster than that can be done by the memory system and at a lower energy cost.
Fig 1 Core Organisation The cache employs line fetch mechanism, interleaved, fully associative sub-blocking and with separate instruction and data line-buffers, a single line of each being sufficient. The Cache supports copy-back write strategy with a write-allocate policy and a write buffer with a victim cache distributed between the blocks. To reduce the stall during write-allocation, and for good performance on a LFL write hit, the LFL in the architecture allows individual words to be written by the processor. The contents of each block are combined such that any cache access may pass through 1, 2 or all 3 stages (the line-buffers, cache RAM or LFL and the victim cache) depending on where (if at all) a cache hit occurs.
Fig 2 Memory Architecture The Co-processor Interface is a "Bus watching" system. The co-processor is attached through a bus to the ARM processor as the co-processor receives instructions it oves data through the input buffer to its own internal instruction pipeline. As the coprocessor instruction begins execution there is a "hand-shake" between the ARM and co-processor that they are both ready to execute the instruction.
CORDIC CO-PROCESSORFig 3 Vector Rotation The fig 3.1 shows the vector vi = (xi, yi) in the XY plane, which is to be rotated by an angle to the target location vi+1 = (xi+1, yi+1). The next figure, fig 3.2 shows how this problem of vector rotation can be viewed. The vector is assumed to be stationary and the Co-ordinates are rotated by angle ¦È. This gives the target vector (xi+1, yi+1), in terms of the sine and cosine angle multiplication and summation of the input vector, given by,
x
_{i+1} = x
_{i} cos ¦È ¨C y
_{i} sin ¦È ---- (1)
y
_{i+1} = y
_{i} cos ¦È + x
_{i} sin ¦È ---- (2)
This general vector rotation transform, forms the basic principle in the CORDIC algorithm credited to Volder [5].
Fig 4 Co-ordinate Rotation The eqns (1) and (2) rotates the vector in a Cartesian plane by the angle ¦È. This can be rearranged so that:
x
_{i+1} = cos ¦È . [x
_{i }¨C y
_{i }tan ¦È] ---- (3)
y
_{i+1} = cos ¦È . [y
_{i }+ x
_{i }tan ¦È] ---- (4)
The CORDIC algorithm intends to reduce the multiplication involved in these equations.
In rotation mode, the angle accumulator is initialized with the desired rotation angle. The rotation decision at all iterations is made to diminish the magnitude of the residual angle in the angle accumulator. The decision at all iterations is therefore based on the sign of the residual angle after each step. Naturally, if the input angle is already expressed in the binary arctangent base, the angle accumulator may be
Fig: 5 Rotation Mode of CORDIC Algorithm eliminated. For rotation mode, the CORDIC equations are:
x
_{i+1 } = [x
_{i } ¨C d
_{i }.y
_{i }.2
^{-i}] ---- (5)
y
_{i+1} = [y
_{i }+ d
_{i }.y
_{i}.2
^{-i}] ---- (6)
z
_{i+1} = z
_{i } ¨C d
_{i }.(tan
^{-1}(2
^{-i})) ---- (7)
Where,
d
_{i} = -1 if z
_{i}< 0, +1 otherwise ---- (8)
This provides the following result,
x
_{n} = x
_{0} cos ¦È ¨C y
_{0} sin ¦È ---- (9)
yn = y
_{0} cos ¦È + x
_{0} sin ¦È ---- (10)
zn = 0 ---- (11)
An = ¡Çn¡Ì (1 + 2
^{-2i}) ---- (12)
ITERATION STEPS : A rotation by any (within some convergence range) desired rotation angle A0 can be achieved by de_ning a converging sequence of n single rotations. The CORDIC algorithm is formulated given
1. A shift sequence i, defining an angle sequence
¦È
_{i} = tan
^{-1}(2
^{-i}) ---- (13)
which guarantees convergence. Shift sequences will be discussed in section
2. A control scheme generating a sign sequence
d
_{i} = -1 if z
_{i} < 0, +1 otherwises ---- (14)
This steers the direction of the rotations in each iteration sequence and guarantees convergence.
ASYNCHRONOUS CORDIC CO-PROCESSOR : The asynchronous CORDIC uses latches instead of flip-flops for buffering the data. Since latches are smaller and consume less power, this can potentially reduce silicon area and power dissipation
Fig: 6 Computational Module In this design the gain multiplied is scaled down by further CORDIC iterations, to increase accuracy. This further reduces the multiplication required, thereby improving speed.
Fig: 7 Functional Block DiagramINPUT CORRECTION MODULE : This module takes care of the restriction that the CORDIC algorithm can rotate only degrees. For angles greater than the limit, only the sign of the co-ordinates need to be changed.
COMPUTATIONAL MODULE : The asynchronous architecture includes a latch after each stage to allow the next stage to work with previous data of the current stage.
Fig: 8 Input Correction ModuleGAIN CORRECTION MODULE : As explained already, the CORDIC algorithm induces some gain into the processor. This gain has to be taken care of at the final output stage. This will further decrease the system speed, if a multiplier is used. A new gain correction module is added to the CORDIC algorithm, which also uses the CORDIC iterations to multiply the correction factor.
The design was implemented removing the scale constant from the iterative equations to yield a shift-add algorithm for vector rotation. The product of the Ki¡¯s is applied here in the system (treated as part of a system processing gain). This product approaches 0.6073 as the number of iterations goes to infinity. Therefore, the rotation algorithm has a gain, An, of approximately 1.647. The exact gain depends on the number of iterations, and obeys the relation,
This gain factor is negated by this output stage
Processor Co-Processor HANDSHAKE: The Handshake involves the Busy and the Ready from the CORDIC co-processor.
Fig: 8 Gain Correction Module The main processor cannot interrupt the CORDIC co-processor. It can only request the data processing by sending the data. CORDIC responds with the busy and ready signal. The main processor has to interpret the Busy and Ready signal to use the data present at the output of CORDIC.
Fig: 9 MAIN uP -CORDIC Interface CC Busy 0 and CC Data Ready 0: CORDIC Co-processor is ready for data processing.
CC Busy 0 and CC Data Ready 1: CORDIC Co-Processor has completed processing the last data Input.
CC Busy 1 and CC Data Ready 0: CORDIC Co-Processor is still processing the last data Input.
CC Busy 1 and CC Data Ready 1: Invalid condition.
SUMMARY: The asynchronous implementation of the processor has unique advantages in terms of power management ¨Cespecially in image processing. The architecture is well suited for embedded market, as DSP is prevalent with embedded processor in cell phones, cordless phones, base stations, pagers, modems, Smart phones and PDAs. Other embedded applications that take advantage of such processors are: disc drive controllers, automotive engine control and management systems, digital auto surround sound, TV-top boxes and internet appliances. Other products are still being modified to take advantage of it: toys, watches, etc. The possible applications are almost endless. The Architecture resolves many of the dependency and exception handling issues. The whole sub-group of architecture has been dedicated to function strictly as signal processor.
[1] Bainbridge, W.J., Furber, S.B., ¡°Asynchronous Macrocell Interconnect using MARBLE¡± Proc. Async¡¯98, San Diego, April 1998 pp. 122-132
[2] Garside, J.D., Furber, S.B., Chung, S-H. ¡°AMULET3
Revealed¡± Proc. Async ¡¯99, Barcelona, April 1999 pp. 51-59.
[3] Furber, S.B., Garside, J.D., Riocreux, P., Temple, S., Day, P. Liu, J., Paver, N.C. ¡°AMULET2e: An Asynchronous Embedded Controller¡± Proceedings of the IEEE, volume 87, number 2 (February 1999), pp. 243-256 ISSN 0018-9219
[4] Gilbert, D.A ., Garside, J.D. ¡°A Result Forwarding Mechanism for Asynchronous Pipelined Systems¡±, Proc.
Async¡¯97, Eindhoven, April 1997, pp. 2-11.
[5] J. E. Volder, ¡°The CORDIC trigonometric computing technique," IRE Trans. Electronic Computers, vol. EC-8, no. 3, pp. 330-34, September 1959.
[6] J. S. Walther, ¡°A unified algorithm for elementary functions," in AFIPS Spring Joint Computer Conference, vol. 38, pp. 379-85, 1971.
[7] David Brash, ¡°The ARM Architecture Version 6 ¡° 2002.