Tapio Ristimäki, Claudio Brunelli and Jari Nurmi, Tampere University of Technology Tampere Finland
In this paper the configware flow from parallel algorithm description to the binaries for coarse grain reconfigurable RAA (Reprogrammable Algorithm Accelerator) IP is given. The presented configware flow includes a graphical place tool, an automatic place & route tool and an assembler. By using the graphical place tool an application developer can map parallel parts of an accelerator configware to the nodes of RAA by hands. With the automatic place & route tool the place can be done automatically. In addition, the route tool takes advantage of the nodes as a route through blocks such that nodes with a distance greater than 1 can communicate with each other. At the backend the assembler is used to translate assembler language to the binaries and to minimize the size of binaries by using the group addressing mechanism of RAA architecture. The efficiency of each given tool is shown for the implementation of the RSAalgorithm and the GPScorrelation.
The typical starting point of a hardware algorithm accelerator development project is an algorithm description. More specifically, we likely have, the Cmodel of the algorithm planned to be implemented on hardware. Especially, if a coarse grain reconfigurable IP architecture is decided to be used as a platform to accelerator, from a Cmodel we should produce the binary files needed on configuration. In this flow we have two well known major problems: (1) how to parallelize the sequential code and (2) how to place dozens of processes from the phase 1 to the elements of reconfigurable architecture.
In this paper it is introduced the configware flow of MIMD based on coarse grain reconfigurable IP block RAA (Reprogrammable Algorithm Accelerator) . The given flow is an engineering approach to sidestep the problems to achieve functional solutions. To say, because the combination of a HLL (HighLevelLanguage) compiler and a very simple architecture was predicted to be unattainable, the intuitive architecture and backend programming tools were set as a target, such that an ordinary software engineer could create new accelerator configwares to the IP. So while problem (1) is solved with a programming friendly architecture in RAA project, as a solution to problem (2) and to the backend of configware flow a comprehensive chain of tools is introduced. The flow is depicted in figure 1 which also gives an organization of this paper.
Figure 1. RAA configware flow. Corresponding sections in this paper are given in parenthesis.
1.1. Previous work
For MIMD architecture point of view several academic HLL development projects have been done . However, the stress in this paper is on programming coarse grain IP blocks with few recent research papers . The stress on those is in HLL and in mapping, especially in VLIW kind of modulo schedulers [9,10].
On commercial side maybe the most advanced coarse grain HLL product is made by Silicon Hive . In the Silicon Hive project (2.6) the programming flow is broken down into three stages. In the first stage the “partitioning compiler” analyzes Ccode and suggests which parts should be coded on the reconfigurable structure and which ones to the host processor. The second step on the Silicon Hive tool flow is the C compiler HIVECC. At the backend of the flow is the Silicon Hive Array Programming Environment which places the compiled program into the physical cells.
PSDSXPP2 is a suite of software tools for programming and simulation of the massively parallel Pact XPP . The flow consists of three stages. As in Silicon Hive above, the code is first divided into sequential and parallel parts. The second step is to compile the parallel parts with Vectorizing CCompiler. The compiler tries to roll out the nested loops and instantiates an optimized code block handcoded with Native Mapping Language (NML). The third stage of the flow is verification. The verification of code is done with visual simulation tools.
1.2. Brief introduction to RAA architecture
Basic RAA structure is a very traditional MIMD architecture, i.e. RAA is an array of 16bit processors. Each processor has its own instruction and data memories and executes its own program. The combination of one processor and its memories is called a node in RAA.
Despite of RAA’s traditional, software engineer intuitive, basic architecture, it is in practise equipped with several optimisation extensions to ease and to boost application development. Extensions are listed in table 1. Given special features are full functional alone, or can be used together in any combination.
Table 1. a short list of RAA features
|Group addressing ||Write operation can dynamically be done parallel to all nodes in single row, to all nodes within 2*2 box or to all nodes in array |
|FIFO links ||Communications between neighbour nodes is done via FIFOs which breaks timings between nodes, and thus eases the application development |
|Direct addressing ||The node can be forced to execute given instruction immediately, making it easy to use external control of nodes. |
By combining to group addressing the RAA can be used as a SIMD.
|Dimension virtualization ||The actual amount of nodes can be hardware virtualized such that binary of accelerator targeted to bigger arrays can be executed on smaller array.  |
|Yield improve ||With very simple hardware structure the inherent redundancy of RAA can be used to repair defected chip to smaller full functional one. |
2. GUI placement tool
In many cases the engineer who has manually partitioned a sequential program to parallel processes has quite a clear picture of how the processes are connected together and moreover how they should be mapped . Especially in RAA where virtually only communication between neighbours is possible, and where only tens of instructions per node are allowed, the architecture has to be kept closely in mind while the configware is designed.
To make it easier to map the parallel processes in the case where placement is evident to the designer, the GUI mapping tool was developed. In the tool the designer sees the nodes of an array as boxes, and selects from the list the assembler files to place in the nodes. The layout of the RAA GUI placer is given in figure 2. On the left hand side in the figure the tool is presented before the placement is done. The nodes are thus attached with “empty” texts. On the right hand side the placement is done, and the nodes are attached with the names of the assembler files placed on them.
Figure 2. The GUI placer before and after the RSA algorithm placement.
3. Automatic placement tool
If the design has nodes communicating with other nodes over distances greater than 1, or if the mapping is a result of automatic tool or another tool, placement by hand may well be in practise an impossible task. For those kinds of situations the automated placement tool was developed.
The use of 4NN FIFOs for communication between the nodes in RAA seems to simplify the place & route problem compared to more flexible communication networks. With 4NN FIFOs the only requirement for placement is that the nodes that communicate together have to be neighbors and that there is not any route phase. However, in RAA architecture it is easy to use a part of nodes as a route through blocks if communication with nonneighbors has to be achieved. In a heavily engineered flow it is natural that the routes through the blocks are defined explicitly by the developer. In contrast, in flows based on automatic tools the determination of the route through the blocks has to be made by the tools.
3.1 Iterative LinKernighan to solve TSP
The most used way to make automatic placement is iteration i.e. –opt algorithms. The initial placement is iterated be swapping blocks until the result is good enough. To avoid local minimums one can swap 3,4… blocks simultaneously. However the increase of the amount of the simultaneously swaps (n), increases the execution time exponentially.
The Lin and Kernighan algorithm  removed the drawback of preselecting a compromise between the optimality and execution time with optalgorithms by introducing a powerful variable opt algorithm. The algorithm changes the value of n during its execution, deciding at each iteration what the value of n should be. At each iteration step the algorithm examines, for ascending values of n, whether an interchange of n pieces may result a better result. This continues until some stopping conditions are satisfied. The original LinKernighan algorithm was targeted to solve the NPhard traveler salesman shortest path problem. However, the same approach is followed here and applied to a placement problem.
The RAA placement algorithm is a combination of a heuristic greedy algorithm and a modified iterative LinKernighan algorithm. The greedy algorithm is used to establish the initial placement.
In greedy initialization the processes are divided to the available nodes recursively such that the unplaced process with the most connections to the other processes is placed as close to the middle of an array as possible. After that the process having a connection to the process placed in the previous phase and having the most connections to the other processes is placed beside the previous process. The operation is continued in bredfirst manner until all processes having direct or indirect communication with the original process has an initial placement. If there are still unplaced processes after the previous cycle, another one is performed until every node is placed.
After the initial placement the modified LinKernighan iterative optimization phase is started. The basic idea of the iterations is to try to maximize the score by doing swaps of nodes in the current placement. The manhattan shortest path between nodes equipped with processes communicating with each other increase the cost of the current placement if it differs from the maximum allowed distance attached to the connection. Special attention is paid, if there is communicating process with allowed distance more than 1. In these cases placement tool takes care that there is enough empty nodes at the path between the nodes to guarantee the routability.
The outline of the modified LinKernighan placement algorithm used is such that
In the given algorithm the subfunction Greedy returns the initial placement and subfunction Score the score of the given placement. The function r-opt_ move makes the best possible r-opt move. The main loop logic is such that 2-opt moves are done until a local optimum has been reached and an uphill move is needed. Next the 3opt moves are tried. The value of r is increased until the score increases. When an improvement is found with ropt the swaps with 2opt moves are continued. The optimizations are executed until the score equals zero, or the given optimization algorithm cannot decrease the score further.
Because r-opt is an optimal exhaustive search, if r equals the amount of nodes in placement the algorithm is optimal. However, execution time of the ropt move increases exponentially as a function of r. Thus, the algorithm used stops executing if r gets bigger than 5, and starts the algorithm with randomized initial placement.
In the path finding process, the graph is established such that each node in RAA is depicted as an edge and every FIFO link as a directed unit cost vertex between corresponding nodes (i.e. edges). It is easy to see that the shortest path problem to this given graph corresponds to the problem of shortest manhattan path in RAA architecture. Of note, the shortest path between two processes is the shortest path between the two nodes they are placed on.
4. Automate Route Tool
The processes in neighboring nodes do not need any route phase, because those communicate between fixed FIFO links. However, the processes separated by a distance of two or more need a route through blocks to establish the communication link. Premise is that placement is such that the needed empty nodes are in the path. In such an initial state the route phase is simple:
- Find the manhattan path from source to the destination
- Create needed configware for each node.
Phase one is done using Dijkstra’s  algorithm. For each vertex, the algorithm keeps track of which vertex it can be linked to in order to generate the shortest path. From these parent links, the path from the source to the destination can be enumerated by starting from the destination and by following the links to the source. In stage two there is the script which creates needed trivial routing configware in an infinite loop.
5. RAA Assembler
The assembler in RAA has two main functions. The first one is to translate assembler instructions into binary format. From the developer’s point of view it is far more efficient to make programs when one can, e.g., write ‘add fifo<= acc1, fifo2’, instead of using binary opcode format like ‘1010001000010’. However, the translation between assembler words and binary fields is trivial.
The second, and not trivial, purpose of the assembler is to automatically make full use of possibility to parallel write same data to nodes in same row or block or array in RAA. The purpose is to minimize the total code size needed to configure the RAA with spatially oriented codes. The minimal code size not only diminishes the amount of memory needed on the external controller, but also minimizes configuration time. However, manually finding an optimum way to use the addressing mechanism with certain accelerator configware is, for a bigger application, impossible in practice.
5.1 Heuristic to select groups addressing mechanism
To achieve good enough results fast, a heuristic algorithm was constructed. The heuristic is based on a greedy strategy. The new memory operation is selected such that a maximum amount of memory slots is set to the legal state. The operation is selected as many times as needed to set every memory slot to the legal state. Legal state means that the program loaded to the slot is according to the placement solution. According to the greedy nature and regularities above, the algorithm is as follows:
- (1) Select the program that is used most and make global memory write with it.
- (2) If over half of the memory slots in a row have the same program such that the program is not the one used in stage 1, make a row access with that program.
In the previous stage we wanted to leave room for block addressing, such that only rows with the same program written to over half of its nodes were included, instead of selecting simply the most common.
- (3) If two or more memory slots in the block have the same program such that the program is not the one used in stages 1 or 2, make a block access with that program.
- (4) Program all memory slots in illegal state as single access.
It is easy to see that the given algorithm ends with every memory slot in a legal state because of rule 4. On the other hand the algorithm is extremely fast. With proper data structure the running time is only O(n), in where n denotes number of nodes.
The efficiency of the group addressing mechanism and the automatic addressing optimisation algorithm were tested with RSA encryption  and GPS correlation  algorithm configwares.
6.1 GUI and assembler results
The efficiency of the group addressing mechanism and the automatic addressing optimisation algorithm were tested with RSA encryption and GPS correlation algorithm configwares. Both implementations were coded with a text editor and connected together in the RAA GUI environment. The parameterised accelerators were configured such that 512bit was used in RSA and 128bit in GPS correlation. Both accelerators need 8´4 nodes array. The code size of each node is from 25 to 31 words yielding unoptimized code sizes of 992 and 576 to RSA and GPS in respectively. The sizes of configuration data after the optimisation were 97 and 66 to RSA and GPS, in respectively. The results are even better to bigger accelerators. The unoptimized code of 1024bit RSA in an 8´8 array of RAA takes 1728 words vs. to 124 words optimised code.
Table 2. The code sizes of RSA and GPS implementations after code optimization.
6.2 Place & route results
The automatic placement tool was implemented with an interpretative language. Unfortunately the implementation language was wrong for this purpose. Although the implementation was optimized such that in each Dijkstra’s algorithm execution only one memory operation is made to the lookuptable, the execution time of a single iteration step was in order of tens of milliseconds. Thus, the computations for bigger placement problems were impossible to finalize.
However, the algorithm itself given for placement seems to be fast. The inexpensive 2opt moves are used extensively, and ropts are needed only for climbing uphill. In the GPS correlation algorithm implementation the 4´4 process basic group was implemented. The amount of different combinations needed to be tested in an exhaustive search to place that configware to the equally sized RAA array is 16!, making a total of 2.1´1013 different combinations to examine. The placement with the RAA placement tool gave an optimal result after examining 67594 combinations.
With another test case it is demonstrated how the algorithm solves the problem with processes communicating over a maximum distance greater than one. The connectivity graph of a test case is given in figure 3. The numbers in parentheses are maximum distances greater than one. The result for the problem was gotten after examining 1029 combinations
Figure 3. Connection graph of test case with maximum distances greater than one.
The automatic design tools for configware development to the RAA architecture were presented. The chapter includes the flow from process description to the RAA binaries. In the first step of the flow two alternatives were considered: manual and automatic placement. The GUIbased program was implemented for handmade placement. For automatic placement a modified LinKernighan algorithm was implemented. A heuristic was designed to provide an initial placement for the LinKernighan algorithm.
Later it was shown how the RAA group addressing mechanism together with the automatic code size optimizer tool reduce the bottleneck in the reconfiguring phase. It was shown that a remarkable benefit from group addressing can be achieved even with a nonoptimal heuristic optimisation algorithm. With realistic application code size, reconfiguration time cuts with given the algorithm were in the order of 7090 percent of nonoptimized code sizes.
In the future, the placement tool should be implemented as optimized Ccode to get more information about the presented modified LinKernighan placement algorithm. Another task would be to combine the code size optimizer and placement tools so that in the placement phase the possibility to optimize the size of the code is also taken into account.
 T. Ristimäki, J. Nurmi, “Reprogrammable Algorithm Accelerator IP Block”, Proc. of IFIP VLSISOC, 2003, pp. 228232.
 M. Annaratone, C. Pommerell, R Ruhl, “ Interprocessor Communication Speed and Performance in Distributedmemory Parallel Processors”, in Proc. Symposium on Computer Architecture, 1989, pp.315324.
 K.E. Batcher, “MPPA Massively Parallel Processor”, in Proc. International Conference of Parallel Processing, 1979, pp. 249.
 T. Blank, “The MasPar MP1 Architecture”, in Proc. IEEE Computer Society International Conference, 1990, pp 2024.
 G. Demos, “Issues in Applying Massively Parallel Computing Power”, The international Journal of Supercomputer Applications, no.4, 1990, pp. 90105.
 E. Waingold, M.Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe and A. Agarwal, “Baring it all to software: Raw machines”, Journal IEEE Computer, Volume: 30 Issue: 9 , pp. 8693.
 G. Dimitroulakos, M.D. Galanis, C.E. Goutis, “Exploring the design space of an optimized compiler approach for meshlike coarsegrained reconfigurable architectures”, in Proc. Parallel and Distributed Processing Symposium, 2006, pp. 2529
 Choi J,L, Dutt N. “Compilation Approach for CoarseGrained Reconfigurable Architectures”, ApplicationSpecific Microprocessors, 2003.
 M. Ahn J. Yoon Y. Paek, Y. Kim, M. Kiemb, K. Choi, ”A spatial mapping algorithm for heterogeneous coarsegrained reconfigurable architectures”, in Proc DATE, 2006, pp. 363 – 368.
 Bingfeng M. Vernalde, S. Verkest, D. De Man, H. Lauwereins, R.:” DRESC: a retargetable compiler for coarsegrained reconfigurable architectures” in Proc FieldProgrammable Technology, 2002, pp 166173.
 “Silicon Hive”, http://www.siliconhive.com.
 J.M.P. Cardoso, M. Weinhardt, “Fast and guaranteed C compilation onto the PACTXPP reconfigurable computing platform”, in Proc. FieldProgrammable Custom Computing Machines, 2002, pp. 291 – 292
 T.Ristimäki, J. Nurmi, “Virtualizing the Dimensions of a CoarseGrained Reconfigurable Array”, Proc. of Field Programmable Logic and its applications, 2004, pp.11301132.
 R. Hartenstain, M. Herz, T. Hoffman, U. Nageldinger, ” KressArray Xplorer: a new CAD environment to optimize reconfigurable datapath array architectures”, in Proc. Design Automation Conference, 2000, pp. 163168.
 S. Lin, B. W. Kernighan, “An Effective Heuristic Algorithm for the TravelingSalesman Problem”, Oper. Res. 21, 1973, pp. 498516.
 E. W. Dijkstra, “A note on two problems in connexion with graphs”, in Numerische Mathematik. 1, 1595, pp. 269271.
 R. L. Rivest, A Shamir, L Adleman, “A Method of Obtaining Digital Signatures and PublicKey Cryptosystems”, Communications of the ACM, Vol. 21, No. 2, 1978, pp. 120126.
 E. D. Kaplan, “Understanding GPS principles and Applications”, Artech House, Boston, 1996