By Nick Lookin & Vadim Bersenev, Institute of Engineering Science, Urals Division of Russian Academy of Sciences
Michael Trapeznikoff, Science & Production Association of Automatics
Abstract
Highspeed data processing based upon specialpurpose processors is one of the basic directions in the development of computer systems for realtime systems. Reconfigurable processor arrays containing bitlevel processor elements are the sort of specialpurpose processor architectures. This array may be preferable for realtime computations in hard time and space restrictions. Features of uniform array may enable the maximal concurrency for data processing but require the adaptation of arbitrary algorithms to uniform architecture. It is one of the most difficult problems in modern computer science, its successful solution determine the efficiency of their application in systems.
This paper is devoted to the development of one type processor arrays, called MiniTera2 and to the investigations of realization of some realtime algorithms by means of this array.
I. INTRODUCTION
Highspeed data processing based upon specialpurpose processors is one of the basic directions in the development of computer systems (CS) for realtime systems (RTS). The main goal of development of these processors is increasing of CS performance for complicated algorithms. Implementation of high and super high complicated algorithms requires parallel data processing on operation level, word level, or even bit level.
Example. We need processing of some objects images. Images are dynamically changed in space and time dimensions and have 1024 x 1024 pixels. The frame rate is 100 Hz and each frame period contains both intraframe data processing (noise filtering, staining etc.) and interframe data processing (objects recognition, estimation of parameters of objects motion in inertial frame etc.). If number of objects has an order of thousands then only processor architectures providing parallel and independent processing of each pixel supplies fullsize image processing (intraframe and interframe) during one frame period.
There are three main tendencies of CS performance increase. They are:

Increasing of switching frequency of gates. It requires essential enhancement of design and technology of chips, micro assemblies and printed circuit boards. In addition increasing of frequency results in growing of power consumption density and amplifies the impact of electro physical parameters on circuits operation [1].

Parallel data processing. There are two directions of parallelism development:
 universal parallelism that leads to generalpurpose parallel architectures (multiprocessor or multicomputer systems) having restrictions for performance increasing (connectivity of the algorithms graphs, decision time losses because of processors interconnections and memory access conflicts [2]);
 specialized parallelism oriented to applications. It leads to specialpurpose parallel architectures that provide efficient computations of certain functions, procedures or algorithms blocks. These architectures are more effective then generalpurpose architectures for highcomplexity realtime algorithms.

Hardware implementation of algorithms. It is provided by functionaloriented architectures of processors permitting to take into consideration the peculiarities of algorithms or of numerical methods in block sets and its interfaces. Usually RTS realize fixed set of algorithms, that is why functionaloriented processors architecture permits to obtain the maximum of performance for real systems [3].
II. RECONFIGURABLE PROCESSOR ARRAYS
Reconfigurable processor arrays (RPA) containing bitlevel processor elements (PE) belong to the type of specialpurpose processor architectures [4]. According to the opinion of many scientists and designers RPA are preferable for realtime computations in hard time and space restrictions. In the paper we consider special case, locally connected RPA, representing a sort of uniform computing environments [5]. This RPA may be represented by both array of bitlevel PE (finegrained parallelism) and set of clusters consisting of RPA (coarsegrained parallelism).
High efficiency of RPA applications in RTS is enabled due to:

mass parallelism for data processing (total number of data flows has an order of ten thousands), which provides maximum of performance and relatively low switching frequency (and hence low power consumption);

architecture uniformity (repeatability of PE and regularity of interconnections) which provides maximum of VLSI integration level;

locality of information lines in RPA (interconnections between adjacent PE only) which provides minimal length of intercircuits connections in VLSI chip and therefore minimal level of noise;

large number of identical PE in one chip (up to several thousands PE in one chip) which provides high fault tolerance of computations.
Evolution of RPA counts about 40 years. During this period many architectures, a lot of hardware and software were developed. The main areas of RPA applications in RTS are follows:

optical systems for missiles and aircrafts;

data processing in radar guidance;

data processing in hydro acoustic system of underwater vehicles;

preprocessing in inertial sensors of air and spaceborn navigation systems;

adaptive formation of antenna directional diagrams for radio navigation systems.
This paper is devoted to the development of one type of RPA and to the investigations of realization of some RTS algorithms by means of this RPA.
III. RPA MiniTera2
Architecture of RPA MiniTera2 developed in 2002 [4] was a basis for our study. This RPA represents rectangular array of identical cells connected between each other by means of only local data channels. Each cell contains:

one PE performing bitwise data processing of data flows;

shadow memory module for fast rewriting of the PE's operation codes that is needed for dynamically reconfiguration PE array during data processing;

control unit for the execution of instructions written in control register before data processing.
Figure III1: RPA Minitera2 cell architecture
Architecture of one cell is shown on Figure III1.
The main features of RPA MiniTera2 cell are synchronous data flows processing in "onfly" mode and programmable number of digits for addition/subtraction. Sixteen programmable interfaces permit switching of any four channels concurrently with data flow processing. Instruction set consists of 49 instructions: 22 ALU instructions, 14 instructions for comparison operations, remaining instructions are used for data flow commutation, code conversion such as P S and S P.
All cells are combined in rectangular arrays; each cell is connected with four adjacent cells. Data channels of each cell are intended for concurrent interchanges between (i,j)th cell and (i1,j), (i+1,j), (i,j1), (i,j+1)th cells. Special part of op code of each PE is needed for programming these interchanges. All cells operate synchronously with one bit clock from synchronizing generator. The main features of RPA MiniTera2 cell are synchronous data flows processing in "onfly" mode and programmable number of digits for addition/subtraction. Sixteen programmable interfaces permit switching of any four channels concurrently with data flow processing. Instruction set consists of 49 instructions: 22 ALU instructions, 14 instructions for comparison operations, remaining instructions are used for data flow commutation, code conversion such as P S and S P.
All cells are combined in rectangular arrays; each cell is connected with four adjacent cells. Data channels of each cell are intended for concurrent interchanges between (i,j)th cell and (i1,j), (i+1,j), (i,j1), (i,j+1)th cells. Special part of op code of each PE is needed for programming these interchanges. All cells operate synchronously with one bit clock from synchronizing generator.
Before operation mode one instruction and one mode of interchange must be written to control register of each RPA cells. This enables after initial setup all PE's univocal correspondence between algorithm graph and cell array topology. Structure of RPA MiniTera2 is shown on Figure III2.
Figure III2: RPA Minitera2 cell architecture
One chip of VLSI contains RPA. Its size or dimension depends on technology. In 2002 preproduction model of VLSI MiniTera2 was developed and fabricated. This experimental chip was made on 0.6 mm CMOS technology and contained cell matrix 5x5 PE, frequency of operation f = 30 MHz. Simulation by SPICE revealed for 0.09 µm CMOS technology 10000 cells per one chip and 6W power consumption for f = 100 MHz.
In 2003 model of functionaloriented processor (FOP) based upon MiniTera2 was developed. In this year computer workstation for realtime image processing was realized. Some applicationspecific image processing algorithms supported by workstation operated successfully in real time. FOP consisted of 288 VLSI (7200 PE) that permitted to get performance about 20 GOPS for image processing. It allowed realizing HDTV algorithms with frame rate 100 Hz. Workstation and FOP based upon RPA MiniTera2 are shown in Figure III3.
Figure III3: FOP based upon RPA MiniTera2 and computer workstation
In addition in 2003 software IDE MiniTera2 for programming and debugging application tasks was developed. The main principle of software operation is graphical programming and debugging of user programs on the base of simulation model of each PE and then 2Darray. The basic components of graphical part of IDE are:
 Algorithm Graph Editor;
 PE Array Editor;
 Inspector;
 ASM Tool;
 Librarian;
 Debugger;
 Project Panel;
 Console.
User interface IDE MiniTera2 is based upon standard MDI GUI, thus any project may be created and debugged very fast. The main window of IDE is drawn on FigureIII4.
Figure III4: IDE main wind
The main features of RPA MiniTera2 architecture are
 data flow processing based on the uniform computing environment;
 RPA performance may be increased by simple and unlimited expansion of array dimension (by simple splicing additional chips);
 any topologies of PE network or FOP structure may be realized by preliminary setup every PE.
These features enable the maximal concurrency for data processing but require the adaptation of arbitrary algorithms or systolization [6] to uniform RPA architecture. It is one of the most difficult problems in modern computer science, its successful solution determine the efficiency of RPA application in RTS.
IV. REALIZATION OF RTS ALGORITHMS BY MEANS OF RPA MINITERA2
From all the possible RTS application tasks let us choose three: image processing, arithmetic operations on long numbers and realtime sorting. Implementation of algorithms of these tasks by means of MiniTera2 is discussed below, in more details for image processing and briefly for two other tasks.
Intraframe processing of noise dot images is one of the most widely used tasks for mobile RTS such us videochannels for missiles and aircrafts. Firstly, they belong to the most complicated algorithms requiring maximal computational performance. Secondly, these algorithms have considerable internal parallelism therefore its systolic processing are preferable. Dot image filtering is basic algorithm of intraframe image processing; determination and selection of local extremes (most bright pixels) in frame is most complicated procedure of filtering. Usually image matrix in real sensors contains both pixel patterns and pixel noise that is why the main task is a separation of false pixels from real ones. It is impossible to know a number of false and real pixels before image processing therefore we need to check every local area around every pixel. For real conditions total number of pixels to be checked has an order of ten thousands.
Local extreme extraction algorithm is based on the following procedure:
where 0 i n1, 0 j n1.
One of the principal conditions for correct implementation of this algorithm by RPA MiniTera2 is a strong synchronisation of pixel flows during their processing. It is the requirement for a structure of input pixel flows. All lines of pixels pass into PE array on horizontal rows. Every input data flow contains continuous sequence of pixels; each pixel
moves along its line from might digit (see on Figure IV1).
Figure IV1: Format for jth line of pixel matrix
2DRPA is based upon parallel streaming comparison each matrix element U(i, j) with every adjacent element in window of size 3 x 3 as shown on Figure IV2.
i1, j1  i, j1  i+1, j1 
i1, j  i, j  i+1, j 
i1, j+1  i, j+1  i+1, j+1 
Figure IV2: Window 3 x 3 for image matrix
Parallel systolic comparison of central pixel with adjacent ones in window of size 3 x 3 may be realized by rectangular macros that contains 15 PE. A structure of such macros is represented on Figure IV3.
Figure IV3: Systolic macros for local extremes extraction into window size 3 x 3
Macros process three lines in parallel and every pixel in one line in serial. Number of pixels in each line may be unlimited. Continuous flows of pixels with signs of local extremes are output data of macros.
Macro described above is "building brick" for extracting local extremes for any size of image. It is very important that increasing of a number of lines (number of matrix columns) is provided by simple expansion of macros. Increasing of number of pixels in line (number of matrix rows) does not require any increasing at all. The upper bound of hardware complexity estimation is represented as a linear function:
where L_{h} is a number of PE, N is a number of pixels in row of image matrix. This effect is provided due to systolization of procedure (1). Comparison of a central pixel with an adjacent may be performed in four stages as follows:
Extraction of local extremes for fullsize image matrix may be provided by splicing macros between each other. Set of six macros intended for local extreme extraction for eight pixel flows is shown on Figure IV4. Using this methodology one may realize local extremes extraction algorithm for image of any size.
Figure IV4: Set of six màcros for local extremes extraction for eight pixel flows
The upper bound of time complexity for local extreme extraction in window of size 3 x 3 by means of MiniTera2 may be represented as a linear function (similar to Lh) as follows:
where r is a number of bits per one pixel.
RPA MiniTera2 extracts all local extremes in image with size 512 x 512 16bit pixels during 8700 cycles. For example video processor based upon such DSP as TMS320C67XXX needs near 30000 cycles.
Extraction of local extremes is widely used in the various numerical algorithms in mathematical physics. Therefore macros described above may be applied not only to image processing.
Realtime multiplication of very long numbers
Systolic RPA MiniTera2 provides high efficiency for arithmetic algorithms. Pipeline multiplication of very long numbers is more effective than well known parallel schemes. For example, for 8192 bit factors MiniTera2 requires 1.5 million CMOS gates and 66.5K cycles and matrix multiplier requires 400 millions CMOS gates and 32.7K cycles. The main goal of our today researches is decreasing multiplication time for very long numbers by parametrical decomposition of cofactors and partial products.
Realtime sorting of large massive
Using MiniTera2 and Shearsort sorting algorithm it is possible to sort large massives with upper bounds of time complexity as follows:
where M is number of massive elements [7]. Many parallel sorting algorithms on clusters and meshes provide an average upper bound of time complexity as
.
This effect of systolic implementation sorting is based on two steps of sorting algorithm:

Transformation of linear massive to square matrix.

Parallel 2Dsorting concurrently on row of matrix ("serpent" sorting) and on column of matrix (usual sorting, for example, ascending sort).
The main goal of our today researches is fast 2Dsorting on hybrid architectures (CPU + RPA).
V. CONCLUSION
Reconfigurable processor array are effective for high performance data processing in realtime systems. Results of the researches and developments presented in the paper show that for three important tasks: image processing, highprecision arithmetic and parallel sorting RPA MiniTera2 provides advantages on system criteria "cost x performance".
Reference
[1] BENNETT P. The why, where and what of lowpower SoC design// EEdesign.com, Dec 02, 2004.
[2] MANKE J.W., KERLICK G.D., LEVINE D., BANERJEE S., DILLON E. Parallel performance of two applications in the Boeing high performance computing benchmark suite// Parallel Computing, pp. 457 – 475, 27 (2001).
[3] LOOKIN N.A. FunctionOriented Processors As the Basis for Digital Electronics of the Intelligent Sensors for Navigation Systems//9rdSaintPetersburg International Conference on Integrated Navigation Systems, St.Petersburg, pp 407 – 410, May 2729, 2002.
[4] LOOKIN N.A. Reconfigurable Processor Arrays for RealTime Systems: Architectures, Efficiency, Application. //HighPerformance Computer Systems. Proceedings of International Science School. Taganrog, Russia: TSURE, pp. 44 – 59, 2004 (in Russian).
[5] PRANGISHVILI I.V. and oth. Microelectronics and uniform structures for logical and computational devices. – Ìoscow, Russia.: Science, 1967 (in Russian).
[6] KUNG S.Y. VLSI Array Processors. Prentice Hall, New Jersey, 1988.
[7] LOOKIN N.A., SOUVOROVA P.G. Realization of fast 2Dsorting by means of uniform computer environment//HighPerformance Computer Systems. Proceedings of International Science School. Taganrog, Russia: TSURE, pp. 59 – 63, 2004 (in Russian).
[8] MILLER R., BOXER L. Algorithms Sequential & Parallel: A Unified Approach. Pearson Education, Prentice Hall, New Jersey, 2000.