By Nick Lookin & Vadim Bersenev, Institute of Engineering Science, Urals Division of Russian Academy of Sciences
Michael Trapeznikoff, Science & Production Association of Automatics
High-speed data processing based upon special-purpose processors is one of the basic directions in the development of computer systems for real-time systems. Reconfigurable processor arrays containing bit-level processor elements are the sort of special-purpose processor architectures. This array may be preferable for real-time computations in hard time and space restrictions. Features of uniform array may enable the maximal concurrency for data processing but require the adaptation of arbitrary algorithms to uniform architecture. It is one of the most difficult problems in modern computer science, its successful solution determine the efficiency of their application in systems.
This paper is devoted to the development of one type processor arrays, called MiniTera-2 and to the investigations of realization of some real-time algorithms by means of this array.
High-speed data processing based upon special-purpose processors is one of the basic directions in the development of computer systems (CS) for real-time systems (RTS). The main goal of development of these processors is increasing of CS performance for complicated algorithms. Implementation of high and super high complicated algorithms requires parallel data processing on operation level, word level, or even bit level.
Example. We need processing of some objects images. Images are dynamically changed in space and time dimensions and have 1024 x 1024 pixels. The frame rate is 100 Hz and each frame period contains both intraframe data processing (noise filtering, staining etc.) and interframe data processing (objects recognition, estimation of parameters of objects motion in inertial frame etc.). If number of objects has an order of thousands then only processor architectures providing parallel and independent processing of each pixel supplies full-size image processing (intraframe and interframe) during one frame period.
There are three main tendencies of CS performance increase. They are:
Increasing of switching frequency of gates. It requires essential enhancement of design and technology of chips, micro assemblies and printed circuit boards. In addition increasing of frequency results in growing of power consumption density and amplifies the impact of electro physical parameters on circuits operation .
Parallel data processing. There are two directions of parallelism development:
- universal parallelism that leads to general-purpose parallel architectures (multiprocessor or multicomputer systems) having restrictions for performance increasing (connectivity of the algorithms graphs, decision time losses because of processors interconnections and memory access conflicts );
- specialized parallelism oriented to applications. It leads to special-purpose parallel architectures that provide efficient computations of certain functions, procedures or algorithms blocks. These architectures are more effective then general-purpose architectures for high-complexity real-time algorithms.
Hardware implementation of algorithms. It is provided by functional-oriented architectures of processors permitting to take into consideration the peculiarities of algorithms or of numerical methods in block sets and its interfaces. Usually RTS realize fixed set of algorithms, that is why functional-oriented processors architecture permits to obtain the maximum of performance for real systems .
II. RECONFIGURABLE PROCESSOR ARRAYS
Reconfigurable processor arrays (RPA) containing bit-level processor elements (PE) belong to the type of special-purpose processor architectures . According to the opinion of many scientists and designers RPA are preferable for real-time computations in hard time and space restrictions. In the paper we consider special case, locally connected RPA, representing a sort of uniform computing environments . This RPA may be represented by both array of bit-level PE (fine-grained parallelism) and set of clusters consisting of RPA (coarse-grained parallelism).
High efficiency of RPA applications in RTS is enabled due to:
mass parallelism for data processing (total number of data flows has an order of ten thousands), which provides maximum of performance and relatively low switching frequency (and hence low power consumption);
architecture uniformity (repeatability of PE and regularity of interconnections) which provides maximum of VLSI integration level;
locality of information lines in RPA (interconnections between adjacent PE only) which provides minimal length of inter-circuits connections in VLSI chip and therefore minimal level of noise;
large number of identical PE in one chip (up to several thousands PE in one chip) which provides high fault tolerance of computations.
Evolution of RPA counts about 40 years. During this period many architectures, a lot of hardware and software were developed. The main areas of RPA applications in RTS are follows:
optical systems for missiles and aircrafts;
data processing in radar guidance;
data processing in hydro acoustic system of underwater vehicles;
pre-processing in inertial sensors of air- and space-born navigation systems;
adaptive formation of antenna directional diagrams for radio navigation systems.
This paper is devoted to the development of one type of RPA and to the investigations of realization of some RTS algorithms by means of this RPA.
III. RPA MiniTera-2
Architecture of RPA MiniTera-2 developed in 2002  was a basis for our study. This RPA represents rectangular array of identical cells connected between each other by means of only local data channels. Each cell contains:
one PE performing bitwise data processing of data flows;
shadow memory module for fast rewriting of the PE's operation codes that is needed for dynamically reconfiguration PE array during data processing;
control unit for the execution of instructions written in control register before data processing.
Figure III-1: RPA Minitera-2 cell architecture
Architecture of one cell is shown on Figure III-1.
The main features of RPA MiniTera-2 cell are synchronous data flows processing in "on-fly" mode and programmable number of digits for addition/subtraction. Sixteen programmable interfaces permit switching of any four channels concurrently with data flow processing. Instruction set consists of 49 instructions: 22 ALU instructions, 14 instructions for comparison operations, remaining instructions are used for data flow commutation, code conversion such as P S and S P.
All cells are combined in rectangular arrays; each cell is connected with four adjacent cells. Data channels of each cell are intended for concurrent interchanges between (i,j)-th cell and (i-1,j), (i+1,j), (i,j-1), (i,j+1)-th cells. Special part of op code of each PE is needed for programming these interchanges. All cells operate synchronously with one bit clock from synchronizing generator. The main features of RPA MiniTera-2 cell are synchronous data flows processing in "on-fly" mode and programmable number of digits for addition/subtraction. Sixteen programmable interfaces permit switching of any four channels concurrently with data flow processing. Instruction set consists of 49 instructions: 22 ALU instructions, 14 instructions for comparison operations, remaining instructions are used for data flow commutation, code conversion such as P S and S P.
All cells are combined in rectangular arrays; each cell is connected with four adjacent cells. Data channels of each cell are intended for concurrent interchanges between (i,j)-th cell and (i-1,j), (i+1,j), (i,j-1), (i,j+1)-th cells. Special part of op code of each PE is needed for programming these interchanges. All cells operate synchronously with one bit clock from synchronizing generator.
Before operation mode one instruction and one mode of interchange must be written to control register of each RPA cells. This enables after initial setup all PE's univocal correspondence between algorithm graph and cell array topology. Structure of RPA MiniTera-2 is shown on Figure III-2.
Figure III-2: RPA Minitera-2 cell architecture
One chip of VLSI contains RPA. Its size or dimension depends on technology. In 2002 pre-production model of VLSI MiniTera-2 was developed and fabricated. This experimental chip was made on 0.6 mm CMOS technology and contained cell matrix 5x5 PE, frequency of operation f = 30 MHz. Simulation by SPICE revealed for 0.09 µm CMOS technology 10000 cells per one chip and 6W power consumption for f = 100 MHz.
In 2003 model of functional-oriented processor (FOP) based upon MiniTera-2 was developed. In this year computer workstation for real-time image processing was realized. Some application-specific image processing algorithms supported by workstation operated successfully in real time. FOP consisted of 288 VLSI (7200 PE) that permitted to get performance about 20 GOPS for image processing. It allowed realizing HDTV algorithms with frame rate 100 Hz. Workstation and FOP based upon RPA MiniTera-2 are shown in Figure III-3.
Figure III-3: FOP based upon RPA MiniTera-2 and computer workstation
In addition in 2003 software IDE MiniTera-2 for programming and debugging application tasks was developed. The main principle of software operation is graphical programming and debugging of user programs on the base of simulation model of each PE and then 2D-array. The basic components of graphical part of IDE are:
- Algorithm Graph Editor;
- PE Array Editor;
- ASM Tool;
- Project Panel;
User interface IDE MiniTera-2 is based upon standard MDI GUI, thus any project may be created and debugged very fast. The main window of IDE is drawn on Figure-III-4.
Figure III-4: IDE main wind
The main features of RPA MiniTera-2 architecture are
- data flow processing based on the uniform computing environment;
- RPA performance may be increased by simple and unlimited expansion of array dimension (by simple splicing additional chips);
- any topologies of PE network or FOP structure may be realized by preliminary setup every PE.
These features enable the maximal concurrency for data processing but require the adaptation of arbitrary algorithms or systolization  to uniform RPA architecture. It is one of the most difficult problems in modern computer science, its successful solution determine the efficiency of RPA application in RTS.
IV. REALIZATION OF RTS ALGORITHMS BY MEANS OF RPA MINITERA-2
From all the possible RTS application tasks let us choose three: image processing, arithmetic operations on long numbers and real-time sorting. Implementation of algorithms of these tasks by means of MiniTera-2 is discussed below, in more details for image processing and briefly for two other tasks.
Intraframe processing of noise dot images is one of the most widely used tasks for mobile RTS such us video-channels for missiles and aircrafts. Firstly, they belong to the most complicated algorithms requiring maximal computational performance. Secondly, these algorithms have considerable internal parallelism therefore its systolic processing are preferable. Dot image filtering is basic algorithm of intraframe image processing; determination and selection of local extremes (most bright pixels) in frame is most complicated procedure of filtering. Usually image matrix in real sensors contains both pixel patterns and pixel noise that is why the main task is a separation of false pixels from real ones. It is impossible to know a number of false and real pixels before image processing therefore we need to check every local area around every pixel. For real conditions total number of pixels to be checked has an order of ten thousands.
Local extreme extraction algorithm is based on the following procedure:
where 0 i n-1, 0 j n-1.
One of the principal conditions for correct implementation of this algorithm by RPA MiniTera-2 is a strong synchronisation of pixel flows during their processing. It is the requirement for a structure of input pixel flows. All lines of pixels pass into PE array on horizontal rows. Every input data flow contains continuous sequence of pixels; each pixel
moves along its line from might digit (see on Figure IV-1).
Figure IV-1: Format for j-th line of pixel matrix
2D-RPA is based upon parallel streaming comparison each matrix element U(i, j) with every adjacent element in window of size 3 x 3 as shown on Figure IV-2.
Figure IV-2: Window 3 x 3 for image matrix
Parallel systolic comparison of central pixel with adjacent ones in window of size 3 x 3 may be realized by rectangular macros that contains 15 PE. A structure of such macros is represented on Figure IV-3.
Figure IV-3: Systolic macros for local extremes extraction into window size 3 x 3
Macros process three lines in parallel and every pixel in one line in serial. Number of pixels in each line may be unlimited. Continuous flows of pixels with signs of local extremes are output data of macros.
Macro described above is "building brick" for extracting local extremes for any size of image. It is very important that increasing of a number of lines (number of matrix columns) is provided by simple expansion of macros. Increasing of number of pixels in line (number of matrix rows) does not require any increasing at all. The upper bound of hardware complexity estimation is represented as a linear function:
where Lh is a number of PE, N is a number of pixels in row of image matrix. This effect is provided due to systolization of procedure (1). Comparison of a central pixel with an adjacent may be performed in four stages as follows:
Extraction of local extremes for full-size image matrix may be provided by splicing macros between each other. Set of six macros intended for local extreme extraction for eight pixel flows is shown on Figure IV-4. Using this methodology one may realize local extremes extraction algorithm for image of any size.
Figure IV-4: Set of six màcros for local extremes extraction for eight pixel flows
The upper bound of time complexity for local extreme extraction in window of size 3 x 3 by means of MiniTera-2 may be represented as a linear function (similar to Lh) as follows:
where r is a number of bits per one pixel.
RPA MiniTera-2 extracts all local extremes in image with size 512 x 512 16-bit pixels during 8700 cycles. For example video processor based upon such DSP as TMS320C67XXX needs near 30000 cycles.
Extraction of local extremes is widely used in the various numerical algorithms in mathematical physics. Therefore macros described above may be applied not only to image processing.
Real-time multiplication of very long numbers
Systolic RPA MiniTera-2 provides high efficiency for arithmetic algorithms. Pipeline multiplication of very long numbers is more effective than well known parallel schemes. For example, for 8192 bit factors MiniTera-2 requires 1.5 million CMOS gates and 66.5K cycles and matrix multiplier requires 400 millions CMOS gates and 32.7K cycles. The main goal of our today researches is decreasing multiplication time for very long numbers by parametrical decomposition of cofactors and partial products.
Real-time sorting of large massive
Using MiniTera-2 and Shearsort sorting algorithm it is possible to sort large massives with upper bounds of time complexity as follows:
where M is number of massive elements . Many parallel sorting algorithms on clusters and meshes provide an average upper bound of time complexity as
This effect of systolic implementation sorting is based on two steps of sorting algorithm:
Transformation of linear massive to square matrix.
Parallel 2D-sorting concurrently on row of matrix ("serpent" sorting) and on column of matrix (usual sorting, for example, ascending sort).
The main goal of our today researches is fast 2D-sorting on hybrid architectures (CPU + RPA).
Reconfigurable processor array are effective for high performance data processing in real-time systems. Results of the researches and developments presented in the paper show that for three important tasks: image processing, high-precision arithmetic and parallel sorting RPA MiniTera-2 provides advantages on system criteria "cost x performance".
 BENNETT P. The why, where and what of low-power SoC design// EEdesign.com, Dec 02, 2004.
 MANKE J.W., KERLICK G.D., LEVINE D., BANERJEE S., DILLON E. Parallel performance of two applications in the Boeing high performance computing benchmark suite// Parallel Computing, pp. 457 – 475, 27 (2001).
 LOOKIN N.A. Function-Oriented Processors As the Basis for Digital Electronics of the Intelligent Sensors for Navigation Systems//9rdSaint-Petersburg International Confer-ence on Integrated Navigation Systems, St.-Petersburg, pp 407 – 410, May 27-29, 2002.
 LOOKIN N.A. Reconfigurable Processor Arrays for Real-Time Systems: Architectures, Efficiency, Application. //High-Performance Computer Systems. Proceedings of International Science School. Taganrog, Russia: TSURE, pp. 44 – 59, 2004 (in Russian).
 PRANGISHVILI I.V. and oth. Microelectronics and uniform structures for logical and computational devices. – Ìoscow, Russia.: Science, 1967 (in Russian).
 KUNG S.Y. VLSI Array Processors. Prentice Hall, New Jersey, 1988.
 LOOKIN N.A., SOUVOROVA P.G. Realization of fast 2D-sorting by means of uniform computer environment//High-Performance Computer Systems. Proceedings of International Science School. Taganrog, Russia: TSURE, pp. 59 – 63, 2004 (in Russian).
 MILLER R., BOXER L. Algorithms Sequential & Parallel: A Unified Approach. Pearson Education, Prentice Hall, New Jersey, 2000.