A 0.79-mm2 29-mW Real-Time Face Detection IP Core
Yuichi Hori, Yuya Hanai, Tadahiro Kuroda
Keio University, Yokohama, Japan
A 0.79-mm2 29-mW real-time face detection IP core is fabricated in a 0.13-mm CMOS technology and its performance was evaluated. It consists of 75-kgate logic, 58-kbit SRAM, and an ARM AMBA bus interface. Comprehensive optimization in both algorithm and hardware design improves performance and reduces area and power dissipation. Two kinds of templates with facial features are proposed to achieve high speed and yet accurate face detection. A Steady State Genetic Algorithm is employed for high-speed hardware implementation of template matching. To reduce area and power dissipation, frame memory is optimized at minimum and the detection engine is shared for two kinds of template matching. The IP core can detect 8 faces per frame at 30fps. Face detection accuracy is 92%.
Ubiquitous computing society is coming a little way down the road. In such society, tiny computers will be distributed everywhere to establish communication between them and provide us an easy way to access to the network. Some of the computers will play a key role in interface between human and computers. As one of such interfaces, face detection technique will shift vision systems like distributed vision network, digital still camera, digital movie camera, surveillance camera and robot eye etc. to brand-new interface for Human Computer Interaction.
To be distributed everywhere, face detection technique must be not only fast and accurate but also small and low-power. Small and low-power face detection technique is suitable for portable consumer applications such as automatic focus, automatic exposure and automatic zoom to faces for digital still cameras and camcorders, or human detection and head counting for security and survey applications. Especially, since our research is targetting to be applied to the digital camera of consumer appliance market, the application is aimed to achieve more than 90 % of accuracy for daily photographs taken with portable digital cameras. Moreover, since the cost-performance ratio requirements are much higher, another goal is to achieve the low cost by reducing the power consumption to 40 mW which is less than 10 % of the entire power consumption of an image processing LSI on which our face detection technique is going to be mounted, and implementing the technique within an area of 1 mm2.
This paper presents a 0.79-mm2 29-mW real-time face detection core. Table I shows the performance comparison between our work and previously presented image recognition techniques [1-4]. This table tells that the technique with high performance can not be of small area and the technique with smaller area can not achieve high performance. The proposed core can be the only technique for distributed vision system in ubiquitous computing society.
Table I. Performance comparison with related works on image recognition.
Fig. 1. Algorithm flow.
Fig. 2. Image preprocessing flow.
Fig. 3. Skin-tone in YCbCr color space.
Fig. 4. Results of image preprocessing.
Face Detection Algorithm
A. Algorithm Overview
The proposed algorithm consists of 3 steps (Fig. 1); the 1st step is for skin-tone edge extraction, the 2nd step is for coarse face detection and the 3rd step is for precise face detection. In the 2nd and the 3rd steps, template matching is executed by using templates that are made of facial features. For high speed and accurate face detection, 2 types of templates are employed. The 2nd step is coarse detection. A lines-of-face template  is employed to find a facial edge in the skin-tone edges. The search area can be rapidly narrowed down to limited areas where faces may be found. This detection, however, may include non-face objects with skin-tone and facial shape such as hands, arms and trees in backgrounds. The 3rd step is precise detection. The limited areas are explored intensively with a 2D facial template to remove the false detection.
B. Image Preprocessing
Fig. 2 shows the flow of image preprocessing. As the 1st step of image preprocessing, the color information of input image is compressed from 16bits to 3bits per pixel for the purpose of internal memory reduction of hardware. To reduce the memory more, 128x96-pixels area is cropped from the center of the input image which has 160x120-pixels resolution. Accordingly, the size of frame memory is reduced by around 1/10 (307 kbit to 37 kbit). This contributes greatly to reduction in chip area and power dissipation, since the frame memory generally occupies high proportion in chip area.
After cropping the input image, skin-tone color is extracted from the input image as the most important information of human face. The skin-tone pixels which are extracted manually from daily photos are plotted like Fig. 3(a). It is known that the distribution is nonlinearly dependent on luminance . To extract such nonlinear skin-tone distribution efficiently, a 3-D color model which is defined in YCbCr color space is utilized. The 3-D color model consists of 8 rectangular parallelepipeds which are defined in each Y range to fit to the skin-tone distribution in YCbCr color space as shown in Fig. 3(b). This filter is simple enough to be implemented in hardware with low power and small area compared with implementing the equation presented in . To eliminate tiny noises, Median filter and Opening/Closing filter are adopted. We call the result skin-tone flag map which is shown in Fig. 4(b).
For the image preprocessing for coarse detection, edge extraction and blurring are executed for skin-tone flag map. This emphasizes the edges of skin-tone for face candidate detection. Laplacian filter with 5x5 window is adopted to extract edges of skin tone. Then, whole image is scanned with 5x5 window for blurring. If there is an edge pixel at the center of the window, value of the center pixel is set to L = 3. Surrounding pixels are set to L i (i = 1, 2,
, L) depending on the distance from the center pixel. As a result, 2-bit blurred image is obtained like Fig. 4(c). This blurring can give a margin to lines-of-face detection described below. Upper 1 bit is the skin-tone flag map.
For the image preprocessing of precise detection, 8 bit to 2 bit luminance conversion is carried out. During the process, luminance normalization is executed using the maximum and minimum luminance value of previous frame. Upper 1 bit is the skin-tone flag map in the same way as the image preprocessing for coarse detection.
C. Coarse Detection : Lines-of-face Detection
This section describes about a positive-negative lines-of-face template to detect the edge of faces from the preprocessed image efficiently. In Fig. 5(a), proposed positive-negative lines-of-face template is shown. White points stand for the positive template to evaluate existence of the lines-of-face. Black points stand for the negative template to evaluate nonexistence of the lines-of-face.
We make use of a semi-ellipse composed of the positive templates to detect the edge of faces. In addition, negative templates are laid outside the positive template. There are no negative templates at the bottom of the template because this is neck area. Sometime chin edge does not appear because the color of chin and neck is same. So we put the positive templates sparsely there.
Fig. 5. Templates.
D. Precise Detection : Face Judgment
After lines-of-face detection, there may be some remaining noises because the lines-of-face template can only detect skin-tone contour. To remove these noises, 2-D facial template is introduced (Fig. 5(b)). 2-D facial template has much information (16x11 pixels) than the lines-of-face template (16 pixels) to realize more precise detection. By using this template for template matching, the detected area in lines-of-face detection is judged as face or non-face. 2-D facial template is generated by averaging images of frontal face for generalization and is blurred to remove individual difference.
E. Genetic Algorithm for Template Matching
A genetic algorithm (GA)  is employed for high speed template matching. GAs use techniques inspired by evolutionary biology such as inheritance, mutation, natural selection and crossover. A solution is represented as a gene. The optimization is performed by creating a pool of genes. In each generation, genetic operations such as crossover, mutation, fitness calculation, sort and survival are executed for all genes. Genes are gradually improved generation by generation and finally led to the optimal solution.
To adopt GAs for template matching, information of transformed template is integrated in a 25-bit gene code as shown in Fig. 6. Fitness of gene is evaluated by calculating correlation between the transformed template and the target image. The gene with the highest fitness during the evolution provides the area in the target image where the image is similar to the shape of the template.
Fig. 6. Gene code for template matching.
A. Overview of Real-Time Face Detection Core
A block diagram of proposed real-time face detection core is shown in Fig. 7. The core consists of 5 components; the frame memory, the face detection engine, the image preprocessor, the best-fitness monitor and the AHB I/F.
As described in Section I, the size of the frame buffer is designed with 128x96-pixels resolution instead of 320x240-pixels resolution which is generally used by face detection algorithms [8, 9]. This is because experimental results show that only about 3% decrease of the detection rate is seen in the case of 128x96-pixels resolution compared to the case of 320x240-pixels resolution. However, it is more important to reduce the size of frame memory to 1/4. The smaller frame memory contributes to reducing the core area.
The face detection engine is the main part of the proposed core and consists of 22-kgates logic and 12-kbit SRAM. This simple implementation also contributes to reducing the core area and power dissipation.
ARM AHB interface module is also mounted to realize general interface to ARM processor. The input image is sent from ARM via this module. Then the image is processed by image preprocessing module. After that, the face detection engine starts to search faces accessing to the frame buffer. Best-fitness monitor checks the detection results and send them to ARM processor.
B. 4-stage Pipelined Architecture of Genetic Algorithm
It is shown how template matching is realized using GA in Section II, still there is a problem for high speed operation. In the general flow of standard GA (sGA), sort process is performed after fitness calculation. The sort process typically requires a large number of cycles like order of N to the 2nd power although other processes require only a couple of cycles. This causes pipeline stall.
To resolve this problem, a Steady State Genetic Algorithm (SSGA)  is employed. A block diagram of the face detection engine is shown in Fig. 8. In SSGA, sort process is omitted and survival process is executed instead of it. In survival process, fitness of parents and child are compared with each other. If a parent is weaker than the child, the parent is replaced by the child and the child survives to the next generation. Since the survival process requires only a few cycles, the hardware can be designed in a 4-stage pipeline without the stall problem. Although, omitting the sort process may require more generations to reach the optimal solution, the simple hardware implementation raises clock frequency and lowers area and power dissipation significantly . So, it can be said that SSGA is quite suitable for hardware implementation compared with sGA.
C. Reproduction Module of Steady State Genetic Algorithm
A reproduction module which is the core part of SSGA hardware consists of quite simple circuits as shown in Fig.9 (The gray part of Fig.8). At first, random numbers for parent selection, crossover mask and mutation mask are generated by Linear Feedback Shift Registers (LFSR) which is generally used as a pseudo random number generator. Because the number of population defined in the core is 256, 8-bit LFSR is adopted for the parent individual selection. The crossover mask and the mutation mask are 25-bit shift registers. The comparison result between configurable thresholds (crossover rate and mutation rate) and the result of 16-bit LFSR is substituted as the first bit of each mask. 16-bit LFSR which can generate more precise random numbers than 8-bit LFSR is used to control the incidence of the crossover and the mutation more strictly.
The gene of parent 1 is made by loading parents' data from pool of genes by using a random number obtained by 8-bit LFSR. Using gene of parent 1 and gene of parent 2 which is obtained by a random number generated at the next cycle, new gene is generated. If the crossover mask is one, gene bit of parent 1 is loaded to new gene bit. If the crossover mask is zero, gene bit of parent 2 is loaded to new gene bit. This process is called as crossover. In addition, if the mutation mask is one, the corresponding gene is reversed to the gene obtained by crossover. This process is the mutation. A new gene which is generated through crossover and mutation using gene of parents is the child gene to survive in the next generation. By continuing this cycle, the pool of genes is gradually evolved.
Fig. 7. Block diagram.
Fig. 8. Face detection engine.
Fig. 9. Circuit schematic of reproduction module
The proposed core with the ARM9 processor is fabricated in a 0.13-mm CMOS technology as an internal block of a general purpose image processing LSI. A chip photomicrograph is shown in Fig. 10 and measured performance is summarized in Table II. As the result of processing time measurement by running proposed algorithm on a PC with Pentium4 3.8 GHz and 2-GByte main memory, it took about 400 msec to finish processing each image. Since the core can complete its processing in 33 msec, it can be said that the core is capable of processing 10 times faster than the Pentium 4, even though the area of the core is only 0.79 mm2. According to the result of statistical experiment by simulation, the process time is 1.5 msec for the coarse detection and 2.5 msec for the precise detection on average. This results show that the core can detect 8 faces within a frame.
Table II. Performance summary.
Fig. 10. Chip photomicrograph.
To confirm real-time operation of the core, an evaluation board is developed as shown in Fig. 11. The chip is mounted on the main board and the front-end board with Charge Coupled Device (CCD) sensor and a fixed focal length lens is connected to it. Using this board, it is confirmed that 8 faces can be detected in each frame of moving pictures at 30 frames/second. In addition, flexible application can be developed by connecting ARM debugger and loading a code to the chip.
Fig. 12 summarizes experimental results on face detection accuracy. As a test image set, 200 photos including 339 faces from daily scenes are chosen at random. This test image set contains various lighting environments, which includes day and night; indoor and outdoor (sun light, incandescent lamp, and fluorescent lamp, etc.). The face detection rate is 92 % on average. This detection rate is achieved in the range that covers 90 % of the number of faces in the test image set, each for size of face, for angle of rotation and for angle of direction. Faces that are small and looking up/down-ward or sideways in daily photos can also be detected. The number of detected non-face object per image is less than 0.5 even with photos of out of focus, under/over exposure and lot of obstructions. Some of the detection results are shown in Fig. 15.
As an example of application for proposed face detection core, head counting is shown in Fig. 13. Two experiments are carried out; (a) counting number of people walking in a station by a steady camcorder, and (b) counting number of people sitting in a stadium by panning a camcorder. In the experiment (a) where around 100 people walk across in front of the camera per minute, counting accuracy is 87 %. In the experiment (b) where around 1,500 people are shown up on the images per minute, counting accuracy is 86 %.
Fig. 11. Evaluation board.
Fig. 12. Experimental results on face detection accuracy.
Fig. 13. Head counting.
As another application, object detection is shown in Fig. 14. The proposed core is designed flexibly so that the template, skin-tone filter, GA parameters and so on can be configured easily by ARM9. As an example, color ball detection is shown here. By setting skin-tone filter for the color of the target ball and lines-of-face template to a circular form, the ball can easily be detected. If target shape is simple enough, e.g. a circle, a rectangle, a eclipse and so on, object detection can be realized at frame rate of 60 frames/second because theres no necessity to run judgment process.
Fig. 14. Object detection.
A 0.79-mm2 29-mW real-time face detection core was fabricated in a 0.13-mm CMOS technology and its performance was evaluated. It consists of 75-kgate logic, 58-kbit SRAM, and an ARM AMBA bus interface. Comprehensive optimization in both algorithm and hardware design improves performance and reduces area and power dissipation. Two kinds of templates with facial features are proposed to achieve high speed and yet accurate face detection. A Steady State Genetic Algorithm is employed for high speed hardware implementation of template matching. To reduce area and power dissipation, frame memory is optimized at minimum and the detection engine is shared for two kinds of template matching. The core can detect 8 faces in each frame of moving pictures at 30 frames/second. Face detection accuracy is 92 %. Part of the future work is to replace the current core by multi-core so that the accuracy and the robustness for size of face, direction of face and rotation of face can be further improved with quantity of increase in chip area.
This work is supported by a Grant from NuCORE Technology Inc.. The authors are very grateful to the company for helpful comments and valuable discussions.
 K. Korekado, T. Morie, O. Nomura, T. Nakano, M. Matsugu and A. Iwata, An Image Filtering Processor for Face/Object Recognition Using Merged/Mixed Analog Digital Architecture, in IEEE Symp. VLSI Circuits Dig. Tech. Papers, pp. 220-223, June 2005.
 T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin, W. Wolf, Embedded Hardware Face Detection, in Proc. IEEE Int. Conf. VLSI Design, pp. 133-138, Jan. 2004.
 T. Kozakaya and H. Nakaia, Development of a Face Recognition System on an Image Processing LSI Chip, in Proc. IEEE Conf. Computer Vision and Pattern Recognition Workshop, vol. 5, pp. 86, June 2004.
 K. Imagawa, K. Iwasa, T. Kataoka, T. Nishi and H.Matsuo, Real-Time Face Detection with MPEG4 Codec LSI for a Mobile Multimedia Terminal, in Proc. IEEE Int. Conf. Consumer Electronics, pp. 16-17, June 2003.
 Y. Hori, K. Shimizu, Y. Nakamura and T. Kuroda, A Real-Time Multi Face Detection Technique Using Positive-Negative Lines-of-Face Template, in Proc. IEEE Int. Conf. Pattern Recognition, vol. 1, pp. 765-768, Aug. 2004.
 R.-L. Hsu, M. Abdel-Mottaleb and A. K. Jain, Face Detection in Color Images, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 696-706, May 2002.
 J. H. Holland, Adaptation in Natural and Artificial System, University of Michigan Press, 1975.
 P. Viola and M. Jones, "Robust real-time face detection," Int. J. Computer Vision, Vol. 57, Issue 2, 137-154, May 2004.
 C. Huang, H. Ai, Y. Li and S. Lao, "Vector Boosting for Rotation Invariant Multi-View Face Detection," in Proc. IEEE Int. Conf. Computer Vision, pp. 446-453, Oct. 2005.
 G. Syswerda, Uniform Crossover in Genetic Algorithms, in Proc. Int. Conf. Genetic Algorithms, pp. 2-9, June 1989.
 B. Shackleford, G. Snider, R. J. Carter, E. Okushi, M.Yasuda, K. Seo and H. Yasuura, A High-Performance, Pipelined, FPGA-Based Genetic Algorithm Machine, J. Genetic Programming and Evolvable Machines, vol. 2, no. 1, pp. 33-60, Mar. 2001
Fig. 15. Detection results.