Fabio Garzia, Claudio Brunelli, Juha Kylli¨ainen, Markus Moisio, Jari Nurmi Tampere University of Technology P.O. Box 553, FIN-33101 Tampere, Finland
Abstract -- In this paper a new approach to the implementation of 3D graphics applications on a SoC architecture is described. This approach is meant to be particularly flexible, in order to be used in different kinds of systems: it is based on the realization of software libraries, that are developed using the C programming language. This software implementation can be further optimized by allowing the mapping of some high-level library functions directly on the hardware, exploiting specialpurpose or reconfigurable hardware blocks possibly included in the target system. The flexibility of our approach is guaranteed by the fact that no operating system is needed by our library. In case which the target system is a small System-on-Chip (SoC) this can also increase the performance, since the services of the operating system require high computational power. The proposed library is tested using an actual SoC based on the Coffee RISC Core; Milk coprocessor provides support for floating-point arithmetic.
Nowadays the implementation of 3D graphics on mobile devices is becoming more and more popular. Using 3D graphics, it is possible to enable running 3D animations or 3D games; recently they are running also on systems like mobile phones or PDA. These devices are largely based on the usage of Systems-on-Chip (SoC).
The system requirements of a 3D graphics application are demanding from the hardware point of view. It is usually required an accelerator to reduce the computational load of the main processor, if it is a general purpose processor. The software programming of a 3D graphics application represents another major issue; first of all, it has to deal with the target hardware. Since many different systems are based on different architectures, there is the risk that a new dedicated programming language should be developed for each of them. If each of the systems would require its own programming language, the portability of the code is not possible. The problem gets even more complicated in systems including reconfigurable devices, because these devices (according to their granularity) require different programming styles.
A possible solution consists in handling a system with reconfigurable device as a general purpose system, where possible. This means that the reconfigurability should be hidden from the programmer by using some built-in functions, provided by the system designer. These built-in functions should be integrated in the operating system, if present, or used by an high level programming language.
From this point of view, a common standard is possible and actually exists already: OpenGL ES  provides a de facto standard for 3D graphics programming on embedded systems. It is a subset of OpenGL, that creates a low-level interface between software applications and hardware or software graphics engines.
The main problem from our point of view is that all the existing implementations of OpenGL ES are based on an operating system; in the same way the standard OpenGL, are strictly linked to the X Server, the Unix video server. Vincent openGL ES  library is an open source project that aims to provide an OpenGL ES software implementation. It provides a runtime compilation infrastructure that creates optimized code for the current graphics context settings. But currently it can only run on a limited number of operating systems and processors.
Another implementation of OpenGL ES is a J2ME-based library . It provides an efficient memory usage and high shading functionalities, but it requires the Kylobyte Virtual Machine, a Java Virtual Machine designed for system with low resources.
Our approach is close to the OpenGL ES interface, in the sense that the developed C library presents similar transformation and shading functions, but differently from all the others it is not based on any operating system: it can work just defining some system properties like the video memory location. The operating system requirement is a problem for a small SoC. The possibility to implement a 3D graphics application without using any operating system support is very convenient because leads to an increase of performance and a reduced usage of resources.
This kind of approach is also convenient during the prototyping phase of a SoC, when a system does not usually have any software support for the application development, except for a cross compiler.
The libraries that we propose have been first developed and compiled under a Linux system, running on a desktop machine. This phase has been useful to perform the preliminary tests to evaluate their behavior and functionality. The details about this general purpose implementation are described in section II.
Then the libraries have been modified as explained below to be executed by a prototyped SoC based on the Coffee RISC core , that has been developed by the same project group. The Milk coprocessor  is included in order to provide floating-point support. The floating-point support is important for the implementation of the most complex rendering functions, as described below. Since it is provided with hardware, the implementation is particularly performing. A description of the prototype development is given in section III.
II. LIBRARY IMPLEMENTATION
The 3D graphics functions provided by the library are split into several files, according to their functionality. One of these files handles the data types definitions. Arrays of 2, 3 and 4 elements are defined as basic types and matrices are defined as arrays of arrays. This way no new struct variable is created, but the existing types are redefined in order to guarantee a standardized usage by the programmer.
A RGB type is also defined as 32-bit integer, and it contains the value of the three primary color components. In the general-purpose implementation each color component has been defined as 8-bit value.
The definitions of the graphical primitives are in a second file. At present, only one primitive is fully functional and tested: the polygon. It is a flexible object, because it is possible to modify at run time the number of vertices (and edges) of the defined polygon. This is useful because some algorithms can modify these properties during the rendering phase.
NURBS curves and surfaces can be defined as well, but they are not fully tested in the current version of the libraries. All the rendering functions are collected in a single file. This file handles also the frame buffer and the z-buffer, allowing their initialization and reset.
Frame buffer is a memory block that contains color data about each pixel of the screen. These data are read from this memory and sent to the display, according to the display refreshing frequency.
Z-buffer is a memory block of the same size of the frame buffer. It contains depth value of each pixel. When a new pixel has to be drawn on the screen, the rendering function first checks its previous depth value: if it is lower, the pixel is updated. This way the back face removal is provided. The display resolution can be set at compile time by the programmer. It is strictly dependent on the system. The number of the pixels defines size of the frame buffer and the z-buffer. In the general purpose implementation, the frame buffer is characterized by a 32-bit integer value per pixel, the z-buffer by a 32-bit floating-point value per pixel. If the resolution is 320x240 pixels (reasonable for a mobile device), the memory has to be at least 600 KBytes wide.
A particular data type named "drawing list" is used to collect the primitives defined by the programmer. All the rendering functions apply to the drawing list.
The transformation phase is the first stage of the graphic pipeline: a transformation matrix is defined (it can be a composition of rotation, translation and scaling). Each vertex of a polygon and each control point of a NURBS (curve or surface) is multiplied by this matrix. The mathematical functions that defines these transformations and the matrixvertex multiplication are collected in a separated file.
The most interesting stage of the graphic pipeline is the rendering. It is based on a function that draws a segment line between two defined points, interpolating both the color and the coordinates. Each point is then scaled according to the viewport size and written to the frame buffer in the corresponding pixel position, using the z-buffer algorithm already described. The color interpolation is quite useful, because it allows the shading of the pixel just calculating the light influence at the initial points.
This segment drawing function is used for several purposes. Besides drawing a simple straight line, it is possible to render a NURBS curve according to the points calculated with the de Boor algorithm. But it is the core of the surface rendering. In fact a function that draws a triangle is defined. It uses a scan line algorithm to fill the triangle with the segment rendering function.
The triangle rendering is used to draw a polygon with a generic number of edges, since it is always divisible in several triangles.
Also NURBS surfaces are rendered using the triangle function, because the points on the surfaces, calculated with the de Boor algorithm, are considered as vertices of a polygon. The segment drawing function, in case the segment is not aligned with the x or the y-axis, uses a fixed step to choose the next interpolation point. In the testing phase, this slows the execution down. A large step speeds up the application, but the quality of the rendered surfaces decreases.
III. PROTOTYPE DEVELOPMENT
The prototype board for our system is an Altera Stratix EP1S40F780C5. It features 512Kbit of on-chip memory and an external flash memory of 512 KB. An additional board allows to control a standard VGA display.
A. Hardware prototype
The prototyped SoC is based on a Coffee core connected to Milk floating-point coprocessor. Some memory is directly interfaced to the processor and is used as instruction and data memory. A VGA controller is designed according to the board specifications and is interfaced to the flash memory.
The synthesis values are listed in the table below (tab. I).
TABLE I:SYNTHESIS RESULTS
|Logic elements || 30968 (75%) |
|Pins || 122 |
|Memory bits || 2359296 |
|DSP blocks || 24 |
|PLL ||1 |
|Operating frequency || 30MHz |
B. Application development
The application written for the prototype is thought to test all the features of the library, while fitting the prototype requirements.
For this reason, the libraries have been modified: the version running on the prototype is lighter than the general purpose one. The memory on board is small. It is not possible to load more than 32 KBytes of code and 32 KBytes of data. The memory used as frame buffer is a flash memory provided with the board and its size is equal to 512 KBytes. Using a resolution of 320x240 pixels it is possible to store only a 32-bit word per pixel. The only one possible choice was to reduce the color depth from 8 bits to 5 bits per color and add the z-buffer data to the same word.
The z-values are coded as 16-bit integers and occupy the upper part of the 32-bit word, while the lower part contains the color data. This choice has a good outcome, because it is possible to store z-buffer and color data using a single store operation, that means a single assembly instruction and a single memory write cycle.
The application created with this library version draws a cubic box. Each face has a different color. Shading is performed by defining a light source in the same position of the observer.
The box is scaled, translated and rotated. The translation allows to fit the box inside the projection space. The rotation is performed with an angle that changes at run time, so that the effect is visible as movement of the cube.
After the compilation, the size of the code segment is 28KB (229152 bits) and the size of the data segment is 19KB (155034 bits).
IV. TESTING RESULTS
The demo running on the screen connected to the prototype board demonstrates the correct behavior of the implemented libraries. On the screen it is possible to see a rotating box, as it was expected.
The execution speed is quite low, because of the limitations posed by the board. The fact that the frame buffer is mapped on a single port flash memory is a considerable bottleneck. Nevertheless it is possible to perform some tests to evaluate the best configuration for the rendering algorithm. As described above, the drawing of the segment line is based on a step configured by the user at compile time. The choice of this step is quite critical.
First of all it is needed to consider the correspondence between the viewport and the so called world space. The world space is the three dimensional coordinates system in which every object is placed to be drawn. In order to simplify the work of the programmer, the world space has the same ratio of the viewport window. This setting guarantees that the shape of the objects defined by the programmer is not modified after the perspective transformation and the viewport mapping.
Because of this setting, the drawing of a line parallel to the x-axis or to the y-axis is characterized by a step equal to the ratio between the world space width and the number of horizontal pixel. This value in our case (world space width = 2.0, number of horizontal pixel = 320) is 0.00625. We can call it pixel width.
In the generic case, the segment is placed so that it generates a random angle with the x-axis. In this case this step is not enough: keeping the step equal to pixel width, the rendering quality is not good anymore and some holes can appear on the surface.
To find the best value, some tests have been performed changing the step from the 10% of the pixel width to the 100% of the pixel width. In particular for five of these values the number of frames per second have been evaluated together with the visualization of the object. The results are analyzed in the following graph (Fig. 1). As expected the fps rate grows at the increasing of the step, because less points are evaluated.
Fig. 1. FPS versus Proportional Step Width
The quality is more or less constant till the 50% of the pixel width, then it decreases. The following pictures (Figs. 2, 3 and 4) display three cases of interests. The first one corresponds to the 10% (Fig. 2), the second to the 50% (Fig. 3) and the third to the 90% of the pixel width(Fig. 4). According to the quality, a value equal to the 50% has been chosen as the best one.
Fig. 2. 10% of pixel width
Fig. 3. 50% of pixel width
A basic 3D graphics library has been developed using C language. It has been designed to run on embedded systems, according to the OpenGL ES specifications, and it doesn't require any operating system to run. Its functionality has been tested first using a desktop linux machine. The library allows to instantiate some geometric primitives in a 3D space, to scale and rotate them and to perform a perspective projection into a 2D window. The Z-buffer implementation guarantees the hidden surface removal.
The implementation is completely based on C language and does not require any operating system. Thus these libraries can be used with prototyped systems or small SoC that does not support an operating system.
A simple application has been written and successfully executed by the described SoC prototyped on a FPGA board.
Fig. 4. 100% of pixel width
 "OpenGL ES Common Profile Specification 2.0" Khronos Group, 2005
 C.H. Tu, B.Y. Chen. "The architecture of a J2ME-based OpenGL ES 3D library" In IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 49-56, 2000
 Kylliainen, J.; Nurmi, J.; Kuulusa, M.: "COFFEE - a core for free" System-on-Chip, 2003. Proceedings. International Symposium on 19-21 Nov. 2003 Page(s):17 - 22
 Brunelli, C.; Campi, F.; Kylliainen, J.; Nurmi, J.: "A reconfigurable FPU as IP component for SoCs" System-on-Chip, 2004. Proceedings. 2004 International Symposium on 16-18 Nov. 2004 Page(s):103 - 106