By R.Selvakumar Rajagopal & Muhammad Mun’im Ahmad ZabidiUniversiti Teknologi Malaysia (UTM)Abstract
DLX is an open source microprocessor, it’s free and it has never been implemented in a commercial ASIC (Application Specific Integrated Circuit) design. The objective of this project is to use the DLX microprocessor implemented with Wishbone bus interface for a SoC (System-on-Chip) design. The DLX CPU was originally designed by Hennessy and Patterson as a vehicle for teaching principles of computer architecture. It is a simple reduced instruction set computer (RISC), very similar to the MIPS processor which is one of the first commercially available RISC CPUs. The designers describe it as the distilled essence of the CPU.
The Wishbone System-On-Chip (SoC) Interconnection Architecture is a flexible design methodology for use with semiconductor IP cores. Its purpose is to foster design reuse by alleviating System-On-Chip integration problems. In this project, the DLX core is synthesized, simulated and implemented in FPGA. When it comes to integrate the CPU core with Wishbone SoC bus, the signal availability for this processor has to be Wishbone Compatible. DLX processor is modified to support cycle and strobe signal, which are common external signals in master slave communications. In this project, the CPU core acts as a master in the Wishbone busing system, while a memory integrated at the slave interface as shown in Figure 1. CPU normal operation is tested, which is fetching instruction from memory and executing it.
The DLX is described in VHDL, while the Wishbone module is in Verilog HDL. Quartus II software is used to synthesize, compile and verify the functionality of CPU and Wishbone by simulation and timing analysis. The partial SoC system is implemented in Altera APEX20KE200 FPGA board. NIOS, which is the core processor in the FPGA board, is used as an intermediate processor which communicates with DLX and the rest of the system via Avalon Bus Protocol to verify the system operability and functionality in real hardware environmentI. INTRODUCTIONSObjective
It costs a lot of money to have a licensed IP core in an ASIC design. By using an open source IP for an ASIC implementation, there will be no license required, thus no money need to be paid to anyone. Of course, fabricating the ASIC design is not free of charge. The main objective of the project is to implement the DLX microprocessor implemented with a Wishbone bus interface. These cores are intended to be used for a smart camera SoC . Both cores are totally free and available for any use. At the end of this project, this design can be integrated along with any Wishbone compatible IP cores. By adopting a standard interconnection scheme, other IP cores can be integrated with the DLX CPU more quickly and easily by the end user. The main reason of using the open source IP cores is because of the perceived usefulness for industry such as avoiding high license cost, no royalties, able to modify core at will, long term supply and maintenance, portability and simplified prototyping.II. PREVIOUS DLX PROJECTSCase Study of DLX Computer System 
DLX microprocessor has been developed as a case study by Peter J. Ashenden in the book “The Designer’s Guide to VHDL” Second Edition. The DLX is described in behavioral and RTL modes in the book. The VHDL description of this implementation starts by first describing the data path entities and their behavioral architecture bodies. Then the data path entities are used to construct the register-transfer-level architecture body of the CPU, and finally the behavioral architecture of the controller that sequences data path operations described. The difference between this DLX CPU and with the one implemented in this project is the interrupt handling method. In this study case, there is no interrupt handling capability and the interrupt control register is not implemented. Other than that, since this model is described for a test bench run on a simulator, the VHDL code is not synthesizable. The CPU model in this book is tested with a written test bench model. Since the function performed by the CPU is to execute a machine language program in memory, the CPU is tested by including a memory in the test bench. The memory is preloaded with a small program and the ports of the CPU are monitored to verify the fetching and execution of the program. Since the CPU runs based on a test bench, a clock generator is also included to drive the clock and resets ports of the CPU. The register-transfer-level model of the CPU is also tested using the same test bench that used to test the behavioral model.Fig. 1. DLX and Wishbone IntegrationThe ASPIDA Project 
The ASPIDA project has implemented an asynchronous IP of the DLX Instruction Set Architecture (ISA) with incorporated support for ISA conversion so it can be easily converted to any RISC ISA. The DLX architecture, is well-supported by existing software development tools (compiler, assembler, loader, instruction set simulator and debugger). The synchronous single-pipeline architecture, which is standard for the basic synchronous DLX implementations, is identical to the architecture of the asynchronous version.
A suitable Open IP interface (Wishbone) is embedded onto the processor to enable it to be integrated into any Open IP SOC system. In addition, the ASPIDA project issues a new Open IP interface standard based on asynchronous technology (CHAIN), and support for this new Open IP interface is also embedded onto the processor core. A design flow that is based as much as possible on existing EDA tools for all design steps, and which is part of the background technology brought in by the partners, has been used in order to produce a portable net list, and to distribute all the intermediate HDL files used for high-level and gate-level design.
The final product is technology-independent and timing-independent and in a form suitable for integration using only standard, industrial tools and flows, with no dependence on asynchronous tools and specific knowledge of asynchronous design for potential end users. The ASPIDA project of DLX microprocessor contains pipelining features, and the main highlight is its implementation as an asynchronous circuit design. Both features make the ASPIDA very complex hence it was not adapted in this project. In addition, the ASPIDA DLX has been implemented in Xilinx FPGA, but Altera FPGA has been used for implementing this project.III. THE DLX MICROPROCESSOR 
A few open source processors were evaluated to choose the right processor that will be implemented for this project. Table 1 shows a simple comparison between three open source processors that were evaluated.TABLE 1. COMPARISON OF OPEN SOURCE PROCESSORS
|DLX ||Leon ||OpenRISC |
|Non-windowed register ||Windowed register ||Windowed register |
|Has been described by Hennessy and Patterson in |
“Computer Architecture, a Quantitative Approach”
and by Peter J. Ashenden in “The designer’s Guide to VHDL”, 2nd edition.
|Based on the SPARC processor, which was designed on MIPS era. |
Documentations are similarly based on SPARC processor.
|New, less documentation. |
From the comparisons, DLX microprocessor is chosen as the core CPU for this project. The main reason is because it’s free and the architecture is not as complex compared to the other two open source processors which are using windowed register architecture. Windowed register is an architecture where more than 32 general purpose registers is used, sometimes might go up to 256 registers. But at once, only 32 registers are visible to the user. To use other registers as well, user have to ‘slide’ up or down a pointer that points certain window at a time. The advantages are there will be no need of stack when calling subroutines, as all information can be stored in the register. While the disadvantage is that more decoding circuit is required to implement this windowed function, which makes the design to be more complex compared to those using non-windowed system.
Today the “arithmetic” organ of von Neumann is called the data path, as shown in figure 2. It consists of execution units, such as arithmetic logic unit (ALU) or shifters, the registers, and the communication path between them. The data path contains most of the state of the processor – the information that must be saved for a program to be suspended and the restored for execution to continue. In addition to the user visible general-purpose registers, the state includes the program counter (PC), the interrupt address register (IAR), interrupt control register (ICR) and so forth.
The processor uses three buses: S1, S2, and Destination. The fundamental operation of the data path is reading operands from register file, operating on them in ALU, and then storing the result back. Since the register file does not need to be read, most designers follow the advice of making the frequent case fast by breaking the sequence into multiple clock cycles and making the clock cycle shorter. Thus, in this architecture there are two latches on the outputs of the register file (called A and B) and a latch on the input I.
The register file contains the 32 general purpose registers of DLX. Register 0 of the register file always has the value, matching the definition of register 0 in the DLX instruction set. The program counter (PC), interrupt address register (IAR) and interrupt control register (ICR) are also part of the state of the machine. There are also registers, not part of the state, used in the execution of instructions: memory address register (MAR), memory data register (MDR), instruction register (IR) and temporary register (TBR). The TBR is a scratch register available for temporary storage for control to perform some DLX instructions. The only path from the S1_bus and S2_bus to the destination bus is through the ALU.Fig. 2. The DLX Architecture
The DLX, like most RISC CPUs, has a relatively large number of general purpose registers. These all are shown in figure 3-2. Registers r1 to r31 are general purpose registers that maybe used to hold integers or any other 32-bit value. Register r0 is special in that it always has the value 0. Any value written into this registers is discarded. The remaining registers have special purposes and are not used to store operands. The program counter (PC) holds the memory address of the next instruction to be executed by the CPU. As we shall see, each DLX instruction is represented in one 32-bit (four-byte) word. The DLX requires that instructions be aligned at addresses that are a multiple of four. Hence the PC value must always be a multiple of four and is incremented by four after each instruction is fetched. Registers TBR, MAR, MDR, IAR and ICR are special registers which will have functions such as a processor status word, exception control signals and so on.
DLX processor’s operation consist of four stages, instruction fetch, instruction decode, execute and memory/write back. Instructions that are stored in memory will be stored in instruction register during the fetch stage. During the decode stage, the instruction from instruction register will be sent to the controller. Based from the instruction, the controller determines the type of instruction (J, I or R), the operation that needs to be performed from the opcode, the address for source and destination registers, the immediate type for operation, whether 16-bit immediate or 26-bit, and finally asserts all the appropriate signals so that the desired operation is performed correctly. The address for register file operation comes from the control unit. The ALU provides the only mean of sending data to destination bus from either source1 bus or source2 bus. Once the fetch stage is completed, the controller checks for the opcode to determine the operation.
Based on the operation, data will be placed on the internal buses whether from IR (immediate value), register file (register value) or from memory data register if the data is required directly from memory. Once the data placed on the respective buses, ALU will execute the data based on the operation required, and finally storing the data into register file or memory, if it is a store operation. Each instruction requires different signals to be asserted form the controller. An add instruction for example, uses the data from register file which is placed on the internal buses using the A and B registers. Next, the data will be added once the ALU operation takes place, and during write back stage, the final value will be written into the register file to the address specified in the instruction itself. This will be slightly different when immediate adding is performed. Only A register places the data on Source1 bus, and the immediate value that needs to be added will be placed on Source2 bus by the instruction register.IV. THE WISHBONE SoC BUS
The Wishbone System-On-Chip (SoC) Interconnection Architecture is a flexible design methodology for use with semiconductor IP cores. Its purpose is to foster design reuse by alleviating System-On-Chip integration problems. This is accomplished by creating a common interface between IP cores. This improves the portability and reliability of the system, and results in faster time-to-market for the end user. The Wishbone standard is not copyrighted and is in the public domain. It is used for design and production of integrated circuit components without royalty or other financial obligation. Figure 3 shows a simple application of Wishbone SoC Bus involving master slave communication. The CPU acts as the bus master while the memory and DMA core are the Wishbone slaves. CPU accesses the slaves once at a time whereby the permission to use slave resources is only granted by the arbiter. Arbiter acts like a traffic light controller in the system and determines the data transfer protocol within the cores inside the system. Slaves too need to access the CPU for resource usage and which slave accesses the CPU at a time is determined by the arbiter. Arbiter uses priority based scheduling and if the priorities are equal, then a round robin scheduling method is applied. As for Wishbone, only one single bus that addresses almost every need. A system that consists of devices with different speed can include two wishbone interfaces; one for high performance block and another for low performance peripherals.Fig. 3. Structure of Wishbone Bus
Wishbone appears to be the simplest among the three buses that have been reviewed. Compared to AMBA, which defines three different buses for peripherals with different speed, Wishbone defines only one single bus - a high speed bus. Works becomes harder when it comes to interconnect devices with different speed and size. Bridges might be required to build a complete system. With Wishbone, all cores connect to the same standard interface. A system designer may choose to implement two wishbone interfaces in a microcontroller core, one for high speed low latency devices and one for low speed, low performance devices. Wishbone signaling appears to be very intuitive and should be easily adapted to other interfaces when needed.V. DESIGN METHODOLOGYProcessor Core Synthesization
The DLX processor source code obtained is not synthesizable. Earlier, the DLX project uses the processor core as a study purpose processor, and this processor was tested successfully on a simulator. Since the final target of the project is to have the processor in ASIC, then the core has to be synthesized to allow a modeling circuit to be created according to how the RTL (Register Level Transfer) is described. The simulated processor uses two-phase non overlapping clock for the control unit, while no clock used at all for the data path. Data path uses a generic value of time described in the entity to allow the data transition. Normally, in an RTL circuit a clock is needed for every register where it latches the input data on clock transition, but in this processor model, a generic delay is used as a reference to latch the data at specified time. Figure 4 shows the design flow for this project. The core is synthesized by introducing a clock signal for every single register in the architecture and the control unit, designing a proper clock distribution in data path, determining the interconnection of the sub modules of the processor and finally mapping it together to form the top level entity of the processor.Sub modules and processor testing
Once the processor is synthesized, each modules of the processor’s entity is tested independently to verify the module functionality. The main registers tested is instruction register (IR), interrupt control register (ICR), and combinational logic circuit such as the arithmetic and logic unit (ALU). Fig. 4. Design Flow
All the individual modules and registers are tested first before the main processor integration to simplify the simulation part once all the modules are integrated. Any errors are better to be detected in register level before sorting it out when testing the whole processor entity. For example in the ALU, two latches are placed before the adder and the barrel shifter to allow data to be latched first before being processed by the adder or shifter. This actually increases the delay before the output data can be loaded into the destination bus, thus providing insufficient time before the C register captures the correct result. By completely removing the two latches, delays for data to pass through the combinational logic is reduced, and the C register captured the correct result just before the data in the destination bus changes.
All the simulation are done based on timing to provide a good picture of the system’s performance and to detect any setup and hold violation. On the other hand, the control unit also is tested to verify all control signals to data path are asserted appropriately. Once the module testing is accomplished, the modules are integrated together based on the original design to test the processor. To fasten the verification of processor’s functionality, the control signals from control unit and the buses from the data path are wired out. By doing this, each step of the instruction execution can be monitored, thus any failure in the data path unit can be detected. Checking the state transition for one instruction execution gives a good assurance that the opcode is correct, and CPU is executing based on the correct opcode.Memory Integration
A memory is interfaced with the processor to provide a mean of data storage and instructions to be executed. A random access memory (RAM) from Altera’s library component (lpm_ram_dp model) is chosen as the data storage for DLX, which the processor can store and load data to and from memory. Mainly, all the instruction that need to be executed is stored in the memory at specific location. CPU will fetch the instruction stored in memory based on the address given by the address bus, determine the opcode and finally executing the instruction. A memory with 11 address bits and 32 data bits is integrated with the processor. The address size is limited here to support the computer’s RAM capability. Since the Altera’s memory does not have a controller in which the CPU can determine the length of data that need to be transferred, all the data transfer between memory and processor are only word transfer. Although the processor generates signals for memory controller to determine the data length, whether a word, half word or byte transfer, only word transfers are done in this testing since the memory available does not support different length of data transfer.Processor Integration with Wishbone
The processor and the Wishbone module are tested independently. When it comes to integrate both together, the processor must be modified again to meet the Wishbone signal requirements. The cycle and strobe signal are must for any master core in Wishbone bus, while the error and retry signals are optional. Thus, the cycle and strobe signal are implemented for the DLX processor. These signals are only necessary when the processor needs to communicate with slave devices. All internal operation such as register to register operation and ALU operation does not require these signals to be asserted.
The instruction load, store and fetch are the three main instructions which provide the only mean of data transfer with other devices. Thus, the signals cycle and strobe are asserted during a fetch instruction and negated during the instruction-decode stage once the acknowledgment from the slave devices is received. Same goes for the load and store instruction where both signals are asserted during the memory state and negated once the acknowledgement is received from the slave devices during the second state of load and store. Every load and store states including for word, half word, and byte follows this protocol. Once the processor is Wishbone Compatible, it is integrated with the master interface of the Wishbone bus module. For testing purpose, a memory is integrated at the slave interface to provide instruction to be executed and as data storage for the DLX CPU.VI. DESIGN IMPLEMENTATIONDLX Performance
The performance of the DLX processor is like the performance of most RISC CPUs. Since the DLX is a simplified version of MIPS processor, the performance of DLX also is similar with the predecessor. The move instruction from a general purpose register to a special register and at the opposite direction requires only one clock cycle, while conditional and unconditional branching requires four clock cycles to complete. Normal ALU operation namely logical and arithmetic operation requires only two clock cycles, one clock cycle to execute in ALU and another to write back into register file. The registers test instructions on the other hand requires three clock cycles, while load and store instructions requires five clock cycles to complete including the memory and write back cycle. Jump instruction requires two clock cycles while jump and link register instruction requires three clock cycles. The extra one clock cycle is to save the previous PC in register file. Interrupt and trap instruction executes for three clock cycles while the return from interrupt instruction needs two clock cycle.NIOS-DLX Interface 
Before DLX can be tested in FPGA, an interface is built to enable the communication between both NIOS and DLX processor. The NIOS processor is a soft-core processor embedded in the FPGA development kit. The use of NIOS is necessary in the hardware testing as only NIOS has the capability of communicating with the computer through serial communication. Result verification in hardware is important in this project, thus NIOS is used as an intermediate processor between the DLX system and the computer. The main job of NIOS here is to control the DLX input signals such as the clock and reset signals, and to fetch the result after each instruction executed by the DLX processor to be displayed to computer via the serial communication. NIOS provides only a few signals that can be used by user interface logic, in this case the DLX system. The main signals that provided to Avalon slave interface are clock, chipselect, reset, address, readdata and writedata. The interface is shown in the figure 6.Controls for DLX in FPGA
To simplify the interface operation and the C program, only one signal of DLX is controlled by the NIOS, the reset_to_DLX signal. NIOS resets the DLX using the writedata signal, and reset_to_DLX signal takes the value from writedata when the address is for reset / control (0x00000430). NIOS asserts the LSB of writedata signal for a few clock periods, and then negates it. This begins the DLX operation. NIOS only asserts the LSB since only one bit is needed to reset the DLX system. Other inputs for the DLX which includes the interrupt and halt are disabled to simplify the testing and to reduce the complexity of writing instruction manually. Interrupt handling method needs a special subroutine called interrupt service subroutine, and when an interrupt occurs, CPU should transfer the operation to the interrupt state. Once the NIOS resets the DLX, the DLX starts the execution of instruction which is included in the memory as a .mif file. After each execution, DLX sends status to NIOS processor informing that the required data is ready. The status_from_DLX signal is used for this purpose, and it is active when address for status is selected (0x00000434). One bit status is only enough for result that needs to be displayed only once. Since the test program set includes ten sample instructions, ten status bits are needed to inform the NIOS processor after each instruction is executed.Fig. 5. NIOS-DLX Interface
NIOS reads the status from readdata signal at the address specified for status (0x00000434). When the status bit is high, NIOS fetches the data from readdata signal, send it to UART and displaying it to the NIOS SDK Shell through serial communication. The status signal for each instruction is unique because it is only asserted once when the predefined result occurs. So, the source for each status bits is different from each other, that is why different status signals are used for each instruction to differentiate the current instruction that being executed.Operating Frequency for DLX
The maximum operating frequency for NIOS is 33.33 MHz while from timing analysis, DLX requires at least half of the frequency provided in FPGA. DLX cannot run at 33.33MHz in APEX Board, but can run at the same frequency in Cyclone II. This difference occurs because Cyclone II uses the 90 nanometer technologies while APEX uses the 150,180 and 210 nanometer technologies. The different technology influences the speed of devices in FPGA. As a result, a clock divider is introduced. The clock divider which is used is provided in Altera’s mega function library.
The NIOS as usual runs at its maximum frequency while DLX been made to run at 16.66MHz, half of the NIOS operating frequency. The in_clock comes from the global clock of FPGA board (crystal) with input frequency of 33.33MHz; out_clock0 goes to DLX system while out_clock1 goes to the NIOS system. Out_clock0 divides the original frequency of 33.33MHz to 16.66MHz for DLX operation, while out_clock1 provides the original frequency for NIOS operation. The problem where the clock skew is larger than the data delay is overcome by using the PLL clock divider.The DLX and Wishbone Implementation on FPGA
The verification of DLX processor in real hardware implementation is done by comparing the result obtained in simulation and with the result displayed in computer as a result from hardware computation. The same result obtained in both simulation and hardware testing proves the functionality of processor in real hardware implementation. The DLX operation in hardware is verified with two different set of .mif file that preloaded earlier in the memory module inside the DLX system before the system compilation and place and route. The first one is a sample of 10 instructions and the second one is a DLX application program, the bubble-sort algorithm. DLX executes instruction as defined in memory in .mif file and each result of execution is fetched by NIOS, and displayed in PC’s screen.
Finally, Wishbone busing system is implemented, DLX and memory communicates using a bus protocol which is determined by the Wishbone Bus Controller or arbiter, and the communication is master and slave communication, no more point-to-point communication. In this implementation of DLX with Wishbone SoC Bus, only one master slave communication exists, that is between the DLX core and the memory. Figure 6.6 shows how the DLX System is implemented in FPGA. Although the Wishbone module can support up to 16 slaves, only one slave communication is tested first to verify the busing functionality. Other than that, bigger memory size is needed when is comes to include more slave devices to the bus. In this communication between the processor and memory, the processor acts as the bus master while the memory acts as the bus slave. If there is more than a master, then access to the slave by a master is determined by priority, and if priorities are equal, then a round robin scheduling scheme is applied. If more than a slave is added, addressing of 32 bit is needed as slaves are selected by the four most significant bitsACKNOWLEDGMENTS
In preparing this paper, I was in contact with many people, researches and academicians. They have contributed towards my understanding and thoughts. In particular, I wish to express my sincere appreciation to Assoc. Professor Muhammad Mun’im Ahmad Zabidi for encouragement, guidance, critics and friendship. I would like also to express my gratitude to Dr. Peter J. Ashenden, a researcher from Adelaide University for providing me ample information regarding my project. Without their continued support and interest, this thesis would not have been the same as presented here. Finally, special thanks to family members, friends and all others that had contributed their constructive opinions and assistance in accomplishing this paper.REFERENCES
 Leeser, Miller & Yu, Smart Camera Based on Reconfigurable Hardware Enables Diverse Real-time Applications, Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'04), 2004.
 Peter J. Ashenden, The Designer’s Guide to VHDL, Second Edition. Morgan Kauffman Publishers, 2004.
 John L Hennessy & David A Peterson, Computer Architecture A quantitative Approach, Morgan Kauffman Publisher, 1990.
 NIOS Embedded Processor System Development
 Asynchronous DLX Demo
at ASYNC 2004 Conference