Stefan Reichör, Gleichmann Electronics ResearchMartina Zeinzinger, Gleichmann Electronics ResearchMarkus Pfaff, Fachhochschule HagenbergHagenberg, AustriaAbstract :
This paper shows a way to connect a FPGA based prototyping environment with a HDL simulator. When the pure cosimulation feature is used, speedups in a range from 2 to 50 are achievable. We show a new technique to run the design in the MHz range for selected time periods. That technique yields higher speedups (> 100). We applied our approach on a leon3 design and got a speedup from 130 compared to the Rtl-VHDL simulation. This factor allows to run a simulation in one minute that took formerly 2 hours.Introduction
Today's developments in digital hardware becoming more and more complex. But while on the one hand complexity is going up, development time on the other hand is precious and tends to decrease for the sake of competitiveness.
To guarantee the functionality of such complex systems, numerous test cases have to be checked in laborious simulation runs. One way to speed up the simulations is to move some parts of the digital design to a FPGA and to run a cosimulation from a testbench written in a hardware description language (like VHDL, Verilog, SystemC) and the design in the FPGA.
A cosimulator working with this principle yields speedups in the range from 2 to 50. The limiting factor for this kind of simulation is the testbench execution in software and the communication time to send data from the simulator to the FPGA respectively to send the response from the FPGA back to the simulator. The speedup depends on the testbench complexity and on the amount of data that is transferred from the testbench to the FPGA and back. This factors differ much from design to design. Therefore the testpattern throughput (in pattern/second) gives a better predictable number.
We have compared many different designs and measured a pattern throughput ranging from 100 kHz up to 400 kHz.
When the FPGA is used as pure prototyping solution, one can use operating frequencies up to 500 MHz. The prototyping solution allows a faster testpattern throughput than the cosimulation by a factor of about 1000.
The following paper presents a way to combine the ease of use of the cosimulation (use a HDL testbench, use the simulator for debugging, ...) and the unmatched performance of the prototyping solution. The starting point is a cosimulation environment. We have extended it to allow the execution of several simulation phases with clock frequencies in the 100 MHz range.
In our case the extended cosimulation system is used to speed up the simulation of the calculation of a mandelbrot set and to speed up the simulation of the 32-bit processor leon3. We describe the needed techniques to make the HDL testbench compliant with the prototyping extension to achieve the maximum simulation throughput.
Our experiments showed that we can achieve speedups >100 in comparision to the RTL simulations. That dramatic speedups (simulations that take some hours will run in a few minutes now) are a great help for the HDL designer to run simulations in a short time, allowing a better test coverage and faster development times.System Overview
The used cosimulation system consists of a simulator and of a coupled hardware device. The hardware device is split in the so called I/O Manager and in the Device Under Test (DUT).
Figure 2 shows that three components. The DUT can either be a FPGA that holds the design which should be accelerated or an arbitrary system with a digital interface (e.g. a CPU, a card with a PCI interface, ...).
The purpose of the system is that the DUT can be embedded into a running simulation in the simulator. A part of the simulator and the I/O Manager are responsible to incorporate the DUT as simulation model into the simulation.
The figure 1 shows the PCI extension card that implements the hardware part of the cosimulation system. The screenshot in figure 3 shows the Mentor Modelsim Simulator that runs a cosimulation. More information about the used cosimulation system can be found at [wph].Figure 1: The cosimulator Hardware
The extended cosimulation system allows two modes of operation:
- Mode 1: Cosimulation - This is the default operation mode of the cosimulator. The I/O Manager can send requests to the DUT at every clock cycle, and all responses from the DUT are sent back to the I/O Manager.
- Mode 2: Clock Acceleration - This mode is the extension of the cosimulator to exploit the high clock frequencies from the FPGA prototyping solution. The simulation runs faster, because the I/O Manager emits only the high speed FPGA clock (DUTClk) in that phase. The benefit we want to achieve are shorter simulation times. The drawback in this mode is that no stimuli or response data can be exchanged between the simulator and the DUT in that phase.
The extended cosimulation allows a seamless switching between the two modes during a simulation. That is the crucial point of this feature.
Figure 2 shows mode 1. In that mode, the simulator can pecify a request for every clock cycle. That request is sent to to the DUT via the I/O Manager. After every clock cycle a response is calculated and sent back to the simulator y the I/O Manager. After every clock cycle the simulator can decide whether the imulation continues with mode 1or with mode 2.
Figure 2: Structural view for mode 1: Cosimulation
Figure 3: A screenshot of the running VHDL cosimulation
Figure 4 shows mode 2. In this phase the simulator sends the number of clock cycles to apply (NumOfClks). When the I/O Manager has received that information, it starts to issue clock cycles. The clock cycles in this mode have a higher frequency than in mode 1 (hence the name Clock Acceleration). Additionally, it is optionally possible to specify a breakpoint configuration. The DUTClk will be stopped, as soon as one of the following conditions are met:
- The specified number of clocks is sent.
- The condition that is specified by the breakpoint configuration is met. This condition specifies a state that is observed from the I/O Manager during the acceleration phase.
Figure 4 : Structural View for mode 2: Clock Acceleration
After the DUTClk is stopped, the I/O Manager sends response data to the simulator. Now the simulator can decide if the simulation continues with mode 1 or with mode 2.Timing behaviour
Figure 5 depicts the fact that mode 1 and mode 2 can be alternated as many times as needed during a simulation.Figure 5: Timing view of a cosimulation that exploits Clock Acceleration
The available timing behaviour makes it easy to switch from a simulation controlled hardware to an emulated hardware.
Normally a clock generation statement like the one in listing 1 is used in VHDL.
Listing 1: VHDL clock generation
|-- Normally used clock generation scheme |
Clk <= not Clk after 10 ns;
To use the “clock acceleration” feature, a foreign procedure called hac_clk is provided. That procedure takes two parameters:
- num_clocks: The number of clock cycles.
- hw_delay_count: Is used as delay counter to eventually slow down the emitted clock. If the value 0 is given, just operate with the PCI clock. For larger values, use the following formula:
- fPCI / (2 * value).
Listing 2 shows, how to use the “clock acceleration” feature in a testbench. The “clock acceleration” is used only for the specified time frame. Otherwise regular clock cycles are generated in the simulation and sent to the DUT.
- Whenever hac_clk is used, there is no communication between the HDL simulator and the DUT. The DUT is clocked with a high frequency clock generator. Therefore the progress of the simulation is very fast.
- When the clock is generated via signal assignments, the cosimulation interface is used. The simulation will now run in the 100 kHz range. In that mode a communication between the HDL testbench and the DUT is carried out.
The use of the hac_clk procedure allows the implementation of a design specific clock generation scheme.
Listing 2: Sample process for clock accelerationBenchmark & Conclusion
| -- declaration of hac_clk as foreign procedure |
procedure hac_clk(num_clocks : in integer; hw_delay_count : in integer) is
assert false report
"ERROR: foreign subprogram not called" severity note;
attribute foreign of hac_clk : procedure
is "hac_clk Hac_Vsim_Interface.dll";
-- Use the accelerated clock, if (now > 1 us) and (now < 1000 us)
-- Otherwise explicit toggle the clock signal as before
clk_gen: process is
variable num_clks : integer := 10000;
if (now > 1 us) and (now < 1000 us) then
wait for num_clks * 10 ns;
wait for 5 ns;
clk <= '1';
wait for 5 ns;
clk <= '0';
end process clk_gen;
We ran several tests with the clock acceleration extension and achieved quite impressive speedups in comparision with the RTL simulation and the cosimulation. The results for a hardware that calculates a 256x256 image for a mandelbrot picture with 10 iterations and the leon3 processor that is used to calculate the prime numbers up to 1000 are shown in table 1.
Table 1: Benchmarks comparing RTL simulation, cosimulation and clock acceleration
|Design || Sim time: RTL ||Speedup: Cosimulation || Speedup: Clock Acceleration || Sim Freq. Clock Acceleration |
|Mandelbrot || 419.29 sec || 14.26 || 566.61 ||14488.22 kHz |
|Leon3 ||52.25 sec || 37.02 || 137.50 || 317.15 kHz |
Our tests proved that it is quite easy to exploit the clock acceleration feature from our cosimulation extension for computation intensive tasks that are mainly clock driven.
Fortunately this is true for many designs which include microprocessors. So we see a broad application range for our new technique.
The FPGA for the accelerated design is an Altera Stratix EP2S180 device. That FPGA can hold designs up to 1.8 million ASIC gates. We see a strong need for larger designs for e.g. ASIC prototyping. Therefore we are working on an extension board that can hold 4 EP2S180 devices. That system will provide an emulation / cosimulation solution for designs up to 7.2 million ASIC gates.
Additionally we have specified a mechanism that allows the cosimulation/emulation with any kind of digital hardware (like ASICs, microctrontrollers, CPUs or even complete boards with a digital interface, e.g. a PCI board). That functionality is a functional enhancement to the normal HDL simulation capabilities, because it allows the cosimulation with designs, where no simulation model is available (e.g. ARM processors).
Currently we are working on a cosimulation board that holds a huge FPGA device, a huge memory device and ethernet interfaces. That board will allow the cosimulation/emulation of for example a linux system running on a leon3 processor. The planned system will combine full speed execution capabilities with a simulator coupling for debugging purposes.References
[Lip96] J. Lipman. Chip hardware and software: Why can't they just get along? EDN, 1996.
[Pfa99] Markus Pfaff. Verfahren zur beschleunigten Systemsimulation mit VHDL durch Integration von externen Hardware/Software-Komponenten. Dissertation, Johannes Kepler Universität Linz, Linz/Austria, Oktober 1999.
[Rei04] Stefan Reichör. Entwurf und Implementierung einer HW/SWCosimulationsumgebung mit Schwerpunkt auf der Einbindung von interaktiven User-Interfaces. Dissertation, Johannes Kepler Universität Linz, Linz/Austria, Juli 2004.
[Rei05] Stefan Reichör. Simulationsbeschleunigung durch Cosimulation und Hardwarein-the-loop. Mentor Graphics User Conference 2005, 2005.
[Row94] J. Rowson. Hardware/Software Cosimulation. Proceedings of the 31st Design Automation Conference, Seiten 439-440, 1994.
[RZP05] Stefan Reichör, Martina Zeinzinger und Markus Pfaff. Speed Up the Digital Design Development by Means of Using the Hardware Accelerator and Cosimulator (HAC). FH Science Day, Oberösterreich, Seiten 67-73, 2005.
[wph] HAC2 product homepage. http://www.ger-fae.com/HAC_2.html
[Zei04] Martina Zeinzinger. Realisierung eines kostengünstigen stark beschleunigenden Hardware-Cosimulators. Diplomarbeit, Fachhochschul-Diplomstudiengang Hardware/Software Systems Engineering, Hagenberg, Juli 2004.