D32PRO, scalable & royalty free 32-bit CPU

By Digital Core Design

INTRODUCTION

* 32-bit, deeply embedded, royalty free

D32PRO is one of the newest 32-bit CPUs available on the market. It’s been designed by Digital Core Design, IP Core provider and SoC design house from Poland, responsible e.g. for the world’s fastest or world’s smallest 8051 CPU. DCD launched more than 70 different architectures since 1999, which have been implemented in more than 300 000 000 electronic devices, that’s why one can be sure that quite considerable experience stands behind the D32PRO.

The D32PRO is a deeply embedded and royalties free 32-bit CPU. It’s a silicon proven solution (UMC 110nm, 1.2V, @150MHz, 128kB code), which consumes as little as 7µW/MHz (90LP process, minimal configuration) and an area of under 10.6 K gates (0.029 mm2).

It’s been equipped with Floating Point Coprocessor and a great variety of peripherals like e.g. USB, Ethernet, I2C, SPI, UART, CAN, LIN, RTC, HDLC, Smart Card etc. That’s why the D32PRO seem to be an attractive alternative for well-known ARM Cortex M0/M0+/M1/M3 and other 32-bit CPUs.

The D32PRO is fully configurable, depending on project requirements, the IP Core’s size can be significantly decreased. Then, advanced Power Management Unit makes the Core perfect for embedded or IoT projects. On the other hand, when project requires the highest computing power, then the performance can be significantly increased, even up to 1.48 DMIPS/MHz. That’s why the D32PRO is a universal 32-bit CPU, which can be easily deployed both in Bluetooth Low Energy, Wearables and Internet of Things project. Every single IP Core has been prepared by DCD’s engineers in accordance to the customers’ needs. So along with the D32PRO’s license, the customer gets silicon proven IP Core, which is ready to glue it with other SoC’s blocks.

FEATURES

* variable pipeline, debugger, bootloader, ultimate code density

The D32PRO has been designed as a universal and fully configurable CPU, ipso facto suitable for a great variety of target applications. It wouldn’t be possible without original RISC architecture based on DCD’s know-how. Especially enhancements pioneered in world’s fastest 8051 were exceptional for the D32PRO’s development. Among them, optimization of the maximal clock frequency to data path delays seems to be the most significant. As an effect, DCD’s 32-bit CPU copes excellent with many jumps in the code, but also (or rather, most of all) it executes smoothly homogeneous code, like the arithmetic operations. All of that wouldn’t be possible without variable pipelining architecture.

Click to enlarge

Other innovations one can find in D32PRO’s instructions’ set. It’s been based on special instructions which are maximally congenial with the higher level languages like e.g. C. As an effect came higher code density and shorter instruction set. Among them one can mention e.g.:

Instructions which enable comparison of the two sign chains
Finding the first one in the register.

As an example, we can take the FZB instruction, which is responsible for the optimization of the byte searching. In standard 8-bit CPUs the NULL sign searching was in the form of all bytes iteration:

In the 32-bit CPUs new optimization has been introduced – it allows to check four bites in the register, at the same time:

But the real acceleration comes with the latest FZB instruction, which has been implemented in the D32PRO. It allows to search the registry with just one, single cycled instruction:

The D32PRO has been equipped with 13 general registers R0-R12 and most of them enable automatic refresh after interrupt return. Thanks to it, the CPU significantly accelerates interrupts handling or context switching in real time systems. Every of the mentioned registers can be used for arithmetic operations, which is especially useful if we look at classic 8051, where only ACC could be used for that. Ipso facto, there’s no need to reload or save the content each time. The D32PRO has been also equipped with configurable interrupt circuit, which enable 1 non-maskable interrupt and up to 32 maskable interrupts. The pattern of interrupt detection is fully configurable, which enables the engineer to react on the edge, on the level or even to configure the interrupts priorities.

Modern 32-bit microcontrollers should be designed with the special concern for power management. That’s why the D32PRO has been equipped with the PMU, to control the clock frequency dynamically. Ipso facto, DCD’s 32-bit CPU can significantly save the energy, when the maximal frequency is not necessary. Moreover, the engineer can also configure the CPU’s clock divider and peripherals’ clock divider. So there’s nothing against to enter only the CPU in a low power mode, when all the peripherals will track the settings with the nominal clock. The CPU itself can be also entered into stop mode, when the clock is completely detached from the CPU and the return to normal mode can be released by an interrupt from any of the peripherals. The CPU can switch-off unused peripherals, as well – which significantly reduces power consumption.

And last but not least, we cannot forget about the debugger and bootloader, which are engineer’s best friend during the IP Core implementation. DCD got experience in 8051 hardware debuggers – the DoCDTM hardware debugger has been even awarded as EDN’s Hot Product of 2013. So there’s no wonder then, than the D32PRO must be equipped with hardware debugger, which guarantees full CPU control from the Eclipse level (complete system Eclipse, GCC => USB 2.0 cable => D32PRO). Unquestionable advantage of DCD’s hardware debugger is a fact, that it needs only two lines for communication, where other solutions are based on JTAG interface, which needs 5 pins typically. Next, the bootloader enables to save the firmware memory program from the external FLASH memory, connected through the SPI interface. And if it’s not enough, one can add that D32PRO’s bootloader has been equipped with the hardware scrambler with the key stored in non-volatile memory, which efficiently prevents firmware against reverse engineering.

Click to enlarge

MODERN 32-BIT CPUs & THEIR EFFICIENCY

* DMIPS/MHz vs tricks & cheats

There’s no secret that we’re all taking part in a race called „electronics”: faster, better, more savvy. It grew stronger in recent years, especially if we look at the datasheets with the benchmark results. Even simple 8 bitters or 16-bit CPU offer indecent/incomprehensible* performance (* sketch if unneeded). So there’s no wonder then, that the results equal to 2 or 3 DMIPS/MHz for 32-bit CPUs are more and more visible. But are they real? Is the CPU with e.g. 2.5 DMIPS/MHz as much powerful as in the real applications? Not necessary…

The D32PRO’s performance measured out with the Dhrystone benchmark is equal to 1.48 DMIPS/MHz. Seems to be pretty good as for the 32-bitter but… if we look at other 32-bitters (let’s call them A, B, C), we can find in their data sheets that they run:

CPU A: 1.55 DMIPS/MHz
CPU B: 1.77 DMIPS/MHz
CPU C: 2.81 DMIPS/MHz

So let’s dig deeper in the CPU A and what can we find there? It can be easily noticed, the extended number of pipeline levels, which enable higher performance for sure. But one should remember that this can be achieved only if the code is homogeneous (e.g. arithmetic operations). But if jumps or interrupts will be added, the overall performance doesn’t look as good. Of course one can always use prediction circuits or cache, but these will require more circuitry and consequently will be physically larger, cost more, and consume more power. So it’s better to be careful when we see extremely high Dhrystone results and the size of the CPU is surprisingly small (e.g. 1.55 DMIPS/MHz with 10k gates) Higher performance goes in hand with the bigger CPU size and vice versa – smaller CPU denotes lower performance.

But let’s go further and take a deeper look into “CPU C”, where we can find… 16, 24, 32 bit instructions. So there’s no secret that if engineer would write 32-bit constant to the registry, he would need at least two instructions. 24-bit instructions complicate fetch command unit, which denotes higher power consumption and area. So it’s better to collate real performance (in real applications) with the benchmark results presented in datasheets (achieved in accordance to bigger, better, faster imperative).

The usage of popular compilers which support optimizations at the linking stage, completely vanish the idea of dhry21 benchmarks. As an example we can take e.g. GCC for CPU A, CPU B or CPU C, where addition and multiplication have been completely removed from the final code. Not to mention, we all know (don’t we?), that in the documentation for dhry21 has been strictly underlined, that the speed for these operations influences DMIPS result. Moreover, the latest change in dhry21 documentation has been forced by too aggressive optimization, which we could see not long ago:

As we can see from the above, 2.5-3 DMIPS results, which can be found in various datasheets, have nothing in common with the latest DMIPS specification. If we would use such kind of methodology, we could get for D32PRO 4 or even 6 DMIPS. But where’s the sense?

As we’ve mentioned above, addition and multiplication have been removed from the final results. Why anyone would do so? Answer is easy, because in the results with more than 2 DMIPS, the compiler optimizes DHRY code, running the addition and multiplication operations and putting the end results as constants in the code. But the reality is different, because in the real DHRY and most of all – in the real applications, multiplication is being held by special CPU’s instructions. Moreover, function calls, in such prepared test, are not executed, cause the body of the function has been placed directly from the main loop (inline). Such a behavior eliminates CALL/RET instructions, which usually take 2 or 3 cycles for each. Inline influences “extra optimization”, because part of the operations “overlap” each other and is being stacked together. As a final result, we get much higher DMIPS results. The (not) only question is if such a synthetic result could be used in a real life? Answer to this question is quite obvious.

SUMMARY

* build in peripherals vs fully scalable?

The D32PRO is a silicon proven, deeply embedded 32-bit CPU. It’s been designed as an IP Core tailored to the project’s needs. It’s fully scalable, with Floating Points (single precision IEEE-754 instructions) and great variety of peripherals (with drivers) on board. DCD’s latest 32-bit CPU is a technology independent IP Core, so it can be implemented both in ASIC and FPGA. But the target technology is ASIC, as the D32PRO is silicon proven (UMC 110nm, 1.2V, @150MHz, 128kB code). Moreover, the IP Core is a royalty-free, so the whole solution is much more “price friendly”, cause the licensee pays just one license fee and then is allowed to produce chips without any pay per chip. Digital Core Design offers free 3 months of comprehensive technical support, which can be freely extended under commercial conditions.

As it’s been said, the D32PRO is a universal and fully scalable 32-bit CPU, so the target applications are countless. But if we would need to point the most significant, they should be:

IOT (Internet of Things)
Smart Cities
Smart Grid
BLE (Bluetooth Low Energy)
Medical Devices
Embedded Electronics
Smart Electronics & Wearables etc.

But these are only the top of the mountain…

The D32PRO offers an engineer full configurability, it can easily switch-off some groups of instructions, so the total are of CPU can be significantly reduced. D32PRO can also switch-off unused peripherals or on the other hand – he can use all of them, cause the D32PRO includes e.g. USB 2.0, Ethernet MAC, CAN, LIN, I2C, SP, CF & SD Card etc.

And if it’s still not enough, we can just mention that the D32PRO can be delivered along with complete FPGA evaluation board, which can significantly speed-up the testing and validation.

The D32PRO is an All in One solution, so dare to say more…

More information at www.dcd.pl

Industry Articles

D32PRO, scalable & royalty free 32-bit CPU