Michal Jedrak, Evatronix S.A.
This article presents the aspects of building USB3.0 application using low performance 8-bit microcontroller taking an 8051 derivative as an example. First it gives a technical overview of the USB technology and its performance. In the next chapters the example architectures are discusses followed by target applications based on them. Besides architectural view, also the software side and its performance is discussed. To give the full view the silicon area demands and power consumption is also presented.
The USB3.0, so called SuperSpeed USB, is the new incarnation of USB standard that brings 5Gbps connectivity into the daily life of million computer users around the world. The USB3.0 throughput is awaited by many applications that are prevailing on the market like Solid State Drives, High-Definition content entertainment devices and mobile phones. Also many vendors want to move to USB3.0 just because of marketing reasons without a real technical need for speed behind it.
This new emerging technology causes a number of questions on the system side, what kind of processor is needed? Is high-end 32-bit CPU natural choice? Is there any other alternative? This next generation USB is becoming integrated into the more and more designs, and many engineers face these challenges. In next few chapters the application of USB3.0 using 8-bit microcontroller will be discussed.
The SuperSpeed USB comes with connectivity of 5Gbps which translates into the about 3.6 Gbps in terms of data transfer. Moreover, comparing to USB2.0 the USB3.0 brings simultaneous communication in both directions, so effectively the bandwidth has been doubled as independent traffic can go in and out of the device. The data packets can even go in bursts without intermediate acknowledge information. These cause additional concerns.
Besides the bidirectional traffic, the new USB introduces streams – a logical extension of the bulk endpoints. In this way data going back and forth can be additionally routed to/from the appropriate sink/source streams.
These elements of dual transmission lines and bulk streams require dedicated data paths that can provide data at high speed, usually independently to the host CPU, so that the CPU does not slow down the transmission.
The other point comes from the USB architecture, since its invention it puts most of the device intelligence at the side of the host system, which is usually a PC, laptop or any other system with significant processing power.
There is also one elements of the USB3.0 that cannot be neglected in technical introduction. It is the power management, this feature is strongly driven by dynamically growing mobile application market where power saving and battery life is extremely important.
With this perspective we could see that some devices would require quite powerful data channel able to transfer magnitude of data at high-speed, still the processor supervising the system, does not need to be that powerful and this job can be done by relatively low performance microprocessor like 8051 achieving satisfying performance in USB3.0 application. The 8051 that will be referenced throughout this document is Evatronix R8051XC2 - a highly optimized derivative of Intel 8051, it is up to 12 times faster than the original. Still the low–performance description is used as its performance is relatively low comparing to nowadays 32-bit RISC engines.
What can and cannot be done
The basic question is what kind of applications can work with USB3.0 and use a low performance microcontroller. For sure it will not be an application that needs a lot of data processing power where processor needs to conduct highly complicated data operations and the USB connectivity is an additional task to handle.
Low-end CPU can be helpful in application where its role is limited to supervising data traffic, USB enumeration and some house-keeping operations. The data processing in such application should be realized by completely separate logic. It is okay, as in many applications the high data traffic is sourced by or fed into a completely separate and specialized unit that can hardly be replaced by any processor.
This concept could be realized using different approaches, two of them will be presented below.
One of the example architectures utilizes USB controller with direct data ports that are a kind of stream ports that forward what they receive to/from USB side. USB controller manages traffic on his own, that is, after receiving specific portion of data at its input stream port it is capable to execute USB packet send sequence and vice versa after receiving USB packet it can forward it to its output stream port.
The other solution utilizes a smart first-party DMA engine, that is, a DMA that is aware of USB protocol and follows the scatter/gather service scheme. This type of operation scheme is basically a data routing between group of memory locations and data source/sink stream according to the descriptors defined in the system memory being aware of USB packets in order to automatically divide data stream into USB packets or combining stream out of them. Such descriptor table needs to be prepared by system processor. There are drawbacks with such approach because it could require a memory buffer of appropriate size to accommodate incoming and outcoming data; additionally, in case of high memory fragmentation the service overhead could cause bottlenecks, therefore managing software should take care of memory management, yet on the other hand one could say that scatter/gather is the best approach to deal with memory fragmentation, so a delicate balance needs to be employed here.
General example of application here could be a system that already provides most functionality and just needs USB connectivity without going into the details of the protocol, so there is a black box that has source/sink port to the master system on the one side and a USB port on the other. As in such application processor supervising USB protocol takes care only of smooth transmission and enumeration it is not a big deal for small and nice microcontrollers like R8051XC2. Moreover, as this black box takes care of all USB elements one does not need to hassle with USB ‘magic’ and as such it effectively relieves user from time consuming specification exploring and all the pain associated with process of introducing new technology into the design.
A specific example for the first architecture proposed in previous chapter, the one with stream ports, is an HD camera presented below.
In such application, dedicated logic handles image-processing path independently to managing USB connection and housekeeping operations. The separate DSP can interface to CCD chip, conduct required data processing at very high rate and feed the data directly into the SuperSpeed USB controller. The microprocessor just supervises USB transmissions and other internal operations. Therefore, this processor does not need to be that powerful, in this way saving silicon area and power.
The example for the DMA based application is the USB to Ethernet Bridge outlined below.
In this case the CPU needs to control the data flow from one end to other, still it never touches the data as the data transmissions are performed by dedicated first-party DMA. It is possible because Ethernet frames are transferred to a USB host as encapsulated within USB packets. Ethernet frames are usually bigger and do not fit into the USB packets, therefore transfers need to be buffered and split into the smaller chunks that fit into the packets transmitted over USB bus and vice versa, when packets come through USB bus they need to be combined into Ethernet frames and send over the network. This data manipulation is done by the protocol aware DMA, despite of its wisdom, it is the processor that needs to direct the DMA controller to transfer given portion of frame data by defining and maintaining appropriate descriptor tables. Despite of the fact that processor is engaged in supervising not only USB commands but also directing DMA transfers it is still possible to engage 8-bit 8051 derivative to do the job.
Quite similar application to the one just given example is the Mass Storage application and this one has been used in software optimization stage described in the next chapter.
To show some application that rather could not be implemented effectively using 8-bit processor is for instance the mass storage based on NAND Flash memory. It is not recommended as the SSD requires a lot of NAND Flash optimization and management like Flash Translation Layer, Bad Block Management, Garbage Collection, Write Amplification reduction and other elements which discussion is beyond scope of this article. Of course, no one can say that NAND Flash application would not work basing on 8-bit processor. It is just matter of performance, if application employs only a single memory device, then memory immanent features limit the application capability and in such a case even 8051 derivative could do the job very effectively.
As an example of application for software discussion the derivative of the second example has been used. It was a RAM based mass storage device. RAM based storage was selected as it reduces additional variable in complicated equation of software performance analysis.
The software of USB application can be divided into layers, each providing a separate set of functionalities.
The application code is the main application that consolidates operation of USB and mass storage sides, each side handles its specific functionalities, so there is USB Software Stack with USB Mass Storage Class services on the one side and the Storage Device Driver on the other.
The hardware abstraction layer elements isolate hardware from core functions allowing them to be implemented the most generic and hardware agnostic way, it also eases process of porting given solution to a different controllers.
In most cases such software is targeted at 32-bit machine therefore it does not work efficiently if compiled for 8-bit CPU.
For example standard implementation of multi-bit macro of swap operation looks as follows
It proves to be highly inefficient in 8-bit environment and it has been replaced by the following implementation:
Despite the fact that it requires additional function call instead of in-line code execution it gave operation execution speed up of about 15 to 20 times. In terms of throughput it brought it from the basic level of 7MB/s to 30 MB/s.
Further optimization was focused on program flow, the most often executed cases in switch statement was moved to the top, so they could be checked first and in this way save time required to find correct case. Besides this switch statement modifications the key code elements were monitored in hardware to check their execution time, basing on this observation code was modified to be executed more efficiently and consequently in shorter time. This was a tedious job, but gave speed up of additional 10 MB/s reaching 40MB/s, what already is the throughput magnitude of High-Speed USB.
The part of approach optimization was also in hardware, so to facilitate throughput increase we modified DMA working scheme. This gave a dramatic boost of throughput to the level of 100MB/s. The other back door and rather unfair approach in improving hardware operation was processor clock increase from 50MHz to 75MHz, that gave 150MB/s of throughput.
Additional optimization in interrupt execution flow concerning the most often met conditional checks like commands (CBW, CSW) and data transmissions facilitated another gain of extra MB to the level of 170MB/s – for clock frequency of 50MHz, the throughput gain is even higher as performance could be around 140MBps.
Still this is not the last word, there is space for further tweaks in hardware dedicated to better suit given application, which in this case is the mass storage.
Silicon area requirements and power consumption
The example application presented in the beginning of software chapter, that is, the mass storage device has been targeted to both FPGA and ASIC technology. As example FPGA technology the Xilinx Virtex-5 has been selected, for ASIC the TSMC 65nm GP has been used.
Xilinx Virtex-5 results:
- Number of occupied Slices : 7 176
- Number of Slice LUTs : 16 837
ASIC TSMC 65nm GP results:
- Total cell area : 221 684.4 µm2
- : 0.22 mm2
- Total gates : 138 553 NAND2 gates
- Power consumption : 13.9 mW
To give a bit of comparison, bare 32-bit processor results4) are given below for the same process node:
- Core layout area : 0.8 mm2
- Power consumption : 14.36 mW
Consequences and performance bottlenecks
As with any other design the system needs to be carefully analyzed, however with 8-bit microcontroller all data paths needs to be carefully evaluated to avoid any CPU direct participation in data transfers. Besides this hardware architectural aspects the software, its tasks and performance demands need to be carefully assessed having in mind the limitations coming from the CPU. The profiling tools should be used to achieve this goal.
The limitations could come from memory management and fragmentation that require preparation of long and complex descriptor tables for DMA transfers. Limitations also comes from type of application and way how it is used. For example operating on the small files in mass storage application introduces significant control transfer and CPU interaction overhead.
Building application of SuperSpeed USB using available optimized versions of 8051 derivative as R8051XC2 is feasible and despite of some limitations it can be used by developers. Of course some applications will never use low performance processor due to the application characteristics yet, there are many others that can benefit in terms of silicon area, power consumption and cost efficiency.
- IRENEUSZ SOBANSKI, EVATRONIX SA, “From Software Based Verification to Firmware Development”, ChipEx2010, May 2010
- ”USB 3.0 SPECIFICATION”; Revision 1.0; November 12, 2008
- MICHAL JEDRAK, EVATRONIX SA, “The new kid on the USBlock: introducing SuperSpeed 3.0”. EDA Tech Forum. March 2010, p. 50-54
- ”Cortex-R4 Processor”, ARM Holding official website, 7th November 2010