Instruction set architectures, tuned specifically for wireless and other network traffic and computing models, are now coming to market that make processing of packets more efficient and easier to develop and program applications for networked embedded systems.
Processing characteristics for a networked design are very different from those required for most traditional embedded and desktop environments. Very small code segments execute against individual network packets that arrive randomly, never to be seen again. In this environment, it is important to minimize the amount of time it takes to process a packet, to respond quickly to interrupts, and execute common complex packet operations quickly.
In addition, the specifics of the networking topopology and the communications technology impose constraints. In wireless embedded applications, a processor, and its underlying instruction set, needs the performance and responsiveness to support several network ports with different protocols, each up to 100 megabits per second.
Designers need predictable performance so that they can drive a wide variety of on-board I/O configurations from software. They need efficient operation so that their system cost would be competitive, and they need a flexible, easily programmable architecture so they can ship their product on time. And, they need a fully functional processor capable of running modern protocol stacks, operating systems and applications.
Designing a processor and an underlying instruction set architecture that would make embedded design easier and cost effective leads to considering some interesting hardware innovations.
To get the performance needed, it was necessary for our engineers to optimize around memory-to-memory operations. Most packet data is only touched once by the processor, so it doesn't make sense to load data into registers, perform an operation, and then store the data back again. Instead, the instruction set should operate directly on data in memory.
In the instruction set architecture (ISA) that we have developed, codenamed "Mercury," deterministic multi-threading was used. This involves the use of a combination of multiple register files, multiple functional units and a hardware thread scheduler to provide efficient resource use and very fast interrupt handling. This architecture, also known as simultaneous multi-threading, is similar to the hyper-threading feature built into modern Pentium IV's, but with 8 simultaneous threads compared to the Pentium's 2 threads, making it enormously more efficient in driving a wireless high-speed 802.11g interface, where it is necessary to respond to an incoming packet quickly.
In the software-configured hardware approach we use to adjust to varying I/O requirements of various networking protocols, the deterministic multithreading approach is quite effective, allowing very fast responses to signal changes on the I/O lines. Hard real time (HR T) threads handle software I/O. Each HRT thread is completely deterministic, to meet the stringent real time needs of software I/O. An HRT thread can respond to an I/O event in fewer than 10 ns, over a thousand times faster than an I/O driver in Linux or VxWorks can. Even though multiple threads share a single pipeline and functional units, a pipeline hazard in one thread will not affect the timing of another thread.
Keeping in mind the development environment in which such a new architecture would be used in the design of an Internet-centric instruction set architecture - it was necessary to start with RISC-style principles of a small, regular instruction set with fixed length instructions. It was then necessary to extend the ISA to support memory-to-memory operations, additional bit-based operations to support multiple interfaces and various protocols, multiple addressing modes better suited to networking than conventional modes, fast specific operations for common tasks, and a b ranch architecture tuned for networking.
The advantage of such an ISA is that it maintains single-cycle instruction execution to enable determinism, regular design to ease programming, and atomic instructions to avoid data corruption across the design.
Network protocol processing is different from desktop computing. Packets may require an inspection, and then the change of a single value, and then be off on their way. In this environment, a processor with fast local memory can operate directly on the memory location without moving it into a register. Each thread has 8 address registers, and one source operand and the destination can be in memory, accessed indirectly through an address register with one of the powerful addressing modes.
This memory-to-memory architecture is particularly useful for memory moves, whether the data is aligned or not.
For example, consider a 32-bit aligned memory move which in our Mercury ISA com pared to the same operation using a standard Stanford-derived MIPS instruction set architecture: move.4 (a6)4++, (a5)4++.
Here, a 4-byte move takes data from memory in the location pointed to by address register a5, and puts it in memory at a6. A5 and a6 are incremented by 4. By comparison, in the standard MIPS instruction set architecture four operations are required to move data to and from memory and to increment the addresses in which separate instructions are needed to move the data to and from memory, and to increment the addresses.
addu t5, t5, 4
addu t6, t6, 4
If the data are unaligned a common occurrence in packet processing the memory-to-memory architecture is also more efficient, three operations versus eight:
shftd d1, 4(a5)++, #shift
move.4 (a6)4++, d1
ldw t1, (t5)
ldw t2, 4(t5)
sl l t3, t1, shift
srl t4, t2, (32-shift)
or t7, t3, t4
stw t7, 0(t6)
addu t5, t5, 4
addu t6, t6, 4
Mercury's SHFTD instruction uses the special Source3 register in this case to specify one source operand, effectively making SHFTD a three-operand instruction. With multiple threads running simultaneously, these memory-to-memory operations are atomic, meaning that they occur within a single cycle and do not interfere with memory operations by other threads.
Network packet headers and data are generally not aligned on word boundaries, so bit operations are very helpful to programmers in a networked environment in which microcontrollers now operate. A complete set of logical, arithmetic and shift instructions, in addition to basic bit test and set instructions, are required to enable efficient program loops based on arbitrary streams of data, and quick access via masks to specific header values.
In a networked environment the addressing modes commonly available on a microcontroller need to be enhanced. Common tasks in a network environment include traversing an array, addressing a specific field within a data structure, examining every byte of a packet, and accessing data on a stack. Specific addressing modes serve each of these functions well.
For example, traversing an array is much easier with indexed addressing, where a base address is indexed with a data register. Addressing a specific field can be done with address plus immediate, and examining every byte in a record can be done with indexed addressing with an auto-incremented index. Accessing data on a large protocol stack can be done with indexed addressing with an immediate field.
Specifically, accumulating sets of adjacent fields from a packet, a complicated procedure in a traditional processor, is simple with sophisticated addressing modes.
With an appropriately designed ISA, common activities on the network can be performed with a single instruction. Powerful arithmetic instructions, including 16-bit multiply-accumulates (signed, unsigned, or fractional) that target a 48-bit accumulator, and a CRCGEN instruction for byte verification make programming for the net more efficient.
Branch and jump logic in a network environment, such as wireless, is different from typical desktop environments. In a desktop environment, for example, loops are common and most branches are taken. In a network environment, nested "if" statements are much more common and most branches are not taken. Also, since caches are not useful here, there is no penalty for a cache miss.
In this environment branch prediction is not very relevant nor useful. When a branch is mispredicted, it only affects instructions in the thread that includes the branch. Since the ISA we have chosen to develop is multi-threaded, there are typically several threads executing at once, so the penalty for a mispredicted branch is very low.
More important is predictable behavior. Branch predic tion is static to preserve real-time determinism, and the compiler uses a bit in the instruction to specify to the hardware if the branch should be predicted forward or backward.