Using Memory More Effectively in NPU Designs

Using Memory More Effectively in NPU Designs
By Chuck Jordan, Teja Technologies, CommsDesign.com
March 24, 2003 (3:26 p.m. EST)
URL: http://www.eetimes.com/story/OEG20030321S0016

Performance of software applications for network processor (NPU)-based systems is highly sensitive to memory structure and data access. Developers must consider speed, size, and access constraints when determining the most efficient utilization of memory to avoid latency and ensure high system-level performance.

To make the design of memory elements easier, designers need to extend the C programming model memory abstraction approach to the NPU world. Additionally, designers need a flexible mechanism for mapping these abstractions to the physical NPU resources.

In this article, we'll explore techniques designers can improve NPU memory utilization. At the same time, we'll examine how development tools can help in this process. During the discussion, we'll focus in on Intel's IXP2xxx processor family.

Making Upfront Decisions
When NPU application code is created by hand in a native assembly language, the developer is forced to make d ecisions about memory utilization during the initial design phase of application development. In particular, the storage location of data structures is an important key to achieving high performance and designers are faced with a number of choices for placing their data in off-chip or on-chip memories.

Changes are invariably required as a result of testing and benchmarking and the process of manually altering the memory access instructions, which are specific to a given memory resource, not only lengthens the design cycle, but also increases the likelihood of introducing errors into the system.

It is important for the designer to understand the various memory resources available for a particular underlying NPU. Typically, DRAM and SRAM, which both can be quite large, are off-chip. For example, on the Intel IXP2xxx NPU, smaller on-chip memories include scratchpad, local memory, and various general-purpose registers. The cost of NPU memory access typically ranges over two orders of magnitude from one cycle for registers to 100+ cycles for DRAM (Figure 1).

Figure 1: Logical architecture of the IXP2400.

The IXP2xxx microengines have general-purpose registers that can be used to hold data and hence can be modeled as a memory bank (Table 1). Each microengine hardware thread can define its own instance of these registers or a register can be defined as â€œabsolute,â€ providing a means to share the data amongst the threads of the same microengine. Register access times are one cycle. However, relative to the other memory resources, there is a limited supply of registers, meaning that the data structures that are mapped to them need to be relatively small. Ideally, they would be used as a sort of data cache to hold temporary or permanent, but frequently used data.

Table 1: Memory resources of the IXP2400

Local memory, which is addressable storage located in each microengine, is also optimal for such data. It is on-chip and very fast—with access times only slightly slower than the registers.

Scratchpad is a 16-kbyte memory that can be accessed by any thread of any microengine and also by the NPU's embedded XScale processor. Since scratchpad memory is on-chip, designers can avoid the latency of going out on an external bus. The 32-bit access times are 10X local memory. Therefore, a developer would want to avoid per-packet operations in this memory bank. However, a strategy of copying collections of local microengine data from registers or local memory to other, slower memory banks on an infrequent basis is feasible.

In the IXP2xxx architecture, quad-data-rate (QDR) SRAM is left off-chip. Anything placed off-chip involves external busses, arbitration, slower clocks, etc. The specified peak bandwidth of the IXP2400 implies a mean time of 1.5 cycles per 32-bit access. Attainin g this rate would necessitate the use of a burst transfer.

Under the burst transfer method, software communicates that it wants to do a larger, contiguous, aligned, multiword, transfer in a burst. The hardware can arbitrate for the bus once, and then hold the bus and do a rapid multi-word transfer. During a burst, the bus is held for a time causing other requests for the bus by micro engines, the XScale, or PCI to be held off (stalled) until the burst completes.

So â€œpeakâ€ is really a theoretical maximum, rarely achieved during steady state. There are a few places in the software where a burst can be done. However, often only a single 32-bit word is being accessed. With a single word, there is a delay, per word, of arbitrating for the bus, stalling until given the grant, and then eventually getting permission from the arbitration unit to proceed.

DRAM is also off-chip in the IXP2xxx architecture and is typically the largest memory resource in an NPU-based system with the most appropriate cost/p erformance characteristics for packet payloads. It has a 64-bit wide data bus. Software that is attempting to access just 32-bits must fetch 64-bits and toss half the data. Software writing 32-bits must read64-modify32-write64 (Figure 2). On the IXP2800, when the memory controller reads it always gets 128 bits.

Figure 2: Relative access times for a 32-bit read/write of the IXP2400 NPU memory resources.

Since individual 32-bit DRAM accesses are the slowest of all in the system, transfers from media switch fabric to DRAM or from DRAM to media switch fabric should always be done in a burst in order for the system to perform at line rate. Since DRAM is so important for packet transfers, other users of DRAM risk adversely affecting the performance goal. Intel has addressed this by allowing the programmer to control the ratio of bus arbitration for PCI and XScale. PCI and XScale acc esses can be programmatically slowed down relative to the microengine's ability to arbitrate and win the bus.

Calling for Memory Abstraction
The Intel NPU microengines' instruction set has specific instructions for each memory bank. For example, to access QDR SRAM, the sram instruction is used. To access DRAM, the dram instruction is used. As a result, when writing applications in microcode, the programmer must fix the memory type since this will dictate which instructions to use.

Further, the flexibility of this advanced hardware presents some critical considerations to the programmer. Data structures should ideally reside in the fastest possible memory bank, but they may not fit in the available space. The scope issue forces data to be placed in a memory bank that is reachable by all processing elements that use it to avoid costly copying of data from place to place. Memory bursts are the ideal way to access data but require that the data be contiguous, aligned on certain boundaries, a nd large enough to justify the overhead of setting up the burst.

To take full advantage of the hardware choices, enable flexible mapping of data to different memory types, and build in reusability, programmers need the ability to develop software without hard-coding memory choices into the application logic.

In traditional microprocessor software, the C programming language gives the programmer the abstraction of a single, flat, linear address space and the pointer data type to hold addresses. This memory abstraction must be extended for NPU-based software applications to multiple, flat, linear address spaces and the pointer data type modified to include the memory bank specifier. For example, compare the program fragments shown below:

The MyMemBank specifier in the extended C specification is subject to remapping to different physical memory banks such as registers, local memory, SRAM, DRAM, etc. By using a softw are platform that features a C-language extension, this extended pointer can be supported as a hardware abstraction layer (HAL) pointer.

To ease the design process, a developer can use an environment in which applications are defined in a hardware-independent model and can be mapped to any of the resources in the NPU-based system. This could include the multiple processors on the NPU itself, external control and management processors, and co-processors such as switch fabrics and search engines.

In addition to HAL pointers, other aspects of the application and their mapping to hardware can also be abstracted. Table 2 lists some examples.

Table 2: Hardware Elements That Can be Extracted on the IXP2xxx Family

Designing with Memory Abstract-Enabled Platforms
To make efficient use of memory easier to achieve, designers can turn to software tools that provide support for memory abstraction. Wi th a software platform that implements memory abstraction, the user starts by defining all the classes needed in the application in the object-oriented style, using the framework classes provided or subclassing them to specify unique functionality.

Next, a logical software architecture is defined that lays out the instances of the classes and the channels of communication that define the desired application. A useful way to represent a software architecture is via a topology diagram. State-machines, represented as a green square, have constant pointers to data structures shown as orange squares (objects). These â€œusageâ€ relationships, from state-machine to objects, are shown as solid blue arrows Figure 3.

Figure 3: Fragment of a logical software architecture of RFC-1812 IPv4 Forwarding application.

Also shown are light-blue pipes called â€œchannelsâ€. Channels represent the in ter-communication primitive from one state-machine to another. In this way, the various stages of a pipeline can be built up to form the logic of a networking application. Numbers in the top-right corner of any object represent its multiplicity. For example, the â€œrx_0_Iâ€ component (an instance of the Rx class) has a multiplicity of 8.

State machines and objects in the software architecture can be placed in convenient groupings, or containers, which make subsequent mapping to hardware easier. State machines can be mapped to a container called a thread while objects can be mapped to a container called a memory space. As an example, a memory space might be given the name "counter_space" and could contain a collection of all of the objects that hold packet statistics. These containers are â€œabstract.â€ Memory spaces have no knowledge of where they will land in memory, nor do threads know on which processor they will run.

Hardware Knowledge Needed
In order to map the software architecture to the hardware architecture, the platform will need some knowledge of the hardware. The hardware architecture can again be thought of as a series of objects, interconnected via busses. Hardware elements for consideration are processors, memory banks, and the busses that connect them (Figure 4).

Figure 4: Depiction of a hardware architecture of a dual-NPU system.

From a diagram such as Figure 4, the platform can learn scope information. For example, the ingress processor ixp2400_I has access to SRAM_I but not SRAM1_E. The platform provides a means for the designer to describe the custom board design or import known reference board designs.

Both software architectures and hardware architectures can be built up hierarchically through containment. This enables modular, reusable description of â€œlargeâ€ systems, such as an equipment rack with multiple shelves, each with multiple boards . Note: Each of these boards will have processors, memory, and busses.

Once the architecture descriptions are complete, the software elements are mapped to the hardware elements. During this mapping, logical threads are assigned to processors, logical memory spaces mapped to physical memory banks, and logical channels to physical media. Mapping can be done in one of two ways.

Designers are given the option to use a graphical user interface and dial in the appropriate mapping for a given thread or memory space from the provided list of choices, or they can write the mapping relationships in textual format. Designers can use memory banks to record which physical address they are known by in the memory map and can also indicate their size. As memory spaces are mapped to memory banks, a tool can account for the amount of storage space occupied within that memory bank. If something doesn't fit, an error will flag the user that they have exceeded the storage of the memory bank.

A tool can check to ma ke sure that an object used by a state machine, mapped to a certain processor, has access to the memory bank that holds the objects it uses. If something is wrong, the user is alerted and can therefore reconsider how they have mapped their design. Many such validations are possible, such as memory size, channel connectivity, and code size.

Once the assignments or mappings are made, code generators produce the appropriate execution code. The user can then test the application with the software simulators or the physical hardware.

Memory, Service Abstractions
Pick up any book on the C programming language and there is usually a chapter on memory functions describing an ANSI-approved function called memcpy(). The user should be able to call this sort of function to perform a "burst" transfer, and should not be burdened with unfamiliar pointer type declarations.

When code is being generated for the NPU micro-engine assembly language, as a target language, the pointer arguments can be co nveyed with two pieces of information: the address and the memory type. In this way, the memory copy macro can know which memory types are involved in the transfer. If a compile-time constant is employed, the optimizing code generator uses the memory type information to emit the most efficient instructions for the memory types involved.

In keeping with the desire to be as efficient an implementation as possible, depending upon the memory banks involved, a burst transfer is done whenever possible. For example, if the call to the memory copy API has a source address within the DRAM memory bank, and the destination address resides in local memory, the code-generator will emit code that performs a DRAM burst read into transfer registers. It will then set the local memory index register and copy each transfer register, one cycle per register, into local memory.

The call to do a memory copy is written only once in the logic of a class and is subject to hardware mapping. Therefore, if the designer chooses to map their object to a different memory bank, they need not re-write their software. This affords portability to several different target languages and NPU generations.

Mutual Exclusion
Data structures that are shared by multiple threads require a lock. The thread holding the lock is free to access the structure. Others that are waiting on the lock sleep for a time until they are granted the lock.

The user can instantiate a mutual exclusion (mutex) in their software architecture and have all the state-machines that are to use it, point to it. The code generated for the mutex function depends on the location where the underlying data structure has been mapped.

During mapping, if all the threads that use the mutex are mapped to the same microengine, a very fast mutex algorithm that uses absolute registers can be emitted by the code generator. If multiple microengines share the mutex (and possibly the XScale too), the user can map the mutex to QDR SRAM or scratchpad and pick an algor ithm that uses the atomic operations for these memories. There are three additional algorithms that can be selected: fair, unfair, and polled with different space vs. speed tradeoffs.

Channels
Channels provide a unidirectional communication from one or more producer state machines to one or more consumer state machines. During custom mapping, several different channel implementations are selectable. Some use queues in shared memory and inter-thread signaling while others use socket I/O and assume a protocol stack. The channel implementation used depends upon how your architecture has been mapped.

When the channel is a shared memory channel, a pointer to data, rather than the data itself, can be given to the channel. If the microengines are neighbors, a best-case channel implemented to use the fast next-neighbor ring can be emitted.

Code example 2 illustrates how the developer can code a channel function using an abstraction and the resulting code generated for that channel depending on two different mappings to hardware.

The pAlert argument is the pointer to the alert that is seen in the code snippets. The CHAN argument is the late-binding constant to a channel that is mapped during the mapping stage. "rc" is the return code, which may be returned as -1 if the channel returns a full—or near-full—indication.

Code example 3 shows the code associated with sending a message across the scratch ring to a non-neighbor micro-engine. A signal is also posted to awaken the consumer thread on that micro-engine.

Code example 4 shows the code associated with sending a message across the next-neighbor ring to a neighbor micro-engine. In this case, a different lighter way to signal is used to awaken the neighbor.

Simply by changing the mapping to use one channel implementation to another, different code is emitted. In code example 4, if the channel is between neighboring microengines, a significant performance improvement is realized. The 50-to-1 performance gain is not only on the sending side. The receiving side realizes even more benefits since memory reads are slower than memory writes and next neighbor ring access is a register access.

Queues and Memory Pools
Queues are a useful means of passing work items from a producer to a consumer thread. A producer can enqueue a pointer to the queue while a consumer can dequeue a pointer from the queue.

During mapping, the user can choose between several different queue implementations. For example, the user can choose implementations which use a hardware scratch ring, a hardware SRAM linked-list, or a purely software ring or linked-list queue. As with the examples cited regarding channels, different queue mappings result in significant performance variations.

A memory pool is a co llection of fixed-size memory buffers called nodes. A node can be dynamically allocated by software that needs a memory buffer. When the software is done with it, it can free the buffer back to the pool. Like the queue, the user can choose an implementation that uses a hardware scratch ring or a hardware SRAM linked-list. By doing so, memory allocation overhead is minimal due to the hardware assist provided by the IXP2xxx.

Wrap Up
Successful development of high-performance software for NPUs demands a detailed understanding of the various memory resources available. Issues such as scope, speed, size, and hardware constraints must be well understood by the programmer.

Using traditional development methods, code changes are difficult for the developer since the NPU instruction set fixes memory types into the software design. Therefore, in order to effectively program systems with multiple, flat, linear address spaces, there is a need for a language extension to abstract memory types.

Ex tending C to have a new pointer type, which is qualified with a memory type specifier, provides several advantages including flexible hardware mapping, rapid performance optimization, shortened development cycle, and code reusability. Abstracted memory access opens the door for other mapping capabilities such as memory pools, communication channels, queues, and mutual exclusion implementations that take advantage of the NPU hardware assist features. A flexible mechanism for mapping these abstractions to the physical NPU resources results in the ability to iterate multiple designs, rapidly converging on that which yields the best systems-level performance.

About the Author
Chuck Jordan is a senior software engineer at Teja Technologies. He received a B.S. degree in computer science from Cal Poly in San Luis Obispo, Calif. Chuck can be reached at cjordan@teja.com.

Industry Articles

Using Memory More Effectively in NPU Designs