by Marcello Lajolo
NEC Laboratories America
Princeton, NJ, USA
As technology moves toward System-on-a-Chip (SoC) integration, the missing links between system-level specification and design implementation will have a major impact on the designer’s productivity and the design quality. ACES is an integrated SoC C-based design environment that leverages on high-level synthesis and co-verification tools and aims to assist the designer in the hardware/software partitioning and architecture selection phases. This document is devoted to the various issues concerning the hardware/software interface and the coordination between software tasks that are implemented by the ACES’s real-time operating system (RTOS). We begin by describing the general problem of interfacing hardware and software, then we analyze the philosophy and implementation of the RTOS from the point of view of the IP integration challenge.
C-based design methodologies are receiving more and more attention from EDA (Electronic Design Automation) vendors. For example, commercial hardware and software coverification tools from companies such as Mentor Graphics, CoWare, VAST, Virtio and Axys can provide fast instructionset simulators linked to various hardware simulators. They mainly focus on the functional and performance modeling problem for software-dominated embedded systems, although they do not address the issues of high-level hardware modeling and refinement. The main limitation of these tools is that they often require to model the hardware at the RT-level and even though recently some of these vendors have started to offer the possibility to perform a mix C/RTL co-verification (e.g. C-Bridge from Mentor Graphics), none of them offers yet an automated behavioral synthesis path from C specifications.
An emerging area is also the one of coprocessor synthesis [1, 2, 3], where the main idea is to combine the software compilation and the hardware synthesis technologies to provide a system that allows designers to explore and implement their designs directly from descriptions written in algorithmic C. The main limitation of this approach is that it is based on the assumption that the designer has already been able to come up with a feasible hardware/software partitioning for the entire design and the coprocessor synthesizer can then provide the possibility to perform some software acceleration by offloading compute-intensive algorithms from the CPU to dedicated hardware. Although very useful, tools of this type can provide only a partial support to a complete SoC design flow because it is well known that many decisions regarding the ef- ficiency (performance, power, area etc.) of the system have largely been fixed by the time a designer commits to a particular architecture. Alternative and complementary methodologies and solutions must hence be provided in order to help the designer during the initial phases of the design process when coarse hardware/software partitioning trade-offs have to be analyzed.
In this paper, we present the behavioral modeling and simulation capabilities of an in-house C-based design methodology called ACES (Application to C to Exploration to System LSI) that leverages on high-level synthesis and co-verification tools and aims to assist the designer in the hardware/software partitioning and architecture selection phases. ACES, has the unique advantage with respect to all the similar approaches to be able to leverage off the strenghts of two key pieces in NEC’s C-based design flow : CYBER and CLASSMATE. CYBER is a behavioral hardware synthesis tool that can provide the link to implementation that is missing in all the alternative environments and CLASSMATE is a hardware/software co-verification tool that can be exploited in order to derive accurate and fast timed functional models for behavioral IPs.The result is a uni- fied design flow from system specification down to system implementation.
2 The ACES design flow
The growing interest toward the adoption of C-based design methodologies is at least twofold. On one hand, designers need to raise the level of abstraction of their specification in order to cope with the increasing system complexity. On the other hand, the severe time-to-market pressure prevents design reuse. It is well known that designers prioritize time-to-market with respect to design optimization and design cost. Design for reuse has lower priority, since designing reusable modules is more difficult and takes longer.
ACES advocates clear separation of behavior and architecture of the SOC. Exploration of architecture alternatives before settling on an architecture requires good design space exploration support. It is necessary to start system design at the behavioral system level. At this level the functional behavior of a system can be specified, evaluated, and design guidance for an efficient implementation can be provided via a systematic design space exploration.
Figure 1: The ACES design flow.
Fig. 1 shows the entire flow with the major steps in ACES. The system is described at the behavioral level as a network of tasks that can communicate by both means of events as well as shared variables. High-level descriptions of the tasks provided in SystemC (but also C and C++) are translated through a compilation process into Discrete Event models with a precise semantics and written in SystemC. For each module in the system specification, ACES can synthesize a hardware netlist, a software program and the interfaces between hardware and software, based on partitioning and communication mapping information given manually by the user on a module by module basis. Behavioral SystemC co-simulation is used to test the behavior of the system and to perform hardware/software partitioning in a closed loop. Good estimates of both hardware and software performance and power are of crucial importance in this phase in order to avoid costly design re-iterations. The graphical simulation environment provides rich libraries of pre-designed modules for test bench generation and waveform display. These features enable designers to focus their efforts on the core functionality of the design and algorithm choices, and spend less time on the tedious aspects of system specification. During the co-simulation, hardware modules run concurrently, while the operation of the software modules is coordinated by a scheduler modeling the RTOS used in the fi- nal implementation. Once a suitable hardware/software architecture has been identified, hardware and software synthesized by ACES can be exported into CLASSMATE and can be simulated and verified at the architectural level.
ACES provides the unique possibility to change the hardware/software implementation of each component in the system by simply changing a parameter. The same simulation code is used to simulate the functionality for both hardware and software implementations. The only things that change are the delay annotations that are used for modeling performance and power consumption and also the scheduling policy of the module in order to model shared system resources like the CPU.
3 Design abstraction levels
Designers today follow design flows built around different design abstraction levels for SoC hardware and software design. Different parts of a SoC might be modeled at different abstraction levels, depending on the verification questions to be answered. Commonly used design abstraction levels are : Untimed Functional Level, Timed Functional Level, Bus Cycle Accurate Level (Transaction Level Model), Register Transfer Level .
In ACES we use timed functional models that are essentially characterized by delay annotations added into functional (algorithmic) models. For hardware components, these delay annotations are automatically generated by the CYBER behavioral synthesis tool, while for software components they are extracted from a cycle-accurate simulation performed in the CLASSMATE environment. We think that timed functional models are the right level of abstraction at which to analyze preliminary hardware/software partitioning tradeoffs. Transaction level model are of course useful, but our experience is that it is quite difficult to go above 100 Kcycles/s of simulation speed for a generic kernel of modern embedded systems consisting of a general-purpose microprocessor, cache memory, main memory and peripherals and maybe some custom hardware and DSPs. In order to be able to analyze various different system alternatives, simulation speed must be increased by at least an order of magnitude and timed functional models can help to accomplish this result.
We selected SystemC for describing timed functional models due to its wide industrial adoption as a system-level design language and also in order to facilitate the integration of IPs described at different abstraction levels.
4 Communication model
ACES allows tasks to communicate by both means of events , as well as shared variables. Events are a semisynchronizing communication primitive that is both powerful enough to represent practical design specifications, and effi- ciently implementable within hardware, software and between the two domains. Events are associated by ACES to each explicit communication link between behavioral tasks. Events can be generated by a task (by writing to an output port) or by the environment and can then be received some time later by another task or by the environment. At the behavioral level, events are broadcasted to all receiving modules (no architectural and communication delays are taken into account). At the architectural level, more accurate information regarding the penalty incurred for accessing a specific architectural link (i.e. a bus) is used. Shared variables are mainly used in order to allow tasks to share resources (i.e. memories) and perform read and write operations by means of accesses to a global data structure instead of having to describe explicit ports at the behavioral level. ACES can automatically generate the explicit physical ports required for performing those accessing at the architectural level, and the additional logic required to ensure mutual exclusion to those shared resources in case of concurrent accesses.
5 Interface synthesis
Communication through busses between CPU and hardware modules is typically created as a “memory-mapped inputoutput (I/O),” where shared registers or I/O devices are allocated a certain address space and a CPU can read and write them by performing common memory access operations. From hardware modules, such registers can be accessed as usual components, and hardware modules often send interrupt or trap signals to CPUs.
The role of interfaces in ACES is to implement an eventbased communication mechanism across hardware and software and to ensure mutual exclusion to concurrent accesses to a shared resource. ACES uses a standard event communication mechanism over the various possible interfaces: softwareto- software, hardware-to-hardware, hardware-to-software and software-to-environment. An automatic interface synthesis process handles memory-mapped and port-based I/O transparently to the user. The communication model assumes that the memory address bus of the processor is accessible, and uses a library of processor-dependent interface modules to adapt the protocol. Memory-mapped addressing involves the synthesis of address decoders both for software-to-hardware and hardware-to-software interfaces.
Figure 2: Architectural template.
For a first feasibility study, we have experimented with the architectural template shown in Figure 2 in which a generic CPU interface is placed between the processor (the NEC V850 micro-controller) and the hardware world. The CPU interface is described in SystemC, so that it can be synthesized by CYBER, and it uses a generic number of I/O ports for the connections with the hardware and a fixed set of signals (Slave- Address, SlaveWriteData,...) for the connections with the processor peripheral bus that correspond to the signals used in the bus model provided in CLASSMATE. Each signal going from hardware to software (FromHW[0:(N-1)]) has an additional event presence bit associated (eFlags[0:(N-1)]) as it will be explained in the following section, while signals going from software to hardware (ToHW[0:(N-1)]) do not carry any event presence bit. The code of the CPU interface is fixed, while the number of I/O ports to and from the hardware world can be programmed. Each I/O port is 16 bit-width and is associated to a specific relative address in the memory- mapped I/O section of the processor , while the absolute address of the memory-mapped I/O section can be programmed. The automatic interface process generates the description of the connections between the CPU interface and the hardware and software world.
6 Real-time Operating System Synthesis
The purpose of the real-time operating system (RTOS) is to integrate the code for individual tasks implemented in software and to ensure that they, together with tasks implemented in hardware, implement a valid network of tasks. To achieve this goal, the RTOS needs to perform the following functions:
- schedule software tasks such that each one is executed in a timely manner,
- provide a mechanism for event emission and detection between software tasks,
- implement the desired semantic of consumption of input events in the software tasks,
- provide a mechanism for transferring events between software tasks on one side and hardware tasks on the other side.
ACES generates an RTOS that is tailored for a specific network of tasks. Note that instead of automatically generating an RTOS, it could be possible to use a commercially available one. For this we would only need to implement the event emission and detection mechanisms using the event services provided by the RTOS, and provide the RTOS scheduler with enough information (usually task execution times and deadlines) to enable it to perform its duties. However, we believe that our approach has several advantages. Firstly, since the RTOS has a fixed communication structure (neither the number of tasks, nor the sensitivity of tasks to event changes over the lifetime of the generated RTOS), the emission and detection of events can be implemented in a very efficient way, and in some cases (when the task is sensitive to a single event) completely avoided. Secondly, since only the necessary functionality is generated, the size of the generated RTOS is often much smaller than the size of commercial ones. Finally, in our approach one can easily experiment with trade offs, e.g. between scheduling policies, or different event input mechanisms (polling vs. interrupts). Commercial RTOS’s typically do not provide such a flexibility.
6.1 Communicating events between software modules
When a software module emits an event (i.e. it updates an output), all modules sensitive to that event must be informed of it and enabled. To every software module we assign (during the software synthesis step) a set of private flags, one for each input signal, to indicate whether that event has occurred since the previous transition. Once it is scheduled to run, a software module checks its input flags to decide which (if any) of its transitions to execute. Thus, the emission of an event consists of setting all the appropriate flags and enabling all the appropriate tasks, and the detection of an event is a simple check on the status of a flag.
6.2 Consumption of events
A software module is executed whenever any one of its input events occurs. Thus, it may happen that the module is executed, but no transitions are enabled, and thus none is executed (i.e. an empty execution may occur). The RTOS needs to ensure that in this case input events are not consumed, but rather preserved for the next execution. This behavior might result in a lot of context switching activity that could be avoided if the operating system could be made aware of the list of events that can enable a transition in each software task. We are currently working on the modification of this behavior in order to allow a module to be executed only when it receives an event that will enable a transition. This requires the operating system to keep track of the list of events that each module is waiting. Prior to scheduling a task, the operating system will check if some of the input events for the task are present in the list of events waited by that task and only in that case it will actually allow the task to run.
6.3 Communicating events from software to hardware
Events emitted by software modules and consumed by hardware modules are communicated through I/O ports. Each such signal is assigned one specific output port to carry the data value. To emit an event to hardware modules, the RTOS simply writes its data value to the associated port. No event presence flag is generated. At the interface level, the data value will then be latched and the output of the latch will be connected to the hardware, that by definition is always ready.
6.4 Communicating events from hardware to software
Events emitted by a hardware module are delivered to a software component either by polling or by interrupt. Both ways are widely used in the design of embedded systems. In general polling is cheaper and more predictable, while interrupts offer superior performance. In ACES, the user needs to choose which mechanism to use for which signal, based on the requirements of the design, and the interrupt handling capabilities of the processor used in the implementation.
Figure 3: Hardware-to-software interface.
To every signal that is communicated by polling ACES assigns one input and one acknowledge I/O port bit, and as many input I/O port bits as necessary to store its data value (see Fig. 3). To emit such an event, a hardware module is slightly modified with the addition of an auxiliary port (E) of one bit. At any time in which the output port to which the event bit is associated is written, the event bit is complemented. At the interface level, an Event Generator transforms first the event bit in a pulse with the duration of one clock cycle and then an Event Stretcher circuit is used to maintain the event flag high for all the time required by the software to receive and generate an acknowledge. This is an asynchronous communication that allows hardware and software to communicate independently of the actual communication medium (i.e. bus) that will be used in the final implementation. On the software side, the event is then handled by the polling task, that ACES generates automatically, if there are any polled signals. This task is run like any other task and can be enabled by a designated signal. By default, ACES designates a signal named poll trigger for this purpose. The signal is emitted by the scheduler once every round-robin cycle (in a round-robin scheduler) or whenever there are no enabled tasks (in a priority based scheduler). However, the user can also change this, and let the polling task be enabled by any other (interrupt-driven) signal.
ACES assigns one abstract interrupt vector to every signal that is communicated by interrupt (as well as the required number of I/O bits to store its data value). To emit such an event, a hardware module first sets the data I/O bits, and then requests an interrupt. When the interrupt is granted, the corresponding interrupt service routine (ISR) is executed. By default, an ISR contains only an event emission routine. However, the user has the option to specify that all software modules sensitive to some specific event are also to be executed inside the ISR. In this way, the most critical tasks can be given immediate attention. For signals communicated by interrupt there is no associated acknowledge I/O port bit. The assumption is that the acknowledge logic is implemented directly in the interrupt controller of the target processor (standards controllers generally do that.) In the case in which the target processor does not provide an output acknowledge port, an additional port (see Section 6.6) is reserved by the RTOS and in the ISR this port is set and then immediately reset in order to generate the acknowledge bit for the hardware.
In the generated RTOS, the parts of code that need to be executed atomically are surrounded by ATOMIC and END ATOMIC macros. The proper definitions of these macros (usually by interrupt disabling/enabling) is the responsibility of the target specific library, included in the generated RTOS.
In ACES each transition of a task must be performed atomically; that is, the values of the input event buffers for that task must not change once it has been started. It is important to notice that this does not mean that the task execution cannot be interrupted (e.g. an interrupt cannot be served) while the task is performing its transition, but only that when resumed, it will continue to see in its input buffers the values that it found at the beginning of the transition.
6.6 I/O ports
The RTOS uses ports to communicate events to and from the outside world. ACES maintains a pool of abstract input, output and acknowledge ports and assigns them to signals as necessary. These abstract ports are implemented either by existing I/O ports of the micro-controller, or by memory-mapped ports (for which ACES also generates necessary hardware). The actual I/O capabilities of a specific micro-controller are described in a specific processor configuration file.
ACES scans the list of required I/O signals for the software partition, and assigns each of them to the first available I/O port with the required number of adjacent available bits. When the I/O ports are exhausted, memory-mapped I/O is used, starting from the address specified as hwsw-sec-start-addr.
6.7 Interrupt handlers
In addition to I/O ports, interrupts can also be used for communication from hardware modules (or from the outside world) to software modules. Similarly to I/O ports, the RTOS views micro-controllers capabilities in terms of abstract interrupt vectors. They are identified by the numbers 0,...,n, where n is some arbitrary number. Each vector essentially associates an abstract interrupt input with a handler routine. The association is handled by the RTOS with two macros for each vector i: DECLARE ISR i and INSTALL ISR i. These macros must be defined in a header file called uC intr.h which resides in a target-specific directory, and which ACES includes in the generated RTOS.
The macro DECLARE ISR i defines the interrupt service routine of the i-th interrupt. The macro INSTALL ISR i is called in the initialization routine that is executed at the very beginning of the execution. It is typically used to store the address of the interrupt service routine in the interrupt vector table.
7 Communication synthesis
Once each component (i.e. task) has been mapped to either hardware or software, the boundary between the hardware and the software world identifies a specific set of signals that become the actors in the interfacing mechanism between hardware and software. This set of signals is made available to the designer in a tabular format that can be visualized using a common web browser. The table of signals presents the following fields:
<name> <source> <destination> <implem_style>
The name field contains a unique name that identifies a particular connection. The source module field indenti- fies the module who writes on that signal and the destination module specifies the module that reads that signal. Finally, the impl style field is used in order to distinguish between polling and interrupt based implementation. For hardware-to-software communications, the designer can choose one of the two implementation styles, while for software-to-hardware communications, the impl style field is always set to memory-mapped and it cannot be changed.
After all the choices have been made, the designer can simply push a button on the browser in order to generate the hardware/software interfaces that can then be used in order to perform an accurate architectural simulation in CLASSMATE.
We believe that a C-based design methodology with the use of high-level synthesis systems represents a paradigm shift in SOC design and ACES represents an in-house design flow that leverages on high-level and communication synthesis for assisting the designer in the hardware/software partitioning and architecture selection phases.
We have presented an initial feasibility study for providing automatic hardware/software integration capabilities in the ACES C-based system-level design flow. All the components necessary for detailed hardware/software co-verification can be generated automatically by ACES and this enables designers to quickly experiment with different system architectures using an intuitive graphical environment, and rapidly evaluate them through cycle-accurate simulation without having to perform the tedious and error-prone task of hardware/software integration.
The author would like to thank Aldo Dolfi and Francesco Regazzoni for their contribution in the implementation of the ideas described in this paper.
 Mentor’s Application Specific Assistant Processor
 K. Wakabayashi and T. Okamoto, “C-Based SoC Design Flow and EDA Tools: An ASIC and System Vendor Perspective,” IEEE Trans. Computer-Aided Design, vol. 19, pp. 1507–1522, Dec. 2000.
 SystemC Home Page: .
 F.Balarin, M.Chiodo, P.Giusto, H.Hsieh, A.Jureska, L.Lavagno, C.Passerone, A.Sangiovanni-Vincentelli, E.Sentovich, K.Suzuki, and B.Tabbara, Hardware-software Co-Design of Embedded Systems: The POLIS Approach. Kluwer Academic Publishers, Norwell, MA., 1997.