An Embedded Processor Architecture With Extensive Support For SoC Debug
by Richard Curnow, Mark Hill, Andrew Jones, SuperH (UK) Ltd.
The SH-5 is a 64-bit embedded architecture developed and licensed by SuperH, Inc. The SH-5 is a general purpose architecture with broad support for multimedia software through its SIMD instructions. A rich set of CPU-related debug features has been designed to support debug of the SH-5 CPU inside an SoC product.
An integrated bus analyzer monitors transactions on externally connected IP blocks and the SuperHyway (an on-chip interconnect fabric designed for high performance SoC designs). This gives developers much better visibility of the behaviour of their SoC. A debug module (SHdebug) provides external interfaces through which the SuperHyway can be extended off-chip, supporting both off-chip control and monitoring of IP blocks as well as allowing the SH-5 to execute debug support code with no intrusion into the debugged software's memory footprint.
This paper provides a description of the SH-5 debug architecture, compares it with some other SoC (System-on-Chip) debug approaches and gives examples of how it can be applied to debug particular scenarios.
High performance microprocessors require class-leading debug features in order to deliver complex SoC based products to market on time.
The SH5-103 core is an implementation of the 64-bit SH-5 processor family and delivers 604 Dhrystone 2.1 MIPS at a 400MHz clock frequency. It has two instruction modes: SHmedia (32-bit encoding supporting the full capabilities) and SHcompact (16-bit encoding, backwards-compatible with SH-4 and earlier cores, offering high code density). A block diagram (figure 1) shows how the core is integrated with other intellectual property (IP) blocks, and the debug architecture described in this paper.
Figure 1: SH5-103 & its debug architecture
SoC designs involve the integration of many separate intellectual property (IP) blocks. Typically, software for SoC based products will include device drivers for the various IP blocks, and the drivers can interact in highly complex ways. Products such as interactive set-top boxes commonly have millions of lines of software associated with them. Consequently software development can dominate the time to market for such products. The provision of efficient debug hardware can make the difference between a product hitting or missing its market opportunity.
The integration of IP blocks into an SoC design means that board-level system debug approaches, such as probing signals with a logic analyser, are largely ineffective since most of the interesting transactions don't come off-chip. The SoC design needs to support visibility of the system interconnect traffic in a manner comprehensible to a software debugger.
The SH-5 has a rich set of CPU-level debug features, such as hardware breakpoints, a variety of watchpoints and branch trace. A key feature of systems based on the SH-5 core is the ability to couple the hardware debug support for on-chip interconnect and integrated IP with the CPU-level debug features. This coupling allows a debugger to observe the processing of particular instructions, the transmission of bus transactions and the activity of integrated IP through a single debug interface.
Some embedded software applications rely on real-time performance of their software. For such systems, intrusive debug (e.g. interrupting the normal code execution to execute debug code) may not be possible. Such applications are supported in SH-5 by non-intrusive debug capabilities (such as trace.)
3. DEBUG FRAMEWORK
The debug features of SH-5 split naturally into CPU-oriented debug, system-oriented debug, and interactions between the two.
The CPU-oriented features are those which allow monitoring or breakpointing of instruction processing. The system-oriented features are those which allow monitoring or breakpointing of the SuperHyway interconnect and integrated IP.
In addition, there is a significant amount of hardware support shared between CPU and system debug. This includes a Debug Module (DM) which drives a high-speed debug link which can be used for trace or for extending the SuperHyway off-chip to a debug adaptor, as shown in figure 1.
The DM provides both initiator and target capability to the debug adaptor. This allows debug software executing on a host system to probe and manipulate many aspects of the SH-5 system operation (especially peripherals), without needing to perturb the SH-5 CPU execution in any way, thereby avoiding any disturbance to the real-time behaviour of the SH-5 target software.
The CPU and system debug are linked together through chain latches. These allow events detected in one realm to influence the detection of events in the other. A common feature is the ability to filter events that may generate trace or launch exceptions. This is further discussed in section 6.
4. ADVANCED CPU DEBUG
The SH-5 CPU contains a watchpoint controller (WPC) which supports a rich set of features, which are described below.
4.1 Breakpoints and watchpoints
The SH-5 architecture provides a set of hardware watchpoints. Instruction address (IA) and operand address (OA) watchpoint channels can each match on a range of addresses. The OA channels have an option to allow matching on the stored value as well as the address stored to. A BRK instruction allows for arbitrarily many patched breakpoints in RAM.
Instruction value (IV) watchpoints are a novel feature of SH-5, and allow SHmedia instructions to be matched against a mask/value specification. The regularity of the SHmedia encoding allows matching such as:
- Uses of a particular register as a source or destination of an instruction
- All instructions of a particular type
A set of actions can be programmed for each IA, IV and OA channel, which is applied whenever the channel detects a match. Actions include:
- raise a debug exception
- modify the state of chain latches, event counters or performance counters
- generate a trace message
Only the exception action is intrusive; the others may occur without any effect on the flow or timing of the software running on the SH-5. 4.2 Single step and nested exceptions
The SH-5 has a single-step feature, where an exception is launched on the completion of each instruction. This is activated through a bit in the machine status register, so single-stepping can be configured on a per-thread basis and saved through context switches.
The SH-5 handles nested exceptions (referred to as ‘panics') which occur within the critical section of another exception handler. Thus watchpoints can be safely used within exception and interrupt handlers, and such handlers can be single-stepped. 4.3 Branch tracing
A branch trace channel can be programmed to generate trace messages when particular types of branch occur. Any combination of the following branch types can be selected:
4.4 Performance counters
- Unconditional branches
- Conditional branches
- Exception & interrupt launches
- Return from exception/interrupt (RTE)
An activity related to debug is that of software performance optimisation. The SH-5 supports this with performance counters. These can be set to count on a range of conditions (e.g. branches taken/not taken, cache misses/hits), or on the number of hits taken by the instruction and operand address/value watchpoints described in section 4.1. 4.5 Shadow PC register
There is a read-only memory-mapped register containing a copy of the current PC. The register also provides the current ASID value. Reading this register is non-intrusive. The primary uses of this register are:
- Investigating what an apparently unresponsive program is doing
If the register is read periodically, a histogram of PCs can be constructed. This allows a program to be profiled without requiring recompilation. 5. ADVANCED SYSTEM DEBUG
The system-level debug features of the SH-5 are described in the following sections. 5.1 Debug module
The SH5 core has an IP block called the debug module (DM), which is dedicated to supporting the attachment of a debug tool and managing CPU and system trace.
The DM provides a debug-link interface through which the on-chip SuperHyway interconnect can be brought off-chip. The debug link can either be layered onto the SH-5's JTAG port, or it can use the dedicated higher-performance SHdebug link connection. The debug link's functional capabilities are the same regardless of which external connection is used.
A 16Mbyte region of the SH-5 physical address space is mapped to remote memory across the debug link. Any access made to this region will be passed to debug adaptor (or host). This allows debug code to reside outside the SH-5 on-board memory. In particular, no debug monitor ROM is required, nor does space need to be provided in on-board RAM for use by the debug monitor. Combined with the CPU control features described in section 5.3, this allows an SH-5 board to be booted with no onboard ROM at all. This novel feature is very convenient for software development, since it avoids the need to keep removing an onboard ROM to reprogram it, as would be necessary for most systems. More importantly it allows memory resources on the host to be used transparently. This means that SH-5 based systems can be debugged without perturbing the memory footprint of the application in order to make space for debug code or data.
In the other direction, the debug adaptor can use the debug link to inject any type of SuperHyway request into the SH-5. Any SH-5 physical location can be read or written from the debug host without any co-operation from the SH-5 CPU, and with no direct effect on the SH-5 CPU's operation. This has benefits such as these:
- Access to the memory-mapped registers inside the SH-5 CPU which allow its execution state to be monitored and manipulated. For example, the SH-5 CPU can be suspended, its boot address can be changed and a reset can be forced.
- IP Blocks can be interrogated while applications are in progress to monitor specific events. For example, the host debug software can poll an IP block's interrupt flag independently of any CPU activity. Any IP block using memory-mapped locations can be examined in this way, and be debugged in a uniform manner.
- SuperH IP blocks contain standard memory-mapped status flags which report on the occurrence of errors. These can be polled to determine which blocks participated in erroneous transactions
- Any external memory, flash, DRAM and SRAM can be accessed directly from the host. It is even possible to reprogram flash memories in-situ using host-originated operations.
- Configuration of CPU or SuperHyway watchpoints, chain-latches and other debug features can all be done without executing target code.
The debug module handles trace message generation. A 1kbyte buffer inside the DM allows rate matching between generating trace on-chip and sending the trace off-chip. Trace messages can be sent to any of 3 destinations:
- Left inside the FIFO inside the DM
- A buffer anywhere in the SH-5 physical memory map (typically in RAM)
- the off-chip debug link
The FIFO and memory buffer can be operated in circular or hold modes (i.e. retain newest or oldest trace respectively once the buffer is full).
The FIFO and debug link modes do not affect the SuperHyway, so are less intrusive than the RAM buffer mode in that respect.
These options provide a trade-off of performance against intrusiveness. Since all these destinations run slower than the CPU, a choice of two options exists when the CPU generates more traceable hits than can be sent to the selected destination. These are:
- Stall the CPU until more hits can be processed (potentially intrusive but loss-less)
Discard hits until more hits can be processed (non-intrusive but potentially lossy) 5.2 Bus analyzer
Because a logic analyzer cannot be connected to the system bus on an SoC, the SH-5 contains two independent bus analyzer channels (BA) that can be programmed to monitor the on-chip SuperHyway interconnect.
Each BA is a module attached to the on-chip SuperHyway bus. A BA module can be programmed to match particular conditions arising on the bus (e.g. request made by a particular module, request accessing a particular physical address).
When a hit occurs, a number of actions can occur depending on how the bus analyzer has been programmed. These include:
5.3 Memory mapped boot / debug vector
- Generating a trace message
- Signalling a debug interrupt to the CPU. In this case, the parameters of the matching operation are stored in memory-mapped registers and can be read for processing by the interrupt handler.
- Freezing any combination of SuperHyway initiator ports on the device. This freezes the state of those modules' activity, allowing their memory-mapped registers to be read in the context of when the bus analyzer hit occurred.
When the SH-5 handles debug exceptions, panics and soft resets, it selects between two separate vectors when generating the handler address. The selection is based on the value of a memory-mapped register.
The first base vector, which is the default, can only be modified through the instruction stream. It can be used by SH-5 software to handle panics and debug exceptions arising when no debugger is in use.
The second vector is memory-mapped, so can be modified by a remote debugger. In addition, the SH-5 has a memory-mapped control register to allow suspend, resume and reset.
These registers allow a debug host to take control of an SH-5 without co-operation from the running software. In particular, the boot address can be set so that the SH-5 boots from remote code accessed over the debug link.
The SH-5 can be selectively forced in to real mode when a debug event (exception or interrupt) is launched. In real mode, both the instruction and data cache states are frozen. The benefits are:
- Debug event handlers can be non-intrusive with respect to the MMU – no virtual memory mappings have be set up for them
- The handler does not disturb the TLB and cache state as seen by the running system, and it can examine the states exactly as they were when the event occurred.
If required, debug event handlers can run in virtual mode (with improved performance), for example when an operating system provides debug event handlers itself to support inter-process debugging. 5.4 Off chip triggers
The SH-5 supports a trigger-in pin and a trigger-out pin. These can be used to signal debug events in both directions.
The trigger-in pin's state can be used in pre-condition settings that enable particular watchpoint channels (IA, IV, OA, branch, bus analyzer). This allows the programmer to only enable an internal SH-5 debug channel when some particular external event has occurred.
The SH-5 can be programmed to modify the state of the trigger-out pins when one of its watchpoint channels matches. This allows an internal SH-5 event to be signalled to external equipment, e.g. to trigger an oscilloscope connected to external pins, or to signal to another CPU in a multi-core design. This allows powerful multi-core debug solutions to be built. 5.5 Fast printf
The fast printf mechanism causes a special type of trace message to be sent from the SH-5 over the debug link, whenever a value is written to a special memory-mapped register.
With suitable adaptor software to interpret the message, this allows arbitrary tracing code to be inserted into the SH-5 software. For example, a fast printf message could be generated at certain critical points in the program so that its progress can be monitored.
The fast printf capability can also be used as a signalling mechanism between debug agent code running on the SH-5 and support code running on the debug adaptor/host. 5.6 Debug interrupt
The SH-5 has a debug interrupt, which is signalled to the CPU by the debug module (DM) under certain conditions. One such condition is a write to a particular memory mapped DM register. This allows the debug host to take control of the SH-5, for example, if the user has pressed Ctrl-C. 6. CHAINING AND FILTERING
Chaining allows a sequence of debug events to be detected and acted on. It is achieved through chain latches. Each IA, IV, OA and bus analyzer channel can be programmed to set or clear a particular chain latch when a hit is detected. Each channel can be configured to be enabled only when a particular chain latch is set. This allows a channel to be made conditional on a sequence of hits occurring on other channels, for example:
- Match on the first store to a variable following a call to a particular function:
- IA channel matches the function and sets the latch
- OA channel matches the store and is conditional on the chain latch.
- Match on the first call to a function after a particular peripheral device register has been written to:
- bus analyzer channel matches on the device register and sets the latch
- IA channel matches on the function and is conditional on the latch.
Each IV and OA channel (and branch tracing) can also be restricted to the match range of an IA channel, allowing IV/OA watchpoints to be limited to selected functions or basic blocks.
Related to the chain latches are the event counters. A channel can be configured to match only when a particular event counter is zero. Channels can be selectively programmed to decrement an event counter when a match occurs. This mechanism allows conditions such as raising an exception on the 100th store to a variable.
Tracing can be made conditional on the chain latches and event counters too, to avoid generation of unnecessary trace messages.
The chain latch and event counter settings are memory-mapped, and can be examined and modified from the debug host.
Additionally, the IA, IV, OA and branch tracing channels can be made conditional on
7. DEBUG SCENARIOS
- Whether the SH-5 is currently executing the SHmedia or SHcompact instruction set
- Whether the SH-5 is currently in user mode or privileged mode
- The current ASID (which allows filtering by thread)
Some examples follow of how the architecture can be applied to SOC debug problems.
7.1 Device register modifications
Suppose software programs a register in an IP block to have a particular value, and later that value is found to have changed unexpectedly.
To diagnose this, one of the bus analyzer channels can be programmed to match on writes to the register. Then any write, from any SuperHyway initiator, will be captured (either as a trace message or through an interrupt.) 7.2 Corrupted data in the cache
Suppose some data is used successfully in a function, and later the same data is found to be corrupted.
An unexpected store to the data can be detected by programming an OA channel to match on the address of the affected variable. To limit the number of hits, an IA channel can be programmed to match at the end of the function, and its action can be to set a chain latch. This chain latch can be used as a precondition for the OA channel, to cascade the conditions together.
Suppose no spurious store is found. The next theory might be that the cache block is evicted to memory and then reloaded, and gets corrupted in the process. One bus analyzer channel can be programmed to match on requests from the CPU to external memory for the physical address of the variable. The other channel can be programmed to capture responses from the memory to the CPU. From the trace data that is generated, the data read from memory can be examined. Similarly, the data stored to memory earlier on can be traced. If the stored and loaded data differ, this might indicate a problem with the set-up of the memory controller or with the memory itself. 8. CONFIGURABILITY
During development of a new SOC, the debug module would be included to provide full access to all the debug features. Once the SOC and its software are mature, new generations of the SOC may be produced, e.g. to reduce the die size. As part of such optimisations, it is possible to remove the Debug Module from the design, saving the area occupied by its logic and by the 1kbyte buffer memory. 9. TOOL SUPPORT 9.1 Hardware
The SH-5 debug architecture supports two types of off-chip debug connection:
- Industry standard JTAG
- High speed SHdebug link.
The SHdebug link contains 9 signals. Data from the adaptor to the SH-5 is serial. Data from the SH-5 to the adaptor is transmitted 4 bits at a time (to give increased trace bandwidth). The SHdebug link is typically capable of operation at 100MHz assuming suitable adaptor design.
At the adaptor end of the link, an FPGA can be used to implement the link protocol and message buffering. The FPGA is typically interfaced to a debug host workstation via a high speed connection such as Ethernet. SuperH currently market the MicroProbe unit, which performs this function together with a C/C++ toolchain hosted on PC, SUN or Linux.
The JTAG connection provides a lower-cost, lower-speed (typically 20MHz) solution, implemented through JTAG to USB interfaces, or JTAG to Ethernet interfaces.
With a difference of typically 5 in operating frequency and 4 in output bandwidth, the SHdebug link outperforms JTAG by approximately
- 20 times for trace bandwidth
- 5 times in SH-5 originated remote memory access
5 times in host access to SH-5 (e.g. for code download into SH-5 memory) 9.2 Software
SuperH have already ported GDB to support the SH-5 as a remote target. This port is currently supported on Solaris, Linux and Microsoft Windows.
The trace features have been extensively targeted by in-house software and SuperH is working with 3rd parties to develop best in class trace tools for the SuperH architecture. 10. CONCLUSION
SH-5 evaluation devices are available now and a range of tools, operating systems and application software has been successfully ported and debugged using the features of the SHdebug architecture. The debug features described in this paper have already proven to be invaluable for:
- Downloading, running and debugging ‘bare-machine' software on SH-5 development boards
- Investigation and solving testing and debugging of the evaluation devices.
- Investigation and solving of problems in operating system development and porting work.
- Developing and debugging of operating system drivers.
SH-5: The 64 Bit SuperH Architecture. Prasenjit Biswas, Atsushi Hasegawa et al. IEEE Micro July/August 2000 (Vol 20 No. 4) pp 28-39. 12. GLOSSARY
BA: Bus analyzer
CPU: Central processing unit
DM: Debug module
IA: Instruction address (watchpoint)
IP: Intellectual property
IV: Instruction value (watchpoint)
OA: Operand address (watchpoint)
WPC: Watchpoint controller