Generating AMD microcode stimuli using VCS constraint solver

Gregory Tang and Rajat Bahl, AMD, Inc.
Alex Wakefield and Padmaraj Ramachandran, Synopsys Inc.

As microprocessor designs have grown considerably in complexity, the use of hand-written directed tests in verification has dwindled. Automated random test generators that cover the stimulus space more efficiently have emerged in their place. These random test generators create microcode test sequences, emphasizing the distribution of stimuli across all meaningful values for opcodes and other instruction attributes. Traditional methods randomize instruction fields sequentially, which often results in verbose, redundant code and limited control over distributions. In this article, we explore using a hierarchical constrained-random approach to accelerate generation and reduce memory consumption, while providing optimal distribution and biasing to hit corner cases using the Synopsys VCS constraint solver. We present and analyze the method and discuss its effectiveness in todayâ€™s verification environment.

The constraint language constructs in the SystemVerilog language provide a clean, concise format for describing microcode instructions in terms of their possible attribute combinations and allow precise control over the distribution of values for each individual field. An initial test generator prototype was implemented using a single class, in which constraints for all opcodes were defined. This early design was able to overcome all the aforementioned flaws of sequential randomization methods.

Using an object-oriented approach, a base class was then implemented with global constraints pertaining to all opcodes. Sub-classes were derived to define groups of related opcodes with similar constraints. By partitioning the constraints hierarchically into smaller groups of opcodes, the memory requirements were drastically reduced, which increased performance.

Generator Architecture

The opcode generator consists of two layers, as shown in Figure 1. The upper layer was implemented using a SystemVerilog random sequence construct with weighted knobs to control the distribution of the high-level items. The lower layer consists of the opcode class, which is randomized with various additional constraints and weights provided by the upper layer. The tests consist of a set of weighted values that direct the generator to the required mix of instructions. The constraint solver directly applies these weights to the generator layer to control the distribution of the various opcode types created.

A simplified example of code for the architecture represented in Figure 1 is shown in Figure 2.

Figure 1 â€“ Test/Generator Architecture

Figure 2 â€“ Test/Generator Code Sample

Single Class Randomization

The simplest coding style for instruction generation is to have a single class containing all opcodes. This approach is very flexible because it allows constraints to be applied between any of the data members in the opcode class. The trade-off is that the randomization speed may be slow because the problem presented to the constraint solver contains many random variables and a large, complex set of constraints. Our opcode class contained approximately 100 random variables and 800 constraint equations.

The single-class architecture is shown in Figure 3, and the code to implement the opcode class is shown in Figure 4. The code consisted of a set of random variables and a set of implication constraints to ensure that legal opcodes are generated. An opcode type is one of the key data members in the class, controlling which type of instruction is generated.

Figure 3 â€“ Single-class Architecture

Figure 4 â€“ Single-class Code Sample

Multi-Class RandomizationTo reduce the size of the randomization problem, the opcode class was split into multiple smaller classes. The opcodes were divided into a series of categories that mapped well to the knobs or weights used in the test interface.

A base instruction class consisted of all the data members common to all of the child classes and contained most methods used to set, print, and pack data. Data members and constraints common to every opcode were placed into this base class, as shown in Figures 5 and 6.

Each opcode category child class contained constraints specific to that set of opcodes. Inside each of these classes, a similar structure to the original single-class code was present â€“ a set of implication operators based on the opcode type.

Figure 5 - Multi-class Architecture

Figure 6 â€“ Multi-class Code Sample

Architectural ConsiderationsThe instruction generator was controlled by a set of knobs or switches allowing the test writer to generate constrained stimulus. We did not have any constraints in the test layer that directly controlled items in the sub-classes. The upper-layer random sequence was controlled by knobs only and chose the opcode category first. This allowed us to allocate the correct object type at this location to add the sub-class into the sequence.

If the test layer directly controls any items in the lower levels, then all decisions related to which sub-class to randomize must be made first. A wrapper class would likely be required, which constrains all of the variables controlled by the tests. This wrapper class would be randomized first, and then the correct sub-class object would be allocated and randomized in the second phase of the generation.

Figure 7 â€“ Wrapper-class Diagram and Code SampleConstraint Profiler

The VCS constraint profiler allowed us to analyze the performance of the generators for runtime and memory. The profiler provided runtime performance details in three categories: cumulative randomize calls, per randomize call, and per partition (shown in Figures 8, 9, and 10 respectively).

We see that the call with the greatest overall impact on CPU time is in the op_gen.sv file at line 4308. This call executes quickly; however, there are 7,104 calls to this randomize, consuming 44 seconds of CPU time.

The next table shows the slowest individual randomize calls, which often corresponds to the first table. The line number and visit count are displayed â€“ op_gen.sv:4308@162 is file op_gen.sv, line 4308, 162nd execution of this line of code due to a loop. Here we can see the slowest call takes 3.2 seconds. However, there are only two calls to this randomize problem in the entire simulation, so optimizing this call will have little impact. The individual randomize call data often correlates well to the cumulative randomization table.

VCS partitions any randomize call into several partitions if possible, particularly when random variables that are not related to each other occur within the same randomize call. This division of the problem allows the unrelated variables to be solved independently. The partition table shows the slowest partitions, and often correlates well to individual and cumulative randomize tables.

Figure 8 â€“ Constraint Profile â€“ Cumulative Randomize CPU Runtime

Figure 9 â€“ VCS Constraint Profile â€“ Individual Randomize CPU Runtime

Figure 10 â€“ VCS Constraint Profile â€“ Individual Partition CPU Runtime

Similar details are provided for memory use in Figure 11. This is particularly useful when the BDD solver is used, because in this mode the solver elaborates the entire solution space of the randomize call before selecting a solution. Elaborating the entire solution space can take large amounts of memory, and the solver must spend some time elaborating the entire solution space. The solution space is cached to speed up subsequent randomization calls.

Figure 11 â€“ VCS Constraint Profile â€“ Memory Data

The BDD solver works well for specific architectures, particularly if the randomize problem does not take excessive memory and the same randomize call occurs many times, as is often the case with CPU opcode generation.

Performance Results

To ensure we correctly compared the performance between the two architectures, we used the profile data to identify randomize results for two opcodes, and then created a small testbench that randomized these opcodes multiple times. This allowed us to measure reasonable CPU time with both solvers in isolation, without any side effects from the other testbench components.

The VCS 2009.12 release provided a testcase extraction feature to extract the slowest partition from each randomize call automatically.

Runtime results

A comparison of the runtime from the single-class and multiple-class architectures with each solver is shown in Figure 12. Clearly, the runtime for the multiple-class architecture is faster with either solver and for both opcodes. The default RACE solver shows a 4x speedup, while the BDD solver shows a 2x speedup.

Figure 12 â€“Runtime Performance Comparison

Memory Results

The memory requirements were also significantly better with the multiple-class architecture, as shown in Figure 13. We measured memory results for the BDD solver only, because the memory consumed by RACE is typically smaller and not a limiting factor.

Figure 13 â€“ Memory Comparison

Performance Analysis

The main reason for the acceleration and memory decrease is due to the smaller set of variables and constraints present in the new implementation. Profile data for each of the architectures is shown in Figure 14 and Figure 15. The new implementation has 7x fewer constraints than the original, allowing the solver to calculate solutions more efficiently.

Figure 14 â€“ Single Class - Top Randomize Calls

Figure 15 â€“ Multiple-class â€“ Top Randomize Calls

Conclusions

Generation of x86 opcodes using serial randomization achieved the desired speed and memory; however, it suffered from a significant distribution problem. Because each portion of the opcode was generated serially, no control over the overall distribution was possible. This resulted in skewed results, and many additional seeds and simulations to close coverage.

A simple constrained random approach solved the distribution issue; however, it quickly reached speed and memory limits due to the complex x86 instruction set. The generation was too slow, resulting in reduced simulation performance.

Randomizing instructions by first choosing the opcode category significantly simplified the problem. The constraint solver had a smaller problem to consider because only constraints specific to the opcode category were present. This resulted in improved memory and speed without sacrificing distribution or test-level control.

Industry Articles

Generating AMD microcode stimuli using VCS constraint solver