Blog

Measuring the complexity of processor bugs to improve testbench quality

I am often asked the question “When is the processor verification done?” or in other words “how do I measure the efficiency of my testbench and how can I be confident in the quality of the verification?”. There is no easy answer. There are several common indicators used in the industry such as coverage and bug curve. While they are absolutely necessary, these are not enough to reach the highest possible quality. Indeed, such indicators do not really unveil the ability of verification methodologies to find the last bugs. With experience, I learned that measuring the complexity of processor bugs is an excellent indicator to use throughout the development of the project.

What defines the complexity of a processor bug and how to measure it?

Experience taught me that we can define the complexity of a bug by counting the number of independent events or conditions that are required to hit the bug.

What do we consider an event?

Let’s take a simple example. A typical bug is found in the caches, when a required hazard is missing. Data corruption can occur when:

  1. A cache line at address @A is Valid and Dirty in the cache.
  2. A load at address @B causes an eviction of line @A.
  3. Another load at address @A starts.
  4. The external write bus is slower than the read, so the load @A completes before the end of the eviction.

External memory returns the previous data because the most recent data from the eviction got lost, causing data corruption.
In this example, 4 events – or conditions – are required to hit the bug. These 4 events give the bug a score of 4, or in other words a complexity of 4.

Classifying processor bugs

To measure the complexity of a bug, we can come up with a classification that will be used by the entire processor verification team. In a previous blog post, we discussed 4 types of bugs and explained how we use these categories to improve the quality of our testbench and verification. Let’s go one step further and combine this method with bug complexity.

An easy bug can require between 1 and 3 events to be triggered. The first simple test fails. A corner case is going to need 4 or more events.

Going back to our example above, we have a bug with a score of 4. If one of the four conditions is not present, then the bug is not hit.

A constrained random testbench will need several features to be able to hit the example above. The sequence of addresses should be smart enough to reuse previous addresses from previous requests, delays on external buses should be sufficiently atypical to have fast Reads and slow-enough Writes.

A hidden case will need even more events. Perhaps a more subtle bug has the same conditions as our example, but it only happens when an ECC error is discovered on the cache, at the exact same time as an interrupt happens, and only when the core finishes an FPU operation that results in a divide-by-zero error. With typical random testbenches, the probability to have all these conditions together is extremely low, making it a “hidden” bug.

Making these hidden bugs more reachable in the testbench is improving the quality of verification. It consists in making hidden cases become corner cases.

Analyzing the complexity of a bug helps improve processor quality

This classification does not have any limit. Experience has shown me that a testbench capable of finding bugs with a score of 8 or 9 is a strong simulation testbench and is key to delivering quality RTL. From what I have seen, today the most advanced simulation testbenches can find bugs with a complexity level up to 10.  Fortunately, the use of formal verification makes it much easier to find bugs that have an even higher complexity, paving the way to even better design, and giving clues about what to improve in simulation.

Using bug complexity to improve the quality of a verification testbench

This classification and methodology is useful only if it is used from the moment verification starts and throughout the project development, for 2 reasons:

  1. Bugs must be fixed as they are discovered. Leaving a level 2 or 3 bug unfixed means that a lot of failures happen when launching large soak testing. Statistically, a similar bug (from the same squadron) that requires more events could be unnoticed.
  2. Bug complexity is used to improve and measure the quality of a testbench. As the level of complexity matches with the number of events required to trigger the bug, the higher the complexity score the more stressing the testbench is. Keeping track and analyzing the events that triggered a bug is very useful to understand how to tune random constraints or to create a new functional coverage point.

Finally, by combining this approach with our methodology that consists of hunting bugs flying in squadrons, we ensure high-level quality verification that helps us be confident that are going beyond  verification sign-off criteria.

Other blog posts