Whether you are testing a new microcontroller or an ASIC, post-silicon validation of the design is a must. Here's a look at diagnostic tests and techniques
Writing software to verify the design of a silicon device is quite different from other types of embedded software development. This article introduces the subject of post-silicon validation and provides techniques for the software developer to successfully plan, develop, and execute software used to verify the design of a new silicon device. Although my experience is in validating microcontrollers, the techniques presented here apply to any type of device with interfaces controlled by software.
Silicon devices are increasingly complex, so it's important to plan and execute good design tests throughout the development cycle. There are two main categories of design testing: pre-silicon verification and p ost-silicon validation. In pre-silicon verification, design analysis tools are used to simulate the design and the test environment before an actual silicon device is created. Since the environment is simulated, there is flexibility in setting up test cases at the block and gate level. Inputs can be injected and outputs probed and logged from virtually anywhere in the design. A powerful, low-level test and debug environment is the result. This is equivalent to white-box unit testing in the software world.
Most modern chip designs are developed in a hardware description language such as Verilog or VHDL. The language used for the design in most cases is also used to create the pre-silicon test cases. A problem with design simulation and pre-silicon testing is that they take a long time to execute when compared to actual silicon. What takes seconds in silicon could take hours or days in a simulated environment. This limits the amount of testing that can be performed. Pre-silicon verification testing is usu ally performed either by the digital designers themselves or by a separate verification team whose members have chip design backgrounds and skills. The subject of pre-silicon design verification is extensive, and much information is available about its methodologies and tools.
This article will focus on the second stage of design testing. Post-silicon validation is a less documented area of testing and is usually performed by skilled embedded software developers with little or no experience in digital design. Post-silicon validation is testing performed on the silicon device once it arrives from the fab. The tests use programming languages such as C and assembly and are run on a reference or validation board containing the target silicon. Testing can be done at speed and now involves interaction with other hardware and peripherals.
One difficulty is that visibility inside the chip is limited; internal signals cannot be probed as they can in a simulation environment. Test and debug is a challenge at this stage. The work is done at the device's register interfaces using software and at its external signals using tools such as logic analyzers and oscilloscopes.
Consider the following scenario: The first silicon finally arrives for a new system on chip (SoC) design and everyone is anxious to see how well the design works. The silicon validation team plugs the new chip into a reference board that has never been tested with the chip. Revision 1 of the bootstrap and diagnostic code is loaded into ROM, and the board is powered up for the first time. Chances are good that it will crash almost immediately. The problem might be with the reference board, the new silicon, the diagnostic software, the boot code, a misunderstanding of how the chip is supposed to work, or some combination of these. Unfortunately, an early failure like this usually prevents the rest of the tests from running correctly, which is not good either. Obvio usly, nothing can be said yet about the quality of the chip.
Post-silicon diagnostic tests should be developed early in the chip development process, and the validation team should be organized and ready for the delivery of the chip. It is a good idea to anticipate various failure types and have plans for dealing with each. While for many embedded projects, the programmers have the luxury of developing and iterating software using the target hardware, this is not the case when validating new silicon. Diagnostic code must be written and tested without hardware.
There are two main goals of validation testing: first, to thoroughly test the functionality of the chip; second, to structure the code and plan the test environment to quickly and efficiently debug, isolate, and report problems and issues that come up in the lab while testing. This means putting together an organized system of debug statements, logic analyzer triggers, event logging, and so on.
Rarely can the diagnostic software print a nice statement describing the exact silicon problem. Many times, the test will fail due to some side effect such as improper register settings or initialization by software, a software logic error, a silicon problem unrelated to what the test was testing at the time, or lock up in the middle of the test. You need to design your software to report all available clues.
Of course, the diagnostic software should be well structured and easy to understand and maintain. There are many reasons for this. Because of the complexity of embedded software diagnostics, it is easy to raise a false alarm to the chip's designers when reporting a test failure. This should be minimized, although sometimes it can't be avoided. When you strongly suspect that a problem is in the silicon, a designer will want to see everything that the software is doing. If the code is overly complex or messy, it adds more difficulty to finding the location of the p roblem (and can be embarrassing).
While you usually have adequate time before the first silicon arrives to design and write the software, development often moves more quickly than you think and pressure mounts once the chip is in the lab. If the code is not well structured upfront, it can rapidly become unreadable and unreliable due to rushed code changes in the lab. The test can lose its credibility and waste a lot of people's time on misreported problems. Good clean code helps testing go smoothly and provides the validation and design engineers more confidence in working through potential problems.
There are three types of validation tests: directed diagnostic tests, stress tests, and real-world tests. Although there is some duplication in the functionality tested, these types represent different test dimensions that are all needed to ensure the quality of the silicon device.
Directed diagnostic tests are des igned to exercise every feature of the chip. There should be a test for every field in each register. All functions of the device must be exhaustively checked one by one.
For starters, you want to set all possible programmable values of each writable register or field (or a reasonable sampling of values if the range is large) and verify that all readable registers can be read back after being written and that all status bits are set correctly for all stated conditions. For registers with large value ranges, you should check boundary conditions as well as values sprinkled throughout the range. Being thorough is important.
I sometimes feel like a robot when I write these tests. You go through the user manual. It tells you that register x bit 3 will cause y when set. You write a test case to set register x bit 3 and then verify that y happened. The manual says that status bit 7 is set when condition z exists. You do what it takes to make condition z occur and then make sure status bit 7 is set. You ma ke condition z go away and verify that bit 7 is no longer set. One by one and step by step, you test them all.
Directed diagnostics are very important, as they check that all the basic functionality of the chip is there. They form a solid basis for the other tests. A problem encountered here can be isolated quicklyunlike failures uncovered by the other types of tests.
Unfortunately, a successful run of the directed diagnostics can give you a false sense of security. While the testing looks thorough and impressive, it's not sufficient by itself to test the chip. One reason is that these test cases, taken in isolation, are usually well tested by the chip designers during simulation. The test cases are the first things they think of and are often tested just to call the development of the feature completed. This testing type is also done extensively in pre-silicon verification, and although it still needs to be tested in silicon, chances are fairly low that anything significant will be found wro ng post-silicon.
Another reason that directed testing is not sufficient is that most problems involve a combination of features being active simultaneously with certain timings of events. Stress testing is designed to catch these types of problems, which constitute the bulk of the problems that exist in the chip. Writing good stress tests, however, is much more of an art than a science (the robot has trouble).
The goal of stress testing is to uncover problems that arise when certain combinations of events and their timing in relation to each other occur. This testing is sometimes known as random testing, though it may not be truly random. Stress testing is especially valuable in post-silicon validation since it is difficult to accomplish in pre-silicon verification. (The lack of real-time speed in the simulation environment as well as lack of actual peripheral interactions makes this sort of testing extremely difficult.)
Stress tests can be very complicated to write and debug, so their design should be thought out carefully. The goal is to get as many chip functions running at the same time and as fast as possible. Elements of randomness should also be included. These tests should be designed to run continuously so they can be left running overnight or through a weekend.
Stress tests vary depending on the type of silicon device you're testing. For example, if you have a device that handles interrupts and has low- and high-power modes, you can design a test that will cause all devices to send interrupts at a high rate of speed, verify that the interrupt handlers are called, and then have the handlers themselves switch the power modes back and forth, verifying that the power modes work. Perhaps have your DMA controller moving memory back and forth at the same time. Try to mix as many things together as possible.
Look at each function of the device and ask yourself if you can exercise it during the test. If it is not feasible to acutally verify the results of the function in detail while it's running (believe me, many times this is harder than it seems), it's beneficial to just have the function running during the test for more stress on the system and to see if it breaks anything. System stress testing can consist of one large test or be broken down into a few different ones.
Remember to design the software tests with the expectation that silicon problems will occur and will need to be isolated. The most common way a stress test will fail is by the software just locking up, so assume this will happen when you write the tests. Incorporate debug logic such as outputting codes to an I/O port where a logic analyzer or scope can collect them or whatever makes the most sense in your particular test environment.
Well-designed stress tests will uncover problems. But it will always be a challenge to pinpoint the exact cause of each failure.
Real-world testing< /FONT>
Real-world testing consists of running a selected set of software applications that stress the chip in different ways. These applications should be typical of end user products that will contain the chip. This testing will vary depending on the type of silicon and usually consists of using off the shelf applications with device drivers written to interface with the device. You should select the applications to stress as much of the chip's functionality as possible.
What is the benefit of this type of test when added to the combination of directed and system stress testing? Although, in theory, directed diagnostics and stress tests should cover testing of all functionality of the chip, no amount of directed testing can detect every possible failure. Given this fact, successful testing is about efficiently using time and manpower to get the best possible test coverage in order to minimize the probability of customers and end users encountering problems in the field. Knowing when to stop is a c ommon test dilemma, but you do know that the answer simply can't be "after we test every possible permutation of the device."
For this reason, real-world testing is a valuable supplement to stress testing. Although the applications tested are not specifically designed for use in testing, they can cause different combinations of events and timing that did not occur in previous testing and could uncover system-level problems. These are good problems to catch, since the patterns of operation in a real-world test are more likely to be seen during the product's use in the field.
You may wonder if you could skip the stress tests (or even the directed diagnostics) and perform an extensive real-world test instead. This is not a good idea for several reasons. One is that you do not always know how well an application exercises the functionality of the chip. It will be hard to gather a set of software that uses every feature, especially since some may not be taken full advantage of by any existing application s.
Another reason not to do real-world testing alone is that if a silicon problem occurs and the application crashes, the cause of the problem will be difficult to isolate. In many cases, you will not even have the source code for real-world tests. Having a good set of directed diagnostics and instrumented system stress tests will allow you to isolate problems more quickly and easily than you could with actual applications.
Real-world testing is important, but it should only be used to supplement the directed diagnostics and stress tests. You must always exhaustively verify that the chip does what the manual says it willso new application software can be written to it. That is the only way to ensure the quality of the device for present and future software designs.
The natural execution order for the three test types is (1) directed diagnostics, (2) stress tests, and (3) real-world applications. However, in man y cases, you will find that bouncing back and forth helps find and isolate the most problems the most quickly.
The suggested sequence is good for the following reasons. Doing directed diagnostics first lets you know that each individual feature works. A problem with a particular function, say a bit in a register not hooked up correctly, could take hours or days to debug in a system test, but would be caught and isolated immediately in a directed diagnostic test case.
For the same reason, stress tests are run before real-world tests. Although problems are still hard to isolate in system stress tests, they can at least be specially designed to aid in capturing and isolating problems. Real-world testing is done last, now that you're armed with the knowledge of existing problems and any necessary workarounds.
There are also reasons to conduct the tests in parallel as much as possible. For example, system testing will usually uncover the most complex problems. The earlier a problem is found, the mor e time there is to investigate and fix its cause.
In one project I worked on, getting through all of the directed diagnostics took considerable time and uncovered mostly problems that were fairly straightforward for the designers to fix. The number of problems found decreased as time went on, and since the directed tests touched all the functionality, it appeared that testing was almost done and the product almost complete. However, some very complex problems emerged when stress testing started. Although the knowledge gained by directed diagnostics made it easier to isolate these problems, the surge of failures were detected, isolated, and reported much later in the testing cycle than we would have preferred. Also, the problems uncovered by the stress tests turned out to be the hardest for the designers to fix.
The best approach to executing the tests is a spiral model. Each pass of the spiral covers a specific subset of the device functionality, which is tested first with directed, then stress, an d finally real-world tests.
You can run all of the tests (without tracking down failures) in parallel if you have the equipment and manpower to do so. The order of the tests becomes important only if you want to spend time efficiently as you troubleshoot the problems described earlier.
In my specific experience, running and troubleshooting the tests happened at the same time since it was such a new device and the diagnostic code was unproven. In other words, the tests never ran without failure the first time. If the chip and diagnostic software are more proven, as it may be in a sample update, running all the tests in parallel and ordering the troubleshooting of problems in the specified order may make the most sense.
To develop your diagnostics, you may be tempted to use an operating system that gives maximum power in developing and debugging software. However, especially with SoCs, the first silicon may n ot even boot the operating system. Even if the system boots, testing could fail or lockup due to stress from underlying operating system functionality that will be hard to isolate.
The best solution is to develop the bulk of your diagnostics using a very thin operating system (or none). It is helpful, though, to use the C runtime library as well as file I/O. It's always good to have a minimum set of diagnostics written in assembly language that is run directly upon processor reset and uses no operating system at all. This is useful if there are significant problems in the first silicon that prevent even a simple run-time environment from booting.
You'll need a good diagnostic framework or scripting language to execute sequences of tests and make it easy to activate and deactivate tests as necessary, preferably without having to modify any source code. You should always report your test results in a consistent way. While some tests may involve some manual steps in verifying a signal on an oscillosco pe or logic analyzer, you should leave the bulk of them completely automated. These can then be easily run as regression tests. You'll discover, however, that some tests may take too much effort to automate; even so, the tests must still be done. It's important to document how to run the manual tests. This includes a step-by-step procedure for how to execute the tests and verify the expected results.
Also, make sure that the validation hardware board has adequate connectors to attach all the signals needed for debugging to a logic analyzer. As discussed before, always assume from the outset that you'll encounter difficult silicon problems that will need to be debugged. The logic analyzer settings can be planned, programmed, and set up even before the chip gets to the lab.
Break the system
The goal of silicon validation is to find problems. Every silicon device that comes out of the fab should be assumed to have problems, some minor, so me major, and some obscure. The aim should be to relentlessly cause them to happen. Verifying that something functions is different from working diligently and creatively to stress the system and uncover problems and bad behavior. The latter mindset distinguishes an excellent validation team from a mediocre one.
That said, remember that if a test fails, it doesn't automatically mean that it's the silicon designer's problem to fix. It could be that your test configured or used the chip incorrectly. Perhaps an easy software workaround can be defined and incorporated into the final documentation for the part. Application notes may need to be generated. For new silicon, the company has to learn how the chip works and should be used. In other words, the validation engineer and the designer both own the problem and should work together to solve it.
Many testers incorrectly assume that because the diagnostic code is not a shippable product that good software practices need not apply. This is not true. Good upfront software design and design review for all tests are critical. A coding standard should exist, and version control and thorough unit-level tests should be used. Peer reviews are always a good idea and can have tremendous benefit when conducted before the first silicon even arrives.
Many of the issues with design-level diagnostic test code are different from product-level embedded development. Coding standards and reviews should reflect this. For example, while having various levels of data abstractions is good in production software, it can be bothersome in test code. Having to look through four layers of header files just to find the definition for the bits being set in a register is a waste of time in the lab setting. This is an extreme example, but the point is to develop and review the test code for suitability for its intended purpose of testing and isolating silicon problems.
Post-silicon validation is a challenging and interesting field within engineering. Members of the validation tea m need to have a methodical mind and good testing, communication, and people skills. They also need good software development skills and the ability to debug very low-level hardware and software issues.
Steve Babin has over 15 years of embedded software development and management experience. He is currently responsible for developing software for cell phones at IBM's Pervasive Computing Group. Earlier in his career, Steve was project leader for the Elan SC400 silicon validation project at Advanced Micro Devices. He holds a BSEE from Louisiana State University. Contact him at firstname.lastname@example.org.
Intel's experience in validating the Pentium 4: developer.intel.com/technology/itj/q12001/articles/art_3.htm
A list of resources relating to microprocessor validation: citeseer.nj.nec.com/ho95architecture.html