Small but Deadly - the Life Cycle of an I/O Bug

By David Murray, Sean Boylan from Duolog Technologies

Some years ago, Duolog worked with a customer to develop a verification infrastructure for system-level validation of a large multimedia chip. Duolog developed a modular, programmable chip-level testbench, incorporating peripherals, memories, reset, clocks and control. The testbench was used for system validation and its main targets were RTL simulation, emulation and FPGA, meaning that the whole infrastructure needed to be synthesizable. While challenging to develop, the testbench was delivered on schedule to coincide with an RTL release of the chip database that was to be used by embedded software engineers ramping up for initial HW/SW integration. As we were providing a verification service, we needed to be highly responsive and wanted to ensure that any problems with the testbench were dealt with immediately. Therefore, while the system testbench was developed off-site, our engineers accompanied its delivery on-site to ensure there was nothing blocking this critical stage of the design flow.

The next three weeks were a revelation. On a daily basis, the Duolog engineers were summoned to some random cubicle in the customerâ€™s site. There, we typically found two engineers hunched in front of a workstation. One was a hardware engineer responsible for IP design and implementation. The other was an embedded software engineer responsible for writing the device driver for that particular IP. The software engineer was trying to integrate and test his code within a co-simulation environment but was not getting the response expected from a simple test case. For example, in the case of a UART IP, the first test â€“ a HW/SW interface test of the IPâ€™s registers and bitfields â€“ had passed with some difficulties. The second test was a loopback test, which involved getting the UART to transmit a byte out of the chip on the TX line and, using Duologâ€™s testbench, looping it back to the RX pin so that the same byte would be received in the UART. This simple integration test would give a good degree of confidence that the UART was where it was supposed to be in the memory map and was behaving as expected. The software integration could then progress from there. There was, however, a problem with the integration as the expected byte never came back. There was a bug in the flow!

BUG!!!

The following dialogue is representative of what happened next and was replayed on many occasions during those three weeks. Names have been changed to protect the innocent and swear words have been removed.

The software engineer, with the Duolog verification engineer and the IP designer in situ, replayed the problem by stepping through his software code:

SW ENGINEER: â€œSee here - in this line of code, we transmit a byte. In the next line of code, we read the receive UART FIFO and look â€“ thereâ€™s nothing there! Iâ€™ve checked the memory map and Iâ€™m sure that Iâ€™m reading and writing to the correct registers! This is a bug!!â€

From the UARTâ€™s perspective, the chip design and verification infrastructure was as follows:

Figure 1: UART loopback path

The UART IP was buried deep within the chip core, under several layers of hierarchy. The signals on the core were multiplexed onto several different pins and went through an I/O layer to the SoC boundary. This I/O layer contained the functional and test muxes, BSR cells, some power isolation logic and finally the I/O cells themselves.

The SoC boundary was where Duolog's testbench domain started. Duolog instantiated the SoC top-level in our testbench and did some de-multiplexing to get the correct pins to the correct testbench signals. The testbench provided functionality to loop back certain signals, including the ability to connect the UART TX port back to its RX pad. This allowed a byte to be transmitted serially over the TX pin and straight back through the RX pin where it would be routed through multiple levels of SoC hierarchy to the UART block. Within the UART, it would be received as a byte which the software could then view through a read access. This was the functionality that was not working for the software engineer.

The dialogue continued:

TB DESIGNER: â€œIs the UART working? Is it transmitting?â€

The IP designer immediately, and understandably, defended his design and assured us that it was indeed working. He pulled up a screen of waveforms and showed the following:

Figure 2: RX not looped back

IP DESIGNER (defensively): â€œLook at this waveform, the transmit is working.â€

IP DESIGNER (more assured): â€œYour testbench is not looping the signals back correctly - see the flat-line on the RX signal.â€

SW ENGINEER (nodding in compliance): â€œYep. The testbench is not working. Itâ€™s delaying my testing!â€

TB DESIGNER: â€œWhat exactly are we looking at here?â€

IP DESIGNER & SW ENGINEER (in unison): â€œThe UART interface.â€

TB DESIGNER: â€œBut from where?â€

IP DESIGNER (dismissively): â€œThe UART, of course.â€

The UART was transmitting down 6-7 levels of hierarchy and a lot could go wrong in those intermediate levels.

TB DESIGNER: â€œCan you show me the boundary of the chip then?â€

The IP designer obliged but first had to stop the simulation and set up a trace for the ports of the chip top-level. After this the software engineer took over and re-ran the simulation. About 10 minutes later, the results came back â€“ a screen full of waveforms with obscure names.

TB DESIGNER: â€œWhich ones are the UART TX and RX?â€ After consulting various specifications they figured out that, in that particular configuration mode, the TX and RX should be on pins 72 and 73 respectively. We honed in on pins 72 and 73 and saw the following waveforms:

Figure 3: Chip boundary

TB DESIGNER (somewhat relieved):â€Itâ€™s not even making it to the testbench. Letâ€™s check the core boundary.â€

After selecting new trace signals and re-running the simulation, the UART signals on the SoC core boundary appeared as follows:

Figure 4: SoC core boundary

TB DESIGNER: â€œNow weâ€™re making progress. Itâ€™s an I/O bug!â€

â€˜Itâ€™s an I/O BUG!â€™

The search had only just begun. The engineer responsible for the I/O layer integration was summoned to help diagnose the root cause.

Figure 5: Is it an I/O bug?

The symptoms of the problem and how the bug was isolated were explained. More information was needed and all of the signals in the I/O layer had to be traced. The huge RTL source file for the I/O layer was opened, along with several excel sheets containing the I/O specification. After consulting the excel sheets, it was confirmed that indeed pins 72 and 73 should contain the UART signals.

Integration Engineer (anxiously): â€œThe I/O specification changed just yesterday so the UART signals will be moving to different pins in the next revision of the I/O Layer RTL.â€

The integration engineer checked the RTL to make sure it was the latest version. Inside the RTL code, he went to the pin representing the UART TX. There were multiple concurrent statements and several instantiations. He flicked between the RTL code and the waveform viewer for the I/O layer. He grouped several signals together in the waveform viewer and analyzed them. He made the following brief summary:

Integration Engineer: â€œOK. There is no toggling on the I/O cell or BSR cell so the problem is somewhere in the pin multiplexing.â€

Figure 6: Output of pin multiplexing

After spending several minutes analyzing the code and the values, the integration engineer picked up the phone and asked the DFT engineer to come and take a look at the RTL. He promptly arrived and they pored over the code until it all became clear. The functional multiplexing was working but was being overridden by the DFT multiplexing which was forcing the output to a test signal instead. The associated expression was using the wrong polarity for the test enable signals. The DFT engineer re-coded it on the spot and the co-simulation was re-run. A positive result came back:

Figure 5: Chip pin working correctly

This meant that the testbench was working as expected, which was a great relief! Satisfied, the integration and DFT engineers left. However, the software test continued to fail. They followed the signalling to the SoC core and again encountered some bad news.

Figure 7: SoC core has no toggling

It seemed as if the input multiplexing path was not working. The integration engineer was called again. He referred to the I/O specification, which stated that the UART_RX could be sourced from several different pins, depending on a mode register set by software. He found the problem quickly â€“ an incorrect mode decode had been coded into the input multiplexer â€“ and within 30 minutes, they had the RX correctly toggling at the IP boundary.

Figure 6: Correct at last at the IP boundary!!

This had taken most of the day, but they had found and fixed two I/O bugs. However, this story was replayed on an almost daily basis, with a whole variety of bugs, over the next three weeks. I/O bugs were found hiding in the following habitats:

Functional Hardware Environment

Incorrect signal or incorrect modes coded on output path
Incorrect signal or incorrect modes coded on input path
Software setup not configuring the device properly
Test logic interference

Testbench Environment

Testbench muxing not correctly configured
Testbench control not set up correctly

Specification Environment

Bugs re-appearing because of frequent changes to the I/O specification

The root cause of many of the problems was incremental changes to the I/O specification, requiring manual modification of the I/O layer code. This was a classic case of an Interpret-Translate-Feedback type of bug. The I/ O specification was a live document and changed very frequently. There was a semi-automated process which produced snippets of RTL code that then needed to be integrated into the final I/O layer RTL. On occasions bugs that were fixed didnâ€™t make it back into the central database, but were lost, and needed to be fixed once more.

We eventually published a set of slides showing where to look for signals toggling, and what to check for at various stages of the path through the chip and out to the testbench.

Click to enlarge
Figure 9: Debug slides

Cost of an I/O bug

I/O bugs are typically â€˜blockingâ€™ bugs. For the project in question, the embedded software was on the critical path and an unstable I/O layer stood between the software engineers and their integration goals. The I/O layer is also at the cross-functional boundary of a number of disciplines, so determining the root cause of problems in the I/O layer can be quite time consuming. Every time something didnâ€™t work related to the chip-level environment, Duolog was called in and inevitably there were bugs in the I/O layer. Every bug had to be located, assigned, fixed, validated, closed and checked, adding delays of days, or even weeks, to the embedded software schedule. Not only were the software engineers working flat-out to solve the problems, so too were the IP team, the integration team, the architecture team and the testbench team.

The cost of all the I/O bugs could be measured in:

Resource cost of handling bugs â€“ analyze, find, fix (exterminate), validate, close â€“ across all teams
Delays in turnaround of I/O bugs resulting in delays to the HW/SW integration schedule, which ultimately resulted in delays to the product release
Cost to re-spin a chip if an I/O bug makes it to silicon!

Exterminating I/O bugs

If one bug gets into the system, even a small one, then it becomes critical. How do you catch these bugs? Do you only uncover them when they have already done the damage? Do you hire exterminators to get rid of them temporarily, and repeat this every so often? Obviously, the most effective, and environmentally friendly, way of getting rid of bugs, is not to let them in to begin with!

Figure 10: Stop the bugs from entering in first place

In order to keep I/O bugs out of your integration flow, there are few fundamental characteristics that are required:

Use a single-source executable specification to capture all of your I/O data, and derive all outputs from this source. Ensure that all I/O users are within this scope as bugs will quickly infiltrate a flow that has data originating in different places. This allows gaps in your bug barrier!
Use a comprehensive and rigorous suite of design rule checks to ensure a high level of quality for the I/O specification. This will ensure that even the smallest bugs canâ€™t get through the barrier.
From your validated central specification, auto-generate all views of the I/O layer, including RTL, verification infrastructures, software configuration, die and package netlists, I/O register descriptions and documentation. Cover everything possible with this automated process as anything that is exposed can be infected by a bug.

An I/O integration flow with these characteristics is the only effective way to eliminate bugs by not letting them into the system in the first place. As the old proverb says, â€˜An ounce of prevention is worth a pound of cureâ€™. Automate your I/O flow â€“ keep the bugs out.

http://www.duolog.com/

David Murray, is CTO of Duolog Technologies and was the original designer of Spinner, an award-winning tool that fully automates the I/O layer of a chip. With more than 18 years experience in the IC design industry across a wide range of disciplines, David has written and presented several papers on topics from functional coverage to algorithmic IP.

Sean Boylan is the Spinner product manager and has 14 years experience developing innovative software for the EDA and telecommunications industry. Prior to Duolog he worked for 3Com and Ericsson.

Industry Articles

Small but Deadly - the Life Cycle of an I/O Bug