Wilhelm Schreiber, Research and Development Engineer, Siemens Information and Communications Network Division, Munich, Germany, Thomas Zipper, Research and Development Engineer, Siemens Information and Communications Network Division, Lake Mary, Florida
When our R&D team in Siemens Information and Communications Network Division set out to develop an advanced network communications device, the "ACE" (AAL2 Connecting Element), we faced serious challenges. We had a client that needed the device very quickly, so we had to rapidly deliver something that would work. We also needed to implement some highly sophisticated algorithms that were not only a design challenge in themselves, but created very tight timing constraints. Finally, the system had three different clock domains that needed to be managed. The solution to these problems was to implement the design in an FPGA and employing physical optimization during synthesis. This s trategy allowed us to achieve our design and schedule objectives.
The ACE is the switching center within a Universal Mobile Telecommunications System (UMTS), the latest generation of mobile radio systems. UMTS is the successor to GSM and is being proposed as a universal standard. The ACE handles AAL2 (ATM Adaptation Layer 2) switching. The ACE device is part of a mobile switching center (MSC), which in the first version is able to mediate up to three hundred thousand subscribers. It provides Utopia L2 and PCI interfaces.
An added benefit of the FPGA implementation was that it gave us the opportunity to change the design to correct for errors or adapt to new standards. Our requirements were not stable, and we knew that subsequent changes were almost a certainty. We also needed less than ten thousand pieces, making FPGAs a much more cost-effective alternative to a $750,000 ASIC. Thus the ACE was implemented in two 300K-gate Xilinx Virtex 2000E arrays.
At the beginning, it was very diff icult to estimate the final size of our design, so we elected to divide the functionality into two physically unique circuits, ACE1 and ACE2. Both ACE1 and ACE2 have three clock domains: a 48-MHz Utopia interface, a 52-MHz internal clock and a 33-MHz PCI interface. ACE1 is responsible for importing, depacketizing and packetizing ATM cells. The cells are imported into a receiver where they undergo reassembly to AAL2 packets.
ACE2 is comprised of two functions: pointer management and a scheduler. Pointer management is responsible for managing the 32K ATM cells that are stored in queues within the cell RAM. The scheduler decides at which time each cell must leave the ACE. Link lists within the scheduler required very sophisticated memory management to handle pointers between RAMs. These are implemented in on-board FPGA RAM because of the number of addresses that need to be switched. Finally, the ATM cells are assigned a header and output over the Utopia interface.
When planning this design pro ject, we determined very early that our methodology needed improvement. We needed to employ a coding style that would take full advantage of the features of the target technology while avoiding its weaknesses. In addition, we knew that the very tight timing within the pointer management and scheduler functions in ACE2 would require highly automated and intelligent synthesis capabilities. In the past, we had found that, when backend design stages are given poor quality constraints or netlists, it could take many hours after which timing issues might still not be resolved.
To achieve our timing objectives, we attempted several alternatives, with varying degrees of success. We first attempted to replace deep adders and multipliers inferred in the source code by our synthesis tool (the Synplify software) with predefined blocks (hard macros). Unfortunately this yielded no improvement.
We also considered a couple of methodological changes. We thought of employing modular design in combination with gate level floor planning. Because we were using the Xilinx PCI interface, however, we were unable to deploy this methodology. Likewise, we found that gate level floor planning within the backend offered no benefit as the instability of the design required continuous changes to the source code that in turn necessitated frequently redoing the floor plan.
We did observe some improvement when we employed techniques with our synthesis and backend tools. Specifically within our Synplify synthesis environment, we found that reducing maximum fan-out on different nets led to replication that improved timing. We also found that providing synthesis with all known constraints, even pin location, made a difference. When utilization was not critical, we disabled resource sharing, as well. Within the backend, we observed positive results when we set timing ignores (TIGs), but also found that excessive TIGs dramatically increased runtime. Xilinx multi-pass place and route (MPPR) with different cost tables also was beneficial.
With the schedule constraints on this project, however, we knew we needed to adopt more than just incremental improvements to our existing methodology. To help us meet our timing constraints and stay on schedule, we elected to employ physical optimization during synthesis. Our most critical timing path was between a RAM and a 48-bit adder within the scheduler and back to a RAM. When we found that the path did not meet timing after our initial synthesis, we performed physical optimization with the Synplicity Amplify software. We assigned area constraints to the critical paths in the design. The Amplify software produced a detailed placement for the critical path and handed it off to the Xilinx backend where we found that timing issues were resolved. To resolve all timing problems in our design, we set up 12 areas for optimization. We employed Synplicity's HDL Analyst and Amplify floor planner to help apportion the logic into appropriat e areas.
The combination of synthesis and physical optimization provided very accurate initial timing estimates and greatly improved productivity. Results were very good when we specified area constraints, and were even better with the advanced Amplify capabilities.
With large workstations with 8Gb of main memory, we were able to synthesize one FPGA in 10 minutes, compared to 2-3 hours for backend design processing. When performing MPPR, we employed a workstation farm, as a sequential processing run could take 200-300 hours. The highly automated functionality of the Synplicity tools also made our job much easier. For example, we were able to replicate a configuration of primitive logic gates and very high fan-out receivers that appear frequently throughout our design. We could also manually replicate as needed, as we did on a timing critical low fan-out net.
The ACE design, in particular, achieving timing requirements in the ACE2 portion, presented tremendous challenges to our design team. By making a relatively minor improvement to our design flow, the addition of physical optimization during synthesis, we were able to overcome these challenges and meet design objectives.
See related chart