STAC: Advanced inter-die communication Technology
By Andrew Jones, Stuart Ryan
STMicroelectronics R&D Ltd
This paper outlines recent a new architecture for implementing a communication channel between two highly-integrated dice. It shows how the on-chip interconnect can be extended to bridge between chips while retaining high bandwidth and low latency. In addition, the technique allows other signals to be integrated into this communication channel without side band signals in a low pin-count and low power architecture. This arrangement provides a universal link which allows the cooperation of multiple chips within a package which may have been designed independently.
System in package (SiP) technology is now widely deployed for a number of different reasons. One observer has noted that: "In the 21st century the new challenge is not how many transistors can be built on a single chip, but rather how to integrate diverse circuits together predictably, harmoniously and cost effectively."
Many of the reasons for the prominence of SiP technology are various circumventions of the difficulties of integrating heterogeneous circuits monolithically. For instance, when integrating noisy digital circuits with noise-sensitive analog proves too challenging. Since defect density scales with area, the integration of large digital logic with smaller functions, e.g. RF devices, can reduce yield and SiP technologies can be used to address this .
The most common SoC methodologies involves accessing and working with a single technology library and partitioning decisions are highly dependent on library construction and routing capabilities that are identified within the tools. The higher integration capacity of SiP reduces the complexity of each partitioned system and reduces the size and routing complexity of the printed circuit board (PCB). Some SiP design challenges arise due to a lack of similar design infrastructure between ASIC technologies and the multitude of layout possibilities. Packaging concepts include chip stacked on chip, flip-chip stacked on chip and chips placed side by side in a package.
So, in forthcoming generations of integrated consumer electronic devices, chips are increasingly likely to be constructed from more than one die within a single package. There are two important motivations for this:
1) The decreasing feature size in CMOS silicon processes allows digital logic to shrink significantly in each successive fabrication technology. For example, an area shrink of around 50% could be expected when comparing a digital logic cell implemented in 90nm and then 65nm. However, for several reasons, analog and IO cells typically shrink very little in smaller geometries but they do increase their cost, as typically, denser technologies are more expensive per unit area. Also the use of unshrinking analog logic for IO can lead to increasingly pad-limited designs.
2) The transition to sub-32nm designs introduces a dichotomy between supporting low voltage, high speed IO logic; for example DDR3 1.5V @ 800MHz+; and higher voltage interconnect technologies, for example HDMI, SATA, USB3 etc. The lower voltage DDR3 interface requires a thinner transistor gate oxide compared to, say, HDMI. Supporting both gate oxide thicknesses can represent an extra, unwelcome, cost. Perhaps more decisively though, porting high-speed analog interfaces to a new process consumes significant resource in terms of characterisation and qualification time. By decoupling the implementation of analogue blocks from that of the major digital parts of a system allows a reduction in time to working silicon.
By splitting a conventional monolithic SoC (System-on-Chip) into multiple dice in order to form a SiP these pressures can be alleviated. In the consumer space an example SiP might comprise a 32nm die containing high speed CPUs, DDR3 controller(s) and differentiating IP, connected to a 55nm die containing analog PHYs. Because of the reduced set of analogue IP the 32nm die gets the maximum benefit from the shrink.
In fact, the compelling reason here is time to market. It is simply much faster to implement certain types of analog circuits using trailing edge lithography and thus re-use both hardened cores, their validation, analog characterisation and, in some cases, their third party certification. The SiP challenges  are the price we pay for this.
While the initial step in this strategy would be to implement two dice in the SIP, it is easy to see that more feature-rich devices can be considered if more dice are included, for example an RF die could be added to support a TV tuner, or perhaps a wireless networking PHY layer, within the same package.
Figure 1: Example Multimedia implemented as a SIP
The conventional way of doing this is essentially to design the system as one would a mono-die then partition the system into a left subsystem and a right subsystem so as to minimize the communication between the two subsystems. This is performed while maintaining a balance between the two subsystems that makes them advantageous (in design terms) for implementation. Then the two subsystems are implemented on different die and the identified inter-die communication is simply the set of signals that need to pass from left to right and vice-versa.
There are a number of problems with this method for example:
- This can lead to a problematically high number of wires; this is a challenge because the number of wires which can be efficiently used to link left die to right die is severely limited in many technologies.
- This approach makes modifying IP blocks in either die difficult if it requires new or different signals to traverse inter-die.
- It makes die re-use particularly difficult - as in general this approach leads to highly specific inter-die traffic.
- Independent testing; validation or packaging of the die may be problematic unless the inter-die communication has been designed particularly carefully.
- If any of the signals carry secrets or otherwise allow eavesdroppers or spoofers to gain access to secure information, an ad-hoc solution must be adopted to limit this vulnerability particularly if the inter-die communication is considered accessible. This may or may not be the case depending on the packaging technology and the applicable security standards.
The remainder of this paper describes some new aspects of an interface that enables large SiP designs which are split across multiple dice. This point-to-point interface is referred to as STAC (STBus Advanced inter-die Communications) and is currently in deployment in production devices.
Serialised inter-die Network
We provide a STAC port to the on-chip interconnect which is taken to the edge of the die.The STAC port reduces the size of the on-chip interconnect from 128-bits to as few as 8-bits and uses a high-speed serial interface to bridge to a similar port on the other die linking to the on-chip interconnect of the other die. This allows the on chip network to carry packetized memory requests and responses onto a flow-controlled token-based high-speed serialised network between the dice. The STAC implements a wormhole-routed port with virtual channels and dedicated buffers to respect end-to-end Quality of Service (QoS) commitments.
Typically, the vast majority of communication between the two die connected by a STAC will be read or write transactions to the memory address space of either chip. However, there will also be significant communication in the assertion and de-assertion of interrupt lines, DMA handshakes, reset request and acknowledges, power down requests and so on. These are commonly referred to as out-of-band signals (OOB). Conventionally, OOB signals are routed separately either by dedicating a specific wire (in the inter-die interface) to each function (as would occur if the chip were implemented mono-die) or by using a reduced set of wires and encoding the signals onto a reduced set of wires (e.g. the PCI 2.1 standards use of 4 physical wires INTA-INTD to effectively carry an arbitrary number of logical interrupts). Note that to save pins several existing on-chip interconnects (previously bus-based) multiplex data and command onto fewer wires than exist logically. This normally adds latency and leads to an increase of power  (consider that in the absence of multiplexing the higher order bits typically carry fewest transitions). However this is not the case with packetized networks as commonly the packets are normally folded and thus the concentration of power in the interconnect is lessened. In the narrow forms we use here (8 or 16-bits) this does not represent an issue.
Memory transactions are carried by a sequence of segments over the STAC. In this paper we focus on an interesting type of STAC segment called a wire STAC segment. OOB signals are each allocated a position within one of several bundles of wires. Each bundle is transmitted as single wire STAC segment with a bundle identifier called a virtual channel. The receiver of such a STAC segment is able to associate the state of each bit within a STAC segment with the state of a specific wire within the wire indicated by the virtual channel. This scheme is illustrated by Figure 2. Of course, the state of each wire in a bundle is not continuously transmitted. The state of the wire is sampled at regular intervals and these samples are transmitted across the STAC in a wire STAC segment along with data traffic and the sample is used to specify the state of a register that holds the state of each OOB signals on the other side of the STAC. The STAC performs this transmission bi-directionally so that wires can be virtually connected from either side. The sampling rate, the number of bundles transmitted and the priority of transmission of these bundles is configurable.
The physical implementation of the STAC is not key for this technique. Any suitable high-speed link technology is a candidate for implementing this invention. In initial uses of this architecture we used similar IO to that of the DDR3 interface. The STAC can be implemented in either serial or parallel form as required by the constraints of the design.
Figure 2: Virtual Wiring
Figure 3: STAC wire segment generation
Figure 4: STAC wire segment multiplexing
Quality of Service
Virtual wires will convey signals related to interrupts, resets, power-state change requests, handshakes e.g. for controlling DMA, and many other types of signals. The QoS of the transmission and reception of these signals relates to 5 parameters:
- Guaranteed delivery
- Delivery order
In a conventional system deploying real wires on a single silicon die items 2, 3 & 4 are not usually considered an issue. Delay (latency) even for a single wire or set of wires transversing much of the length of a chip is often well characterised so that the propagation delay is known precisely. On-chip control signals are expected to have an error-rate below the threshold it is worth economically addressing for consumer devices.
In a virtual wiring environment things are complicated by the fact that the wires are 1) sampled at a finite rate and 2) the STAC segment packets which mediate the signals are multiplexed across a link and hence may be delayed in transmission by an amount of time depending on what other packets may be attempting to use the link concurrently.
Because of the characteristics of the system STAC segment packets are guaranteed to be delivered in the order in which they were transmitted (i.e. no overtaking within a priority class) and because this is implemented in the very controlled electrical environment of either on-silicon or between silicon die within the same package, transmission is deemed error-free.
The problem is then to be able to architect the system to be able to:
A) Commit to a limited delay between an incoming signal changing state at a bundle bank register on the transmitting die and the equivalent signal changing state at the corresponding bundle bank register on the receiving die.
B) Commit to a constrained variation in the delay above. E.g. A QoS commitment would involve being able to guarantee that the delay for a virtual wire will be no more than D nanoseconds and the jitter will be no more than J nanoseconds.
Our solution revolves around being able to control:
- The sample rate S at which the signal is converted to a STAC segment packet.
- The prioritisation P of the queue at the STAC interface which arbitrates which of the STAC packets ready for transmission will be transmitted next.
- The sampling of a bundle and transmission of a packet not at a regular sample rate but whenever there is a state change of any signal associated with a bundle. We call this an activity-based bundles wire to distinguish it from the sample-based bundles described previously.
It is stated without proof here that D and J can be both functions of S and P and other (invariant) aspects of the system.
In some situations in order to give a satisfactorily low D and J for sample-based bundles would imply a very high S (sample rate). This may give rise to problems because the link may become inundated with STAC wire segments many of which may not actually be carrying a state change and are thus redundant. This may give problems with the service received by other users of the STAC link. In this situation logic is provided which triggers a sampling of a bundle register only when it detects an edge on any of the signals that are latched by that register. In this case packets do not have a sample quantum wait and the end-to-end delay is simply calculated by adding the performance of the various circuits involved in generating and receiving the packet and so the delay is limited. Because this mechanism never results in a virtual wire transmission without a state change it cannot saturate the link with redundant packets.
Figure 5 Sampling Control
Figure 6 STAC packet Prioritization
Figure 7: Virtual Wiring multiplexing ofwire (I) packets with high(Nh) and low(Nl) priority NoC packets on a STAC link
This paper has outlined new techniques for implementing a universal communication channel between systems on different silicon dice. It shows how the on-chip interconnect can be extended between chips while retaining high bandwidth and low latency. In addition the technique allows other signals to be integrated into this communication channel to reduce pin-count and power consumption but in a universal manner which allows arbitrary implementations of such interfaces to retain their interface compatibility. In order to allow arbitrary mixing and matching of companion pairs of dice an architecture standard which standardizes address map, interrupts, reset and handshake conventions needs to be applied. STMicroelectronics is deploying this technology across a range of consumer video devices.
Thanks to STMicroelectronics Grenoble, France R&D team as a whole and in particular, Ignazio Urzi & Dominique Henoff, but also Alberto Scandurra of the R&D team in Catania, Italy, for their experience, insights and pragmatism in implementing our architectural ideas.
 R. Deaves and A. Jones, A Toolkit for Rapid Modeling, Analysis and Verification of SoC Designs, IPSOC, Nov. 2003.
 R. Deaves and A. Jones, An IP-based SoC Design Kit for Rapid Time-to-Market, IPSOC, Dec. 2002.
 A Jones & S. Ryan. A re-usable architecture for functional isolation of SoCs. IP 07 IP Based Electronic System Conference. Dec 2007
 King L. Tai, System-in-package (SIP): challenges and opportunities, Proceedings of the 2000 conference on Asia South Pacific design automation, p.191-196, January 2000, Yokohama, Japan
 Trigas, C. Design challenges for system-in-package vs system-on-chip Custom Integrated Circuits Conference, 2003. Proceedings of the IEEE 2003 P.663 - 666
 Goetz, M. System on chip design methodology applied to system in package architecture Electronic Components and Technology Conference, 2002. Proceedings. 52nd Pp 254 - 258
 Pfahl R.C and Adams,J. System in Package Technology, International. Electronics Manufacturing Initiative (iNEMI), 2005.