SATA Connectivity solutions for Xilinx FPGAs

This Whitepaper gives an overview over the Serial ATA (SATA) protocol and the implications when integrating SATA into an FPGA-based programmable system. Besides details of the different protocol layers, we will discuss the hardware and software components for building a complete, reliable, high-performance SATA solution by utilizing a design platform from Missing Link Electronics (MLE).

MLE provides configurable Single Board Computer platforms where the microcontroller can be optimized towards the Open Source software that runs on it â€“ and the particular application the system serves. These platforms come with a full GNU/Linux software stack pre-installed which comprises kernel, drivers and a root filesystem of many hundred application software packages. This creates an environment similar to an industrial PC with full TCP/IP networking via Ethernet, WiFi, or UMTS/GSM, for example. Also included are many options of adaptable I/O interfaces such as USB, Bluetooth, DVI, Audio, SPI, IIC, CAN, analog I/O, LVDS, GPIO, and â€“ the ASICS World Sevicesâ€™ SATA Host IP core.

This SATA Host IP core has been certified for Serial ATA Revision 3.0 compliance on a Xilinx Virtex-6 FPGA by the UNH IOL SATA Consortium in May 2010.

ASICS World Services provides a broad line of general-purpose IP cores, including a variety of USB related products such as USB 3.0 Device, USB 2.0 OTG IP cores as well as various others such as SATA I/II/III Device and Host IP core, encryption (AES), error correction (Reed Solomon), and many other functions.

ASICS World Services was founded in 1999, immediately starting to offer semiconductor products and services. Since then ASICS World Services has established a world wide reputation for professional highest-quality IP cores with flexible licensing at a low cost. An expanding customer base is quickly promoting ASICS World services to one of leading IP providers in the world.

For more information, please visit: http://www.asics.ws

1 Introduction

Serial ATA (SATA) has become the new defacto standard for mass data storage interfaces. Harddisk drives are migrating towards SATA and newer Solid State Disks (SSD) are almost always SATA based. Especially the combination of SATA and SSD offers compact, high performance, and high capacity data access and programmable systems can only benefit from this: Customizable electronic testing systems or industrial, scientific or medical systems now have a viable option for mass data storage without the weight and mechanical restrictions of harddisks.

A programmable system is a computer system where application software, operating system and hardware can be configured to suit a particular applicationâ€™s requirements. Field- Programmable Gate-Arrays (FPGA) are the enabling technology behind to make even the hardware programmable. The key advantages are additional degrees of freedom for cost and performance optimizations, the flexibility to perform late changes - including fieldupgrades - and the ability to offer long-term available products because one becomes more independent from parts availability.

As shown in Fig. 1 multiple layers are involved for full SATA host controller functionality:

Figure 1: Serial ATA Function Layers

Therefore, when it comes to implement a complete SATA solution for an FPGA-based programmable system much more than just a high-quality IP core is needed. Two aspects often get overlooked:

First, only Phy, Link, and some portions of the Transport layer make sense to be implemented in FPGA hardware and, thus, are provided when buying an IP core. The higher levels of the Transport layer, the Applications layer, the Device layer and the User Program layer are better to be implemented in software and, thus, typically not part of an IP core delivery. This may create an unnecessary high burden of IP core integration and add to the hidden costs of an IP core.

One of the reasons is that those higher layers depend on the target systemâ€™s architecture which again depends on the target systemâ€™s use case. Therefore, the second aspect often overlooked is that components such as Scatter-Gather DMA engines (which are hardware and software) must be implemented, tested and integrated to deliver a complete solution which ties together the IP core with the user programs.

The purpose of this Whitepaper is to first give an overview over SATA, the protocol and the necessary components to implement a complete SATA solution, and second to give system architects guidelines for putting together these components to deliver a full solution quickly and predictably.

2 SATA Protocol Layers

This section gives the reader a more detailed understanding of the SATA protocol than Wikipedia [3]. On the other hand we aim to be more concise than the SATA specification [2].

Phy layer - which delivers the electrical interface and which nowadays can be fully implemented inside an FPGA.
Link layer - sends and receives data frames and takes care of bit errors. In the SATA Host IP core this is built from programmable logic gates.
Transport layer - controls the read and write operation via so-called Frame Information Structure (FIS) types and is also implemented from programmable logic gates.
Application layer - handles standard ATA commands to the SATA device. An efficient implementation combines hardware and software.
Device layer - serves as a hardware abstraction layer (HAL) to make SATA connected devices available to user programs.
User program layer - comprises program suites for operating (and testing) SATA connected devices.

2.1 Phy layer

Modern FPGA devices feature so-called Multi-Gigabit Transceivers (MGT) which are suitable for many different high-speed serial protocols [5]. The SATA Host IP core from ASICS World Services utilizes these MGT to implement high quality SATA functionality, i. e. the Phy layer of the SATA Host IP core is completely done within the FPGA. There is no need for external Phy components. Figure 2 illustrates the building blocks of the Phy layer, which consist of the Xilinx Multi-Gigabit Transceivers as well as numerous control and management logic for link bring-up and management.

Figure 2: Building blocks of the PHY layer

The built in Multi-Gigabit Transceivers in Xilinx FPGAs, include the following functionality:

Analog Drivers with adjustable drive strength
Differential Receivers with line idle detect
Clock/Data Recovery
High Speed SERDES
8B/10B Encoding
Symbol Alignment
Elasticity Buffers

An external Out-Of-Band (OOB) signaling block similar to the one described in [4] is used to generate SATA COMRESET and COMWAKE signals. It is also used to detect COMINIT and COMWAKE from the device and to initiate exit from power save modes.

The reconfiguration Logic block provides a mechanism to reconfigure the transceivers to operate at different data rates, to enable switching between SATA Gen 1 (1.5 Gbps), Gen 2 (3.0 Gbps) and Gen 3 (6.0 Gbps).

The Speed Negotiation block is responsible for managing the link. It detects when a device is connected and negotiates the speed with the device. This block also manages entering and exiting of power save modes (slumber and partial). It also constantly updates the link layer of the current link status.

2.2 Link layer

The Link layer performs frame based transaction. It transmits and receives control primitives to manage the flow of frames.

The Link layer creates a bridge between the Transport layer and the Phy layer. It encapsulates data frames in special symbols that indicate the beginning and end of a frame and removes those symbols on the receiving side. It also automatically asserts back pressure to the SATA device by inserting special hold primitives to throttle the transfers, and responds to back-pressure requests from the device by stripping hold primitives and waiting for valid data.

Other responsibilities of the Link layer are to calculate and verify cyclic redundancy checks (CRC) and to scramble/descramble all transmitted data. Scrambling is an important part of the Link layer as it dramatically reduces Electro-Magnetic Interference (EMI).

2.3 Transport layer

Communication on the Transport layer is done via Frame Information Structures (FIS). The SATA standard defines the set of FIS types shown in Table 1. In the following we will have a look at the detailed FIS flow between Host and Device for read and write operations.

Table 1: FIS types

As illustrated in Figure 3 the host informs the device about the current active operation via a Register FIS, which holds a standard ATA command. When the device is ready to transmit data it shall send one or more Data FIS and complete the transaction via a Register FIS Device to Host.

Figure 3 shows the FIS flow between host and device for a write DMA operation. Again the host informs the device of the operation via a Register FIS. When the device is ready to

Figure 3: FIS Flow between Host and Device during a DMA operation

receive data it shall send a DMA Activate FIS and the host will start transmitting a single data FIS. When the device has processed this FIS and it still expects data it shall send a DMA Activate FIS again. In case of an error or a completed operation it shall complete the transaction via a Register FIS Device to Host.

A new feature introduced with SATA is the so-called â€œFirst Party DMAâ€. This transfers some control over the DMA engine to the device and thus enables the device for caching a list of commands and reordering these for optimized performance â€“ so called â€œNative Command Queueingâ€. New ATA commands are used for First Party DMA transfers. As these commands are not necessarily instantaniously completed by the device, but rather queued, the FIS flow is a bit different for this mode of operation. The flow for a Read FPDMA Queued command is shown on the left hand side of Figure 4.

Figure 4: FIS Flow between Host and Device during First Party DMA Queued Operation

After receiving a Register FIS from the host the device queues the command and answers with a Set Device Bits FIS to clear the busy field. Thus the host can send the next command. Each not yet completed command has a unique tag set to distinguish them. Each new command is added to the queue of the device and a scheduler in the device selects the command to be processed next. To process a command the device sends a DMA Setup FIS to the host with the tag field set accordingly. The DMA engine of the host then selects the scatterlist which belongs to the particular command and processes the data. To complete the transfer the device shall send a Register FIS Device to Host.

As shown in Figure 4 the Write FPDMA Queued command is processed in a similar manner, but as in the non FPDMA mode again with a â€œping-pongâ€ of DMA Activate and Data FIS types. For efficiency the SATA protocol allows for an additional feature, the so-called â€œAuto Activationâ€, which will combine the first DMA Activate FIS into the DMA Setup FIS. Obviously this feature is only relevant for writes of smaller portions of data.

2.4 Application layer

Communication on the Application layer uses ATA commands [1]. While a limited number of these commands certainly can be implemented as a finite state machine in FPGA hardware, a software implementation is much more efficient and flexible. Here, the Open Source Linux kernel provides a known-good implementation which almost exactly follows the ATA Standard and is proven in over a billion devices shipped.

Figure 5: Complete SATA Solution

The Linux ATA library, libATA, copes with more than 100 different ATA commands to communicate with ATA devices. These commands include not only data transfers but also provide functionality for S.M.A.R.T. (Self Monitoring Analysis and Reporting Technology) and for SECURITY features such as secure erase and device locking.

The ability to utilize this code base, however, requires extra work of implementing hardwaredependent software in form of Linux device drivers as so-called Linux Kernel Modules. As Figure 5 shows, the MLE â€œSoftâ€ Hardware Platform comes with a full GNU/Linux software stack pre-installed, plus a tested and optimized device driver for the SATA Host IP core from ASICS World Services.

2.5 Device layer

The MLE Linux therefore acts as a hardware abstraction layer between the user programs and the SATA Host IP core. Like in any other Linux based system, SATA devices are hidden behind a block layer which serves as a backend for the ATA library. All, read, write, or management operations can now perform on an abstract block device (/dev/sdX) without any special knowledge of the underlying SATA device itself. Even better, certain optimizations such as data caching and read-ahead transfers come with this software stack â€“ fully integrated by MLE.

At this point the UNIX/Linux philosophy is reached which shields all devices via filesystems. Apart from â€œrawâ€ data access one also can utilize the many different filesystems, for example, EXT2, EXT3, or XFS which the MLE Linux readily supports. These provide a robust and software developer friendly access.

2.6 User Program layer

The MLE â€œSoftâ€ Hardware Platform is not just an operating system kernel but ships with a plethora of intgrated and pre-validated user space programs. Each program and the corresponding configuration data is conveniently organized in a Debian-style software package format.

Of particular interest are all tools for filesystem manipulation and for SATA device administration: Programs such as hdparm and smartctl give low-level access to SATA devices attached to the systems and can be used, for example, as a testbench or for performance analysis. This is complemented by full-featured Open Source test suites like bonnie++ and iometer.

Python scripting capablities and a C/C++ software development flow now make it very efficient to implement user programs and help focus on building the SATA-based application rather than spending valuable resources on SATA integration and debugging work. And, having a known-good environment also facilitates analysis during system architecture exploration.

3 Architectural Choices

When integrating a SATA IP core into an FPGA based system there are many degrees of freedom. So, for pushing the limits of the whole system, knowledge not only of software or hardware is required, but knowledge of both. Integration of software and hardware have to go hand in hand.

Figure 6 show examples of how a system integration of a SATA IP core can be done. The most obvious integration is adding the IP core as a slave to the bus (A) and letting the CPU do the transfers between memory and the IP. Obviously data will pass twice over the system bus, but if no high data rates are required this easy-to-implement approach may be sufficient. In this case, however, the CPU can only be used for a small application layer, as most of the time it will be busy copying data.

Figure 6: Architectural Choices

The moment the CPU has to run a full operating system the performance impact will be really dramatical. That is where one will consider reducing the CPU load by adding a dedicated copy engine, the Xilinx Central DMA (B). This way data is still transferred twice over the bus, but the CPU does not spend all of its time with copying data.

Still the performance of a system with a full operating system is far away from a standalone application and both of them are far away from the theoretical performance SATA can deliver.

The architecture (C) changes this by reducing the load of the system bus and using simple dedicated copy engines via Xilinxâ€™ streaming NPI port and MPMC (Multi Port Memory Controller). This boosts the performance of the standalone application up to the theoretical limit.

However the Linux performance of such a system is still limited. From the standalone application we know the bottleneck is not within the interconnection. This time the bottleneck is the memory management in Linux. Linux handles memory in blocks of a pagesize. This pagesize is 4096 Bytes for typical systems. With a simple DMA engine and free memory scattered all over the RAM in 4096 Byte blocks only 4096 Bytes may be transferred with each transfer. This problem is tackled with the architecture (D). For example, the PPC440 core included in the Virtex-5FXT has dedicated DMA engines that are capable of scatter gather. This way the DMA engine gets passed a pointer to a list of memory entries and scatters/gathers data to/from this list. This results in larger transfer sizes and brings the system very close to the standalone performance.

Figure 7: Performance of complete SATA solution

4 Conclusion

Today, the decision whether to make or whether to buy a SATA host controller core is obvious: Very few design teams are capable to implement a functioning SATA host controller for the cost of licensing one. At the same time it seems common to spend significant time and money in-house to integrate this core into a programmable system-on-chip, develop device drivers for this core, and to implement application software for operating (and testing) this core.

The joint work from ASICS World Services and Missing Link Electronics promises to significantly reduce the cost and the time it normally takes to deliver complete SATA host controller solutions. To learn more about this complete SATA solution from ASICS World Services and Missing Link Electronics please visit the MLE Live Online Evaluation site. There, you get more technical insight and you are invited to test drive a complete SATA system via the Internet: http://www.missinglinkelectronics.com/loe

References

[1] INTERNATIONAL COMMITTEE FOR INFORMATION TECHNOLOGY STANDARDS. AT Attachment 8 - ATA/ATAPI Command Set, September 2008. http://www.t13.org/.

[2] SERIAL ATA INTERNATIONAL ORGANIZATION. Serial ATA Revision 2.6. February 2007. http://www.sata-io.org/.

[3] WIKIPEDIA - THE FREE ENCYCLOPEDIA. Serial ATA. http://en.wikipedia.org/wiki/Serial_ATA.

[4] XILINX, INC. Serial ATA Physical Link Initialization with the GTP Transceiver of Virtex-5 LXT FPGAs, 2008. http://www.xilinx.com/support/documentation/application_notes/xapp870.pdf.

[5] XILINX, INC. Virtex-5 FPGA RocketIO GTX Transceiver User Guide, October 2009. http://www.xilinx.com/bvdocs/userguides/ug198.pdf.

Industry Articles

SATA Connectivity solutions for Xilinx FPGAs