By Anand V Kulkarni, Senior Engineering Manager, Atria Logic, Bangalore, India
Numerous industries in broadcast, cable, videoconferencing and consumer electronics space are using H.264 as the video codec of choice for their products and services. The H.264/AVC video coding standard achieves a significant improvement in coding efficiency with increased computational complexity relative to former standards. This creates a big challenge for efficient hardware and/or software implementations. In this article, architecture and implementation of UHD 4k@60fps, H.264/AVC codec IP SoC solution with Atria Logic feature rich, low latency, high video quality H.264 (AVC) UHD Hi422 Intra H.264 codec IP plus highly efficient Xilinx LogiCore HDMI subsystem IPs is described.
There are many previous works with hardware-only implementation. The proposed solution differentiates in terms of SW friendly approach with full fledged Linux OS and application drivers running on ARM Cortex A9 processor. This is ready to tape-out SoC solution allowing customer specific customizations in SW, hence significantly reducing time to market turn around.
Fig: Block Diagram of Atria Logic UHD H.264 Codec Solution
- Complete Modular implementation with scope for customization and scalability
- H.264 Intra-only Hi422 Level 5.1 encoder and decoder
- HDMI2.0 Receiver and Transmitter subsystem integrated
- 8/10-bit support
- YUV 4:2:2/4:4:4, RGB support
- Very low latency at ~0.3sec
- Variable bit rate (VBR) and constant bit rate (CBR) support
- Video quality at 0.99% SSIM, or 50dB PSNR or higher
- Video processing subsystem for pre/post processing such as color space conversion, video, scaling, chroma subsampling features 
- Gb Ethernet streaming output support
3. Hardware Implementation
As shown in Fig 1, the hardware implementation consists of 3 major components;
- Xilinx LogiCore IPs for HDMI system, Video processing IP sub-blocks
- Atria Logic UHD H.264 Encoder IP core
- Atria Logic UHD H.264 Decoder IP core
Apart from these major components, other peripheral IPs like DDR3 memory controller (MIG from Xilinx for PL memory or In-built controller within PS for PS_DDR), Ethernet MAC IPs are also used.
3.1 HDMI System
Xilinx provides IP suite for HDMI system for 7 series FPGA family, which consists of HDMI Tranceiver (GTX), HDMI Receiver (RX) and HDMI Transmitter (TX) subsystems.
Fig: Xilinx LogiCore HDMI System
3.1.1 HDMI RX Subsystem
The HDMI RX subsystem consists of two major modules. The HDMI Transceiver (GTX) module receives the serial RX data and converts it into a parallel data stream. The RX Subsystem module extracts the video and audio streams from the HDMI stream.
The transceiver module incorporates the high speed GT transceivers. The transceivers convert the parallel data into serial and vice versa. The transceiver module is open source. This allows the user to optimize the clock buffer resources.
The RX Subsystem has three AXI interfaces. The video bridge converts the captured native video to AXI streaming video, and outputs the video data through the AXI video interface. The video timing controller measures the video timing.
The received audio is transmitted through the AXI streaming audio interface. The CPU interface is used to access the peripherals by a processor for control and status. The RX Subsystem internals are fixed and cannot be modified or altered by the user. The HDCP module is optional and it is not included in the standard deliverables.
Further details on HDMI Rx subsystem can be found at Xilinx website .
3.1.2 HDMI TX Subsystem
The TX Subsystem module takes the incoming video and audio streams and transfers them into an HDMI stream. The stream is forwarded to the HDMI GTX module. This module converts the parallel data stream into a serial high speed data stream.
The TX Subsystem consists of the transmitter core, AXI video bridge, video timing controller and optional HDCP. The TX Subsystem has three AXI interfaces. The video AXI stream carries dual or quad pixels per clock and it supports 8, 10 and 12-bits per component. The video bridge converts the incoming video AXI-stream to native video. The video timing controller generates the native video timing.
The audio AXI stream transports multiple channels of uncompressed audio data.
ARM CortexA9 controls the transmitter blocks through the CPU interface. The AXI4 interconnect routes the main CPU interface to the various blocks.
The TX Subsystem internals are fixed and cannot be modified or altered by the user. Further details on HDMI Tx subsystem can be found at Xilinx website .
3.2 Atria Logic H.264 Encoder IP
The AL-H264E-4KI422-HW is a hardware-based, feature rich, low latency, high video quality H.264 (AVC) UHD Hi422 Intra encoder IP core. The AL-H264E-4KI422-HW encoder pairs up with the Atria Logic AL-H264D-4KI422-HW low latency decoder for low latency end-to-end links.
Fig: AtriaLogic UHD H.264 Encoder IP Block Diagram
The encoder is targeted for medical imaging, broadcast, enterprise/CE and industrial applications. Medical imaging applications include endoscopy, micro surgery and remote assisted surgery and diagnostics. Broadcast applications include video recorders for news and event coverage, film sets and production studios, as well as real-time monitoring of video shoots. Enterprise/CE applications include HDBaseT video transmission over CAT5/6 Ethernet cabling to computer monitors and UHD TV displays. Industrial applications include monitoring of manufacturing plants, and remote manipulation of mobile or fixed light or heavy machinery.
The encoder supports the H.264 Hi422 (High-422) profile at Level 5.1 (3840x2160p30) for Intra-only coding. Support for 10-bit video content means that there is no degradation of grayscale or color gradients in terms of banding. Support for YUV 4:2:2 video content means that there is better color separation, especially noticeable for red colors, which provides much sharper image details. These video quality aspects are especially important in case of medical imaging applications.
Support for Intra-only encoding allows the encoder to encode uncompressed video at frame latencies. A macroblock-line level pipelined architecture brings the latency further down to sub-frame level, at about 0.3msec.
Pipelined design with 8 pixels/clock processing rate, frame rate of 4k@60fps is achieved.
When connected to the Atria Logic AL-H264D-4KI422-HW low latency decoder via an IP network, the glass-to-glass latency is about 0.6msec, not taking into account any transmission latency, and otherwise 2 frames with transmission in case of an IP network. Such low latency is important for any closed-loop man-machine application as mentioned here above.
The efficient implementation only takes up 78% of the programmable logic and DSP resources and 55% of the available RAM, leaving ample room for implementation of any other required circuitry. Integration of a Gb Ethernet MAC provides streaming over IP support.
3.3 Atria Logic H.264 Decoder IP
The AL-H264D-4KI422-HW is a hardware-based, feature rich, low latency, high video quality H.264 (AVC) UHD Hi422 Intra decoder IP core. The AL-H264D-4KI422-HW decoder pairs up with the Atria Logic AL-H264E-4KI422-HW low latency encoder for low latency end-to-end links.
Fig: AtriaLogic UHD H.264 Decoder IP Block Diagram
The decoder supports the H.264 Hi422 (High-422) profile at Level 5.1 (3840x2160p30) for Intra-only coding. Support for 10-bit video content means that there is no degradation of grayscale or color gradients in terms of banding. Support for YUV 4:2:2 video content means that there is better color separation, especially noticeable for red colors, which provides much sharper image details. These video quality aspects are especially important in case of medical imaging applications.
Support for Intra-only decoding allows the decoder to decode compressed video at frame latencies. A macroblock-line level pipelined architecture brings the latency further down to sub-frame level, at about 0.3msec.
Pipelined design with 8 pixels/clock processing rate, frame rate of 4k@60fps is achieved.
When connected to the Atria Logic AL-H264E-4KI422-HW low latency encoder via an IP network, the glass-to-glass latency is about 0.6msec, not taking into account any transmission latency, and otherwise 2 frames with transmission in case of an IP network. Such low latency is important for any closed-loop man-machine application as mentioned here above
The efficient implementation only takes up 68% of the programmable logic resources, 35% of available DSP resources, and 45% of the available RAM, leaving ample room for implementation of any other required circuitry. Integration of a Gb Ethernet MAC provides streaming over IP support.
4. Software Implementation
The software infrastructure implemented on this solution benefits from the seamless integration of Linux OS Ubuntu’s distribution on the Zynq SoC. The list of features that Linux supports is extensive. Atop a Linux operating system, the ZC706 platform has much more flexibility to accommodate a very wide range of applications because of Linux’s programmability and versatility. As one of the most powerful user-programmable operating systems, Linux allows us to customize the system exactly to our needs.
On top of Linux OS, we built our own application drivers to configure our codec IPs. For Xilinx HDMI subsystem LogiCore IPs, Xilinx provides bare-metal drivers running on ARM Cortex A9 core which included configuration and flow control needed for HDMI GTX, RX and TX cores. We ported the bare metal drivers to run on top of Linux OS seamlessly along with our application drivers to bring-up the solution.
5. Validation Setup
For the validation of this solution, Xilinx ZC706 platform is used. For more info on Xilinx 706 board, refer the link http://www.xilinx.com/products/boards-and-kits/ek-z7-zc706-g.html
For the memory interface, we have support for both PS_DDR and PL_DDR interface. User can opt for either of these options.
Figure: UHD Encoder-Decoder HW setup on ZC706
5.1 HDMI Interface
For HDMI interface, FMC daughter card Inrevium TB-FMCH-HDMI 4K REV2.0 is used. It is connected to the HPC FMC connector of Zynq ZC706 board. The FMC daughter card has 2 ports. Sink and Source. For Encoder one end of HDMI cable is connected to the sink port of the FMC daughter card and the other end to the laptop or camera.
For more information on FMC daughter card: http://solutions.inrevium.com/products/fmc/hpc/index.html
5.2 Ethernet Interface
For Ethernet Interface, FMC daughter card Inrevium TB-FMCL-GLAN B Rev 2 is used. It is connected to the LPC FMC connector of Zynq 706 board. The FMC daughter card has 2 ports. Port 1 and Port 2.
The Ethernet cable is connected to the Port 1 of the FMC card for both Encoder and Decoder.
For more information on FMC daughter card: http://solutions.inrevium.com/products/fmc/lpc/index.html
- Encoder Core clock – 200MHz;
- Decoder Core clock – 225MHz.
- PS_DDR clock– 32-bit, 533MHz;
- PL_DDR clock – 64-bit, 400MHz;
- AXI_clock – 150MHz;
- HDMI stream clock – 300MHz
- HDMI RX/TX video out clock – 150MHz
6. Performance and Quality Metrics
- Glass to glass latency of ~0.3ms
- The PSNR on an average is around 45dB
- SSIM(Structural Similarity)- Vary between 0.97 to 0.99 (Less than 2% degradation w.r.t uncompressed picture quality
- Blockiness - The degradation for Blockiness is hardly 1% to 1.2%, when compared to uncompressed video
In this article, implementation details of UHD (4k) H.264 codec solution using Xilinx Zynq platform is presented. The solution provides high quality, low latency implementation which can cater multiple applications in Medical, Broadcast and consumer electronics domains.
The author would like to thank Xilinx customer support team for their support to bring up HDMI system and VPSS IPs.
- HDMI v1.4/2.0 Receiver Subsystem Product Guide (PG236) - Xilinx
- HDMI v1.4/2.0 Transmitter Subsystem Product Guide (PG235) - Xilinx
- Video Processing Subsystem - Xilinx
If you wish to download a copy of this white paper, click here