Monday, October 10, 2011

Supercomputer Performance on a Chip Powers Next-Generation Embedded Image Processing

A new framework for the C language had been developed to move appropriate code to parallel graphics processors for fast execution in graphical and other numeric-intensive applications. A new generation of hybrid processors is emerging to take advantage of this capability.
Current-generation imaging applications have exhausted every method for squeezing out additional performance. First, microprocessor vendors deepened on-die caches, created SIMD (single instruction multiple data) instruction set extensions to process media streams, and then implemented multi-threading and clock frequency scaling. However, imaging applications typically operate on data sets that are an order of magnitude larger than caches, which better serve traditional code and static (re-used) data. Although the rest of the performance-enhancing techniques help image processing, we will demonstrate here a much more efficient and scalable way to attack the problem, drawing from a supercomputing pedigree.
Large racks populated with CPU blades and loud fans have been deployed in MRI and CT scanning equipment, often with ASICs and FPGAs as additional performance-boosting offload engines. Additional imaging applications have emerged, from surveillance to facial recognition, built around smaller box PCs. But the performance of such CPU-centric systems on these imaging applications leaves much to be desired.
Conventional sequential microprocessors and coding languages have run their courses and are ill-prepared for the faster real-time performance and smaller system size demands of next-generation medical imaging, video surveillance, homeland security and similar applications.  A fresh new approach is needed for the embedded market, and that can now be realized thanks to the advances in graphics processing units (GPUs) and an innovative programming language, OpenCL.
The OpenCL programming language was developed to provide developers with a platform to create C-based applications that run on the CPU but can also offload parallel kernels to the GPU. GPUs are implemented with dozens to hundreds of very powerful math engines with fast local RAM. PCs with dual/quad core desktop processors and discrete PCI Express graphics cards can certainly deliver the performance required for the embedded market; however, next-generation systems require smaller size and lower power consumption.
Big Performance in Small Packages
Graphics processing has evolved in a relatively short period of time from specialized supercomputers to powerful Graphics Processing Unit (GPU) add-in cards. High-end GPUs pack Teraflops of floating-point compute horsepower onto a single PCI Express graphics card.
At the same time, the massively parallel processing logic once dedicated to specific 3D graphics processing tasks has become more flexible and programmable. Enhancements to the GPUs have enabled these processors to address a wider range of applications. While not suitable to accelerating every application, those with similar characteristics to graphics workloads can achieve large improvements in performance and power efficiency by exploiting the GPU. The best suited applications are characterized by many, largely independent tasks ideally operating on large, regularly structured data sets.
Smaller die geometries have enabled the first family of single die CPU+GPU solutions, known as heterogeneous multicore processors, which greatly reduce space and heat while increasing the data bandwidth between CPU cores and graphics cores. These Fusion processors or APUs (accelerated processing units) are positioned well to reduce the size and weight of imaging systems dramatically. The low-power APUs integrate GPUs capable of tens to hundreds of Gigaflops onto the same die as multiple conventional CPU cores. AMD’s new Fusion processors such as the AMD Embedded G-Series dual-core 1.6 GHz T56N CPU with integrated AMD Radeon HD 6310 GPU shown in Figure 1, are ushering in a new era of efficient supercomputing for embedded applications that are not well served by conventional CPUs.
Figure 1
The Accelerated Processing Unit (APU) is the result of the integration of a multicore x86 CPU with a parallel graphics processor that is also capable of single instruction multiple data (SIMD) parallel processing for general computing tasks.
GPUs are optimized primarily for graphics tasks, and are best suited to certain types of parallel workloads.  Because of their data-parallel execution logic, GPUs are effective at tackling problems that can be decomposed into a large number of independent parallel tasks. With hundreds of computing cores, modern GPUs are much more scalable than the handful of cores offered in a CPU-centric paradigm. The term “GPGPU” refers to the use of a GPU for general-purpose parallel computations.
GPUs are geared for processing many independent “work items” in parallel. When it comes to graphics rendering, the particular work items are operations on vertices and pixels, including texturing and shading calculations. In a general-purpose GPU (GPGPU) program, a set of operations is typically executed in parallel on each item in a data set. Applications well suited for GPU execution such as image processing have large data sets, high parallelism and minimal dependency between data elements. A common form for a data set to take in a GPGPU application is a 2D grid because this fits naturally with the rendering model built into GPUs. Many computations naturally map into grids: matrix algebra, image processing, physically based simulation and so on. OpenCL kernels can also take advantage of dedicated texture processing hardware in the GPU to perform various 2D filtering operations on certain memory reads.
Originated by Apple and turned over to the Khronos Group for standardization, Open Computing Language (OpenCL) is a framework for writing programs that can exploit heterogeneous platforms consisting of CPUs, GPUs and other processors. OpenCL includes a language based on C99 for writing kernels—functions that execute on OpenCL devices—plus APIs that are used to define and then control the platforms. Each kernel is a main body of a loop or routine. The developer specifies the kernel, memory regions and data set to process using OpenCL constructs.
OpenCL provides parallel computing using task-based and data-based parallelism, and presents the GPU resources in a very clean manner as having many instantiations of a single processor type, buffers and memory spaces. AMD provides an SDK for its Fusion series of APUs to allow embedded developers to get started with OpenCL.
OpenSURF: Community Development with a Vision
As an example of the use of OpenCL with a GPGPU, one group within the open-source community analyzed and profiled the components of the Speeded Up Robust Features (SURF) Computer Vision algorithm written in OpenCL. SURF analyzes an image like the one in Figure 2 and produces feature vectors for points of interest (“ipoints”).
Figure 2
The SURF application identifies points of interest within an image.
SURF features have been used to perform operations like object recognition, feature comparison and face recognition. A feature vector describes a set of ipoints, consisting of the location of the point in the image, the local orientation at the detected point, the scale at which the interest point was detected, and a descriptor vector (typically 64 values) that can be used to compare with the descriptors of other features.
A diagram of the application is shown in Figure 3. To find points of interest, SURF applies a Fast-Hessian Detector that uses approximated Gaussian Filters at different scales to generate a stack of Hessian matrices. SURF utilizes an integral image, which allows scaling of the filter instead of the image. The location of the ipoint is calculated by finding the local maxima or minima in the image at different scales using the generated Hessian matrices. The local orientation at an ipoint maintains invariance to image rotation. Orientation (the 4th stage of the pipeline in Figure 3) is calculated using the wavelet response in the X and Y directions in the neighborhood of the detected ipoint. The dominant local orientation is selected by rotating a circle segment covering an angle of 60 degrees around the origin. At each position, the X and Y responses within the segment of the circle are summed and used to form a new vector. The orientation of the longest vector becomes the feature orientation.
Figure 3
The SURF application block diagram reveals the computational work items.
To demonstrate the power of familiar C-style structures in OpenCL with the data-parallel compute capability of GPGPUs, Figure 4 shows SURF code for calculating and normalizing descriptors.
Figure 4
The CPU host calls the OpenCL runtime (top), and a portion of the SURF kernel (bottom) reveals OpenCL, constructs which tap GPU resources (top).
The calculation of the largest response is done using a local memory-based reduction. The 64-element descriptor is calculated by dividing the neighborhood of the ipoint into 16 regular subregions. Haar wavelets are calculated in each region and each region contributes 4 values to the descriptor. Thus, 16 * 4 values are used in applications based on SURF to compare descriptors.
The goal of using OpenCL with GPGPUs is to extract as much parallelism out of the framework as possible. In SURF, execution performance is determined by the characteristics of the data set rather than the size of the data. This is because the number of ipoints detected in the non-max suppression stage of the algorithm helps to determine the workgroup dimensions for the orientation and descriptor kernels. Computer Vision frameworks like SURF also have a large number of tunable parameters—for example, a detection threshold, which changes the number of points detected in the suppression stage—that greatly impact the performance of an application.
Optimization
In heterogeneous computing, knowledge about the architecture of the targeted set of devices is critical to reap the full bene?ts of the hardware. For example, selected kernels in an application may be able to exploit vectorized operations available on the targeted device, and if some of the kernels can be optimized with vectorization in mind, the overall application may be sped up significantly. However, it is important to gauge the contributions of each kernel to the overall application runtime. Then informed optimizations can be applied to obtain the best performance.
In a heterogeneous computing scenario, an application starts out executing on the CPU, and then the CPU launches kernels on a second device (e.g., a GPU). The data transferred between these devices must be managed efficiently to minimize the impact of communication. Data manipulated by multiple kernels should be kept on the same device where the kernels are run. In many cases, data is transferred back to the CPU host, or integrated into CPU library functions. Analysis of program ?ow can pinpoint sections of the application where it would be bene?cial to modify data management, leading to more efficient use of the overall system.
GPGPUs can be programmed and optimized efficiently using OpenCL, a royalty-free language based upon C, in order to unlock the full potential of standardized processing hardware for the rapid development of next-generation imaging applications. 

No comments:

Post a Comment