May 28, 2013
1 Installation
1.1 Dependencies
1.2 HIPAcc Installation
Android Cross Compilation:
2 Domain Specific Language
2.1 Built-in C++ Classes
2.2 Defining Image Operators
Device Code:
Host Code:
2.3 Memory Management
2.4 Supported Data Types and Operations
Data Types:
Convert Functions:
Math Functions:
3 Framework Usage
3.1 Optimization Selection
3.2 Sample Invocation
3.3 Supported Target Hardware
To install the HIPAcc framework, which features a Clang-based source-to-source compiler, a version of Clang is required that was tested with HIPAcc. Therefore, the file dependencies.sh lists the revision of Clang (and other dependencies) the current version of HIPAcc works with. In addition to Clang, also LLVM and libcxx are required. Using Polly is optional.
Polly: polyhedral optimizations for LLVM (optional)
Installation of dependencies:
cd <source_dir> git clone http://llvm.org/git/llvm.git cd llvm && git checkout <llvm_revision> cd <source_dir> git clone http://llvm.org/git/libcxx.git cd libcxx && git checkout <libcxx_revision> cd <source_dir>/llvm/tools git clone http://llvm.org/git/clang.git cd clang && git checkout <clang_revision> // optional: installation of polly cd <source_dir>/llvm/tools git clone http://llvm.org/git/polly.git cd polly && git checkout <polly_revision>
Configure the software packages using CMake, build and install using make
install.
Note: On GNU/Linux systems, libc++ has to be built using clang/clang++. The
easiest way to do this is to use CMake and to specify the compilers at the command
line: CXX=clang++ CC=clang cmake ../ ‑DCMAKE_INSTALL_PREFIX=/opt/local
Note: On Mac OS 10.6, cxxabi.h from
http://home.roadrunner.com/ hinnant/libcppabi.zip
is required to build libc++ successfully.
Next, you have to download HIPAcc from
http://github.com/hipacc/hipacc.
HIPAcc can be downloaded either as versioned tarball, or the latest version
can be retrieved using git. Download the latest release (currently, the tarball
hipacc-0.6.1.tar.gz) or get the latest sources using git:
git clone git@github.com:hipacc/hipacc.git hipacc
To build and install the HIPAcc framework, CMake is used. In the main directory, the file INSTALL contains required instructions:
To configure the project, call cmake in the root directory. A working installation of Clang/LLVM (and Polly) is required. The llvm-config tool will be used to determine configuration for HIPAcc and must be present in the environment.
The following variables can be set to tell cmake where to look for certain components:
CMAKE_INSTALL_PREFIX: Installation prefix
CMAKE_BUILD_TYPE: Build type (Debug or Release)
OPENCL_INC_PATH: OpenCL include path
(e.g., -DOPENCL_INC_PATH=/opt/cuda/include)
OPENCL_LIB_PATH: OpenCL library path
(e.g., -DOPENCL_LIB_PATH=/usr/lib64)
CUDA_BIN_PATH: CUDA binary path
(e.g., -DCUDA_BIN_PATH=/opt/cuda/bin)
The following options can be enabled or disabled:
USE_POLLY: Use Polly for kernel analysis (e.g., -DUSE_POLLY=ON)
USE_JIT_ESTIMATE: Use just-in-time compilation of generated kernels in order to get resource estimates - option only available for GNU/Linux systems
A possible configuration may look like in the following:
cd <hipacc_root> mkdir build && cd build mkdir install_dir cmake ../ -DCMAKE_INSTALL_PREFIX=./install_dir make && make install
Generating target code for Android requires cross compilation. Therefore, additional variables are recognized for cross compilation for Android (given are examples to target the Arndale Board with a Samsung Exynos 5 Dual (ARM Cortex-A15 & ARM Mali T604 GPU):
ANDROID_SOURCE_DIR: Android source directory
(e.g. -DANDROID_SOURCE_DIR=/opt/arndaleboard/android-jb-mr1)
TARGET_NAME: Name of the target platform
(e.g. -DTARGET_NAME=arndale)
HOST_TYPE: Name of the local compile host type
(e.g. -DHOST_TYPE=linux-x86)
NDK_TOOLCHAIN_DIR: Android NDK directory
(e.g. -DNDK_TOOLCHAIN_DIR=/opt/android/android-14-toolchain)
RS_TARGET_API: Android API level
(e.g. -DRS_TARGET_API=16)
EMBEDDED_OPENCL_INC_PATH: OpenCL include path
(e.g. -DEMBEDDED_OPENCL_INC_PATH=/opt/cuda/include)
EMBEDDED_OPENCL_LIB_PATH: OpenCL library path within the target system
(e.g. -DEMBEDDED_OPENCL_LIB_PATH=vendor/lib/egl)
EMBEDDED_OPENCL_LIB: Name of the embedded OpenCL library
(e.g. -DEMBEDDED_OPENCL_LIB=libGLES_mali.so)
The DSL can be separated into two parts: a) host code that describes the binding to standard C/C++ code and b) device code that describes the calculation on the graphics card.
The library consists of built-in C++ classes that describe the following three basic components required to express image processing on an abstract level:
Image:
Describes data storage for the image pixels. Each pixel can be stored as an
integer number, a floating point number, or in another format depending on
instantiation of this templated class. The data layout is handled internally
using multi-dimensional arrays. Syntax:
Image<type>(width, height);
Iteration Space:
Describes a rectangular region of interest in the output image, for example
the complete image. Each pixel in this region is a point in the iteration
space. Syntax:
IterationSpace<type>(Image, width, height, offset_x, offset_y);
width, height, offset_x and offset_y are optional.
Kernel:
Describes an algorithm to be applied to each pixel in the Iteration
Space. Syntax:
Kernel(IterationSpace, <parameters>);
The parameters are defined in the kernel class itself by the user.
Mask: Stores the coefficients that can be used by convolution kernels. Syntax:
Mask<type>(size_x, size_y);
BoundaryCondition: Describes how the pixels of an Accessor are accessed when pixels are accessed out-of-bounds. The following boundary handling modes are supported:
BOUNDARY_UNDEFINED: No border handling specified → no border handling will be added by the compiler.
BOUNDARY_CLAMP: The x/y addresses will be set to the last valid value within the image.
BOUNDARY_REPEAT: Accesses outside to the image are handled as if the image is repeated in each direction.
BOUNDARY_MIRROR: Accesses outside to the image are handled as if the image is mirrored at the border.
BOUNDARY_CONSTANT: Accesses outside to the image return a constant value.
Syntax:
BoundaryCondition<type>(Image, size_x, size_y, boundary_handling_mode, constant_value);
size_x and size_y define the domain where boundary handling is necessary (e.g., within a 5×5 convolution filter); boundary_handling_mode is one of the aforementioned constants. In case BOUNDARY_CONSTANT is used, the optional constant_value has to be specified.
Accessor:
Describes which pixels of an Image are seen within the Kernel. Similar to an Iteration Space, the Accessor defines an Iteration Space on an input image. Syntax:
Accessor<type>(Image, width, height, offset_x, offset_y);
width, height, offset_x and offset_y are optional.
In order to avoid out-of-bounds memory accesses, also a BoundaryCondition object can be specified instead of an Image.
Syntax:
Accessor<type>(BoundaryCondition, width, height, offset_x, offset_y);
width, height, offset_x and offset_y are optional.
In case the Iteration Space (defined for the output image) does not
match the region of interest defined by the Accessor (defined for an
input image), interpolation is required. Therefore, HIPAcc provides
Accessors implementing different interpolation modes. Supported are
nearest neighbor, linear filtering, bicubic filtering, and Lanczos
filtering. Syntax:
AccessorNN<type>(Image, width, height, offset_x, offset_y); AccessorLF<type>(Image, width, height, offset_x, offset_y); AccessorCF<type>(Image, width, height, offset_x, offset_y); AccessorL3<type>(Image, width, height, offset_x, offset_y);
Interpolation can be also combined with border handling. Syntax:
AccessorNN<type>(BoundaryCondition, width, height, offset_x, offset_y);
In the following, the HIPAcc framework is illustrated using a Gaussian filter, smoothing an image. By doing so, the Gaussian filter reduces image noise and detail. This filter is a local operator that is applied to a neighborhood (σ) of each pixel to produce a smoothed image (see Equation (1)). The filter mask of the Gaussian filter as described in Equation (2) depends only on the size of the considered neighborhood (σ) and is otherwise constant for the image. Therefore, the filter mask is typically precalculated and stored in a lookup table to avoid redundant calculations for each image pixel.
![[hipacc_documentation-Z-G-1.png]](hipacc_documentation-Z-G-1.png)
To express this filter in our framework, the programmer derives a class from the built-in Kernel class and implements the virtual kernel function, as shown in Listing 1. To access the pixels of an input image, the parenthesis operator () is used, taking the column (dx) and row (dy) offsets as optional parameters. Similarly, coefficients of a filter Mask are accessed using the parenthesis operator (), specifying the desired column (x) and row (y) index. The output image as specified by the Iteration Space is accessed using the output() method provided by the built-in Kernel class. The user instantiates the class with input image accessors, one iteration space, and other parameters that are member variables of the class.
class GaussianFilter : public Kernel<float> {
private:
Accessor<float> &Input;
const int size_x, size_y;
public:
GaussianFilter(IterationSpace<float> &IS, Accessor<float> &Input, const int size_x, const int size_y) :
Kernel(IS),
Input(Input),
size_x(size_x),
size_y(size_y)
{ addAccessor(&Input); }
void kernel() {
const int ax = size_x >> 1;
const int ay = size_y >> 1;
float sum = 0;
for (int yf = -ay; yf<=ay; yf++) {
for (int xf = -ax; xf<=ax; xf++) {
float gauss_constant = expf(-1.0f*((xf*xf)/(2.0f*size_x*size_x) + (yf*yf)/(2.0f*size_y*size_y)));
sum += gauss_constant*Input(xf, yf);
}
}
output() = sum;
}
};
| Listing 1: Gaussian filter, calculating the Gaussian mask for each pixel. |
While in Listing 1, the Gaussian filter mask was calculated for each pixel (according to Equation (2)), the Gaussian filter mask can be precalculated and stored to a Mask. This is shown in Listing 2 where the mask coefficient is retrieved from a Mask object.
class GaussianFilter : public Kernel<float> {
private:
Accessor<float> &Input;
Mask<float> &cMask;
const int size_x, size_y;
public:
GaussianFilter(IterationSpace<float> &IS, Accessor<float> &Input, Mask<float> &cMask, const int size_x, const int size_y) :
Kernel(IS),
Input(Input),
cMask(cMask),
size_x(size_x),
size_y(size_y)
{ addAccessor(&Input); }
void kernel() {
const int ax = size_x >> 1;
const int ay = size_y >> 1;
float sum = 0;
for (int yf = -ay; yf<=ay; yf++) {
for (int xf = -ax; xf<=ax; xf++) {
sum += cMask(xf, yf)*Input(xf, yf);
}
}
output() = sum;
}
};
| Listing 2: Gaussian filter, using a precalculated the Gaussian mask. |
As an alternative, the convolution can be expressed using the convolve method, taking three parameters: a) the mask itself, b) the reduction operator for each element, and c) the calculation instruction for one element of the mask with pixels of the image. This is shown in Listing 3.
class GaussianFilter : public Kernel<float> {
private:
Accessor<float> &Input;
Mask<float> &cMask;
public:
GaussianFilter(IterationSpace<float> &IS, Accessor<float> &Input, Mask<float> &cMask) :
Kernel(IS),
Input(Input),
cMask(cMask)
{ addAccessor(&Input); }
void kernel() {
output() = convolve(cMask, HipaccSUM, [&] () -> float {
return cMask()*Input(cMask);
});
}
};
| Listing 3: Gaussian filter, using the convolve function. |
In Listing 4, the input and output Image objects IN and OUT are defined as two-dimensional W × H grayscale images, having pixels represented as floating-point numbers (lines 10–11). The Image object IN is initialized with the host_in pointer to a plain C array (line 14). The Gaussian filter Mask object GMask is defined (line 17) and is initialized (line 18) for the filter size. Because of accessing neighboring pixels in the Gaussian filter, border handling is required. In line 21, a Boundary Condition object specifying mirroring as boundary mode for the filter size is defined. The region of interest IsOut contains the whole image (line 24) and the Accessor AccIn is defined on the input image taking the boundary condition into account (line 27). The kernel is initialized with the iteration space, accessor, and filter mask objects as well as filter size parameters size_x and size_y (line 30), and executed by a call to the execute() method (line 33). To retrieve the output image, the host_out pointer is assigned the Image object OUT, invoking the getData() operator (line 36).
const int width=1024, height=1024, size_x=3, size_y=3; // pointers to raw image data float *host_in = ...; float *host_out = ...; // pointer to Gaussian filter mask float *filter_mask = ...; // input and output images Image<float> IN(width, height); Image<float> OUT(width, height); // initialize input image IN = host_in; // operator= // filter Mask for Gaussian filter Mask<float> GMask(size_x, size_y); GMask = filter_mask; // Boundary handling mode for out of bounds accesses BoundaryCondition<float> BcInMirror(IN, GMask, BOUNDARY_MIRROR); // define region of interest IterationSpace<float> IsOut(OUT); // Accessor used to access image pixels with the defined boundary handling mode Accessor<float> AccIn(BcInMirror); // define kernel GaussianFilter GF(IsOut, AccIn, GMask, size_x, size_y); // execute kernel GF.execute(); // retrieve output image host_out = OUT.getData();
| Listing 4: Host code, instantiating and executing the Gaussian filter. |
For each Image defined in the DSL, memory is allocated on the compute device (e.g., GPU). Synchronization between the memory allocated on the compute device and the data assigned to the Image instance is explicitly done by the programmer. Assigning a memory pointer to an Image triggers the memory transfer from the host to the compute device. Copying the data back to the host is initiated by the getData() operator.
Once the data is on the compute device, data can be directly copied between Images and Accessors. Listing 6 shows the possibilities of memory assignments between Images and Accessors as well as the data transfer to and from the compute device.
// input Image int width, height; uchar *image = read_image(&width, &height, "input.pgm"); Image<uchar> IN(width, height); // copy data to the device: host -> device IN = image; // define second Image Image<uchar> TMP(width, height); // copy from IN to TMP: device -> device TMP = IN; // define ROI on IN (Accessor) Accessor<uchar> AccIn(IN, roi_width, roi_height, offset_x, offset_y); // define ROI on TMP (Accessor) Accessor<uchar> AccTmp(TMP, roi_width, roi_height, 0, 0); // copy from ROI on IN to ROI on TMP: device -> device AccTmp = AccIn; // output image Image<uchar> OUT(roi_width, roi_height); // copy from ROI on TMP to OUT: device -> device OUT = AccTmp; // copy from Accessor to Image: device -> device AccTmp = OUT; // copy data from device to host: device -> host OUT.getData();
| Listing 5: Data transfer possibilities in HIPAcc. |
HIPAcc supports all built-in (primitive) data types supported in C/C++ and provides vector types (currently only with 4 vector elements) for these data types. Table 1 lists the supported built-in data types in C/C++ and the corresponding scalar and vector data types in HIPAcc.
| ||||||||||||||||||||||||||||||||||
| Table 1: Supported built-in types and vector types by the HIPAcc framework. | ||||||||||||||||||||||||||||||||||
While casting and implicit conversion between built-in scalar data types is provided by the C/C++ languages, no such support is provided for vector data types. In order to convert between different vector data types, convert functions are provided by the HIPAcc framework (see Table 2).
| ||||||||||||||||||||||||||||||||||
| Table 2: Convert functions for vector types provided by the HIPAcc framework. | ||||||||||||||||||||||||||||||||||
Standard math functions (math.h / cmath) are supported on scalar data types. For vector data types, corresponding math functions are provided by HIPAcc in the hipacc::math namespace. Listing 6 shows the usage of vector types and math functions on vector types.
using namespace hipacc;
using namespace hipacc::math;
ushort4 pixel_s = { 0, 0, 0, 0};
uchar4 pixel;
pixel.x = 204;
pixel.y = 0;
pixel.z = 0;
pixel.w = 0;
float4 tmp;
// using sin from hipacc::math
tmp = sin(convert_float4(pixel));
// calling sin from hipacc::math directly
tmp = hipacc::math::sin(convert_float4(pixel));
pixel_s = convert_uchar(tmp);
| Listing 6: Example usage of vector types and math functions. |
Using vector types, the Gaussian filter can be also applied to images using 4-channel pixels as shown in Listing 7.
class GaussianFilter : public Kernel<uchar4> {
private:
Accessor<uchar4> &Input;
Mask<float> &cMask;
public:
GaussianFilter(IterationSpace<uchar4> &IS, Accessor<uchar4> &Input, Mask<float> &cMask) :
Kernel(IS),
Input(Input),
cMask(cMask)
{ addAccessor(&Input); }
void kernel() {
output() = convert_uchar4(convolve(cMask, HipaccSUM, [&] () -> float4 {
return cMask() * convert_float4(Input(cMask));
}));
}
};
| Listing 7: Gaussian filter on 4 channel pixels, using the convolve function. |
In order to generate target code for a GPU accelerator, the user invokes the hipacc compiler providing an input file and specifying the output file using the ‑o <file> option.
In addition, the ‑target <n> option specifies the target hardware. Supported devices are listed in Table 3.
The code variant (i.e., combination of optimizations) for a particular target device is automatically chosen by the HIPAcc framework according to an expert system and based on heuristics.
For manual testing, the user can enable or disable optimizations using corresponding command line options.
For example, the user can specify that local memory or texture memory should be turned on or off.
Similar, the amount of padding or the unroll factor can be set by the user.
The ‑‑time‑kernels compiler flag generates code that executes each kernel 10 times for calculating the execution time.
This timing information (in ms) can be retrieved for the kernel executed last using the hipaccGetLastKernelTiming() function.
Below, all options of the source-to-source compiler are listed.
membarth@codesign75:~/projects/hipacc/build/release > ./bin/hipacc --help
Copyright (c) 2012, University of Erlangen-Nuremberg
Copyright (c) 2012, Siemens AG
Copyright (c) 2010, ARM Limited
All rights reserved.
OVERVIEW: HIPAcc - Heterogeneous Image Processing Acceleration framework
USAGE: hipacc [options] <input>
OPTIONS:
-emit-cuda Emit CUDA code; default is OpenCL code
-emit-opencl-cpu Emit OpenCL code for CPU devices, no padding supported
-emit-renderscript Emit Renderscript code for Android
-emit-renderscript-gpu Emit Renderscript code for Android (force GPU execution)
-emit-filterscript Emit Filterscript code for Android
-emit-padding <n> Emit CUDA/OpenCL/Renderscript image padding, using alignment of <n> bytes for GPU devices
-target <n> Generate code for GPUs with code name <n>.
Code names for CUDA/OpenCL on NVIDIA devices are:
'Tesla-10', 'Tesla-11', 'Tesla-12', and 'Tesla-13' for Tesla architecture.
'Fermi-20' and 'Fermi-21' for Fermi architecture.
'Kepler-30' and 'Kepler-35' for Kepler architecture.
Code names for for OpenCL on AMD devices are:
'Evergreen' for Evergreen architecture (Radeon HD5xxx).
'NorthernIsland' for Northern Island architecture (Radeon HD6xxx).
Code names for for OpenCL/Renderscript on ARM devices are:
'Midgard' for Mali-T6xx' for Mali.
-explore-config Emit code that explores all possible kernel configuration and print its performance
-use-config <nxm> Emit code that uses a configuration of nxm threads, e.g. 128x1
-time-kernels Emit code that executes each kernel multiple times to get accurate timings
-use-textures <o> Enable/disable usage of textures (cached) in CUDA/OpenCL to read/write image pixels - for GPU devices only
Valid values for CUDA on NVIDIA devices: 'off', 'Linear1D', 'Linear2D', and 'Array2D'
Valid values for OpenCL: 'off' and 'Array2D'
-use-local <o> Enable/disable usage of shared/local memory in CUDA/OpenCL to stage image pixels to scratchpad
Valid values: 'on' and 'off'
-vectorize <o> Enable/disable vectorization of generated CUDA/OpenCL code
Valid values: 'on' and 'off'
-pixels-per-thread <n> Specify how many pixels should be calculated per thread
-o <file> Write output to <file>
--help Display available options
--version Display version information
The installation of the HIPAcc framework provides a set of example programs and a Makefile for getting started easily.
The installation directory contains the tests directory with sample programs.
Setting the TEST_CASE environment variable to one of these directories and the HIPACC_TARGET for the graphics card in the system is all to get started.
Afterwards, the make cuda and make opencl targets can be used to generate code using the CUDA and OpenCL back ends, respectively.
TEST_CASE: directory of the example that should be compiled using the HIPAcc compiler.
HIPACC_TARGET: specified the target architecture for which the compiler should optimize for
Here are sample definition of these variables:
# compile the bilateral filter example export TEST_CASE=./tests/bilateral_filter # generate target code for a Quadro FX 5800 graphics card from NVIDIA export HIPACC_TARGET=Tesla-13
The target hardware as supported by HIPAcc is categorized according to a target architecture. The target architecture corresponds to the code name of NVIDIA GPUs with compute capability appended and corresponds to the series specification for GPUs from AMD and ARM. Table 3 lists the devices currently supported by the HIPAcc framework.
| |||||||||||||||||||||||||
| Table 3: Target architecture and sample GPU devices supported by the HIPAcc framework. | |||||||||||||||||||||||||