HIPAcc Documentation

The Heterogeneous Image Processing Acceleration Framework
Version 0.6.1

Richard Membarth

May 28, 2013

Contents

    1  Installation
        1.1  Dependencies
        1.2  HIPAcc Installation
                Android Cross Compilation:

    2  Domain Specific Language
        2.1  Built-in C++ Classes
        2.2  Defining Image Operators
                Device Code:
                Host Code:
        2.3  Memory Management
        2.4  Supported Data Types and Operations
                Data Types:
                Convert Functions:
                Math Functions:

    3  Framework Usage
        3.1  Optimization Selection
        3.2  Sample Invocation
        3.3  Supported Target Hardware

1  Installation

To install the HIPAcc framework, which features a Clang-based source-to-source compiler, a version of Clang is required that was tested with HIPAcc. Therefore, the file dependencies.sh lists the revision of Clang (and other dependencies) the current version of HIPAcc works with. In addition to Clang, also LLVM and libcxx are required. Using Polly is optional.

1.1  Dependencies

Installation of dependencies:

cd <source_dir>
git clone http://llvm.org/git/llvm.git
cd llvm && git checkout <llvm_revision>
cd <source_dir>
git clone http://llvm.org/git/libcxx.git
cd libcxx && git checkout <libcxx_revision>
cd <source_dir>/llvm/tools
git clone http://llvm.org/git/clang.git
cd clang && git checkout <clang_revision>
// optional: installation of polly
cd <source_dir>/llvm/tools
git clone http://llvm.org/git/polly.git
cd polly && git checkout <polly_revision>

Configure the software packages using CMake, build and install using make install.
Note: On GNU/Linux systems, libc++ has to be built using clang/clang++. The easiest way to do this is to use CMake and to specify the compilers at the command line: CXX=clang++ CC=clang cmake ../ ‑DCMAKE_INSTALL_PREFIX=/opt/local
Note: On Mac OS 10.6, cxxabi.h from http://home.roadrunner.com/ hinnant/libcppabi.zip is required to build libc++ successfully.

1.2  HIPAcc Installation

Next, you have to download HIPAcc from http://github.com/hipacc/hipacc. HIPAcc can be downloaded either as versioned tarball, or the latest version can be retrieved using git. Download the latest release (currently, the tarball hipacc-0.6.1.tar.gz) or get the latest sources using git:
git clone git@github.com:hipacc/hipacc.git hipacc

To build and install the HIPAcc framework, CMake is used. In the main directory, the file INSTALL contains required instructions:

To configure the project, call cmake in the root directory. A working installation of Clang/LLVM (and Polly) is required. The llvm-config tool will be used to determine configuration for HIPAcc and must be present in the environment.

The following variables can be set to tell cmake where to look for certain components:

The following options can be enabled or disabled:

A possible configuration may look like in the following:

cd <hipacc_root>
mkdir build && cd build
mkdir install_dir
cmake ../ -DCMAKE_INSTALL_PREFIX=./install_dir
make && make install

Android Cross Compilation:

Generating target code for Android requires cross compilation. Therefore, additional variables are recognized for cross compilation for Android (given are examples to target the Arndale Board with a Samsung Exynos 5 Dual (ARM Cortex-A15 & ARM Mali T604 GPU):

2  Domain Specific Language

The DSL can be separated into two parts: a) host code that describes the binding to standard C/C++ code and b) device code that describes the calculation on the graphics card.

2.1  Built-in C++ Classes

The library consists of built-in C++ classes that describe the following three basic components required to express image processing on an abstract level:

2.2  Defining Image Operators

In the following, the HIPAcc framework is illustrated using a Gaussian filter, smoothing an image. By doing so, the Gaussian filter reduces image noise and detail. This filter is a local operator that is applied to a neighborhood (σ) of each pixel to produce a smoothed image (see Equation (1)). The filter mask of the Gaussian filter as described in Equation (2) depends only on the size of the considered neighborhood (σ) and is otherwise constant for the image. Therefore, the filter mask is typically precalculated and stored in a lookup table to avoid redundant calculations for each image pixel.

[hipacc_documentation-Z-G-1.png]

Device Code:

To express this filter in our framework, the programmer derives a class from the built-in Kernel class and implements the virtual kernel function, as shown in Listing 1. To access the pixels of an input image, the parenthesis operator () is used, taking the column (dx) and row (dy) offsets as optional parameters. Similarly, coefficients of a filter Mask are accessed using the parenthesis operator (), specifying the desired column (x) and row (y) index. The output image as specified by the Iteration Space is accessed using the output() method provided by the built-in Kernel class. The user instantiates the class with input image accessors, one iteration space, and other parameters that are member variables of the class.

class GaussianFilter : public Kernel<float> {
  private:
    Accessor<float> &Input;
    const int size_x, size_y;

  public:
    GaussianFilter(IterationSpace<float> &IS, Accessor<float> &Input, const int size_x, const int size_y) :
      Kernel(IS),
      Input(Input),
      size_x(size_x),
      size_y(size_y)
    { addAccessor(&Input); }

    void kernel() {
      const int ax = size_x >> 1;
      const int ay = size_y >> 1;
      float sum = 0;

      for (int yf = -ay; yf<=ay; yf++) {
        for (int xf = -ax; xf<=ax; xf++) {
          float gauss_constant = expf(-1.0f*((xf*xf)/(2.0f*size_x*size_x) + (yf*yf)/(2.0f*size_y*size_y)));
          sum += gauss_constant*Input(xf, yf);
        }
      }
      output() = sum;
    }
};

Listing 1: Gaussian filter, calculating the Gaussian mask for each pixel.

While in Listing 1, the Gaussian filter mask was calculated for each pixel (according to Equation (2)), the Gaussian filter mask can be precalculated and stored to a Mask. This is shown in Listing 2 where the mask coefficient is retrieved from a Mask object.

class GaussianFilter : public Kernel<float> {
  private:
    Accessor<float> &Input;
    Mask<float> &cMask;
    const int size_x, size_y;

  public:
    GaussianFilter(IterationSpace<float> &IS, Accessor<float> &Input, Mask<float> &cMask, const int size_x, const int size_y) :
      Kernel(IS),
      Input(Input),
      cMask(cMask),
      size_x(size_x),
      size_y(size_y)
    { addAccessor(&Input); }

    void kernel() {
      const int ax = size_x >> 1;
      const int ay = size_y >> 1;
      float sum = 0;

      for (int yf = -ay; yf<=ay; yf++) {
        for (int xf = -ax; xf<=ax; xf++) {
          sum += cMask(xf, yf)*Input(xf, yf);
        }
      }
      output() = sum;
    }
};

Listing 2: Gaussian filter, using a precalculated the Gaussian mask.

As an alternative, the convolution can be expressed using the convolve method, taking three parameters: a) the mask itself, b) the reduction operator for each element, and c) the calculation instruction for one element of the mask with pixels of the image. This is shown in Listing 3.

class GaussianFilter : public Kernel<float> {
  private:
    Accessor<float> &Input;
    Mask<float> &cMask;

  public:
    GaussianFilter(IterationSpace<float> &IS, Accessor<float> &Input, Mask<float> &cMask) :
      Kernel(IS),
      Input(Input),
      cMask(cMask)
    { addAccessor(&Input); }

    void kernel() {
      output() = convolve(cMask, HipaccSUM, [&] () -> float {
        return cMask()*Input(cMask);
        });
    }
};

Listing 3: Gaussian filter, using the convolve function.

Host Code:

In Listing 4, the input and output Image objects IN and OUT are defined as two-dimensional W × H grayscale images, having pixels represented as floating-point numbers (lines 10–11). The Image object IN is initialized with the host_in pointer to a plain C array (line 14). The Gaussian filter Mask object GMask is defined (line 17) and is initialized (line 18) for the filter size. Because of accessing neighboring pixels in the Gaussian filter, border handling is required. In line 21, a Boundary Condition object specifying mirroring as boundary mode for the filter size is defined. The region of interest IsOut contains the whole image (line 24) and the Accessor AccIn is defined on the input image taking the boundary condition into account (line 27). The kernel is initialized with the iteration space, accessor, and filter mask objects as well as filter size parameters size_x and size_y (line 30), and executed by a call to the execute() method (line 33). To retrieve the output image, the host_out pointer is assigned the Image object OUT, invoking the getData() operator (line 36).

const int width=1024, height=1024, size_x=3, size_y=3;

// pointers to raw image data
float *host_in = ...;
float *host_out = ...;
// pointer to Gaussian filter mask
float *filter_mask = ...;

// input and output images
Image<float> IN(width, height);
Image<float> OUT(width, height);

// initialize input image
IN = host_in; // operator=

// filter Mask for Gaussian filter
Mask<float> GMask(size_x, size_y);
GMask = filter_mask;

// Boundary handling mode for out of bounds accesses
BoundaryCondition<float> BcInMirror(IN, GMask, BOUNDARY_MIRROR);

// define region of interest
IterationSpace<float> IsOut(OUT);

// Accessor used to access image pixels with the defined boundary handling mode
Accessor<float> AccIn(BcInMirror);

// define kernel
GaussianFilter GF(IsOut, AccIn, GMask, size_x, size_y);

// execute kernel
GF.execute();

// retrieve output image
host_out = OUT.getData();

Listing 4: Host code, instantiating and executing the Gaussian filter.

2.3  Memory Management

For each Image defined in the DSL, memory is allocated on the compute device (e.g., GPU). Synchronization between the memory allocated on the compute device and the data assigned to the Image instance is explicitly done by the programmer. Assigning a memory pointer to an Image triggers the memory transfer from the host to the compute device. Copying the data back to the host is initiated by the getData() operator.

Once the data is on the compute device, data can be directly copied between Images and Accessors. Listing 6 shows the possibilities of memory assignments between Images and Accessors as well as the data transfer to and from the compute device.

// input Image
int width, height;
uchar *image = read_image(&width, &height, "input.pgm");
Image<uchar> IN(width, height);

// copy data to the device: host -> device
IN = image;

// define second Image
Image<uchar> TMP(width, height);

// copy from IN to TMP: device -> device
TMP = IN;


// define ROI on IN (Accessor)
Accessor<uchar> AccIn(IN, roi_width, roi_height, offset_x, offset_y);

// define ROI on TMP (Accessor)
Accessor<uchar> AccTmp(TMP, roi_width, roi_height, 0, 0);

// copy from ROI on IN to ROI on TMP: device -> device
AccTmp = AccIn;


// output image
Image<uchar> OUT(roi_width, roi_height);

// copy from ROI on TMP to OUT: device -> device
OUT = AccTmp;

// copy from Accessor to Image: device -> device
AccTmp = OUT;

// copy data from device to host: device -> host
OUT.getData();

Listing 5: Data transfer possibilities in HIPAcc.

2.4  Supported Data Types and Operations

Data Types:

HIPAcc supports all built-in (primitive) data types supported in C/C++ and provides vector types (currently only with 4 vector elements) for these data types. Table 1 lists the supported built-in data types in C/C++ and the corresponding scalar and vector data types in HIPAcc.

C/C++ built-in type scalar type vector type
char char char4
unsigned char uchar uchar4
short short short4
unsigned short ushort ushort4
int int int4
unsigned int uint uint4
long long long4
unsigned long ulong ulong4
float float float4
double double double4
Table 1:  Supported built-in types and vector types by the HIPAcc framework.

Convert Functions:

While casting and implicit conversion between built-in scalar data types is provided by the C/C++ languages, no such support is provided for vector data types. In order to convert between different vector data types, convert functions are provided by the HIPAcc framework (see Table 2).

convert function return type argument type
convert_char4 char4 any vector data type
convert_uchar4 uchar4 any vector data type
convert_short4 short4 any vector data type
convert_ushort4 ushort4 any vector data type
convert_int4 int4 any vector data type
convert_uint4 uint4 any vector data type
convert_long4 long4 any vector data type
convert_ulong4 ulong4 any vector data type
convert_float4 float4 any vector data type
convert_double4 double4 any vector data type
Table 2:  Convert functions for vector types provided by the HIPAcc framework.

Math Functions:

Standard math functions (math.h / cmath) are supported on scalar data types. For vector data types, corresponding math functions are provided by HIPAcc in the hipacc::math namespace. Listing 6 shows the usage of vector types and math functions on vector types.

using namespace hipacc;
using namespace hipacc::math;

ushort4 pixel_s = { 0, 0, 0, 0};

uchar4 pixel;
pixel.x = 204;
pixel.y = 0;
pixel.z = 0;
pixel.w = 0;

float4 tmp;
// using sin from hipacc::math
tmp = sin(convert_float4(pixel));
// calling sin from hipacc::math directly
tmp = hipacc::math::sin(convert_float4(pixel));

pixel_s = convert_uchar(tmp);

Listing 6: Example usage of vector types and math functions.

Using vector types, the Gaussian filter can be also applied to images using 4-channel pixels as shown in Listing 7.

class GaussianFilter : public Kernel<uchar4> {
  private:
    Accessor<uchar4> &Input;
    Mask<float> &cMask;

  public:
    GaussianFilter(IterationSpace<uchar4> &IS, Accessor<uchar4> &Input, Mask<float> &cMask) :
      Kernel(IS),
      Input(Input),
      cMask(cMask)
    { addAccessor(&Input); }

    void kernel() {
      output() = convert_uchar4(convolve(cMask, HipaccSUM, [&] () -> float4 {
            return cMask() * convert_float4(Input(cMask));
            }));
    }
};

Listing 7: Gaussian filter on 4 channel pixels, using the convolve function.

3  Framework Usage

In order to generate target code for a GPU accelerator, the user invokes the hipacc compiler providing an input file and specifying the output file using the ‑o <file> option. In addition, the ‑target <n> option specifies the target hardware. Supported devices are listed in Table 3.

3.1  Optimization Selection

The code variant (i.e., combination of optimizations) for a particular target device is automatically chosen by the HIPAcc framework according to an expert system and based on heuristics. For manual testing, the user can enable or disable optimizations using corresponding command line options. For example, the user can specify that local memory or texture memory should be turned on or off. Similar, the amount of padding or the unroll factor can be set by the user. The ‑‑time‑kernels compiler flag generates code that executes each kernel 10 times for calculating the execution time. This timing information (in ms) can be retrieved for the kernel executed last using the hipaccGetLastKernelTiming() function.

Below, all options of the source-to-source compiler are listed.

membarth@codesign75:~/projects/hipacc/build/release > ./bin/hipacc --help

Copyright (c) 2012, University of Erlangen-Nuremberg
Copyright (c) 2012, Siemens AG
Copyright (c) 2010, ARM Limited
All rights reserved.

OVERVIEW: HIPAcc - Heterogeneous Image Processing Acceleration framework

USAGE:  hipacc [options] <input>

OPTIONS:

  -emit-cuda              Emit CUDA code; default is OpenCL code
  -emit-opencl-cpu        Emit OpenCL code for CPU devices, no padding supported
  -emit-renderscript      Emit Renderscript code for Android
  -emit-renderscript-gpu  Emit Renderscript code for Android (force GPU execution)
  -emit-filterscript      Emit Filterscript code for Android
  -emit-padding <n>       Emit CUDA/OpenCL/Renderscript image padding, using alignment of <n> bytes for GPU devices
  -target <n>             Generate code for GPUs with code name <n>.
                          Code names for CUDA/OpenCL on NVIDIA devices are:
                            'Tesla-10', 'Tesla-11', 'Tesla-12', and 'Tesla-13' for Tesla architecture.
                            'Fermi-20' and 'Fermi-21' for Fermi architecture.
                            'Kepler-30' and 'Kepler-35' for Kepler architecture.
                          Code names for for OpenCL on AMD devices are:
                            'Evergreen'      for Evergreen architecture (Radeon HD5xxx).
                            'NorthernIsland' for Northern Island architecture (Radeon HD6xxx).
                          Code names for for OpenCL/Renderscript on ARM devices are:
                            'Midgard' for Mali-T6xx' for Mali.
  -explore-config         Emit code that explores all possible kernel configuration and print its performance
  -use-config <nxm>       Emit code that uses a configuration of nxm threads, e.g. 128x1
  -time-kernels           Emit code that executes each kernel multiple times to get accurate timings
  -use-textures <o>       Enable/disable usage of textures (cached) in CUDA/OpenCL to read/write image pixels - for GPU devices only
                          Valid values for CUDA on NVIDIA devices: 'off', 'Linear1D', 'Linear2D', and 'Array2D'
                          Valid values for OpenCL: 'off' and 'Array2D'
  -use-local <o>          Enable/disable usage of shared/local memory in CUDA/OpenCL to stage image pixels to scratchpad
                          Valid values: 'on' and 'off'
  -vectorize <o>          Enable/disable vectorization of generated CUDA/OpenCL code
                          Valid values: 'on' and 'off'
  -pixels-per-thread <n>  Specify how many pixels should be calculated per thread
  -o <file>               Write output to <file>
  --help                  Display available options
  --version               Display version information

3.2  Sample Invocation

The installation of the HIPAcc framework provides a set of example programs and a Makefile for getting started easily. The installation directory contains the tests directory with sample programs. Setting the TEST_CASE environment variable to one of these directories and the HIPACC_TARGET for the graphics card in the system is all to get started. Afterwards, the make cuda and make opencl targets can be used to generate code using the CUDA and OpenCL back ends, respectively.

Here are sample definition of these variables:

# compile the bilateral filter example
export TEST_CASE=./tests/bilateral_filter

# generate target code for a Quadro FX 5800 graphics card from NVIDIA
export HIPACC_TARGET=Tesla-13

3.3  Supported Target Hardware

The target hardware as supported by HIPAcc is categorized according to a target architecture. The target architecture corresponds to the code name of NVIDIA GPUs with compute capability appended and corresponds to the series specification for GPUs from AMD and ARM. Table 3 lists the devices currently supported by the HIPAcc framework.

target architecture supported devices
Tesla-10 NVIDIA GeForce 8800 GTX, 9800 GT, Tesla C870
Tesla-11 NVIDIA GeForce 8800 GTS, 9800 GTX
Tesla-12 NVIDIA GeForce GT 240
Tesla-13 NVIDIA GeForce GTX 285, Tesla C1060
Fermi-20 NVIDIA GeForce GTX 590, Tesla C2050
Fermi-21 NVIDIA GeForce GTX 560 Ti
Kepler-30 NVIDIA GeForce GTX 680
Kepler-35 NVIDIA GeForce GTX TITAN
Evergreen AMD Radeon HD 58xx
NorthernIsland AMD Radeon HD 69xx
Midgard ARM Mali T6xx
Table 3:  Target architecture and sample GPU devices supported by the HIPAcc framework.

References

[MHT+12a]   Richard Membarth, Frank Hannig, Jürgen Teich, Mario Körner, and Wieland Eckert. Automatic Optimization of In-Flight Memory Transactions for GPU Accelerators based on a Domain-Specific Language for Medical Imaging. In Proceedings of the 11th International Symposium on Parallel and Distributed Computing (ISPDC), pages 211–218, Munich, Germany, June 2012.

[MHT+12b]   Richard Membarth, Frank Hannig, Jürgen Teich, Mario Körner, and Wieland Eckert. Generating Device-specific GPU Code for Local Operators in Medical Imaging. In Proceedings of the 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS), pages 569–581, Shanghai, China, May 2012.

[MHT+12c]   Richard Membarth, Frank Hannig, Jürgen Teich, Mario Körner, and Wieland Eckert. Mastering Software Variant Explosion for GPU Accelerators. In Proceedings of the 10th International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar), pages 123–132, Rhodes Island, Greece, August 2012.

[MLT11]   Richard Membarth, Anton Lokhmotov, and Jürgen Teich. Generating GPU Code from a High-level Representation for Image Processing Kernels. In Proceedings of the 5th Workshop on Highly Parallel Processing on a Chip (HPPC), pages 270–280, Bordeaux, France, August 2011.

Last modified: Tues, May 28, 2013, 11:05 am UTC+1 +1