% *======================================================================*
%  Cactus Thorn template for ThornGuide documentation
%  Author: Ian Kelley
%  Date: Sun Jun 02, 2002
%  $Header$
%
%  Thorn documentation in the latex file doc/documentation.tex
%  will be included in ThornGuides built with the Cactus make system.
%  The scripts employed by the make system automatically include
%  pages about variables, parameters and scheduling parsed from the
%  relevant thorn CCL files.
%
%  This template contains guidelines which help to assure that your
%  documentation will be correctly added to ThornGuides. More
%  information is available in the Cactus UsersGuide.
%
%  Guidelines:
%   - Do not change anything before the line
%       % START CACTUS THORNGUIDE",
%     except for filling in the title, author, date, etc. fields.
%        - Each of these fields should only be on ONE line.
%        - Author names should be separated with a \\ or a comma.
%   - You can define your own macros, but they must appear after
%     the START CACTUS THORNGUIDE line, and must not redefine standard
%     latex commands.
%   - To avoid name clashes with other thorns, 'labels', 'citations',
%     'references', and 'image' names should conform to the following
%     convention:
%       ARRANGEMENT_THORN_LABEL
%     For example, an image wave.eps in the arrangement CactusWave and
%     thorn WaveToyC should be renamed to CactusWave_WaveToyC_wave.eps
%   - Graphics should only be included using the graphicx package.
%     More specifically, with the "\includegraphics" command.  Do
%     not specify any graphic file extensions in your .tex file. This
%     will allow us to create a PDF version of the ThornGuide
%     via pdflatex.
%   - References should be included with the latex "\bibitem" command.
%   - Use \begin{abstract}...\end{abstract} instead of \abstract{...}
%   - Do not use \appendix, instead include any appendices you need as
%     standard sections.
%   - For the benefit of our Perl scripts, and for future extensions,
%     please use simple latex.
%
% *======================================================================*
%
% Example of including a graphic image:
%    \begin{figure}[ht]
% 	\begin{center}
%    	   \includegraphics[width=6cm]{MyArrangement_MyThorn_MyFigure}
% 	\end{center}
% 	\caption{Illustration of this and that}
% 	\label{MyArrangement_MyThorn_MyLabel}
%    \end{figure}
%
% Example of using a label:
%   \label{MyArrangement_MyThorn_MyLabel}
%
% Example of a citation:
%    \cite{MyArrangement_MyThorn_Author99}
%
% Example of including a reference
%   \bibitem{MyArrangement_MyThorn_Author99}
%   {J. Author, {\em The Title of the Book, Journal, or periodical}, 1 (1999),
%   1--16. {\tt http://www.nowhere.com/}}
%
% *======================================================================*

% If you are using CVS use this line to give version information
% $Header$

\documentclass{article}

% Use the Cactus ThornGuide style file
% (Automatically used from Cactus distribution, if you have a
%  thorn without the Cactus Flesh download this from the Cactus
%  homepage at www.cactuscode.org)
\usepackage{../../../../doc/latex/cactus}

\begin{document}

% The author of the documentation
\author{Erik Schnetter \textless eschnetter@perimeterinstitute.ca\textgreater}

% The title of the document (not necessarily the name of the Thorn)
\title{OpenCLRunTime}

% the date your document was last changed, if your document is in CVS,
% please use:
%    \date{$ $Date: 2004-01-07 14:12:39 -0600 (Wed, 07 Jan 2004) $ $}
\date{May 17, 2012}

\maketitle

% Do not delete next line
% START CACTUS THORNGUIDE

% Add all definitions used in this documentation here
%   \def\mydef etc

% Add an abstract for this thorn's documentation
\begin{abstract}
  Executing OpenCL kernels requires some boilerplate code: One needs
  to choose an OpenCL platform and device, needs to compile the code
  (from a C string), needs to pass in arguments, and finally needs to
  execute the actual kernel code. This thorn \texttt{OpenCLRunTime}
  provides a simple helper routine for these tasks.
\end{abstract}


\section{Overview}

Thorn \texttt{OpenCLRunTime} performs the following tasks:
\begin{itemize}
  \item At startup, it outputs a description of the hardware
    (platforms and devices) available via OpenCL, i.e.\ CPUs and GPUs
    that have OpenCL drivers installed.
  \item At startup, it selects one device of one platform that will be
    used later on.
  \item It provides an API that can be used to compile kernels for
    this device, and which remembers previously compiled kernels so
    that they don't have to be recompiled.
  \item It disassembles the compiled kernels so that they can be
    examined (which would otherwise be difficult in an environment
    using dynamic compilation).
  \item It allocates storage for grid functions on the device, and
    handles copying data to and from the device (in interaction with
    thorn \texttt{Accelerator}).
  \item It offers a set of convenience macros similar to
    \texttt{CCTK\_ARGUMENTS} and \texttt{CCTK\_PARAMETERS} to access
    grid structure information. (Parameter values are expanded when
    compiling kernels, often enabling additional optimisations.)
  \item It offers looping macros similar to those provided by thorn
    \texttt{LoopControl}, which parallelise loops via OpenCL's
    multithreading.
  \item It offers datatypes and macros for easy vectorisation, based
    on OpenCL's vector types.
\end{itemize}

At the moment, \texttt{OpenCLRunTime} only supports unigrid
simulations; adaptive mesh refinement or multi-block methods are not
yet supported. (The main reason for this is that the \texttt{cctkGH}
entries on the device are not updated; however, this should be
straightforward to implement.)


\section{Example}

An OpenCL compute kernel may be called as follows:

\begin{verbatim}
  char const *const groups[] = {
    "WaveToyOpenCL::Scalar",
    NULL};
  
  int const imin[] = {cctk_nghostzones[0],
                      cctk_nghostzones[1],
                      cctk_nghostzones[2]};
  int const imax[] = {cctk_lsh[0] - cctk_nghostzones[0],
                      cctk_lsh[1] - cctk_nghostzones[1],
                      cctk_lsh[2] - cctk_nghostzones[2]};
  
  static struct OpenCLKernel *kernel = NULL;
  char const *const sources[] = {"", OpenCL_source_WaveToyOpenCL_evol, NULL};
  OpenCLRunTime_CallKernel(cctkGH, CCTK_THORNSTRING, "evol",
                           sources, groups, NULL, NULL, NULL, -1,
                           imin, imax, &kernel);
\end{verbatim}

The array \texttt{groups} specifies which grid functions are to be
available in the OpenCL kernel. This is a C array terminated by
NULL\@. (This information could instead also be gathered from the
respective \texttt{schedule.ccl} declarations.)

The integer arrays \texttt{imin} and \texttt{imax} specify the
iteration bounds of the kernel. This information is necessary so that
OpenCL can properly map this iteration space onto the available OpenCL
groups (threads).

The array \texttt{sources} (a C array terminated by NULL) specifies
the actual source code for the kernel. The first string (here empty)
can contain declarations and definitions that should be available
outside the kernel function. The second string specifies the actual
kernel code, excluding the actual function declaration which is
inserted automatically. This is an example for such a kernel code:

\begin{verbatim}
// Grid points are index in the same way as for a CPU
// Using ptrdiff_t instead of int is more efficient on 64-bit
// architectures
ptrdiff_t const di = 1;
ptrdiff_t const dj =
  CCTK_GFINDEX3D(cctkGH,0,1,0) - CCTK_GFINDEX3D(cctkGH,0,0,0);
ptrdiff_t const dk =
  CCTK_GFINDEX3D(cctkGH,0,0,1) - CCTK_GFINDEX3D(cctkGH,0,0,0);

// Coordinates are calculated in the same as as for a CPU
CCTK_REAL const idx2 = 1.0 / pown(CCTK_DELTA_SPACE(0), 2);
CCTK_REAL const idy2 = 1.0 / pown(CCTK_DELTA_SPACE(1), 2);
CCTK_REAL const idz2 = 1.0 / pown(CCTK_DELTA_SPACE(2), 2);
CCTK_REAL const dt2  = pown(CCTK_DELTA_TIME, 2);
  
// Note: The kernel below is not vectorised (since it doesn't use
// CCTK_REAL_VEC). Therefore, vectorisation must be switched off in
// the paramter file (via OpenCLRunTime::vector_size_x = 1).

// This loop macro automatically parallelizes the code
// imin[] and imax[] are passed from the host
LC_LOOP3(evol,
         i,j,k, imin[0],imin[1],imin[2], imax[0],imax[1],imax[2],
         cctk_lsh[0],cctk_lsh[1],cctk_lsh[2])
{
  // Calculate index of current point
  ptrdiff_t const ijk = di*i + dj*j + dk*k;
  
  CCTK_REAL const dxxu = idx2 * (u_p[ijk-di] - 2.0 * u_p[ijk] + u_p[ijk+di]);
  CCTK_REAL const dyyu = idy2 * (u_p[ijk-dj] - 2.0 * u_p[ijk] + u_p[ijk+dj]);
  CCTK_REAL const dzzu = idz2 * (u_p[ijk-dk] - 2.0 * u_p[ijk] + u_p[ijk+dk]);
  
  CCTK_REAL const uval =
    +2.0 * u_p[ijk] - u_p_p[ijk] + dt2 * (dxxu + dyyu + dzzu);
  
  u[ijk] = uval;
  
} LC_ENDLOOP3(evol);
\end{verbatim}

The last argument \texttt{kernel} is used to store the compiled kernel
and associated information, so that kernels do not have to be
recompiled for every call.

In this case, the actual kernel source code is contained in a source
file \texttt{evol.cl} in thorn \texttt{WaveToyOpenCL}. The C string
\texttt{OpenCL\_source\_WaveToyOpenCL\_evol} is generated
automatically by Cactus as described in the users' guide and/or thorn
\texttt{OpenCL}\@.


\section{Details}

\subsection{Hardware information}

At startup, this thorn outputs a description of the hardware
(platforms and devices) available via OpenCL, i.e.\ CPUs and GPUs that
have OpenCL drivers installed. Platforms correspond to vendors (AMD,
Apple, Intel, Nvidia), devices to actual hardware (a CPU, a GPU,
etc.). This information is written into a file \texttt{opencl.txt} in
the output directory.

\subsection{Device selection}

At startup, this thorn selects one device of one platform that will be
used later on. It chooses the first device of the first platform that
matches the parameter \texttt{opencl\_device\_type}, which can be
\texttt{CPU}, \texttt{GPU}, \texttt{accelerator}, or \texttt{any}.

\subsection{Compiling kernels}

This thorn provides an API that can be used to compile kernels for
this device, and which remembers previously compiled kernels so that
they don't have to be recompiled. The compiler options for OpenCL are
specified by the parameter \texttt{opencl\_options} and enable
aggressive optimisations by default, as one would want for
floating-point intensive code that is not too susceptive to round-off
errors.

Cactus parameter values are expanded at compile time, enabling further
optimisations. (However, when a parameter value changes, the kernel is
not automatically recompiled -- steerable parameters are not yet
supported. This would be straightforward to implement.)

Typically, OpenCL compilers can optimise more than e.g.\ C or Fortran
compilers. The reason is that an OpenCL compiler knows the complete
code -- it is not possible to call routines that are defined
elsewhere, or to be influenced by changes originating elsewhere
(e.g.\ in another thread). Unfortunatelly, this does not mean that all
OpenCL compilers are good at optimising -- OpenCL is a fairly young
language, and some of the technology is still immature.

A file containing the exact source code passed to the compiler is
placed into the output directory with a name \texttt{KERNELNAME.cl}. A
log file containing all compiler output including error messages is
placed into the output directory with a name \texttt{KERNELNAME.log}.
Both are indispensable for debugging OpenCL code.

\subsection{Disassembling kernels}

This thorn disassembles the compiled kernels so that they can be
examined (which would otherwise be difficult in an environment using
dynamic compilation). The disassembled output is placed into the
output directory with a name \texttt{KERNELNAME.s}, if disassembling
is supported and makes sense. (For example, object files for Nvidia
GPUs contain PTX, which is essentially a high-level assembler code,
and are thus not disassembled.)

By default, kernels are disassembled in the background.

\subsection{Memory management}

This thorn allocates storage for grid functions on the device, and
handles copying data to and from the device (in interaction with thorn
\texttt{Accelerator}).

OpenCL devices have memory that is independent of the host memory.
This is the case even when using CPUs -- a particular memory region
cannot be accessed by host code and by OpenCL kernels at the same
time. This thorn offers several memory models (memory allocation
strategies):
\begin{description}
\item[always-mapped:] Host and device access the memory
  simultaneously. This may work, but violates the OpenCL standard.
  \textbf{Do not use this, unless you know that your implementation
    supports this.} If this does not work, some values in memory will
  randomly change.
\item[copy:] Host and device memory are allocated independently. Data
  will be copied. This makes sense e.g.\ for GPUs that have their own
  memory. This model also allows memory layout optimisation such as
  aligning grid functions with vector sizes or cache lines. Such
  layout optimisations are currently not supported by the Cactus flesh
  (but work is in progress to implement this there).
\item[map:] Device memory is allocated such that it (likely) coincides
  with the memory already allocated on the host. However, either only
  the host or only the device can access this memory at a time; the
  OpenCL run-time needs to be notified to switch between these. This
  memory model will save space, but may be slower if host memory
  cannot efficiently be accessed from the device. This memory model is
  also not yet fully tested.
\end{description}

Routines may execute either on the host (regular routines) or on a
device (OpenCL routines). Variables accessed (read or written) by
routines may need to be copied between host and device. Thorn
\texttt{Accelerator} keeps track of this, and notifies thorn
\texttt{OpenCLRunTime} when data need to be copied.

Data also need to be available on the host for inter-processor
synchronisation and I/O\@.

The parameter \texttt{sync\_copy\_whole\_buffer} determines whether
the whole grid function or only values on/near the boundary are copied
for synchronisation.

\subsection{Grid structure}

This thorn offers a set of convenience macros similar to
\texttt{CCTK\_ARGUMENTS} and \texttt{CCTK\_PARAMETERS} to access grid
structure information. Currently, only a subsect of the information in
\texttt{cctkGH} is available:

\begin{verbatim}
  ptrdiff_t cctk_lbnd[]
  ptrdiff_t cctk_lsh[]
  ptrdiff_t imin[]
  ptrdiff_t imax[]
  CCTK_REAL cctk_time
  CCTK_REAL cctk_delta_time
  CCTK_REAL cctk_origin_space[]
  CCTK_REAL cctk_delta_space[]
  CCTK_DELTA_TIME
  CCTK_ORIGIN_SPACE()
  CCTK_DELTA_SPACE()
  CCTK_GFINDEX3D()
\end{verbatim}

\texttt{cctk\_lbnd} and \texttt{cctk\_lsh} have the same meaning as on
the host. \texttt{imin} and \texttt{imax} contain the values specified
when calling \texttt{OpenCLRunTime\_CallKernel}, and determine the
loop bounds used in this kernel. The real-valued variables and their
macro counterparts have the same meaning as on the host. The type of
the integer fields has been changed from \texttt{int} to
\texttt{ptrdiff\_t}, which is a 64-bit type on 64-bit platforms, and
leads to more efficient code since it avoids type conversions.

\subsection{Loops}

This thorn offers looping macros similar to those provided by thorn
\texttt{LoopControl}, which parallelise loops via OpenCL's
multithreading.

The loop macros \texttt{LC\_LOOP3} and \texttt{LC\_ENDLOOP3} should be
called as in the example above. The first argument defines a name for
the loop, the next three arguments define the names of the iteration
indices. The remaining arguments describe the loop bounds and the
grid function size.

These macros need to be used. Each OpenCL thread will loop only over a
part of the region described by \texttt{imin} and \texttt{imax}. If
this macro is not used, OpenCL's multithreading may be used in an
inconsistent manner (unless you use OpenCL's API to distribute the
workload yourself).

\subsection{Vectorisation}

OpenCL supports vector data types. Using such vector data types is
important to achieve good performance on CPUs. This thorn provides
macros, in particular \texttt{CCTK\_REAL\_VEC}, that can be used for
this. Unfortunately, vectorisation has to be performed explicitly by
the kernel writer, and is not performed by this thorn. (However, note
that some OpenCL compilers can vectorise code automatically.)

When vectorising code explicitly, one needs to use special
instructions to load and store values from and to memory. This is not
(yet) described here; however, the macros are similar to those offered
by thorn \texttt{Vectors}. At the moment, these vectorisation
capabilities are targeted for automated code generation (e.g.\ by
Kranc) rather than for manual programming.


% \begin{thebibliography}{9}
% 
% \end{thebibliography}

% Do not delete next line
% END CACTUS THORNGUIDE

\end{document}