% *======================================================================* % Cactus Thorn template for ThornGuide documentation % Author: Ian Kelley % Date: Sun Jun 02, 2002 % $Header$ % % Thorn documentation in the latex file doc/documentation.tex % will be included in ThornGuides built with the Cactus make system. % The scripts employed by the make system automatically include % pages about variables, parameters and scheduling parsed from the % relevant thorn CCL files. % % This template contains guidelines which help to assure that your % documentation will be correctly added to ThornGuides. More % information is available in the Cactus UsersGuide. % % Guidelines: % - Do not change anything before the line % % START CACTUS THORNGUIDE", % except for filling in the title, author, date, etc. fields. % - Each of these fields should only be on ONE line. % - Author names should be separated with a \\ or a comma. % - You can define your own macros, but they must appear after % the START CACTUS THORNGUIDE line, and must not redefine standard % latex commands. % - To avoid name clashes with other thorns, 'labels', 'citations', % 'references', and 'image' names should conform to the following % convention: % ARRANGEMENT_THORN_LABEL % For example, an image wave.eps in the arrangement CactusWave and % thorn WaveToyC should be renamed to CactusWave_WaveToyC_wave.eps % - Graphics should only be included using the graphicx package. % More specifically, with the "\includegraphics" command. Do % not specify any graphic file extensions in your .tex file. This % will allow us to create a PDF version of the ThornGuide % via pdflatex. % - References should be included with the latex "\bibitem" command. % - Use \begin{abstract}...\end{abstract} instead of \abstract{...} % - Do not use \appendix, instead include any appendices you need as % standard sections. % - For the benefit of our Perl scripts, and for future extensions, % please use simple latex. % % *======================================================================* % % Example of including a graphic image: % \begin{figure}[ht] % \begin{center} % \includegraphics[width=6cm]{MyArrangement_MyThorn_MyFigure} % \end{center} % \caption{Illustration of this and that} % \label{MyArrangement_MyThorn_MyLabel} % \end{figure} % % Example of using a label: % \label{MyArrangement_MyThorn_MyLabel} % % Example of a citation: % \cite{MyArrangement_MyThorn_Author99} % % Example of including a reference % \bibitem{MyArrangement_MyThorn_Author99} % {J. Author, {\em The Title of the Book, Journal, or periodical}, 1 (1999), % 1--16. {\tt http://www.nowhere.com/}} % % *======================================================================* % If you are using CVS use this line to give version information % $Header$ \documentclass{article} % Use the Cactus ThornGuide style file % (Automatically used from Cactus distribution, if you have a % thorn without the Cactus Flesh download this from the Cactus % homepage at www.cactuscode.org) \usepackage{../../../../doc/latex/cactus} \begin{document} % The author of the documentation \author{Erik Schnetter \textless eschnetter@perimeterinstitute.ca\textgreater} % The title of the document (not necessarily the name of the Thorn) \title{OpenCLRunTime} % the date your document was last changed, if your document is in CVS, % please use: % \date{$ $Date: 2004-01-07 14:12:39 -0600 (Wed, 07 Jan 2004) $ $} \date{May 17, 2012} \maketitle % Do not delete next line % START CACTUS THORNGUIDE % Add all definitions used in this documentation here % \def\mydef etc % Add an abstract for this thorn's documentation \begin{abstract} Executing OpenCL kernels requires some boilerplate code: One needs to choose an OpenCL platform and device, needs to compile the code (from a C string), needs to pass in arguments, and finally needs to execute the actual kernel code. This thorn \texttt{OpenCLRunTime} provides a simple helper routine for these tasks. \end{abstract} \section{Overview} Thorn \texttt{OpenCLRunTime} performs the following tasks: \begin{itemize} \item At startup, it outputs a description of the hardware (platforms and devices) available via OpenCL, i.e.\ CPUs and GPUs that have OpenCL drivers installed. \item At startup, it selects one device of one platform that will be used later on. \item It provides an API that can be used to compile kernels for this device, and which remembers previously compiled kernels so that they don't have to be recompiled. \item It disassembles the compiled kernels so that they can be examined (which would otherwise be difficult in an environment using dynamic compilation). \item It allocates storage for grid functions on the device, and handles copying data to and from the device (in interaction with thorn \texttt{Accelerator}). \item It offers a set of convenience macros similar to \texttt{CCTK\_ARGUMENTS} and \texttt{CCTK\_PARAMETERS} to access grid structure information. (Parameter values are expanded when compiling kernels, often enabling additional optimisations.) \item It offers looping macros similar to those provided by thorn \texttt{LoopControl}, which parallelise loops via OpenCL's multithreading. \item It offers datatypes and macros for easy vectorisation, based on OpenCL's vector types. \end{itemize} At the moment, \texttt{OpenCLRunTime} only supports unigrid simulations; adaptive mesh refinement or multi-block methods are not yet supported. (The main reason for this is that the \texttt{cctkGH} entries on the device are not updated; however, this should be straightforward to implement.) \section{Example} An OpenCL compute kernel may be called as follows: \begin{verbatim} char const *const groups[] = { "WaveToyOpenCL::Scalar", NULL}; int const imin[] = {cctk_nghostzones[0], cctk_nghostzones[1], cctk_nghostzones[2]}; int const imax[] = {cctk_lsh[0] - cctk_nghostzones[0], cctk_lsh[1] - cctk_nghostzones[1], cctk_lsh[2] - cctk_nghostzones[2]}; static struct OpenCLKernel *kernel = NULL; char const *const sources[] = {"", OpenCL_source_WaveToyOpenCL_evol, NULL}; OpenCLRunTime_CallKernel(cctkGH, CCTK_THORNSTRING, "evol", sources, groups, NULL, NULL, NULL, -1, imin, imax, &kernel); \end{verbatim} The array \texttt{groups} specifies which grid functions are to be available in the OpenCL kernel. This is a C array terminated by NULL\@. (This information could instead also be gathered from the respective \texttt{schedule.ccl} declarations.) The integer arrays \texttt{imin} and \texttt{imax} specify the iteration bounds of the kernel. This information is necessary so that OpenCL can properly map this iteration space onto the available OpenCL groups (threads). The array \texttt{sources} (a C array terminated by NULL) specifies the actual source code for the kernel. The first string (here empty) can contain declarations and definitions that should be available outside the kernel function. The second string specifies the actual kernel code, excluding the actual function declaration which is inserted automatically. This is an example for such a kernel code: \begin{verbatim} // Grid points are index in the same way as for a CPU // Using ptrdiff_t instead of int is more efficient on 64-bit // architectures ptrdiff_t const di = 1; ptrdiff_t const dj = CCTK_GFINDEX3D(cctkGH,0,1,0) - CCTK_GFINDEX3D(cctkGH,0,0,0); ptrdiff_t const dk = CCTK_GFINDEX3D(cctkGH,0,0,1) - CCTK_GFINDEX3D(cctkGH,0,0,0); // Coordinates are calculated in the same as as for a CPU CCTK_REAL const idx2 = 1.0 / pown(CCTK_DELTA_SPACE(0), 2); CCTK_REAL const idy2 = 1.0 / pown(CCTK_DELTA_SPACE(1), 2); CCTK_REAL const idz2 = 1.0 / pown(CCTK_DELTA_SPACE(2), 2); CCTK_REAL const dt2 = pown(CCTK_DELTA_TIME, 2); // Note: The kernel below is not vectorised (since it doesn't use // CCTK_REAL_VEC). Therefore, vectorisation must be switched off in // the paramter file (via OpenCLRunTime::vector_size_x = 1). // This loop macro automatically parallelizes the code // imin[] and imax[] are passed from the host LC_LOOP3(evol, i,j,k, imin[0],imin[1],imin[2], imax[0],imax[1],imax[2], cctk_lsh[0],cctk_lsh[1],cctk_lsh[2]) { // Calculate index of current point ptrdiff_t const ijk = di*i + dj*j + dk*k; CCTK_REAL const dxxu = idx2 * (u_p[ijk-di] - 2.0 * u_p[ijk] + u_p[ijk+di]); CCTK_REAL const dyyu = idy2 * (u_p[ijk-dj] - 2.0 * u_p[ijk] + u_p[ijk+dj]); CCTK_REAL const dzzu = idz2 * (u_p[ijk-dk] - 2.0 * u_p[ijk] + u_p[ijk+dk]); CCTK_REAL const uval = +2.0 * u_p[ijk] - u_p_p[ijk] + dt2 * (dxxu + dyyu + dzzu); u[ijk] = uval; } LC_ENDLOOP3(evol); \end{verbatim} The last argument \texttt{kernel} is used to store the compiled kernel and associated information, so that kernels do not have to be recompiled for every call. In this case, the actual kernel source code is contained in a source file \texttt{evol.cl} in thorn \texttt{WaveToyOpenCL}. The C string \texttt{OpenCL\_source\_WaveToyOpenCL\_evol} is generated automatically by Cactus as described in the users' guide and/or thorn \texttt{OpenCL}\@. \section{Details} \subsection{Hardware information} At startup, this thorn outputs a description of the hardware (platforms and devices) available via OpenCL, i.e.\ CPUs and GPUs that have OpenCL drivers installed. Platforms correspond to vendors (AMD, Apple, Intel, Nvidia), devices to actual hardware (a CPU, a GPU, etc.). This information is written into a file \texttt{opencl.txt} in the output directory. \subsection{Device selection} At startup, this thorn selects one device of one platform that will be used later on. It chooses the first device of the first platform that matches the parameter \texttt{opencl\_device\_type}, which can be \texttt{CPU}, \texttt{GPU}, \texttt{accelerator}, or \texttt{any}. \subsection{Compiling kernels} This thorn provides an API that can be used to compile kernels for this device, and which remembers previously compiled kernels so that they don't have to be recompiled. The compiler options for OpenCL are specified by the parameter \texttt{opencl\_options} and enable aggressive optimisations by default, as one would want for floating-point intensive code that is not too susceptive to round-off errors. Cactus parameter values are expanded at compile time, enabling further optimisations. (However, when a parameter value changes, the kernel is not automatically recompiled -- steerable parameters are not yet supported. This would be straightforward to implement.) Typically, OpenCL compilers can optimise more than e.g.\ C or Fortran compilers. The reason is that an OpenCL compiler knows the complete code -- it is not possible to call routines that are defined elsewhere, or to be influenced by changes originating elsewhere (e.g.\ in another thread). Unfortunatelly, this does not mean that all OpenCL compilers are good at optimising -- OpenCL is a fairly young language, and some of the technology is still immature. A file containing the exact source code passed to the compiler is placed into the output directory with a name \texttt{KERNELNAME.cl}. A log file containing all compiler output including error messages is placed into the output directory with a name \texttt{KERNELNAME.log}. Both are indispensable for debugging OpenCL code. \subsection{Disassembling kernels} This thorn disassembles the compiled kernels so that they can be examined (which would otherwise be difficult in an environment using dynamic compilation). The disassembled output is placed into the output directory with a name \texttt{KERNELNAME.s}, if disassembling is supported and makes sense. (For example, object files for Nvidia GPUs contain PTX, which is essentially a high-level assembler code, and are thus not disassembled.) By default, kernels are disassembled in the background. \subsection{Memory management} This thorn allocates storage for grid functions on the device, and handles copying data to and from the device (in interaction with thorn \texttt{Accelerator}). OpenCL devices have memory that is independent of the host memory. This is the case even when using CPUs -- a particular memory region cannot be accessed by host code and by OpenCL kernels at the same time. This thorn offers several memory models (memory allocation strategies): \begin{description} \item[always-mapped:] Host and device access the memory simultaneously. This may work, but violates the OpenCL standard. \textbf{Do not use this, unless you know that your implementation supports this.} If this does not work, some values in memory will randomly change. \item[copy:] Host and device memory are allocated independently. Data will be copied. This makes sense e.g.\ for GPUs that have their own memory. This model also allows memory layout optimisation such as aligning grid functions with vector sizes or cache lines. Such layout optimisations are currently not supported by the Cactus flesh (but work is in progress to implement this there). \item[map:] Device memory is allocated such that it (likely) coincides with the memory already allocated on the host. However, either only the host or only the device can access this memory at a time; the OpenCL run-time needs to be notified to switch between these. This memory model will save space, but may be slower if host memory cannot efficiently be accessed from the device. This memory model is also not yet fully tested. \end{description} Routines may execute either on the host (regular routines) or on a device (OpenCL routines). Variables accessed (read or written) by routines may need to be copied between host and device. Thorn \texttt{Accelerator} keeps track of this, and notifies thorn \texttt{OpenCLRunTime} when data need to be copied. Data also need to be available on the host for inter-processor synchronisation and I/O\@. The parameter \texttt{sync\_copy\_whole\_buffer} determines whether the whole grid function or only values on/near the boundary are copied for synchronisation. \subsection{Grid structure} This thorn offers a set of convenience macros similar to \texttt{CCTK\_ARGUMENTS} and \texttt{CCTK\_PARAMETERS} to access grid structure information. Currently, only a subsect of the information in \texttt{cctkGH} is available: \begin{verbatim} ptrdiff_t cctk_lbnd[] ptrdiff_t cctk_lsh[] ptrdiff_t imin[] ptrdiff_t imax[] CCTK_REAL cctk_time CCTK_REAL cctk_delta_time CCTK_REAL cctk_origin_space[] CCTK_REAL cctk_delta_space[] CCTK_DELTA_TIME CCTK_ORIGIN_SPACE() CCTK_DELTA_SPACE() CCTK_GFINDEX3D() \end{verbatim} \texttt{cctk\_lbnd} and \texttt{cctk\_lsh} have the same meaning as on the host. \texttt{imin} and \texttt{imax} contain the values specified when calling \texttt{OpenCLRunTime\_CallKernel}, and determine the loop bounds used in this kernel. The real-valued variables and their macro counterparts have the same meaning as on the host. The type of the integer fields has been changed from \texttt{int} to \texttt{ptrdiff\_t}, which is a 64-bit type on 64-bit platforms, and leads to more efficient code since it avoids type conversions. \subsection{Loops} This thorn offers looping macros similar to those provided by thorn \texttt{LoopControl}, which parallelise loops via OpenCL's multithreading. The loop macros \texttt{LC\_LOOP3} and \texttt{LC\_ENDLOOP3} should be called as in the example above. The first argument defines a name for the loop, the next three arguments define the names of the iteration indices. The remaining arguments describe the loop bounds and the grid function size. These macros need to be used. Each OpenCL thread will loop only over a part of the region described by \texttt{imin} and \texttt{imax}. If this macro is not used, OpenCL's multithreading may be used in an inconsistent manner (unless you use OpenCL's API to distribute the workload yourself). \subsection{Vectorisation} OpenCL supports vector data types. Using such vector data types is important to achieve good performance on CPUs. This thorn provides macros, in particular \texttt{CCTK\_REAL\_VEC}, that can be used for this. Unfortunately, vectorisation has to be performed explicitly by the kernel writer, and is not performed by this thorn. (However, note that some OpenCL compilers can vectorise code automatically.) When vectorising code explicitly, one needs to use special instructions to load and store values from and to memory. This is not (yet) described here; however, the macros are similar to those offered by thorn \texttt{Vectors}. At the moment, these vectorisation capabilities are targeted for automated code generation (e.g.\ by Kranc) rather than for manual programming. % \begin{thebibliography}{9} % % \end{thebibliography} % Do not delete next line % END CACTUS THORNGUIDE \end{document}