Map-Reduce schemes

The most important piece of code in the KeOps package is the one-dimensional, heavily templated Map-Reduce scheme that can be found in the CUDA file. Used as a default backend by the Genred operator, this standard distributed algorithm relies on principles that are exposed in the reference CUDA programming guide

In a nutshell, this scheme may be described as a tiled “for” loop on the reduction index \(j\), parallelized over the sole index \(i\) – hence the “1D” denomination – which reduces the computed values of \(F(p^1,\dots, x^1_i, \dots, y^1_j, \dots)\) on-the-fly, without ever storing or sending them to the Device memory.

The 1D algorithm

More precisely, as illustrated in the Figure below – with the standard C++ convention of indexes that range “from \(0\) to \(\mathrm{N} - 1\)” – we may decompose the instructions executed by our CUDA blocks as follows:

  1. Each block of \(\mathrm{K}\) threads is attributed an index \(\mathrm{A}\) that ranges between \(0\) and \(\lceil \mathrm{M} / \mathrm{K} \rceil - 1\). This number may exceed the physical number of blocks that can run simultaneously on the GPU chip, but the nvcc compiler abstracts these technicalities away.

  2. In every block, the \(\mathrm{K}\) threads are indexed by an integer \(k \in [\![0, \mathrm{K}[\![~~ = [\![0, \mathrm{K}-1]\!]\). The \(k\)-th thread is assigned to a fixed value of \(i = k + \mathrm{AK}\). It loads the relevant values of \(p^1, \dots, p^P\) and \(x^1_i, \dots, x^X_i\) from the Device to the Thread memory or register, taking advantage of the speed-up for contiguous memory accesses: threads in the same block read neighboring memory adresses. Once again, the compiler handles the distribution of \(\mathrm{K}\) virtual workers on a fixed number of physical CUDA cores.

  3. Each thread is instructed to compute a single value \(a_i = a_{k+\mathrm{AK}}\) through a “for” loop on the values of \(j\) in \(\left[\!\left[ 1,\mathrm{N} \right]\!\right]\). To minimize the transfers between the Device and Shared memories while maximizing the amount of contiguous memory accesses (as discussed page ), this \(j\)-loop is cut in blocks of size \(\mathrm{K}\): the large \(\mathrm{M}\)-by-\(\mathrm{N}\) plane of \((i,j)\) indices is effectively cut in small \(\mathrm{K}\)-by-\(\mathrm{K}\) tiles, following a standard CUDA procedure. Having initialized a temporary buffer “\(a\)” (in the Thread memory) to the neutral element of the \(\operatorname{Reduction}\)\(0\) if it is a sum, \(+\infty\) if it is a minimum, etc. – the \(k\)-th thread of the block loops over values of the tile index \(\mathrm{B}\) in \(\left[\!\left[0, \lceil \mathrm{N} / \mathrm{K} \rceil - 1\right]\!\right]\):

    1. Being assigned to an index \(j_k = k + \mathrm{BK}\), the worker loads the relevant values of \(y^1_{j_k}, \dots, y^Y_{j_k}\) from the Device to the Shared memory. This task is performed in conjunction with the other threads of the block and comes down to a contiguous transfer of a slice “\(j \in [\![ \mathrm{BK}, \mathrm{BK + K} [\![\)” of the \(y\)-data arrays from the “library” of the State department to the shared “office shelf”.

    2. The thread waits for latecomers and synchronizes with all workers in the same block: we don’t want to start the computing job if some of the \(y_j\)’s have not yet been loaded properly in the Shared memory!

    3. Making a loop over the reduction index \(j\) in \([\![\mathrm{BK}, \mathrm{BK + K} [\![\), the worker:

      1. Loads the relevant values of the \(y_j\)’s from the Shared to the Thread memory.

      2. Computes the value of \(F(p^1,\dots, x^1_i, \dots, y^1_j, \dots)\), with all variables standing close to the computing core in the Thread memory.

      3. Reduces this value onto the running buffer \(a\), in the Thread memory.

    4. Once again, the thread synchronizes with the other workers.

  4. Once this large outer loop has been completed, the buffer \(a\) associated to the \(k\)-th thread contains our final value \(a_{k+\mathrm{AK}}\). It is then saved from the Thread to the Device memory in an appropriate “output” array, alongside the other values in the “\(i \in [\![\mathrm{AK}, \mathrm{AK + K} [\![\)” range that have been computed by the block.

Naive scheme.

(a) Simple, ideal scheme.

Tiled scheme.

(b) Optimized GPU scheme.

Figure. The default 1D Map-Reduce scheme used by the KeOps Genred engine can be described as a simple loop over the reduction index \(j\) that is optimized for GPU chips. (a) Each thread \(i\) computes one of the \(a_i\)‘s by looping over the reduction index \(j\) and eating-up the values of \(F\) on-the-fly. (b) Due to the importance of the Shared memory and block-wise memory accesses, (a) is cut in \(\mathrm{K}\)-by-\(\mathrm{K}\) tiles to ensure an optimal management of the \(y_j\)‘s.


As most efficient CUDA programs, the algorithm presented above is pretty verbose: a full page of tedious memory transfers surrounds what is, at heart, a good old “for-for” loop. Crucially though, our efforts pay off: as evidenced by our benchmarks, KeOps typically provides a x30/x10,000 speed-up when compared with tensorized PyTorch-GPU/NumPy-CPU implementations of the same kernel dot product, while keeping a linear (instead of quadratic) memory footprint.

This efficiency mostly comes down to the fact that instead of storing the \(\mathrm{M}\)-by-\(\mathrm{N}\) computed values of \(F(p^1,\dots, x^1_i, \dots, y^1_j, \dots)\) in superfluous quadratic buffers (such as the “kernel matrix”), generating at least \(2\mathrm{M}\mathrm{N}\) high-latency transfers between the Thread and the Device memories, KeOps maximizes the use of the Shared memory and consumes the relevant values of \(F(p^1,\dots, x^1_i, \dots, y^1_j, \dots)\) on-the-spot, in the registers of the CUDA cores.

Note that this level of performance could not have been achieved with high-level Python code: PyTorch and TensorFlow variables always refer to arrays that are stored in the Device memory. Writing C++ CUDA programs is the only way of getting an explicit access to the Shared and Thread memories. As discussed in the next sections, supporting generic formulas and reductions with KeOps thus required the implementation of a fully-fledged symbolic math engine, within the CUDA framework, that could be executed inside our loop at steps 3.3.2-3.

The 2D algorithm

The “GPU_1D” algorithm that we just presented is efficient whenever \(\mathrm{M}\) is larger than the number of CUDA cores available on the chip: no thread stays idle. This is generally the case in shape analysis and data sciences, where the support of batch processing by KeOps allows programmers to fully saturate their GPUs with large input tensors.

Nevertheless, to provide cover for cases where the number of “indexing lines” \(\mathrm{M}\) is much smaller than the size of the “reduction range” \(\mathrm{N}\), KeOps also implements a 2D Map-Reduce scheme in the CUDA file. Assigning the \(\mathrm{K}\)-by-\(\mathrm{K}\) tiles of the computation plan one-by-one to the CUDA blocks – instead of using a line-wise grouping method – this algorithm requires the allocation of intermediate buffers but makes sure that no block stays idle during the computation.