Map-Reduce schemes
The most important piece of code in the KeOps package is the
one-dimensional, heavily templated Map-Reduce scheme that can be found
in the
GpuConv1D.cu
CUDA file. Used as a default backend by the
Genred
operator, this standard distributed algorithm relies on
principles that are exposed in the reference
CUDA programming guide
In a nutshell, this scheme may be described as a
tiled “for” loop on the reduction index
The 1D algorithm
More precisely, as illustrated in the Figure below – with the
standard C++ convention of indexes that range “from
Each block of
threads is attributed an index that ranges between and . This number may exceed the physical number of blocks that can run simultaneously on the GPU chip, but the nvcc compiler abstracts these technicalities away.In every block, the
threads are indexed by an integer . The -th thread is assigned to a fixed value of . It loads the relevant values of and from the Device to the Thread memory or register, taking advantage of the speed-up for contiguous memory accesses: threads in the same block read neighboring memory adresses. Once again, the compiler handles the distribution of virtual workers on a fixed number of physical CUDA cores.Each thread is instructed to compute a single value
through a “for” loop on the values of in . To minimize the transfers between the Device and Shared memories while maximizing the amount of contiguous memory accesses (as discussed page ), this -loop is cut in blocks of size : the large -by- plane of indices is effectively cut in small -by- tiles, following a standard CUDA procedure. Having initialized a temporary buffer “ ” (in the Thread memory) to the neutral element of the – if it is a sum, if it is a minimum, etc. – the -th thread of the block loops over values of the tile index in :Being assigned to an index
, the worker loads the relevant values of from the Device to the Shared memory. This task is performed in conjunction with the other threads of the block and comes down to a contiguous transfer of a slice “ ” of the -data arrays from the “library” of the State department to the shared “office shelf”.The thread waits for latecomers and synchronizes with all workers in the same block: we don’t want to start the computing job if some of the
’s have not yet been loaded properly in the Shared memory!Making a loop over the reduction index
in , the worker:Loads the relevant values of the
’s from the Shared to the Thread memory.Computes the value of
, with all variables standing close to the computing core in the Thread memory.Reduces this value onto the running buffer
, in the Thread memory.
Once again, the thread synchronizes with the other workers.
Once this large outer loop has been completed, the buffer
associated to the -th thread contains our final value . It is then saved from the Thread to the Device memory in an appropriate “output” array, alongside the other values in the “ ” range that have been computed by the block.
Figure. The default 1D Map-Reduce scheme used by the KeOps Genred
engine
can be described as a simple loop over the reduction index
Performances
As most efficient CUDA programs, the algorithm presented above is pretty verbose: a full page of tedious memory transfers surrounds what is, at heart, a good old “for-for” loop. Crucially though, our efforts pay off: as evidenced by our benchmarks, KeOps typically provides a x30/x10,000 speed-up when compared with tensorized PyTorch-GPU/NumPy-CPU implementations of the same kernel dot product, while keeping a linear (instead of quadratic) memory footprint.
This efficiency mostly comes down to the fact that instead of storing
the
Note that this level of performance could not have been achieved with high-level Python code: PyTorch and TensorFlow variables always refer to arrays that are stored in the Device memory. Writing C++ CUDA programs is the only way of getting an explicit access to the Shared and Thread memories. As discussed in the next sections, supporting generic formulas and reductions with KeOps thus required the implementation of a fully-fledged symbolic math engine, within the CUDA framework, that could be executed inside our loop at steps 3.3.2-3.
The 2D algorithm
The “GPU_1D” algorithm that we just presented is efficient whenever
Nevertheless, to provide cover for cases where the number of “indexing
lines”