Block-sparsity
Complexity of the KeOps routines.
Notwithstanding their finely grained management of memory, the GPU_1D and
GPU_2D schemes have a quadratic time complexity: if
are computed to implement a discrete Gaussian convolution:
Can we do better?
To break through this quadratic lower bound, a simple idea is to skip
some computations, using a sparsity prior on the kernel matrix. For
instance, we could decide to skip kernel computations when points
Sparsity on the CPU.
On CPUs, a standard strategy is to use sparse matrices, encoding our
operators through lists of non-zero coefficients and indices.
Schematically, this comes down to endowing each index
This approach is very well suited to matrices which only have a handful of nonzero coefficients per line, such as the intrinsic Laplacian of a 3D mesh. But on large, densely connected problems, sparse encoding runs into a major issue: since it relies on non-contiguous memory accesses, it scales poorly on GPUs.
Block-sparse reductions
As explained in our introduction,
GPU chips are wired to rely on coalesced memory
operations which load blocks of dozens of contiguous bytes at once.
Instead of allowing the use of arbitrary indexing sets
where:
The
intervals form a partition of the set of -indices :For every segment
, the intervals encode a set of neighbors as a finite collection of contiguous ranges of indices:
By encoding our sparsity patterns as block-wise binary masks made up of tiles
we can leverage coalesced memory operations for maximum efficiency on the GPU. As long as our index ranges are wider than the CUDA blocks, we should get close to optimal performances.
Going further.
This scheme can be generalized to generic formulas and reductions. For
reductions with respect to the
and restrict ourselves to computations of the form:
A decent trade-off
This block-wise approach to sparse reductions may seem a bit too coarse, as some negligible coefficients get computed with little to no impact on the final result… But in practice, the GPU speed-ups on contiguous memory operations more than make up for it: implemented in the GpuConv1D_ranges.cu CUDA file, our block-sparse Map-Reduce scheme is the workhorse of the multiscale Sinkhorn algorithm showcased in the GeomLoss library.
As explained in our documentation,
the main user interface for KeOps
block-sparse reduction is an optional “.ranges” attribute for
LazyTensors
which encodes arbitrary block-sparsity masks. In
practice, as illustrated in the Figure below, helper
routines allow users to specify tiled sparsity patterns from clustered
arrays of samples
![]()
|
![]()
|
Figure.
Illustrating block-sparse reductions with 2D point clouds. When
using an