Block-sparse reductions
Complexity of the KeOps routines
KeOps is built around online map-reduce routines
that have a quadratic time complexity: if
Can we do better?
To go beyond this quadratic lower bound,
a simple idea is to use a sparsity prior
on the kernel matrix to skip some computations.
For instance, we could decide to skip kernel computations
when points
Sparsity on the CPU
On CPUs, a standard strategy is to use sparse matrices
and encode our operators with lists of non-zero coefficients and indices.
Schematically, this comes down to endowing each index
This approach is very well suited to matrices with a handful of nonzero coefficients per line, e.g. the intrinsic Laplacian of a 3D mesh. But on large, densely connected problems, sparse encoding runs into a major issue: as it relies on non-contiguous memory accesses, it scales very poorly on parallel hardware.
Block-sparsity on the GPU
As explained on the NVidia devblog,
GPUs rely on coalesced memory operations which load blocks
of dozens of contiguous bytes at once. Instead of allowing the
use of arbitrary index sets
where:
The
intervals form a partition of the set of -indices :For every segment
, the intervals encode a set of neighbors as a finite collection of contiguous index ranges:
By encoding our sparsity patterns as block-wise binary masks
made up of tiles
Going further. This scheme can be generalized to generic
formulas and reductions. For reductions with respect to the
A decent trade-off. This block-wise approach to sparse reductions may seem a bit too coarse, as negligible coefficients get computed to no avail… But in practice, the GPU speed-ups on contiguous memory loads more than make up for it.
Documentation
Examples
As documented in e.g. the numpy.Genred
or torch.Genred
docstring,
all KeOps reductions accept an optional ranges argument,
which can be either None
(i.e. dense, quadratic reduction)
or a 6-uple of integer arrays, which encode
a set of
A full tutorial on block-sparse reductions is provided in the gallery, for both NumPy and PyTorch APIs. As you go through these notebooks, you will learn how to:
Cluster and sort your data to enforce contiguity.
Define coarse binary masks that encode block-sparse reduction schemes.
Turn this information at cluster level into a ranges argument that can be used with KeOps’ generic reductions.
Test these block-sparse algorithms, and benchmark them vs. simpler, dense implementations.
The pykeops.numpy.cluster
and pykeops.torch.cluster
modules
provide a set of helper functions whose interface is described below.
Feel free to use and adapt them to your own setting,
beyond the simple case of Sum reductions and Gaussian convolutions!