Block-sparse reductions

Complexity of the KeOps routines

KeOps is built around online map-reduce routines that have a quadratic time complexity: if $M$ and $N$ denote the number of $i$ and $j$ variables, the cost of a generic reduction scales asymptotically in $O(MN)$. This is most evident in our convolution benchmark, where we compute all the kernel coefficients $K(x_i,y_j) = \exp(-\|x_i-y_j\|^2)$ to implement a discrete convolution:

\[a~=~ K_{xy} \,b \qquad\text{i.e.}\qquad a_i ~=~ \sum_{j=1}^N K(x_i,y_j)\cdot b_j \qquad \text{for $i\in[1,M]$.}\]

Can we do better? To go beyond this quadratic lower bound, a simple idea is to use a sparsity prior on the kernel matrix to skip some computations. For instance, we could decide to skip kernel computations when points $x_i$ and $y_j$ are far apart from each other. But can we do so efficiently?

Sparsity on the CPU

On CPUs, a standard strategy is to use sparse matrices and encode our operators with lists of non-zero coefficients and indices. Schematically, this comes down to endowing each index $i\in[1,M]$ with a set $J_i\subset[1,N]$ of $j$-neighbors, and to restrict ourselves to the computation of

\[a_i ~=~ \sum_{j\in J_i} K(x_i,y_j)\cdot b_j, \qquad \text{for $i\in[1,M]$.}\]

This approach is very well suited to matrices with a handful of nonzero coefficients per line, e.g. the intrinsic Laplacian of a 3D mesh. But on large, densely connected problems, sparse encoding runs into a major issue: as it relies on non-contiguous memory accesses, it scales very poorly on parallel hardware.

Block-sparsity on the GPU

As explained on the NVidia devblog, GPUs rely on coalesced memory operations which load blocks of dozens of contiguous bytes at once. Instead of allowing the use of arbitrary index sets $J_i$ for all lines of our sparse kernel matrix, we should thus restrict ourselves to computations of the form:

\[a_i ~=~ \sum_{l=1}^{S_k} \sum_{j=\text{start}^k_l}^{\text{end}^k_l-1} K(x_i,y_j)\cdot b_j, \qquad \text{for $i \in [\text{start}_k, \text{end}_k)$ and $k\in [1,K]$,}\]

where:

The $[\text{start}_k, \text{end}_k)$ intervals form a partition of the set of $i$-indices $[1,M]$:

\[[1,M]~=~ \bigsqcup_{k=1}^K \,[\text{start}_k, \text{end}_k).\]
For every segment $k\in[1,K]$, the $S_k$ intervals $[\text{start}^k_l, \text{end}^k_l)$ encode a set of neighbors as a finite collection of contiguous index ranges:

\[\forall~i\in[\text{start}_k, \text{end}_k), ~ J_i~=~ \bigsqcup_{l=1}^{S_k} \,[\text{start}^k_l, \text{end}^k_l).\]

By encoding our sparsity patterns as block-wise binary masks made up of tiles $T^k_l~=~[\text{start}_k, \text{end}_k) \times [\text{start}^k_l, \text{end}^k_l) \subset [1,M]\times[1,N]$, we can leverage coalesced memory operations for maximum efficiency on the GPU. As long as our index ranges are wider than the buffer size, we should get close to optimal performances.

Going further. This scheme can be generalized to generic formulas and reductions. For reductions with respect to the $i$ axis, we’d simply have to define transposed tiles $U^k_l~=~[\text{start}^k_l, \text{end}^k_l) \times [\text{start}_k, \text{end}_k) \subset [1,M]\times[1,N]$ and restrict ourselves to computations of the form:

\[b_j ~=~ \sum_{l=1}^{S_k} \sum_{i=\text{start}^k_l}^{\text{end}^k_l-1} K(x_i,y_j)\cdot a_i, \qquad \text{for $j \in [\text{start}_k, \text{end}_k)$ and $k\in [1,K]$.}\]

A decent trade-off. This block-wise approach to sparse reductions may seem a bit too coarse, as negligible coefficients get computed to no avail… But in practice, the GPU speed-ups on contiguous memory loads more than make up for it.

Documentation

See the pytorch or numpy api documentation for the syntax.

Examples

As documented in e.g. the numpy.Genred or torch.Genred docstring, all KeOps reductions accept an optional ranges argument, which can be either None (i.e. dense, quadratic reduction) or a 6-uple of integer arrays, which encode a set of $[\text{start}_k, \text{end}_k)$ and $[\text{start}^k_l, \text{end}^k_l)$ intervals for reductions with respect to the $j$ and $i$ indices.

A full tutorial on block-sparse reductions is provided in the gallery, for both NumPy and PyTorch APIs. As you go through these notebooks, you will learn how to:

Cluster and sort your data to enforce contiguity.
Define coarse binary masks that encode block-sparse reduction schemes.
Turn this information at cluster level into a ranges argument that can be used with KeOps’ generic reductions.
Test these block-sparse algorithms, and benchmark them vs. simpler, dense implementations.

The pykeops.numpy.cluster and pykeops.torch.cluster modules provide a set of helper functions whose interface is described below. Feel free to use and adapt them to your own setting, beyond the simple case of Sum reductions and Gaussian convolutions!