# Block-sparse reductions¶

## Complexity of the KeOps routines¶

KeOps is built around online map-reduce routines
which have a **quadratic time complexity**: if \(M\) and
\(N\) denote the number of \(i\) and \(j\) variables,
the cost of a generic reduction scales asymptotically in \(O(MN)\).
This is most evident in our convolution benchmark,
where we compute all the kernel coefficients \(K(x_i,y_j) = \exp(-\|x_i-y_j\|^2)\)
to implement a discrete convolution:

**Can we do better?**
To go beyond this quadratic lower bound,
a simple idea is to **skip some computations**, using a **sparsity prior**
on the kernel matrix. For instance, we could decide to skip kernel computations
when **points** \(x_i\) **and** \(y_j\) **are far apart from each other**.
But can we do so **efficiently**?

### Sparsity on the CPU¶

On CPUs, a standard strategy is to use sparse matrices,
encoding our operators through **lists of non-zero coefficients and indices**.
Schematically, this comes down to endowing each index \(i\in[1,M]\)
with a set \(J_i\subset[1,N]\) of \(j\)-neighbors,
and to restrict ourselves to the computation of

This approach is very well suited to matrices with a handful of nonzero coefficients per line,
e.g. the intrinsic Laplacian of a 3D mesh.
But on large, densely connected problems, sparse encoding
runs into a major issue: as it relies on **non-contiguous** memory accesses,
it scales **very poorly** on modern parallel hardware.

### Block-sparsity on the GPU¶

As explained on the NVidia devblog,
GPUs rely on **coalesced memory operations** which load blocks
of **dozens of contiguous bytes** at once. Instead of allowing the
use of arbitrary index sets \(J_i\) for all lines of our sparse kernel matrix,
we should thus restrict ourselves to computations of the form:

where:

The \([\text{start}_k, \text{end}_k)\) intervals form a

**partition**of the set of \(i\)-indices \([1,M]\):\[[1,M]~=~ \bigsqcup_{k=1}^K \,[\text{start}_k, \text{end}_k).\]For every segment \(k\in[1,K]\), the \(S_k\) intervals \([\text{start}^k_l, \text{end}^k_l)\) encode a set of

**neighbors**as a finite collection of contiguous index ranges:\[\forall~i\in[\text{start}_k, \text{end}_k), ~ J_i~=~ \bigsqcup_{l=1}^{S_k} \,[\text{start}^k_l, \text{end}^k_l).\]

By encoding our sparsity patterns as **block-wise binary masks**
made up of tiles \(T^k_l~=~[\text{start}_k, \text{end}_k) \times [\text{start}^k_l, \text{end}^k_l) \subset [1,M]\times[1,N]\),
we can leverage coalesced memory operations for maximum efficiency on the GPU.
As long as our index ranges are **wider than the buffer size**,
we should thus be close to **optimal performances**.

**Going further.** This scheme can be generalized to **generic**
formulas and reductions. For reductions with respect to the \(i\) axis,
we’d simply have to define *transposed* tiles
\(U^k_l~=~[\text{start}^k_l, \text{end}^k_l) \times [\text{start}_k, \text{end}_k) \subset [1,M]\times[1,N]\)
and restrict ourselves to computations of the form:

**A decent trade-off.** This **block-wise** approach to sparse reductions may seem a bit
too **coarse**, as negligible coefficients get computed to no avail…
But in practice, the **GPU speed-ups** on contiguous memory loads
more than make up for it.

## Examples¶

As documented in e.g. the `numpy.Genred`

or `torch.Genred`

docstring,
all KeOps reductions accept an optional **ranges** argument,
which can be either `None`

(i.e. dense, quadratic reduction)
or a 6-uple of integer arrays, which encode
a set of \([\text{start}_k, \text{end}_k)\) and
\([\text{start}^k_l, \text{end}^k_l)\) intervals
for reductions with respect to the \(j\) and \(i\) indices.

A **full tutorial** on block-sparse reductions
is provided in the gallery,
for both NumPy
and PyTorch APIs.
As you go through these notebooks, you will learn how to:

Cluster and sort your data to enforce

**contiguity**.Define

**coarse binary masks**that encode block-sparse reduction schemes.Turn this information at cluster level into a

**ranges**argument that can be used with KeOps’**generic reductions**.**Test**these block-sparse algorithms, and**benchmark**them vs. simpler, dense implementations.

The `pykeops.numpy.cluster`

and `pykeops.torch.cluster`

modules
provide a set of **helper functions** whose interface is described below.
Feel free to use and adapt them to **your own setting**,
beyond the simple case of **Sum** reductions and Gaussian **convolutions**!