Why using KeOps?

Scalable kernel operations

KeOps can now be used on a broad class of problems (A generic framework that suits your needs). But at heart, the main motivation behind this library is the need to compute fast and scalable Gaussian convolutions (RBF kernel products). For very large values of \(M\) and \(N\), given :

  • a target point cloud \((x_i)_{i=1}^M \in \mathbb R^{M \times D}\),

  • a source point cloud \((y_j)_{j=1}^N \in \mathbb R^{N \times D}\),

  • a signal \((b_j)_{j=1}^N \in \mathbb R^{N}\) attached to the \(y_j\)’s,

KeOps allows you to compute efficiently the array \((a_i)_{i=1}^M \in \mathbb R^{M}\) given by

\[a_i = \sum_j K(x_i,y_j) b_j, \qquad i=1,\cdots,M,\]

where \(K(x_i,y_j) = \exp(-\|x_i - y_j\|^2 / 2 \sigma^2)\). On top of this, thanks to KeOps’ automatic differentiation module, you can also get access to the gradient of the \(a_i\)’s with respect to the \(x_i\)’s:

\[a_i' = \sum_j \partial_x K(x_i,y_j) b_j, \qquad i=1,\cdots,M,\]

without having to code the formula \(\partial_x K(x_i,y_j) = -\tfrac{1}{\sigma^2}(x_i - y_j) \exp(-\|x_i - y_j\|^2 / 2 \sigma^2)\) !

High performances

In recent years, Deep Learning frameworks such as PyTorch or TensorFlow have evolved into fully-fledged applied math libraries: with negligible overhead, they bring automatic differentiation and seamless GPU support to research communities that were used to Matlab, NumPy and other tensor-centric frameworks.

Unfortunately, though, no magic is involved: optimised CUDA codes still have to be written for every single operation provided to end-users. Supporting all the standard mathematical routines thus comes at a huge engineering cost for the developers of the main frameworks. As of 2019, this effort has been mostly restricted to the operations needed to implement Convolutional Neural Networks: linear algebra routines and convolutions on grids, images and volumes.

Consequently, in array-centric frameworks, the standard way of computing Gaussian convolutions is to create and store in memory the full \(M\)-by-\(N\) kernel matrix \(K_{i,j}=K(x_i,y_j)\), before computing \((a_i) = (K_{i,j}) (b_j)\) as a matrix product. But for large datasets (say, \(M,N \geqslant 10,000\)), this is not realistic: large matrices just don’t fit in GPU memories.

KeOps is all about letting researchers break through this memory bottleneck. Relying on online map-reduce schemes, we provide CUDA routines that “sum” the coefficients \(K_{i,j}\cdot b_j\) as they are computed, without ever storing the full matrix \(K\) in memory.

As evidenced by our benchmarks, the KeOps routines outperform their standard GPU-tensorized counterparts by two orders of magnitude on modern hardware: on top of a reduced memory usage, they can also bring a considerable speed-up to all applications that rely on massive - but simple - map-reduce operations.

A generic framework that suits your needs

KeOps supports generic operations, way beyond the simple case of kernel convolutions. Let’s say that you have at hand:

  • a collection \(p^1, p^2, ..., p^P\) of vectors.

  • a collection \(x^1_i, x^2_i, ..., x^X_i\) of vector sequences, indexed by an integer \(i\) ranging from 1 to \(M\).

  • a collection \(y^1_j, y^2_j, ..., y^Y_j\) of vector sequences, indexed by an integer \(j\) ranging from 1 to \(N\).

  • a vector-valued function \(f(p^1, p^2,..., x^1_i, x^2_i,..., y^1_j, y^2_j, ...)\) on these input vectors.

Then, referring to the \(p\)’s as parameters, the \(x_i\)’s as x-variables and the \(y_j\)’s as y-variables, the KeOps library allows you to compute efficiently any expression \(a_i\) of the form

\[a_i = \operatorname{Reduction}_{j=1,\cdots,N}\limits \big[ f(p^1, p^2,..., x^1_i, x^2_i,..., y^1_j, y^2_j, ...) \big], \qquad i=1,\cdots,M\]

alongside its derivatives with respect to all the variables and parameters.

Feel free to browse through our gallery of tutorials for examples of applications.