Why using KeOps?
Scalable kernel operations
KeOps can be used on a broad class of problems (A generic framework). But the first motivation behind this library is simple: we intend to accelerate the computation of Gaussian convolutions on point clouds, also known as RBF kernel products on sampled data.
Working in a vector space
a target point cloud
, encoded as an array;a source point cloud
, encoded as an array;a signal
, attached to the ’s and encoded as a vector.
Then, KeOps allows us to compute efficiently
the vector of
where
without having to write
the formula
A generic framework
Going further, KeOps supports a wide range of mathematical and deep learning computations. Let’s say that we have at hand:
a collection
of vectors;a collection
of vector sequences, indexed by an integer that ranges from 1 to ;a collection
of vector sequences, indexed by an integer that ranges from 1 to ;a vector-valued function
on these input vectors and indices, such as a small neural network;a
or “pooling” operator such as a sum, a max or an arg-min.
Then, referring to the
alongside its derivatives with respect to all variables and parameters.
Examples of applications
This type of computation is common in machine learning and applied mathematics:
A kernel matrix-vector product is implemented using a sum reduction and a formula
that is weighted by a suitable kernel function . As detailed in the section above, the computation reads:This operation is key to spline regression, kernel methods and the study of Gausian processes. In physics, we often use Newton or Coulomb kernels such as
: accelerating kernel products is the first step towards fast N-body simulations.
K-Nearest Neighbors queries are implemented using an “arg-K-min” reduction that returns, for all index
, the indices that correspond to the K smallest values of a distance or similarity metric . For instance, in a Euclidean space, we compute:where
is a sum of squared distances. K-NN search is a key building block for numerous methods in data sciences, from simple classifiers to advanced methods in topological data analysis and dimensionality reduction. KeOps intends to provide fast runtimes for all types of metrics, beyond the standard Euclidean distance and cosine similarity.
In computer graphics and geometric deep learning, we implement point cloud convolutions and message passing layers using a function:
that is the product of a neighborhood
between point positions , and of a parametric filter that is applied to a collection of feature vectors . The reduction or “pooling” operator is usually a (weighted) sum or a maximum.Most architectures in computer vision rely on K-Nearest Neighbors graphs (”
”) to define sparse neighborhood windows. These are equal to 1 if is one of the closest neighbors of and 0 otherwise. The point convolution then reads:Crucially, KeOps now also lets users work with global point convolutions without compromising on performances: we refer to the Section 5.3 of our NeurIPS 2020 paper and to this presentation of quasi-geodesic convolutions on protein surfaces for a detailed discussion.
In natural language processing, we implement attention layers for transformer networks using an exponentiated dot product
between query ( ) and key ( ) vectors of dimension . The reduction is a normalized matrix-vector product with an array of value vectors (a soft maximum) and the overall computation reads:It can be implemented efficiently using the KeOps “Sum-SoftMax-Weight” reduction.
We implement the Fourier transform with a sum reduction and a complex exponential:
This expression evaluates the spectral content at frequency
of a signal that is represented by sampled values at locations . KeOps thus allows users to implement efficient Fourier-Stieltjes transforms on non-uniform data using both real- and complex-valued trigonometric functions.
In optimization theory, we implement the Legendre-Fenchel transform or convex conjugate of an arbitrary function
that is sampled on a point cloud with a vector of values using a dot product and a maximum reduction:In imaging sciences, we implement the distance transform of a binary mask
that is defined on the rectangle domain using a minimum reduction and a squared distance function:We note that just like the Legendre-Fenchel transform, the distance transform is separable and can be implemented efficiently on 2D and 3D grids. Just as with separable Gaussian convolution, the trick is to apply the transform successively on the lines and columns of the image. Thanks to its native support for batch processing, KeOps is ideally suited to these manipulations: it can be used to implement all types of fast separable transforms on the GPU.
In optimal transport theory, we implement the C-transform using a “min” reduction and a formula
that penalizes the value of the ground cost function by that of the dual potential :Going further, numerically stable Sinkhorn iterations correspond to the case where the minimum in the C-transform is replaced by a (rescaled) log-sum-exp reduction, known as a soft minimum at temperature
:As detailed in our NeurIPS 2020 paper, KeOps speeds up modern optimal transport solvers by one to three orders of magnitude, from standard auction iterations to multiscale Sinkhorn loops. A collection of reference solvers is provided by the GeomLoss library, that now scales up to millions of samples in seconds.
Numerous particle and swarming models rely on interaction steps that fit this template to update the positions and inner states of their agents. For instance, on modest gaming hardware, KeOps can scale up simulations of Vicsek-like systems to millions of active swimmers and flyers: this allows researchers to make original conjectures on their models with a minimal amount of programming effort.
Crucially, we can understand all these computations as reductions of “symbolic” matrices whose coefficients are given, for all indices LazyTensors
avoid storing in memory the matrix of values
High performances
KeOps fits within a thriving ecosystem of Python/C++ libraries for scientific computing. So how does it compare with other acceleration frameworks such as Numba, Halide, TVM, Julia or JAX/XLA? To answer this question, let us now briefly explain the relationship between our library and the wider software environment for tensor computing.
Tensor computing on the GPU
Fast numerical methods are the fuel of machine learning research. Over the last decade, the sustained development of the CUDA ecosystem has driven the progress in the field: though Python is the lingua franca of data science and machine learning, most frameworks rely on efficient C++ backends to leverage the computing power of GPUs. Recent advances in computer vision or natural language processing attest to the fitness of modern libraries: they stem from the mix of power and flexibility that is provided by PyTorch, TensorFlow and general purpose accelerators such as JAX/XLA.
Nevertheless, important work remains to be done. Geometric computations present a clear gap in performances between Python and C++: notable examples are implementations of point cloud convolutions or of the nearest neighbor search that is discussed above. To scale up geometric computations to real-world data, a common practice is therefore to replace the compute-intensive parts of a Python code by handcrafted CUDA kernels. These are expensive to develop and maintain, which leads to an unfortunate need to compromise between ease of development and scalability.
KeOps: a specialized tool
Requirements for geometric data analysis and learning. None of the aforementioned methods are fully suited for exploratory research in geometric data analysis and machine learning. Let us briefly explain why:
First of all, some acceleration schemes do not stream well on GPUs or have to rely on expensive pre-computations: hierarchical matrices or advanced nearest neighbor finders can hardly be used in the training loop of a neural network.
Other strategies make strong assumptions on the properties of the convolution filter
or on the dimension and geometry of the ambient feature space. These restrictions make existing tools cumbersome to use in e.g. deep learning, where one wishes to have modelling freedom with respect to the choice of the embedding space geometry and dimension.Finally, most acceleration frameworks for Python expect users to be knowledgeable on GPU parallelism or do not support automatic differentiation.
The bottomline is that most existing tools are not ready to be used by a majority of researchers in the community.
A gap in the literature. In order to tackle these issues, the developers of deep learning libraries have recently put an emphasis on just-in-time compilation for neural networks. For instance, the recent PyTorch JIT and XLA engines enable operator fusion and unlock performance speed-ups for research code. These general purpose compilers are fully transparent to users and show promise for a wide range of applications. Nevertheless, they fall short on the type of geometric computations that are discussed above. This is most apparent for nearest neighbor search, matrix-vector products with kernel matrices and message passing methods on point clouds, where one still has to develop and maintain custom CUDA kernels to achieve state-of-the-art performance.
A unique position. KeOps intends to fix this specific but important problem with all the convenient features of a modern library. We present examples of applications in our gallery of tutorials and discuss its inner workings in our guided tour of the KeOps++ engine. As evidenced by our benchmarks, the KeOps routines outperform their standard counterparts by two orders of magnitude in many settings. On top of a reduced memory usage, they can thus bring a considerable speed-up to numerous methods in machine learning, computational physics and other applied fields.
Is KeOps going to speed-up your program?
Strengths.
At its heart, KeOps leverages the low
Kolmogorov complexity of symbolic arrays: it can be used when the computational bottleneck
of a method is an interaction step
that fits a simple Map-Reduce template.
In practice, it is thus likely to offer gains on runtime and memory usage when
the formula
Limitations.
On the other hand, the main limitations of KeOps stem from the overflow of CUDA registers in the computation of the formula
Another drawback is that we do not pre-ship binaries but instead rely on C++/CUDA compilers to run our kernels. Fortunately, this weakness is now mitigated by the ubiquitous deployment of fast compilers built in e.g. the CUDA drivers. With the release of KeOps 2.0 in March 2022, installation and compilation issues have (mostly) become a thing of the past.
Main features
Feel free to browse through our gallery of tutorials for examples of applications. Among other features, KeOps supports:
Non-radial kernels, neural networks and other arbitrary formulas.
Most common reduction operations: Summations, stabilized LogSumExp reductions, Min, Max, ArgKMin, SoftMin, Softmax…
Batch processing and block-wise sparsity masks.
High-order derivatives with respect to all parameters and variables.
The resolution of positive definite linear systems using a conjugate gradient solver.