Keops logo

Kernel Operations on the GPU, with autodiff, without memory overflows

The KeOps library lets you compute reductions of large arrays whose entries are given by a mathematical formula or a neural network. It combines efficient C++ routines with an automatic differentiation engine and can be used with Python (NumPy, PyTorch), Matlab and R.

It is perfectly suited to the computation of kernel matrix-vector products, K-nearest neighbors queries, N-body interactions, point cloud convolutions and the associated gradients. Crucially, it performs well even when the corresponding kernel or distance matrices do not fit into the RAM or GPU memory. Compared with a PyTorch GPU baseline, KeOps provides a x10-x100 speed-up on a wide range of geometric applications, from kernel methods to geometric deep learning.

The project is hosted on GitHub, under the permissive MIT license.
PyPi version Downloads R version RDownloads

Symbolic matrices

Why using KeOps? Math libraries represent most objects as matrices and tensors:

  • (a) Dense matrices. Variables are often encoded as dense numerical arrays \((M_{i,j}) = (M[i,j]) \in \mathbb{R}^{\mathrm{M}\times\mathrm{N}}\). This representation is convenient and well-supported, but also puts a heavy load on the memories of our computers. Unfortunately, large arrays are cumbersome to move around and may not even fit in RAM or GPU memories.

    In practice, this means that a majority of scientific programs are memory-bound. Run times for most neural networks and mathematical computations are not limited by the raw capabilities of our CPUs and CUDA cores, but by the time-consuming transfers of large arrays from memory circuits to arithmetic computing units.

  • (b) Sparse matrices. To work around this problem, a common solution is to rely on sparse matrices: tensors that have few non-zero coefficients. We represent these objects using lists of indices \((i_n,j_n)\) and values \(M_n = M_{i_n,j_n}\) that correspond to a small number of non-zero entries. Matrix-vector operations are then implemented with indexing methods and scattered memory accesses.

    This method is elegant and allows us to represent large arrays with a small memory footprint. But unfortunately, it does not stream well on GPUs: parallel computing devices are wired to perform block-wise memory accesses and have a hard time dealing with lists of random indices \((i_n,j_n)\). As a consequence, when compared with dense arrays, sparse encodings only speed up computations for matrices that have less than 1% non-zero coefficients. This restriction prevents sparse matrices from being truly useful outside of graph and mesh processing.

  • (c) Symbolic matrices. KeOps provides another solution to speed up tensor programs. Our key remark is that most of the large arrays that are used in machine learning and applied mathematics share a common mathematical structure. Distance matrices, kernel matrices, point cloud convolutions and attention layers can all be described as symbolic tensors: given two collections of vectors \((x_i)\) and \((y_j)\), their coefficients \(M_{i,j}\) at location \((i,j)\) are given by mathematical formulas \(F(x_i,y_j)\) that are evaluated on data samples \(x_i\) and \(y_j\).

    These objects are not “sparse” in the traditional sense… but can nevertheless be described efficiently using a mathematical formula \(F\) and relatively small data arrays \((x_i)\) and \((y_j)\). The main purpose of the KeOps library is to provide support for this abstraction with all the perks of a deep learning library:

    • A transparent interface with CPU and GPU integration.

    • Numerous tutorials and benchmarks.

    • Full support for automatic differentiation, batch processing and approximate computations.

In practice, KeOps symbolic tensors are both fast and memory-efficient. We take advantage of the structure of CUDA registers to bypass costly memory transfers between arithmetic units and memory circuits. This allows us to provide a x10-x100 speed-up to PyTorch GPU programs in a wide range of settings.

Using our Python interface, a typical sample of code looks like:

# Create two arrays with 3 columns and a (huge) number of lines, on the GPU
import torch  # NumPy, Matlab and R are also supported
M, N, D = 1000000, 2000000, 3
x = torch.randn(M, D, requires_grad=True).cuda()  # x.shape = (1e6, 3)
y = torch.randn(N, D).cuda()                      # y.shape = (2e6, 3)

# Turn our dense Tensors into KeOps symbolic variables with "virtual"
# dimensions at positions 0 and 1 (for "i" and "j" indices):
from pykeops.torch import LazyTensor
x_i = LazyTensor(x.view(M, 1, D))  # x_i.shape = (1e6, 1, 3)
y_j = LazyTensor(y.view(1, N, D))  # y_j.shape = ( 1, 2e6,3)

# We can now perform large-scale computations, without memory overflows:
D_ij = ((x_i - y_j)**2).sum(dim=2)  # Symbolic (1e6,2e6,1) matrix of squared distances
K_ij = (- D_ij).exp()               # Symbolic (1e6,2e6,1) Gaussian kernel matrix

# We come back to vanilla PyTorch Tensors or NumPy arrays using
# reduction operations such as .sum(), .logsumexp() or .argmin()
# on one of the two "symbolic" dimensions 0 and 1.
# Here, the kernel density estimation   a_i = sum_j exp(-|x_i-y_j|^2)
# is computed using a CUDA scheme that has a linear memory footprint and
# outperforms standard PyTorch implementations by two orders of magnitude.
a_i = K_ij.sum(dim=1)  # Genuine torch.cuda.FloatTensor, a_i.shape = (1e6, 1),

# Crucially, KeOps fully supports automatic differentiation!
g_x = torch.autograd.grad((a_i ** 2).sum(), [x])

KeOps allows you to get the most out of your hardware without compromising on usability. It provides:

  • Linear (instead of quadratic) memory footprint for numerous types of computations.

  • Support for a wide range of mathematical formulas that can be composed at will.

  • Seamless computation of derivatives and gradients, up to arbitrary orders.

  • Sum, LogSumExp, Min, Max but also ArgMin, ArgMax or K-min reductions.

  • A conjugate gradient solver for large-scale spline interpolation and Gaussian process regression.

  • Transparent integration with standard packages, such as the SciPy solvers for linear algebra.

  • An interface for block-sparse and coarse-to-fine strategies.

  • Support for multi GPU configurations.

More details are provided below:

Projects using KeOps

Symbolic matrices are to geometric learning what sparse matrices are to graph processing.

KeOps can thus be used in a wide range of settings, from shape analysis (registration, geometric deep learning, optimal transport…) to machine learning (kernel methods, k-means, UMAP…), Gaussian processes, computational biology and physics. Among other projects, KeOps provides core routines for the following packages:

  • GPyTorch (from the universities of Cornell, Columbia, Pennsylvania) and Falkon (from the university of Genoa and the Sierra Inria team), two libraries for Gaussian Process regression that now scale up to billion-scale datasets.

  • Deformetrica, a computational anatomy software from the Aramis Inria team.

  • The Gudhi library for topological data analysis and higher dimensional geometry understanding, from the DataShape Inria team.

  • GeomLoss, a PyTorch package for Chamfer (Hausdorff) distances, Kernel (Sobolev) divergences and Earth Mover’s (Wasserstein) distances. It provides optimal transport solvers that scale up to millions of samples in seconds.

  • The deep graph matching consensus module, for learning and refining structural correspondences between graphs.

  • FshapesTk and the Shapes toolbox, two research-oriented LDDMM toolkits.

Licensing, citation, academic use

This library is licensed under the permissive MIT license, which is fully compatible with both academic and commercial applications.

If you use this code in a research paper, please cite our original publication:

  author  = {Benjamin Charlier and Jean Feydy and Joan Alexis Glaunès and François-David Collin and Ghislain Durif},
  title   = {Kernel Operations on the GPU, with Autodiff, without Memory Overflows},
  journal = {Journal of Machine Learning Research},
  year    = {2021},
  volume  = {22},
  number  = {74},
  pages   = {1-6},
  url     = {}


Charlier, B., Feydy, J., Glaunès, J. A., Collin, F.-D. & Durif, G. Kernel Operations on the GPU, with Autodiff, without Memory Overflows. Journal of Machine Learning Research 22, 1–6 (2021).

For applications to geometric (deep) learning, you may also consider our NeurIPS 2020 paper:

  title={Fast geometric learning with symbolic matrices},
  author={Feydy, Jean and Glaun{\`e}s, Joan and Charlier, Benjamin and Bronstein, Michael},
  journal={Advances in Neural Information Processing Systems},


Please contact us for any bug report, question or feature request by filing a report on our GitHub issue tracker!

Core library - KeOps, PyKeOps, KeOpsLab:

R bindings - RKeOps:


Beyond explicit code contributions, KeOps has grown out of numerous discussions with applied mathematicians and machine learning experts. We would especially like to thank Alain Trouvé, Stanley Durrleman, Gabriel Peyré and Michael Bronstein for their valuable suggestions and financial support.

This website was rendered on the Jean Zay scientific cluster with a Nvidia A100 GPU and 8 cores of an AMD Milan EPYC 7543. Access to the HPC resources of IDRIS was granted to Jean Feydy under the allocation 2022-AD011013722 made by GENCI.

KeOps was awarded an open science prize by the French Ministry of Higher Education and Research in 2023 (“Espoir - Documentation”).

Table of contents