# Conclusion¶

The previous sections uncovered the inner workings of the KeOps
library. After pages and pages of technical derivations,
**users can now reap the reward of our year-long investment in low-level software
engineering** through the user-friendly `LazyTensor`

wrapper.

## History of the project¶

**Our starting point: Computational Anatomy.**
Back in 2017, we started working on the KeOps library to give to our
esteemed colleagues of the
(medical) shape analysis community an easy
access to the CUDA routines of the
fShapes toolkit – a
Matlab toolbox that relies
extensively on Gaussian kernel products. This initial target was reached
pretty quickly: today, the reference Deformetrica
software – maintained by the Aramis Inria team
at the
ICM Institute for Brain and Spinal Cord –
is fully reliant on
the **PyTorch+KeOps framework**. Most of our collaborators use one of
the KeOps bindings to implement their shape processing pipelines.

As discussed in one of our tutorials,
new “LDDMM” codebases for
statistical shape modelling are ten times slimmer (and easier to
maintain!) than they were just three years ago: **graduate students can
now get started in days instead of months**, and we expect to witness many
progresses in the field as research teams get relieved from the burden
of low-level C++ development. As far as our specialized community
of mathematicians is concerned, with more than 1,000 downloads per
month on the PyPi repository,
KeOps is already a success.

**Reaching a wider audience.**
In 2018-2019, after several interactions with colleagues in
**machine learning** and **optimal transport conferences**, we quickly realized that our
generic Map-Reduce engine could be used to **solve problems that go way
beyond neuro-anatomy**. Provided that some effort was made to improve the
general user experience, KeOps `LazyTensors`

could be a game changer
for engineers and researchers in many applied fields.

**Today**, after months of patient re-packaging and documentation, KeOps
is a **fully-fledged open source library**
(MIT License) whose development
can be tracked on GitHub.
The Python bindings are easy to install through the PyPi
repository (`pip install pykeops`

), with numerous examples available
in our galleries.

**Place of KeOps in the scientific ecosystem.**
The KeOps package has **no claim to set the state-of-the-art** in High
Performance Calculus: when implemented properly, hand-written CUDA
schemes will always outperform naive GPU loops, be it for (approximate)
nearest neighbor search or B-spline interpolation.

However, as it combines a **reasonable level of performance** with the
**flexibility of a deep learning interface**, KeOps can unlock
research programs by significantly increasing the productivity of
developers. Allowing our colleagues in computational anatomy to **benefit**
from the “deep learning revolution” **without having to focus exclusively
on convolutional neural networks** was the main ambition of this work; we
now hope that this localized success can be replicated in **other fields**.

## Supported reductions and formulas¶

As discussed in our
introductory tutorials,
`LazyTensors`

can be built
from **any valid NumPy array or PyTorch tensor** and support a wide
range of mathematical operations. Generic, broadcasted computations
define valid programs:

```
import torch
from pykeops.torch import LazyTensor
A, B, M, N, D = 7, 3, 100000, 200000, 10
x_i = LazyTensor( torch.randn(A, B, M, 1, D) ) # "i"-variable
l_i = LazyTensor( torch.randn(1, 1, M, 1, D) ) # "i"-variable
y_j = LazyTensor( torch.randn(1, B, 1, N, D) ) # "j"-variable
s = LazyTensor( torch.rand( A, 1, 1, 1, 1) ) # parameter
F_ij = (x_i ** 1.5 + y_j / l_i).cos() # Algebraic expression
F_ij = F_ij - (x_i | y_j) # Scalar product
F_ij = F_ij + (x_i[:,:,:,:,2] * s.relu() * y_j) # Indexing, ReLU activation
a_j = F_ij.sum(dim=2) # a_j.shape = [7, 3, 200000, 10]
```

`LazyTensors`

fully support automatic differentiation – up to
arbitrary orders – as well as a decent collection of reduction
operations. On top of the `sum()`

, `@`

(matrix multiplication)
and `logsumexp()`

operators which have already been discussed in
depth, users may rely on `min()`

, `argmin()`

,
`min_argmin()`

,
`max()`

, `argmax()`

, `max_argmax()`

, `Kmin(K=...)`

,
`argKmin(K=...)`

or `min_argKmin(K=...)`

methods to implement
their algorithms.

**Linear solver.**
Interestingly, KeOps also provides support for the resolution of
**large “mathematical” linear systems** – a critical operation in geology
(Kriging), imaging (splines), statistics (Gaussian Process regression)
and data sciences (kernel regression). Assuming that the `LazyTensor`

“**K_xx**” encodes a symmetric, positive definite matrix \(K_{xx}\),
the `solve()`

method:

```
a_i = K_xx.solve(b_i, alpha=alpha)
```

returns the solution:

of the linear system “\((\alpha \operatorname{Id}~+~ K_{xx})\,a = b\)”, computed with a conjugate gradient scheme.

**Using KeOps as a backend for high-level libraries.**
Going further, as discussed in
our gallery,
`LazyTensors`

can be neatly interfaced
with the high-quality solvers of the
Scipy
and
GPytorch libraries. Preliminary
results with the maintainers of the latter already show remarkable
improvements to the state-of-the-art: re-running the benchmarks of
(Wang et al., 2019) with a new KeOps backend,
exact
Gaussian Process regressions that took **7 hours** to train on a cluster
of 8 top-drawer V100 GPUs (\(\texttt{3DRoad}\) dataset, \(\mathrm{N} = \texttt{278,319}\),
\(\mathrm{D} = \texttt{3}\))
can now be performed in **15 minutes** on a single gaming
chip, the Nvidia GeForce RTX 2080 Ti.

## Future works¶

Our gallery of tutorials
showcases an eclectic collection of
applications in machine learning, statistics, optimal transport theory
and computational anatomy.
We carry on working towards a **closer integration** with the **Python**
scientific stack and
will improve/implement **R** and **Julia** bindings in months to come.
We also plan to implement boilerplate features such as
row- and column-wise indexing, block-wise definition of LazyTensors
and a full support of tensor variables. **Additional low-level profiling**
should also help us to converge towards **optimal runtimes**.

By making our routines freely available to the general public, we hope to
**help the applied maths community to catch up with the state-of-the-art**
in computer science: **in 2019**, bruteforce quadratic algorithms should
have no problem scaling up to **millions of samples** in minutes; clever
approximation schemes are only needed if users intend to perform
real-time analysis or scale to Gigabytes of data.

**Our long-term goal: fast approximation schemes.**
Long-term, our main challenge will be to reconcile KeOps with the
**rich literature** in numerical mathematics that focuses on fast
**approximation schemes for kernel dot products** – which are often
referred to as discrete **convolutions** in computational geometry or
discrete **integral operators** in physics.
Adapting ideas from the
Nyström,
Fast Multipole and
Fast & Free Memory Methods to GPU chips,
we hope
to **let users trade time for accuracy** with
a simple `K.tol = 1e-3`

interface by 2020.