Design choices
Developing versatile CUDA libraries. As discussed at the end of our presentation of the GPU_1D scheme, supporting generic reductions and automatic differentiation within KeOps was a daunting challenge. Let us briefly explain why.
The overwhelming majority of Deep Learning frameworks and contributed packages follow a simple “one Python operation = one CUDA program” development paradigm – with the notable exception of projects such as TensorComprehensions. Following the usual practice, each PyTorch operation is linked to a pre-compiled binary program that implements a specific computation: a sum, a matrix-vector product or whatever. With every Python instruction, these routines are simply executed on the “.data” attributes (raw C++ arrays) of the relevant variables.
Back in 2017, in early KeOps releases, this is how we first implemented the Gaussian kernel dot product and its derivatives of order 1 and 2 : with explicit “gaussian_dot.cu”, “gaussian_dot_grad_x.cu”, “gaussian_dot_grad_xx.cu” CUDA files. Once the basics are understood, writing by hand an ad hoc CUDA program for every specific type of operation is not too difficult.
The KeOps symbolic engine
On the other hand, allowing KeOps users to define and reduce their own formulas without having to write a single line of C++ code is a much more challenging target. One way or another, the specification of a new symbolic computation in a Python script has to result in the injection of lightweight, efficient C++ code right inside the main KeOps CUDA loop, at steps 3.3.2-3.
As of 2019, supporting the dynamic generation of efficient CUDA code for Deep Learning scripts is the major challenge that the TensorFlow (Google) and PyTorch (Facebook) development teams strive to tackle. Usually referred to as Just In Time (JIT) compilation, this feature is now partially supported by the latest versions of PyTorch and TensorFlow, through an “Accelerated Linear Algebra” (XLA) backend. Asking users to accept some restrictions on the structure of their Python scripts, experimental JIT compilers attempt to fuse consecutive CUDA routines in order to optimize the use of resources and buffers.
With limited means – three part-time developers, mathematicians by trade – KeOps achieves this target in the controlled setting of symbolic Map-Reduce schemes. The well-documented constraints that are put on the development of academic software projects guide our main design decisions:
To ensure the portability and long-term maintainability of the KeOps library, we perform the syntactic analysis of user-defined formulas without any dependency on external “math processing” engines.
To make sure that KeOps is able to reach users outside of our own microcosm, its core logic must be written in pure C++ – without any hard dependency on Python. Today, most KeOps users access it through its PyTorch and NumPy interfaces… But future R or Julia bindings may well have a more lasting impact.
Working with variadic templates
To achieve our goals whilst abiding by these constraints, we chose to rely on the power of modern C++/CUDA compilers. Leveraging expressive meta-programming instructions that were introduced by the C++11 revision, the keops/core/ folder effectively implements a small but robust math engine within the C++ templating system.
Letting a general-purpose tool such as nvcc or clang handle the parsing and (most of) the low-level optimization of our code may look like a rash decision. But in practice, the graph-based optimizers of modern C++ compilers are now so efficient that we never felt like going back: with the help of a few simplification rules – encoded as template specializations – the binaries generated by the KeOps generic engine perform just as well as clever hand-written CUDA kernels. With foundations that rely on standard, well-tested tools that are now too big to fail, we expect to be able to maintain KeOps for many years to come.