Benchmarking Gaussian convolutions in high dimensions

Let’s compare the performances of PyTorch and KeOps on simple Gaussian RBF kernel products, as the dimension grows.

Setup

import importlib
import os
import time

import numpy as np
import torch
from matplotlib import pyplot as plt

from benchmark_utils import flatten, random_normal, full_benchmark

use_cuda = torch.cuda.is_available()

Benchmark specifications:

N = 10000  # Number of samples
# Dimensions to test:
Dims = [1, 3, 5, 10, 20, 30, 50, 80, 100, 120, 150, 200, 300, 500, 1000, 2000, 3000]

Synthetic dataset.

def generate_samples(D, device="cuda", lang="torch", batchsize=1, **kwargs):
    """Generates two point clouds x, y and a scalar signal b of size N.

    Args:
        D (int): dimension of the ambient space.
        device (str, optional): "cuda", "cpu", etc. Defaults to "cuda".
        lang (str, optional): "torch", "numpy", etc. Defaults to "torch".
        batchsize (int, optional): number of experiments to run in parallel. Defaults to None.

    Returns:
        3-uple of arrays: x, y, b
    """
    randn = random_normal(device=device, lang=lang)

    x = randn((batchsize, N, D))
    y = randn((batchsize, N, D))
    b = randn((batchsize, N, 1))

    return x, y, b

Define a simple Gaussian RBF product, using a tensorized implementation. Note that expanding the squared norm \(\|x-y\|^2\) as a sum \(\|x\|^2 - 2 \langle x, y \rangle + \|y\|^2\) allows us to leverage the fast matrix-matrix product of the BLAS/cuBLAS libraries.

def gaussianconv_pytorch(x, y, b, **kwargs):
    """(B,N,D), (B,N,D), (B,N,1) -> (B,N,1)"""

    D_xx = (x * x).sum(-1).unsqueeze(2)  # (B,N,1)
    D_xy = torch.matmul(x, y.permute(0, 2, 1))  # (B,N,D) @ (B,D,M) = (B,N,M)
    D_yy = (y * y).sum(-1).unsqueeze(1)  # (B,1,M)
    D_xy = D_xx - 2 * D_xy + D_yy  # (B,N,M)
    K_xy = (-D_xy).exp()  # (B,N,M)

    return K_xy @ b  # (B,N,1)

Define a simple Gaussian RBF product, using an online implementation:

from pykeops.torch import generic_sum


def gaussianconv_keops(x, y, b, backend="GPU", **kwargs):
    D = x.shape[-1]
    fun = generic_sum(
        "Exp(X|Y) * B",  # Formula
        "A = Vi(1)",  # Output
        "X = Vi({})".format(D),  # 1st argument
        "Y = Vj({})".format(D),  # 2nd argument
        "B = Vj(1)",  # 3rd argument
    )
    ex = (-(x * x).sum(-1)).exp()[:, :, None]
    ey = (-(y * y).sum(-1)).exp()[:, :, None]
    return ex * fun(2 * x, y, b * ey, backend=backend)

Same, but without the chunked computation mode:

def gaussianconv_keops_nochunks(x, y, b, backend="GPU", **kwargs):
    D = x.shape[-1]
    fun = generic_sum(
        "Exp(X|Y) * B",  # Formula
        "A = Vi(1)",  # Output
        "X = Vi({})".format(D),  # 1st argument
        "Y = Vj({})".format(D),  # 2nd argument
        "B = Vj(1)",  # 3rd argument
        enable_chunks=False,
    )
    ex = (-(x * x).sum(-1)).exp()[:, :, None]
    ey = (-(y * y).sum(-1)).exp()[:, :, None]
    return ex * fun(2 * x, y, b * ey, backend=backend)

PyTorch vs. KeOps (Gpu)

routines = [
    (gaussianconv_pytorch, "PyTorch (GPU)", {}),
    (gaussianconv_keops_nochunks, "KeOps < 1.4.2 (GPU)", {}),
    (gaussianconv_keops, "KeOps >= 1.4.2 (GPU)", {}),
]

full_benchmark(
    f"Gaussian Matrix-Vector products in high dimension, with N={N:,} (GPU)",
    routines,
    generate_samples,
    problem_sizes=Dims,
    xlabel="Dimension of the points",
)


plt.show()
Gaussian Matrix-Vector products in high dimension, with N=10,000 (GPU)

Out:

Benchmarking : Gaussian Matrix-Vector products in high dimension, with N=10,000 (GPU) ===============================
PyTorch (GPU) -------------
  1x100 loops of size         1:   1x100x0.009003s
  1x 10 loops of size         3:   1x 10x0.009006s
  1x 10 loops of size         5:   1x 10x0.009011s
  1x 10 loops of size        10:   1x 10x0.008990s
  1x 10 loops of size        20:   1x 10x0.009082s
  1x 10 loops of size        30:   1x 10x0.009244s
  1x 10 loops of size        50:   1x 10x0.009667s
  1x 10 loops of size        80:   1x 10x0.009531s
  1x 10 loops of size       100:   1x 10x0.009871s
  1x 10 loops of size       120:   1x 10x0.010059s
  1x 10 loops of size       150:   1x 10x0.010541s
  1x 10 loops of size       200:   1x 10x0.011160s
  1x 10 loops of size       300:   1x 10x0.012691s
  1x 10 loops of size       500:   1x 10x0.016053s
  1x 10 loops of size     1,000:   1x 10x0.023758s
  1x  1 loops of size     2,000:   1x  1x0.040664s
  1x  1 loops of size     3,000:   1x  1x0.057225s
KeOps < 1.4.2 (GPU) -------------
Compiling libKeOpstorch87af0eab60 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,1); Y = Vj(1,1); B = Vj(2,1);
       dtype  : float32
... Done.
  1x100 loops of size         1:   1x100x0.000769s
Compiling libKeOpstorch7ca52f1064 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,3); Y = Vj(1,3); B = Vj(2,1);
       dtype  : float32
... Done.
  1x100 loops of size         3:   1x100x0.000901s
Compiling libKeOpstorchc90d2006c2 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,5); Y = Vj(1,5); B = Vj(2,1);
       dtype  : float32
... Done.
  1x100 loops of size         5:   1x100x0.001001s
Compiling libKeOpstorchb7ca7e2c2a in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,10); Y = Vj(1,10); B = Vj(2,1);
       dtype  : float32
... Done.
  1x100 loops of size        10:   1x100x0.001610s
Compiling libKeOpstorcha2382f7584 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,20); Y = Vj(1,20); B = Vj(2,1);
       dtype  : float32
... Done.
  1x100 loops of size        20:   1x100x0.002453s
Compiling libKeOpstorch99a789f624 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,30); Y = Vj(1,30); B = Vj(2,1);
       dtype  : float32
... Done.
  1x 10 loops of size        30:   1x 10x0.003727s
Compiling libKeOpstorch0624cdfdb4 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,50); Y = Vj(1,50); B = Vj(2,1);
       dtype  : float32
... Done.
  1x 10 loops of size        50:   1x 10x0.005852s
Compiling libKeOpstorchb61084bc42 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,80); Y = Vj(1,80); B = Vj(2,1);
       dtype  : float32
... Done.
  1x 10 loops of size        80:   1x 10x0.008205s
Compiling libKeOpstorch1886501f5a in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,100); Y = Vj(1,100); B = Vj(2,1);
       dtype  : float32
... Done.
  1x 10 loops of size       100:   1x 10x0.013620s
Compiling libKeOpstorch514782b410 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,120); Y = Vj(1,120); B = Vj(2,1);
       dtype  : float32
... Done.
  1x 10 loops of size       120:   1x 10x0.016492s
Compiling libKeOpstorch96e84940c5 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,150); Y = Vj(1,150); B = Vj(2,1);
       dtype  : float32
... Done.
  1x 10 loops of size       150:   1x 10x0.017452s
Compiling libKeOpstorchfb9bb9ddab in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,200); Y = Vj(1,200); B = Vj(2,1);
       dtype  : float32
... Done.
  1x 10 loops of size       200:   1x 10x0.024536s
Compiling libKeOpstorchdcb8531a88 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,300); Y = Vj(1,300); B = Vj(2,1);
       dtype  : float32
... Done.
  1x  1 loops of size       300:   1x  1x0.164111s
Compiling libKeOpstorchb186e32e33 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,500); Y = Vj(1,500); B = Vj(2,1);
       dtype  : float32
... Done.
  1x  1 loops of size       500:   1x  1x0.511857s
Compiling libKeOpstorch9c17343b87 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,1000); Y = Vj(1,1000); B = Vj(2,1);
       dtype  : float32
... Done.
  1x  1 loops of size     1,000:   1x  1x2.781254s
Compiling libKeOpstorch1704d5c180 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,2000); Y = Vj(1,2000); B = Vj(2,1);
       dtype  : float32
... Done.
  1x  1 loops of size     2,000:   1x  1x16.039861s
**
Too slow !
KeOps >= 1.4.2 (GPU) -------------
Compiling libKeOpstorchd0354445da in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,1); Y = Vj(1,1); B = Vj(2,1);
       dtype  : float32
... Done.
  1x100 loops of size         1:   1x100x0.000754s
Compiling libKeOpstorch72e8e950d7 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,3); Y = Vj(1,3); B = Vj(2,1);
       dtype  : float32
... Done.
  1x100 loops of size         3:   1x100x0.000845s
Compiling libKeOpstorchf54260d8e0 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,5); Y = Vj(1,5); B = Vj(2,1);
       dtype  : float32
... Done.
  1x100 loops of size         5:   1x100x0.000964s
Compiling libKeOpstorch9fc9d3c9c9 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,10); Y = Vj(1,10); B = Vj(2,1);
       dtype  : float32
... Done.
  1x100 loops of size        10:   1x100x0.001604s
Compiling libKeOpstorchb5317d4f56 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,20); Y = Vj(1,20); B = Vj(2,1);
       dtype  : float32
... Done.
  1x100 loops of size        20:   1x100x0.002541s
Compiling libKeOpstorch087bb1966d in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,30); Y = Vj(1,30); B = Vj(2,1);
       dtype  : float32
... Done.
  1x 10 loops of size        30:   1x 10x0.003696s
Compiling libKeOpstorch2d5543fa75 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,50); Y = Vj(1,50); B = Vj(2,1);
       dtype  : float32
... Done.
  1x 10 loops of size        50:   1x 10x0.005827s
Compiling libKeOpstorch42e1b72653 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,80); Y = Vj(1,80); B = Vj(2,1);
       dtype  : float32
... Done.
  1x 10 loops of size        80:   1x 10x0.008191s
Compiling libKeOpstorchd54aed8b9c in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,100); Y = Vj(1,100); B = Vj(2,1);
       dtype  : float32
... Done.
  1x 10 loops of size       100:   1x 10x0.014915s
Compiling libKeOpstorchef33c8380f in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,120); Y = Vj(1,120); B = Vj(2,1);
       dtype  : float32
... Done.
  1x 10 loops of size       120:   1x 10x0.017933s
Compiling libKeOpstorch1350140370 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,150); Y = Vj(1,150); B = Vj(2,1);
       dtype  : float32
... Done.
  1x 10 loops of size       150:   1x 10x0.020148s
Compiling libKeOpstorch5ff25d912f in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,200); Y = Vj(1,200); B = Vj(2,1);
       dtype  : float32
... Done.
  1x  1 loops of size       200:   1x  1x0.026719s
Compiling libKeOpstorcheccb711a8c in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,300); Y = Vj(1,300); B = Vj(2,1);
       dtype  : float32
... Done.
  1x  1 loops of size       300:   1x  1x0.038495s
Compiling libKeOpstorchdeeda41316 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,500); Y = Vj(1,500); B = Vj(2,1);
       dtype  : float32
... Done.
  1x  1 loops of size       500:   1x  1x0.062088s
Compiling libKeOpstorch592e3a18c2 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,1000); Y = Vj(1,1000); B = Vj(2,1);
       dtype  : float32
... Done.
  1x  1 loops of size     1,000:   1x  1x0.110568s
Compiling libKeOpstorchd539dcbf58 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,2000); Y = Vj(1,2000); B = Vj(2,1);
       dtype  : float32
... Done.
  1x  1 loops of size     2,000:   1x  1x0.193958s
Compiling libKeOpstorch6e1dfcb773 in /data/jean/keops/pykeops/build:
       formula: Sum_Reduction(Exp(X|Y) * B,0)
       aliases: X = Vi(0,3000); Y = Vj(1,3000); B = Vj(2,1);
       dtype  : float32
... Done.
  1x  1 loops of size     3,000:   1x  1x0.289959s

Total running time of the script: ( 17 minutes 10.876 seconds)

Gallery generated by Sphinx-Gallery