Genred
Summary
This section contains the full API documentation for the PyTorch Generic reductions, with full support of PyTorch’s torch.autograd
engine.
Creates a new generic operation. 

Instantiate a new generic operation. 

To apply the routine on arbitrary torch Tensors. 
Syntax
 class pykeops.torch.Genred[source]
Creates a new generic operation.
This is KeOps’ main function, whose usage is documented in the userguide, the gallery of examples and the highlevel tutorials. Taking as input a handful of strings and integers that specify a custom MapReduce operation, it returns a C++ wrapper that can be called just like any other PyTorch function.
Note
Genred()
is fully compatible with PyTorch’sautograd
engine: You can backprop through the KeOps__call__()
just as if it was a vanilla PyTorch operation (except for Min or Max reduction types, see reductions)Example
>>> my_conv = Genred('Exp(SqNorm2(x  y))', # formula ... ['x = Vi(3)', # 1st input: dim3 vector per line ... 'y = Vj(3)'], # 2nd input: dim3 vector per column ... reduction_op='Sum', # we also support LogSumExp, Min, etc. ... axis=1) # reduce along the lines of the kernel matrix >>> # Apply it to 2d arrays x and y with 3 columns and a (huge) number of lines >>> x = torch.randn(1000000, 3, requires_grad=True).cuda() >>> y = torch.randn(2000000, 3).cuda() >>> a = my_conv(x, y) # a_i = sum_j exp(x_iy_j^2) >>> print(a.shape) torch.Size([1000000, 1]) >>> [g_x] = torch.autograd.grad((a ** 2).sum(), [x]) # KeOps supports autograd! >>> print(g_x.shape) torch.Size([1000000, 3])
 __init__(formula, aliases, reduction_op='Sum', axis=0, dtype=None, opt_arg=None, formula2=None, cuda_type=None, dtype_acc='auto', use_double_acc=False, sum_scheme='auto', enable_chunks=True, rec_multVar_highdim=False, use_fast_math=True)[source]
Instantiate a new generic operation.
Note
Genred
relies on C++ or CUDA kernels that are compiled onthefly, and stored in a cache directory as shared libraries (“.so” files) for later use. Parameters:
formula (string) – The scalar or vectorvalued expression that should be computed and reduced. The correct syntax is described in the documentation, using appropriate mathematical operations.
aliases (list of strings) –
A list of identifiers of the form
"AL = TYPE(DIM)"
that specify the categories and dimensions of the input variables. Here:AL
is an alphanumerical alias, used in the formula.TYPE
is a category. One of:Vi
: indexation by \(i\) along axis 0.Vj
: indexation by \(j\) along axis 1.Pm
: no indexation, the input tensor is a vector and not a 2d array.
DIM
is an integer, the dimension of the current variable.
As described below,
__call__()
will expect as input Tensors whose shape are compatible with aliases.
 Keyword Arguments:
reduction_op (string, default =
"Sum"
) – Specifies the reduction operation that is applied to reduce the values offormula(x_i, y_j, ...)
along axis 0 or axis 1. The supported values are one of Reductions.axis (int, default = 0) –
Specifies the dimension of the “kernel matrix” that is reduced by our routine. The supported values are:
axis = 0: reduction with respect to \(i\), outputs a
Vj
or “\(j\)” variable.axis = 1: reduction with respect to \(j\), outputs a
Vi
or “\(i\)” variable.
opt_arg (int, default = None) – If reduction_op is in
["KMin", "ArgKMin", "KMin_ArgKMin"]
, this argument allows you to specify the numberK
of neighbors to consider.dtype_acc (string, default
"auto"
) –type for accumulator of reduction, before casting to dtype. It improves the accuracy of results in case of large sized data, but is slower. Default value “auto” will set this option to the value of dtype. The supported values are:
dtype_acc =
"float16"
: allowed only if dtype is “float16”.dtype_acc =
"float32"
: allowed only if dtype is “float16” or “float32”.dtype_acc =
"float64"
: allowed only if dtype is “float32” or “float64”..
use_double_acc (bool, default False) – same as setting dtype_acc=”float64” (only one of the two options can be set) If True, accumulate results of reduction in float64 variables, before casting to float32. This can only be set to True when data is in float32 or float64. It improves the accuracy of results in case of large sized data, but is slower.
sum_scheme (string, default
"auto"
) –method used to sum up results for reductions. Default value “auto” will set this option to “block_red”. Possible values are:
sum_scheme =
"direct_sum"
: direct summationsum_scheme =
"block_sum"
: use an intermediate accumulator in each block before accumulating in the output. This improves accuracy for large sized data.sum_scheme =
"kahan_scheme"
: use Kahan summation algorithm to compensate for roundoff errors. This improves accuracy for large sized data.
enable_chunks (bool, default True) – for Gpu mode only, enable automatic selection of special “chunked” computation mode for accelerating reductions with formulas involving large dimension variables.
rec_multVar_highdim (bool, default False) – for Gpu mode only, enable special “final chunked” computation mode for accelerating reductions with formulas involving large dimension variables. Beware ! This will only work if the formula has the very special form that allows such computation mode.
use_fast_math (bool, default True) – enables use_fast_math Cuda option
 __call__(*args, backend='auto', device_id=1, ranges=None, out=None)[source]
To apply the routine on arbitrary torch Tensors.
 Parameters:
*args (2d Tensors (variables
Vi(..)
,Vj(..)
) and 1d Tensors (parametersPm(..)
)) –The input numerical arrays, which should all have the same
dtype
, be contiguous and be stored on the same device. KeOps expects one array per alias, with the following compatibility rules:All
Vi(Dim_k)
variables are encoded as 2dtensors withDim_k
columns and the same number of lines \(M\).All
Vj(Dim_k)
variables are encoded as 2dtensors withDim_k
columns and the same number of lines \(N\).All
Pm(Dim_k)
variables are encoded as 1dtensors (vectors) of sizeDim_k
.
 Keyword Arguments:
backend (string) –
Specifies the mapreduce scheme. The supported values are:
"auto"
(default): let KeOps decide which backend is best suited to your data, based on the tensors’ shapes."GPU_1D"
will be chosen in most cases."CPU"
: use a simple C++for
loop on a single CPU core."GPU_1D"
: use a simple multithreading scheme on the GPU  basically, one thread per value of the output index."GPU_2D"
: use a more sophisticated 2D parallelization scheme on the GPU."GPU"
: let KeOps decide which one of the"GPU_1D"
or the"GPU_2D"
scheme will run faster on the given input.
device_id (int, default=1) – Specifies the GPU that should be used to perform the computation; a negative value lets your system choose the default GPU. This parameter is only useful if your system has access to several GPUs.
ranges (6uple of IntTensors, None by default) –
Ranges of integers that specify a blocksparse reduction scheme with Mc clusters along axis 0 and Nc clusters along axis 1. If None (default), we simply loop over all indices \(i\in[0,M)\) and \(j\in[0,N)\).
The first three ranges will be used if axis = 1 (reduction along the axis of “\(j\) variables”), and to compute gradients with respect to
Vi(..)
variables:ranges_i
, (Mc,2) IntTensor  slice indices \([\operatorname{start}^I_k,\operatorname{end}^I_k)\) in \([0,M]\) that specify our Mc blocks along the axis 0 of “\(i\) variables”.slices_i
, (Mc,) IntTensor  consecutive slice indices \([\operatorname{end}^S_1, ..., \operatorname{end}^S_{M_c}]\) that specify Mc ranges \([\operatorname{start}^S_k,\operatorname{end}^S_k)\) inredranges_j
, with \(\operatorname{start}^S_k = \operatorname{end}^S_{k1}\). The first 0 is implicit, meaning that \(\operatorname{start}^S_0 = 0\), and we typically expect thatslices_i[1] == len(redrange_j)
.redranges_j
, (Mcc,2) IntTensor  slice indices \([\operatorname{start}^J_\ell,\operatorname{end}^J_\ell)\) in \([0,N]\) that specify reduction ranges along the axis 1 of “\(j\) variables”.
If axis = 1, these integer arrays allow us to say that
for k in range(Mc)
, the output values for indicesi in range( ranges_i[k,0], ranges_i[k,1] )
should be computed using a MapReduce scheme over indicesj in Union( range( redranges_j[l, 0], redranges_j[l, 1] ))
forl in range( slices_i[k1], slices_i[k] )
.Likewise, the last three ranges will be used if axis = 0 (reduction along the axis of “\(i\) variables”), and to compute gradients with respect to
Vj(..)
variables:ranges_j
, (Nc,2) IntTensor  slice indices \([\operatorname{start}^J_k,\operatorname{end}^J_k)\) in \([0,N]\) that specify our Nc blocks along the axis 1 of “\(j\) variables”.slices_j
, (Nc,) IntTensor  consecutive slice indices \([\operatorname{end}^S_1, ..., \operatorname{end}^S_{N_c}]\) that specify Nc ranges \([\operatorname{start}^S_k,\operatorname{end}^S_k)\) inredranges_i
, with \(\operatorname{start}^S_k = \operatorname{end}^S_{k1}\). The first 0 is implicit, meaning that \(\operatorname{start}^S_0 = 0\), and we typically expect thatslices_j[1] == len(redrange_i)
.redranges_i
, (Ncc,2) IntTensor  slice indices \([\operatorname{start}^I_\ell,\operatorname{end}^I_\ell)\) in \([0,M]\) that specify reduction ranges along the axis 0 of “\(i\) variables”.
If axis = 0, these integer arrays allow us to say that
for k in range(Nc)
, the output values for indicesj in range( ranges_j[k,0], ranges_j[k,1] )
should be computed using a MapReduce scheme over indicesi in Union( range( redranges_i[l, 0], redranges_i[l, 1] ))
forl in range( slices_j[k1], slices_j[k] )
.out (2d Tensor, None by default) – The output numerical array, for inplace computation. If provided, the output array should all have the same
dtype
, be contiguous and be stored on the same device as the arguments. Moreover it should have the correct shape for the output.
 Returns:
The output of the reduction, stored on the same device as the input Tensors. The output of a Genred call is always a 2dtensor with \(M\) or \(N\) lines (if axis = 1 or axis = 0, respectively) and a number of columns that is inferred from the formula.
 Return type:
(M,D) or (N,D) Tensor