# High-level frameworks

To provide a transparent interface for the backpropagation algorithm, deep learning libraries rely on three core modules:

**A comprehensive list of operations**, with forward and backward routines implemented*next to each other*.**Efficient GPU and CPU backends**for those routines, allowing users to take advantage of their hardware without having to write a single line of C++.**A high-level graph manipulation engine**for symbolic computations, which executes the backpropagation’s “backward pass” whenever a gradient value is required.

## A minimal working example

Let us illustrate the underlying mechanics of PyTorch – the most
popular framework among academics – in a simple case: the computation
of the Gaussian *kernel norm*:

\[\begin{split}\begin{aligned} H(q,p)~&=~ \frac{1}{2} \sum_{i,j=1}^\mathrm{N} k( q_i - q_j ) \, \langle\,p_i,p_j\,\rangle_{\mathbb{R}^\mathrm{D}} & \text{where} & & k(x) ~=~& \exp(-\|x\|^2\,/\,2\sigma^2)\label{eq:hamiltonien_kernel2}\\ &=~ \frac{1}{2} \langle\,p,\, K_{q,q}\, p\,\rangle_{\mathbb{R}^{\mathrm{N}\times\mathrm{D}}} & \text{where} & & (K_{q,q})_{i,j} ~&=~k( q_i - q_j ),\label{eq:hamiltonien_kernel3}\end{aligned}\end{split}\]

and of its gradients with respect to the input arrays \((q_i) \in \mathbb{R}^{\mathrm{N}\times \mathrm{D}}\) and \((p_i) \in \mathbb{R}^{\mathrm{N}\times \mathrm{D}}\). Using the standard (tensorized) PyTorch interface, programmers may write:

```
import torch # GPU + autodiff library
from torchviz import make_dot # See github.com/szagoruyko/pytorchviz
# With PyTorch, using the GPU is that simple:
use_gpu = torch.cuda.is_available()
dtype = torch.cuda.FloatTensor if use_gpu else torch.FloatTensor
# Under the hood, this flag determines the backend that is to be
# used for forward and backward operations, which have all been
# implemented both in pure CPU and GPU (CUDA) code.
# Step 1: Define numerical tensors (from scratch or numpy) --------------------
N, D = 1000, 3 # Work with clouds of 1,000 points in 3D
# Generate arbitrary arrays on the host (CPU) or device (GPU):
q = torch.rand(N, D).type(dtype) # random tensor of shape (N,D)
p = torch.rand(N, D).type(dtype) # random tensor of shape (N,D)
s = torch.Tensor([2.5]).type(dtype) # deviation "sigma" of our kernel
# Step 2: Ask PyTorch to keep track of q and p's children ---------------------
# In this demo, we won't try to fine tune the kernel and
# do not need any derivative with respect to sigma:
q.requires_grad = True
p.requires_grad = True
# Step 3: Actual computations -------------------------------------------------
# Every PyTorch instruction is executed on-the-fly, but the graph API
# 'torch.autograd' keeps track of the operations and stores in memory
# the intermediate results that are needed for the backward pass.
q_i = q[:,None,:] # shape (N,D) -> (N,1,D)
q_j = q[None,:,:] # shape (N,D) -> (1,N,D)
D_ij = ((q_i - q_j) ** 2).sum(dim=2) # squared distances |q_i-q_j|^2
K_ij = (- D_ij / (2 * s**2) ).exp() # Gaussian kernel matrix
v = K_ij@p # matrix multiplication. (N,N) @ (N,D) = (N,D)
# Finally, compute the kernel norm H(q,p):
H = .5 * torch.dot( p.view(-1), v.view(-1) ) # .5 * <p,v>
# Display the computational graph in the figure below, annotated by hand:
make_dot(H, {'q':q, 'p':p}).render(view=True)
```

## Encoding formulas as tree-like objects

With PyTorch, tensor variables are much more than plain numerical
arrays. Any tensor that descends from a differentiable variable
(marked with the flag `requires_grad = True`

) possesses two essential attributes:

A

`data`

pointer which refers to a C++ array that may be stored in either Host (CPU) or Device (GPU) memories.A

`grad_fn`

recursive tree-like object, which records thecomputational historyof the tensor and can be used whenever a backward pass is required by the`grad()`

operator.

In the picture above, we displayed the `H.grad_fn`

attribute of
our kernel norm using the GraphViz Dot program.
This acyclic graph is the exact
equivalent of the second “backward” line of the backpropagation
diagram that we presented in the previous section:

White nodesstand for backward operators \(\partial_x F_{i+1} : (x_i,x_{i+1}^*) \mapsto x_i^*\).

The green leaveis the first covariable \(x_p^*\in \mathbb{R}\), the “gradient with respect to the output” that is initialized to 1 by default.

Red leavesare the covariables “\(x_0^*\)”, the gradients that we are about to compute.

Blue leavesare thestoredvalues \(x_i\) that were computed during the forward pass.

## A well-packaged backropagation engine

Thanks to the groundwork done by the PyTorch symbolic engine, computing gradients is now as simple as writing:

```
grad_q, grad_p = torch.autograd.grad( H, [q, p] ) # pair of (N,D) tensors
```

**That’s it – and it goes pretty fast!** As should be evident by now, the
blend of semi-symbolic calculus and parallel performances that deep
learning frameworks provide is a **game changer** for applied
mathematicians. Before going any further, we thus strongly advise
readers to try out these scripts on their machines and go through the
main Matlab/NumPy to PyTorch migration guide.

## Custom operators, higher-order differentiation

As explained in this tutorial, creating new pairs of (forward, backward)
PyTorch operators is easy. Allowing users to inject their own
C++ code in a PyTorch program, the
`torch.autograd.Function`

module is a convenient interface for the
developers of PyTorch_Geometric,
GPytorch
and other contributed extensions
to the vanilla framework.

Please note that the PyTorch engine also supports the computation of
**high-order gradients** through the `create_graph = True`

optional argument of the
`grad()`

operator. Even though full Hessian matrices may *not* be
computed efficiently using backprop-like strategies – they’re typically
way too large anyway – formulas that involve gradients may themselves be
understood as “vector computer programs” and differentiated accordingly.
In practice, developers of contributed libraries just have to make sure
that their *backward* operators rely on well-defined *forward* routines,
thus allowing the `autograd`

engine to bootstrap the computation of
high-order derivatives.