# Backpropagation

Now that the main rules of GPU programming have been exposed, let us
**recap the fundamentals** of *backpropagation* or *reverse accumulation
AD*, the algorithm that allows Automatic Differentiation (AD) engines to
differentiate scalar-valued computer programs \(F : \mathbb{R}^n \to \mathbb{R}\)
efficiently. As we uncover the methods that are hidden behind the
transparent .grad() calls of modern libraries, we will hopefully
allow the reader to get a clear understanding of the **rules** and
**limitations** of automatic differentiation engines.

## Definitions

**Differential.**
First of all, let us make our notations precise by recalling the
mathematical definition of the *differential* of a smooth function.
Let \((X,\|\cdot\|_X)\) and \((Y,\|\cdot\|_Y)\) be two normed
vector spaces. A function \(F : X \to Y\) is said to be (Fréchet)
differentiable at \(x_0 \in X\) if there exists a continuous linear
operator \(\mathcal{L} : X \to Y\) such that:

or equivalently:

If it exists, such a linear operator \(\mathcal{L}\) is unique. It
is called the **differential** of \(F\) at \(x_0\) and is
denoted by \(\mathcal{L} = \mathrm{d}_x F(x_0)\). We use a dot symbol to
denote the application of \(\mathcal{L}\), as in

**Jacobian matrix.**
Let us consider the spaces \(X = \mathbb{R}^n\) and \(Y = \mathbb{R}^m\) endowed
with their usual (Euclidean) metric structures. Given a function
\(F = (F^1,\dots,F^m)\) that is differentiable at a location
\(x_0\) in \(\mathbb{R}^n\), the matrix of the linear operator
\(\mathrm{d}_x F(x_0)\) in the canonical basis is the *Jacobian matrix* of
partial derivatives:

**Gradient vector.**
When \(F\) is scalar-valued (\(m = 1\)), the Jacobian matrix is
a *line*: to retrieve a column *gradient vector* in \(\mathbb{R}^n\), one
usually considers its **transpose**. To define this manipulation in a way
that is **coordinate-free**, independent of the choice of a reference
basis, we must assume that \(X\) is a *Hilbert space*: its metric
structure is *complete* and comes from an inner product
\(\langle\,\cdot\,,\,\cdot\,\rangle_X: X\times X \rightarrow \mathbb{R}\).

Rigorously, let \((X,~ \langle\,\cdot\,,\,\cdot\,\rangle_X)\) be a Hilbert space,
\(x_0\) a reference location in \(X\) and \(F : X \to \mathbb{R}\) a
function that is differentiable at \(x_0\). As its differential
\(\mathrm{d}_x F(x_0) : X \to \mathbb{R}\) at \(x_0\) is a *continuous linear
form*, the **Riesz representation theorem** ensures that there exists a
unique vector \(\nabla_x F(x_0) \in X\), the **gradient** of
\(F\) at \(x_0\) such that

## Computing gradients

A **naive** approach to the computation of gradient vectors, the so-called
**finite differences** scheme, is to use a Taylor expansion of \(F\)
around \(x_0\) and write that for small enough values of
\(\delta t\),

This idea is simple to implement, but also requires \((n+1)\)
evaluations of the function \(F\) to compute a *single* gradient
vector! As soon as the dimension \(n\) of the input space exceeds 10
or 100, **this stops being tractable**: just like inverting a full matrix
\(A\) is not a sensible way of solving the linear system
“\(Ax = b\)”, one should not use finite differences - or any
equivalent *forward* method - to compute the gradient of a scalar-valued
objective.

**Generalized gradient.**
To go beyond this simple scheme, we need to work with the gradient of
*vector-valued* applications. Once again, coordinate-free definitions
rely on scalar products.
Let \((X,~ \langle\,\cdot\,,\,\cdot\,\rangle_X)\) and
\((Y,~ \langle\,\cdot\,,\,\cdot\,\rangle_Y)\) be two Hilbert spaces, and
let \(F : X\rightarrow Y\) be a function that is differentiable at
\(x_0 \in X\). The adjoint

of the differential induces a continuous linear map

through the Riesz representation theorem, called the **generalized
gradient** of \(F\) at \(x_0\) with respect to the Hilbertian
structures of \(X\) and \(Y\).

**Calculus.**
The generalized gradient appears in the infinitesimal development of
scalar quantities computed from \(F(x)\) around a reference location
\(x_0\). Let \(\alpha \in Y^*\) be a **continuous** linear form on
\(Y\), identified with a vector \(a \in Y\) through the Riesz
representation theorem:

Then, for any increment \(\delta x \in X\), we can write that:

**Fundamental example.**
If \(X\) and \(Y\) are respectively equal to \(\mathbb{R}^n\),
\(\mathbb{R}\) and are endowed with the standard \(L^2\)-Euclidean dot
products:

the matrix of \(\partial_x F(x_0):\mathbb{R}\rightarrow\mathbb{R}^n\) in the canonical basis is equal to the vector \(\nabla_x F(x_0)\) of directional derivatives:

Going further, the matrix of the generalized gradient in the canonical basis coincides with the transpose of the Jacobian matrix whenever the scalar products considered are equal to the “canonical” ones. Everything is consistent.

## Metric structure, chain rule

This generalized “metric” definition of the gradient has two major advantages over the simple notion of “vector of partial derivatives”:

It stresses the fact that

**a gradient is always defined with respect to a metric structure**, not a basis. In high-dimensional settings, as the equivalence of norms stops being effective, the choice of an appropriate*descent metric*becomes a**key regularization prior**for first-order optimization schemes. Encoded through a change of variables on the parameters that we strive to optimize, this modelling choice usually has a strong impact on Machine Learning pipelines.It allows us to

*compose*gradients without reserve. Indeed, if \(X\), \(Y\), \(Z\) are three Hilbert spaces and if \(F = H \circ G\) with \(G : X \rightarrow Y\) and \(H : Y\rightarrow Z\), then for all \(x_0 \in X\), the chain rule asserts that\[\begin{aligned} \mathrm{d}_x F(x_0) ~&=~ \mathrm{d}_y H(G(x_0)) \circ \mathrm{d}_x G(x_0)~, \end{aligned}\]so that with the usual flip for the composition of adjoint (i.e. transposed) operators:

\[\begin{split}\begin{aligned} \big[\mathrm{d}_x F(x_0)\big]^* ~&=~ \big[\mathrm{d}_x G(x_0)\big]^* \circ \big[\mathrm{d}_y H(G(x_0))\big]^* \\ \text{i.e.}~~~~~~~~~~\partial_x F(x_0)~~~~&=~ ~\,\,\partial_x G(x_0) ~\,\, \circ ~ \partial_y H(G(x_0)). \end{aligned}\end{split}\]

## Backpropagation

In practice, the function \(F : \mathbb{R}^n \rightarrow \mathbb{R}\) to differentiate is defined as a composition \(F = F_p\circ\cdots\circ F_2\circ F_1\) of elementary functions \(F_i:\mathbb{R}^{N_{i-1}}\rightarrow \mathbb{R}^{N_i}\) – the lines of our program – where \(N_0 = n\) and \(N_p = 1\):

To keep the notations simple, we now assume that all the input and output spaces \(\mathbb{R}^{N_i}\) are endowed with their canonical \(L^2\)-Euclidean metrics. The gradient vector \(\nabla_x F(x_0)\) that we strive to compute, at an arbitrary location \(x_0\in\mathbb{R}^n\), is the image of \(1 \in \mathbb{R}\) by the linear map

Thanks to the chain rule, we can write that:

where the \(x_i\text{'s} = F_i\circ\cdots\circ F_1(x)\) denote the
intermediate results in the computation of \(x_p = F(x_0)\).
Assuming that the *forward* and *backward* operators

are known and **encoded as computer programs**, we can thus compute
both \(F(x_0)\) and
\(\nabla_x F(x_0) = \partial_x F(x_0) \cdot 1\) with a
forward-backward pass through the following
diagram:

**In a nutshell.**
The *backpropagation* algorithm can be cut in two steps that correspond
to the two lines of the diagram above:

Starting from \(x_0 \in \mathbb{R}^n = \mathbb{R}^{N_0}\), compute and

**store in memory**the successive vectors \(x_i \in \mathbb{R}^{N_i}\). The last one, \(x_p \in \mathbb{R}\), is equal to the value of the objective \(F(x_0)\).Starting from the canonical value of \(x_p^* = 1 \in \mathbb{R}\), compute the successive

**dual vectors**:\[x_i^* ~=~ \partial_x F_{i+1} (x_i) \cdot x_{i+1}^*~.\]The last one, \(x_0^* \in \mathbb{R}^n\), is equal to the gradient vector \(\nabla_x F(x_0) = \partial_x F(x_0) \cdot 1\).

**Implementation and performances.**
This forward-backward procedure can be generalized to all acyclic
computational graphs. Hence, provided that all forward and backward
operators
are implemented and available, we can compute *automatically* the
gradient of any symbolic procedure that is written as a succession of
elementary differentiable operations: the \(F_i\)’s.

In practice, the *backwards* of usual operations are seldom more costly
than 4-5 applications of the corresponding *forward* operators:
differentiating a polynomial gives us a polynomial, logarithms become
pointwise inversions, etc. Ergo, if one has enough memory at hand to
store the intermediate results \(x_0, \dots, x_{p-1}\) during
the forward pass, **the backpropagation algorithm is an automatic and
time-effective way of computing the gradients** of generic scalar-valued
functions, with **runtimes that do not exceed that of four or five
applications of the forward program**. This statement may come as a
shock to first-time users of deep learning frameworks; but as we are
about to see, it is both *true* and *effective*.