Backpropagation
Now that the main rules of GPU programming have been exposed, let us
recap the fundamentals of backpropagation or reverse accumulation
AD, the algorithm that allows Automatic Differentiation (AD) engines to
differentiate scalar-valued computer programs
Definitions
Differential.
First of all, let us make our notations precise by recalling the
mathematical definition of the differential of a smooth function.
Let
or equivalently:
If it exists, such a linear operator
Jacobian matrix.
Let us consider the spaces
Gradient vector.
When
Rigorously, let
Computing gradients
A naive approach to the computation of gradient vectors, the so-called
finite differences scheme, is to use a Taylor expansion of
This idea is simple to implement, but also requires
Generalized gradient.
To go beyond this simple scheme, we need to work with the gradient of
vector-valued applications. Once again, coordinate-free definitions
rely on scalar products.
Let
of the differential induces a continuous linear map
through the Riesz representation theorem, called the generalized
gradient of
Calculus.
The generalized gradient appears in the infinitesimal development of
scalar quantities computed from
Then, for any increment
Fundamental example.
If
the matrix of
Going further, the matrix of the generalized gradient in the canonical basis coincides with the transpose of the Jacobian matrix whenever the scalar products considered are equal to the “canonical” ones. Everything is consistent.
Metric structure, chain rule
This generalized “metric” definition of the gradient has two major advantages over the simple notion of “vector of partial derivatives”:
It stresses the fact that a gradient is always defined with respect to a metric structure, not a basis. In high-dimensional settings, as the equivalence of norms stops being effective, the choice of an appropriate descent metric becomes a key regularization prior for first-order optimization schemes. Encoded through a change of variables on the parameters that we strive to optimize, this modelling choice usually has a strong impact on Machine Learning pipelines.
It allows us to compose gradients without reserve. Indeed, if
, , are three Hilbert spaces and if with and , then for all , the chain rule asserts thatso that with the usual flip for the composition of adjoint (i.e. transposed) operators:
Backpropagation
In practice, the function
To keep the notations simple, we now assume that all the input and
output spaces
Thanks to the chain rule, we can write that:
where the
are known and encoded as computer programs, we can thus compute
both
In a nutshell. The backpropagation algorithm can be cut in two steps that correspond to the two lines of the diagram above:
Starting from
, compute and store in memory the successive vectors . The last one, , is equal to the value of the objective .Starting from the canonical value of
, compute the successive dual vectors:The last one,
, is equal to the gradient vector .
Implementation and performances.
This forward-backward procedure can be generalized to all acyclic
computational graphs. Hence, provided that all forward and backward
operators
are implemented and available, we can compute automatically the
gradient of any symbolic procedure that is written as a succession of
elementary differentiable operations: the
In practice, the backwards of usual operations are seldom more costly
than 4-5 applications of the corresponding forward operators:
differentiating a polynomial gives us a polynomial, logarithms become
pointwise inversions, etc. Ergo, if one has enough memory at hand to
store the intermediate results