Backpropagation

Now that the main rules of GPU programming have been exposed, let us recap the fundamentals of backpropagation or reverse accumulation AD, the algorithm that allows Automatic Differentiation (AD) engines to differentiate scalar-valued computer programs F:RnR efficiently. As we uncover the methods that are hidden behind the transparent .grad() calls of modern libraries, we will hopefully allow the reader to get a clear understanding of the rules and limitations of automatic differentiation engines.

Definitions

Differential. First of all, let us make our notations precise by recalling the mathematical definition of the differential of a smooth function. Let (X,X) and (Y,Y) be two normed vector spaces. A function F:XY is said to be (Fréchet) differentiable at x0X if there exists a continuous linear operator L:XY such that:

 δxX,F(x0+δx)=F(x0)+L(δx)+o(δxX)

or equivalently:

limδx0F(x0+δx)F(x0)L(δx)YδxX=0 .

If it exists, such a linear operator L is unique. It is called the differential of F at x0 and is denoted by L=dxF(x0). We use a dot symbol to denote the application of L, as in

L(δx)=dxF(x0)δx.

Jacobian matrix. Let us consider the spaces X=Rn and Y=Rm endowed with their usual (Euclidean) metric structures. Given a function F=(F1,,Fm) that is differentiable at a location x0 in Rn, the matrix of the linear operator dxF(x0) in the canonical basis is the Jacobian matrix of partial derivatives:

(F1x1(x0)F1xn(x0)Fixj(x0)Fmx1(x0)Fmxn(x0)).

Gradient vector. When F is scalar-valued (m=1), the Jacobian matrix is a line: to retrieve a column gradient vector in Rn, one usually considers its transpose. To define this manipulation in a way that is coordinate-free, independent of the choice of a reference basis, we must assume that X is a Hilbert space: its metric structure is complete and comes from an inner product ,X:X×XR.

Rigorously, let (X, ,X) be a Hilbert space, x0 a reference location in X and F:XR a function that is differentiable at x0. As its differential dxF(x0):XR at x0 is a continuous linear form, the Riesz representation theorem ensures that there exists a unique vector xF(x0)X, the gradient of F at x0 such that

 δxX,dxF(x0)δx=xF(x0),δxX .

Computing gradients

A naive approach to the computation of gradient vectors, the so-called finite differences scheme, is to use a Taylor expansion of F around x0 and write that for small enough values of δt,

xF(x0)  =  (x1F(x0)x2F(x0)xnF(x0))    1δt(F(x0+δt(1,0,0,,0))F(x0)F(x0+δt(0,1,0,,0))F(x0)F(x0+δt(0,0,0,,1))F(x0)).

This idea is simple to implement, but also requires (n+1) evaluations of the function F to compute a single gradient vector! As soon as the dimension n of the input space exceeds 10 or 100, this stops being tractable: just like inverting a full matrix A is not a sensible way of solving the linear system “Ax=b”, one should not use finite differences - or any equivalent forward method - to compute the gradient of a scalar-valued objective.

Generalized gradient. To go beyond this simple scheme, we need to work with the gradient of vector-valued applications. Once again, coordinate-free definitions rely on scalar products. Let (X, ,X) and (Y, ,Y) be two Hilbert spaces, and let F:XY be a function that is differentiable at x0X. The adjoint

(dxF)(x0):YX

of the differential induces a continuous linear map

xF(x0):YX

through the Riesz representation theorem, called the generalized gradient of F at x0 with respect to the Hilbertian structures of X and Y.

Calculus. The generalized gradient appears in the infinitesimal development of scalar quantities computed from F(x) around a reference location x0. Let αY be a continuous linear form on Y, identified with a vector aY through the Riesz representation theorem:

yY,  α,y = α(y) = a,yY.

Then, for any increment δxX, we can write that:

8α,F(x0+δx) = α,F(x0) +   α,  dxF(x0)δx  + o(δxX)= α,F(x0) + (dxF)(x0)α, δx  + o(δxX)i.e.  a,F(x0+δx)Y = a,F(x0)Y +    xF(x0)a, δx X + o(δxX) .

Fundamental example. If X and Y are respectively equal to Rn, R and are endowed with the standard L2-Euclidean dot products:

x,xRn = i=1nxixi andy,yR = yy ,

the matrix of xF(x0):RRn in the canonical basis is equal to the vector xF(x0) of directional derivatives:

xF(x0) = xF(x0)1 .

Going further, the matrix of the generalized gradient in the canonical basis coincides with the transpose of the Jacobian matrix whenever the scalar products considered are equal to the “canonical” ones. Everything is consistent.

Metric structure, chain rule

This generalized “metric” definition of the gradient has two major advantages over the simple notion of “vector of partial derivatives”:

  1. It stresses the fact that a gradient is always defined with respect to a metric structure, not a basis. In high-dimensional settings, as the equivalence of norms stops being effective, the choice of an appropriate descent metric becomes a key regularization prior for first-order optimization schemes. Encoded through a change of variables on the parameters that we strive to optimize, this modelling choice usually has a strong impact on Machine Learning pipelines.

  2. It allows us to compose gradients without reserve. Indeed, if X, Y, Z are three Hilbert spaces and if F=HG with G:XY and H:YZ, then for all x0X, the chain rule asserts that

    dxF(x0) = dyH(G(x0))dxG(x0) ,

    so that with the usual flip for the composition of adjoint (i.e. transposed) operators:

    [dxF(x0)] = [dxG(x0)][dyH(G(x0))]i.e.          xF(x0)    =  xG(x0)  yH(G(x0)).

Backpropagation

In practice, the function F:RnR to differentiate is defined as a composition F=FpF2F1 of elementary functions Fi:RNi1RNi – the lines of our program – where N0=n and Np=1:

Forward composition

To keep the notations simple, we now assume that all the input and output spaces RNi are endowed with their canonical L2-Euclidean metrics. The gradient vector xF(x0) that we strive to compute, at an arbitrary location x0Rn, is the image of 1R by the linear map

xF(x0):RRn.

Thanks to the chain rule, we can write that:

xF(x0)1 = xF1(x0)xF2(F1(x0))xFp(Fp1((F1(x0))))1= xF1(x0)xF2(   x1   )xFp(            xp1           )1

where the xi's=FiF1(x) denote the intermediate results in the computation of xp=F(x0). Assuming that the forward and backward operators

      Fi:     RNi1    RNixFi(x)andxFi:RNi1×RNiRNi1(x,a)xFi(x)a

are known and encoded as computer programs, we can thus compute both F(x0) and xF(x0)=xF(x0)1 with a forward-backward pass through the following diagram:

Reverse AD

In a nutshell. The backpropagation algorithm can be cut in two steps that correspond to the two lines of the diagram above:

  1. Starting from x0Rn=RN0, compute and store in memory the successive vectors xiRNi. The last one, xpR, is equal to the value of the objective F(x0).

  2. Starting from the canonical value of xp=1R, compute the successive dual vectors:

    xi = xFi+1(xi)xi+1 .

    The last one, x0Rn, is equal to the gradient vector xF(x0)=xF(x0)1.

Implementation and performances. This forward-backward procedure can be generalized to all acyclic computational graphs. Hence, provided that all forward and backward operators are implemented and available, we can compute automatically the gradient of any symbolic procedure that is written as a succession of elementary differentiable operations: the Fi’s.

In practice, the backwards of usual operations are seldom more costly than 4-5 applications of the corresponding forward operators: differentiating a polynomial gives us a polynomial, logarithms become pointwise inversions, etc. Ergo, if one has enough memory at hand to store the intermediate results x0,,xp1 during the forward pass, the backpropagation algorithm is an automatic and time-effective way of computing the gradients of generic scalar-valued functions, with runtimes that do not exceed that of four or five applications of the forward program. This statement may come as a shock to first-time users of deep learning frameworks; but as we are about to see, it is both true and effective.