The Conjugate Gradient method

A geometrical explanation.

Posted on May 1, 2019

Cheat sheet

This little box keeps all the important formulae you (now don't) need to keep in mind.

Minimum of quadratic function: \[x^\star = \operatorname{arginf}_{x\in\mathbb R^n} \frac{1}{2}x^TAx -b^Tx\]
Error: \[e_i = x_i - x^\star\]
Residual: \[\begin{aligned}r_i &= b-Ax_i= -f'(x_i)\\ &= -Ae_i\end{aligned}\]
Orth. dir. descent: \[x_{i+1} = x_i + \alpha_i\cdot d_i\] with orthogonal directions \(\{d_i\}_{i=0}^{N-1}.\)
Conjugacy/A-orthogonality: \[u~\bot_A~ v\Leftrightarrow u^TAv = 0\]
Conj. dir. descent: \[x_{i+1} = x_i + \alpha_i\cdot d_i\] with conjugate directions \(\{d_i\}_{i=0}^{N-1}.\)
Warning: While this article can be accessed on mobile, I strongly recommend using a somewhat larger screen. Many elements cannot be shown or are too big to fit on a mobile phone screen

Click on and symbols to expand content you are interested in.

Introduction

What this article is about, what I assume you to know, how the article is structured and how it is to be used.

The Conjugate Gradient method is one of the most important ideas in scientific computing: It is applied for solving large sparse linear equations, like those arising in the numerical solution of partial differential equations and it is also used as a powerful optimization method.

This article is trying to give an intuitive introduction to the Conjugate Gradient method.from now on often abbreviated as "CG"

The CG method is based on a beautiful geometric idea but this idea is barely explained in most exposition texts, a fact that was lamented (and alleviated) by Jonathan Shewchuk's grandiose paper "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain". This article tries to transfer Shewchuk's brilliant explanations into a more interactive medium.

Note that this article is designed to be read on a decently sized monitor. While a phone screen will show the necessary info, there are some features and visualizations that might not work satisfactorily.

If you find errors, have comments on the mathematical and/or didactical exposition, please drop me an email at [email protected].

I assume you need this algorithm for one of the following use-cases:

  1. Solving a linear system of equations \[A x = b\]for a matrix \(A\in \mathbb R^{n\times n}\) symmetric and positive definite and vectors \(x, b\in \mathbb R^n\). This setting typically arises when you need to solve partial differential equations numerically.
  2. Minimizing a quadratic function \[f(x) = \frac{1}{2}x^TAx - b^T x + c\] with a matrix \(A\in \mathbb R^{n\times n}\) symmetric and positive definite, vectors \(x, b\in \mathbb R^n\) and a scalar \(c\in \mathbb R\)
  3. Minimizing a more complicated function \[f : \mathbb R^n\to \mathbb R.\]

The first two cases are equivalent ( ).

Note that the quadratic function is differentiable and has gradient \(\nabla f(x) = \frac{1}{2}(A^T + A)x - b = Ax - b\) (last step by symmetry of \(A\)) and (by positive definiteness of \(A\)) is guaranteed to have exactly one minimum which needs to fulfill \(\nabla f(x) = 0\). Hence, minimization of \(f\) amounts exactly to solving the linear equation \(Ax = b\).

Although CG was developed for minimization of quadratic functions, it is applied to non-quadratic minimization problems as well.

Obviously, the solution to \(Ax = b\) is \(x = A^{-1}b\) but we assume that the dimension \(n\) of the problem is so high that we cannot invert the matrix. From here on, we won't allow matrix inversion as a feasible operation. We even forbid explicitly solving linear equations of the form \(Ax = b\). This is not the same as inverting \(A\) and multiplying \(A^{-1}b\) and is a lot faster than that, but we assume that the problem is so big that even this will not work. We allow, on the other hand, all kinds of (valid) multiplications between vectors and matrices and assume that those can be done efficiently.

Because the CG method is most intuitively understood in the context of optimization, we will drop the \(Ax = b\) aspect for now and we will mostly look at the problem of minimizing a quadratic function of the form \[f(x) = \frac{1}{2}x^TAx - b^T x + c.\] We also make the standing assumption that \(A\) is square, symmetric and positive definite.

Because we assume that exact solution of the minimization problem (or equivalently, solving the corresponding linear equation) cannot be done efficiently, we will look at iterative methods, i.e. we will construct an iteration \(x_{i}\mapsto x_{i+1}\) such that the sequence \[ x_0, x_1,x_2,\ldots\] converges to the minimum.

In the course of this article, I am assuming that you are familiar with basic linear algebra, in particular the following topics:

  • vectors, matrices
  • eigenvalues and eigenvectors
  • Gram-Schmidt orthogonalization (although you don't need explicit knowledge about it, only a passing idea of what it does)


Important notation

Conventions and notations

Recall that we want to find the minimum \[x^\star = \operatorname{arginf}_{x\in\mathbb R^n} \frac{1}{2}x^TAx -b^Tx + c\] for a symmetric, positive definite matrix \(A\in \mathbb R^{n\times n}.\) and that the objective function's derivative is given by \(f'(x) = Ax - b.\)

Surfaces of constant value of \(f\) ("level sets") are ellipses in 2d, ellipsoids in 3d and so on.

Figure 1a: Level sets of a quadratic function \(f:\mathbb R^2 \to \mathbb R.\)
Figure 1b: Level sets of a quadratic function \(f:\mathbb R^3 \to \mathbb R.\)

We are going to discuss iterative optimization methods, i.e. we will construct a sequence \[x_0, x_1, x_2,\ldots\] such that

  1. Every iteration is computationally feasible and
  2. The more iterations we compute, the closer we are to \(x^\star\), i.e. \(\lim_{i\to\infty} x_i = x^\star.\)

Compare this to the direct solution of the optimization problem, which is possible by computing \(x^\star \) such that \(A x^\star = b\). This is one giant calculation and it does not make sense to "stop early", i.e. you are either done or not (with the obvious upside of the solution being provably exact).

The following two definitions are of vital importance for the rest of the article so try to really internalize them (see also the cheat sheet on the left hand side).

Give a current point (remember, we will optimize in iterations, this is one of those iterates) \(x_i\), we define

  • The error \[e_i := x_i - x^\star\] is the vector that points from the solution to the current iterate. This will be an important variable in the analysis of Gradient descent and CG but we will never be able to obtain it otherwise, we would be done immediately: For a given \(x_i\) and known \(e_i\), the solution is \(x^\star = x_i - e_i\) and we would be done. during the optimization procedure
  • The residual \[r_i := b - Ax_i = -f'(x_i) = -Ae_i\] should be interpretedThere is another interpretation as an error, mapped into the space in which \(b\) resides, but this is largely irrelevant for the exposition here. as the direction of steepest descent (starting in \(x_i\)).
Figure 2: For a given iterate \(x_0\), error \(e_0\) (in grey because we cannot use it in practice) and residual \(r_0\). Note that the residual (as the direction of steepest descent) needs to be perpendicular to level sets of the objective function (grey ellipses).

The following items are major take-aways from this section:

  • The error \(e_i\) is a quantity used in theoretical analysis of the optimization algorithms but is inaccessible for each concrete optimization problem.
  • The residual \(r_i\) is the direction of steepest descent and is easily computed by \(r_i = b - Ax_i\).
  • Error and residual do in general not point along the same axis. It is for this reason that gradient descent does not converge in exactly one step (more to this later).

Gradient descent

A decent first idea for function minimization

Minimization of a function amounts to finding "a good way down into the valley". Smart people in Gloucestershire, UK, have developed a decently efficient way of finding a quick way down a local elevation called Cooper's Hill by letting a round cheese roll down it and following it.

This is essentially how gradient descent works: Standing at some point \(x_i\), you figure out the direction of steepest descentwhich is exactly the negative gradient and you make a hop of certain length \(\alpha_i\)yet to be determined along that direction: \[ \begin{aligned} x_{i+1} &= x_i + \alpha_i \cdot (-f'(x_i))\\ &=x_i + \alpha_i \cdot r_i. \end{aligned}\] Recall that \(r_i = -f'(x_i).\) We will talk about iterative methods here so we have already started numbering the points along which we hop during our optimization procedure.

Figure 3: Gradient descent with correct step size \(\alpha_i\), leading to characteristic 90 degrees zig-zag path. Note how steepest descent means descending in direction perpendicular to the surface of constant "cost" \(f\), i.e. the level set.

Now what is a good way of choosing \(\alpha_i\)? Gradient descent is a general-purpose method very useful in many optimization settings.one of the most fashionable examples being stochastic gradient descent for training of artificial neural networks in Machine Learning Usually, one just picks some "small" \(\alpha\) and makes it bigger or smaller depending on how well it seems to work.

But, we are currently in the setting of quadratic optimizationBut: CG can be generalized to nonlinear optimization where we can do a lot better! We will choose \(\alpha_i\) such that \(f(x_{i+1})\) is minimalAs a function of \(\alpha_i.\) This means we simplify locally to a one-dimensional optimization problem of minimizing along \(x_i + \mathbb R \cdot r_i\). Hence, \[ 0 \stackrel{!}{=} \frac{\mathrm d}{\mathrm d \alpha_i} f(x_{i+1}) = f'(x_{i+1})^T\cdot\frac{\mathrm d x_{i+1}}{\mathrm d \alpha_i} = -f'(x_{i+1})^T\cdot f'(x_i). \] Now this is very interesting: The optimal choice of \(\alpha_i\) sets the position of \(x_{i+1}\) such that the steepest descent in \(x_{i+1}\), which is \(f'(x_{i+1})\), is orthogonal to the direction of steepest descent in \(x_i.\) Or, more succinctly, we have to go along zig-zags.with right angles between a zig and a zag

That's nice but we still do not have a numerical value for \(\alpha_i\). Actually, the correct value is given by \[\alpha_i = \frac{r_i^Tr_i}{r_i^TAr_i},\] as the following calculation shows:( ).

\(\alpha_i\) is determined by the condition \(f'(x_{i+1})^T\cdot f'(x_i) = 0\), and recalling that by definition of the residual, \(f'(x_k) = -r_k,\) this is equivalent to \[\begin{aligned} 0 &\stackrel{!}{=} r_{i+1}^Tr_i = (b- Ax_{i+1})^Tr_i = (b- A[x_i+\alpha_i r_i])^T r_i \\ &= (b-Ax_i)^Tr_i - \alpha_i\cdot (Ar_i)^Tr_i \\ &= r_i^Tr_i - \alpha_i\cdot r_i^TAr_i \end{aligned},\] which means \[\alpha_i = \frac{r_i^Tr_i}{r_i^TAr_i}.\]

This means that the gradient descent method for quadratic function minimization works as follows:

The following consideration explains why gradient descent is an inefficient method: Consider again figure 3, an optimization problem in 2d. This happens:

  1. We start at \(x_0\) and optimize along search direction \(r_0\), yielding the optimal step size \(\alpha_0\) and the next iterate \(x_1\)
  2. At \(x_1\), we optimize along search direction \(r_1\) which is perpendicular to \(r_0\). After doing that, we arrive at \(x_2\)
  3. At \(x_2\), we optimize along search direction \(r_2\), which is identical to \(r_0\). This means, we need to make another step along a direction we already "optimized along"
  4. In short, we lose optimality with respect to direction we have already optimized along, over and over again.

The same problem arises in higher dimensions as well. Of course, \(r_2\) will not exactly be \(r_0\) but there will be significant overlap between them.

How can we design an algorithm that does not fall into that trap? In the next section we will discuss an idea which won't work (but almost so). After massaging this idea a bit we will come up with a practical algorithm.

This section in bullet points:

  • The gradient descent is a good off-the-shelf method for optimization.
  • For quadratic optimization problems we can derive the optimal step size.
  • Even with the optimal step size, gradient descent walks along inefficent zig-zags.
  • The main issue is that we keep losing optimality along directions we have already optimized along.
  • Conjugate gradients will be able to overcome this problem.

Orthogonal directions descent

An idea that does not work but almost gives the right idea

Obviously, the best step is the one that takes us from the initial guess \(x_0\) directly to the optimum. This is possible for gradient descent if the initial error \(e_0\) is an eigenvector of \(A.\) But apart from this very lucky case, there is no way of computing this "perfect" step without solving the problem analytically.

So, what is the next best idea? Look again at figure 3: The gradient descent algorithm repeatedly steps in a direction which he has already transversed. Wouldn't it be nice if instead of an infinite series of "zig-zag-zig-zag-zig-...", we pool all zigs and zags and limit ourselves to one "zig" and one "zag"? In other words, can't we just fix a set of orthogonal directions \(\{d_0,\ldots, d_{n-1}\}\) (in \(\mathbb R^2\) this would be a set of two vectors, and accordingly in higher dimensions) and walk along those directions only exactly once?

In short: Instead of we want to do:

For our two-dimensional example, this optimization utopia would look like in figure 4: We pick any starting direction, in this case we reasonably set i.e. we start by going down as steeply as possible \(d_0 := -f'(x_0)\) and define \(d_1\) to be the only direction Unique except up to scaling. Note that this is a particularity of \(\mathbb R^2\); in higher dimensions we would need to choose an arbitrary perpendicular vector perpendicular to \(d_0.\)

Then we go just the right amount along \(d_0\) and conclude by a second step along \(d_1\), which brings us directly to \(x^\star.\) This "works" analogously in higher dimensions.

Figure 4: Orthogonal directions descent (with correct step size \(\alpha_i\), leading to characteristic 90 degrees zig-zag path). Note that we are completely ignoring the level sets of the objective function: On the path from \(x_0\) to \(x_1\) we even resume going up the hill after tangentially touching a level set!

Let's see how we might do that: We have chosen an orthogonal set of vectors \(\{d_0,\ldots, d_{n-1}\}\) and we aim to make just \(n\) steps in order to reach \(x^\star:\) \[ x_{i+1} = x_i + \alpha_i d_i\] How do we need to choose the \(\alpha_i\)? We never want to make a single step along a direction we have already taken, so we demand that \(d_i^Te_{i+1} = 0,\) i.e. the distance that separates us from the optimum does not have any component along the direction of \(d_i\). As \[e_{i+1} = x_{i+1}-x^\star = x_i + \alpha_i d_i - x^\star = e_i + \alpha_i d_i,\] this requirement means that \[0 = d_i^T (e_i + \alpha_i d_i),\] which is equivalent to \[\alpha_i = -\frac{d_i^Te_i}{d_i^Td_i}.\] So this is where our little dream needs to die: We have no hope of evaluation \(\alpha_i\) without explicit knowledge of \(e_i\), which, remember, is as good as knowing the solution itself.

It seems that our idea of choosing a set of \(n\) orthogonal directions and planning on doing exactly \(n\) steps in order to arrive at \(x^\star\) was a bit too optimistic. But actually, the general idea was good. The flaw in it was choosing orthogonal directions and thereby choosing a set of vectors which was not adapted to the "geometry" of the problem. Of course, we like to use orthogonal sets of vectors for all kinds of linear algebra stuff. But let's look again at level sets of \(f\) in combination with orthogonal vector pairs at various locations. See how they kind of not fit together? The level sets (which, remember, describe the geometry of our optimization problem) appear to insist on a specific scaling of space, i.e. something like "the east-southeast direction is less importantRecall that tighter level sets mean larger growth in that direction, hence the north-northeast direction has steeper ascent in \(f\). than the north-northeast direction". The orthogonal vector pairs on the other hand pretend that space is just regular, rotation-invariant \(\mathbb R^2.\) Those two conceptions do not really work together. Orthogonal directions ignore the geometry of the problem. The solution will be to use \(A\)-orthogonal, or conjugate, vector sets.

Figure 5: Level sets of \(f\) and pairs of orthogonal vectors.


Conjugacy

Some motivation for the use of \(A\)-orthogonal, or conjugate, vector sets.

We define \(A\)-orthogonality, or conjugacy as follows: Two vectors \(u, v\) are said to be A-orthogonal (or conjugate), i.e. \(u~\bot_A~ v\), if \[u^TAv = 0\]

What does that mean? This is best explained geometrically:

Let's take our familiar level set picture, print it on a sheet of (chewn) bubblegum and tug at it in such a way that the ellipse level sets become circles.

Figure 6: Stretching the level set picture.

Now we choose pairs of vectors which are not orthogonal in the original figure but appear perpendicular on the stretched version. See how, depending on the vectors' orientation, the angle between them is sometimes acute and sometimes obtuse. It is a right angle exactly if the vectors point along the axes of the ellipse.

Figure 7: Conjugate vectors.

Play with the slider below to get a feel for how A-orthogonal vector pairs look like.

0

Major message of this section:

  • We hate zig-zags because they are inefficient.
  • We know we cannot expect to jump from a starting point directly to the minimum \(x^*\).
  • In \(n\)-dimensional space, we concede that the algorithm might need at least \(n\) steps, "one for each dimension".
  • We say, ok, but \(n\) steps is really all you will get.
  • Even more, we do not want to just be close to the minimum after the \(n\)th step, but we demand to find it exactly, i.e., \(x_{n} = x^\star.\)
  • We realized that taking the orthogonal coordinate directions won't do the trick, but those \(A\)-orthogonal/conjugate directions seem to be a promising idea.
  • This is what we will do: Given a set of conjugate directions \(\{d_0,\ldots, d_{n-1}\},\) we will take exactly \(n\) steps \[x_{i+1} = x_i + \alpha_i\cdot d_i.\] What are the values of \(\alpha_i\) such that \(x_n = x^\star\)?

Conjugate directions descent

The main ideas of the CG method

So assume now that we have a set of \(n\) A-orthogonal (or conjugate) directions along which we want to take our descent. Now this is a big assumption: How do we find such a set of vectors? This is not clear at all right now but we will face that problem later. For now, we imagine a good fairy handing us a set of vectors \[\{d_0,\ldots, d_{n-1}\}\] which are conjugate (or A-orthogonal), i.e. for any \(i =\not j\): \[ d_i^TAd_j = 0.\]

For example, in the setting of the optimization problem and with conjugate vectors from figure 7 in \(\mathbb R^2\), any pair of vectors depicted there will do.

Then we start in some \(x_0\in \mathbb R^n\) and define \[\begin{aligned}&x_1 &&= x_0 + \alpha_0 \cdot d_0\\ &x_2 &&= x_1 + \alpha_1 \cdot d_1\\ &&&\quad\vdots \\ &x_{i+1} &&= x_i + \alpha_i \cdot d_i \\ &&&\quad\vdots \\ &x_{n} &&= x_{n-1} + \alpha_{n-1}\cdot d_{n-1}\end{aligned}\]

Also, we are being really greedy by demanding that \(x_n = x^\star,\) i.e. we want the exact solution to be found in at most \(n\) steps. Note that the only unknowns in this sequence of equations are the step sizes \(\alpha_i,\) everything else is either one of the directions \(d_i\) or the result of the preceding calculation.

How do we set those step sizes \(\alpha_i\) such that we achieve our goal of \(x_n = x^\star\)? The requirement is that \(e_{i+1}\) is conjugate to \(d_i\). Intuitively, this means that we will neverThere is a big caveat to this "never" that you might have spotted: Gradient descent had the issue that we lost optimality along directions which we had already optimized along. How can we be sure that something like this does not happen here? This is an important point which will be resolved later. For now, don't worry about it. go in direction of \(d_i\) again. It makes sense, actually: If we force the algorithm to manage with \(n\) steps, each of which point in a specific direction \(d_i,\) it can only go in direction of \(d_i\) in this step (or it will need an additional step along \(d_i\) later, which we forbid).

So, this means that we need \[d_i^T Ae_{i+1} = 0,\] i.e. \[ \begin{aligned}0 &= d_i^T A e_{i+1} = d_i^TA(e_i + \alpha_i d_i)\\ &= d_i^T (-r_i + \alpha_i Ad_i),\end{aligned}\] which is equivalent to \[ \alpha_i = \frac{d_i^T r_i}{d_i^TAd_i}.\] In contrast to the equivalent condition for the "orthogonal directions" idea, this is actually computable: The \(d_i\) are fixed,again, we still need to talk about how the heck we will get those vectors, but that's for later. and the residual \(r_i\) is just \(b-Ax_i,\) which is easily obtained.

At the same time, this conjugacy conditionNote that \(0 = d_i^T A e_{i+1}\) means that \(d_i\) and \(e_{i+1}\) are conjugate. has another interpretation: Choosing \(\alpha_i\) is equivalent to finding the minimum point along the search direction \(d_i\): \[\begin{aligned} 0 &= \frac{\mathrm d}{\mathrm d \alpha} f(x_{i+1})\\ &= f'(x_{i+1})^T\frac{\mathrm d}{\mathrm d \alpha} x_{i+1} \\ &= -r_{i+1}^T d_i \\ &= d_i^TAe_{i+1}.\end{aligned}\] This is wonderful: The global minimization procedure is reduced to a sequence of minimization procedures along fixed, conjugate directions. Contrast this to our failed idea of orthogonal directional descent where we would have been forced to climb up the hill first before going along the next direction, even with the impracticality of orthogonal directions aside.

Now we claim that the sequence of computations \[\begin{aligned}&x_1 &&= x_0 + \alpha_0 \cdot d_0\\ &x_2 &&= x_1 + \alpha_1 \cdot d_1\\ &&&\quad\vdots \\ &x_{i+1} &&= x_i + \alpha_i \cdot d_i \\ &&&\quad\vdots \\ &x_{n} &&= x_{n-1} + \alpha_{n-1}\cdot d_{n-1}\end{aligned}\] with \[ \alpha_i = \frac{d_i^T r_i}{d_i^TAd_i}\] finds us the optimum in at most \(n\) steps, i.e. \(x_n = x^\star.\)

Before we go into the details of why that is correct, let's try to find numerical validation first: We take our minimization problem from before, we start at \(x_0 = (-2,2)\) and we take the following set of conjugate directions: \[d_0 = \begin{pmatrix} 12 \\ 8\end{pmatrix}, d_1 = \begin{pmatrix} 18 \\ -13\end{pmatrix}.\] For now we do not worry about how we obtained this particular set of conjugate directions. Incidentally (not really, of course), we have chosen the first vector to be pointing along the direction of steepest descent at the initial value \(x_0.\) The iterations are calculated here: ( )

We have chosen \(d_0 = r_0\) so that we actually start off in direction of steepest descent. This is a nice but (almost) arbitrary choice. \[\begin{aligned}\alpha_0 &= \frac{d_0^Tr_0}{d_0^TAd_0}= \frac{13}{75}\\ x_1 &= x_0 + \alpha_0 d_0 = \begin{pmatrix}0.08\\-0.61\overline 3\end{pmatrix}\\ r_1 &= b - Ax_1 = \begin{pmatrix} 2.98\overline 6\\4.48\end{pmatrix}\\ \alpha_1 &= \frac{d_1^Tr_1}{d_1^TAd_1}= \frac{112}{1050}\\ x_2 &= x_1 + \alpha_1 d_1 = \begin{pmatrix}2\\-2\end{pmatrix},\end{aligned} \] and \(x_2= x^\star\) indeed.
Plotting those two iterations yields figure 7. Note how the conjugate directions look perpendicular in the stretched plot on the right hand side.
Figure 8: Conjugate directions optimization method.
It seems that the algorithm terminates indeed after two steps. This might be a coincidence, so we try another pair of conjugate directions, this time not choosing the steepest descent as our first vector: \[d_0 = \begin{pmatrix} 0 \\ 1\end{pmatrix}, d_1 = \begin{pmatrix} 3 \\ -1\end{pmatrix}.\] The math is here: ( )

\[\begin{aligned}\alpha_0 &= \frac{d_0^Tr_0}{d_0^TAd_0}= \frac{4}{3}\\ x_1 &= x_0 + \alpha_0 d_0 = \begin{pmatrix}-2\\-2/3\end{pmatrix}\\ r_1 &= b - Ax_1 = \begin{pmatrix} 28/3\\0\end{pmatrix}\\ \alpha_1 &= \frac{d_1^Tr_1}{d_1^TAd_1}= \frac{28}{21}\\ x_2 &= x_1 + \alpha_1 d_1 = \begin{pmatrix}2\\-2\end{pmatrix},\end{aligned} \] and \(x_2= x^\star\) again.
Figure 9: Conjugate directions optimization method, different conjugate vectors.

Again, the conjugate directions algorithm terminates in two steps.

In three dimensions, the conjugate directions algorithm takes three steps.

Figure 10a: Conjugate directions method in 3 dimensions in original coordinates. You can rotate and zoom this figure.
Figure 10b: Conjugate directions method in 3 dimensions in stretched coordinates (note how the conjugate directions appear perpendicular in this plot). You can rotate and zoom this figure.

So, if you accept those three positive examples as evidence for the effectiveness of the conjugate directions optimization method, you might wonder how and why it can work. The following lemma proves this.

Lemma 1: Convergence of the conjugate directions optimization method in \(n\) steps.

Given a basis of conjugate vectors \(\{d_i\}_{i=0}^{n-1}\), the method of conjugate directions arrives at the optimum in at most \(n\) steps.

Proof: ( )

The proof for that is actually quite short and very illuminating. We begin by considering the initial error \(e_0,\) i.e. the (unknown) difference \(x_0-x^\star.\) This we can write as \[e_0 = \sum_{j=0}^{n-1}\delta_j \cdot d_j.\] Why is that possible? The set \(\{d_i\}_i\) constitutes a basis of \(\mathbb R^n\) and \(e_0\) is an element of that vector space, hence it can be obtained as a linear combination of the basis.

How do we obtain the \(\delta_j\)? We can multiply this characterization in turns by each \(d_j^TA\) from the left. Then every time, by conjugacy, all but one term on the left hand side vanishes: \[d_k^TAe_0 = \sum_{j=0}^{n-1}\delta_j \cdot d_k^TAd_j = \delta_k \cdot d_k^TAd_k\] We solve for \(\delta_k\) and obtain \[\delta_k = \frac{d_k^TAe_0}{d_k^TAd_k}.\] The next step is "adding 0": We can include additional terms next to \(e_0\) which vanish in combination with the term \(d_k^TA\) in front: \[\delta_k = \frac{d_k^TA{\color{blue}\left(e_0+\sum_{i=0}^{k-1}\alpha_i d_i\right)}}{d_k^TAd_k}.\] Now recall that our iterations are given by \[x_{i+1} = x_i + \alpha_i d_i\] and thus \[e_{i+1} = e_i + \alpha_i d_i\] and, iteratively, \[e_k = {\color{blue}e_0 +\sum_{i=0}^{k-1}\alpha_i d_i},\] so, all in all, replacing the blue term: \[\delta_k = \frac{d_k^TAe_k}{d_k^TAd_k}.\] This is exactly the negative of our step length: \[\alpha_k = -\delta_k.\] What does that mean? There are two ways of thinking about the conjugate directions method:

The initial position \(x_0\) is updated step for step in direction of the conjugate directions \(d_i\) and is built up along those components.

Alternatively we can think of \(e_0\) starting with error contributions from all directions, with each step along a conjugate direction \(d_i\) "biting away" the corresponding component in this error term until the last step takes away the remaining error:

\[\begin{aligned}e_i &= e_0 + \sum_{j=0}^{i-1}\alpha_j \cdot d_j\\ &= \sum_{j=0}^{n-1}\delta_j\cdot d_j + \sum_{j=0}^{i-1}(-\delta_j) \cdot d_j\\ &= \sum_{j=i}^{n-1}\delta_j \cdot d_j\end{aligned} \]

Note that this sum contains less and less terms each iteration \(i\mapsto i+1\) and loses its last term at \(n-1\mapsto n,\) which is exactly the last iteration.

Let's take another closer look at the Conjugate Directions method, i.e. someone gave us a nice set of \(n\) conjugate (A-orthogonal) directions \(\{d_i\}_{i=0}^{n-1}\).

Conjugate Directions finds at every step the best solutions within the bounds of where it's been allowed to explore. We make another observation:

What does that mean? We define \[\mathcal D_i := \operatorname{span}\{d_0,d_1\ldots,d_{i-1}\}.\] Then from the equation \[e_i = e_0 + \sum_{j=0}^{i-1}\alpha_j d_j,\] we get that \(e_i \in e_0 + \mathcal D_i.\) We claim that Conjugate Directions works such that \(e_i\) is "optimal within the bounds of where it's been allowed to explore". Where has the iteration been allowed to explore so far? That is exactly \(x_0 + \mathcal D_i.\) Equivalently we can say that the error has been "allowed to explore" \(e_0 + \mathcal D_i.\) In which sense is this optimal? The claim is the following lemma:

Lemma 2: Optimality of the error term

We define the \(A\)-norm by \[\|x\|_A^2 := x^TAx.\] Then \[ e_i = \operatorname{argmin} \{\|\epsilon\|_A:~ \epsilon\in e_0 + \mathcal D_i\}.\] I.e., \(e_i\) is the element in \(e_0 + \mathcal D_i\) with minimal \(A\)-norm.

Proof: ( )

We can decompose (see the proof of lemma 1) \[ e_i = \sum_{j=i}^{n-1}\delta_j\cdot d_j\] and thus \[\begin{aligned} \|e_i\|_A^2 &= \sum_{j=i}^{n-1}\sum_{k=i}^{n-1}\delta_j\delta_k\cdot d_j^TAd_k \\ &= \sum_{j=i}^{n-1}\delta_j^2 \cdot d_j^TAd_j. \end{aligned}\] On the other hand, an arbitrary element \(\epsilon \in e_0 + \mathcal D_i\) has the following form: \[\epsilon = \underbrace{\sum_{j=0}^{n-1}\delta_j\cdot d_j}_{=e_0} + \underbrace{\sum_{j=0}^{i-1} \kappa_j\cdot d_j}_{\in \mathcal D_i} = \sum_{j=0}^{i-1} (\delta_j + \kappa_j)\cdot d_j + \sum_{j=i}^{n-1} \delta_j\cdot d_j.\] The A-norm of \(\epsilon\) is given by \[\begin{aligned} \|\epsilon\|_A^2 &= \left\| \sum_{j=0}^{i-1} (\delta_j + \kappa_j)\cdot d_j + \sum_{j=i}^{n-1} \delta_j\cdot d_j\right\|_A^2\\ &= \left(\sum_{j=0}^{i-1} (\delta_j + \kappa_j)\cdot d_j\right)^TA \left(\sum_{j=0}^{i-1} (\delta_j + \kappa_j)\cdot d_j\right) \\ &+ \sout{\left(\sum_{j=0}^{i-1} (\delta_j + \kappa_j)\cdot d_j\right)^TA \left( \sum_{j=i}^{n-1} \delta_j\cdot d_j\right)}\\ &+ \sout{\left(\sum_{j=i}^{n-1} \delta_j\cdot d_j\right)^TA \left(\sum_{j=0}^{i-1} (\delta_j + \kappa_j)\cdot d_j\right)}\\ &+ \left(\sum_{j=i}^{n-1} \delta_j\cdot d_j\right)^TA \left(\sum_{j=i}^{n-1} \delta_j\cdot d_j\right)\\ &= \left\|\sum_{j=0}^{i-1} (\delta_j + \kappa_j)\cdot d_j\right\|_A^2 + \left\|\sum_{j=i}^{n-1} \delta_j\cdot d_j\right\|_A^2 \\ &\geq \left\|\sum_{j=i}^{n-1} \delta_j\cdot d_j\right\|_A^2 \\ &= \|e_i\|_A^2\end{aligned}\] Note that the two terms in the third and fourth line vanish because of A-orthogonality of the \(d_j\). In the end we just drop a positive term and obtain exactly the A-norm of \(e_i\). This means that for any \(\epsilon \in x_0 + \mathcal D_i,\) we have \[\|\epsilon\|_A^2 \geq \|e_i\|_A^2,\] which is exactly optimality of \(e_i.\)

Let's talk about intuition next. What does this sense of optimality mean?

If we start with some \(x_0\), then we pick \(x_1\) such that it is the minimum point of \(x_0 + \mathcal D_1.\) This is just how we defined Conjugate Directions Descent (and exactly how Gradient Descent worked as well).

The surprising part is that optimality is conserved to all following iterations as well. This does not work in Gradient Descent, as we have already observed: ( )

Consider again figure 2 (shown again below for convenience): Starting at \(x_0,\) Gradient Descent optimizes along search direction \(r_0\), which yields \(x_1.\) From there, it optimizes along search direction \(r_1\), yielding \(x_2.\) After that, the next search direction is \(r_2=r_0,\) i.e. the same search direction as in step 1. But now, we are not optimal in \(x_0 + \mathcal D_2\) anymore! We need to optimize along direction \(r_0\) again. Then again along \(r_1,\) because we have lost optimality in that direction as well and so forth...

So, why should we indeed expect that "optimality with respect to search direction \(d_0\)" is conserved after making another step along \(d_1\) and stopping at \(x_2\)? There are again two possible explanations for that (choose which ever suits you better)

  • Analytically: At point \(x_2,\) our remaining error term is \(e_2 = \sum_{j=2}^{n-1} \delta_j \cdot d_j,\) i.e. there is no contribution and no contribution from direction \(d_1\) as well, but this is less surprising because \(x_2\) was obtained by cutting away the \(d_1\) contribution along direction \(d_0\) left. This means, we cannot minimize the A-norm anymore by going alongthis is lemma 1 and its proof \(d_0.\) We are still optimal along direction \(d_0!\)
  • Figure 11a.
    Figure 11b.
  • By stretching: Look at figure 11b: \(x_1\) was obtained by picking a point on \(x_0 + \mathcal D_1\) such that \(e_1\) was A-orthogonal to \(d_0\). In this stretched figure, this means that \(e_1\) appears perpendicular to \(d_0=r_0.\) Arrived at \(x_2,\) we see that this has not changed: Now \(d_0\) and \(e_2\) appear perpendicular, i.e. they are in fact A-orthogonal. By going even a short distance away from \(x_2\) in direction of \(d_0\) (shown in dashed), we would leave the smaller sphere depicting a set of constant A-norm, hereby increasing our A-norm. Hence, \(x_2\) is still optimal in direction of \(d_0.\) Note that we are of course also optimal with respect to the more recent search direction \(d_1\) but that is to be expected. The non-losing of optimality along past search directions is more surprising.

A good mental image for how figure 11b works is to imagine yourself standing at the solution point \(x^\star,\) pulling a string connected to a bead that is constrained to lie in \(x_0 + \mathcal D_1.\) Each time the expanding subspace \(\mathcal D\) is enlarged by a dimension, the bead becomes free to move a little closer to you.

The following property of Conjugate Directions Descent will be of paramount importance:

Lemma 3: Orthogonality of the residual

In the \(i\)th iteration of Conjugate Directions Descent, the residual is orthogonal to all previous search directions: \[r_{i+1} ~\bot~ d_0, d_1, \ldots, d_i.\]

Proof: ( )

Recall that the error of the \(j\)th particle is given by \[e_{i+1} = \sum_{l=i+1}^{n-1}\delta_l\cdot d_l.\] Then we premultiply this equation on both sides with \(-d_j^TA\) for some \(j \leq i,\) which yields \[\begin{aligned}-d_j^T\underbrace{Ae_{i+1}}_{=-r_{i+1}} &= - \sum_{l=i+1}^{n-1}\delta_l \cdot \underbrace{d_j^TAd_l}_{=0~ (j < l)}\\ \Leftrightarrow d_j^Tr_{i+1} &= 0\end{aligned},\] which means exactly that \(r_{i+1}\) is orthogonal to all previous search directions \(d_j\), \(j=0,\ldots, i\)

Note that an equivalent proposition does not hold for Gradient Descent: The new descent direction \(r_{i+1}\) is indeed orthogonal to the previous descent direction \(r_i\), but not orthogonal to \(r_{i-1}.\) (which can be seen by looking at the characteristic zig-zag pattern).

Here's a recap of this (long) section's main ideas:

  • The conjugate directions optimization method works really well.
  • We arrive in \(n\) directions at the optimum.
  • The iteration of "local optimization procedures" (along the current search directions \(d_i\)) does not lose its permanence as gradient descent does: Once we are optimal along some direction, we will always be (and going along an old search direction will necessarily deteriorate the optimization).
  • But... How do we get a set of conjugate directions to start with?

Gram-Schmidt conjugation

An algorithm for obtaining a set of conjugate directions (which we will not use).

How do we obtain a set of \(n\) conjugate (A-orthogonal) directions \(\{d_i\}_{i=0}^{n-1}\)? One possibility is Gram-Schmidt conjugation. This works just like Gram-Schmidt orthogonalization, but instead of producing orthogonal vectors, it generates conjugate vectors. It works as follows:

  1. Take a set of \(n\) linearly independent vectors \(\{u_i\}_{i=0}^{n-1}.\)
  2. Define \(d_0 = u_0.\)
  3. Construct \(d_1\) by removing all components from \(u_1\) which are not A-orthogonal to \(d_0\)
  4. Construct \(d_2\) by removing all components from \(u_2\) which are not A-orthogonal to \(d_0\) and \(d_1\)
  5. ...
  6. Construct \(d_{n-1}\) by removing all components from \(u_{n-1}\) which are not A-orthogonal to all \(d_i\), \(i=0,\ldots,{n-2}.\)

In formulae, we iteratively define (for \(i = 1,\ldots, n-1\)) \[d_i = u_i + \sum_{k=0}^{i-1}\beta_{ik}d_k.\] What are the \(\beta_{ik}\)? We premultiply with \(d_j^TA\) for some \(j\in\{0,\ldots, i-1\},\) to get \[\begin{aligned}d_j^TAd_i &= d_j^TAu_i + \sum_{k=0}^{i-1}\beta_{ik}d_j^TAd_k\\ 0 &= d_j^TAu_i + \beta_{ij}d_j^TAd_j\\ \beta_{ij} &= - \frac{d_j^TAu_i}{d_j^TAd_j}\end{aligned}\]

Thus, in principle we could do the following:

Gram-Schmidt + Conjugate Directions Descent
  1. Generate a linearly independent basis of "proposal directions" \(\{u_i\}_{i=0}^{n-1}.\)
  2. Use Gram-Schmidt in order to construct a basis of conjugate directions \(\{d_i\}_{i=0}^{n-1}.\)
  3. Do Conjugate Directions Descent in order to solve the minimization problem.

This works but the problem here is that this is too costly in both memory and computational effort.we need to compare each \(u_i\) with all previously computed search directions \(\{d_j\}.\) The computational complexity of that grows quite fast. In fact, the computational expense needed is exactly the effort needed for explicitly solving the linear equation \(Ax = b\) by Gaussian elimination, which we wanted to deperately avoid in the first place.

Note that the Gram-Schmidt construction of the \(i\)th search direction \(d_i\) only needs the previous search directions. Thus we could in principle choose to do the following:

Gram-Schmidt + Conjugate Directions Descent (taking turns)
  1. Generate a linearly independent basis of "proposal directions" \(\{u_i\}_{i=0}^{n-1}.\)
  2. Set \(d_0 = u_0.\)
  3. Make one step of Conjugate Directions Descent.
  4. Make one step of Gram-Schmidt in order to construct the next search direction \(d_1.\)
  5. Make one step of Conjugate Directions Descent.
  6. Make one step of Gram-Schmidt in order to construct the next search direction \(d_2.\)
  7. ...
  8. Make one step of Gram-Schmidt in order to construct the next search direction \(d_{n-1}.\)
  9. Make the last step of Conjugate Directions Descent.

But this does not alleviate the problem at all, because the computational complexity is the same.

The crucial idea is the following: We do not fix the set \(\{u_i\}\) right from the beginning. Instead (major changes in green):

Dream method (this will be CG)
  1. Set some clever initial search direction \(d_0\)
  2. Make one step of Conjugate Directions Descent.
  3. Pick a (good) next candidate for a search direction \(u_1\) and construct a search direction \(d_1\) conjugate to \(d_0\).
  4. Make one step of Conjugate Directions Descent.
  5. Pick a (good) next candidate for a search direction \(u_2\), which is already conjugate to \(d_0\) and construct from it a search direction \(d_2\) conjugate to \(d_1\), with \(d_2~\bot_A~ d_0\) already by choice.
  6. Make one step of Conjugate Directions Descent.
  7. Pick a (good) next candidate for a search direction \(u_3\), which is already conjugate to \(d_0\) and \(d_1\) and construct from it a search direction \(d_3\) conjugate to \(d_2\), with \(d_2~\bot_A~ d_0, d_1\) already by choice.
  8. Make one step of Conjugate Directions Descent.
  9. ...
  10. Continue until the end

See what's happening here? We do not need to check all previously computed search directions but only the most recent one! The most recent "proposal direction" \(u_{i+1}\) is already conjugate to all previous search directions with exception of the last one. Thus the conjugation procedure which yields the next search direction \(d_{i+1}\) is a lot easier!

That means, the main idea is the following:

We iteratively construct conjugate directions in such a way that we can "forget" all previous search directions but the previous one. Then we only need to make the new direction conjugate to this previous direction. This reduces the computational complexity coming from the conjugation procedure to a fraction of what up-front-Gram-Schmidt needs.

OK, so what are these miraculous \(u_i\) "proposal directions"? In fact, they are exactly the gradients \(u_i = r_i = -f'(x_i)\). We will now discuss why this works.


The Conjugate Gradient (CG) algorithm

Some last tricks and a full description

Let's summarize again what we learned so far:

Given a set of conjugate directions \(\{d_i\}_{i=0}^{n-1},\) the algorithm of Conjugate Directions Descent terminates in at most \(n\) steps at the minimum. Our remaining challenge is to choose this set of conjugate directions. We realized that generating such a set in advance (from an initial set of proposal directions \(\{u_i\}_{i=0}^{n-1}\) that we can modify to be a conjugate set \(\{d_i\}_{i=0}^{n-1}\)) is too costly. Then we argued that by a clever choice of "proposal directions" \(u_i\) on the fly (interspersed with Conjugate Directions Descent iterations), we might be able to cut down on computational complexity. We will see in this section that a really good choice for proposal directions are the residuals/directions of steepest descent, i.e. \[ u_i = r_i.\]

The key will be to prove that this proposal direction \(r_i\) is already conjugate to all previous search directions, with exception of the last one, \(d_{i-1}.\) Then we can construct the new search direction \(d_i\) out of the "proposal direction" \(r_i\) just by making it conjugate to \(d_{i-1}.\)

With this method we can generate a set of conjugate directions \(d_i\) on the fly and simultaneously do Conjugate Directions Descent. This combination is called Conjugate Gradients (because we take the gradients and make them conjugate to previous search directions).

The last remaining key statement to prove is thus the following:

Lemma 4:

Consider the setting of Conjugate Directions. If the set of "proposal directions" is the set of residuals, i.e. \(u_i = r_i,\) then \(r_{i+1}\) is conjugate to all previous search directions with exception of the last one, i.e. \[r_{i+1} ~\bot_A ~ d_0, d_1, \ldots, d_{i-1},\] for \(i=1,\ldots, {n-1}.\)

Also, \[r_{i+1} ~\bot~ r_0, r_1, \ldots, r_i\]

Proof: ( )

Recall the definition \(\mathcal D_i := \operatorname{span}\{d_0,d_1,\ldots d_{i-1}\}.\) The \(i\)th search directions \(d_i\) is constructed from the \(i\)th residual and all previous search directions, hence \(d_i \in \operatorname{span}\{r_i, \mathcal D_{i}\}.\) This means (note that \(d_0 = r_0\)):
  • \(\operatorname{span}\{r_0\} = \operatorname{span}\{d_0\} = \color{green}{\mathcal D_1}\)
  • \(\operatorname{span}\{r_0, r_1\} = \operatorname{span}\{{\color{green}{\mathcal D_1}}, r_1\}= \operatorname{span}\{\mathcal D_1, d_1\} = {\color{blue}{\mathcal D_2}}\)
  • \(\operatorname{span}\{r_0, r_1, r_2\} = \operatorname{span}\{{\color{blue}{\mathcal D_2}}, r_2\}= \operatorname{span}\{\mathcal D_2, d_2\} = \mathcal D_3\)
  • ...
and thus we can equivalently write \[\mathcal D_i := \operatorname{span}\{r_0,r_1,\ldots r_{i-1}\}.\] We recall that by lemma 3, \(r_{i+1}\bot \mathcal D_{i+1},\) and this proves the claim \[r_{i+1} ~\bot~ r_0, r_1, \ldots, r_i.\] Now note that \(r_{i} = -Ae_{i} = -A(e_{i-1} + \alpha_{i-1}d_{i-1}) = r_{i-1} - \alpha_{i-1}Ad_{i-1}.\) Thus, \(r_{i} \in\operatorname{span}\{\underbrace{r_{i-1}}_{\in \mathcal D_i}, \underbrace{Ad_{i-1}}_{\in A\mathcal D_i}\}. \) This means that \(\mathcal D_{i+1} = \operatorname{span}\{\mathcal D_i, r_i\} = \operatorname{span}\{\mathcal D_i, A\mathcal D_i\},\) and recursively, \[\mathcal D_i := \operatorname{span}\{d_0,Ad_0,\ldots A^{i-1}d_{0}\}.\] Now we can prove the main statement: By lemma 3 and using the characterization of \(\mathcal D_{i+1}\) from above, \(r_{i+1} \bot \mathcal D_{i+1}=\operatorname{span}\{\mathcal D_i, A\mathcal D_i\}.\) This means that, in particular, \(r_{i+1}\bot A\mathcal D_{i}.\) In other words, \[r_{i+1}~\bot_A \mathcal D_i,\] or \[r_{i+1} ~\bot_A ~ d_0,d_1,\ldots, d_{i-1}.\]

So that means that the residual \(r_i\) is a really good proposal direction in point \(x_i.\) We will now derive the formula for generating a search direction \(d_i\) from it which is also conjugate to \(d_{i-1},\) in addition to being conjugate to all search directions prior to that, \(d_0, \ldots, d_{i-2}.\) This is done by Gram-Schmidt conjugation and mainly consists of purely symbolic manipulation without further grand ideas (and you can safely skip it if you are not interested in the details).

Lemma 5:

Given a current position \(x_{i+1}\), its residual \(r_{i+1}\) being a proposal direction \(u_{i+1} = r_{i+1}\) which is conjugate to almost all previous search directions, i.e. \[ u_{i+1} ~ \bot_A ~ d_0,d_1,\ldots d_{i-1},\] then by setting \[d_{i+1} := r_{i+1} + \beta_{i+1}\cdot d_i,\quad \beta_{i+1} := \frac{r_{i+1}^Tr_{i+1}}{r_i^Tr_i},\] this new search direction is now conjugate to all previous search directions, i.e. \[ d_{i+1} ~ \bot_A ~ d_0,d_1,\ldots d_{i}.\]

Proof: ( )

We need to modify \(u_{i+1}\) into a vector \(d_{i+1}\) such that \(d_{i+1}\) is conjugate to \(d_i.\) The Gram-Schmidt conjugation procedure yields \[d_{i+1} := r_{i+1} + \beta_{i+1}\cdot d_i,\quad \beta_{i+1}\] where \(\beta_{i+1} = -\frac{u_{i+1}^TAd_{i}}{d_i^TAd_i}.\) Note that this indeed yields a vector which is conjugate to all search directions, even the previous ones. The remaining proof consists of deriving the specific formula for \(\beta_{i+1}.\) We recall (from the proof of lemma 4) \[r_{i+1} = r_i - \alpha_i Ad_i.\] Multiplying this with \(r_i^T\) from the left gives \[r_{i+1}^Tr_{i+1} = r_{i+1}^Tr_i - \alpha_i r_{i+1}^TAd_i.\] But \(r_i^Tr_{i+1}=0\) by the second statement of lemma 4 and so \[r_{i+1}^TAd_i = -\frac{1}{\alpha_i}\cdot r_{i+1}^Tr_{i+1}. \] Using \(\alpha_i = \frac{d_i^Tr_i}{d_i^TAd_i},\) \[ \beta_{i+1} = -\frac{r_{i+1}^Tr_{i+1}}{d_i^Tr_i}.\] But as \(d_i\) was obtained by Gram-Schmidt conjugation, the denominator can be written as \(d_i^Tr_i = r_i^Tr_i + \beta_i\cdot d_{i-1}^Tr_i = r_i^Tr_i,\) with the second term vanishing due to lemma 3. Hence, \[ \beta_{i+1} = -\frac{r_{i+1}^Tr_{i+1}}{r_i^Tr_i}.\]
Here's the full algorithm. There are two modifications which increase efficiency of CG:
  • We use \(r_{i+1} = -Ae_{i+1} = -A(e_i + \alpha_i d_i) = r_i - \alpha_iAd_i\) to simplify computation of the residual and reduce the number of matrix multiplications.
  • The step size is given by \(\alpha_i = \frac{r_i^Tr_i}{d_i^TAd_i}\) instead of our original form \(\frac{d_i^Tr_i}{d_i^TAd_i}.\) This is valid because of the identity \(d_i^Tr_i = r_i^Tr_i,\) derived in lemma 5.

Acknowledgements

This article is basically a ripped-off digest of the absolutely great paper "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain" by Jonathan Shewchuk.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.