Cheat sheet
This little box keeps all the important formulae you (now don't) need to keep in mind.
Click on and symbols to expand content you are interested in.
Introduction
The Conjugate Gradient method is one of the most important ideas in scientific computing: It is applied for solving large sparse linear equations, like those arising in the numerical solution of partial differential equations and it is also used as a powerful optimization method.
This article is trying to give an intuitive introduction to the Conjugate Gradient method.from now on often abbreviated as "CG"
The CG method is based on a beautiful geometric idea but this idea is barely explained in most exposition texts, a fact that was lamented (and alleviated) by Jonathan Shewchuk's grandiose paper "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain". This article tries to transfer Shewchuk's brilliant explanations into a more interactive medium.
Note that this article is designed to be read on a decently sized monitor. While a phone screen will show the necessary info, there are some features and visualizations that might not work satisfactorily.
If you find errors, have comments on the mathematical and/or didactical exposition, please drop me an email at [email protected]
I assume you need this algorithm for one of the following usecases:
 Solving a linear system of equations \[A x = b\]for a matrix \(A\in \mathbb R^{n\times n}\) symmetric and positive definite and vectors \(x, b\in \mathbb R^n\). This setting typically arises when you need to solve partial differential equations numerically.
 Minimizing a quadratic function \[f(x) = \frac{1}{2}x^TAx  b^T x + c\] with a matrix \(A\in \mathbb R^{n\times n}\) symmetric and positive definite, vectors \(x, b\in \mathbb R^n\) and a scalar \(c\in \mathbb R\)
 Minimizing a more complicated function \[f : \mathbb R^n\to \mathbb R.\]
The first two cases are equivalent ( ).
Although CG was developed for minimization of quadratic functions, it is applied to nonquadratic minimization problems as well.
Obviously, the solution to \(Ax = b\) is \(x = A^{1}b\) but we assume that the dimension \(n\) of the problem is so high that we cannot invert the matrix. From here on, we won't allow matrix inversion as a feasible operation. We even forbid explicitly solving linear equations of the form \(Ax = b\). This is not the same as inverting \(A\) and multiplying \(A^{1}b\) and is a lot faster than that, but we assume that the problem is so big that even this will not work. We allow, on the other hand, all kinds of (valid) multiplications between vectors and matrices and assume that those can be done efficiently.
Because the CG method is most intuitively understood in the context of optimization, we will drop the \(Ax = b\) aspect for now and we will mostly look at the problem of minimizing a quadratic function of the form \[f(x) = \frac{1}{2}x^TAx  b^T x + c.\] We also make the standing assumption that \(A\) is square, symmetric and positive definite.
Because we assume that exact solution of the minimization problem (or equivalently, solving the corresponding linear equation) cannot be done efficiently,
we will look at
In the course of this article, I am assuming that you are familiar with basic linear algebra, in particular the following topics:
 vectors, matrices
 eigenvalues and eigenvectors
 GramSchmidt orthogonalization (although you don't need explicit knowledge about it, only a passing idea of what it does)
Important notation
Recall that we want to find the minimum \[x^\star = \operatorname{arginf}_{x\in\mathbb R^n} \frac{1}{2}x^TAx b^Tx + c\] for a symmetric, positive definite matrix \(A\in \mathbb R^{n\times n}.\) and that the objective function's derivative is given by \(f'(x) = Ax  b.\)
Surfaces of constant value of \(f\) ("level sets") are ellipses in 2d, ellipsoids in 3d and so on.
We are going to discuss iterative optimization methods, i.e. we will construct a sequence \[x_0, x_1, x_2,\ldots\] such that
 Every iteration is computationally feasible and
 The more iterations we compute, the closer we are to \(x^\star\), i.e. \(\lim_{i\to\infty} x_i = x^\star.\)
Compare this to the direct solution of the optimization problem, which is possible by computing \(x^\star \) such that \(A x^\star = b\). This is one giant calculation and it does not make sense to "stop early", i.e. you are either done or not (with the obvious upside of the solution being provably exact).
The following two definitions are of vital importance for the rest of the article so try to really internalize them (see also the cheat sheet on the left hand side).
Give a current point (remember, we will optimize in iterations, this is one of those iterates) \(x_i\), we define

The
error \[e_i := x_i  x^\star\] is the vector thatpoints from the solution to the current iterate . This will be an important variable in the analysis of Gradient descent and CG but we will never be able to obtain it otherwise, we would be done immediately: For a given \(x_i\) and known \(e_i\), the solution is \(x^\star = x_i  e_i\) and we would be done. during the optimization procedure 
The
residual \[r_i := b  Ax_i = f'(x_i) = Ae_i\] should be interpretedThere is another interpretation as an error, mapped into the space in which \(b\) resides, but this is largely irrelevant for the exposition here. as thedirection of steepest descent (starting in \(x_i\)).
The following items are major takeaways from this section:
 The error \(e_i\) is a quantity used in theoretical analysis of the optimization algorithms but is inaccessible for each concrete optimization problem.
 The residual \(r_i\) is the direction of steepest descent and is easily computed by \(r_i = b  Ax_i\).
 Error and residual do in general
not point along the same axis. It is for this reason that gradient descent does not converge in exactly one step (more to this later).
Gradient descent
Minimization of a function amounts to finding "a good way down into the valley". Smart people in Gloucestershire, UK, have developed a decently efficient way of finding a quick way down a local elevation called Cooper's Hill by letting a round cheese roll down it and following it.
This is essentially how gradient descent works: Standing at some point \(x_i\), you figure out the direction of steepest descentwhich is exactly the negative gradient and you make a hop of certain length \(\alpha_i\)yet to be determined along that direction: \[ \begin{aligned} x_{i+1} &= x_i + \alpha_i \cdot (f'(x_i))\\ &=x_i + \alpha_i \cdot r_i. \end{aligned}\] Recall that \(r_i = f'(x_i).\) We will talk about iterative methods here so we have already started numbering the points along which we hop during our optimization procedure.
Now what is a good way of choosing \(\alpha_i\)? Gradient descent is a generalpurpose method very useful in many optimization settings.one of the most fashionable examples being stochastic gradient descent for training of artificial neural networks in Machine Learning Usually, one just picks some "small" \(\alpha\) and makes it bigger or smaller depending on how well it seems to work.
But, we are currently in the setting of quadratic optimizationBut: CG can be generalized to nonlinear optimization where we can do a lot better! We will choose \(\alpha_i\) such that \(f(x_{i+1})\) is minimalAs a function of \(\alpha_i.\) This means we simplify locally to a onedimensional optimization problem of minimizing along \(x_i + \mathbb R \cdot r_i\). Hence, \[ 0 \stackrel{!}{=} \frac{\mathrm d}{\mathrm d \alpha_i} f(x_{i+1}) = f'(x_{i+1})^T\cdot\frac{\mathrm d x_{i+1}}{\mathrm d \alpha_i} = f'(x_{i+1})^T\cdot f'(x_i). \] Now this is very interesting: The optimal choice of \(\alpha_i\) sets the position of \(x_{i+1}\) such that the steepest descent in \(x_{i+1}\), which is \(f'(x_{i+1})\), is orthogonal to the direction of steepest descent in \(x_i.\) Or, more succinctly, we have to go along zigzags.with right angles between a zig and a zag
That's nice but we still do not have a numerical value for \(\alpha_i\). Actually, the correct value is given by \[\alpha_i = \frac{r_i^Tr_i}{r_i^TAr_i},\] as the following calculation shows:( ).
This means that the gradient descent method for quadratic function minimization works as follows:
The following consideration explains why gradient descent is an inefficient method: Consider again figure 3, an optimization problem in 2d. This happens:
 We start at \(x_0\) and optimize along search direction \(r_0\), yielding the optimal step size \(\alpha_0\) and the next iterate \(x_1\)
 At \(x_1\), we optimize along search direction \(r_1\) which is perpendicular to \(r_0\). After doing that, we arrive at \(x_2\)
 At \(x_2\), we optimize along search direction \(r_2\), which is identical to \(r_0\). This means, we need to make another step along a direction we already "optimized along"
 In short, we lose optimality with respect to direction we have already optimized along, over and over again.
The same problem arises in higher dimensions as well. Of course, \(r_2\) will not exactly be \(r_0\) but there will be significant overlap between them.
How can we design an algorithm that does not fall into that trap? In the next section we will discuss an idea which won't work (but almost so). After massaging this idea a bit we will come up with a practical algorithm.
This section in bullet points:
 The gradient descent is a good offtheshelf method for optimization.
 For quadratic optimization problems we can derive the optimal step size.
 Even with the optimal step size, gradient descent walks along inefficent zigzags.
 The main issue is that we keep losing optimality along directions we have already optimized along.
 Conjugate gradients will be able to overcome this problem.
Orthogonal directions descent
Obviously, the best step is the one that takes us from the initial guess \(x_0\) directly to the optimum. This is possible for gradient descent if the initial error \(e_0\) is an eigenvector of \(A.\) But apart from this very lucky case, there is no way of computing this "perfect" step without solving the problem analytically.
So, what is the next best idea? Look again at figure 3: The gradient descent algorithm repeatedly steps in a direction which he has already transversed. Wouldn't it be nice if instead of an infinite series of "zigzagzigzagzig...", we pool all zigs and zags and limit ourselves to one "zig" and one "zag"? In other words, can't we just fix a set of orthogonal directions \(\{d_0,\ldots, d_{n1}\}\) (in \(\mathbb R^2\) this would be a set of two vectors, and accordingly in higher dimensions) and walk along those directions only exactly once?
In short: Instead of we want to do:For our twodimensional example, this optimization utopia would look like in figure 4: We pick any starting direction, in this case we reasonably set i.e. we start by going down as steeply as possible \(d_0 := f'(x_0)\) and define \(d_1\) to be the only direction Unique except up to scaling. Note that this is a particularity of \(\mathbb R^2\); in higher dimensions we would need to choose an arbitrary perpendicular vector perpendicular to \(d_0.\)
Then we go just the right amount along \(d_0\) and conclude by a second step along \(d_1\), which brings us directly to \(x^\star.\) This "works" analogously in higher dimensions.
Let's see how we might do that: We have chosen an orthogonal set of vectors \(\{d_0,\ldots, d_{n1}\}\) and we aim to make just \(n\) steps in order to reach \(x^\star:\)
\[ x_{i+1} = x_i + \alpha_i d_i\]
How do we need to choose the \(\alpha_i\)? We
It seems that our idea of choosing a set of \(n\) orthogonal directions and planning on doing exactly \(n\) steps in order to arrive at \(x^\star\) was a bit too optimistic. But actually, the general idea was good. The flaw in it was choosing orthogonal directions and thereby choosing a set of vectors which was not adapted to the "geometry" of the problem. Of course, we like to use orthogonal sets of vectors for all kinds of linear algebra stuff. But let's look again at level sets of \(f\) in combination with orthogonal vector pairs at various locations. See how they kind of not fit together? The level sets (which, remember, describe the geometry of our optimization problem) appear to insist on a specific scaling of space, i.e. something like "the eastsoutheast direction is less importantRecall that tighter level sets mean larger growth in that direction, hence the northnortheast direction has steeper ascent in \(f\). than the northnortheast direction". The orthogonal vector pairs on the other hand pretend that space is just regular, rotationinvariant \(\mathbb R^2.\) Those two conceptions do not really work together. Orthogonal directions ignore the geometry of the problem. The solution will be to use \(A\)orthogonal, or conjugate, vector sets.
Conjugacy
We define \(A\)orthogonality, or conjugacy as follows: Two vectors \(u, v\) are said to be Aorthogonal (or conjugate), i.e. \(u~\bot_A~ v\), if \[u^TAv = 0\]
What does that mean? This is best explained geometrically:
Let's take our familiar level set picture, print it on a sheet of (chewn) bubblegum and tug at it in such a way that the ellipse level sets become circles.
Now we choose pairs of vectors which are not orthogonal in the original figure but appear perpendicular on the stretched version. See how, depending on the vectors' orientation, the angle between them is sometimes acute and sometimes obtuse. It is a right angle exactly if the vectors point along the axes of the ellipse.
Play with the slider below to get a feel for how Aorthogonal vector pairs look like.
Major message of this section:
 We hate zigzags because they are inefficient.
 We know we cannot expect to jump from a starting point directly to the minimum \(x^*\).
 In \(n\)dimensional space, we concede that the algorithm might need at least \(n\) steps, "one for each dimension".
 We say, ok, but \(n\) steps is really all you will get.
 Even more, we do not want to just be close to the minimum after the \(n\)th step, but we demand to find it exactly, i.e., \(x_{n} = x^\star.\)
 We realized that taking the orthogonal coordinate directions won't do the trick, but those \(A\)orthogonal/conjugate directions seem to be a promising idea.
 This is what we will do: Given a set of conjugate directions \(\{d_0,\ldots, d_{n1}\},\) we will take exactly \(n\) steps \[x_{i+1} = x_i + \alpha_i\cdot d_i.\] What are the values of \(\alpha_i\) such that \(x_n = x^\star\)?
Conjugate directions descent
So assume now that we have a set of \(n\) Aorthogonal (or conjugate) directions along which we want to take our descent. Now this is a
For example, in the setting of the optimization problem and with conjugate vectors from figure 7 in \(\mathbb R^2\), any pair of vectors depicted there will do.
Then we start in some \(x_0\in \mathbb R^n\) and define \[\begin{aligned}&x_1 &&= x_0 + \alpha_0 \cdot d_0\\ &x_2 &&= x_1 + \alpha_1 \cdot d_1\\ &&&\quad\vdots \\ &x_{i+1} &&= x_i + \alpha_i \cdot d_i \\ &&&\quad\vdots \\ &x_{n} &&= x_{n1} + \alpha_{n1}\cdot d_{n1}\end{aligned}\]
Also, we are being really greedy by demanding that \(x_n = x^\star,\) i.e. we want the exact solution to be found in at most \(n\) steps. Note that the only unknowns in this sequence of equations are the step sizes \(\alpha_i,\) everything else is either one of the directions \(d_i\) or the result of the preceding calculation.
How do we set those step sizes \(\alpha_i\) such that we achieve our goal of \(x_n = x^\star\)? The requirement is that \(e_{i+1}\) is conjugate to \(d_i\). Intuitively, this means that we will
neverThere is a big caveat to this "never" that you might have spotted: Gradient descent had the issue that we lost optimality along directions which we had already optimized along. How can we
be sure that something like this does not happen here? This is an important point which will be resolved later. For now, don't worry about it. go in direction of \(d_i\) again. It makes sense, actually: If we force the algorithm to manage with \(n\) steps, each of which
point in a specific direction \(d_i,\) it can
So, this means that we need \[d_i^T Ae_{i+1} = 0,\] i.e. \[ \begin{aligned}0 &= d_i^T A e_{i+1} = d_i^TA(e_i + \alpha_i d_i)\\ &= d_i^T (r_i + \alpha_i Ad_i),\end{aligned}\] which is equivalent to \[ \alpha_i = \frac{d_i^T r_i}{d_i^TAd_i}.\] In contrast to the equivalent condition for the "orthogonal directions" idea, this is actually computable: The \(d_i\) are fixed,again, we still need to talk about how the heck we will get those vectors, but that's for later. and the residual \(r_i\) is just \(bAx_i,\) which is easily obtained.
At the same time, this conjugacy conditionNote that \(0 = d_i^T A e_{i+1}\) means that \(d_i\) and \(e_{i+1}\) are conjugate. has another interpretation: Choosing \(\alpha_i\) is equivalent to finding the minimum point along the search direction \(d_i\): \[\begin{aligned} 0 &= \frac{\mathrm d}{\mathrm d \alpha} f(x_{i+1})\\ &= f'(x_{i+1})^T\frac{\mathrm d}{\mathrm d \alpha} x_{i+1} \\ &= r_{i+1}^T d_i \\ &= d_i^TAe_{i+1}.\end{aligned}\] This is wonderful: The global minimization procedure is reduced to a sequence of minimization procedures along fixed, conjugate directions. Contrast this to our failed idea of orthogonal directional descent where we would have been forced to climb up the hill first before going along the next direction, even with the impracticality of orthogonal directions aside.
Now we claim that the sequence of computations \[\begin{aligned}&x_1 &&= x_0 + \alpha_0 \cdot d_0\\ &x_2 &&= x_1 + \alpha_1 \cdot d_1\\ &&&\quad\vdots \\ &x_{i+1} &&= x_i + \alpha_i \cdot d_i \\ &&&\quad\vdots \\ &x_{n} &&= x_{n1} + \alpha_{n1}\cdot d_{n1}\end{aligned}\] with \[ \alpha_i = \frac{d_i^T r_i}{d_i^TAd_i}\] finds us the optimum in at most \(n\) steps, i.e. \(x_n = x^\star.\)
Before we go into the details of why that is correct, let's try to find numerical validation first: We take our minimization problem from before, we start at \(x_0 = (2,2)\) and we take the following set of conjugate directions: \[d_0 = \begin{pmatrix} 12 \\ 8\end{pmatrix}, d_1 = \begin{pmatrix} 18 \\ 13\end{pmatrix}.\] For now we do not worry about how we obtained this particular set of conjugate directions. Incidentally (not really, of course), we have chosen the first vector to be pointing along the direction of steepest descent at the initial value \(x_0.\) The iterations are calculated here: ( )
Again, the conjugate directions algorithm terminates in two steps.
In three dimensions, the conjugate directions algorithm takes three steps.
So, if you accept those three positive examples as evidence for the effectiveness of the conjugate directions optimization method, you might wonder how and why it can work. The following lemma proves this.
Lemma 1: Convergence of the conjugate directions optimization method in \(n\) steps.
Given a basis of conjugate vectors \(\{d_i\}_{i=0}^{n1}\), the method of conjugate directions arrives at the optimum in at most \(n\) steps.
Proof: ( )
How do we obtain the \(\delta_j\)? We can multiply this characterization in turns by each \(d_j^TA\) from the left. Then every time, by conjugacy, all but one term on the left hand side vanishes: \[d_k^TAe_0 = \sum_{j=0}^{n1}\delta_j \cdot d_k^TAd_j = \delta_k \cdot d_k^TAd_k\] We solve for \(\delta_k\) and obtain \[\delta_k = \frac{d_k^TAe_0}{d_k^TAd_k}.\] The next step is "adding 0": We can include additional terms next to \(e_0\) which vanish in combination with the term \(d_k^TA\) in front: \[\delta_k = \frac{d_k^TA{\color{blue}\left(e_0+\sum_{i=0}^{k1}\alpha_i d_i\right)}}{d_k^TAd_k}.\] Now recall that our iterations are given by \[x_{i+1} = x_i + \alpha_i d_i\] and thus \[e_{i+1} = e_i + \alpha_i d_i\] and, iteratively, \[e_k = {\color{blue}e_0 +\sum_{i=0}^{k1}\alpha_i d_i},\] so, all in all, replacing the blue term: \[\delta_k = \frac{d_k^TAe_k}{d_k^TAd_k}.\] This is exactly the negative of our step length: \[\alpha_k = \delta_k.\] What does that mean? There are two ways of thinking about the conjugate directions method:
The initial position \(x_0\) is updated step for step in direction of the conjugate directions \(d_i\) and is built up along those components.
Alternatively we can think of \(e_0\) starting with error contributions from all directions, with each step along a conjugate direction \(d_i\) "biting away" the corresponding component in this error term until the last step takes away the remaining error:
\[\begin{aligned}e_i &= e_0 + \sum_{j=0}^{i1}\alpha_j \cdot d_j\\ &= \sum_{j=0}^{n1}\delta_j\cdot d_j + \sum_{j=0}^{i1}(\delta_j) \cdot d_j\\ &= \sum_{j=i}^{n1}\delta_j \cdot d_j\end{aligned} \]Note that this sum contains less and less terms each iteration \(i\mapsto i+1\) and loses its last term at \(n1\mapsto n,\) which is exactly the last iteration.
Let's take another closer look at the Conjugate Directions method, i.e. someone gave us a nice set of \(n\) conjugate (Aorthogonal) directions \(\{d_i\}_{i=0}^{n1}\).
Conjugate Directions finds at every step the best solutions within the bounds of where it's been allowed to explore. We make another observation:
What does that mean? We define \[\mathcal D_i := \operatorname{span}\{d_0,d_1\ldots,d_{i1}\}.\] Then from the equation \[e_i = e_0 + \sum_{j=0}^{i1}\alpha_j d_j,\] we get that \(e_i \in e_0 + \mathcal D_i.\) We claim that Conjugate Directions works such that \(e_i\) is "optimal within the bounds of where it's been allowed to explore". Where has the iteration been allowed to explore so far? That is exactly \(x_0 + \mathcal D_i.\) Equivalently we can say that the error has been "allowed to explore" \(e_0 + \mathcal D_i.\) In which sense is this optimal? The claim is the following lemma:
Lemma 2: Optimality of the error term
We define the \(A\)norm by \[\x\_A^2 := x^TAx.\] Then \[ e_i = \operatorname{argmin} \{\\epsilon\_A:~ \epsilon\in e_0 + \mathcal D_i\}.\] I.e., \(e_i\) is the element in \(e_0 + \mathcal D_i\) with minimal \(A\)norm.
Proof: ( )
Let's talk about intuition next. What does this sense of optimality mean?
If we start with some \(x_0\), then we pick \(x_1\) such that it is the minimum point of \(x_0 + \mathcal D_1.\) This is just how we defined Conjugate Directions Descent (and exactly how Gradient Descent worked as well).
The surprising part is that optimality is conserved to all following iterations as well. This does not work in Gradient Descent, as we have already observed: ( )
So, why should we indeed expect that "optimality with respect to search direction \(d_0\)" is conserved after making another step along \(d_1\) and stopping at \(x_2\)? There are again two possible explanations for that (choose which ever suits you better)
 Analytically: At point \(x_2,\) our remaining error term is \(e_2 = \sum_{j=2}^{n1} \delta_j \cdot d_j,\) i.e. there is no contribution and no contribution from direction \(d_1\) as well, but this is less surprising because \(x_2\)
was obtained by cutting away the \(d_1\) contribution along direction \(d_0\) left. This means, we cannot minimize the Anorm anymore by going alongthis is lemma 1 and its proof \(d_0.\) We are
still optimal along direction \(d_0!\)  By stretching: Look at figure 11b: \(x_1\) was obtained by picking a point on \(x_0 + \mathcal D_1\) such that \(e_1\) was Aorthogonal to \(d_0\). In this stretched figure,
this means that \(e_1\) appears perpendicular to \(d_0=r_0.\)
Arrived at \(x_2,\) we see that this has not changed: Now \(d_0\) and \(e_2\) appear perpendicular, i.e. they are in fact Aorthogonal. By going even a short distance away from \(x_2\) in direction of \(d_0\) (shown in dashed), we would leave
the smaller sphere depicting a set of constant Anorm, hereby increasing our Anorm. Hence, \(x_2\) is
still optimal in direction of \(d_0.\) Note that we are of course also optimal with respect to the more recent search direction \(d_1\) but that is to be expected. The nonlosing of optimality along past search directions is more surprising.
A good mental image for how figure 11b works is to imagine yourself standing at the solution point \(x^\star,\) pulling a string connected to a bead that is constrained to lie in \(x_0 + \mathcal D_1.\) Each time the expanding subspace \(\mathcal D\) is enlarged by a dimension, the bead becomes free to move a little closer to you.
The following property of Conjugate Directions Descent will be of paramount importance:
Lemma 3: Orthogonality of the residual
In the \(i\)th iteration of Conjugate Directions Descent, the residual is orthogonal to all previous search directions: \[r_{i+1} ~\bot~ d_0, d_1, \ldots, d_i.\]
Proof: ( )
Note that an equivalent proposition does not hold for Gradient Descent: The new descent direction \(r_{i+1}\) is indeed orthogonal to the previous descent direction \(r_i\), but not orthogonal to \(r_{i1}.\) (which can be seen by looking at the characteristic zigzag pattern).
Here's a recap of this (long) section's main ideas:
 The conjugate directions optimization method works really well.
 We arrive in \(n\) directions at the optimum.
 The iteration of "local optimization procedures" (along the current search directions \(d_i\)) does not lose its permanence as gradient descent does: Once we are optimal along some direction, we will always be (and going along an old search direction will necessarily deteriorate the optimization).
 But... How do we get a set of conjugate directions to start with?
GramSchmidt conjugation
How do we obtain a set of \(n\) conjugate (Aorthogonal) directions \(\{d_i\}_{i=0}^{n1}\)? One possibility is GramSchmidt conjugation. This works just like GramSchmidt orthogonalization, but instead of producing orthogonal vectors, it generates conjugate vectors. It works as follows:
 Take a set of \(n\) linearly independent vectors \(\{u_i\}_{i=0}^{n1}.\)
 Define \(d_0 = u_0.\)
 Construct \(d_1\) by removing all components from \(u_1\) which are not Aorthogonal to \(d_0\)
 Construct \(d_2\) by removing all components from \(u_2\) which are not Aorthogonal to \(d_0\) and \(d_1\)
 ...
 Construct \(d_{n1}\) by removing all components from \(u_{n1}\) which are not Aorthogonal to all \(d_i\), \(i=0,\ldots,{n2}.\)
In formulae, we iteratively define (for \(i = 1,\ldots, n1\)) \[d_i = u_i + \sum_{k=0}^{i1}\beta_{ik}d_k.\] What are the \(\beta_{ik}\)? We premultiply with \(d_j^TA\) for some \(j\in\{0,\ldots, i1\},\) to get \[\begin{aligned}d_j^TAd_i &= d_j^TAu_i + \sum_{k=0}^{i1}\beta_{ik}d_j^TAd_k\\ 0 &= d_j^TAu_i + \beta_{ij}d_j^TAd_j\\ \beta_{ij} &=  \frac{d_j^TAu_i}{d_j^TAd_j}\end{aligned}\]
Thus, in principle we could do the following:
GramSchmidt + Conjugate Directions Descent
 Generate a linearly independent basis of "proposal directions" \(\{u_i\}_{i=0}^{n1}.\)
 Use GramSchmidt in order to construct a basis of conjugate directions \(\{d_i\}_{i=0}^{n1}.\)
 Do Conjugate Directions Descent in order to solve the minimization problem.
This works but the problem here is that this is too costly in both memory and computational effort.we need to compare each \(u_i\) with all previously computed search directions \(\{d_j\}.\) The computational complexity of that grows quite fast. In fact, the computational expense needed is exactly the effort needed for explicitly solving the linear equation \(Ax = b\) by Gaussian elimination, which we wanted to deperately avoid in the first place.
Note that the GramSchmidt construction of the \(i\)th search direction \(d_i\) only needs the previous search directions. Thus we could in principle choose to do the following:
GramSchmidt + Conjugate Directions Descent (taking turns)
 Generate a linearly independent basis of "proposal directions" \(\{u_i\}_{i=0}^{n1}.\)
 Set \(d_0 = u_0.\)
 Make one step of Conjugate Directions Descent.
 Make one step of GramSchmidt in order to construct the next search direction \(d_1.\)
 Make one step of Conjugate Directions Descent.
 Make one step of GramSchmidt in order to construct the next search direction \(d_2.\)
 ...
 Make one step of GramSchmidt in order to construct the next search direction \(d_{n1}.\)
 Make the last step of Conjugate Directions Descent.
But this does not alleviate the problem at all, because the computational complexity is the same.
The crucial idea is the following: We do
Dream method (this will be CG)
 Set some clever initial search direction \(d_0\)
 Make one step of Conjugate Directions Descent.
 Pick a (good) next candidate for a search direction \(u_1\) and construct a search direction \(d_1\) conjugate to \(d_0\).
 Make one step of Conjugate Directions Descent.
 Pick a (good) next candidate for a search direction \(u_2\),
which is already conjugate to \(d_0\) and construct from it a search direction \(d_2\) conjugate to \(d_1\), with \(d_2~\bot_A~ d_0\) already by choice.  Make one step of Conjugate Directions Descent.
 Pick a (good) next candidate for a search direction \(u_3\),
which is already conjugate to \(d_0\) and \(d_1\) and construct from it a search direction \(d_3\) conjugate to \(d_2\), with \(d_2~\bot_A~ d_0, d_1\) already by choice.  Make one step of Conjugate Directions Descent.
 ...
 Continue until the end
See what's happening here? We do not need to check all previously computed search directions but only the most recent one! The most recent "proposal direction" \(u_{i+1}\) is already conjugate to all previous search directions with exception of the last one. Thus the conjugation procedure which yields the next search direction \(d_{i+1}\) is a lot easier!
That means, the main idea is the following:
We iteratively construct conjugate directions in such a way that we can "forget" all previous search directions but the previous one. Then we
OK, so what are these miraculous \(u_i\) "proposal directions"? In fact, they are exactly the gradients \(u_i = r_i = f'(x_i)\). We will now discuss why this works.
The Conjugate Gradient (CG) algorithm
Let's summarize again what we learned so far:
Given a set of conjugate directions \(\{d_i\}_{i=0}^{n1},\) the algorithm of Conjugate Directions Descent terminates in at most \(n\) steps at the minimum. Our remaining challenge is to choose this set of conjugate directions. We realized that generating such a set in advance (from an initial set of proposal directions \(\{u_i\}_{i=0}^{n1}\) that we can modify to be a conjugate set \(\{d_i\}_{i=0}^{n1}\)) is too costly. Then we argued that by a clever choice of "proposal directions" \(u_i\) on the fly (interspersed with Conjugate Directions Descent iterations), we might be able to cut down on computational complexity. We will see in this section that a really good choice for proposal directions are the residuals/directions of steepest descent, i.e. \[ u_i = r_i.\]
The key will be to prove that this proposal direction \(r_i\) is already conjugate to all previous search directions, with exception of the last one, \(d_{i1}.\) Then we can construct the new search direction \(d_i\) out of the "proposal direction" \(r_i\) just by making it conjugate to \(d_{i1}.\)
With this method we can generate a set of conjugate directions \(d_i\) on the fly and simultaneously do Conjugate Directions Descent. This combination is called Conjugate Gradients (because we take the gradients and make them conjugate to previous search directions).
The last remaining key statement to prove is thus the following:
Lemma 4:
Consider the setting of Conjugate Directions. If the set of "proposal directions" is the set of residuals, i.e. \(u_i = r_i,\) then \(r_{i+1}\) is conjugate to all previous search directions with exception of the last one, i.e. \[r_{i+1} ~\bot_A ~ d_0, d_1, \ldots, d_{i1},\] for \(i=1,\ldots, {n1}.\)
Also, \[r_{i+1} ~\bot~ r_0, r_1, \ldots, r_i\]
Proof: ( )
 \(\operatorname{span}\{r_0\} = \operatorname{span}\{d_0\} = \color{green}{\mathcal D_1}\)
 \(\operatorname{span}\{r_0, r_1\} = \operatorname{span}\{{\color{green}{\mathcal D_1}}, r_1\}= \operatorname{span}\{\mathcal D_1, d_1\} = {\color{blue}{\mathcal D_2}}\)
 \(\operatorname{span}\{r_0, r_1, r_2\} = \operatorname{span}\{{\color{blue}{\mathcal D_2}}, r_2\}= \operatorname{span}\{\mathcal D_2, d_2\} = \mathcal D_3\)
 ...
So that means that the residual \(r_i\) is a really good proposal direction in point \(x_i.\) We will now derive the formula for generating a search direction \(d_i\) from it which is also conjugate to \(d_{i1},\) in addition to being conjugate to all search directions prior to that, \(d_0, \ldots, d_{i2}.\) This is done by GramSchmidt conjugation and mainly consists of purely symbolic manipulation without further grand ideas (and you can safely skip it if you are not interested in the details).
Lemma 5:
Given a current position \(x_{i+1}\), its residual \(r_{i+1}\) being a proposal direction \(u_{i+1} = r_{i+1}\) which is conjugate to almost all previous search directions, i.e. \[ u_{i+1} ~ \bot_A ~ d_0,d_1,\ldots d_{i1},\] then by setting \[d_{i+1} := r_{i+1} + \beta_{i+1}\cdot d_i,\quad \beta_{i+1} := \frac{r_{i+1}^Tr_{i+1}}{r_i^Tr_i},\] this new search direction is now conjugate to all previous search directions, i.e. \[ d_{i+1} ~ \bot_A ~ d_0,d_1,\ldots d_{i}.\]
Proof: ( )
 We use \(r_{i+1} = Ae_{i+1} = A(e_i + \alpha_i d_i) = r_i  \alpha_iAd_i\) to simplify computation of the residual and reduce the number of matrix multiplications.
 The step size is given by \(\alpha_i = \frac{r_i^Tr_i}{d_i^TAd_i}\) instead of our original form \(\frac{d_i^Tr_i}{d_i^TAd_i}.\) This is valid because of the identity \(d_i^Tr_i = r_i^Tr_i,\) derived in lemma 5.
Acknowledgements
This article is basically a rippedoff digest of the absolutely great paper "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain" by Jonathan Shewchuk.
This work is licensed under a Creative Commons AttributionShareAlike 4.0 International License.