Introduction
I wrote the following notes on the Conjugate Gradient method after reading Jonathan Shewchuk's Introduction to the Conjugate Gradient Method Without the Agonizing Pain.
Shewchuk's paper is as wonderful as its title, with low dimensional examples and diagrams that give a visceral feel for what the conjugate gradient method is. However, articulating an idea in your own words and in your own way is necessary if you want to understand it really deeply. These notes are my attempt to do this- hopefully you'll find them useful and perhaps even insightful. Please enjoy!
I've divided the notes into sections as follows:
Gradient Descent
We'd like to find the \(\mathbf{x}\) that solves \(\mathbf{A}\mathbf{x}=\mathbf{b}\), where \(\mathbf{A}\) is a symmetric positive-definite matrix of real numbers. We do this iterating towards \(\mathbf{x}\) from some starting point \(\mathbf{x}_0\).
Let \(\mathbf{d}_i\) be an \(\mathbf{A}\)-orthogonal basis for \(\mathbb{R}^n\). That is, let the \(\mathbf{d}_i\)'s comprise a basis, and satisfy \(\mathbf{d}_i^T\mathbf{A}\mathbf{d}_j=0\) when \(i\neq j\).
Given \(\mathbf{x}_0\in\mathbb{R}^n\), there are unique \(\alpha_i\) such that
$$\mathbf{x}_0+\alpha_1\mathbf{d}_1+\alpha_2\mathbf{d}_2+\alpha_3\mathbf{d}_3+\cdots+\alpha_n\mathbf{d}_n=\mathbf{x}$$We use this equation to build a sequence of points \(\mathbf{x}_i\) that get closer and closer to \(\mathbf{x}\).
The error associated with each \(\mathbf{x}_i\) is given by \(\mathbf{e}_i=\mathbf{x}-\mathbf{x}_i\). Although we don't know \(\mathbf{e}_i\) (because we don't know \(\mathbf{x}\)), we do know a proxy for the error called the residual, defined by \(\mathbf{r}_i=\mathbf{A}\mathbf{e}_i\). Note that
$$ \begin{align} \mathbf{r}_i &=\mathbf{A}\mathbf{e}_i\\ &=\mathbf{A}(\mathbf{x}-\mathbf{x}_i)\\ &=\mathbf{b}-\mathbf{A}\mathbf{x}_i \end{align} $$The error \(\mathbf{e}_i\) is given by $$ \mathbf{e}_i = \alpha_{i+1}\mathbf{d}_{i+1} +\alpha_{i+2}\mathbf{d}_{i+2} +\cdots +\alpha_n\mathbf{d}_n, $$ and so $$ \mathbf{r}_i = \mathbf{A}\mathbf{e}_i = \alpha_{i+1}\mathbf{A}\mathbf{d}_{i+1} +\alpha_{i+2}\mathbf{A}\mathbf{d}_{i+2} +\cdots +\alpha_n\mathbf{A}\mathbf{d}_n. \tag{1} $$
The \(\mathbf{d}_i\)'s are \(\mathbf{A}\)-orthogonal, and so observe \((\bigstar)\) that \(\mathbf{r}_i\) is orthogonal to \(\mathbf{d}_1\), \(\mathbf{d}_2\), ..., \(\mathbf{d}_i\). Also observe that \(\mathbf{d}^T_{i+1}\mathbf{r}_i=\alpha_{i+1}\mathbf{d}^T_{i+1}\mathbf{A}\mathbf{d}_{i+1}\), which gives us $$ \alpha_{i+1} = \frac{\mathbf{d}^T_{i+1}\mathbf{r}_i}{\mathbf{d}^T_{i+1}\mathbf{A}\mathbf{d}_{i+1}} $$
Thus, given only \(\mathbf{A}\), \(\mathbf{b}\), \(\mathbf{x}_i\), and the \(\mathbf{d}_i\)'s, we can compute $$ \mathbf{x}_{i+1} = \mathbf{x}_i+\alpha_{i+1}\mathbf{d}_{i+1}. $$
We can also compute \(\mathbf{r}_{i+1}\), either directly as \(\mathbf{b}-\mathbf{A}\mathbf{x}_{i+1}\), or as an update to \(\mathbf{r}_i\), observing from \((1)\) that \(\mathbf{r}_{i+1}=\mathbf{r}_i-\alpha_{i+1}\mathbf{A}\mathbf{d}_{i+1}\).
This process is known as gradient descent, and it can be streamlined so that there is only one matrix-vector product per step.
Notice that we don't actually compute \(\mathbf{x}_i\) at each step. There's usually no point to doing this because we can find the next coefficient \(\alpha_i\) simply by advancing the residual. Shewchuk recommends occasionally computing
$$ \mathbf{x}_i = \mathbf{x}_0+\alpha_1\mathbf{d}_1+\alpha_2\mathbf{d}_2+\cdots+\alpha_i\mathbf{d}_i $$and then setting \(\mathbf{r}_i=\mathbf{b}-\mathbf{A}\mathbf{x}_i\) in order to keep roundoff error from accumulating. Shewchuk also notes that when the \(\mathbf{d}_i\)'s are built from the standard basis, gradient descent is equivalent to Gaussian elimination.
Conjugate Gradients
The conjugate gradient method builds the \(\mathbf{d}_i\)'s out of the computed residuals \(\mathbf{r}_i\).
We start by setting \(\mathbf{d}_1=\mathbf{r}_0\), where \(\mathbf{r}_0=\mathbf{b}-\mathbf{A}\mathbf{x}_0\) is the initial residual corresponding to the starting point \(\mathbf{x}_0\).
Next, suppose we have \(\mathbf{d}_1\), \(\mathbf{d}_2\), ..., \(\mathbf{d}_i\), and we have just computed \(\mathbf{r}_i\). We'd like to use \(\mathbf{r}_i\) to create a new vector \(\mathbf{d}_{i+1}\) that is \(\mathbf{A}\)-orthogonal to \(\mathbf{d}_1\), \(\mathbf{d}_2\), ..., \(\mathbf{d}_i\). We propose that \(\mathbf{d}_{i+1}\) takes the following form.
$$ \mathbf{d}_{i+1} = \mathbf{r}_i -\beta_1\mathbf{d}_1 -\beta_2\mathbf{d}_2 -\cdots -\beta_i\mathbf{d}_i \tag{2} $$We assume \(\mathbf{r}_i\) is nonzero, otherwise the problem is solved. We know from the previous section \((\bigstar)\) that \(\mathbf{r}_i\) is orthogonal to \(\mathbf{d}_1\), \(\mathbf{d}_2\), ..., \(\mathbf{d}_i\) and so no matter what values we pick for \(\beta_k\) we are never in any danger of \(\mathbf{d}_{i+1}\) being equal to zero. The question is, can we find \(\beta_k\) such that \(\mathbf{d}_{i+1}^T\mathbf{A}\mathbf{d}_k=0\) for \(k=1,2,...,i\). From the \(\mathbf{A}\)-orthogonality of the \(\mathbf{d}_i\)'s it follows that we can, and that these values are given by
$$ \beta_k=\frac{\mathbf{r}_i^T\mathbf{A}\mathbf{d}_k}{\mathbf{d}_k^T\mathbf{A}\mathbf{d}_k}. $$Let's take a step back and see what we've done so far. We are performing gradient descent, but with the \(\mathbf{d}_i\)'s constructed out of the computed residuals \(\mathbf{r}_i\). Here's our process, with \(LC(\mathbf{d}_1,\mathbf{d}_2,...,\mathbf{d}_i)\) denoting a linear combination of \(\mathbf{d}_1\), \(\mathbf{d}_2\), ..., \(\mathbf{d}_i\).
We are now set up to observe that a bunch of things equal zero. Recall observation \((\bigstar)\) from the previous section, that \(\mathbf{r}_i\) is orthogonal to \(\mathbf{d}_1\), \(\mathbf{d}_2\), ..., \(\mathbf{d}_i\). From this and the above span statements, it follows that \(\mathbf{r}_i\) is also orthogonal to \(\mathbf{r}_0\), \(\mathbf{r}_1\), ..., \(\mathbf{r}_{i-1}\), as well as to \(\mathbf{A}\mathbf{d}_1\), \(\mathbf{A}\mathbf{d}_2\), ..., \(\mathbf{A}\mathbf{d}_{i-1}\). This magically annihilates all the coefficients in \((2)\) except \(\beta_i\), allowing us to write
$$ \mathbf{d}_{i+1} = \mathbf{r}_i - \frac{\mathbf{r}_i^T\mathbf{A}\mathbf{d}_i}{\mathbf{d}_i^T\mathbf{A}\mathbf{d}_i} \mathbf{d}_i. \tag{3} $$To simplify this result, recall the gradient descent update for the residual
$$ \mathbf{r}_i=\mathbf{r}_{i-1}- \frac{\mathbf{d}^T_i\mathbf{r}_{i-1}}{\mathbf{d}^T_i\mathbf{A}\mathbf{d}_i} \mathbf{A}\mathbf{d}_i $$Taking the inner product with \(\mathbf{r}_i\), and remembering \(\mathbf{r}_i^T\mathbf{r}_{i-1}=0\), we see that
$$ -\frac{\mathbf{r}_i^T\mathbf{A}\mathbf{d}_i}{\mathbf{d}_i^T\mathbf{A}\mathbf{d}_i} =\frac{\mathbf{r}_i^T\mathbf{r}_i}{\mathbf{d}^T_i\mathbf{r}_{i-1}} =\frac{\mathbf{r}_i^T\mathbf{r}_i}{\mathbf{r}^T_{i-1}\mathbf{r}_{i-1}}, \tag{4} $$where \(\mathbf{r}_i^T\mathbf{d}_{i+1}=\mathbf{r}_i^T\mathbf{r}_i\) comes from \((3)\), noting that \(\mathbf{r}_i^T\mathbf{d}_i=0\). Thus \((3)\) becomes
$$ \mathbf{d}_{i+1} = \mathbf{r}_i + \frac{\mathbf{r}_i^T\mathbf{r}_i}{\mathbf{r}^T_{i-1}\mathbf{r}_{i-1}} \mathbf{d}_i, $$allowing us to write the conjugate gradient process with all its steps as
Nonlinear Optimization
In the previous two sections, we developed gradient descent and conjugate gradients in the context of finding the \(\mathbf{x}\in\mathbb{R}^n\) that solves \(\mathbf{A}\mathbf{x}=\mathbf{b}\). An equivalent problem is to find the \(\mathbf{x}\) that minimizes the scalar function
$$ F(\mathbf{x})=\frac{1}{2}\mathbf{x}^T\mathbf{A}\mathbf{x} - \mathbf{x}^T\mathbf{b}. $$The key insight in this new context is that the residual \(\mathbf{r}_i\) associated with the point \(\mathbf{x}_i\) is simply given by \(-\nabla F\) evaluated at \(\mathbf{x}_i\).
Recall from our notes on gradient descent that the purpose of \(\alpha_i\) is to give us the next point \(\mathbf{x}_i = \mathbf{x}_{i-1}+\alpha_i\mathbf{d}_i\). In fact, \(\alpha_i\) minimizes the function \(F(\mathbf{x}_{i-1}+\alpha\mathbf{d}_i)\), allowing us to rewrite the conjugate gradient method as follows.
The NEW version of the conjugate gradient method can be applied to scalar functions in general. Here's an example where \(F\) is a non-quadratic function of \(\mathbf{x}\).

Conjugate gradients gives a kind of gradient informed inertia to the evolution of the search directions \(\mathbf{d}_i\). This keeps the iterations from getting trapped in the wasteful oscillations that occur with ordinary gradient descent. Conjugate gradients gives an exact solution when \(F\) has the form \(\frac{1}{2}\mathbf{x}^T\mathbf{A}\mathbf{x}-\mathbf{x}^T\mathbf{b}\), and gives degraded performance as \(F\) departs from this form.