Neural Networks

Patch Kessler
2024.05.20

This is my HELLO WORLD for neural networks, the computing paradigm that's apparently on the cusp of ushering in Artificial General Intelligence for all of us. For a while now I've wanted to understand exactly what neural networks are and how they work. The best way to understand something is to build it yourself (in my view at least), and to this end I endeavor in these notes to create a basic neural network that can recognize handwritten digits.

My starting point consists of the fantastic videos from 3blue1brown as well as some of the material Grant Sanderson references, especially the book Neural Networks and Deep Learning by Michael Nielson.

The big picture is clear to me- values propagate through the network from one layer to the next, and weights and biases are iteratively adjusted so as to minimize a cost function. My goal here is to work through the details, especially backpropagation, and then to code everything in Matlab as elegantly as possible.

Model Overview

The following image shows a representative four layer neural network. We store the nodal values from each layer in vectors \(\mathbf{u}^i\), (that might have different lengths), and the weights and biases in matrices \(\mathbf{A}^i\) and vectors \(\mathbf{b}^i\).

Our input is the vector \(\mathbf{u}^1\), which in our case contains the pixel values from an image, and our output is the vector \(\mathbf{u}^4\) which tells us something useful about the image. The actual matrix/vector computations are as follows.

$$ \begin{align} \mathbf{u}^2 &=\sigma(\mathbf{A}^1\mathbf{u}^1+\mathbf{b}^1)\\ \mathbf{u}^3 &=\sigma(\mathbf{A}^2\mathbf{u}^2+\mathbf{b}^2)\\ \mathbf{u}^4 &=\sigma(\mathbf{A}^3\mathbf{u}^3+\mathbf{b}^3) \tag{1} \end{align} $$

The sigmoid function \(\sigma(x)\) operates element by element, and is given by

$$ \sigma(x) = \frac{1}{1+e^{-x}}. $$

The sigmoid function has the pleasing property that \(\sigma' = \sigma(1-\sigma)\)

We'll be adjusting the weights \(\mathbf{A}^i\) and biases \(\mathbf{b}^i\) so as to minimize the following cost function

$$ G = \sum_{k=1}^n||\mathbf{u}^4(\mathbf{x}_k)-\mathbf{y}_k||^2, $$

where \((\mathbf{x}_1,\mathbf{y}_1)\), \((\mathbf{x}_2,\mathbf{y}_2)\), ...,\((\mathbf{x}_n,\mathbf{y}_n)\) are training data sets. In each of these sets, \(\mathbf{x}_k\) is the network input, and \(\mathbf{y}_k\) is the desired network output. In the expression for \(G\), we compare the desired network output \(\mathbf{y}_k\) to the actual network output \(\mathbf{u}^4(\mathbf{x}_k)\).

We minimize \(G\) by computing its derivatives with respect to the parameters in \(\mathbf{A}^i\) and \(\mathbf{b}^i\). We compute these derivatives exactly and inexpensively using the chain rule (aka backpropagation). These derivatives comprise the gradient of the cost function \(G\), and guide our updates to \(\mathbf{A}^i\) and \(\mathbf{b}^i\) as we work to reduce the value of \(G\).

Although our example has only four layers (\(\mathbf{u}^1\) to \(\mathbf{u}^4\)), the patterns we find easily extend to a network with any number of layers.

Differentiation

We express (1) in terms of its components, which greatly simplifies the mechanics of differentiation.

$$ \begin{align}u^2_{i_2} &=\sigma(A^1_{i_2,i_1}u^1_{i_1}+b^1_{i_2})\\ u^3_{i_3} &=\sigma(A^2_{i_3,i_2}u^2_{i_2}+b^2_{i_3})\\ u^4_{i_4} &=\sigma(A^3_{i_4,i_3}u^3_{i_3}+b^3_{i_4})\tag{2}\end{align} $$

We adopt a modification of Einstein's summation convention in which we sum over the range of an index whenever the index occurs more than once in a product. For instance, if \(i_1\) ranges from 1 to 3, then

$$ A_{i_2,i_1}u_{i_1}v_{i_1} = A_{i_2,1}u_1v_1 + A_{i_2,2}u_2v_2 + A_{i_2,3}u_3v_3. $$

Yes, we know that allowing more than two repeated indicies can cause ambiguity (e.g., in differential geometry and relativity), however in our work here it seems fine. Using this modified convention,

$$ G = \sum_{k=1}^n(u_{i_4}^4-y_{i_4})(u_{i_4}^4-y_{i_4}). $$

Differentiating \(G\) with respect to \(A^3_{\alpha\beta}\), we find that

$$ \frac{\partial G}{\partial A^3_{\alpha\beta}} = \sum_{k=1}^n2(u_{i_4}^4-y_{i_4})u_{i_4}^4(1-u_{i_4}^4)\frac{\partial A^3_{i_4,i_3}}{\partial A^3_{\alpha\beta}}u^3_{i_3}. $$

We use \(\mathbf{w}^4\) to denote \(\mathbf{u}^4.*\mathbf(1-\mathbf{u}^4)\) where \(.*\) is used in the element-by-element Matlab sense. Also, the derivative in the summand equals 1 when \(\alpha=i_1\) and \(\beta=i_2\), and 0 otherwise, and so

$$ \frac{\partial G}{\partial A^3_{\alpha\beta}} = \sum_{k=1}^n2(u_{\alpha}^4-y_{\alpha})w_{\alpha}^4u^3_{\beta}. $$

This is just one component (row \(\alpha\), column \(\beta\)) of the matrix representation of \(\nabla\mathbf{A}^3\). We represent \(\nabla\mathbf{A}^3\) in its entirety using the mechanics of matrix multiplication.

$$ \nabla\mathbf{A}^3 = \sum_{k=1}^n(2(\mathbf{u}^4-\mathbf{y}).*\mathbf{w}^4)*(\mathbf{u}^3)^T $$

In this equation, matrix multiplication \(*\) between the column vector \(2(\mathbf{u}^4-\mathbf{y}).*\mathbf{w}^4\) on the left and the row vector \((\mathbf{u}^3)^T\) on the right, creates a 2D array. Continuing in this way, and staring at scribbles for an evening or two while being laid up with Covid, we find that

$$ \begin{align}\nabla\mathbf{A}^3 &= \sum_{k=1}^n(2(\mathbf{u}^4-\mathbf{y}).*\mathbf{w}^4)*(\mathbf{u}^3)^T\\ \nabla\mathbf{A}^2 &= \sum_{k=1}^n(((2(\mathbf{u}^4-\mathbf{y}).*\mathbf{w}^4)^T\mathbf{A}^3)^T.*\mathbf{w}^3)*(\mathbf{u}^2)^T\\ \nabla\mathbf{A}^1 &= \sum_{k=1}^n(((((2(\mathbf{u}^4-\mathbf{y}).*\mathbf{w}^4)^T\mathbf{A}^3)^T.*\mathbf{w}^3)^T\mathbf{A}^2)^T.*\mathbf{w}^2)*(\mathbf{u}^1)^T\end{align} $$

There's definitely a pattern here. The main part of each expression is the same, and it gets bigger with each iteration in the same way. Here's what computing this looks like in Matlab.

% We work with an m+1 layer network.    
% A is a 1xm cell array with A{i} containing Ai                        
% b is a 1xm cell array with b{i} containing bi                       
%                       
% Training data
% X(:,i) is the ith input training vector
% Y(:,i) is the ith output training vector                  
%                       
% Objective
% Our goal here is to build cell arrays gA and gb,
% with the same structure as A and b but containing
% gradient data.

% initialize gradA and gradb
gradA = cell(size(A));
gradb = cell(size(b));
for i = 1:numel(A)
    gradA{i} = zeros(size(A{i}));
    gradb{i} = zeros(size(b{i}));
end

for i = 1:size(X,2) % for each set of training data

    % determine u and w at each layer
    ui = {X(:,i)};
    wi = {[]};
    for k = 1:numel(A)
        ui{k+1} = 1./(1+exp(-(A{k}*ui{k}+b{k})));
        wi{k+1} = ui{k+1}.*(1-ui{k+1});
    end

    % initialize g
    g = 2*(ui{end}-Y(:,i)).*wi{end};

    % update gradient arrays
    for k = numel(A):-1:1
        gradA{k} = gradA{k} + g*ui{k}';
        gradb{k} = gradb{k} + g;
        if k==1
            break;
        end
        g = (g'*A{k})'.*wi{k};
    end
    
end

Data

We work with the MNIST Database of Handwritten Digits, which for some reason we were unable to download directly from Yann Lecun's website (doing this gave me a 403 error). Not to worry- after some Googling we found the same data on the following site:
https://www.kaggle.com/datasets/hojjatk/mnist-dataset

We are working in Matlab, and so rather than write our own script to read in the data, we found one already working on the MathWorks file exchange:
https://www.mathworks.com/matlabcentral/fileexchange/27675-read-digits-and-labels-from-mnist-database

This script reads in the files and packages the data as Matlab arrays that are easy to handle. Here for instance are three different images of the number 3. (The blue lines are straight- your impression otherwise is an optical illusion.)

We have 60,000 of these images that we use for training, and 10,000 more that we use to test the network performance.

Training Time

There's nothing left to prepare- it's time to train our network. We do this with the following basic Matlab code:
detectHandWriting_01.m

After playing around with a couple of network architectures, we settle on a four layer network with layers having 400, 30, 25, and 10 nodes. Here are some traces showing the evolution of network performance as training progresses.

Different traces show the fractions of different groups of test data that are correctly classified. Note that the digit 1 is classified more accurately and sooner than the digit 8, which kind of makes sense (8 and 5 look very similar). The background gray traces are other individual digits.

The full picture of how each digit succeeds or fails to be classified is captured in the wonderfully named matrix of confusion. To build this matrix you simply tabulate what the network does to each test digit.

In the image on the right, we try to understand this matrix visually by representing the magnitude of each entry by a small square. The values in the confusion matrix are dynamic, migrating towards the diagonal as training progresses. Here's a movie showing this evolution for our first 300 steps of gradient descent.

During the 8300 steps of gradient descent that come next (i.e., most of our computing effort), the confusion matrix elements don't move around as dramatically, however this is where the network performance is really dialed in. Training a network like this is an art that we've only gotten a small taste of in this work. Here are some additional notes:

  • We train on batches of 1500 pieces of training data at a time. If we train for too long on any one set of training data, the score for the training data set gets better, but the score for the test data starts to fall off. This is over-fitting.
  • With 13065 parameters, our performance is 94.5%. With 6441 parameters, our performance drops to 92%. The connections between network architecture, network size, training regime, and ultimate performance are deep areas of active research which we've just scratched the surface of.
  • A review of different learning strategies applied the MNIST data set is presented rather nicely at the following site: https://paperswithcode.com/sota/image-classification-on-mnist

My Own Handwriting?

I was curious if this network could recognize my own handwriting, and so I wrote down ten digits on a piece of paper and took a photo.

I cropped, downsampled, and converted the images to grayscale so that they resembled the images from the MNIST dataset.

Surprisingly, although the network performs well on the MNIST testing data, it stumbles with my handwriting, which seems pretty legible to me. I tried several times in fact, using a different pen to get thicker lines, and so on. With the above sample, the network does well with all the numbers except 3 and 5. It misclassifies my 3 as a 5, and it misclassifies my 5 as an 8. One remedy would be to make the network more robust, by adding more nodes and giving it more training, so that its performance is greater than 94.5%. It is only a toy system after all... Another tactic would be for me to modify my own handwriting until it successfully gets classified by the network, moving the learning from the toy system into my own neural network.

Parsing Line Art

A problem I put considerable time into about ten years ago consists of recognizing letters from line art in PDF mechanical design drawings (see more detail at https://www.balloondigest.com/). At the time, I laboriously constructed relative location based filters for every single letter, uppercase and lowercase, and although my code was functional, it wasn't that robust.

When I started hearing about neural networks, I thought they might be a much better way to do this. The line art for letters on a mechanical drawing PDF usually consists of chains of connected points in the plane, and these chains can easily be ingested by a neural network. For instance, make the points into columns in a 2xN array, and then add a third row, with values that are the same for points that are connected.

So many projects, so little time...

Resources

A book on Deep Learning I recently purchased which I have just started reading and which seems really good so far is
Understanding Deep Learning, by Simon J. D. Prince.

I've literally just scratched the surface of a vast and ever expanding field. It's like discovering YouTube and realizing that you can't watch all the videos. However you can focus on some of them, and my own list of things to look at next includes convolutional neural networks, and variational autoencoders.