Colored Pixels: Revisiting Handwritten Digit Recognition Using ANN

Quid Nunc?

Seven weeks since the conclusion of the handwritten digit recognition class project, my attempts at hacking an artificial neural network that have produced sub-par recognition rates have left a competitive itch that somehow needed redressing. In the days since since turning over my work, I was constantly bothered by the nagging feeling that I haven't done a stellar job.

Specifically, these issues were left unresolved:

training ten ANN binary classifiers, one for each digit
ANN architecture challenges, which led to
sub-optimal vector/matrix implementation of activation and back propagation computation
poor choice of cost/error function
poor recognition rates for all but one or two digits. ( ~ average of < 40%, which is a generous estimate)

On top of it all, the proper blog post about my efforts hasn't been completed yet. I seriously doubt that any effort to complete it can be easily justified, given the litany of problems.

"In the face of overwhelming odds, I'm left with only one option, I'm gonna have to science the shit out of this"

The quote above is from Mark Watney who curiously resembles Jason Bourne/Private James Francis Ryan in last year's documentary, The Martian. It's also the line that I couldn't get out of my head.

The first step was to admit that I don't know a ton of stuff about ANNs. So two weeks after the cynicism wore off, I enrolled at an online machine learning course taught by Professor Andrew Ng (chief scientist at Baidu research) at Coursera (https://www.coursera.org/learn/machine-learning/). Fun fact: It's THE course that led to the founding of Coursera!

ANN Multi-classification and Prediction

What a difference 5 weeks of course work makes! I was finally able to abandon my original approach of using multiple binary classifiers to classify training sets into k categories.

Figure 1 Artificial neural network architecture with a single hidden layer for multi-classification

Each unit in the output layer represents one of the k-classes, i.e. y₁ corresponds to the first class (or category), y_kto the second class (or category), etc. Ideally, a perfect classification of a pattern into one of the k-classes entails that one of the output units is 1 for that class and 0 elsewhere, e.g. if a pattern is classified as belonging to the fifth category, y₅ = 1, and y_k = 0 for k not equal to 5.

In order to classify each pattern into one of k-classes, two additional steps were needed

each training pattern must have must have a k-element 'true' output
after training, the predicted class for each pattern is the output node unit with the highest value, e.g. if the maximum of all y_k occurs at k = 8, input pattern belongs to the 8th category.

For this specific application, digits 0-9 corresponds to classes 1-10. y_k values can be thought of as containing the probabilities the input pattern belonging to the k^th class.

Cost functions based on logistic regression

Previously, I've used the error function based on the mean squared error

Although this as a good starting point, it only measures the closeness of y_k to t_k. Since it is averaged for all m training patterns, one may be easily be fooled into thinking that the low value of J indicates good performance. The new error function below may be more powerful and may mitigate this effect

Immediately, one notices the terms

Admittedly, it looks complicated and considering that partial derivatives of this error function with respect to the weights will be needed, is it a justifiable choice?

Suppose that for an input pattern, t_k = 1. The second term is zero. The first term will be non-zero as long as y_k is not equal to t_k, i.e. y_k < 1. As y_k approaches the value of 1, J becomes smaller. Subsequently, suppose t_k = 0. The first term is zero, while the second term is non-zero as long as y_k > 0. Using the sigmoid activation function especially at the output layer, ensures that 0 < y_k < 1.

Any value of y_kbetween 0 and 1 has a more complete measurement of the error in the new error function J. This is especially useful for multi-classification tasks since we need to be able to accumulate all the computed error at each unit, taking into account where it is 0 or 1 for that class.

For completeness, the partial derivative of this cost function with respect to the weights (Θ) is

This partial derivative expression also applies to ANNs with several hidden layers. Computation of this partial derivative require that inputs (x_j), input unit activation h(x), and output units y can be computed. x_j may be units from intermediate hidden layers, i.e. they are inputs to succeeding layers. Similarly, y may be the expected output of the next hidden layer instead of the final output layer.

Vectorized implementations of ANN computations

As noted previously, instead of going through each input pattern one at a time, representing all input patterns, and all layers as matrices allows us to perform matrix and vector computations. Activation of each layer involves computing for the sum of the connection weights multiplied by the input units. In linear algebra this combined operation is performed during matrix multiplication operation.

On languages (e.g. Matlab, Octave, R) supporting optimized implementations, they are a tremendous help in minimizing errors. Although this places an unusually large amount of trust on the libraries used by the language, they allow us to implement the required computations in 1-2 lines. Sums also require only a single function call in R instead of loop-based implementation that are usually prone to errors.

Thus, in my R implementation (see Reference section)

# add bias column to input layer

x = cbind(array(1, c(nrow(training_set), 1)), training_set)

# compute hidden layer activation

z_2 = x %*% t(w_ji)

z_j = h_func(z_2)

# add bias column

a_2 = cbind(array(1, c(nrow(z_j), 1)), z_j)

# compute output layer

y_k = h_func(a_2 %*% t(w_kj))

The bias units are inserted as columns of 1's prior to activation. Essentially all forward and back propagation computations have remained the same albeit with a vastly improved vectorized implementations. Our new found understanding also gives us flexibility to change the architecture, i.e. modifying number of units in hidden and output layers. We are also able handle varying amounts of features in the input layer.

Recognition performance

Figure 2 Test set (30000 digits)

Now that our arsenal's complete we go back to the previous results. We utilized a training set composed of 1000 samples with each digit having 100 image samples each. We used 28 hidden layer units and 10 output layer units corresponding to digits 0-9. We trained the network and after 1000 iterations were able to achieve a final error 0.0005671102.

We then used the weights to classify a set of 30000 digits (Figure 2) . Each band (30 x 100) corresponds to a single digit. For perfect classification, there should be no misplaced colored pixel on each digit band.

Figure 3 Results using a 1000 digit training set

Figure 3 is the result. There were about ~ 4725 misplaced pixels in the result, which corresponds to about 85% recognition rate.

Figure 4 Incorrectly classified as 4 instead of 9

Figure 4 shows one sample digit incorrectly classified as 4 instead of 9. At about 85% recognition performance, this already betters my previous result of < 40%.

Optimization: The Final Frontier

As we increased the number of training sets, the ANN performance noticeably slows down. At one point, it took nearly 2 hours to run through 10000 iterations. Updating the weights by Gradient descent back propagation is notoriously slow even with vectorized implementations as sample size and number of features increases. In particular, it's minimization method is very slow and there is no guarantee that it won't get stuck in some local minima. What now?

In the machine learning class, instead of optimizing the weights via the gradient descent, we utilized an advanced optimization algorithm which I have converted to R (see fmincg.R in the reference section) from the original Matlab implementation. We needed only to define the cost function as well as the gradients for each weight set.

I still haven't grasped the many intricate details of its algorithm but it's fast and its minima search is smarter than gradient descent. For now, I'm quite content playing with it but hopefully, we'll get around to analyzing it in the future.

... And We're Back

Figure 5 10000-sample trained ANN utilizing an advanced optimization algorithm

Equipped with the advanced optimization algorithm (fmincg, thanks Mr. Carl Edward Rasmussen!), we set about training a network with 10000 samples and tested it on the 30000 image set. Figure 5 shows the results. There were about ~ 2110 misplaced pixels which corresponds to about 92.96667% recognition! There is a noticeable improvement across all digits.

Regularization

One other concept when dealing with ANNs is overfitting.

Because of the number of features or free parameters in the network, it is very likely that exposure to the training set will cause the ANN to 'memorize' the set. Consequently it may fail at generalization when exposed to new samples. One way this is is avoided to use regularization.

In regularization, the cost function is modified by adding terms corresponding to the weights and scaling it by some factor λ. Gradient descent (or any advanced optimization method) minimizes this modified cost function. As a result, network weights are usually kept smaller. The decision boundaries become smoother, minimizing the chances of an overfit.

The modified cost function is shown below. It should be noted that the bias terms are not regularized. These bias terms are the first elements or columns of the weight matrices Θ.

This is easily implented. Unregularized cost function and gradients are computed as before with minimal modifications. For λ = 0, this reverts to the usual calculation. Shown below is a sample implementation of the regularization process.

# cost function
cost = sum(-y_matrix * log(y_k) - (1 - y_matrix)*log(1 - y_k))/m

# regularization on lambda != 0
if (lambda != 0) {

rWji = w_ji
rWkj = w_kj

# do not regularize bias column
rWji[, 1] = array(0, nrow(w_ji))
rWkj[, 1] = array(0, nrow(w_kj))

cost = cost + lambda*(sum(rWji*rWji) + sum(rWkj*rWkj))/(2*m)

dWji = dWji + lambda*rWji/m
dWkj = dWkj + lambda*rWkj/m
}

Figure 6 Results with regularization (λ = 1)

Figure 6 shows the results of regularization at λ = 1. This corresponds to ~ 1503 misplaced pixels or about 94.99% recognition.

Summary

This has been a very interesting project and the improved results have been very encouraging. There is renewed widespread interest in ANNs again especially with deep learning, artificial intelligence, and the big data buzz almost always in the news nowadays. The Coursera machine learning class helped a lot in furthering my understanding of ANNs. I'm looking forward to the next set of lectures.

There are other avenues worth pursuing

randomly arranging the order of the training set
determining whether adding a bias term is necessary
utilizing multiple hidden layers (deep learning)
convolution and recurrent neural networks
feature reduction (via PCA or ICA)
alternatives to gradient descent (particle swarm optimization, perhaps?)

As usual, all the codes utilized here are posted at my GitHub repository. Apologies for the shameless plug :)

References

Wikipedia contributors, The Martian (Film), Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/The_Martian_(film)

https://github.com/daelsepara/pixelprocessing/blob/master/NeuralNetwork/AdvancedNeuralNetwork.R

https://github.com/daelsepara/pixelprocessing/blob/master/NeuralNetwork/fmincg.R

http://www.mathworks.com/matlabcentral/fileexchange/42770-logistic-regression-with-regularization-used-to-classify-hand-written-digits/content/Logistic%20Regression%20with%20regularisation/fmincg.m

Colored Pixels

Thursday, March 3, 2016

Revisiting Handwritten Digit Recognition Using ANN