Quid Nunc?
Seven weeks since the conclusion of the handwritten digit recognition class project, my attempts at hacking an artificial neural network that have produced sub-par recognition rates have left a competitive itch that somehow needed redressing. In the days since since turning over my work, I was constantly bothered by the nagging feeling that I haven't done a stellar job.
Specifically, these issues were left unresolved:
- training ten ANN binary classifiers, one for each digit
- ANN architecture challenges, which led to
- sub-optimal vector/matrix implementation of activation and back propagation computation
- poor choice of cost/error function
- poor recognition rates for all but one or two digits. ( ~ average of < 40%, which is a generous estimate)
On top of it all, the proper blog post about my efforts hasn't been completed yet. I seriously doubt that any effort to complete it can be easily justified, given the litany of problems.
"In the face of overwhelming odds, I'm left with only one option, I'm gonna have to science the shit out of this"
The quote above is from Mark Watney who curiously resembles Jason Bourne/Private James Francis Ryan in last year's documentary, The Martian. It's also the line that I couldn't get out of my head.
The first step was to admit that I don't know a ton of stuff about ANNs. So two weeks after the cynicism wore off, I enrolled at an online machine learning course taught by Professor Andrew Ng (chief scientist at Baidu research) at Coursera (https://www.coursera.org/learn/machine-learning/). Fun fact: It's THE course that led to the founding of Coursera!
ANN Multi-classification and Prediction
What a difference 5 weeks of course work makes! I was finally able to abandon my original approach of using multiple binary classifiers to classify training sets into k categories.
![]() |
Figure 1 Artificial neural network architecture with a single hidden layer for multi-classification |
In order to classify each pattern into one of k-classes, two additional steps were needed
- each training pattern must have must have a k-element 'true' output
- after training, the predicted class for each pattern is the output node unit with the highest value, e.g. if the maximum of all yk occurs at k = 8, input pattern belongs to the 8th category.
Cost functions based on logistic regression
Previously, I've used the error function based on the mean squared errorAlthough this as a good starting point, it only measures the closeness of yk to tk. Since it is averaged for all m training patterns, one may be easily be fooled into thinking that the low value of J indicates good performance. The new error function below may be more powerful and may mitigate this effect
Immediately, one notices the terms
Admittedly, it looks complicated and considering that partial derivatives of this error function with respect to the weights will be needed, is it a justifiable choice?
Suppose that for an input pattern, tk = 1. The second term is zero. The first term will be non-zero as long as yk is not equal to tk, i.e. yk < 1. As yk approaches the value of 1, J becomes smaller. Subsequently, suppose tk = 0. The first term is zero, while the second term is non-zero as long as yk > 0. Using the sigmoid activation function especially at the output layer, ensures that 0 < yk < 1.
Any value of yk between 0 and 1 has a more complete measurement of the error in the new error function J. This is especially useful for multi-classification tasks since we need to be able to accumulate all the computed error at each unit, taking into account where it is 0 or 1 for that class.
For completeness, the partial derivative of this cost function with respect to the weights (Θ) is
This partial derivative expression also applies to ANNs with several hidden layers. Computation of this partial derivative require that inputs (xj), input unit activation h(x), and output units y can be computed. xj may be units from intermediate hidden layers, i.e. they are inputs to succeeding layers. Similarly, y may be the expected output of the next hidden layer instead of the final output layer.
Vectorized implementations of ANN computations
As noted previously, instead of going through each input pattern one at a time, representing all input patterns, and all layers as matrices allows us to perform matrix and vector computations. Activation of each layer involves computing for the sum of the connection weights multiplied by the input units. In linear algebra this combined operation is performed during matrix multiplication operation.
On languages (e.g. Matlab, Octave, R) supporting optimized implementations, they are a tremendous help in minimizing errors. Although this places an unusually large amount of trust on the libraries used by the language, they allow us to implement the required computations in 1-2 lines. Sums also require only a single function call in R instead of loop-based implementation that are usually prone to errors.
Thus, in my R implementation (see Reference section)
# add bias column to input layer
x = cbind(array(1, c(nrow(training_set), 1)), training_set)
# compute hidden layer activation
z_2 = x %*% t(w_ji)
z_j = h_func(z_2)
# add bias column
a_2 = cbind(array(1, c(nrow(z_j), 1)), z_j)
# compute output layer
y_k = h_func(a_2 %*% t(w_kj))
The bias units are inserted as columns of 1's prior to activation. Essentially all forward and back propagation computations have remained the same albeit with a vastly improved vectorized implementations. Our new found understanding also gives us flexibility to change the architecture, i.e. modifying number of units in hidden and output layers. We are also able handle varying amounts of features in the input layer.
Recognition performance
![]() |
Figure 2 Test set (30000 digits) |
Now that our arsenal's complete we go back to the previous results. We utilized a training set composed of 1000 samples with each digit having 100 image samples each. We used 28 hidden layer units and 10 output layer units corresponding to digits 0-9. We trained the network and after 1000 iterations were able to achieve a final error 0.0005671102.
We then used the weights to classify a set of 30000 digits (Figure 2) . Each band (30 x 100) corresponds to a single digit. For perfect classification, there should be no misplaced colored pixel on each digit band.
![]() |
Figure 3 Results using a 1000 digit training set |
Figure 3 is the result. There were about ~ 4725 misplaced pixels in the result, which corresponds to about 85% recognition rate.
![]() |
Figure 4 Incorrectly classified as 4 instead of 9 |
Figure 4 shows one sample digit incorrectly classified as 4 instead of 9. At about 85% recognition performance, this already betters my previous result of < 40%.
Optimization: The Final Frontier
As we increased the number of training sets, the ANN performance noticeably slows down. At one point, it took nearly 2 hours to run through 10000 iterations. Updating the weights by Gradient descent back propagation is notoriously slow even with vectorized implementations as sample size and number of features increases. In particular, it's minimization method is very slow and there is no guarantee that it won't get stuck in some local minima. What now?
In the machine learning class, instead of optimizing the weights via the gradient descent, we utilized an advanced optimization algorithm which I have converted to R (see fmincg.R in the reference section) from the original Matlab implementation. We needed only to define the cost function as well as the gradients for each weight set.
I still haven't grasped the many intricate details of its algorithm but it's fast and its minima search is smarter than gradient descent. For now, I'm quite content playing with it but hopefully, we'll get around to analyzing it in the future.
... And We're Back
![]() |
Figure 5 10000-sample trained ANN utilizing an advanced optimization algorithm |
Equipped with the advanced optimization algorithm (fmincg, thanks Mr. Carl Edward Rasmussen!), we set about training a network with 10000 samples and tested it on the 30000 image set. Figure 5 shows the results. There were about ~ 2110 misplaced pixels which corresponds to about 92.96667% recognition! There is a noticeable improvement across all digits.
Regularization
One other concept when dealing with ANNs is overfitting.
Because of the number of features or free parameters in the network, it is very likely that exposure to the training set will cause the ANN to 'memorize' the set. Consequently it may fail at generalization when exposed to new samples. One way this is is avoided to use regularization.
In regularization, the cost function is modified by adding terms corresponding to the weights and scaling it by some factor λ. Gradient descent (or any advanced optimization method) minimizes this modified cost function. As a result, network weights are usually kept smaller. The decision boundaries become smoother, minimizing the chances of an overfit.
The modified cost function is shown below. It should be noted that the bias terms are not regularized. These bias terms are the first elements or columns of the weight matrices Θ.
This is easily implented. Unregularized cost function and gradients are computed as before with minimal modifications. For λ = 0, this reverts to the usual calculation. Shown below is a sample implementation of the regularization process.
Because of the number of features or free parameters in the network, it is very likely that exposure to the training set will cause the ANN to 'memorize' the set. Consequently it may fail at generalization when exposed to new samples. One way this is is avoided to use regularization.
In regularization, the cost function is modified by adding terms corresponding to the weights and scaling it by some factor λ. Gradient descent (or any advanced optimization method) minimizes this modified cost function. As a result, network weights are usually kept smaller. The decision boundaries become smoother, minimizing the chances of an overfit.
The modified cost function is shown below. It should be noted that the bias terms are not regularized. These bias terms are the first elements or columns of the weight matrices Θ.
This is easily implented. Unregularized cost function and gradients are computed as before with minimal modifications. For λ = 0, this reverts to the usual calculation. Shown below is a sample implementation of the regularization process.
# cost function
cost = sum(-y_matrix * log(y_k) - (1 - y_matrix)*log(1 - y_k))/m
# regularization on lambda != 0
if (lambda != 0) {
rWji = w_ji
rWkj = w_kj
# do not regularize bias column
rWji[, 1] = array(0, nrow(w_ji))
rWkj[, 1] = array(0, nrow(w_kj))
cost = cost + lambda*(sum(rWji*rWji) + sum(rWkj*rWkj))/(2*m)
dWji = dWji + lambda*rWji/m
dWkj = dWkj + lambda*rWkj/m
}
cost = sum(-y_matrix * log(y_k) - (1 - y_matrix)*log(1 - y_k))/m
# regularization on lambda != 0
if (lambda != 0) {
rWji = w_ji
rWkj = w_kj
# do not regularize bias column
rWji[, 1] = array(0, nrow(w_ji))
rWkj[, 1] = array(0, nrow(w_kj))
cost = cost + lambda*(sum(rWji*rWji) + sum(rWkj*rWkj))/(2*m)
dWji = dWji + lambda*rWji/m
dWkj = dWkj + lambda*rWkj/m
}
![]() |
Figure 6 Results with regularization (λ = 1) |
Figure 6 shows the results of regularization at λ = 1. This corresponds to ~ 1503 misplaced pixels or about 94.99% recognition.
Summary
This has been a very interesting project and the improved results have been very encouraging. There is renewed widespread interest in ANNs again especially with deep learning, artificial intelligence, and the big data buzz almost always in the news nowadays. The Coursera machine learning class helped a lot in furthering my understanding of ANNs. I'm looking forward to the next set of lectures.
There are other avenues worth pursuing
- randomly arranging the order of the training set
- determining whether adding a bias term is necessary
- utilizing multiple hidden layers (deep learning)
- convolution and recurrent neural networks
- feature reduction (via PCA or ICA)
- alternatives to gradient descent (particle swarm optimization, perhaps?)
References
- Wikipedia contributors, The Martian (Film), Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/The_Martian_(film)
No comments:
Post a Comment