Wednesday, March 30, 2016

Useful R tips #4

Unpacking lists into objects


















Suppose we have a function which returns a list containing several objects:

X = list('alpha' = alpha, 'b' = b, 'w' = w, 'evals' = evals, 'stp' = stp, 'flag' = flag)

return(X)

We can unpack this list into the individual objects using

for (i in 1:length(X)) { assign(names(X)[i], X[[i]]) }

ls() will show objects alphabw,  evalsstpflag now defined in the current environment/scope. 

Now, suppose we wanted to mimic Matlab/GNU Octave's way of being able to return multiple objects without having to rewrite those lines of code above? We can implement an unpack() function that will unpack the list inside the caller's environment/scope:

unpack <- function(X) {
# assign to objects in the parent/calling frame
for (i in 1:length(X)) {
assign(names(X)[i], X[[i]], envir = parent.frame())
}
}

To use:

result = examineExampleP(i, glob, alpha, w, b, X, Y, krnel, kpar1, kpar2, C, tol, steps, stp, evals, eps, K)

unpack(result)

or simply

unpack( examineExampleP(i, glob, alpha, w, b, X, Y, krnel, kpar1, kpar2, C, tol, steps, stp, evals, eps, K) )

Both will unpack the list returned by the function into the objects in the caller's environment/scope. When called from the console, it will be unpacked into the global environment.

Sunday, March 20, 2016

Useful R tips #3

An added bonus implementing machine learning algorithms in R is that you often make use of optimized or fast vectorized libraries or functions. Here are some of the functions I've found to be very useful especially when dealing with data sets, whether they are training sets, or cross-validation or test sets.

Random permutations of a set

Suppose you have a matrix X whose dimensions are m-rows (examples) and n-columns (features)

X[sample(1:nrow(X)), ]

This will generate a random re-arrangement of permutation X, i.e. rows have been reshuffled. To get only the first K examples, we need only use matrix indexing similar to the statement below

X[sample(1:nrow(X)), ][1:K, ]

Or for clarity, break it up into two lines

Xrand = X[sample(1:nrow(X)), ]
Xperm = Xrand[1:K, ]

Duplicate (replicate) matrix/column vector m or n times

Suppose we have a single row vector and we wish to replicate it m-times so we can perform operations on X[m, n] and centroid[1, n]

# centroid is a 1xn row vector
# X is an m x n matrix
repmat(centroid, m, 1)

m here is the number of times (rows) the row vector is replicated. Notice that the third parameters of repmat is 1, which means that we don't need to create more copies of centroid's columns. If centroid were a column vector, we need only to specify the number of columns to generate similar to the statement below

# centroid is a mx1 column vector
# X is an m x n matrix
repmat(centroid, 1, n)

Apply a function to each row/column of matrix

Suppose we need to apply a function to each row or column of a matrix without implementing it in a loop? This is often an issue because functions sum(), mean(), max(), min(), and others only return a single value. One solution is to use R's apply function

To get the mean of each column (or feature) in X, we may do so using:

X_means = apply(X, 2, mean)

Instead of implementing it in a loop like

X_means = array(0, c(1, ncol(X)))

for (n in 1:ncol(X)) { 

X_means[n] = mean(X[,n]) 
}

If X is an m x n matrix, the above call to apply will yield a 1 x n row vector. The second argument in apply2, indicates that the function is applied to each column. To apply it to each row, we use 1 instead.

X_means = apply(X, 1, mean)

This will then return a 1 x m column vector. Some other useful functions to use with apply are sd (standard deviation), max/min (maximum/minimum), which.max/which.min (index of maximum/minimum)

Thursday, March 3, 2016

Revisiting Handwritten Digit Recognition Using ANN

Quid Nunc?


Seven weeks since the conclusion of the handwritten digit recognition class project, my attempts at hacking an artificial neural network that have produced sub-par recognition rates have left a competitive itch that somehow needed redressing. In the days since since turning over my work, I was constantly bothered by the nagging feeling that I haven't done a stellar job.

Specifically, these issues were left unresolved:

  • training ten ANN binary classifiers, one for each digit
  • ANN architecture challenges, which led to
  • sub-optimal vector/matrix implementation of activation and back propagation computation
  • poor choice of cost/error function
  • poor recognition rates for all but one or two digits. ( ~ average of < 40%, which is a generous estimate)

On top of it all, the proper blog post about my efforts hasn't been completed yet. I seriously doubt that any effort to complete it can be easily justified, given the litany of problems.

"In the face of overwhelming odds, I'm left with only one option, I'm gonna have to science the shit out of this"


The quote above is from Mark Watney who curiously resembles Jason Bourne/Private James Francis Ryan in last year's documentary, The Martian. It's also the line that I couldn't get out of my head.

The first step was to admit that I don't know a ton of stuff about ANNs. So two weeks after the cynicism wore off, I enrolled at an online machine learning course taught by Professor Andrew Ng (chief scientist at Baidu research)  at Coursera (https://www.coursera.org/learn/machine-learning/). Fun fact: It's THE course that led to the founding of Coursera!

ANN Multi-classification and Prediction

What a difference 5 weeks of course work makes! I was finally able to abandon my original approach of using multiple binary classifiers to classify training sets into k categories.

Figure 1 Artificial neural network architecture with a single hidden layer for multi-classification

Each unit in the output layer represents one of the k-classes, i.e. y1 corresponds to the first class (or category), yto the second class (or category), etc. Ideally, a perfect classification of a pattern into one of the k-classes entails that one of the output units is 1 for that class and 0 elsewhere, e.g. if a pattern is classified as belonging to the fifth category, y5 = 1, and yk = 0 for k not equal to 5.

In order to classify each pattern into one of k-classes, two additional steps were needed
  • each training pattern must have must have a k-element 'true' output
  • after training, the predicted class for each pattern is the output node unit with the highest value, e.g. if the maximum of all yk occurs at k = 8, input pattern belongs to the 8th category.
For this specific application, digits 0-9 corresponds to classes 1-10. yk values can be thought of as containing the probabilities the input pattern belonging to the kth class.


Cost functions based on logistic regression

Previously, I've used the error function based on the mean squared error


Although this as a good starting point, it only measures the closeness of yk to tk. Since it is averaged for all m training patterns, one may be easily be fooled into thinking that the low value of J indicates good performance. The new error function below may be more powerful and may mitigate this effect


Immediately, one notices the terms



Admittedly, it looks complicated and considering that partial derivatives of this error function with respect to the weights will be needed, is it a justifiable choice?

Suppose that for an input pattern, tk = 1. The second term is zero. The first term will be non-zero as long as yk is not equal to tk, i.e. yk < 1. As  yk approaches the value of 1, J becomes smaller. Subsequently, suppose  tk = 0. The first term is zero, while the second term is non-zero as long as yk > 0. Using the sigmoid activation function especially at the output layer, ensures that 0 < yk < 1.

Any value of yk between  0 and 1 has a more complete measurement of the error in the new error function J. This is especially useful for multi-classification tasks since we need to be able to accumulate all the computed error at each unit, taking into account where it is 0 or 1 for that class.

For completeness, the partial derivative of this cost function with respect to the weights (Θ) is

This partial derivative expression also applies to ANNs with several hidden layers. Computation of this partial derivative require that inputs (xj), input unit activation h(x), and output units can be computed. xj may be units from intermediate hidden layers, i.e. they are inputs to succeeding layers. Similarly, y may be the expected output of the next hidden layer instead of the final output layer.

Vectorized implementations of ANN computations

As noted previously, instead of going through each input pattern one at a time, representing all input patterns, and all layers as matrices allows us to perform matrix and vector computations. Activation of each layer involves computing for the sum of the connection weights multiplied by the input units. In linear algebra this combined operation is performed during matrix multiplication operation.

On languages (e.g. Matlab, Octave, R) supporting optimized implementations, they are a tremendous help in minimizing errors. Although this places an unusually large amount of trust on the libraries used by the language, they allow us to implement the required computations in 1-2 lines. Sums  also require only a single function call in R instead of loop-based implementation that are usually prone to errors.

Thus, in my R implementation (see Reference section)

# add bias column to input layer
x = cbind(array(1, c(nrow(training_set), 1)), training_set)
# compute hidden layer activation
z_2 = x %*% t(w_ji)
z_j = h_func(z_2)
# add bias column
a_2 = cbind(array(1, c(nrow(z_j), 1)), z_j)
# compute output layer
y_k = h_func(a_2 %*% t(w_kj))

The bias units are inserted as columns of  1's prior to activation. Essentially all forward and back propagation computations have remained the same albeit with a vastly improved vectorized implementations. Our new found understanding also gives us flexibility to change the architecture, i.e. modifying number of units in hidden and output layers. We are also able handle varying amounts of features in the input layer.

Recognition performance

Figure 2 Test set (30000 digits)
Now that our arsenal's complete we go back to the previous results. We utilized a training set composed of 1000 samples with each digit having 100 image samples each. We used 28 hidden layer units and 10 output layer units corresponding to digits 0-9. We trained the network and after 1000 iterations were able to achieve a final error 0.0005671102.

We then used the weights to classify a set of 30000 digits (Figure 2) . Each band (30 x 100) corresponds to a single digit. For perfect classification, there should be no misplaced colored pixel on each digit band.

Figure 3 Results using a 1000 digit training set
Figure 3 is the result. There were about ~ 4725 misplaced pixels in the result, which corresponds to about 85% recognition rate. 
Figure 4 Incorrectly classified as 4 instead of 9
Figure 4 shows one sample digit incorrectly classified as 4 instead of 9. At about 85% recognition performance, this already betters my previous result of < 40%.

Optimization: The Final Frontier

As we increased the number of training sets, the ANN performance noticeably slows down. At one point, it took nearly 2 hours to run through 10000 iterations. Updating the weights by Gradient descent back propagation is notoriously slow even with vectorized implementations as sample size and number of features increases. In particular, it's minimization method is very slow and there is no guarantee that it won't get stuck in some local minima. What now?

In the machine learning class, instead of optimizing the weights via the gradient descent, we utilized an advanced optimization algorithm which I have converted to R (see fmincg.R in the reference section) from the original Matlab implementation. We needed only to define the cost function as well as the gradients for each weight set.

I still haven't grasped the many intricate details of its algorithm but it's fast and its minima search is smarter than gradient descent. For now, I'm quite content playing with it but hopefully, we'll get around to analyzing it in the future.

... And We're Back

Figure 5 10000-sample trained ANN utilizing an advanced optimization algorithm
Equipped with the advanced optimization algorithm (fmincg, thanks Mr. Carl Edward Rasmussen!), we set about training a network with 10000 samples and tested it on the 30000 image set. Figure 5 shows the results. There were about ~ 2110 misplaced pixels which corresponds to about 92.96667% recognition! There is a noticeable improvement across all digits.

Regularization

One other concept when dealing with ANNs is overfitting.

Because of the number of features or free parameters in the network, it is very likely that exposure to the training set will cause the ANN to 'memorize' the set. Consequently it may fail at generalization when exposed to new samples. One way this is is avoided to use regularization.

In regularization, the cost function is modified by adding terms corresponding to the weights and scaling it by some factor λ. Gradient descent (or any advanced optimization method) minimizes this modified cost function. As a result, network weights are usually kept smaller. The decision boundaries become smoother, minimizing the chances of an overfit.

The modified cost function is shown below. It should be noted that the bias terms are not regularized. These bias terms are the first elements or columns of the weight matrices Θ.

This is easily implented. Unregularized cost function and gradients are computed as before with minimal modifications. For λ = 0, this reverts to the usual calculation. Shown below is a sample implementation of the regularization process.

# cost function
cost = sum(-y_matrix * log(y_k) - (1 - y_matrix)*log(1 - y_k))/m

# regularization on lambda != 0
if (lambda != 0) {

rWji = w_ji
rWkj = w_kj

# do not regularize bias column
rWji[, 1] = array(0, nrow(w_ji))
rWkj[, 1] = array(0, nrow(w_kj))

cost = cost + lambda*(sum(rWji*rWji) + sum(rWkj*rWkj))/(2*m)

dWji = dWji + lambda*rWji/m
dWkj = dWkj + lambda*rWkj/m
}


Figure 6 Results with regularization (λ = 1)

Figure 6 shows the results of regularization at λ = 1. This corresponds to ~ 1503 misplaced pixels or about 94.99% recognition.

Summary

This has been a very interesting project and the improved results have been very encouraging. There is renewed widespread interest in ANNs again especially with deep learning, artificial intelligence, and the big data buzz almost always in the news nowadays. The Coursera machine learning class helped a lot in furthering my understanding of ANNs. I'm looking forward to the next set of lectures.

There are other avenues worth pursuing
  • randomly arranging the order of the training set
  • determining whether adding a bias term is necessary
  • utilizing multiple hidden layers (deep learning)
  • convolution and recurrent neural networks
  • feature reduction (via PCA or ICA)
  • alternatives to gradient descent (particle swarm optimization, perhaps?)
As usual, all the codes utilized here are posted at my GitHub repository. Apologies for the shameless plug :)

References