Wednesday, March 30, 2016

Useful R tips #4

Unpacking lists into objects


















Suppose we have a function which returns a list containing several objects:

X = list('alpha' = alpha, 'b' = b, 'w' = w, 'evals' = evals, 'stp' = stp, 'flag' = flag)

return(X)

We can unpack this list into the individual objects using

for (i in 1:length(X)) { assign(names(X)[i], X[[i]]) }

ls() will show objects alphabw,  evalsstpflag now defined in the current environment/scope. 

Now, suppose we wanted to mimic Matlab/GNU Octave's way of being able to return multiple objects without having to rewrite those lines of code above? We can implement an unpack() function that will unpack the list inside the caller's environment/scope:

unpack <- function(X) {
# assign to objects in the parent/calling frame
for (i in 1:length(X)) {
assign(names(X)[i], X[[i]], envir = parent.frame())
}
}

To use:

result = examineExampleP(i, glob, alpha, w, b, X, Y, krnel, kpar1, kpar2, C, tol, steps, stp, evals, eps, K)

unpack(result)

or simply

unpack( examineExampleP(i, glob, alpha, w, b, X, Y, krnel, kpar1, kpar2, C, tol, steps, stp, evals, eps, K) )

Both will unpack the list returned by the function into the objects in the caller's environment/scope. When called from the console, it will be unpacked into the global environment.

Sunday, March 20, 2016

Useful R tips #3

An added bonus implementing machine learning algorithms in R is that you often make use of optimized or fast vectorized libraries or functions. Here are some of the functions I've found to be very useful especially when dealing with data sets, whether they are training sets, or cross-validation or test sets.

Random permutations of a set

Suppose you have a matrix X whose dimensions are m-rows (examples) and n-columns (features)

X[sample(1:nrow(X)), ]

This will generate a random re-arrangement of permutation X, i.e. rows have been reshuffled. To get only the first K examples, we need only use matrix indexing similar to the statement below

X[sample(1:nrow(X)), ][1:K, ]

Or for clarity, break it up into two lines

Xrand = X[sample(1:nrow(X)), ]
Xperm = Xrand[1:K, ]

Duplicate (replicate) matrix/column vector m or n times

Suppose we have a single row vector and we wish to replicate it m-times so we can perform operations on X[m, n] and centroid[1, n]

# centroid is a 1xn row vector
# X is an m x n matrix
repmat(centroid, m, 1)

m here is the number of times (rows) the row vector is replicated. Notice that the third parameters of repmat is 1, which means that we don't need to create more copies of centroid's columns. If centroid were a column vector, we need only to specify the number of columns to generate similar to the statement below

# centroid is a mx1 column vector
# X is an m x n matrix
repmat(centroid, 1, n)

Apply a function to each row/column of matrix

Suppose we need to apply a function to each row or column of a matrix without implementing it in a loop? This is often an issue because functions sum(), mean(), max(), min(), and others only return a single value. One solution is to use R's apply function

To get the mean of each column (or feature) in X, we may do so using:

X_means = apply(X, 2, mean)

Instead of implementing it in a loop like

X_means = array(0, c(1, ncol(X)))

for (n in 1:ncol(X)) { 

X_means[n] = mean(X[,n]) 
}

If X is an m x n matrix, the above call to apply will yield a 1 x n row vector. The second argument in apply2, indicates that the function is applied to each column. To apply it to each row, we use 1 instead.

X_means = apply(X, 1, mean)

This will then return a 1 x m column vector. Some other useful functions to use with apply are sd (standard deviation), max/min (maximum/minimum), which.max/which.min (index of maximum/minimum)

Thursday, March 3, 2016

Revisiting Handwritten Digit Recognition Using ANN

Quid Nunc?


Seven weeks since the conclusion of the handwritten digit recognition class project, my attempts at hacking an artificial neural network that have produced sub-par recognition rates have left a competitive itch that somehow needed redressing. In the days since since turning over my work, I was constantly bothered by the nagging feeling that I haven't done a stellar job.

Specifically, these issues were left unresolved:

  • training ten ANN binary classifiers, one for each digit
  • ANN architecture challenges, which led to
  • sub-optimal vector/matrix implementation of activation and back propagation computation
  • poor choice of cost/error function
  • poor recognition rates for all but one or two digits. ( ~ average of < 40%, which is a generous estimate)

On top of it all, the proper blog post about my efforts hasn't been completed yet. I seriously doubt that any effort to complete it can be easily justified, given the litany of problems.

"In the face of overwhelming odds, I'm left with only one option, I'm gonna have to science the shit out of this"


The quote above is from Mark Watney who curiously resembles Jason Bourne/Private James Francis Ryan in last year's documentary, The Martian. It's also the line that I couldn't get out of my head.

The first step was to admit that I don't know a ton of stuff about ANNs. So two weeks after the cynicism wore off, I enrolled at an online machine learning course taught by Professor Andrew Ng (chief scientist at Baidu research)  at Coursera (https://www.coursera.org/learn/machine-learning/). Fun fact: It's THE course that led to the founding of Coursera!

ANN Multi-classification and Prediction

What a difference 5 weeks of course work makes! I was finally able to abandon my original approach of using multiple binary classifiers to classify training sets into k categories.

Figure 1 Artificial neural network architecture with a single hidden layer for multi-classification

Each unit in the output layer represents one of the k-classes, i.e. y1 corresponds to the first class (or category), yto the second class (or category), etc. Ideally, a perfect classification of a pattern into one of the k-classes entails that one of the output units is 1 for that class and 0 elsewhere, e.g. if a pattern is classified as belonging to the fifth category, y5 = 1, and yk = 0 for k not equal to 5.

In order to classify each pattern into one of k-classes, two additional steps were needed
  • each training pattern must have must have a k-element 'true' output
  • after training, the predicted class for each pattern is the output node unit with the highest value, e.g. if the maximum of all yk occurs at k = 8, input pattern belongs to the 8th category.
For this specific application, digits 0-9 corresponds to classes 1-10. yk values can be thought of as containing the probabilities the input pattern belonging to the kth class.


Cost functions based on logistic regression

Previously, I've used the error function based on the mean squared error


Although this as a good starting point, it only measures the closeness of yk to tk. Since it is averaged for all m training patterns, one may be easily be fooled into thinking that the low value of J indicates good performance. The new error function below may be more powerful and may mitigate this effect


Immediately, one notices the terms



Admittedly, it looks complicated and considering that partial derivatives of this error function with respect to the weights will be needed, is it a justifiable choice?

Suppose that for an input pattern, tk = 1. The second term is zero. The first term will be non-zero as long as yk is not equal to tk, i.e. yk < 1. As  yk approaches the value of 1, J becomes smaller. Subsequently, suppose  tk = 0. The first term is zero, while the second term is non-zero as long as yk > 0. Using the sigmoid activation function especially at the output layer, ensures that 0 < yk < 1.

Any value of yk between  0 and 1 has a more complete measurement of the error in the new error function J. This is especially useful for multi-classification tasks since we need to be able to accumulate all the computed error at each unit, taking into account where it is 0 or 1 for that class.

For completeness, the partial derivative of this cost function with respect to the weights (Θ) is

This partial derivative expression also applies to ANNs with several hidden layers. Computation of this partial derivative require that inputs (xj), input unit activation h(x), and output units can be computed. xj may be units from intermediate hidden layers, i.e. they are inputs to succeeding layers. Similarly, y may be the expected output of the next hidden layer instead of the final output layer.

Vectorized implementations of ANN computations

As noted previously, instead of going through each input pattern one at a time, representing all input patterns, and all layers as matrices allows us to perform matrix and vector computations. Activation of each layer involves computing for the sum of the connection weights multiplied by the input units. In linear algebra this combined operation is performed during matrix multiplication operation.

On languages (e.g. Matlab, Octave, R) supporting optimized implementations, they are a tremendous help in minimizing errors. Although this places an unusually large amount of trust on the libraries used by the language, they allow us to implement the required computations in 1-2 lines. Sums  also require only a single function call in R instead of loop-based implementation that are usually prone to errors.

Thus, in my R implementation (see Reference section)

# add bias column to input layer
x = cbind(array(1, c(nrow(training_set), 1)), training_set)
# compute hidden layer activation
z_2 = x %*% t(w_ji)
z_j = h_func(z_2)
# add bias column
a_2 = cbind(array(1, c(nrow(z_j), 1)), z_j)
# compute output layer
y_k = h_func(a_2 %*% t(w_kj))

The bias units are inserted as columns of  1's prior to activation. Essentially all forward and back propagation computations have remained the same albeit with a vastly improved vectorized implementations. Our new found understanding also gives us flexibility to change the architecture, i.e. modifying number of units in hidden and output layers. We are also able handle varying amounts of features in the input layer.

Recognition performance

Figure 2 Test set (30000 digits)
Now that our arsenal's complete we go back to the previous results. We utilized a training set composed of 1000 samples with each digit having 100 image samples each. We used 28 hidden layer units and 10 output layer units corresponding to digits 0-9. We trained the network and after 1000 iterations were able to achieve a final error 0.0005671102.

We then used the weights to classify a set of 30000 digits (Figure 2) . Each band (30 x 100) corresponds to a single digit. For perfect classification, there should be no misplaced colored pixel on each digit band.

Figure 3 Results using a 1000 digit training set
Figure 3 is the result. There were about ~ 4725 misplaced pixels in the result, which corresponds to about 85% recognition rate. 
Figure 4 Incorrectly classified as 4 instead of 9
Figure 4 shows one sample digit incorrectly classified as 4 instead of 9. At about 85% recognition performance, this already betters my previous result of < 40%.

Optimization: The Final Frontier

As we increased the number of training sets, the ANN performance noticeably slows down. At one point, it took nearly 2 hours to run through 10000 iterations. Updating the weights by Gradient descent back propagation is notoriously slow even with vectorized implementations as sample size and number of features increases. In particular, it's minimization method is very slow and there is no guarantee that it won't get stuck in some local minima. What now?

In the machine learning class, instead of optimizing the weights via the gradient descent, we utilized an advanced optimization algorithm which I have converted to R (see fmincg.R in the reference section) from the original Matlab implementation. We needed only to define the cost function as well as the gradients for each weight set.

I still haven't grasped the many intricate details of its algorithm but it's fast and its minima search is smarter than gradient descent. For now, I'm quite content playing with it but hopefully, we'll get around to analyzing it in the future.

... And We're Back

Figure 5 10000-sample trained ANN utilizing an advanced optimization algorithm
Equipped with the advanced optimization algorithm (fmincg, thanks Mr. Carl Edward Rasmussen!), we set about training a network with 10000 samples and tested it on the 30000 image set. Figure 5 shows the results. There were about ~ 2110 misplaced pixels which corresponds to about 92.96667% recognition! There is a noticeable improvement across all digits.

Regularization

One other concept when dealing with ANNs is overfitting.

Because of the number of features or free parameters in the network, it is very likely that exposure to the training set will cause the ANN to 'memorize' the set. Consequently it may fail at generalization when exposed to new samples. One way this is is avoided to use regularization.

In regularization, the cost function is modified by adding terms corresponding to the weights and scaling it by some factor λ. Gradient descent (or any advanced optimization method) minimizes this modified cost function. As a result, network weights are usually kept smaller. The decision boundaries become smoother, minimizing the chances of an overfit.

The modified cost function is shown below. It should be noted that the bias terms are not regularized. These bias terms are the first elements or columns of the weight matrices Θ.

This is easily implented. Unregularized cost function and gradients are computed as before with minimal modifications. For λ = 0, this reverts to the usual calculation. Shown below is a sample implementation of the regularization process.

# cost function
cost = sum(-y_matrix * log(y_k) - (1 - y_matrix)*log(1 - y_k))/m

# regularization on lambda != 0
if (lambda != 0) {

rWji = w_ji
rWkj = w_kj

# do not regularize bias column
rWji[, 1] = array(0, nrow(w_ji))
rWkj[, 1] = array(0, nrow(w_kj))

cost = cost + lambda*(sum(rWji*rWji) + sum(rWkj*rWkj))/(2*m)

dWji = dWji + lambda*rWji/m
dWkj = dWkj + lambda*rWkj/m
}


Figure 6 Results with regularization (λ = 1)

Figure 6 shows the results of regularization at λ = 1. This corresponds to ~ 1503 misplaced pixels or about 94.99% recognition.

Summary

This has been a very interesting project and the improved results have been very encouraging. There is renewed widespread interest in ANNs again especially with deep learning, artificial intelligence, and the big data buzz almost always in the news nowadays. The Coursera machine learning class helped a lot in furthering my understanding of ANNs. I'm looking forward to the next set of lectures.

There are other avenues worth pursuing
  • randomly arranging the order of the training set
  • determining whether adding a bias term is necessary
  • utilizing multiple hidden layers (deep learning)
  • convolution and recurrent neural networks
  • feature reduction (via PCA or ICA)
  • alternatives to gradient descent (particle swarm optimization, perhaps?)
As usual, all the codes utilized here are posted at my GitHub repository. Apologies for the shameless plug :)

References

Sunday, January 10, 2016

Handwritten Digit Recognition Using Artificial Neural Networks (WIP)

Note: I am still writing the paper for this project, so this post is essentially just a repository of the results... for now.

Training set (10 sample images of each digit) from MNIST database


Training set transformed into a 100 x 784 image, 10 patterns per digit


Test set transformed into a 10000 x 784 image, 1000 patterns per digit.

Results

Training results





Convergence

Recognizer performance (heatmap)


Recognizer performance (full test set)


Saturday, January 9, 2016

Neural Networks

The Mathematical Brain


It seems like artificial neural networks (ANN) have been around forever. Inspired largely by the organization of real biological neural networks, these computational models have been used to approximate functions, pattern recognition and classification, data processing, robotics, etc.

Biological neuron

Diagram of a typical myelinated vertebrate motor neuron
(Source: Neuron, Wikipedia)
A neuron is a specialized type of cell. A prominent feature of the neuron is that is its electrical excitability, i.e. capability to generate and propagate action potentials. The presence synapses allow it to transmit signals to other cells. The majority of the inputs to the neurons originate from the dendrites. Neurons conduct nerve impulses in an 'all-or-nothing' process, i.e. when the neuron responds, it responds quickly, or not at all.

Artificial neural network


Artificial neural networks are similar to their biological counterparts due to these key features:
  • interconnected between the different layers of neurons
  • activation function that converts a neuron's weighted input to its output activation.
(Pattern Recognition and Machine Learning, p. 228)

The figure above depicts a two layer neural network (input, hidden). The input, hidden, and output variables are represented by nodes (xD, zM, yK), and the weight parameters are represented by links between the nodes (wMD, wKM). Links from additional inputs and hidden variables are the bias parameters (x0, z0) . Arrows indicate the direction of information flow through the network during forward propagation.

The input layer is activated by summing all the contributions of all inputs plus the bias. These inputs are transformed in the hidden layer zj via an activation function h()

(Pattern recognition and Machine Learning, p. 227)
Similarly, in the output layer, the inputs from the hidden layer is summed and activated at the output layer.
(Pattern Recognition and Machine Learning, p. 227)
(Pattern Recognition, p. 228)
Incorporating the inputs and outputs from all layers, the network output can be summarized in a more compact form below
(Pattern Recognition and Machine Learning, p. 228)

Finally, all that remains are suitable choices for activation functions σ and h. Usually, sigmoid functions are used. (More on this in our demonstration later).

Training and Learning: Back-propagation


The most important feature of artificial neural networks is the capability to "learn". This is by done by updating the interconnection weights using a learning process or algorithm.

For input patterns composed of xn, we define an error function to measure the performance of network by comparing it with the set of true values tn.

(Pattern Recognition and Machine Learning, p. 233)

The goal is to minimize the error, and use this information to update the weights.

(Pattern Recognition and Machine Learning, p. 240) 
The idea is to measure the error, starting from the output layer, we propagate the 'error' towards the previous layers, and use it to update the weights.

(Pattern Recognition and Machine Learning)
As can be seen from the set of equations above, we utilize outputs and errors from the succeeding layers (output layer) and propagate it to towards the previous layers (via chain rule) to compute the weight updates. Because of the use of the partial derivatives, this procedure is sometimes called gradient descent.

Boolean Gates: a neural network implementation


We now proceed with a demonstration of a fully functioning artificial neural network (ANN). In this activity, we will be implementing an ANN that functions as a Boolean (Logic) gate (AND, OR, XOR, NOT, NAND, NOR, XNOR). We have eschewed using pre-compiled/built/bundled libraries (or packages) to better illustrate key features, concepts, and design considerations when implementing artificial neural network software. This was inspired by a blog post about an 11-line ANN Python code (see reference below).

Why implement Boolean gates?


An early consideration when implementing ANNs is the availability of training sets. In the case of Boolean operations, the training set is compact and complete! The complete truth table for any Boolean operation requires only four patterns and their corresponding true values (outputs).

Shown below are the truth tables for all the two-input logic gates

AND/OR
(source: Logic Gate, Wikipedia)
NAND/NOR/XOR/XNOR
(source: Logic Gate, Wikipedia)


Logic gates are ubiquitous components of many electronic devices. A lot of simple circuits can be composed or reduced entirely to one or a few combinations of these circuit elements.

Network Design

Two-layer network architecture for Boolean gate

Shown above is the network structure we have adopted consisting of two layers: input (x1, x2), hidden (z1 ... z4). It consists only of a single output yk. Weights connecting the two layers to the hidden and output are Wji and Wkj. The contributions of the bias parameters x0 and z0 are fixed at 1.

We have used only two layers because there seems to be no definitive quantitative criteria (in literature) for choosing the number of hidden layers as well as the number of units per hidden layer. Although, in practice, the number of units in a hidden layer is usually larger than or kept equal to the number of inputs from the previous layer. If otherwise, the number of units in is smaller than the number of inputs, this may imply that there are losses during the information transfer between the layers. There may be unpredictable side effects to the overall performance.

Because our ANN needs to deliver a Boolean output, we have used a sigmoid activation function:

Sigmoid function
(source: Wikipedia)
Logistic curve
(source: 
Wikipedia)

Because our ANN is literally a binary classifier, the choice of using a sigmoid function is deliberate. Another attractive property is that it is defined for all real numbers, although its characteristic S-shape can already be observed within a smaller interval (-6+6). It is also possible to completely map continuous function using sigmoids similar to Fourier series expansions of periodic functions.

Furthermore, its first derivative is easily computed. This is simply


Logistic curves are used extensively, especially in modelling population growths, chemical reactions, distributions. It must be noted that for other classification schemes, especially those involving more than two classes, a softmax function (or normalized exponential) which is a generalization of the logistic function.
Softmax function
(source: Wikipedia)
Because we have a complete training set consisting of only four patterns, it is easy to do batch processing instead of sequential processing. That is, per iteration, we can train the network to recognize all patterns instead of training it using only pattern before another pattern is used in the next iteration. This has the advantage of preventing the ANN from over-training on a specific pattern. A disadvantage of batch processing is that it requires that all derivatives, and the results of intermediate computations needs to be stored in memory especially during the back-propagation process. Sequential processing, sometimes called 'online learning' does not have the same memory requirements.

We can also represent the network parameters as vectors and matrices since the forward process usually involves a vector dot and cross product operations. Some of these processes also exhibit inherent parallelism and are ideally suited implementation on modern computers and graphics cards which are multi-core and multi-threading. The human brain exhibits inherent parallel processing as well as high degrees of redundancy, i.e. lots of other neurons perform similar tasks as others.

Results


I have implemented a basic neural network Boolean Gate classifier based on the Python code mentioned earlier. The features of the implementation are summarized below

Features

  • two fixed layers (2-unit input layer plus bias, 4-unit hidden layer plus bias)
  • fixed bias = 1.0
  • implements: training, forward, back-propagation processes
  • weight initialization using Uniform white noise (-1, 1) and Gaussian white noise (mean, standard deviation)
  • stopping conditions: tolerance or maximum iteration
  • adjustable learning rate
  • can be trained for all logic gates: AND, OR, XOR, NAND, NOR, XNOR

Sample output: training, testing

Training parameters


  • maximum iterations: 1000000
  • tolerance: 0.005
  • learning rate: 1.0
  • Boolean gate: XNOR (truth vector: [1, 0, 0, 1])
  • weight initialization: uniform white noise (mininum: -1.0, maximum: 1.0)
Convergence: number of iterations: 77080, max. iterations: 100000, Error: 0.004999995

Sample run


# training
result = nnet_train(1000000, 1, 5*10^(-3), c(1, 0, 0, 1))

# test against different input patterns
nnet_forward(c(0,0), result$w_ji, result$w_kj)$y_k
0.9933873 # ~ 1


nnet_forward(c(0,1), result$w_ji, result$w_kj)$y_k
0.005123679 # ~ 0


nnet_forward(c(1,0), result$w_ji, result$w_kj)$y_k
0.005125778 # ~ 0

nnet_forward(c(1,1), result$w_ji, result$w_kj)$y_k
0.996862 # ~ 1

The plot above shows the ANN converging to an error less than 0.005 after 77080 iterations. After training, we need only to run the forward operation using the set of weights (Wji and Wkj) in order to check the performance of the Boolean gate. Shown above are sample runs with comments for highlight. Because the sigmoid activation function is a real-valued continuous function, the outputs are not exactly equal to 0 or 1. As a final step, we may add a threshold function if a pure binary output is desired, e.g. outputs which are < 0.5 are considered 0; set to 1, when above threshold.

At 77000+ iterations, the training is slow for a simple Boolean operation and the tolerance is quite high. For other classification tasks this may be insufficient, but in our case, it is good enough as it is able to accurately mimic the behavior of the XNOR gate. Perhaps this can be addressed by redesigning the network architecture, i.e. more intermediate layers and units.

Learning rate


Boolean gate trained with a lower learning rate: 0.01, Error: 0.08328227
# slow learning rate
result = nnet_train(100000, 0.01, 5*10^(-3), c(1, 0, 0, 1))

nnet_forward(c(0, 0), result$w_ji, result$w_kj)$y_k
0.9111153 # ~ 1

nnet_forward(c(0, 1), result$w_ji, result$w_kj)$y_k
0.05935309 # ~ 0

nnet_forward(c(1, 0), result$w_ji, result$w_kj)$y_k
0.1008474 # ~ 0 !?!

nnet_forward(c(1, 1), result$w_ji, result$w_kj)$y_k
0.9159617 # ~ 1

Using a low learning rate, the program stopped after reaching a maxim 100000 iterations with an Error nowhere near 0.005. The classification performance is also poorer. In the above output, it almost fails to classify the (1, 0) input pattern

Boolean gate trained with a fast learning rate: 2, Error: 0.004999969
Using a faster learning rate (2.0), the convergence is much faster, terminating at 37576 iterations with an error of 0.004999969. The classification performance also improved.


# fast learning rate
result = nnet_train(100000, 2, 5*10^(-3), c(1, 0, 0, 1))

nnet_forward(c(0, 0), result$w_ji, result$w_kj)$y_k
0.9951378 # ~ 1

nnet_forward(c(0, 1), result$w_ji, result$w_kj)$y_k
0.004323133 # ~ 0

nnet_forward(c(1, 0), result$w_ji, result$w_kj)$y_k
0.005610439 # ~ 0

nnet_forward(c(1, 1), result$w_ji, result$w_kj)$y_k
0.9947962 # ~ 1

Training time and tolerance


A issue related to the learning rate is the training time. We can lower or raise the maximum iterations or decrease the tolerance (while maximum iterations is increased) to demonstrate this. Here, we show training times of 1000 and 10000 and 50000 iterations (half of the original training time).
Short training period: maximum iterations: 1000, Error: 0.1086976
# short training time
result = nnet_train(1000, 1, 5*10^(-3), c(1, 0, 0, 1))

nnet_forward(c(0, 0), result$w_ji, result$w_kj)$y_k
0.8593224 # ~ 1 !?!

nnet_forward(c(0, 1), result$w_ji, result$w_kj)$y_k
0.1196668 # ~ 0 !?!

nnet_forward(c(1, 0), result$w_ji, result$w_kj)$y_k
0.1058873 # ~ 0 !?!

nnet_forward(c(1, 1), result$w_ji, result$w_kj)$y_k
0.9321584 # ~ 1

From the output above, it is clear that the network did not have sufficient training time to sufficiently learn the XNOR gate operation. It is barely able to classify patterns (0, 0), (0, 1), and (1,0),

Increasing the training time to 10000, we obtain the following results

Training time: 10000, Error: 0.01472075 

# training time = 10000
result = nnet_train(1000, 1, 5*10^(-3), c(1, 0, 0, 1))

nnet_forward(c(0, 0), result$w_ji, result$w_kj)$y_k
0.9856066 # ~ 1

nnet_forward(c(0, 1), result$w_ji, result$w_kj)$y_k
0.0125765 # ~ 0

nnet_forward(c(1, 0), result$w_ji, result$w_kj)$y_k
0.01660571 # ~ 0

nnet_forward(c(1, 1), result$w_ji, result$w_kj)$y_k
0.984696 # ~ 1

The performance improves and despite having lesser training time compared to 77000+ obtained earlier, it is already able to clearly mimic the XNOR operation. Increasing further to 50000, we obtain the following results

Training time: 50000, Error: 0.006208026

# training time = 50000
result = nnet_train(1000, 1, 5*10^(-3), c(1, 0, 0, 1))

nnet_forward(c(0, 0), result$w_ji, result$w_kj)$y_k
0.994051811 # ~ 1

nnet_forward(c(0, 1), result$w_ji, result$w_kj)$y_k
0.005480674 # ~ 0

nnet_forward(c(1, 0), result$w_ji, result$w_kj)$y_k
0.006903740 # ~ 0

nnet_forward(c(1, 1), result$w_ji, result$w_kj)$y_k
0.993500497 # ~ 1

The improvement is significant compared to the 10000 run and is already comparable to the performance of the network trained at 77000+ iterations. If computational time were a scarce resource, training for 50000 iterations is already an acceptable trade-off especially when the improvement from the 77000+ iteration training time is a bit marginal.

Training set


Finally we address an important issue affecting ANN performance with regards to training set. Suppose our training set is incomplete, i.e. missing a pattern or redundant, how does this affect the training and classification performance? This issue becomes important especially for more complicated classification tasks in which the availability of training data is lacking, incomplete or redundant, i.e. there is nothing new under the sun. In image classification tasks, the variance in the available data is sufficiently large that resources needed to train the neural network maybe extremely prohibitive or out of reach for many researchers.

To illustrate this in our network, we knock off one training pattern (1, 0) and observe the effects. First, we make the changes to the source code:

...

# boolean gate operation (incomplete)
boolean_gate <- function(output = c(0, 1, 0)) {

booldata = array(0, c(3, 2))
booldata[1, ] = c(0, 0)
booldata[2, ] = c(0, 1)
booldata[3, ] = c(1, 1)
return(list('patterns' = booldata, 'output' = output))
}

nnet_train <- function(maxiter = 1000000, learning_rate = 0.1, tol = 10^(-3), output = c(0, 1, 0, 0), Gaussian = FALSE, mu = 0.0, sigma = 1.0) {

...


The rest of the code have not been changed. We have obtained the following results.

Iterations: 34324, Error: 0.00499997

# training time = 100000
incomplete_net = nnet_train(100000, 1, 5*10^(-3), c(1, 0, 1))

nnet_forward(c(0, 0), incomplete_net$w_ji, incomplete_net$w_kj)$y_k
0.9933288 # ~ 1

nnet_forward(c(0, 1), incomplete_net$w_ji, incomplete_net$w_kj)$y_k
0.006582452 # ~ 0

nnet_forward(c(1, 0), incomplete_net$w_ji, incomplete_net$w_kj)$y_k
0.9999934 # ~ 1 !!!!!!

nnet_forward(c(1, 1), incomplete_net$w_ji, incomplete_net$w_kj)$y_k
0.998254 # ~ 1

Although the network converges after 34000+ iterations with an Error of 0.00499997, the incomplete training set caused it to fail at classifying the input pattern (1, 0). It outputs a 1 instead of 0. For other complex classifications where it is very difficult if not impossible to have complete training set, we need to at least have a well represented training set, i.e. all classes are sufficiently represented or the number of examples per class is sufficiently large. What constitutes a sufficiently large pool of examples is an still open issue.

Depending on the application, and indeed,  in extreme cases, increasing the size of training sets may not be the preferable option. Consider our network as we attempt to classify inputs that are not strictly binary:

nnet_forward(c(-1, 0), result$w_ji, result$w_kj)$y_k
0.9999955 # ~ 1 !?!

nnet_forward(c(0.1, 0.1), result$w_ji, result$w_kj)$y_k
0.9768845 # ~ 1

nnet_forward(c(1, 0.1), result$w_ji, result$w_kj)$y_k
0.006361499 # ~ 0

nnet_forward(c(-1, 0.1), result$w_ji, result$w_kj)$y_k
0.9999937 # ~ 0 !?!

nnet_forward(c(1, 0.1), result$w_ji, result$w_kj)$y_k
0.006361499 # ~ 0

In this case, it would be seem to be an overkill to have a larger training set consisting of other real-valued patterns. Sanitizing the input and output may be the better option.

Other features


As mentioned earlier, we implemented other features in our network. We shall not go over them in detail here, but instead, demonstrate how they are used, for further investigations.

# training a NAND gate
result = nnet_train(100000, 1, 5*10^(-3), c(1, 1, 1, 0))

# training an OR gate
result = nnet_train(100000, 1, 5*10^(-3), c(0, 1, 1, 1))

# initializing weights using Gaussian white noise with mean = 0.0, standard deviation = 1.0
gaussian_net = nnet_train(100000, 1, 5*10^(-3), c(1,0,0,1), TRUE, 0.0, 1.0)

Summary


We have learned how an artificial neural network functions, by demonstrating the forward, back-propagation, and training processes in our software implementation. We have also investigated the performance of the networks when the parameters are tuned. We have also gained insights into how the ANN may be improved as well as briefly touched on the open issues.

Source Code


Below is the R-implementation, with the functions and network parameters highlighted for clarity

# sigmoid activation function
h_func <- function(x) {
return(1/(1+exp(-x)))
}

# 1st-derivative of activation function
h_funcd <- function(x) {
return(x*(1-x))
}

# boolean gate operation (defaults to XOR operation)
boolean_gate <- function(output = c(0, 1, 1, 0)) {

booldata = array(0, c(4, 2))
booldata[1, ] = c(0, 0)
booldata[2, ] = c(0, 1)
booldata[3, ] = c(1, 0)
booldata[4, ] = c(1, 1)
return(list('patterns' = booldata, 'output' = output))
}

# feed-forward operation
nnet_forward <- function(xw_jiw_kj) {
# activation function with bias = 1.0
z_j = h_func(x %*% w_ji + 1.0)
y_k = h_func(z_j %*% w_kj + 1.0)
return(list('y_k' = y_k, 'z_j' = z_j))
}

# backward propagation
nnet_backprop <- function (y_k, z_jxt_kw_kj) {
error_k = t_k - y_k
delta_k = error_k*h_funcd(y_k)

error_j = delta_k %*% (t(w_kj))
delta_j = error_j * h_funcd(z_j)

# compute delta weight updates
dWkj = t(z_j) %*% (delta_k)
dWji = t(x) %*% (delta_j)

# compute mean error
Error = mean(abs(error_k))

return(list('dWkj' = dWkj, 'dWji' = dWji, 'Error' = Error))
}

nnet_train <- function(maxiter = 1000000, learning_rate = 0.1, tol = 10^(-3), output = c(0, 1, 1, 0), Gaussian = FALSE, mu = 0.0, sigma = 1.0) {
boolean_op = boolean_gate(output)
x = boolean_op$patterns
t_k = boolean_op$output
ii = dim(x)
Error = 4
# intialize weights
if (!Gaussian) {
w_ji = array(runif(n = length(ii), min = -1, max = 1), rev(ii))
w_kj = runif(n = ii[1], min = -1, max = 1)
} else {
w_ji = array(rnorm(n = length(ii), mean = mu, sd = sigma), rev(ii))
w_kj = rnorm(n = ii[1], mean = mu, sd = sigma)
}
i = 0
convergence = numeric(0)
y_k = array(0, length(t_k))
# training loop
while (i < maxiter && Error > tol) {
forward = nnet_forward(xw_jiw_kj)
backward = nnet_backprop(forward$y_k, forward$z_jx, t_kw_kj)
# update weights
w_ji w_ji + learning_rate*backward$dWji
w_kj w_kj + learning_rate*backward$dWkj
i = i + 1

# save current performance
Error = backward$Error
y_k = forward$y_k
convergence = c(convergence, Error)
}
return(list('convergence' = convergence, 'x' = x, 'y_k' = y_k, 'Error' = Error, 'iterations' = i, 'w_kj' = w_kj, 'w_ji' = w_ji))
}

References


  • Christopher M. Bishop, Pattern Recognition and Machine Learning, New York: Springer, 2006