Saturday, January 9, 2016

Neural Networks

The Mathematical Brain


It seems like artificial neural networks (ANN) have been around forever. Inspired largely by the organization of real biological neural networks, these computational models have been used to approximate functions, pattern recognition and classification, data processing, robotics, etc.

Biological neuron

Diagram of a typical myelinated vertebrate motor neuron
(Source: Neuron, Wikipedia)
A neuron is a specialized type of cell. A prominent feature of the neuron is that is its electrical excitability, i.e. capability to generate and propagate action potentials. The presence synapses allow it to transmit signals to other cells. The majority of the inputs to the neurons originate from the dendrites. Neurons conduct nerve impulses in an 'all-or-nothing' process, i.e. when the neuron responds, it responds quickly, or not at all.

Artificial neural network


Artificial neural networks are similar to their biological counterparts due to these key features:
  • interconnected between the different layers of neurons
  • activation function that converts a neuron's weighted input to its output activation.
(Pattern Recognition and Machine Learning, p. 228)

The figure above depicts a two layer neural network (input, hidden). The input, hidden, and output variables are represented by nodes (xD, zM, yK), and the weight parameters are represented by links between the nodes (wMD, wKM). Links from additional inputs and hidden variables are the bias parameters (x0, z0) . Arrows indicate the direction of information flow through the network during forward propagation.

The input layer is activated by summing all the contributions of all inputs plus the bias. These inputs are transformed in the hidden layer zj via an activation function h()

(Pattern recognition and Machine Learning, p. 227)
Similarly, in the output layer, the inputs from the hidden layer is summed and activated at the output layer.
(Pattern Recognition and Machine Learning, p. 227)
(Pattern Recognition, p. 228)
Incorporating the inputs and outputs from all layers, the network output can be summarized in a more compact form below
(Pattern Recognition and Machine Learning, p. 228)

Finally, all that remains are suitable choices for activation functions σ and h. Usually, sigmoid functions are used. (More on this in our demonstration later).

Training and Learning: Back-propagation


The most important feature of artificial neural networks is the capability to "learn". This is by done by updating the interconnection weights using a learning process or algorithm.

For input patterns composed of xn, we define an error function to measure the performance of network by comparing it with the set of true values tn.

(Pattern Recognition and Machine Learning, p. 233)

The goal is to minimize the error, and use this information to update the weights.

(Pattern Recognition and Machine Learning, p. 240) 
The idea is to measure the error, starting from the output layer, we propagate the 'error' towards the previous layers, and use it to update the weights.

(Pattern Recognition and Machine Learning)
As can be seen from the set of equations above, we utilize outputs and errors from the succeeding layers (output layer) and propagate it to towards the previous layers (via chain rule) to compute the weight updates. Because of the use of the partial derivatives, this procedure is sometimes called gradient descent.

Boolean Gates: a neural network implementation


We now proceed with a demonstration of a fully functioning artificial neural network (ANN). In this activity, we will be implementing an ANN that functions as a Boolean (Logic) gate (AND, OR, XOR, NOT, NAND, NOR, XNOR). We have eschewed using pre-compiled/built/bundled libraries (or packages) to better illustrate key features, concepts, and design considerations when implementing artificial neural network software. This was inspired by a blog post about an 11-line ANN Python code (see reference below).

Why implement Boolean gates?


An early consideration when implementing ANNs is the availability of training sets. In the case of Boolean operations, the training set is compact and complete! The complete truth table for any Boolean operation requires only four patterns and their corresponding true values (outputs).

Shown below are the truth tables for all the two-input logic gates

AND/OR
(source: Logic Gate, Wikipedia)
NAND/NOR/XOR/XNOR
(source: Logic Gate, Wikipedia)


Logic gates are ubiquitous components of many electronic devices. A lot of simple circuits can be composed or reduced entirely to one or a few combinations of these circuit elements.

Network Design

Two-layer network architecture for Boolean gate

Shown above is the network structure we have adopted consisting of two layers: input (x1, x2), hidden (z1 ... z4). It consists only of a single output yk. Weights connecting the two layers to the hidden and output are Wji and Wkj. The contributions of the bias parameters x0 and z0 are fixed at 1.

We have used only two layers because there seems to be no definitive quantitative criteria (in literature) for choosing the number of hidden layers as well as the number of units per hidden layer. Although, in practice, the number of units in a hidden layer is usually larger than or kept equal to the number of inputs from the previous layer. If otherwise, the number of units in is smaller than the number of inputs, this may imply that there are losses during the information transfer between the layers. There may be unpredictable side effects to the overall performance.

Because our ANN needs to deliver a Boolean output, we have used a sigmoid activation function:

Sigmoid function
(source: Wikipedia)
Logistic curve
(source: 
Wikipedia)

Because our ANN is literally a binary classifier, the choice of using a sigmoid function is deliberate. Another attractive property is that it is defined for all real numbers, although its characteristic S-shape can already be observed within a smaller interval (-6+6). It is also possible to completely map continuous function using sigmoids similar to Fourier series expansions of periodic functions.

Furthermore, its first derivative is easily computed. This is simply


Logistic curves are used extensively, especially in modelling population growths, chemical reactions, distributions. It must be noted that for other classification schemes, especially those involving more than two classes, a softmax function (or normalized exponential) which is a generalization of the logistic function.
Softmax function
(source: Wikipedia)
Because we have a complete training set consisting of only four patterns, it is easy to do batch processing instead of sequential processing. That is, per iteration, we can train the network to recognize all patterns instead of training it using only pattern before another pattern is used in the next iteration. This has the advantage of preventing the ANN from over-training on a specific pattern. A disadvantage of batch processing is that it requires that all derivatives, and the results of intermediate computations needs to be stored in memory especially during the back-propagation process. Sequential processing, sometimes called 'online learning' does not have the same memory requirements.

We can also represent the network parameters as vectors and matrices since the forward process usually involves a vector dot and cross product operations. Some of these processes also exhibit inherent parallelism and are ideally suited implementation on modern computers and graphics cards which are multi-core and multi-threading. The human brain exhibits inherent parallel processing as well as high degrees of redundancy, i.e. lots of other neurons perform similar tasks as others.

Results


I have implemented a basic neural network Boolean Gate classifier based on the Python code mentioned earlier. The features of the implementation are summarized below

Features

  • two fixed layers (2-unit input layer plus bias, 4-unit hidden layer plus bias)
  • fixed bias = 1.0
  • implements: training, forward, back-propagation processes
  • weight initialization using Uniform white noise (-1, 1) and Gaussian white noise (mean, standard deviation)
  • stopping conditions: tolerance or maximum iteration
  • adjustable learning rate
  • can be trained for all logic gates: AND, OR, XOR, NAND, NOR, XNOR

Sample output: training, testing

Training parameters


  • maximum iterations: 1000000
  • tolerance: 0.005
  • learning rate: 1.0
  • Boolean gate: XNOR (truth vector: [1, 0, 0, 1])
  • weight initialization: uniform white noise (mininum: -1.0, maximum: 1.0)
Convergence: number of iterations: 77080, max. iterations: 100000, Error: 0.004999995

Sample run


# training
result = nnet_train(1000000, 1, 5*10^(-3), c(1, 0, 0, 1))

# test against different input patterns
nnet_forward(c(0,0), result$w_ji, result$w_kj)$y_k
0.9933873 # ~ 1


nnet_forward(c(0,1), result$w_ji, result$w_kj)$y_k
0.005123679 # ~ 0


nnet_forward(c(1,0), result$w_ji, result$w_kj)$y_k
0.005125778 # ~ 0

nnet_forward(c(1,1), result$w_ji, result$w_kj)$y_k
0.996862 # ~ 1

The plot above shows the ANN converging to an error less than 0.005 after 77080 iterations. After training, we need only to run the forward operation using the set of weights (Wji and Wkj) in order to check the performance of the Boolean gate. Shown above are sample runs with comments for highlight. Because the sigmoid activation function is a real-valued continuous function, the outputs are not exactly equal to 0 or 1. As a final step, we may add a threshold function if a pure binary output is desired, e.g. outputs which are < 0.5 are considered 0; set to 1, when above threshold.

At 77000+ iterations, the training is slow for a simple Boolean operation and the tolerance is quite high. For other classification tasks this may be insufficient, but in our case, it is good enough as it is able to accurately mimic the behavior of the XNOR gate. Perhaps this can be addressed by redesigning the network architecture, i.e. more intermediate layers and units.

Learning rate


Boolean gate trained with a lower learning rate: 0.01, Error: 0.08328227
# slow learning rate
result = nnet_train(100000, 0.01, 5*10^(-3), c(1, 0, 0, 1))

nnet_forward(c(0, 0), result$w_ji, result$w_kj)$y_k
0.9111153 # ~ 1

nnet_forward(c(0, 1), result$w_ji, result$w_kj)$y_k
0.05935309 # ~ 0

nnet_forward(c(1, 0), result$w_ji, result$w_kj)$y_k
0.1008474 # ~ 0 !?!

nnet_forward(c(1, 1), result$w_ji, result$w_kj)$y_k
0.9159617 # ~ 1

Using a low learning rate, the program stopped after reaching a maxim 100000 iterations with an Error nowhere near 0.005. The classification performance is also poorer. In the above output, it almost fails to classify the (1, 0) input pattern

Boolean gate trained with a fast learning rate: 2, Error: 0.004999969
Using a faster learning rate (2.0), the convergence is much faster, terminating at 37576 iterations with an error of 0.004999969. The classification performance also improved.


# fast learning rate
result = nnet_train(100000, 2, 5*10^(-3), c(1, 0, 0, 1))

nnet_forward(c(0, 0), result$w_ji, result$w_kj)$y_k
0.9951378 # ~ 1

nnet_forward(c(0, 1), result$w_ji, result$w_kj)$y_k
0.004323133 # ~ 0

nnet_forward(c(1, 0), result$w_ji, result$w_kj)$y_k
0.005610439 # ~ 0

nnet_forward(c(1, 1), result$w_ji, result$w_kj)$y_k
0.9947962 # ~ 1

Training time and tolerance


A issue related to the learning rate is the training time. We can lower or raise the maximum iterations or decrease the tolerance (while maximum iterations is increased) to demonstrate this. Here, we show training times of 1000 and 10000 and 50000 iterations (half of the original training time).
Short training period: maximum iterations: 1000, Error: 0.1086976
# short training time
result = nnet_train(1000, 1, 5*10^(-3), c(1, 0, 0, 1))

nnet_forward(c(0, 0), result$w_ji, result$w_kj)$y_k
0.8593224 # ~ 1 !?!

nnet_forward(c(0, 1), result$w_ji, result$w_kj)$y_k
0.1196668 # ~ 0 !?!

nnet_forward(c(1, 0), result$w_ji, result$w_kj)$y_k
0.1058873 # ~ 0 !?!

nnet_forward(c(1, 1), result$w_ji, result$w_kj)$y_k
0.9321584 # ~ 1

From the output above, it is clear that the network did not have sufficient training time to sufficiently learn the XNOR gate operation. It is barely able to classify patterns (0, 0), (0, 1), and (1,0),

Increasing the training time to 10000, we obtain the following results

Training time: 10000, Error: 0.01472075 

# training time = 10000
result = nnet_train(1000, 1, 5*10^(-3), c(1, 0, 0, 1))

nnet_forward(c(0, 0), result$w_ji, result$w_kj)$y_k
0.9856066 # ~ 1

nnet_forward(c(0, 1), result$w_ji, result$w_kj)$y_k
0.0125765 # ~ 0

nnet_forward(c(1, 0), result$w_ji, result$w_kj)$y_k
0.01660571 # ~ 0

nnet_forward(c(1, 1), result$w_ji, result$w_kj)$y_k
0.984696 # ~ 1

The performance improves and despite having lesser training time compared to 77000+ obtained earlier, it is already able to clearly mimic the XNOR operation. Increasing further to 50000, we obtain the following results

Training time: 50000, Error: 0.006208026

# training time = 50000
result = nnet_train(1000, 1, 5*10^(-3), c(1, 0, 0, 1))

nnet_forward(c(0, 0), result$w_ji, result$w_kj)$y_k
0.994051811 # ~ 1

nnet_forward(c(0, 1), result$w_ji, result$w_kj)$y_k
0.005480674 # ~ 0

nnet_forward(c(1, 0), result$w_ji, result$w_kj)$y_k
0.006903740 # ~ 0

nnet_forward(c(1, 1), result$w_ji, result$w_kj)$y_k
0.993500497 # ~ 1

The improvement is significant compared to the 10000 run and is already comparable to the performance of the network trained at 77000+ iterations. If computational time were a scarce resource, training for 50000 iterations is already an acceptable trade-off especially when the improvement from the 77000+ iteration training time is a bit marginal.

Training set


Finally we address an important issue affecting ANN performance with regards to training set. Suppose our training set is incomplete, i.e. missing a pattern or redundant, how does this affect the training and classification performance? This issue becomes important especially for more complicated classification tasks in which the availability of training data is lacking, incomplete or redundant, i.e. there is nothing new under the sun. In image classification tasks, the variance in the available data is sufficiently large that resources needed to train the neural network maybe extremely prohibitive or out of reach for many researchers.

To illustrate this in our network, we knock off one training pattern (1, 0) and observe the effects. First, we make the changes to the source code:

...

# boolean gate operation (incomplete)
boolean_gate <- function(output = c(0, 1, 0)) {

booldata = array(0, c(3, 2))
booldata[1, ] = c(0, 0)
booldata[2, ] = c(0, 1)
booldata[3, ] = c(1, 1)
return(list('patterns' = booldata, 'output' = output))
}

nnet_train <- function(maxiter = 1000000, learning_rate = 0.1, tol = 10^(-3), output = c(0, 1, 0, 0), Gaussian = FALSE, mu = 0.0, sigma = 1.0) {

...


The rest of the code have not been changed. We have obtained the following results.

Iterations: 34324, Error: 0.00499997

# training time = 100000
incomplete_net = nnet_train(100000, 1, 5*10^(-3), c(1, 0, 1))

nnet_forward(c(0, 0), incomplete_net$w_ji, incomplete_net$w_kj)$y_k
0.9933288 # ~ 1

nnet_forward(c(0, 1), incomplete_net$w_ji, incomplete_net$w_kj)$y_k
0.006582452 # ~ 0

nnet_forward(c(1, 0), incomplete_net$w_ji, incomplete_net$w_kj)$y_k
0.9999934 # ~ 1 !!!!!!

nnet_forward(c(1, 1), incomplete_net$w_ji, incomplete_net$w_kj)$y_k
0.998254 # ~ 1

Although the network converges after 34000+ iterations with an Error of 0.00499997, the incomplete training set caused it to fail at classifying the input pattern (1, 0). It outputs a 1 instead of 0. For other complex classifications where it is very difficult if not impossible to have complete training set, we need to at least have a well represented training set, i.e. all classes are sufficiently represented or the number of examples per class is sufficiently large. What constitutes a sufficiently large pool of examples is an still open issue.

Depending on the application, and indeed,  in extreme cases, increasing the size of training sets may not be the preferable option. Consider our network as we attempt to classify inputs that are not strictly binary:

nnet_forward(c(-1, 0), result$w_ji, result$w_kj)$y_k
0.9999955 # ~ 1 !?!

nnet_forward(c(0.1, 0.1), result$w_ji, result$w_kj)$y_k
0.9768845 # ~ 1

nnet_forward(c(1, 0.1), result$w_ji, result$w_kj)$y_k
0.006361499 # ~ 0

nnet_forward(c(-1, 0.1), result$w_ji, result$w_kj)$y_k
0.9999937 # ~ 0 !?!

nnet_forward(c(1, 0.1), result$w_ji, result$w_kj)$y_k
0.006361499 # ~ 0

In this case, it would be seem to be an overkill to have a larger training set consisting of other real-valued patterns. Sanitizing the input and output may be the better option.

Other features


As mentioned earlier, we implemented other features in our network. We shall not go over them in detail here, but instead, demonstrate how they are used, for further investigations.

# training a NAND gate
result = nnet_train(100000, 1, 5*10^(-3), c(1, 1, 1, 0))

# training an OR gate
result = nnet_train(100000, 1, 5*10^(-3), c(0, 1, 1, 1))

# initializing weights using Gaussian white noise with mean = 0.0, standard deviation = 1.0
gaussian_net = nnet_train(100000, 1, 5*10^(-3), c(1,0,0,1), TRUE, 0.0, 1.0)

Summary


We have learned how an artificial neural network functions, by demonstrating the forward, back-propagation, and training processes in our software implementation. We have also investigated the performance of the networks when the parameters are tuned. We have also gained insights into how the ANN may be improved as well as briefly touched on the open issues.

Source Code


Below is the R-implementation, with the functions and network parameters highlighted for clarity

# sigmoid activation function
h_func <- function(x) {
return(1/(1+exp(-x)))
}

# 1st-derivative of activation function
h_funcd <- function(x) {
return(x*(1-x))
}

# boolean gate operation (defaults to XOR operation)
boolean_gate <- function(output = c(0, 1, 1, 0)) {

booldata = array(0, c(4, 2))
booldata[1, ] = c(0, 0)
booldata[2, ] = c(0, 1)
booldata[3, ] = c(1, 0)
booldata[4, ] = c(1, 1)
return(list('patterns' = booldata, 'output' = output))
}

# feed-forward operation
nnet_forward <- function(xw_jiw_kj) {
# activation function with bias = 1.0
z_j = h_func(x %*% w_ji + 1.0)
y_k = h_func(z_j %*% w_kj + 1.0)
return(list('y_k' = y_k, 'z_j' = z_j))
}

# backward propagation
nnet_backprop <- function (y_k, z_jxt_kw_kj) {
error_k = t_k - y_k
delta_k = error_k*h_funcd(y_k)

error_j = delta_k %*% (t(w_kj))
delta_j = error_j * h_funcd(z_j)

# compute delta weight updates
dWkj = t(z_j) %*% (delta_k)
dWji = t(x) %*% (delta_j)

# compute mean error
Error = mean(abs(error_k))

return(list('dWkj' = dWkj, 'dWji' = dWji, 'Error' = Error))
}

nnet_train <- function(maxiter = 1000000, learning_rate = 0.1, tol = 10^(-3), output = c(0, 1, 1, 0), Gaussian = FALSE, mu = 0.0, sigma = 1.0) {
boolean_op = boolean_gate(output)
x = boolean_op$patterns
t_k = boolean_op$output
ii = dim(x)
Error = 4
# intialize weights
if (!Gaussian) {
w_ji = array(runif(n = length(ii), min = -1, max = 1), rev(ii))
w_kj = runif(n = ii[1], min = -1, max = 1)
} else {
w_ji = array(rnorm(n = length(ii), mean = mu, sd = sigma), rev(ii))
w_kj = rnorm(n = ii[1], mean = mu, sd = sigma)
}
i = 0
convergence = numeric(0)
y_k = array(0, length(t_k))
# training loop
while (i < maxiter && Error > tol) {
forward = nnet_forward(xw_jiw_kj)
backward = nnet_backprop(forward$y_k, forward$z_jx, t_kw_kj)
# update weights
w_ji w_ji + learning_rate*backward$dWji
w_kj w_kj + learning_rate*backward$dWkj
i = i + 1

# save current performance
Error = backward$Error
y_k = forward$y_k
convergence = c(convergence, Error)
}
return(list('convergence' = convergence, 'x' = x, 'y_k' = y_k, 'Error' = Error, 'iterations' = i, 'w_kj' = w_kj, 'w_ji' = w_ji))
}

References


  • Christopher M. Bishop, Pattern Recognition and Machine Learning, New York: Springer, 2006

1 comment: