Monday, December 21, 2015

Expectation-Maximization Algorithm

Data: Problems, Challenges, and the EM Algorithm


A common challenge encountered in machine learning and pattern recognition occurs when the observed data is incomplete or the distribution from which the observed data was generated (or drawn) is: a) unknown but parameters can be estimated or b) can be modeled as a mixture of PDFs but its parameters are not known. In both cases reasonable estimates for the mean and variances can be made even if some of the data is missing.

The Expectation-Maximization (EM) algorithm is ideally suited in situations where the available data is incomplete. Coined and explained in (Dempster, et. al, 1997), the moniker comes from its iterative two-step process called expectation (E) and maximization (M), although the use of the algorithm has been recorded as early as 1950, applied to gene frequency estimation.

In the expectation step, current estimates of the parameters (initially, a guess) are used to generate the "expected" observed data by drawing from the unobserved, complete set. The maximization step computes new estimates of the parameters by maximizing the expected log-likelihood of the data (generated in the E-step). These new estimates are then used as inputs to the E-steps and so on and so forth. The process continues until convergence, i.e. there is no significant change in the parameter estimates or until a set number of iterations is reached.

Exact expressions for both steps are indicated in the next sections.

Mixture Models


An unknown or arbitrary distribution p(x) can be modeled as a linear combination of PDFs p(x|j) in the form

(Introduction to Pattern Recognition, p. 11)

with the condition that the sum of all prior probabilities is equal to 1 for sufficiently large J. In most cases, p(x|j) are modeled as Gaussians N(mj, sj), j = 1, 2, ... J. The resulting PDF is multi-modal, i.e. many peaks.

Shown below are 500 data points using a mixture model of 3 Gaussian PDFs. For easy visualization, we have color-coded the points into three classes, although the groupings are generally unknown. Clustering algorithms may be easily adopted to classify them, otherwise visual inspection should suffice for now.
Data points drawn from a mixture of 3 Gaussian PDFs

Expectation Step


From the observed data, we can compute the joint PDF p(x|j, theta) with the prior probabilities Pand the assumption that it is generated from a Gaussian mixture.
E-step: Data points are generated using parameter estimates. The expected log-likelihood is then computed
(Pattern Recognition, p. 46)

E-step: exact expression for the E-step log-likelihood
(Pattern Recognition, p. 47)
Shown above are exact expressions of the log-likelihood for the E-step. Because most components of the log-likelihood are expressed as exponential functions through the use of a Gaussian mixture model, we arrive at a form that can be easily evaluated. xk is the k-th generated data point. The unknown parameters are the mean (uj), variance sj (sigma), and Pj. j is the number of PDFs.

Maximization Step


From the E-step, we obtained an exact expression for the log-likelihood function. These are maximized for each unknown parameter. This is done by computing for the gradient of the log-likelihood with respect to each parameter then equating it to zero, i.e. finding the maxima at the critical point.

M-step: maximization of each unknown parameter
(Pattern Recognition, p. 47)
The expressions P(j|xk; Theta(t)) (capital/big Theta) is evaluated using
M-step: expressions for P(j|xk; theta(t)) and p(j|xk; theta(t))
(Pattern Recognition, p. 47)
p(xk|j; theta(t)) (small theta) is simply the value of Gaussian pdf at xusing the current estimates theta(t) as parametersThe algorithm is initiated by providing valid estimates for the prior probabilities Pj(t = 0). It is valid only if it all adds up to 1.

Demonstration

For this demonstration, we have converted Matlab(R) implementations of the algorithms (mixture models, EM) into R instead of creating our own versions of the code since we are concerned only with the exposition of the technique. Links to our conversions of the codes are provided in the reference section below. I have used examples from Introduction to Pattern Recognition: A Matlab Approach book. Further investigations may be done by adjusting the initial estimates as well as the tolerance.

Important parameters for the mixture above are listed below. 500 data points are distributed among three two-dimensional (2D) PDFs with apriori probabilities of 0.4, 0.4, and 0.2 respectively. We shall estimate all parameters (mean, S, P, N) using the EM algorithm.

Mixture Parameters

Parameter X1 X2 X3
mean1 1 3 2
mean2 1 3 6
S 0.1 0.2 0.3
P (apriori) 0.4 0.4 0.2
N 500

Results

Initial guess


Parameter X1 X2 X3
mean1 0 5 5
mean2 2 2 5
S 0.15 0.27 0.40
P (apriori) 0.33 (1/3) 0.33 (1/3) 0.33 (1/3)
tolerance 10-5

Although we have fixed our initial guesses for the mean here, we may choose random values. We used a tolerance of 10-5, although, it may be set to a lower value at the expense of computational time (# of iterations). The prior probabilities are set equally among the three sets of data points.

Final estimates


Parameter X1 X2 X3
mean1 0.9749602 2.9819268 2.0322002
mean2 0.9800478 2.9859954 5.9081155
S 0.1012807 0.2139030 0.2915451
P (apriori) 0.399999 0.395992 0.204009
error 6.925152-7
iterations 16

The algorithm was able to provide a reasonable estimate (error ~ 10-7) of the parameters in 16 iterations. We have attained mean, variance (S), and P estimates from the data that agrees closely with the parameters used to generate the 500 data points (from a Gaussian mixture model).
error vs. iteration

regions within 2-sigma about the mean
In the above figure, we have highlighted regions within 2 sigma (square root of the variance) about the mean. Theoretically, these regions should encompass ~ 95% of all the points.

Poor initial estimates of parameters

It must be noted that the EM algorithm is highly sensitive to the validity or goodness of the initial estimates for the mean, variances, and prior probabilities. For example, using the initial estimates below, we ran the EM algorithm again.

Sample poor initial estimates

Parameter X1 X2 X3
mean1 1.6 1.4 1.3
mean2 1.4 1.6 1.5
S 0.2 0.4 0.3
P (apriori) 0.2 0.4 0.4
tolerance 10-5

Not only did it not converge after 1000 iterations for the same tolerance (10-5), its final estimates after a hard/fixed termination, are mostly incorrect.

Parameter X1 X2 X3
mean1 0.5085580 2.6059821 0.9592023
mean2 0.9145188 3.8784266 0.9630420
S 0.00100000 1.42680773 0.09210382
P (apriori) 1.539032e-17 6.231477e-01 3.768523e-01
error 0.0003108576
iterations 1000

error after 1000 iterations
Incorrect parameter estimates
Shown above are the incorrect estimates of pdf parameters due to bad initial estimates. Notice that the 95% region of the red pdf is neglible, i.e. it is nowhere to be found in the plot above. The green pdf parameter estimates of the mean and variance are off. It is totally incorrect for the blue pdf, i.e. mean and variances encompasses the red pdf.

R-script used to test EM-Algorithm


# Mixture Models -----------------------------

# set initial random number generator seed
set.seed(0)

# generate mean for three 2D Gaussian PDFs
m = array(0, c(2,3))
m[,1] = c(1,1)
m[,2] = c(3,3)
m[,3] = c(2,6)

# ... and distribution variances (co-variances), each a 2x2 matrix
S = array(0,c(2,2,3))
S[,,1] = 0.1*diag(2)
S[,,2] = 0.2*diag(2)
S[,,3] = 0.3*diag(2)

# set prior probabilities for each PDF
P = c(0.4, 0.4, 0.2)

# distribute among 500 data points and initialize random number generator seed
N = 500
sed = 0

# generate mixture model
data = mixt_model(m, S, P, N, sed)

# plot mixture and their corresponding mean centroids
plot_data(data$X, data$y, m, 1)

# EM Algorithm -----------------------------

# set initial estimate of the mean
m_ini = array(0, c(2,3))
m_ini[,1] = c(0, 2)
m_ini[,2] = c(5, 2)
m_ini[,3] = c(5, 5)

# set initial estimate of the variances
s_ini = c(0.15, 0.27, 0.4)

# set initial apriori probability estimates
Pa_ini = c(1/3, 1/3, 1/3)

# set algorithm tolerance level
e_min = 10^(-5)

# generate estimates using EM algorithm
em_alg_function(data$X, m_ini, s_ini, Pa_ini, e_min)

References

  • A. P. Dempster, N. M. Laird, D. B. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm", Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1. (1977), pp. 1-38.
  • B D Chuong and B Serafim, "What is the expectation maximization algorithm?", Nature Biotechnology,  Vol. 26, No. 1 (2008), pp. 897-899
  • S Theoridis and K Koutroumbas, Pattern Recognition. 4th Ed. United Kingdom: Academic Press, 2009
  • S. Theododis and K. Koutroumbas, "Introduction to Pattern Recognition: A Matlab Approach", United Kingdom: Academic Press, 2010

No comments:

Post a Comment