Following my previous post on optimization and mixtures (here), Nicolas told me that my idea was probably not the most clever one (there).

So, we get back to our simple mixture model,

In order to describe how EM algorithm works, assume first that both and are perfectly known, and the mixture parameter is the only one we care about.

- The simple model, with only one parameter that is unknown

Here, the likelihood is

so that we write the log likelihood as

which might not be simple to maximize. Recall that the mixture model can interpreted through a latent variate (that cannot be observed), taking value when is drawn from , and 0 if it is drawn from . More generally (especially in the case we want to extend our model to 3, 4, … mixtures), and .

With that notation, the likelihood becomes

and the log likelihood

the term on the right is useless since we only care about p, here. From here, consider the following iterative procedure,

Assume that the mixture probability is known, denoted . Then I can predict the value of (i.e. and ) for all observations,

So I can inject those values into my log likelihood, i.e. in

having maximum (no need to run numerical tools here)

that will be denoted . And I can iterate from here.

Formally, the first step is where we calculate an expected (E) value, where is the best predictor of given my observations (as well as my belief in ). Then comes a maximization (M) step, where using , I can estimate probability .

- A more general framework, all parameters are now unkown

So far, it was simple, since we assumed that and were perfectly known. Which is not reallistic. An there is not much to change to get a complete algorithm, to estimate . Recall that we had which was the expected value of Z_{1,i}, i.e. it is a probability that observation i has been drawn from .

If , instead of being in the segment was in , then we could have considered mean and standard deviations of observations such that =0, and similarly on the subset of observations such that =1.

But we can’t. So what can be done is to consider as the weight we should give to observation i when estimating parameters of , and similarly, 1-would be weights given to observation i when estimating parameters of .

So we set, as before

and then

and for the variance, well, it is a weighted mean again,

and this is it.

- Let us run the code on the same data as before

Here, the code is rather simple: let us start generating a sample

> X1 = rnorm(n,0,1)

> X20 = rnorm(n,0,1)

> Z = sample(c(1,2,2),size=n,replace=TRUE)

> X2=4+X20

> X = c(X1[Z==1],X2[Z==2])

then, given a vector of initial values (that I called and then before),

> s = c(0.5, mean(X)-1, var(X), mean(X)+1, var(X))

I define my function as,

> em = function(X0,s) {

+ Ep = s[1]*dnorm(X0, s[2], sqrt(s[4]))/(s[1]*dnorm(X0, s[2], sqrt(s[4])) +

+ (1-s[1])*dnorm(X0, s[3], sqrt(s[5])))

+ s[1] = mean(Ep)

+ s[2] = sum(Ep*X0) / sum(Ep)

+ s[3] = sum((1-Ep)*X0) / sum(1-Ep)

+ s[4] = sum(Ep*(X0-s[2])^2) / sum(Ep)

+ s[5] = sum((1-Ep)*(X0-s[3])^2) / sum(1-Ep)

+ return(s)

+ }

Then I get , or . So this is it ! We just need to iterate (here I stop after 200 iterations) since we can see that, actually, our algorithm converges quite fast,

> for(i in 2:200){

+ s=em(X,s)

+ }

Let us run the same procedure as before, i.e. I generate samples of size 200, where difference between means can be small (0) or large (4),

Ok, Nicolas, you were right, we’re doing much better ! Maybe we should also go for a Gibbs sampling procedure ?… next time, maybe….