The maximum likelihood principle states that the set of model parameters that maximize the apparent probability of a set of observations is the best set possible. One rationale for this is that in the limit, such a choice approaches the true value of these parameters (assuming, of course, that the model is a valid one). This can be seen by the argument given below. Interestingly, maximum likelihood is equivalent to maximum compression. An alternative is the so-called maximum a posteriori approach (MAP) in which a Bayesian prior distribution on the parameters is assumed and the maximum is computed taking this prior into account. In terms of compression, MAP estimation correlates to minimizing the number of bits required to send a description of the observations and the model parameters. This formulation is also called the Minimum Message Length method.
Stated more precisely, the maximum likelihood estimate of a model parameters is
- argmaxθ p(X | θ)
where X is the observed data and θ represents the model parameters. The MAP estimate of the model parameters is
- argmaxθ p(X | θ) p(θ)
The relationship between compression, truth and maximum likelihood can be seen by taking a bound on the maximum likelihood.
If p(X) is the distribution of a random variable X that takes on values x from the real numbers and q(x) is some other distribution, then by the definition of a distribution, we know that
- q(x) ≥ 0
- Σx q(x) = 1
and likewise for p.
If we take N repeated independent samples xi of X, then the expected value of the mean of log q(xi) is given by
- E(Σ log q(xi) / N) = Σ p(xi) log q(xi) / N
But since log y ≥ y-1 for positive y, we have
- Σ p(xi) log q(xi) - Σ p(xi) log p(xi) = Σ p(xi) log q(xi)/p(xi)
- ... ≤ Σ p(xi) [ 1- q(xi)/p(xi) ]
- ... ≤ Σ p(xi) - Σ q(xi)
- ... ≤ 1 - 1 = 0
- Σ p(xi) log q(xi) ≤ Σ p(xi) log p(xi)
and equality can only be achieved where q(x) = p(x).
This means that maximizing the expected value of the mean value of log q is the same as finding p. To the extent that the law of large numbers lets us approximate this expected value by the observed mean, maximizing this observed mean lets us approximate p.
Thus it can be said that statistical inference can let us approximate the truth.
Interestingly, the negative of this mean value of log q is the expected length of a compressed representation of the xi where q is the model used to do the compression. Thus we can also claim that ultimate compression = truth. This leads us off to Occam's Razor.