 FACTOID # 7: The top five best educated states are all in the Northeast.

 Home Encyclopedia Statistics States A-Z Flags Maps FAQ About

 WHAT'S NEW

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

(* = Graphable)

Encyclopedia > Maximum likelihood

Maximum likelihood estimation (MLE) is a popular statistical method used for fitting a mathematical model to some data. Modeling real world data by estimating maximum likelihood offers a way of tuning the free parameters of the model to provide a good fit. This article is about the field of statistics. ...

The method was pioneered by geneticist and statistician Sir R. A. Fisher between 1912 and 1922. It has widespread applications in various fields, including: A geneticist is a scientist who studies genetics, the science of heredity and variation of organisms. ... Statisticians are mathematicians who work with theoretical and applied statistics in the both the private and public sectors. ... Sir Ronald Aylmer Fisher, FRS (17 February 1890 â€“ 29 July 1962) was an English statistician, evolutionary biologist, and geneticist. ...

For a fixed set of data and underlying probability model, maximum likelihood picks the values of the model parameters that make the data "more likely" than any other values of the parameters would make them. Maximum likelihood estimation gives a unique and easy to determine solution in the case of the normal distribution and many other problems, although in very complex problems this may not be the case. If a uniform prior distribution is assumed over the parameters, the maximum likelihood estimate coincides with the most probable values thereof. The normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields. ... In mathematics, the uniform distributions are simple probability distributions. ... A prior probability is a marginal probability, interpreted as a description of what is known about a variable in the absence of some evidence. ... In statistics, the method of maximum a posteriori (MAP, or posterior mode) estimation can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. ...

The following discussion assumes that readers are familiar with basic notions in probability theory such as probability distributions, probability density functions, random variables and expectation. It also assumes they are familiar with standard basic techniques of maximizing continuous real-valued functions, such as using differentiation to find a function's maxima. Probability theory is the branch of mathematics concerned with analysis of random phenomena. ... A probability distribution describes the values and probabilities that a random event can take place. ... In mathematics, a probability density function (pdf) is a function that represents a probability distribution in terms of integrals. ... In probability theory, a random variable is a quantity whose values are random and to which a probability distribution is assigned. ... In probability theory the expected value (or mathematical expectation) of a random variable is the sum of the probability of each possible outcome of the experiment multiplied by its payoff (value). Thus, it represents the average amount one expects as the outcome of the random trial when identical odds are... In mathematics, a continuous function is a function for which, intuitively, small changes in the input result in small changes in the output. ... In mathematics, the real numbers may be described informally as numbers that can be given by an infinite decimal representation, such as 2. ... This article is about functions in mathematics. ... Differentiation can mean the following: In biology: cellular differentiation; evolutionary differentiation; In mathematics: see: derivative In cosmogony: planetary differentiation Differentiation (geology); Differentiation (logic); Differentiation (marketing). ... Local and global maxima and minima for cos(3Ï€x)/x, 0. ...

## Principles

Consider a family Dθ of probability distributions parameterized by an unknown parameter θ (which could be vector-valued), associated with either a known probability density function (continuous distribution) or a known probability mass function (discrete distribution), denoted as fθ. We draw a sample of n values from this distribution, and then using fθ we compute the (multivariate) probability density associated with our observed data, In mathematics, a probability density function (pdf) is a function that represents a probability distribution in terms of integrals. ... In probability theory, a probability mass function (abbreviated pmf) gives the probability that a discrete random variable is exactly equal to some value. ...

As a function of θ with x1, ..., xn fixed, this is the likelihood function Look up likelihood in Wiktionary, the free dictionary. ...

The method of maximum likelihood estimates θ by finding the value of θ that maximizes . This is the maximum likelihood estimator (MLE) of θ:

From a simple point of view, the outcome of a maximum likelihood analysis is the maximum likelihood estimate. This can be supplemented by an approximation for the covariance matrix of the MLE, where this approximation is derived from the likelihood function. A more complete outcome from a maximum likelihood analysis would be the likelihood function itself, which can be used to construct improved versions of confidence intervals compared to those obtained from the approximate variance matrix. See also Likelihood Ratio Test Look up likelihood in Wiktionary, the free dictionary. ... In statistics, a confidence interval (CI) is an interval estimate of a population parameter. ... A likelihood-ratio test is a statistical test relying on a test statistic computed by taking the ratio of the maximum value of the likelihood function under the constraint of the null hypothesis to the maximum with that constraint relaxed. ...

Commonly, one assumes that the data drawn from a particular distribution are independent, identically distributed (iid) with unknown parameters. This considerably simplifies the problem because the likelihood can then be written as a product of n univariate probability densities: In probability theory, a sequence or other collection of random variables is independent and identically distributed (i. ...

and since maxima are unaffected by monotone transformations, one can take the logarithm of this expression to turn it into a sum:

The maximum of this expression can then be found numerically using various optimization algorithms. In mathematics, the term optimization, or mathematical programming, refers to the study of problems in which one seeks to minimize or maximize a real function by systematically choosing the values of real or integer variables from within an allowed set. ...

This contrasts with seeking an unbiased estimator of θ, which may not necessarily yield the MLE but which will yield a value that (on average) will neither tend to over-estimate nor under-estimate the true value of θ. In statistics, a biased estimator is one that for some reason on average over- or underestimates what is being estimated. ...

Note that the maximum likelihood estimator may not be unique, or indeed may not even exist.

## Properties

### Functional invariance

The maximum likelihood estimator selects the parameter value which gives the observed data the largest possible probability (or probability density, in the continuous case). If the parameter consists of a number of components, then we define their separate maximum likelihood estimators, as the corresponding component of the MLE of the complete parameter. Consistently with this, if is the MLE for θ, and if g is any function of θ, then the MLE for α = g(θ) is by definition

It maximizes the so-called profile likelihood:

### Bias

The bias of maximum-likelihood estimators can be substantial. Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random (see uniform distribution). If n is unknown, then the maximum-likelihood estimator of n is the value on the drawn ticket, even though the expectation is only (n+1)/2. In estimating the highest number n, we can only be certain that it is greater than or equal to the drawn ticket number. In statistics, a biased estimator is one that for some reason on average over- or underestimates what is being estimated. ... In mathematics, the uniform distributions are simple probability distributions. ... In probability theory the expected value (or mathematical expectation) of a random variable is the sum of the probability of each possible outcome of the experiment multiplied by its payoff (value). Thus, it represents the average amount one expects as the outcome of the random trial when identical odds are...

### Asymptotics

In many cases, estimation is performed using a set of independent identically distributed measurements. These may correspond to distinct elements from a random sample, repeated observations, etc. In such cases, it is of interest to determine the behavior of a given estimator as the number of measurements increases to infinity, referred to as asymptotic behaviour. In probability theory, a sequence or other collection of random variables is independent and identically distributed (i. ... A sample is that part of a population which is actually observed. ... In mathematics and applications, particularly the analysis of algorithms, asymptotic analysis is a method of classifying limiting behaviour, by concentrating on some trend. ...

Under certain (fairly weak) regularity conditions, which are listed below, the MLE exhibits several characteristics which can be interpreted to mean that it is "asymptotically optimal". These characteristics include:

• The MLE is asymptotically unbiased, i.e., its bias tends to zero as the number of samples increases to infinity.
• The MLE is asymptotically efficient, i.e., it achieves the Cramér-Rao lower bound when the number of samples tends to infinity. This means that, asymptotically, no unbiased estimator has lower mean squared error than the MLE.
• The MLE is asymptotically normal. As the number of samples increases, the distribution of the MLE tends to the Gaussian distribution with mean θ and covariance matrix equal to the inverse of the Fisher information matrix.

Since the Cramér-Rao bound only speaks of unbiased estimators while the maximum likelihood estimator is usually biased, asymptotic efficiency as defined here does not mean anything: perhaps there are other nearly unbiased estimators with much smaller variance. However, it can be shown that among all regular estimators, which are estimators which have an asymptotic distribution which is not dramatically disturbed by small changes in the parameters, the asymptotic distribution of the maximum likelihood estimator is the best possible, i.e., most concentrated.  This article is about bias of statistical estimators. ... In statistics, efficiency is one measure of desirability of an estimator. ... In statistics, the CramÃ©r-Rao inequality, named in honor of Harald CramÃ©r and Calyampudi Radhakrishna Rao, expresses a lower bound on the variance of an unbiased statistical estimator, based on Fisher information. ... In statistics the mean squared error of an estimator T of an unobservable parameter Î¸ is i. ... In statistics, an estimator is a function of the observable sample data that is used to estimate an unknown population parameter; an estimate is the result from the actual application of the function to a particular set of data. ... In statistics and information theory, the Fisher information (denoted ) is the variance of the score. ...

Some regularity conditions which ensure this behavior are:

1. The first and second derivatives of the log-likelihood function must be defined.
2. The Fisher information matrix must not be zero, and must be continuous as a function of the parameter.
3. The maximum likelihood estimator is consistent.

By the mathematical meaning of the word asymptotic, asymptotic properties are properties which only approached in the limit of larger and larger samples: they are approximately true when the sample size is large enough. The theory does not tell us how large the sample needs to be in order to obtain a good enough degree of approximation. Fortunately, in practice they often appear to be approximately true, when the sample size is moderately large. So in practice, inference about the estimated parameters is often based on the asymptotic Gaussian distribution of the MLE. When we do this, the Fisher information matrix is usefully estimated by the observed information matrix.

Some cases where the asymptotic behaviour described above does not hold are outlined next.

Estimate on boundary. Sometimes the maximum likelihood estimate lies on the boundary of the set of possible parameters, or (if the boundary is not, strictly speaking, allowed) the likelihood gets larger and larger as the parameter approaches the boundary. Standard asymptotic theory needs the assumption that the true parameter value lies away from the boundary. If we have enough data, the maximum likelihood estimate will keep away from the boundary too. But with smaller samples, the estimate can lie on the boundary. In such cases, the asymptotic theory clearly does not give a practically useful approximation. Examples here would be variance-component models, where each component of variance, σ2, must satisfy the constraint σ2 ≥0.

Data boundary parameter-dependent. For the theory to apply in a simple way, the set of data values which has positive probability (or positive probability density) should not depend on the unknown parameter. A simple example where such parameter-dependence does hold is the case of estimating θ from a set of independent identically distributed when the common distribution is uniform on the range (0,θ). For estimation purposes the relevant range of θ is such that θ cannot be less than the largest observation. In this instance the maximum likelihood estimate exists and has some good behaviour, but the asymptotics are not as outlined above. In mathematics, the uniform distributions are simple probability distributions. ...

Nuisance parameters. For maximum likelihood estimations, a model may have a number of nuisance parameters. For the asymptotic behaviour outlined to hold, the number of nuisance parameters should not increase with the number of observations (the sample size). A well-known example of this case is where observations occur as pairs, where the observations in each pair have a different (unknown) mean but otherwise the observations are independent and Normally distributed with a common variance. Here for 2N observations, there are N+1 parameters. It is well-known that the maximum likelihood estimate for the variance does not converge to the true value of the variance. In statistics, a nuisance parameter is a parameter which is not of immediate interest, which nonetheless must be accounted in the analysis of some other parameters. ...

Increasing information. For the asymptotics to hold in cases where the assumption of independent identically distributed observations does not hold, a basic requirement is that the amount of information in the data increases indefinitely as the sample size increases. Such a requirement may not be met if either there is too much dependence in the data (for example, if new observations are essentially identical to existing observations), or if new independent observations are subject to an increasing observation error. In probability theory, a sequence or other collection of random variables is independent and identically distributed (i. ...

## Examples

### Discrete distribution, finite parameter space

Consider tossing an unfair coin 80 times (i.e., we sample something like x1=H, x2=T, ..., x80=T, and count the number of HEADS "H" observed). Call the probability of tossing a HEAD p, and the probability of tossing TAILS 1-p (so here p is θ above). Suppose we toss 49 HEADS and 31 TAILS, and suppose the coin was taken from a box containing three coins: one which gives HEADS with probability p=1/3, one which gives HEADS with probability p=1/2 and another which gives HEADS with probability p=2/3. The coins have lost their labels, so we don't know which one it was. Using maximum likelihood estimation we can calculate which coin has the largest likelihood, given the data that we observed. The likelihood function (defined below) takes one of three values: This article or section does not cite its references or sources. ...

We see that the likelihood is maximized when p=2/3, and so this is our maximum likelihood estimate for p.

### Discrete distribution, continuous parameter space

Now suppose we had only one coin but its p could have been any value 0 ≤ p ≤ 1. We must maximize the likelihood function:

over all possible values 0 ≤ p ≤ 1.

One way to maximize this function is by differentiating with respect to p and setting to zero: For other uses, see Derivative (disambiguation). ...

Likelihood of different proportion parameter values for a binomial process with t = 3 and n = 10; the ML estimator occurs at the mode with the peak (maximum) of the curve.

which has solutions p=0, p=1, and p=49/80. The solution which maximizes the likelihood is clearly p=49/80 (since p=0 and p=1 result in a likelihood of zero). Thus we say the maximum likelihood estimator for p is 49/80. Graph of likelihood of different proportion parameter values for a binominal process with k = 3 and n = 10 Image created by Rschulz on March 8, 2005 using the R statistical program version 1. ... Graph of likelihood of different proportion parameter values for a binominal process with k = 3 and n = 10 Image created by Rschulz on March 8, 2005 using the R statistical program version 1. ... In statistics, mode means the most frequent value assumed by a random variable, or occurring in a sampling of a random variable. ...

This result is easily generalized by substituting a letter such as t in the place of 49 to represent the observed number of 'successes' of our Bernoulli trials, and a letter such as n in the place of 80 to represent the number of Bernoulli trials. Exactly the same calculation yields the maximum likelihood estimator t / n for any sequence of n Bernoulli trials resulting in t 'successes'. In the theory of probability and statistics, a Bernoulli trial is an experiment whose outcome is random and can be either of two possible outcomes, called success and failure. ...

### Continuous distribution, continuous parameter space

For the normal distribution which has probability density function The normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields. ... In mathematics, a probability density function (pdf) is a function that represents a probability distribution in terms of integrals. ...

the corresponding probability density function for a sample of n independent identically distributed normal random variables (the likelihood) is In mathematics, a probability density function (pdf) is a function that represents a probability distribution in terms of integrals. ... In probability theory, a sequence or other collection of random variables is independent and identically distributed (i. ...

or more conveniently:

,

where is the sample mean. In mathematics and statistics, the arithmetic mean of a set of numbers is the sum of all the members of the set divided by the number of items in the set. ...

This family of distributions has two parameters: θ=(μ,σ), so we maximize the likelihood over both parameters simultaneously, or if possible, individually.

Since the logarithm is a continuous strictly increasing function over the range of the likelihood, the values which maximize the likelihood will also maximize its logarithm. Since maximizing the logarithm often requires simpler algebra, it is the logarithm which is maximized below. [Note: the log-likelihood is closely related to information entropy and Fisher information.] The natural logarithm, formerly known as the hyperbolic logarithm, is the logarithm to the base e, where e is an irrational constant approximately equal to 2. ... In mathematics, a continuous function is a function for which, intuitively, small changes in the input result in small changes in the output. ... In mathematics, functions between ordered sets are monotonic (or monotone) if they preserve the given order. ... In mathematics, the range of a function is the set of all output values produced by that function. ... Claude Shannon In information theory, the Shannon entropy or information entropy is a measure of the uncertainty associated with a random variable. ... In statistics and information theory, the Fisher information (denoted ) is the variance of the score. ...

which is solved by

.

This is indeed the maximum of the function since it is the only turning point in μ and the second derivative is strictly less than zero. Its expectation value is equal to the parameter μ of the given distribution, In probability (and especially gambling), the expected value (or expectation) of a random variable is the sum of the probability of each possible outcome of the experiment multiplied by its payoff (value). Thus, it represents the average amount one expects to win per bet if bets with identical odds are...

which means that the maximum-likelihood estimator is unbiased.

Similarly we differentiate the log likelihood with respect to σ and equate to zero:

which is solved by

.

Inserting we obtain

.

When we calculate the expectation value, the double sum gives a nonzero contribution only if i=j. We obtain

.

This means that the estimator is biased (However, is consistent).

Formally we say that the maximum likelihood estimator for θ = (μ,σ2) is:

In this case the MLEs could be obtained individually. In general this may not be the case, and the MLEs would have to be obtained simultaneously.

## Non-independent variables

It may be the case that variables are correlated, in which case they are not independent. Two random variables X and Y are only independent if their joint probability density function is the product of the individual probability density functions, i.e.

Suppose one constructs an order Gaussian vector out of random variables , where each variable has means given by . Furthermore, let the covariance matrix be denoted by Σ, In statistics and probability theory, the covariance matrix is a matrix of covariances between elements of a vector. ...

The joint probability density function of these n random variables is then given by:

In the two variable case, the joint probability density function is given by:

In this and other cases where a joint density function exists, the likelihood function is defined as above, under Principles, using this density.

• Abductive reasoning, a logical technique corresponding to maximum likelihood.
• Censoring (statistics)
• Delta method, a method for finding the distribution of functions of a maximum likelihood estimator.
• Generalized method of moments, a method related to maximum likelihood estimation.
• Inferential statistics, for an alternative to the maximum likelihood estimate.
• Likelihood function, a description on what likelihood functions are.
• Maximum a posteriori (MAP) estimator, for a contrast in the way to calculate estimators when prior knowledge is postulated.
• Mean squared error, a measure of how 'good' an estimator of a distributional parameter is (be it the maximum likelihood estimator or some other estimator).
• Method of moments (statistics), for another popular method for finding parameters of distributions.
• Method of support, a variation of the maximum likelihood technique.
• Quasi-maximum likelihood estimator, a MLE estimator that is misspecified, but still consistent.
• The Rao–Blackwell theorem, a result which yields a process for finding the best possible unbiased estimator (in the sense of having minimal mean squared error). The MLE is often a good starting place for the process.
• Sufficient statistic, a function of the data through which the MLE (if it exists and is unique) will depend on the data.

Abduction, or inference to the best explanation, is a method of reasoning in which one chooses the hypothesis that would, if true, best explain the relevant evidence. ... In statistics, censoring occurs when the value of an observation is only partially known. ... In statistics, the delta method is a method for deriving an approximate probability distribution for a function of an asymptotically normal statistical estimator from knowledge of the limiting variance of that estimator. ... The generalized method of moments is a very general statistical method for obtaining estimates of parameters of statistical models. ... It has been suggested that this article or section be merged with statistical inference. ... Look up likelihood in Wiktionary, the free dictionary. ... In statistics, the method of maximum a posteriori (MAP, or posterior mode) estimation can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. ... In statistics the mean squared error of an estimator T of an unobservable parameter Î¸ is i. ... In statistics, the method of moments is a method of estimation of population parameters such as mean, variance, median, etc. ... In statistics, the Raoâ€“Blackwell theorem describes a technique that can transform an absurdly crude estimator into an estimator that is optimal by the mean-squared-error criterion or any of a variety of similar criteria. ... In statistics the mean squared error of an estimator T of an unobservable parameter Î¸ is i. ... In statistics, one often considers a family of probability distributions for a random variable X (and X is often a vector whose components are scalar-valued random variables, frequently independent) parameterized by a scalar- or vector-valued parameter, which let us call θ. ... Results from FactBites:

 Stats: Maximum likelihood estimation (May 6, 2003) (642 words) Maximum likelihood is an approach that looks at a large class of distributions and then chooses the "best" distribution. The log of the likelihood function often simplifies many of the calculations, and if you find the maximum of the log likelihood that also has to be the maximum of the likelihood itself. I won't show all the equations, but the maximum likelihood estimate of mu ends up equaling the sample mean and the maximum likelihood estimate of sigma ends up equaling, not the sample standard deviation exactly, but something very close where you replace n-1 with n in the formula.
 NationMaster - Encyclopedia: Maximum likelihood (2910 words) Maximum likelihood estimation (MLE) is a popular statistical method used to make inferences about parameters of the underlying probability distribution from a given data set. Likelihood of different proportion parameter values for a binomial process with t = 3 and n = 10; the ML estimator occurs at the mode with the peak (maximum) of the curve. When maximising the likelihood, we may equivalently maximise the log of the likelihood, since log is a continuous strictly increasing function over the range of the likelihood.
More results at FactBites »

Share your thoughts, questions and commentary here
Press Releases | Feeds | Contact