FACTOID # 20: Statistically, Delaware bears more cost of the US Military than any other state.

 Home Encyclopedia Statistics States A-Z Flags Maps FAQ About

 WHAT'S NEW

SEARCH ALL

Search encyclopedia, statistics and forums:

(* = Graphable)

Encyclopedia > Exponential family

In probability and statistics, an exponential family is any class of probability distributions having a certain form. This is for mathematical convenience, on account of their nice algebraic properties; as well as for generality, as they are in a sense very natural distributions to consider. The exponential family first appeared in independent work by E. J. G. Pitman, G. Darmois and B. O. Koopman in 1935-6 Probability theory is the mathematical study of probability. ... A graph of a Normal bell curve showing statistics used in educational assessment and comparing various grading methods. ... In mathematics and statistics, a probability distribution is a function of the probabilities of a mutually exclusive and exhaustive set of events. ... Edwin James George Pitman (1897-1993) was a Professor of Mathematics at the University of Tasmania from 1926 to 1962, and visiting lecturer and professor at several universities in the United States, England and Australia. ... Bernard O. Koopman (1900&#8211;1981) was a French born, American mathematician known for his work in operations research. ...

There are both discrete and continuous exponential families that are useful and important in theoretical or practical work. We use cumulative distribution functions (cdf) in order to encompass both discrete and continuous distributions. In mathematics, a probability distribution is called discrete, if it is fully characterized by a probability mass function. ... By one convention, a probability distribution is called continuous if its cumulative distribution function is continuous. ... In probability theory, the cumulative distribution function (abbreviated cdf) completely describes the probability distribution of a real-valued random variable, X. For every real number x, the cdf is given by where the right-hand side represents the probability that the random variable X takes on a value less than...

Suppose H is a non-decreasing function of a real variable and H(x) approaches 0 as x approaches −∞. Then Lebesgue-Stieltjes integrals with respect to dH(x) are integrals with respect to the "reference measure" of the exponential family generated by H. If you are having difficulty understanding this article, you might want to first learn more about integrals, particularly the Lebesgue integral, and measure theory. ...

Any member of that exponential family has cumulative distribution function In probability theory, the cumulative distribution function (abbreviated cdf) completely describes the probability distribution of a real-valued random variable, X. For every real number x, the cdf is given by where the right-hand side represents the probability that the random variable X takes on a value less than...

$dF(x|eta) = e^{-eta^{top} T(x) - A(eta)}, dH(x).$

If F is a continuous distribution with a density, one can write dF(x) = f(xdx. The meanings of the different symbols in the right-hand side are as follows:

• H(x) is a Lebesgue-Stieltjes integrator for the reference measure. When the reference measure is finite, it can be normalized and H is actually the cumulative distribution function of a probability distribution. If F is continuous with a density, then so is H, which can then be written dH(x) = h(xdx. If F is discrete, then so is H (with the same support).
• η is the natural parameter, a column vector, so that ηT = (η1, ..., ηn), its transpose, is a row vector. The parameter space—i.e., the set of values of η for which this function is integrable—is necessarily convex.
• T(x) is a sufficient statistic of the distribution, and it is a column vector whose number of scalar components is the same as that of η so that ηTT(x) is a scalar. (Note that the concept of sufficient statistic applies more broadly than just to members of the exponential family.)
• and A(η) is a normalization factor without which F would not be a probability distribution. The function A is important in its own right, because in cases in which the reference measure dH(x) is a probability measure, then A is the cumulant-generating function of the probability distribution of the sufficient statistic T(X) when the distribution of X is dH(x).

If you are having difficulty understanding this article, you might want to first learn more about integrals, particularly the Lebesgue integral, and measure theory. ... In probability theory, the cumulative distribution function (abbreviated cdf) completely describes the probability distribution of a real-valued random variable, X. For every real number x, the cdf is given by where the right-hand side represents the probability that the random variable X takes on a value less than... The word support has several specialized meanings: In mathematics, see support (mathematics). ... Look up Convex set in Wiktionary, the free dictionary. ... In statistics, one often considers a family of probability distributions for a random variable X (and X is often a vector whose components are scalar-valued random variables, frequently independent) parameterized by a scalar- or vector-valued parameter, which let us call &#952;. A quantity T(X) that depends on... In statistics, one often considers a family of probability distributions for a random variable X (and X is often a vector whose components are scalar-valued random variables, frequently independent) parameterized by a scalar- or vector-valued parameter, which let us call &#952;. A quantity T(X) that depends on... The concept of a normalizing constant arises in probability theory and a variety of other areas of mathematics. ... // Cumulants of probability distributions In probability theory and statistics, the cumulants Îºn of the probability distribution of a random variable X are given by In other words, Îºn/n! is the nth coefficient in the power series representation of the logarithm of the moment-generating function. ... In mathematics a generating function is a formal power series whose coefficients encode information about a sequence an that is indexed by the natural numbers. ... In mathematics and statistics, a probability distribution is a function of the probabilities of a mutually exclusive and exhaustive set of events. ...

The normal, gamma, chi-square, beta, Dirichlet, Bernoulli, binomial, multinomial, Poisson, negative binomial, geometric, and Weibull distributions are all exponential families. The Cauchy distributions and uniform distributions do not comprise an exponential family. The normal distribution, also called Gaussian distribution by scientists (named after Carl Friedrich Gauss due to his rigorous application of the distribution to astronomical data (Havil, 2003)), is a continuous probability distribution of great importance in many fields. ... In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions that represents the sum of exponentially distributed random variables, each of which has mean . ... In probability theory and statistics, the chi-square distribution (also chi-squared or Ï‡2  distribution) is one of the theoretical probability distributions most widely used in inferential statistics, i. ... In probability theory and statistics, the beta distribution is a continuous probability distribution with the probability density function (pdf) defined on the interval [0, 1]: where Î± and Î² are parameters that must be greater than zero and B is the beta function. ... Several images of the probability density of the Dirichlet distribution when K=3 for various parameter vectors Î±. Clockwise from top left: Î±=(6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4). ... In probability theory and statistics, the Bernoulli distribution, named after Swiss scientist Jakob Bernoulli, is a discrete probability distribution, which takes value 1 with success probability and value 0 with failure probability . ... In probability theory and statistics, the binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. ... In probability theory, the multinomial distribution is a generalization of the binomial distribution. ... In probability theory and statistics, the Poisson distribution is a discrete probability distribution. ... In probability and statistics the negative binomial distribution is a discrete probability distribution. ... In probability theory and statistics, the geometric distribution is either of two discrete probability distributions: the probability distribution of the number X of Bernoulli trials needed to get one success, supported on the set { 1, 2, 3, ...}, or the probability distribution of the number Y = X âˆ’ 1 of failures before... In probability theory and statistics, the Weibull distribution (named after Waloddi Weibull) is a continuous probability distribution with the probability density function where and is the shape parameter and is the scale parameter of the distribution. ... The Cauchy-Lorentz distribution, named after Augustin Cauchy, is a continuous probability distribution with probability density function where x0 is the location parameter, specifying the location of the peak of the distribution, and Î³ is the scale parameter which specifies the half-width at half-maximum (HWHM). ... In mathematics, the uniform distributions are simple probability distributions. ...

$f(x)={n choose x}p^x (1-p)^{n-x}$
for x ∈ {0, 1, 2, ..., n}. Let F be the cumulative distribution function. Then
$dF(x) = p^x (1-p)^{n-x},dH(x)=expleft(x logleft({p over 1-p}right) + n logleft(1-pright)right),dH(x),$
so the "natural parameter" η (the same as a Lagrange multiplier in the maximum entropy formulation); for this family of distributions is
$eta = log{p over 1-p}.$

In probability theory and statistics, the binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. ... In mathematics, a function on the real numbers is called a step function if it can be written as a finite linear combination of indicator functions of half-open intervals. ... In mathematics, particularly in combinatorics, the binomial coefficient of the natural number n and the integer k is the number of combinations that exist. ... In probability theory, a probability mass function (abbreviated pmf) gives the probability that a discrete random variable is exactly equal to some value. ... In probability theory, the cumulative distribution function (abbreviated cdf) completely describes the probability distribution of a real-valued random variable, X. For every real number x, the cdf is given by where the right-hand side represents the probability that the random variable X takes on a value less than... In mathematical optimization problems, Lagrange multipliers are a method for dealing with constraints. ...

### Differential identities: an example

As mentioned above, $scriptstyle K(u) = A(u + eta) - A(eta)$ is the cumulant generating function for $scriptstyle T$. A consequence of this is that one can fully understand the mean and covariance structure of $scriptstyle T = (T_{1}, T_{2}, dots , T_{p})$ by differentiating $scriptstyle A(eta)$.

$E(T_{j}) = frac{ partial A(eta) }{ partial eta_{j} }$

and

$mathrm{cov}(T_{i},T_{j}) = frac{ partial^{2} A(eta) }{ partial eta_{i} partial eta_{j} }.$

The first two raw moments and all mixed moments can be recovered from these two identities. This is often useful when $scriptstyle T$ is a complicated function of the data whose moments are difficult to calculate by integration. As an example consider a real valued random variable $scriptstyle X$ with density

$p_{theta}(x) = frac{ theta e^{-x} }{(1 + e^{-x})^{theta + 1} }$

indexed by shape parameter $theta in (0,infty)$ (this distribution is called the skew-logistic). The density can be rewritten as

$frac{ e^{-x} } { 1 + e^{-x} } mathrm{exp}( -theta mathrm{log}(1 + e^{-x}) + mathrm{log}(theta))$

Notice this is an exponential family with canonical parameter

$scriptstyle eta = -theta,$

sufficient statistic

$scriptstyle T = mathrm{log}(1 + e^{-x}),$

and normalizing factor

$scriptstyle A(eta) = -mathrm{log}(theta) = -mathrm{log}(-eta)$

So using the first identity,

$E(mathrm{log}(1 + e^{-X})) = E(T) = frac{ partial A(eta) }{ partial eta } = frac{ partial }{ partial eta } [-mathrm{log}(-eta)] = frac{1}{-eta} = frac{1}{theta},$

and using the second identity

$mathrm{var}(mathrm{log}(1 + e^{-X})) = frac{ partial^{2} A(eta) }{ partial eta^{2} } = frac{ partial }{ partial eta } left[frac{1}{-eta}right] = frac{1}{(-eta)^{2}} = frac{1}{theta^2}$

This example illustrates a case where using this method is very simple, but the brute force calculation would be nearly impossible.

## Maximum entropy derivation

The exponential family arises naturally as the answer to the following question: what is the maximum entropy distribution consistent with given constraints on expected values? Ice melting - classic example of entropy increasing[1] described in 1862 by Rudolf Clausius as an increase in the disgregation of the molecules of the body of ice. ...

The information entropy of a probability distribution dF(x) can only be computed with respect to some other probability distribution (or, more generally, a positive measure), and both measures must be mutually absolutely continuous. Accordingly, we need to pick a reference measure dH(x) with the same support as dF(x). As an aside, frequentists need to realize that this is a largely arbitrary choice, while Bayesians can just make this choice part of their prior probability distribution. Claude Shannon In information theory, the Shannon entropy or information entropy is a measure of the uncertainty associated with a random variable. ... In mathematics, a measure is a function that assigns a number, e. ... Absolute continuity of real functions In mathematics, a real_valued function f of a real variable is absolutely continuous if for every positive number &#949;, no matter how small, there is a positive number &#948; small enough so that whenever a sequence of pairwise disjoint intervals [xk, yk], k = 1, ..., n... A prior probability is a marginal probability, interpreted as a description of what is known about a variable in the absence of some evidence. ...

The entropy of dF(x) relative to dH(x) is

$S[dF|dH]=-int {dFover dH}ln{dFover dH},dH$

or

$S[dF|dH]=intln{dHover dF},dF$

where dF/dH and dH/dF are Radon-Nikodym derivatives. Note that the ordinary definition of entropy for a discrete distribution supported on a set I, namely In mathematics, the Radon-Nikodym theorem is a result in functional analysis that states that if a measure Q is absolutely continuous with respect to another sigma-finite measure P then there is a measurable function f, taking values in [0,&#8734;], on the underlying space such that for any...

$S=-sum_{iin I} p_iln p_i$

assumes (though this is seldom pointed out) that dH is chosen to be counting measure on I. In mathematics, the counting measure is an intuitive way to put a measure on any set: the size of a subset is taken to be the number of the subsets elements if this is finite, and âˆž if the subset is infinite. ...

Consider now a collection of observable quantities (random variables) Ti. The probability distribution dF whose entropy with respect to dH is greatest, subject to the conditions that the expected value of Ti be equal to ti, is a member of the exponential family with dH as reference measure and (T1, ..., Tn) as sufficient statistic.

The derivation is a simple variational calculation using Lagrange multipliers. Normalization is imposed by letting T0 = 1 be one of the constraints. The natural parameters of the distribution are the Lagrange multipliers, and the normalization factor is the Lagrange multiplier associated to T0. Calculus of variations is a field of mathematics that deals with functions of functions, as opposed to ordinary calculus which deals with functions of numbers. ... Fig. ...

## Role in statistics

### Classical estimation: sufficiency

According to the Pitman-Koopman-Darmois theorem, among families of probability distributions whose domain does not vary with the parameter being estimated, only in exponential families is there a sufficient statistic whose dimension remains bounded as sample size increases. Less tersely, suppose Xn, n = 1, 2, 3, ... are independent identically distributed random variables whose distribution is known to be in some family of probability distributions. Only if that family is an exponential family is there a (possibly vector-valued) sufficient statistic T(X1, ..., Xn) whose number of scalar components does not increase as the sample size n increases. In statistics, one often considers a family of probability distributions for a random variable X (and X is often a vector whose components are scalar-valued random variables, frequently independent) parameterized by a scalar- or vector-valued parameter, which let us call &#952;. A quantity T(X) that depends on... In statistics, one often considers a family of probability distributions for a random variable X (and X is often a vector whose components are scalar-valued random variables, frequently independent) parameterized by a scalar- or vector-valued parameter, which let us call &#952;. A quantity T(X) that depends on...

### Bayesian estimation: conjugate distributions

Exponential families are also important in Bayesian statistics. In Bayesian statistics a prior distribution is multiplied by a likelihood function and then normalised to produce a posterior distribution. In the case of a likelihood which belongs to the exponential family there exists a conjugate prior, which is often also in the exponential family. A conjugate prior π for the parameter η of an exponential family is given by Bayesian inference is statistical inference in which probabilities are interpreted not as frequencies or proportions or the like, but rather as degrees of belief. ... A prior probability is a marginal probability, interpreted as a description of what is known about a variable in the absence of some evidence. ... Look up likelihood in Wiktionary, the free dictionary. ... In Bayesian probability theory, the posterior probability is the conditional probability of some event or proposition, taking empirical data into account. ... In Bayesian probability theory, a class of prior probability distributions p(Î¸) is said to be conjugate to a class of likelihood functions p(x|Î¸) if the resulting posterior distributions p(Î¸|x) are in the same family as p(Î¸). For example, the Gaussian family is conjugate to itself (or self-conjugate...

$pi(eta) propto exp(-eta^{top} alpha - beta, A(eta)),$

where $alpha in mathbb{R}^n$ and β > 0 are hyperparameters (parameters controlling parameters).

A conjugate prior is one which, when combined with the likelihood and normalised, produces a posterior distribution which is of the same type as the prior. For example, if one is estimating the success probability of a binomial distribution, then if one chooses to use a beta distribution as one's prior, the posterior is another beta distribution. This makes the computation of the posterior particularly simple. Similarly, if one is estimating the parameter of a Poisson distribution the use of a gamma prior will lead to another gamma posterior. Conjugate priors are often very flexible and can be very convenient. However, if one's belief about the likely value of the theta parameter of a binomial is represented by (say) a bimodal (two-humped) prior distribution, then this cannot be represented by a beta distribution. In probability theory and statistics, the Poisson distribution is a discrete probability distribution. ...

An arbitrary likelihood will not belong to the exponential family, and thus in general no conjugate prior exists. The posterior will then have to be computed by numerical methods.

## Statistical inference

### Sampling distributions

As discussed above, the sufficient statistic (T1, ..., Tn) plays a pivotal role in statistical inference, whether classical or Bayesian. Accordingly, it is interesting to study its sampling distribution. That is, if X1, ..., Xm is a random sample—that is, a collection of independent, identically-distributed random variables—drawn from a distribution in the exponential family, we want to know the probability distribution of the statistic This article is in need of attention from an expert on the subject. ... A statistic (singular) is the result of applying a statistical algorithm to a set of data. ...

$widehat t_i={1over m}sum_{j=1}^m T_i(X_j).$

Letting T0=1, we can write

$dF(eta)=e^{-eta^alpha T_alpha}dH$

using Einstein's summation convention, namely For other topics related to Einstein see Einstein (disambig) In mathematics, especially in applications of linear algebra to physics, the Einstein notation or Einstein summation convention is a notational convention useful when dealing with coordinate equations or formulas. ...

$eta^alpha T_alpha=eta^0 T_0+eta^i T_i=eta^0T_0+eta^1T_1+cdots+eta^nT_n$

Then,

$Z[eta]=int dF=e^{-eta^0+A(eta)}$

is what physicists call the partition function in statistical mechanics. The condition that dF be normalized implies that η0 = A(η), as anticipated in the above section on information entropy. In number theory, see Partition function (number theory) In statistical mechanics, see Partition function (statistical mechanics) In quantum field theory, see Partition function (quantum field theory) In game theory, see Partition function (game theory) This is a disambiguation page &#8212; a navigational aid which lists other pages that might otherwise... Statistical mechanics is the application of probability theory, which includes mathematical tools for dealing with large populations, to the field of mechanics, which is concerned with the motion of particles or objects when subjected to a force. ...

Next, it is straightforward to check that

${partialoverpartialeta^i}ln Z(eta)={partialoverpartialeta^i}A(eta)=E[T_imideta],$

denoted ti, and

${partial^2overpartialeta^i,partialeta^j}ln Z(eta)={partial^2overpartialeta^i,partialeta^j}A(eta)={rm Cov}[T_i,T_jmideta]$

denoted tij. As the same information can be obtained from either Z or A, it is not necessary to normalize the probability distribution dF by setting η0 = A before taking the derivatives. Also, the function A(η) is the cumulant-generating function of the distribution of T not just for dF or dH but for the entire exponential subfamily with the given dH and T.

The equations

$E[T_i(X)mideta]=t_i$

can usually be solved to find η as a function of ti, which means that either set of parameters can be used to completely specify a member of the specific subfamily under consideration. In that case, the covariances tij can also be expressed in terms of the ti, which is useful for estimation purposes as we shall see below.

We are now ready to consider the random samples mentioned earlier. It follows that

$E[widehat t_i]=t_i,$

that is, the statistic $widehat{t_i}$ is an unbiased estimator of ti. Moreover, since the elements of a random sample are assumed to be mutually independent, In statistics, a biased estimator is one that for some reason on average over- or underestimates what is being estimated. ...

${rm Cov}[widehat{t_i},widehat{t_j}]={1over m^2}sum_{k,l=1}^m{rm Cov}[T_i(X_k),T_j(X_l)]={1over m}t_{ij}$

Because the covariance vanishes in the limit of large samples, the estimators $widehat{t_i}$ are said to be consistent. In statistics, an estimator is a function of the known data that is used to estimate an unknown parameter; an estimate is the result from the actual application of the function to a particular set of data. ...

More generally, the kth cumulant of the distribution of $widehat{t_i}$ can be seen to decay with the (k − 1)th power of sample size, so the distribution of these statistics is asymptotically a multivariate normal distribution. To use asymptotic normality (as one would in the construction of confidence intervals) one needs an estimate of the covariances. Therefore we also need to look at the sampling distribution of // Cumulants of probability distributions In probability theory and statistics, the cumulants Îºn of the probability distribution of a random variable X are given by In other words, Îºn/n! is the nth coefficient in the power series representation of the logarithm of the moment-generating function. ... In statistics, an estimator is a function of the observable sample data that is used to estimate an unknown population parameter; an estimate is the result from the actual application of the function to a particular set of data. ...

$widehat t_{ij}={1over m-1}sum_{k=1}^m (T_i(X_k)-widehat{t_i})(T_j(X_k)-widehat{t_j}).$

This is easily seen to be an unbiased estimator of tij, but consistency and asymptotic chi-squared behaviour are rather more involved, and depend on the third and fourth cumulants of dF. For any positive integer , the chi-square distribution with degrees of freedom is the probability distribution of the random variable where the are independent standard normal variables (zero expected value and unit variance). ... // Cumulants of probability distributions In probability theory and statistics, the cumulants Îºn of the probability distribution of a random variable X are given by In other words, Îºn/n! is the nth coefficient in the power series representation of the logarithm of the moment-generating function. ...

## References

• Keener, Robert W. (2006). Statistical Theory: Notes for a Course in Theoretical Statistics. Springer, 27-28, 32-33.

Results from FactBites:

 PlanetMath: exponential family (241 words) as a nuisance parameter, belongs to the exponential family. Similarly, the Poisson, binomial, Gamma, and inverse Gaussian distributions all belong to the exponential family and they are all in canonical form. This is version 4 of exponential family, born on 2004-07-27, modified 2006-09-12.
More results at FactBites »

Share your thoughts, questions and commentary here