In statistics and information theory, a maximum entropy probability distribution is a probability distribution whose entropy is larger than (or equal to) that of all other members of a specified class of distributions. For Wikipedia statistics, see m:Statistics Statistics is the science and practice of developing human knowledge through the use of empirical data expressed in quantitative form. ...
Information theory is a branch of the mathematical theory of probability and mathematical statistics, that quantifies the concept of information. ...
In mathematics, a probability distribution assigns to every interval of the real numbers a probability, so that the probability axioms are satisfied. ...
Entropy of a Bernoulli trial as a function of success probability. ...
If nothing is known about a distribution except that it belongs to a certain class, then the maximum entropy distribution for that class is often assumed "by default", according to the principle of maximum entropy. The reason is twofold: first, maximizing entropy, in a sense, means minimizing the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time. The principle of maximum entropy is a method for analyzing the available information in order to determine a unique epistemic probability distribution. ...
Definition of entropy
If X is a discrete random variable with distribution given by In mathematics, a random variable is discrete if its probability distribution is discrete; a discrete probability distribution is one that is fully characterized by a probability mass function. ...
then the entropy of X is defined as If X is a continuous random variable with probability density p(x), then the entropy of X is defined as By one convention, a random variable X is called continuous if its cumulative distribution function is continuous. ...
In quantum mechanics, a probability amplitude is a complex numbervalued function which describes an uncertain or unknown quantity. ...

where p(x) log(1/p(x)) is understood to be zero whenever p(x) = 0. The base of the logarithm is not important as long as the same one is used consistently: change of base merely results in a rescaling of the entropy. Information theoreticians may prefer to use base 2 in order to express the entropy in bits; mathematicians and physicists will often prefer the natural logarithm, resulting in a unit of nits or nepers for the entropy. In mathematics, if two variables of bn = x are known, the third can be found. ...
A bit (abbreviated b) is the most basic information unit used in computing and information theory. ...
The natural logarithm is the logarithm to the base e, where e is approximately equal to 2. ...
Nit can refer to: A common name for various types of lice eggs. ...
For Neper as a mythological god, see Neper (mythology) In electronics, a neper (Np) is a nonSI unit, accepted for use with SI, used to express ratios, such as gain, loss, and relative values. ...
Examples of maximum entropy distributions Given mean and standard deviation: the normal distribution The most important maximum entropy distribution is the normal distribution N(μ,σ^{2}). It has maximum entropy among all distributions on the real line with specified mean μ and standard deviation σ. Therefore, if all you know about a distribution is its mean and standard deviation, it is often reasonable to assume that the distribution is normal. Probability density function of Gaussian distribution (bell curve). ...
In statistics, mean has two related meanings: the average in ordinary English, which is more correctly called the arithmetic mean, to distinguish it from geometric mean or harmonic mean. ...
In probability and statistics, the standard deviation is the most commonly used measure of statistical dispersion. ...
Uniform and piecewise uniform distributions The uniform distribution on the interval [a,b] is the maximum entropy distribution among all continuous distributions which are supported in the interval [a, b] (which means that the probability density is 0 outside of the interval). In mathematics, the uniform distributions are simple probability distributions. ...
More generally, if we're given a subdivision a=a_{0} < a_{1} < ... < a_{k} = b of the interval [a,b] and probabilities p_{1},...,p_{k} which add up to one, then we can consider the class of all continuous distributions such that 
The density of the maximum entropy distribution for this class is constant on each of the intervals [a_{j1},a_{j}); it looks somewhat like a histogram. In statistics, a histogram is a graphical display of tabulated frequencies. ...
The uniform distribution on the finite set {x_{1},...,x_{n}} (which assigns a probability of 1/n to each of these values) is the maximum entropy distribution among all discrete distributions supported on this set.
Positive and given mean: the exponential distribution The exponential distribution with mean 1/λ is the maximum entropy distribution among all continuous distributions supported in [0,∞) that have a mean of 1/λ. In probability theory and statistics, the exponential distribution is a continuous probability distribution. ...
In physics, this occurs when gravity acts on a gas that is kept at constant pressure and temperature: if X describes the height of a molecule, then the variable X is exponentially distributed (which also means that the density of the gas depends on height proportional to the exponential distribution). The reason: X is clearly positive and its mean, which corresponds to the average potential energy, is fixed. Over time, the system will attain its maximum entropy configuration, according to the second law of thermodynamics. Potential energy (U, or Ep), a kind of scalar potential, is energy by virtue of matter being able to move to a lowerenergy state, releasing energy in some form. ...
In physics, the second law of thermodynamics, in its many forms, is a statement about the quality and direction of energy flow, and it is closely related to the concept of entropy. ...
Discrete distributions with given mean Among all the discrete distributions supported on the set {x_{1},...,x_{n}} with mean μ, the maximum entropy distribution has the following shape: 
where the positive constants C and r can be determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ. As an example, consider the following scenario: a large number N of dice is thrown, and you are told that the sum of all the shown numbers is S. Based on this information alone, what would be a reasonable assumption for the number of dice showing 1, 2, ..., 6? This is an instance of the situation considered above, with {x_{1},...,x_{6}} = {1,...,6} and μ = S/N. Finally, among all the discrete distributions supported on the infinite set {x_{1},x_{2},...} with mean μ, the maximum entropy distribution has the shape: 
where again the constants C and r can be determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ.
A theorem by Boltzmann All the above examples are consequences of the following theorem by Boltzmann. Ludwig Boltzmann Ludwig Boltzmann (February 20, 1844 – September 5, Austrian physicist famous for the invention of statistical mechanics. ...
Continuous version Suppose S is a closed subset of the real numbers R and we're given n measurable functions f_{1},...,f_{n} and n numbers a_{1},...,a_{n}. We consider the class of all continuous random variables which are supported on S (i.e. whose density function is zero outside of S) and which satisfy the n expected value conditions In topology and related branches of mathematics, a closed set is a set whose complement is open. ...
Please refer to Real vs. ...
In mathematics, measurable functions are wellbehaved functions between measurable spaces. ...
In probability (and especially gambling), the expected value (or expectation) of a random variable is the sum of the probability of each possible outcome of the experiment multiplied by its payoff (value). Thus, it represents the average amount one expects to win per bet if bets with identical odds are...
The maximum entropy distribution for this class (if it exists) has a probability density of the following shape: 
where the constants C and λ_{j} have to be determined so that the integral of p(x) over S is 1 and the above conditions for the expected values are satisfied. Conversely, if constants C and λ_{j} like this can be found, then p(x) is indeed the density of the (unique) maximum entropy distribution for our class. This theorem is proved with the calculus of variations and Lagrange multipliers. Calculus of variations is a field of mathematics which deals with functions of functions, as opposed to ordinary calculus which deals with functions of numbers. ...
In mathematical optimization problems, Lagrange multipliers are a method for dealing with constraints. ...
Discrete version Suppose S = {x_{1},x_{2},...} is a (finite or infinite) discrete subset of the reals and we're given n functions f_{1},...,f_{n} and n numbers a_{1},...,a_{n}. We consider the class of all discrete random variables X which are supported on S and which satisfy the n conditions The maximum entropy distribution for this class (if it exists) has a distribution of the following shape: 
where the constants C and λ_{j} have to be determined so that the sum of the probabilities is 1 and the above conditions for the expected values are satisfied. Conversely, if constants C and λ_{j} like this can be found, then the above distribution is indeed the maximum entropy distribution for our class. This version of the theorem can be proved with the tools of ordinary calculus and Lagrange multipliers. For other uses of the term calculus see calculus (disambiguation) Calculus is a central branch of mathematics, developed from algebra and geometry, and built on two major complementary ideas. ...
In mathematical optimization problems, Lagrange multipliers are a method for dealing with constraints. ...
Caveats Note that not all classes of distributions contain a maximum entropy distribution. It is possible that a class contain distributions of arbitrary large entropy (e.g. the class of all continuous distributions on R with mean 0), or that the entropies are bounded above but there is no distribution which attains the maximal entropy (e.g. the class of all continuous distributions X on R with E(X) = 0 and E(X^{2}) = E(X^{3}) = 1).
Sources  T. M. Cover and J. A. Thomas, Elements of Information Theory, 1991. Chapter 11.
