FACTOID # 14: North Carolina has a larger Native American population than North Dakota, South Dakota and Montana combined.

 Home Encyclopedia Statistics States A-Z Flags Maps FAQ About

 WHAT'S NEW

SEARCH ALL

Search encyclopedia, statistics and forums:

(* = Graphable)

Encyclopedia > Regression analysis

Uses of regression include curve fitting, prediction (including forecasting of time-series data), modeling of causal relationships, and testing scientific hypotheses about relationships between variables. Curve fitting is finding a curve which matches a series of data points and possibly other constraints. ... A prediction is a statement or claim that a particular event will occur in the future in more certain terms than a forecast. ... Prediction of future events is an ancient human wish. ... In statistics, signal processing, and econometrics, a time series is a sequence of data points, measured typically at successive times, spaced at (often uniform) time intervals. ...

The term "regression" was used in the nineteenth century to describe a biological phenomenon, namely that the progeny of exceptional individuals tend on average to be less exceptional than their parents, and more like their more distant ancestors. Francis Galton, a cousin of Charles Darwin, studied this phenomenon and applied the slightly misleading term "regression towards mediocrity" to it. For Galton, regression had only this biological meaning, but his work[1] was later extended by Udny Yule and Karl Pearson to a more general statistical context.[2] Alternative meaning: Nineteenth Century (periodical) (18th century &#8212; 19th century &#8212; 20th century &#8212; more centuries) As a means of recording the passage of time, the 19th century was that century which lasted from 1801-1900 in the sense of the Gregorian calendar. ... This article does not cite any references or sources. ... For other people of the same surname, and places and things named after Charles Darwin, see Darwin. ... Regression toward the mean refers to the fact that those with extreme scores on any measure at one point in time will, for purely statistical reasons, probably have less extreme scores the next time they are tested. ... George Udny Yule (February 18, 1871 â€“ June 26, 1951) was a Scottish statistician. ... Karl Pearson FRS (March 27, 1857 â€“ April 27, 1936) established the discipline of mathematical statistics. ...

## Simple linear regression

Illustration of linear regression on a data set (red points).

The general form of a simple linear regression is Image File history File links This is a lossless scalable vector image. ... Image File history File links This is a lossless scalable vector image. ...

$y_i=alpha+beta x_i +varepsilon_i$

where α is the intercept, β is the slope, and $varepsilon$ is the error term, which picks up the unpredictable part of the response variable yi. The error term is usually posited to be normally distributed. The x's and y's are the data quantities from the sample or population in question, and α and β are the unknown parameters ("constants") to be estimated from the data. Estimates for the values of α and β can be derived by the method of ordinary least squares. The method is called "least squares," because estimates of α and β minimize the sum of squared error estimates for the given data set. The estimates of α and β are often denoted by $widehat{alpha}$ and $widehat{beta}$ or their corresponding Roman letters. It can be shown (see Draper and Smith, 1998 for details) that least squares estimates are given by The normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields. ... A sample is that part of a population which is actually observed. ... Least squares is a mathematical optimization technique that attempts to find a best fit to a set of data by attempting to minimize the sum of the squares of the differences (called residuals) between the fitted function and the data. ...

$hat{beta}=frac{sum(x_i-bar{x})(y_i-bar{y})}{sum(x_i-bar{x})^2}$

and

$hat{alpha}=bar{y}-hat{beta}bar{x}$

where $bar{x}$ is the mean (average) of the x values and $bar{y}$ is the mean of the y values. In mathematics and statistics, the arithmetic mean (or simply the mean) of a list of numbers is the sum of all the members of the list divided by the number of items in the list. ...

## Generalizing simple linear regression

The simple model above can be generalized in different ways.

• The number of predictors can be increased from one to several. See
Main article: linear regression
• The relationship between the knowns (the xs and ys) and the unknowns (α and the βs) can be nonlinear. See
Main article: non-linear regression
• The response variable may be non-continuous. For binary (zero or one) variables, there are the probit and logit model. The multivariate probit model makes it possible to estimate jointly the relationship between several binary dependent variables and some independent variables. For categorical variables with more than two values there is the multinomial logit. For ordinal variables with more than two values, there are the ordered logit and ordered probit models. An alternative to such procedures is linear regression based on polychoric or polyserial correlations between the categorical variables. Such procedures differ in the assumptions made about the distribution of the variables in the population. If the variable is positive with low values and represents the repetition of the occurrence of an event, count models like the Poisson regression or the negative binomial model may be used
• The form of the right hand side can be determined from the data. See Nonparametric regression. These approaches require a large number of observations, as the data are used to build the model structure as well as estimate the model parameters. They are usually computationally intensive.

In statistics, linear regression is a regression method that models the relationship between a dependent variable Y, independent variables Xi, i = 1, ..., p, and a random term Îµ. The model can be written as Example of linear regression with one dependent and one independent variable. ... In statistics, nonlinear regression is the problem of fitting a model to multidimensional x,y data, where f is a nonlinear function of x with parameters Î¸. It is often erroneously thought that the use of least squares to estimate the parameters a, b, c in the model is an instance... The probit model is a popular specification of a binary regression model, using a probit link function. ... Logistic regression is a statistical regression model for Bernoulli-distributed dependent variables. ... Extension of the probit model to estimate several binary variables jointly. ... The level of measurement of a variable in mathematics and statistics describes how much information the numbers associated with the variable contain. ... A multinomial logit model is an econometric or statistical model which is a generalization of logit models in which there can be more than two cases. ... The level of measurement of a variable in mathematics and statistics describes how much information the numbers associated with the variable contain. ... In statistics, ordered logit is a flavor of the popular logit analysis, used for ordinal dependent variables. ... In statistics, ordered probit is a flavor of the popular probit analysis, used for ordinal dependent variables. ... In statistics, the Poisson regression model attributes to a response variable Y a Poisson distribution whose expected value depends on a predictor variable x (written in lower case because the model treats x as non-random, in the following way: (where log means natural logarithm). ... In probability and statistics the negative binomial distribution is a discrete probability distribution. ... In statistics, the generalized linear model (GLM) is a useful generalization of ordinary least squares regression. ... Nonparametric regression is a form of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. ...

## Regression diagnostics

Once a regression model has been constructed it is important to confirm the goodness of fit of the model and the statistical significance of the estimated parameters. Commonly used checks of goodness of fit include R-squared, analyses of the pattern of residuals and construction of an ANOVA table. Statistical significance is checked by an F-test of the overall fit, followed by t-tests of individual parameters. Goodness of fit means how well a statistical model fits a set of observations. ... In statistics, a result is significant if it is unlikely to have occurred by chance, given that a presumed null hypothesis is true. ... In statistics, the coefficient of determination R2 is the proportion of a sample variance of a response variable that is explained by the predictor variables when a linear regression is done. ... In statistics, the concepts of error and residual are easily confused with each other. ... In statistics, analysis of variance (ANOVA) is a collection of statistical models and their associated procedures which compare means by splitting the overall observed variance into different parts. ... An F-test is any statistical test in which the test statistic has an F-distribution if the null hypothesis is true. ... A t-test is any statistical hypothesis test in which the test statistic has a Students t-distribution if the null hypothesis is true. ...

## Estimation of model parameters

The parameters of a regression model can be estimated in many ways. The most common are

For a model with normally distributed errors the method of least squares and the method of maximum likelihood coincide (see Gauss-Markov theorem). Least squares is a mathematical optimization technique that attempts to find a best fit to a set of data by attempting to minimize the sum of the squares of the differences (called residuals) between the fitted function and the data. ... In statistics, the method of maximum likelihood, pioneered by geneticist and statistician Sir Ronald A. Fisher, is a method of point estimation, that uses as an estimate of an unobservable population parameter the member of the parameter space that maximizes the likelihood function. ... Bayesian refers to probability and statistics -- either methods associated with the Reverend Thomas Bayes (ca. ... This article is not about Gauss-Markov processes. ...

## Interpolation and extrapolation

Regression models predict a value of the y variable given known values of the x variables. If the prediction is to be done within the range of values of the x variables used to construct the model this is known as interpolation. Prediction outside the range of the data used to construct the model is known as extrapolation and it is more risky. In the mathematical subfield of numerical analysis, interpolation is a method of constructing new data points from a discrete set of known data points. ... In mathematics, extrapolation is the process of constructing new data points outside a discrete set of known data points. ...

## Assumptions underpinning regression

Regression analysis depends on certain assumptions

1. The predictors must be linearly independent, i.e it must not be possible to express any predictor as a linear combination of the others. See Multicollinear. In linear algebra, a set of elements of a vector space is linearly independent if none of the vectors in the set can be written as a linear combination of finitely many other vectors in the set. ... Multicollinearity refers to linear inter-correlation among variables. ...

2. The error terms must be normally distributed and independent.

3. The variance of the error terms must be constant.

## Examples

To illustrate the various goals of regression, we will give three examples.

### Prediction of future observations

The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).

 Height (in) 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 Weight (lb) 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164

We would like to see how the weight of these women depends on their height. We are therefore looking for a function η such that $Y=eta(X)+varepsilon$, where Y is the weight of the women and X their height. Intuitively, we can guess that if the women's proportions are constant and their density too, then the weight of the women must depend on the cube of their height. Weight, in the context of human body weight measurements in the medical sciences and in sports is a measurement of mass, and is thus expressed in units of mass, such as kilograms (kg), or units of force such as pounds (lb). ...

A plot of the data set confirms this supposition

$vec{X}$ will denote the vector containing all the measured heights ($vec{X}=(58,59,60,dots)$) and $vec{Y}=(115,117,120,dots)$ is the vector containing all measured weights. We can suppose the heights of the women are independent from each other and have constant variance, which means the Gauss-Markov assumptions hold. We can therefore use the least-squares estimator, i.e. we are looking for coefficients β01 and β2 satisfying as well as possible (in the sense of the least-squares estimator) the equation: Image File history File links Data_plot_women_weight_vs_height. ...

$vec{Y}=beta_0 + beta_1 vec{X} + beta_2 vec{X}^3+vec{varepsilon}$

Geometrically, what we will be doing is an orthogonal projection of Y on the subspace generated by the variables 1,X and X3. The matrix X is constructed simply by putting a first column of 1's (the constant term in the model) a column with the original values (the X in the model) and a third column with these values cubed (X3). The realization of this matrix (i.e. for the data at hand) can be written:

 1 x x3 1 58 195112 1 59 205379 1 60 216000 1 61 226981 1 62 238328 1 63 250047 1 64 262144 1 65 274625 1 66 287496 1 67 300763 1 68 314432 1 69 328509 1 70 343000 1 71 357911 1 72 373248

The matrix $(mathbf{X}^t mathbf{X})^{-1}$ (sometimes called "information matrix" or "dispersion matrix") is:

$left[begin{matrix} 1.9cdot10^3&-45&3.5cdot 10^{-3} -45&1.0&-8.1cdot 10^{-5} 3.5cdot 10^{-3}&-8.1cdot 10^{-5}&6.4cdot 10^{-9} end{matrix}right]$

Vector $widehat{beta}_{LS}$ is therefore:

$widehat{beta}_{LS}=(X^tX)^{-1}X^{t}y= (147, -2.0, 4.3cdot 10^{-4})$

hence $eta(X) = 147 - 2.0 X + 4.3cdot 10^{-4} X^3$

A plot of this function shows that it lies quite closely to the data set

The confidence intervals are computed using: Image File history File links Plot_regression_women. ...

$[widehat{beta_j}-widehat{sigma}sqrt{s_j}t_{n-p;1-frac{alpha}{2}};widehat{beta_j}+widehat{sigma}sqrt{s_j}t_{n-p;1-frac{alpha}{2}}]$

with:

$widehat{sigma}=0.52$
$s_1=1.9cdot 10^3, s_2=1.0, s_3=6.4cdot 10^{-9};$
$alpha=5%$
$t_{n-p;1-frac{alpha}{2}}=2.2$

Therefore, we can say that the 95% confidence intervals are: In this diagram, the bars represent observation means and the red lines represent the confidence intervals surrounding them. ...

$beta_0in[112 , 181]$
$beta_1in[-2.8 , -1.2]$
$beta_2in[3.6cdot 10^{-4} , 4.9cdot 10^{-4}]$

Segmented linear regression to detect relations and breakpoints despite scatter // Mustard and salinity In statistics, regression analysis [1] is done to detect a mathematical relation between several series of measured things (elements) that have variable values, especially when the relation is scattered due to random variation. ... In this diagram, the bars represent observation means and the red lines represent the confidence intervals surrounding them. ... In statistics, confidence intervals are the most prevalent form of interval estimation. ... In mathematics, extrapolation is the process of constructing new data points outside a discrete set of known data points. ... Kriging is group of geostatistical techniques to interpolate the value of a random field (e. ... Look up forecast in Wiktionary, the free dictionary. ... In statistics, a prediction interval bears the same relationship to a future observation that a confidence interval bears to an unobservable population parameter. ... This article is about the field of statistics. ... A series of measurements of a process may be treated as a time series, and then trend estimation is the application of statistical techniques to make and justify statements about trends in the data. ... In robust statistics, robust regression is a form of regression analysis designed to circumvent the limitations of traditional parametric and non-parametric methods. ... In probability theory and statistics, a multivariate normal distribution, also sometimes called a multivariate Gaussian distribution, is a specific probability distribution, which can be thought of as a generalization to higher dimensions of the one-dimensional normal distribution (also called a Gaussian distribution). ... // Probability The Doctrine of Chances Author: Abraham de Moivre Publication data: 1738 (2nd ed. ...

## Notes

1. ^ Francis Galton. "Typical laws of heredity", Nature 15 (1877), 492-495, 512-514, 532-533. (Galton uses the term "reversion" in this paper, which discusses the size of peas.); Francis Galton. Presidential address, Section H, Anthropology. (1885) (Galton uses the term "regression" in this paper, which discusses the height of humans.)
2. ^ G. Udny Yule. "On the Theory of Correlation", J. Royal Statist. Soc., 1897, p. 812-54. Karl Pearson, G. U. Yule, Norman Blanchard, and Alice Lee. "The Law of Ancestral Heredity", Biometrika (1903). In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is assumed to be Gaussian. This assumption was weakened by R.A. Fisher in his works of 1922 and 1925 (R.A. Fisher, "The goodness of fit of regression formulae, and the distribution of regression coefficients", J. Royal Statist. Soc., 85, 597-612 from 1922 and Statistical Methods for Research Workers from 1925). Fisher assumed that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this respect, Fisher's assumption is closer to Gauss's formulation of 1821.

This article does not cite any references or sources. ... This article or section should be merged with George Udny Yule G. Udny Yule (February 18, 1871 - June 26, 1951) British statistician who made important contributions to the theory and practice of correlation and association and to time series analysis. ... Karl Pearson FRS (March 27, 1857 â€“ April 27, 1936) established the discipline of mathematical statistics. ... Biometrika is a scientific journal established in 1901 by Francis Galton, Karl Pearson and W. F. R. Weldon to promote the study of biometrics, the statistical analysis of hereditary phenomena. ... Sir Ronald Fisher Sir Ronald Aylmer Fisher, FRS (February 17, 1890&#8211;July 29, 1962) was an extraordinarily talented evolutionary biologist, geneticist and statistician. ... Statistical Methods for Research Workers (ISBN 0050021702) is a classic 1925 book on statistics by the statistician Ronald Fisher. ...

## References

• Audi, R., Ed. (1996). "curve fitting problem," The Cambridge Dictionary of Philosophy. Cambridge, Cambridge University Press. pp.172-173.
• William H. Kruskal and Judith M. Tanur, ed. (1978), "Linear Hypotheses," International Encyclopedia of Statistics. Free Press, v. 1,
Evan J. Williams, "I. Regression," pp. 523-41.
Julian C. Stanley, "II. Analysis of Variance," pp. 541-554.
• Lindley, D.V. (1987). "Regression and correlation analysis," New Palgrave: A Dictionary of Economics, v. 4, pp. 120-23.
• Birkes, David and Yadolah Dodge, Alternative Methods of Regression. ISBN 0-471-56881-3
• Chatfield, C. (1993) "Calculating Interval Forecasts," Journal of Business and Economic Statistics, 11. pp. 121-135.
• Draper, N.R. and Smith, H. (1998).Applied Regression Analysis Wiley Series in Probability and Statistics
• Fox, J. (1997). Applied Regression Analysis, Linear Models and Related Methods. Sage
• Hardle, W., Applied Nonparametric Regression (1990), ISBN 0-521-42950-1
• Meade, N. and T. Islam (1995) "Prediction Intervals for Growth Curve Forecasts," Journal of Forecasting, 14, pp. 413-430.
• Gujarati, Basic Econometrics, 4th edition
• Sykes, A.O. "An Introduction to Regression Analysis" (Innaugural Coase Lecture)
• S. Kotsiantis, D. Kanellopoulos, P. Pintelas, Local Additive Regression of Decision Stumps, Lecture Notes in Artificial Intelligence, Springer-Verlag, Vol. 3955, SETN 2006, pp. 148 – 157, 2006
• S. Kotsiantis, P. Pintelas, Selective Averaging of Regression Models, Annals of Mathematics, Computing & TeleInformatics, Vol 1, No 3, 2005, pp. 66-75

William H. Kruskal (born October 10, 1919 in New York City, died April 21, 2005 in Chicago) was an American mathematician and statistician. ... Julian C. Stanley Jr. ... Dennis Victor Lindley (born 25 July 1923) is a British statistician, decision theorist and leading advocate of Bayesian statistics. ...

## Software

This is an incomplete list of software that is designed for the explicit purpose of performing statistical analyses. ... The SAS System, originally Statistical Analysis System, is an integrated system of software products provided by SAS Institute that enables the programmer to perform: data entry, retrieval, management, and mining report writing and graphics statistical and mathematical analysis business planning, forecasting, and decision support operations research and project management quality... SPSS is a computer program used for statistical analysis and is also the name of the company (SPSS Inc. ... Minitab is a computer program designed to perform basic and advanced statistical functions. ... Stata, created in 1985 by Statacorp, is a statistical program used by many businesses and academic institutions around the world. ... Screenshot of a spreadsheet under OpenOffice A spreadsheet is a rectangular table (or grid) of information, often financial information. ... Microsoft Excel (full name Microsoft Office Excel) is a spreadsheet program written and distributed by Microsoft for computers using the Microsoft Windows operating system and for Apple Macintosh computers. ... OpenOffice. ... A programming language is an artificial language that can be used to control the behavior of a machine, particularly a computer. ... For other uses, see Mathematica (disambiguation). ... The R programming language, sometimes described as GNU S, is a programming language and software environment for statistical computing and graphics. ... Stata, created in 1985 by Statacorp, is a statistical program used by many businesses and academic institutions around the world. ... Not to be confused with Matlab Upazila in Chandpur District, Bangladesh. ...

Results from FactBites:

 PA 765: Multiple Regression (19411 words) Multiple regression shares all the assumptions of correlation: linearity of relationships, the same level of relationship throughout the range of the independent variable ("homoscedasticity"), interval or near-interval data, absence of outliers, and data whose range is not truncated. Cubic regression splines operate similar to local polynomial regression, but a constraint is imposed that the regression line in a given bin must join to the start of the regression line in the next bin, thereby avoiding discontinuities in the curve, albeit by increasing error a bit. Local regression fits a regression surface not for all the data points as in traditional regression, but for the data points in a "neighborhood." Researchers determine the "smoothing parameter," which is a specified percentage of the sample size, and neighborhoods are the points within the corresponding radius.
 DSS - Introduction to Regression (3860 words) Regression analysis is used when you want to predict a continuous dependent variable from a number of independent variables. Usually, regression analysis is used with naturally-occurring variables, as opposed to experimentally manipulated variables, although you can use regression with experimentally manipulated variables. The purpose of regression analysis is to come up with an equation of a line that fits through that cluster of points with the minimal amount of deviations from the line.
More results at FactBites »

Share your thoughts, questions and commentary here