# OCW051: Regression

## Predictions and Probabilistic Models

Regression models are often used to predict a response variable [latex]text{y}[/latex] from an explanatory variable [latex]text{x}[/latex].

### Learning Objectives

Explain how to estimate the relationship among variables using regression analysis

### Key Takeaways

#### Key Points

• Regression models predict a value of the [latex]text{Y}[/latex] variable, given known values of the [latex]text{X}[/latex] variables. Prediction within the range of values in the data set used for model-fitting is known informally as interpolation.
• Prediction outside this range of the data is known as extrapolation. The further the extrapolation goes outside the data, the more room there is for the model to fail due to differences between the assumptions and the sample data or the true values.
• There are certain necessary conditions for regression inference: observations must be independent, the mean response has a straight-line relationship with [latex]text{x}[/latex], the standard deviation of [latex]text{y}[/latex] is the same for all values of [latex]text{x}[/latex], and the response [latex]text{y}[/latex] varies according to a normal distribution.

#### Key Terms

• interpolation: the process of estimating the value of a function at a point from its values at nearby points
• extrapolation: a calculation of an estimate of the value of some function outside the range of known values

### Regression Analysis

In statistics, regression analysis is a statistical technique for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables, called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function, which can be described by a probability distribution.

Regression analysis is widely used for prediction and forecasting. Regression analysis is also used to understand which among the independent variables is related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation.

### Making Predictions Using Regression Inference

Regression models predict a value of the [latex]text{Y}[/latex] variable, given known values of the [latex]text{X}[/latex] variables. Prediction within the range of values in the data set used for model-fitting is known informally as interpolation. Prediction outside this range of the data is known as extrapolation. Performing extrapolation relies strongly on the regression assumptions. The further the extrapolation goes outside the data, the more room there is for the model to fail due to differences between the assumptions and the sample data or the true values.

It is generally advised that when performing extrapolation, one should accompany the estimated value of the dependent variable with a prediction interval that represents the uncertainty. Such intervals tend to expand rapidly as the values of the independent variable(s) move outside the range covered by the observed data.

However, this does not cover the full set of modelling errors that may be being made–in particular, the assumption of a particular form for the relation between [latex]text{Y}[/latex] and [latex]text{X}[/latex]. A properly conducted regression analysis will include an assessment of how well the assumed form is matched by the observed data, but it can only do so within the range of values of the independent variables actually available. This means that any extrapolation is particularly reliant on the assumptions being made about the structural form of the regression relationship. Best-practice advice here is that a linear-in-variables and linear-in-parameters relationship should not be chosen simply for computational convenience, but that all available knowledge should be deployed in constructing a regression model. If this knowledge includes the fact that the dependent variable cannot go outside a certain range of values, this can be made use of in selecting the model – even if the observed data set has no values particularly near such bounds. The implications of this step of choosing an appropriate functional form for the regression can be great when extrapolation is considered. At a minimum, it can ensure that any extrapolation arising from a fitted model is “realistic” (or in accord with what is known).

### Conditions for Regression Inference

A scatterplot shows a linear relationship between a quantitative explanatory variable [latex]text{x}[/latex] and a quantitative response variable [latex]text{y}[/latex]. Let’s say we have [latex]text{n}[/latex] observations on an explanatory variable [latex]text{x}[/latex] and a response variable [latex]text{y}[/latex]. Our goal is to study or predict the behavior of [latex]text{y}[/latex] for given values of [latex]text{x}[/latex]. Here are the required conditions for the regression model:

• Repeated responses [latex]text{y}[/latex] are independent of each other.
• The mean response [latex]mu_text{y}[/latex] has a straight-line (i.e., “linear”) relationship with [latex]text{x}[/latex]: [latex]mu_text{y} = alpha + beta text{x}[/latex]; the slope [latex]beta[/latex] and intercept [latex]alpha[/latex] are unknown parameters.
• The standard deviation of [latex]text{y}[/latex] (call it [latex]sigma[/latex]) is the same for all values of [latex]text{x}[/latex]. The value of [latex]sigma[/latex] is unknown.
• For any fixed value of [latex]text{x}[/latex], the response [latex]text{y}[/latex] varies according to a normal distribution.

The importance of data distribution in linear regression inference: A good rule of thumb when using the linear regression method is to look at the scatter plot of the data. This graph is a visual example of why it is important that the data have a linear relationship. Each of these four data sets has the same linear regression line and therefore the same correlation, 0.816. This number may at first seem like a strong correlation—but in reality the four data distributions are very different: the same predictions that might be true for the first data set would likely not be true for the second, even though the regression method would lead you to believe that they were more or less the same. Looking at panels 2, 3, and 4, you can see that a straight line is probably not the best way to represent these three data sets.

## A Graph of Averages

A graph of averages and the least-square regression line are both good ways to summarize the data in a scatterplot.

### Learning Objectives

Contrast linear regression and graph of averages

### Key Takeaways

#### Key Points

• In most cases, a line will not pass through all points in the data. A good line of regression makes the distances from the points to the line as small as possible. The most common method of doing this is called the “least-squares” method.
• Sometimes, a graph of averages is used to show a pattern between the [latex]text{y}[/latex] and [latex]text{x}[/latex] variables. In a graph of averages, the [latex]text{x}[/latex]-axis is divided up into intervals. The averages of the [latex]text{y}[/latex] values in those intervals are plotted against the midpoints of the intervals.
• The graph of averages plots a typical [latex]text{y}[/latex] value in each interval: some of the points fall above the least-squares regression line, and some of the points fall below that line.

#### Key Terms

• interpolation: the process of estimating the value of a function at a point from its values at nearby points
• extrapolation: a calculation of an estimate of the value of some function outside the range of known values
• graph of averages: a plot of the average values of one variable (say [latex]text{y}[/latex]) for small ranges of values of the other variable (say [latex]text{x}[/latex]), against the value of the second variable ([latex]text{x}[/latex]) at the midpoints of the ranges

### Linear Regression vs. Graph of Averages

Linear (straight-line) relationships between two quantitative variables are very common in statistics. Often, when we have a scatterplot that shows a linear relationship, we’d like to summarize the overall pattern and make predictions about the data. This can be done by drawing a line through the scatterplot. The regression line drawn through the points describes how the dependent variable [latex]text{y}[/latex] changes with the independent variable [latex]text{x}[/latex]. The line is a model that can be used to make predictions, whether it is interpolation or extrapolation. The regression line has the form [latex]text{y}=text{a}+text{bx}[/latex], where [latex]text{y}[/latex] is the dependent variable, [latex]text{x}[/latex] is the independent variable, [latex]text{b}[/latex] is the slope (the amount by which [latex]text{y}[/latex] changes when [latex]text{x}[/latex] increases by one), and [latex]text{a}[/latex] is the [latex]text{y}[/latex]-intercept (the value of [latex]text{y}[/latex] when [latex]text{x}=0[/latex]).

In most cases, a line will not pass through all points in the data. A good line of regression makes the distances from the points to the line as small as possible. The most common method of doing this is called the “least-squares” method. The least-squares regression line is of the form [latex]hat{text{y}} = text{a}+text{bx}[/latex], with slope [latex]text{b} = frac{text{rs}_text{y}}{text{s}_text{x}}[/latex] ([latex]text{r}[/latex] is the correlation coefficient, [latex]text{s}_text{y}[/latex] and [latex]text{s}_text{x}[/latex] are the standard deviations of [latex]text{y}[/latex] and [latex]text{x}[/latex]). This line passes through the point [latex](bar{text{x}},bar{text{y}})[/latex] (the means of [latex]text{x}[/latex] and [latex]text{y}[/latex]).

Sometimes, a graph of averages is used to show a pattern between the [latex]text{y}[/latex] and [latex]text{x}[/latex] variables. In a graph of averages, the [latex]text{x}[/latex]-axis is divided up into intervals. The averages of the [latex]text{y}[/latex] values in those intervals are plotted against the midpoints of the intervals. If we needed to summarize the [latex]text{y}[/latex] values whose [latex]text{x}[/latex] values fall in a certain interval, the point plotted on the graph of averages would be good to use.

The points on a graph of averages do not usually line up in a straight line, making it different from the least-squares regression line. The graph of averages plots a typical [latex]text{y}[/latex] value in each interval: some of the points fall above the least-squares regression line, and some of the points fall below that line.

Least Squares Regression Line: Random data points and their linear regression.

## The Regression Method

The regression method utilizes the average from known data to make predictions about new data.

### Learning Objectives

Contrast interpolation and extrapolation to predict data

### Key Takeaways

#### Key Points

• If we know no information about the [latex]text{x}[/latex]-value, it is best to make predictions about the [latex]text{y}[/latex]-value using the average of the entire data set.
• If we know the independent variable, or [latex]text{x}[/latex]-value, the best prediction of the dependent variable, or [latex]text{y}[/latex]-value, is the average of all the [latex]text{y}[/latex]-values for that specific [latex]text{x}[/latex]-value.
• Generalizations and predictions are often made using the methods of interpolation and extrapolation.

#### Key Terms

• extrapolation: a calculation of an estimate of the value of some function outside the range of known values
• interpolation: the process of estimating the value of a function at a point from its values at nearby points

### The Regression Method

The best way to understand the regression method is to use an example. Let’s say we have some data about students’ Math SAT scores and their freshman year GPAs in college. The average SAT score is 560, with a standard deviation of 75. The average first year GPA is 2.8, with a standard deviation of 0.5. Now, we choose a student at random and wish to predict his first year GPA. With no other information given, it is best to predict using the average. We predict his GPA is 2.8

Now, let’s say we pick another student. However, this time we know her Math SAT score was 680, which is significantly higher than the average. Instead of just predicting 2.8, this time we look at the graph of averages and predict her GPA is whatever the average is of all the students in our sample who also scored a 680 on the SAT. This is likely to be higher than 2.8.

To generalize the regression method:

• If you know no information (you don’t know the SAT score), it is best to make predictions using the average.
• If you know the independent variable, or [latex]text{x}[/latex]-value (you know the SAT score), the best prediction of the dependent variable, or [latex]text{y}[/latex]-value (in this case, the GPA), is the average of all the [latex]text{y}[/latex]-values for that specific [latex]text{x}[/latex]-value.

### Generalization

In the example above, the college only has experience with students that have been admitted; however, it could also use the regression model for students that have not been admitted. There are some problems with this type of generalization. If the students admitted all had SAT scores within the range of 480 to 780, the regression model may not be a very good estimate for a student who only scored a 350 on the SAT.

Despite this issue, generalization is used quite often in statistics. Sometimes statisticians will use interpolation to predict data points within the range of known data points. For example, if no one before had received an exact SAT score of 650, we would predict his GPA by looking at the GPAs of those who scored 640 and 660 on the SAT.

Extrapolation is also frequently used, in which data points beyond the known range of values is predicted. Let’s say the highest SAT score of a student the college admitted was 780. What if we have a student with an SAT score of 800, and we want to predict her GPA? We can do this by extending the regression line. This may or may not be accurate, depending on the subject matter.

Extrapolation: An example of extrapolation, where data outside the known range of values is predicted. The red points are assumed known and the extrapolation problem consists of giving a meaningful value to the blue box at [latex]text{x}=7[/latex].

## The Regression Fallacy

The regression fallacy fails to account for natural fluctuations and rather ascribes cause where none exists.

### Learning Objectives

Illustrate examples of regression fallacy

### Key Takeaways

#### Key Points

• Things such as golf scores, the earth’s temperature, and chronic back pain fluctuate naturally and usually regress towards the mean. The logical flaw is to make predictions that expect exceptional results to continue as if they were average.
• People are most likely to take action when variance is at its peak. Then, after results become more normal, they believe that their action was the cause of the change, when in fact, it was not causal.
• In essence, misapplication of regression to the mean can reduce all events to a “just so” story, without cause or effect. Such misapplication takes as a premise that all events are random, as they must be for the concept of regression to the mean to be validly applied.

#### Key Terms

• regression fallacy: flawed logic that ascribes cause where none exists
• post hoc fallacy: flawed logic that assumes just because A occurred before B, then A must have caused B to happen

### What is the Regression Fallacy?

The regression (or regressive) fallacy is an informal fallacy. It ascribes cause where none exists. The flaw is failing to account for natural fluctuations. It is frequently a special kind of the post hoc fallacy.

Things such as golf scores, the earth’s temperature, and chronic back pain fluctuate naturally and usually regress towards the mean. The logical flaw is to make predictions that expect exceptional results to continue as if they were average. People are most likely to take action when variance is at its peak. Then, after results become more normal, they believe that their action was the cause of the change, when in fact, it was not causal.

This use of the word “regression” was coined by Sir Francis Galton in a study from 1885 called “Regression Toward Mediocrity in Hereditary Stature. ” He showed that the height of children from very short or very tall parents would move towards the average. In fact, in any situation where two variables are less than perfectly correlated, an exceptional score on one variable may not be matched by an equally exceptional score on the other variable. The imperfect correlation between parents and children (height is not entirely heritable) means that the distribution of heights of their children will be centered somewhere between the average of the parents and the average of the population as whole. Thus, any single child can be more extreme than the parents, but the odds are against it.

Francis Galton: A picture of Sir Francis Galton, who coined the use of the word “regression. “

### Examples of the Regression Fallacy

• When his pain got worse, he went to a doctor, after which the pain subsided a little. Therefore, he benefited from the doctor’s treatment.The pain subsiding a little after it has gotten worse is more easily explained by regression towards the mean. Assuming the pain relief was caused by the doctor is fallacious.
• The student did exceptionally poorly last semester, so I punished him. He did much better this semester. Clearly, punishment is effective in improving students’ grades. Often, exceptional performances are followed by more normal performances, so the change in performance might better be explained by regression towards the mean. Incidentally, some experiments have shown that people may develop a systematic bias for punishment and against reward because of reasoning analogous to this example of the regression fallacy.
• The frequency of accidents on a road fell after a speed camera was installed. Therefore, the speed camera has improved road safety. Speed cameras are often installed after a road incurs an exceptionally high number of accidents, and this value usually falls (regression to mean) immediately afterwards. Many speed camera proponents attribute this fall in accidents to the speed camera, without observing the overall trend.
• Some authors have claimed that the alleged “Sports Illustrated Cover Jinx” is a good example of a regression effect: extremely good performances are likely to be followed by less extreme ones, and athletes are chosen to appear on the cover of Sports Illustrated only after extreme performances. Assuming athletic careers are partly based on random factors, attributing this to a “jinx” rather than regression, as some athletes reportedly believed, would be an example of committing the regression fallacy.

### Misapplication of the Regression Fallacy

On the other hand, dismissing valid explanations can lead to a worse situation. For example: After the Western Allies invaded Normandy, creating a second major front, German control of Europe waned. Clearly, the combination of the Western Allies and the USSR drove the Germans back.

The conclusion above is true, but what if instead we came to a fallacious evaluation: “Given that the counterattacks against Germany occurred only after they had conquered the greatest amount of territory under their control, regression to the mean can explain the retreat of German forces from occupied territories as a purely random fluctuation that would have happened without any intervention on the part of the USSR or the Western Allies.” This is clearly not the case. The reason is that political power and occupation of territories is not primarily determined by random events, making the concept of regression to the mean inapplicable (on the large scale).

In essence, misapplication of regression to the mean can reduce all events to a “just so” story, without cause or effect. Such misapplication takes as a premise that all events are random, as they must be for the concept of regression to the mean to be validly applied.

Source: Statistics