Linear Regression

Linear regression is a strategy of modelling the influence(s) of one or several variables on a (metric) variable (the latter often being called the "dependent variable").

Simple example

A simple example of a regression model would be a contradiction in terms. Even though linear regression analysis is quite common in the social sciences, it is a complex strategy that should be employed with the greatest care.

Complex example

REGRESSION
	/ STATISTICS COEFF R ANOVA TOL
	/ DEPENDENT lnincome
	/ METHOD = ENTER age agesqr gender occup exper educn
	/ RESIDUALS HIST (*zresid) NORM DURBIN OUTLIERS (LEVER SDRESID COOK)
	/ SCATTERPLOT (sdresid zpred)
	/ PARTIALPLOT ALL
	/ SAVE DFBETA.

I will use this section to explain the most important features of the linear regression model, using the example provided above.

The STATISTICS line, as used here, will display the unstandardized and the standardized regression coefficients, their standard errors, t-values and significance levels, R² and the F- test for the overall model. Finally (with keyword TOL) collinearity statistics are displayed.

A few words about collinearity: The "tolerance" column in the output is the reverse of the extent to which each individual regressor (independent variable) is (linearly) dependent on the other regressors. Thus, a value of "tolerance" near 1 means that the respective variable is largely "independent" from the other regressors. A value approaching 0 (zero) means that you have to think about that specific variable. The next column, entitled "VIF" (Variance Inflation Factor), will display a statistic that is interpreted just the other way round, ranging from one to (plus) infinity: The larger the VIF, the more the respective variable is (linearly) dependent on the others. VIF refers to the extent to which the standard error of the specific regression coefficient is enlarged due to collinearity. A value of 3 means that if this variable was completely independent from other variables in the regression models, the standard error would be only (about) one third of the actual standard error.
(Note: More sophisticated multicollinearity statistics are available, but those mentioned here will usually do their job - that is, making it your job to look at those variables that exhibit a certain amount of collinearity.)

In the line beginning with keyword DEPENDENT, you will tell SPSS which is the dependent variable in your model.

The METHOD = ENTER line tells SPSS which are the independent variables (or regressors) in your model.

The commands in the RESIDUALS line request first a HISTogramm of the standardized residuals to check the extent to which the residuals approach a normal distribution. The normal p-p-plot requested with keyword NORM serves the same purpose; the values of the residuals (by default displayed in red) should be close to the straight line (usually displayed in green) from the left bottom corner to the right top corner.

Keyword DURBIN requests the Durbin Watson statistic for autocorrelation. This statistic should be close to the value 2. Deviations from that value exhibit autocorrelation; how large a deviation may be judged critical depends on the number of cases in your model.

Keyword OUTLIERS together with the three keywords that follow in parentheses requests statistics by which outlying and influential cases may be identified. "Leverage" (in German: "Hebel") refers to cases that have outlying values in one or more of the regressors. SDRESID stands for "studentized deleted residuals" and refers to cases that would have large residuals if the model was estimated without the respective cases (these are cases that are not well accounted for by the independent variables). However, both high leverage and large residuals do not necessarily constitute a problem. Cook's distance is a statistic that combines both of the aforementioned statistics; cases whose values in this statistic are considerably higher than the remainder of the cases should be checked carefully.

The SCATTERPLOT of studentized deleted residuals ("*sdresid") and standardized predicted values ("*zpred") will yield a plot that can be checked for heteroskedasticity. The values for the studentized deleted residuals should be evenly distributed around zero for all levels of the predicted values.

PARTIALPLOTS will display what John Fox ("Regression diagnostics", SAGE publishers) calls partial regression plots. These plots exhibit the "net" relationship between each independent variable and the dependent variable ("net" because the influence of the other variables is "partialed out"). These plots are important to identify possible nonlinear relationships or groups of influential cases that are not easily identified by the statistics mentioned above.

SAVE DFBETA will add, for each independent variable in the model, a variable to the data set that indicates to what extent the regression coefficient for that variable would change if the respective case was omitted from the data entering the regression model. Thus, it serves to check which specific regression coefficient might be unduly affected by a single "influential" case.