# LOGISTIC REGRESSION

Logistic regression estimates the influence of one or several variables on a *binary* dependent variable.

*Example with regression diagnostics displayed in the output*

LOGISTIC REGRESSION a10 | |

/ METHOD=ENTER a13 a15 a16 a159 a15*a159 | |

/ CONTRAST (a16)=INDICATOR(2) | |

/ CASEWISE = COOK DFBETA OUTLIER. |

*Example with regression diagnostics saved in our data set*

LOGISTIC REGRESSION a10 | |

/ METHOD=ENTER a13 a15 a16 a159 a15*a159 | |

/ CONTRAST (a16)=INDICATOR(2) | |

/ SAVE COOK DFBETA. |

In this example, a variable named a10 is the dependent variable. The line `METHOD ENTER`

provides SPSS with the names for the independent variables. Note that a15*a159 is an interaction effect; SPSS computes the product of these variables or, if one or both if these variables are treated as categorical variables, the product of the respective dummy variables.

In the next line, SPSS is told that variable a16 is to be treated as a categorical variable. The key word `INDICATOR`

in this line means that a16 is decomposed into a series of k-1 dummy variables (k being the number of categories of a16) with the second category as the reference category. (Note that this is one of the many things that are *not* offered when working with SPSS's menu system, where you can only chose the first or the last category. The choice of reference category cannot be discussed here, but it should not be made in an automatic way!) If there were another (or several other) categorical variable(s), you would include the respective number of `CONTRAST`

lines. The advantage of telling SPSS that there are categorical variables and how to treat them consists not only in the automatic creation of dummy (or other ) variables; what is more important, SPSS will test the overall influence of the set of related (dummy or other) variables on the likelihood function.

Finally, statistics similar to Cook's D and dfbeta in linear regression are computed. Particularly high values might indicate that the respective cases exert undue influence on the results of the estimation. In the first example, these statistics are listed in the output; keyword OUTLIER by default ensures that only cases with studentized residuals of size 2 or larger are displayed (otherwise all cases will be listed!). As this list is ordered by case numbers, with large data sets it may be tedious to discern the "really outlying" cases (i.e. those with the highest values of Cook's D). Therefore, it may be preferable to have these statistics added to the data set. Thus, they might be plotted and checked for outlying values, or you might sort your data according to the variable of interest which will help you also to inspect the values of dependent and independent variables of the respective cases.

*Warning*: The parameters of a logistic regression model are estimated by Maximum Likelihood methods. Not infrequently, one or more parameters cannot be estimated if there is insufficient information in the data. Most statistical packages inform the users when this occurs, but SPSS does not, because some (undocumented) tricks are used when this case arises. However, you can recognize these "unestimable" coefficients in the output: Usually, they are of quite abnormous size (something between ± 6 to 10) and, above all, they have huge standard errors, not infrequently about twice or thrice the "estimate" for the coefficient.

Here is an especially stupid example which I have made up, but please be sure that such things happen with real data (I simply do not have a convenient example at hand right now). Consider the following crosstabulation testing the effect of two diets on weight reduction.

DIET | Total | |||
---|---|---|---|---|

Dr Jekyll's | Dr Frankenstein's | |||

W_REDUCT | no | 30 | 40 | 70 |

yes | 10 | 0 | 10 | |

Total | 40 | 40 | 80 |

With diet 1, 10 out of 40 people have obtained a weight reduction (remember, this is not a real world example), whereas with diet 2, not a single person has lost weight (this is much more realistic ...). The decisive part of the output of a logistic regression looks as follows:

B | S.E. | Wald | df | Sig. | Exp(B) | ||
---|---|---|---|---|---|---|---|

Step 1 | DIET | -10.104 | 42.822 | .056 | 1 | .813 | .000 |

Constant | 9.006 | 42.826 | .044 | 1 | .833 | 8148.985 |

The large B coefficients and especially the huge standard errors indicate that something's wrong. What is it? The maximum likelihood procedure cannot arrive at an estimate for the probability of losing weight with the second diet -- the data exhibit a probability of 0, but such a probability does occur only in samples (whether made-up or real life), not in the underlying statistical theory. Note that in more complex analyses, usually most of the parameters can be estimated in a meaningful way and there are only one or a few parameters with the problem outlined here. However, there is only one choice: Remove the respective variable(s) from your model and re-estimate it.

© W. Ludwig-Mayerhofer, IGSW | Last update: 06 Oct 2016