Logistic Regression and Related Models

Logistic regression models deal with categorical dependent variables. Depending on the number of categories and on whether or not these categories are ordered, different models are available.

Model overview

Binary logistic regression

Here are three examples with variable "vote" (yes/no) as the dependent variable:

logit vote age education gender

logistic vote age education gender

logit vote age education gender, or

The first command will produce the model estimates in terms of logit coefficients; the second and third command will yield what some people call "effect coefficients", i.e. the effect the independent variables have on the odds.

Alternatively, you may write

logistic vote age education gender
logit

Here, logit will "translate" the immediately preceding model (with effect coefficients) into a model with logit coefficients.

Multinomial logistic regression

With Stata procedure mlogit, you may estimate the influence of variables on a dependent variable with several categories (such as "Brand A", "Brand B", "Brand C", "Brand D"). Note that if these categories are ordered (such as in statements like "strongly agree" ... "strongly disagree"), an ordered logistic regression model should usually be preferred.

Example

mlogit brand age sex class, baseoutcome (2) rrr

The option baseoutcome is required only if you wish to depart from Stata's default, i.e., the most frequent category. Another option is rrr, which causes stata to display the odds ratios (and the associated confidence intervals) instead of the logit coefficients. Note that for some strange reasons the odds are called "relative risks" here (hence the name of the option), but the formula in the handbook shows that it's all about the odds, as you might expect.

Ordered logistic regression

Actually, Stata offers several possibilities to analyze an ordered dependent variable, say, an attitude towards abortion. The most common model is based on cumulative logits and goes like this:

Example

ologit abortion age sex class, or

Option or will again produce influences in terms of odds.

Probit models

Probit models are alternatives to logistic regression models (or logit models). The commands for the binary, multinomial and ordered case go like this:

probit vote age education gender

mprobit brand age sex class, baseoutcome (2)

oprobit abortion age sex class

Interpretation of effects with "margins" `margins`

Stata can compute the effects of independent variables on the outcome in terms of probabilities, either literally (predicted probabilities) or as marginal effects (predicted changes of probability).

Margins for models with binary dependent variables

margins sex	Margins for a categorical variable
margins, at(age=(10(10)80))	Margins for a metric variable
margins, dydx(_all) atmeans	Marginal effects of all independent variables at the mean of other covariates
margins, dydx(_all)	Mean marginal effects of all covariates

Margins for dependent variables with more than two categories

Margins are particularly important in the case of the multinomial model, as the regression coefficients may be very misleading. They must be obtained separately for each category of the dependent variable. This holds true for the ordinal model as well.

To achieve this, you can use all the commands described above, just adding an option indicating the category for which the margins are to be computed. There are two ways to achieve this which I will describe for the simplest case, a categorical independent variable:

margins sex, predict(outcome(#3))	Margins for the third category (whatever its actual value)
margins sex, predict(outcome(3))	Margins for the category that is coded as "3"

Tests of significance

The significance tests on the coefficients based on the z statistic are not considered the best available. A superior test is based on the likelihood ratio statistic. Unfortunately, computation is a bit tedious (unless you resort to the procedure lrdrop1 described belowed, which has its own drawbacks): You have to save the estimates from your model first, then compute a constrained model (e.g. a model with one parameter set to zero, or actually a model with any constraints you like) and finally perform a LR test on both models. The procedure is as follows:

`(m)(o)logit depvar indvars`		Estimation of first model
`estimates store anyname`		Estimates are stored in matrix "anyname"
`(m)(o)logit depvar indvars`		Estimation of model with constraints
`lrtest anyname .`		Performing the LR test; note the dot at the end indicating that the last model estimated is to be tested against model "anyname".

Of course, you may estimate several models, store the estimates (under different names) and test any models you like afterwards. Make sure that the models you estimate contain the same number of cases and always one model is nested within the other.

Another way would be simply to compute the LR test "by hand" (or rather by brain) using the log-likelihoods from the Stata output.

Procedure `lrdrop1`

This ado file, which can be installed via ssc install lrdrop1, computes LR tests for all variables in the model. However, it was written at the end of the last millenium and it does not support automatically created factor (dummy) variables or interaction effects. However, if you create the respective variables prior to your model step, it will work out fine in most cases. Note that multinomial logit commands must be preceded by prefix version 10: (some lower versions will work as well) in order for lrdrop1 not to abort with an error message.

Measures of fit

For many purposes, Stata's output concerning overall model fit is sufficient. Both the model chi-square (i.e., the LR test for the current model compared to the null model) and McFadden's Pseudo R-square are included in the standard output.

A number of additional statistics are available from the fitstat package by J. Scott Long and Jeremey Freese. This package may be installed as follows:

ssc install fitstat

See help fitstat (after installation) for more details.