Factors

Factors are vectors (or variables in a data.frame) that are supposed to represent categorical variables. Note that this does not imply that the values of the factor need to be categorical; they may well be numeric. However, a vector (variable) that consists of characters will automatically be treated as a factor in statistical analyses.

A factor is not a special type of object per se; in a way it is simply a vector (or a variable in a data.frame). But in other ways it is special, as it is not just some vector, but a vector with the additional information that it represents a factor. Indeed, if a vector is defined as a factor (in ways that will be explained below), it will belong to a special class, i.e., the class "factor".

Now, what is it all about? The gist of it is that when a factor appears in a statistical model, it will be dealt with in some predefined way. In regression models, categorical variables typically are translated into dummy variables (also called indicator variables), and this is what is automatically done if a variable is defined as a factor. In analysis of variance, which is a procedure to compare the means of several groups, it is even mandatory to use a factor to define these groups.

To give a simple example, suppose we create a dependent variable, y, as follows:

y <- c(15,16,13,12,14,10,4,8,5,7)

Now let's define a second variable, x1:

x1 <- c(3,2,2,3,3,2,2,1,2,1)

In a regression model (r1 <- lm(y ~ x1)) variable x1 will be treated as a metric variable. Now, let's create a factor from it:

x2 <- factor(x1)

If you now run the regression model with x2 as a dependent variable, two regression coefficients will appear that represent the values of levels "2" and "3" as compared to level "1", the reference (or base) category.

If the factor is numeric, as in this example, you may wish to label the different values; these labels will appear in the output of statistical models or will be used to label graphs. In our example, this could work like this:

x2 <- factor(x1, labels = c("School A","School B","School C"))

Ordered factors

Factors may represent ordinal variables, and if this is the case, you may wish to define them as ordered factors. You should do so only if you know what orthogonal polynomials are about, because this is the way ordered factors will be coded in regression models. The command to create an ordered factors is simply

x2 <- factor(x1, ordered = TRUE)

or even more simply

x2 <- ordered(x1)

I mentioned above that any character variable will be treated as a factor automatically. If you define it as an ordered factor, the implied order will be alphabetical. This can be changed with the levels option, which might look like, e.g., levels = c("Basic","Intermediate","Excellent").

A fast way to create factors

Assume that you have a small experiment with three groups and five observations in each group. The command

gr1 <- gl(3,5,15)

will create the vector 1,1,1,1,1,2,2,2,2,2,3,3,3,3,3. The command works like this: The first number informs R about the number of factors. The second number tells R that each factor is to be replicated k times (k in this example is 5). The number 15 in this example actually is redundant, as it is equal to number of factors times the number of replications by default. But let's assume that the 3 times 5 observations actually refer to the group of males only and we have another 3 times 5 observations on females. In this case, you might write gr <- gl(3,5,30), and as a result the vector listed above would be repeated to create codes for 30 observations overall.

Of course it is mandatory that the values of the y variable correspond exactly to the factors that are created this way.