Data Transformation

Data transformation is what probably occupies most of the time of social scientists who work with quantitative data. R certainly was not designed to make this task as easy as possible. At the same time it offers a wealth of functions that may be useful, depending on what you set out to accomplish. Furthermore, packages have been created that make the task of data transformation easier. I gather that the most popular of these is dplyr, which is part of a large package called tidyverse.

At the moment, I can offer only a few very simple solutions to very simple (and common) problems, making use of the inbuilt R functions. As to dplyr, you will certainly find resources on the internet.

Mathematical transformation

In what follows, I will always suppose that data are available as a data.frame (even though most of the stuff will work with matrices or other types of object). And I will not assume the use of attach.

In many instances, all we want to do is to apply some mathematical transformation. We will want to compute the logarithm of a variable, or we will subtract a number (or the value of another variable) from it.

Here are three examples:

mydata$loginc <- log(mydata$income)
mydata$ageat0 <- mydata$age - 18
mydata$netinc <- mydata$grossinc - mydata$taxes

Throughout, the extract operator $ is used. This ensures that R "knows" that the variable(s) on the right hand side can be found in the data.frame "mydata" and that the new variable on the left hand side is added to the data.frame.

The variable on the left hand side need not be a new variable. You might write just as well

mydata$income <- log(mydata$income)
mydata$age <- mydata$age - 18

You will do this, however, only if you are a very experienced user (and only under special circumstances). Normally, you will never change an existing variable and rather add a new variable to the data set.

Here are some mathematical operators beyond the obvious "+" and "–".

`/`		Division
`*`		Multiplication
`^` or `**`		Power
`%%`		Modulus
`floor(), ceiling()`		Floor and ceiling, respectively
`log()`		Natural logarithm (to base e, i.e. Euler's number)
`log2(), log10()`		Logarithm to base 2 and 10, respectively
`log(x, base=y)` or simply `log(x, y)`		Logarithm to base y (where y can be any expression that yields a positive number)
`sqrt()`		Sqare root

There are some more functions which may be of interest for statisticians, but less so for the average social scientists. I may mention factorial() or choose(n,k). The latter computes the binomial coefficient, i.e. the number of ways for choosing k from n elements.

Centering variables

Variables can easily be centered around the mean, and in addition they may be z-transformed (i.e., divided by their standard deviation to produce a variable with mean = 0 and S.D. = 1).

Centering only is accomplished by

mydata$income_c <- scale(mydata$income, scale=FALSE)

The z-transformation is accomplished by

mydata$income_z <- scale(mydata$income)

Note that a variable can be centered around any other value than the mean, as, e.g., in:

mydata$income_c5 <- scale(mydata$income, center= 5, scale=FALSE)

Likewise, if you want the variable to be divided by some other factor than the S.D., you may write, e.g.:

mydata$income_s10 <- scale(mydata$income, center= FALSE, scale=10)

Of course, centering and scaling by predefined values may be combined.