Working with Data

In contrast to most other software, R will not automatically assume that once you have opened a data set (whether a data.frame or a matrix), all further commands will refer to these data. Rather, you have to tell R which data.frame object a given command is referring to. For instance, if you have data.frame called "mydata" with variables "income", "education", "gender" and "jobexp", you cannot normally create a regression model just by listing the variable names.

Data vectors

A linear regression model relates a dependent variable, often called "Y", to one or more independent variables, often called "X1", "X2" and so on. Mathematically, such variables can be represented as vectors, that is, a column where each row corresponds to a case and consecutively lists the data values for all cases in the data set.

In R, vectors are an important class of objects, and indeed you might build a regression model from vectors. I have seen it done quite often. This involves two steps:

1. Extract the vectors from the data.frame, as in the following example which refers to a data.frame called "mydata":

y <- mydata$income
x1 <- mydata$education
x2 <- mydata$gender

2. Build your regression model:

outp1 <- lm(y ~ x1 x2)

This is particularly helpful if you want to try out several different things with these variables.

Data frames

If you do not extract the variables as vectors from the data frame, a bit more effort is required in each single data analysis command.

A first way available with most statistical procedures is to name the data object explicitly in the command, or formula, as in the following example:

outp1 <- lm( depvar ~ indvar1 indvar2, data=mydata)

This informs Stata that the variables to be used can be found in object 'mydata'.

A second way to use variables from a data set is provided by the extract operator "$":

outp1 <- lm(mydata$depvar ~ mydata$indvar1 mydata$indvar2)

Finally, you may use with():

outp1 <- with(mydata, lm(depvar ~ indvar1 indvar2)

"Attaching" the data

To spare the effort of repeated reference to the data object, you may 'attach' it, as in

attach(mydata)

Now, variables can be addressed by their names (provided they are available, of course). In other words, you can refer to variables by their name just as if they were a vector or some other object.

However, there are reasons to avoid attach. (Note: Link opens new window.) These reasons may, or may not, apply to you. If you are simply working with a data set, "attaching" it may be just fine. Attaching may become problematic if there are many objects in your workspace, because if you are unlucky, you my create interferences between the variables in the data.frame and other objects.

If you use attach, you should be aware that there is also detach(), which just means the end of attaching.

© W. Ludwig-Mayerhofer, R Guide | Last update: 12 Dec 2016