Reading/Importing Data

Even though you may enter data in R via a built-in spreadsheet (the command is newdata <- edit(data.frame())), data will normally be available as an external file.

The specific command to be used to read data depends on the format in which data are available. It also depends on a number of other factors. Be prepared to find a lot of stuff here that is rather tedious.

Note that in R parlance, a data file is called a "data frame". However, data.frame is a name for a specific class of objects; sometimes what we would ordinarily call a data file is, for R, just a "matrix", and then again it might be a totally different and very special object. In what follows, I try to indicate whether you can assume that data will be available as a data.frame or not.

Reading data in R format

How to read data files that have been created (and saved) via R depends on the specific way the file has been created. Therefore we must deal with this aspect as well.

"Read" R data that have been "written"

Data can be saved to disk with the write.table command, which basically works like this: write.table(name-of-object, "name-of-file"). In other words, the object to be saved is supposed to be either a matrix or a data file. Note that the object is automatically saved as an R data.frame. Therefore, it is recommendable (but in no way necessary) to give the file the extension "RData".

A data frame thus created is read by, e.g.

mydata <- read.table("C:/Path-to-file/name-of-dataframe")

In other words, the data are assigned to an object which in this example is called "mydata".

"Load" R objects that have been "saved"

Any R object, including data frames and matrices, can be saved to disk with the command save. Such an object may be "loaded". Saving and loading does not influence the class of objects; thus, a class matrix object that has been saved will be loaded as a matrix.

Saving and loading are (or can be) very simple

save(name-of-object, file = "C:/Path-to-data-set/name-of-file")

Note that "file" may have an extension, but it must be stated explicitly. The directory (path) is not necessary if the data are to be located in the current working directory

Such an object may be retrieved by

load("C:/Path-to-data-set/name-of-file")

The important thing here is that the object is now available by the name it had carried upon saving, i.e. name-of-object. Note that there is some danger involved. If, in the meantime, you have created a new object name "name-of-object", this object will be overwritten by the content of "name-of-file".

Reading data that come with R libraries

The "core" libraries that are invoked upon starting R, as well as many special libraries, contain one or several data sets.

To find out which of these data sets are available, type data(). If you want to find out which data (if any) are provided by a particular library, type data(package="nameOfLibrary").

Any of these data sets can be loaded simply with the command

data(name-of-dataset)

Note that these data sets do not necessarily represent a single data.frame, or perhaps not a data frame atl all; they may actually consist of several objects which may belong to different classes, possibly even to quite outlandish classes that have been defined in the framework of special packages. So, after loading a data set you have to inspect the current workspace (command ls()) to find out which objects are new to it, and in addition you might investigate the class and the structure of the object.

Importing data from other statistical packages

R can read data sets from a variety of commercial statistical software. It may also export data sets that can be read by such software.

To do so, you have to deploy special purpose libraries , particularly library foreign and library Hmisc. The former can help you to import Stata and SPSS files; the latter, as you may guess, is a package that does many different things, but it may also import SPSS plus SAS files. As far as I can see, importing files with the help of these packages will yield data.frames.

Both packages offer additional options (particularly Hmisc), which I urge to explore via the help system.

Library foreign

The general format is

library(foreign)
dataobject <- keyword("C:/Path-to-data-set/name-of-data-set")

In other words, the data set is assigned to an object (if not, its contents are printed to the screen in the console!); the specifics depend on the type of data you want to read:

`dataobject <- read.dta("C:/Path-to-data-set/mydata.dta")`	Imports Stata file "mydata.dta"
`dataobject <- read.spss("C:/Path-to-data-set/mydata.sav")`	Imports SPSS file "mydata.sav"

Note that typically R cannot read data that have been produced by the latest version of the respective package. For instance, currently (that is, with R version 3.3.0) a Stata data set to be imported must correspond to Stata version 12 or older (the handbook says that it may work only with version 5 to 11).

Problems may result from Stata variables that have labels for all values. These variables are interpreted by R as factors. This can prevented by option convert.factors=FALSE.

For some packages, facilities exist to export data from R into other formats. In the case of foreign and Stata, the command is write.dta. I'm afraid that you have to look up details in the R Handbook on Data Import/Export.

Library Hmisc

Hmisc might be better than foreign when dealing with newer versions of SPSS files.

`dataobject <- spss.get("C:/Path-to-data-set/mydata.sav")`	Imports SPSS file "mydata.sav"
`dataobject <- sas.get("C:/Path-to-data-set/mydata.sas")`	Imports SAS file "mydata.sas"

Library haven

haven is another way to read Stata and SPSS files. It may have some advantages over foreign, but I'm not yet familiar with it.

Library memisc

Memisc is a package for the management of survey data and the presentation of analysis results. It can read SPSS (both portable and system files) and Stata data sets; however, to write such data sets from R you have to use other packages.

Memisc has its own class for data, namely, data.set, but to work with such a data set within R it may be helpful to convert this to a data.frame, as in the following line that reads in an SPPS data set:

mydata <- data.frame(as.data.set(spss.system.file("Some-SPSS-file.sav")))

For a Stata file, the command is:

mydata <- data.frame(as.data.set(Stata.file("Some-Stata-file.dta")))

Reading data from spreadsheets

There is a function to import Microsoft Excel® sheets, but you may frequently encounter data from such spreadsheets that have been converted to the .csv format, that is, a text file with values separated by commas:

mydata <- read.csv("name-of-data-set.csv",header=TRUE)

If the data in the text file are separated just by blanks, the command is

mydata <- read.table("name-of-data-set.txt",header=TRUE)

Two comments on this.
1. The extensions used here are not mandatory; what counts is what the data actually look like. You could give these files any extension you want, however nonsensical it may be. It might even be an extension typically reserved for proprietary data formats, such as .doc or .df. You'll just not normally want to do this.
2. The option header=TRUE, as you might imagine, indicates that the data set starts with a row that contains the column, or variable, names.