Beginner's guide to R: Easy ways to do basic data analysis

Comments

So you've read your data into an R object. Now what?

Examine your data object

Before you start analyzing, you might want to take a look at your data object's structure and a few row entries. If it's a 2-dimensional table of data stored in an R data frame object with rows and columns -- one of the more common structures you're likely to encounter -- here are some ideas. Many of these also work on 1-dimensional vectors as well.

Many of the commands below assume that your data are stored in a variable called mydata (and not that mydata is somehow part of these functions' names).

[This story is part of Computerworld's "Beginner's guide to R." To read from the beginning, check out the introduction; there are links on that page to the other pieces in the series.]

If you type:

head(mydata)

R will display mydata's column headers and first 6 rows by default. Want to see, oh, the first 10 rows instead of 6? That's:

head(mydata, n=10)

Or just:

head(mydata, 10)

Note: If your object is just a 1-dimensional vector of numbers, such as (1, 1, 2, 3, 5, 8, 13, 21, 34), head(mydata) will give you the first 6 items in the vector.

To see the last few rows of your data, use the tail() function:

tail(mydata)

Or:

tail(mydata, 10)

Tail can be useful when you've read in data from an external source, helping to see if anything got garbled (or there was some footnote row at the end you didn't notice).

To quickly see how your R object is structured, you can use the str() function:

str(mydata)

This will tell you the type of object you have; in the case of a data frame, it will also tell you how many rows (observations in statistical R-speak) and columns (variables to R) it contains, along with the type of data in each column and the first few entries in each column.

For a vector, str() tells you how many items there are -- for 8 items, it'll display as [1:8] -- along with the type of item (number, character, etc.) and the first few entries.

Various other data types return slightly different results.

If you want to see just the column names in the data frame called mydata, you can use the command:

colnames(mydata)

Likewise, if you're interested in the row names -- in essence, all the values in the first column of your data frame -- use:

rownames(mydata)

Pull basic stats from your data frame

Because R is a statistical programming platform, it's got some pretty elegant ways to extract statistical summaries from data. To extract a few basic stats from a data frame, use the summary() function:

summary(mydata)

That returns some basic calculations for each column. If the column has numbers, you'll see the minimum and maximum values along with median, mean, 1st quartile and 3rd quartile. If it's got factors such as fair, good, very good and excellent, you'll get the number of each factor listed in the column.

The summary() function also returns stats for a 1-dimensional vector.

If you'd like even more statistical summaries from a single command, install and load the psych package. Install it with this command:

install.packages("psych")

You need to run this install only once on a system. Then load it with:

library(psych)

You need to run the library command each time you start a new R session if you want to use the psych package.

Now try the command:

describe(mydata)

and you'll get several more statistics from the data including standard deviation, "mad" (mean absolute deviation), skew (measuring whether or not the data distribution is symmetrical) and kurtosis (whether the data have a sharp or flatter peak near its mean).

R has the statistical functions you'd expect, including mean(), median(), min(), max(), sd() [standard deviation], var() [variance] and range()which you can run on a 1-dimensional vector of numbers. (Several of these functions -- such as mean() and median() -- will not work on a 2-dimensional data frame).

Oddly, the mode() function returns information about data type instead of the statistical mode; there's an add-on package, modeest, that adds a mfv() function (most frequent value) to find the statistical mode.

R also contains a load of more sophisticated functions that let you do analyses with one or two commands: probability distributions, correlations, significance tests, regressions, ANOVA (analysis of variance between groups) and more.

As just one example, running the correlation function cor() on a dataframe such as:

cor(mydata)

will give you a matrix of correlations for each column of numerical data compared with every other column of numerical data.

Note: Be aware that you can run into problems when trying to run some functions on data where there are missing values. In some cases, R's default is to return NA even if just a single value is missing. For example, while the summary() function returns column statistics excluding missing values (and also tells you how many NAs are in the data), the mean() function will return NA if even only one value is missing in a vector.

In most cases, adding the argument:

na.rm=TRUE

to NA-sensitive functions will tell that function to remove any NAs when performing calculations, such as:

mean(myvector, na.rm=TRUE)

If you've got data with some missing values, read a function's help file by typing a question mark followed by the name of the function, such as:

?median

The function description should say whether the na.rm argument is needed to exclude missing values.

Checking a function's help files -- even for simple functions -- can also uncover additional useful options, such as an optional trim argument for mean() that lets you exclude some outliers.

Not all R functions need a robust data set to be useful for statistical work. For example, how many ways can you select a committee of 4 people from a group of 15? You can pull out your calculator and find 15! divided by 4! times 11! ... or you can use the R choose() function:

choose(15,4)

Or, perhaps you want to see all of the possible pair combinations of a group of 5 people, not simply count them. You can create a vector with the people's names and store it in a variable called mypeople:

mypeople <- c("Bob", "Joanne", "Sally", "Tim", "Neal")

In the example above, c() is the combine function.

Then run the combn() function, which takes two arguments -- your entire set first and then the number you want to have in each group:

combn(mypeople, 2)

Probably most experienced R users would combine these two steps into one like this:

combn(c("Bob", "Joanne", "Sally", "Tim", "Neal"),2)

But separating the two can be more readable for beginners.

Get slices or subsets of your data

Maybe you don't need correlations for every column in your data frame and you just want to work with a couple of columns, not 15. Perhaps you want to see data that meets a certain condition, such as within 3 standard deviations. R lets you slice your data sets in various ways, depending on the data type.

To select just certain columns from a data frame, you can either refer to the columns by name or by their location (i.e., column 1, 2, 3, etc.).

For example, the mtcars sample data frame has these column names: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear and carb.

Can't remember the names of all the columns in your data frame? If you just want to see the column names and nothing else, instead of functions such as str(mtcars) and head(mtcars) you can type:

names(mtcars)

That's handy if you want to store the names in a variable, perhaps called mtcars.colnames (or anything else you'd like to call it):

mtcars.colnames <- names(mtcars)

But back to the task at hand. To access only the data in the mpg column in mtcars, you can use R's dollar sign notation:

mtcars$mpg

More broadly, then, the format for accessing a column by name would be:

dataframename$columnname

That will give you a 1-dimensional vector of numbers like this:

[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8

[12] 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5

[23] 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4

The numbers in brackets are not part of your data, by the way. They indicate what item number each line is starting with. If you've only got one line of data, you'll just see [1]. If there's more than one line of data and only the first 11 entries can fit on the first line, your second line will start with [12], and so on.

Sometimes a vector of numbers is exactly what you want -- if, for example, you want to quickly plot mtcars$mpg and don't need item labels, or you're looking for statistical info such as variance and mean.

Chances are, though, you'll want to subset your data by more than one column at a time. That's when you'll want to use bracket notation, what I think of as rows-comma-columns. Basically, you take the name of your data frame and follow it by [rows,columns]. The rows you want come first, followed by a comma, followed by the columns you want. So, if you want all rows but just columns 2 through 4 of mtcars, you can use:

mtcars[,2:4]

Do you see that comma before the 2:4? That's leaving a blank space where the "which rows do you want?" portion of the bracket notation goes, and it means "I'm not asking for any subset, so return all." Although it's not always required, it's not a bad practice to get into the habit of using a comma in bracket notation so that you remember whether you were slicing by columns or rows.

If you want multiple columns that aren't contiguous, such as columns 2 AND 4 but not 3, you can use the notation:

mtcars[,c(2,4)]

A couple of syntax notes here:

R indexes from 1, not 0. So your first column is at [1] and not [0].

R is case sensitive everywhere. mtcars$mpg is not the same as mtcars$MPG.

mtcars[,-1] will not get you the last column of a data frame, the way negative indexing works in many other languages. Instead, negative indexing in R means exclude that item. So, mtcars[,-1] will return every column except the first one.

To create a vector of items that are not contiguous, you need to use the combine function c(). Typing mtcars[,(2,4)] without the c will not work. You need that c in there:

mtcars[,c(2,4)]

What if want to select your data by data characteristic, such as "all cars with mpg > 20", and not column or row location? If you use the column name notation and add a condition like:

mtcars$mpg>20

you don't end up with a list of all rows where mpg is greater than 20. Instead, you get a vector showing whether each row meets the condition, such as:

[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE

[10] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE

[19] TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE

[28] TRUE FALSE FALSE FALSE TRUE

To turn that into a listing of the data you want, use that logical test condition and row-comma-column bracket notation. Remember that this time you want to select rows by condition, not columns. This:

mtcars[mtcars$mpg>20,]

tells R to get all rows from mtcars where mpg > 20, and then to return all the columns.

If you don't want to see all the column data for the selected rows but are just interested in displaying, say, mpg and horsepower for cars with an mpg greater than 20, you could use the notation:

mtcars[mtcars$mpg>20,c(1,4)]

using column locations, or:

mtcars[mtcars$mpg>20,c("mpg","hp")]

using the column names.

Why do you need to specify mtcars$mpg in the row spot but "mpg" in the column spot? Just another R syntax quirk is the best answer I can give you.

If you're finding that your selection statement is starting to get unwieldy, you can put your row and column selections into variables first, such as:

mpg20 <- mtcars$mpg > 20

cols <- c("mpg", "hp")

Then you can select the rows and columns with those variables:

mtcars[mpg20, cols]

making for a more compact select statement but more lines of code.

Getting tired of including the name of the data set multiple times per command? If you're using only one data set and you are not making any changes to the data that need to be saved, you can attach and detach a copy of the data set temporarily.

The attach() function works like this:

attach(mtcars)

So, instead of having to type:

mpg20 <- mtcars$mpg > 20

You can leave out the data set reference and type this instead:

mpg20 <- mpg > 20

After using attach() remember to use the detach function when you're finished:

detach()

Some R users advise avoiding attach() because it can be easy to forget to detach(). If you don't detach() the copy, your variables could end up referencing the wrong data set.

Alternative to bracket notation

Bracket syntax is pretty common in R code, but it's not your only option. If you dislike that format, you might prefer the subset() function instead, which works with vectors and matrices as well as data frames. The format is:

subset(your data object, logical condition for the rows you want to return, select statement for the columns you want to return)

So, in the mtcars example, to find all rows where mpg is greater than 20 and return only those rows with their mpg and hp data, the subset() statement would look like:

subset(mtcars, mpg>20, c("mpg", "hp"))

What if you wanted to find the row with the highest mpg?

subset(mtcars, mpg==max(mpg))

If you just wanted to see the mpg information for the highest mpg:

subset(mtcars, mpg==max(mpg), mpg)

If you just want to use subset to extract some columns and display all rows, you can either leave the row conditional spot blank with a comma, similar to bracket notation:

subset(mtcars, , c("mpg", "hp"))

Or, indicate your second argument is for columns with select= like this:

subset(mtcars, select=c("mpg", "hp"))

Counting factors

To tally up counts by factor, try the table command. For the diamonds data set, to see how many diamonds of each category of cut are in the data, you can use:

table(diamonds$cut)

This will return how many diamonds of each factor -- fair, good, very good, premium and ideal -- exist in the data. Want to see a cross-tab by cut and color?

table(diamonds$cut, diamonds$color)

If you are interested in learning more about statistical functions in R and how to slice and dice your data, there are a number of free academic downloads with many more details. These include Learning statistics with R by Daniel Navarro at the University of Adelaide in Australia (500+ page PDF download, may take a little while). And although not free, books such as The R Cookbook and R in a Nutshell have a lot of good examples and well-written explanations.

Join the newsletter!

Error: Please check your email address.

Tags software

More about University of Adelaide