Beginner's guide to R: Easy ways to do basic data analysis

Part 3 of our hands-on series covers pulling stats from your data frame, and related topics.

Sometimes a vector of numbers is exactly what you want -- if, for example, you want to quickly plot mtcars$mpg and don't need item labels, or you're looking for statistical info such as variance and mean.

Chances are, though, you'll want to subset your data by more than one column at a time. That's when you'll want to use bracket notation, what I think of as rows-comma-columns. Basically, you take the name of your data frame and follow it by [rows,columns]. The rows you want come first, followed by a comma, followed by the columns you want. So, if you want all rows but just columns 2 through 4 of mtcars, you can use:

mtcars[,2:4]

Do you see that comma before the 2:4? That's leaving a blank space where the "which rows do you want?" portion of the bracket notation goes, and it means "I'm not asking for any subset, so return all." Although it's not always required, it's not a bad practice to get into the habit of using a comma in bracket notation so that you remember whether you were slicing by columns or rows.

If you want multiple columns that aren't contiguous, such as columns 2 AND 4 but not 3, you can use the notation:

mtcars[,c(2,4)]

A couple of syntax notes here:

  • R indexes from 1, not 0. So your first column is at [1] and not [0].
  • R is case sensitive everywhere. mtcars$mpg is not the same as mtcars$MPG.
  • mtcars[,-1] will not get you the last column of a data frame, the way negative indexing works in many other languages. Instead, negative indexing in R means exclude that item. So, mtcars[,-1] will return every column except the first one.
  • To create a vector of items that are not contiguous, you need to use the combine function c(). Typing mtcars[,(2,4)] without the c will not work. You need that c in there:

mtcars[,c(2,4)]

What if want to select your data by data characteristic, such as "all cars with mpg > 20", and not column or row location? If you use the column name notation and add a condition like:

mtcars$mpg>20

you don't end up with a list of all rows where mpg is greater than 20. Instead, you get a vector showing whether each row meets the condition, such as:

[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE

[10] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE

[19] TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE

[28] TRUE FALSE FALSE FALSE TRUE

To turn that into a listing of the data you want, use that logical test condition and row-comma-column bracket notation. Remember that this time you want to select rows by condition, not columns. This:

mtcars[mtcars$mpg>20,]

tells R to get all rows from mtcars where mpg > 20, and then to return all the columns.

If you don't want to see all the column data for the selected rows but are just interested in displaying, say, mpg and horsepower for cars with an mpg greater than 20, you could use the notation:

mtcars[mtcars$mpg>20,c(1,4)]

using column locations, or:

mtcars[mtcars$mpg>20,c("mpg","hp")]

using the column names.

Why do you need to specify mtcars$mpg in the row spot but "mpg" in the column spot? Just another R syntax quirk is the best answer I can give you.

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

More about AustraliaUniversity of Adelaide

Show Comments
[]