4 data wrangling tasks in R for advanced beginners

Learn how to add columns, get summaries, sort your results and reshape your data.

Comments

Syntax 4: mapply()

This, and the simpler sapply(), also can apply a function to some -- but not necessarily all -- columns in a data frame, without having to worry about numbering each item like x[1] and x[2] above. The mapply() format to create a new column in a data frame is:

dataFrame$newColumn <- mapply(someFunction, dataFrame$column1, dataFrame$column2, dataFrame$column3)

The code above would apply the function someFunction() to the data in column1, column2 and column3 of each row of the data frame.

Note that the first argument of mapply() here is the name of a function, not an equation or formula. So if we want (profit/revenue) * 100 as our result, we could first write our own function to do this calculation and then use it with mapply().

Here's how to create a named function, profitMargin(), that takes two variables -- in this case we're calling them netIncome and revenue just within the function -- and return the first variable divided by the second variable times 100, rounded to one decimal place:

profitMargin <- function(netIncome, revenue) { 
    mar <- (netIncome/revenue) * 100 
    mar <- round(mar, 1)
    }

Now we can use that user-created named function with mapply():

companiesData$margin <- mapply(profitMargin, companiesData$profit, companiesData$revenue)

Or we could create an anonymous function within mapply():

companiesData$margin <- mapply(function(x, y) round((x/y) * 100, 1), companiesData$profit, companiesData$revenue)

One advantage mapply() has over transform() is that you can use columns from different data frames (note that this may not always work if the columns are different lengths). Another is that it's got an elegant syntax for applying functions to vectors of data when a function takes more than one argument, such as:

mapply(someFunction, vector1, vector2, vector3)

sapply() has a somewhat different syntax from mapply, and there are yet more functions in R's apply family. I won't go into them further here, but for a more detailed look at base R's various apply options, A brief introduction to 'apply' in R by bioinformatician Neil Saunders is a useful starting point.

Syntax 5: tidyverse's dplyr

Hadley Wickham's dplyr package, released in early 2014 to rationalize and dramatically speed up operations on data frames, is one of my favorite R packages and definitely worth learning. Install and load the package if you haven't already. To add a column to an existing data frame with dplyr, use dplyr's mutate function in the format mutate(mydataframe, mynewcol = some equation:

companiesData <- mutate(companiesData, margin = round((profit/revenue) * 100, 1))

Getting summaries by subgroups of your data

It's easy to find, say, the highest profit margin in our data with max(companiesData$margin). To assign the value of the highest profit margin to a variable named highestMargin, this simple code does the trick.

highestMargin <- max(companiesData$margin)

That just returns:

[1] 33.09838

but you don't know anything more about the other variables in the row, such as year and company.

To see the entire row with the highest profit margin, not only the value, this is one option:

highestMargin <- companiesData[companiesData$margin == max(companiesData$margin),]

and here's another:

highestMargin <- subset(companiesData, margin==max(margin))

(For an explanation on these two techniques for extracting subsets of your data, see Get slices or subsets of your data from the Beginner's guide to R: Easy ways to do basic data analysis.)

But what if you want to find rows with the highest profit margin for each company? That involves applying a function by groups.

The tidyverse views this in what Hadley Wickham considers a "split-apply-combine" task: Split up your data set by one or more groups, apply some function, then combine the results back into a data set.

dplyr makes this easy. The group_by() function splits the data into groups of your choice, based on column names. Once the data frame is grouped, any subsequent actions you perform in a calculation chain are done within each group. Then you can summarize your data by group using the summarise() or summarize() functions -- that creates a new, summarized data frame; or you can simply add a new column to the existing data frame using mutate() after grouping.

Finally, the pipe operator, %>%, sends the results of one calculation into the next.

The format for splitting a data frame by multiple groups and then applying a function with dplyr would be:

my_summaries <- mydata %>%
  group_by(column1, column2) %>%
  summarise(
    new_column = myfunction(column name(s) I want to the function to act upon)
  }

Let's take a more detailed look at that. The first row creates a new data frame called my_summaries that contains all the data in the mydata frame. The %>% pipe operator then sends that to the next step in line 2, which groups the data by the first and second columns. Then, a new column is created called new_column, which is the result of myfunction being applied within each group.

Join the newsletter!

Error: Please check your email address.

More about Advanced Apple GitHub Google Microsoft