4 data wrangling tasks in R for advanced beginners

Learn how to add columns, get summaries, sort your results and reshape your data.

Comments

To get the highest profit margins for each company, we're splitting the data frame by only one factor -- company. To get just the highest value and company name for each company, use summarize :

highestProfitMargins <- companiesData %>%
  group_by(company) %>%
  summarize(
    bestMargin = max(margin)
  )

(Here we've assigned the results to the variable highestProfitMargins, which will contain a new data frame.)

	company	bestMargin
1	Apple	26.7
2	Google	29.0
3	Microsoft	33.1

Summarize doesn't give any information from other columns in the original data frame. In what year did each of the highest margins occur? We can't tell by using summarize.

If you want all the other column data, too, change "summarize" to "mutate." That will return your existing data frame with a new column that repeats the maximum margin for each company:

highestProfitMargins <- companiesData %>%
  group_by(company) %>%
  mutate(
    bestMargin = max(margin)
  )

	fy	company	revenue	profit	margin	bestMargin
1	2010	Apple	65225	14013	21.5	26.7
2	2011	Apple	108248	25922	23.9	26.7
3	2012	Apple	156508	41733	26.7	26.7
4	2010	Google	29321	8505	29.0	29.0
5	2011	Google	37905	9737	25.7	29.0
6	2012	Google	50175	10737	21.4	29.0
7	2010	Microsoft	62484	18760	30.0	33.1
8	2011	Microsoft	69943	23150	33.1	33.1
9	2012	Microsoft	73723	16978	23.0	33.1

Note that this result shows the profit margin for each company and year in the margin column along with the bestMargin repeated for each company and year. The only way to tell which year has the best margin is to compare the two columns to see where they're equal.

summarize() and mutate() let you apply more than one function at a time, for example:

highestProfitMargins <- companiesData %>%
  group_by(company) %>%
  mutate(
    highestMargin = max(margin),
    lowestMargin = min(margin)
  )

This gets you:

	fy	company	revenue	profit	margin	highestMargin	lowestMargin
1	2010	Apple	65225	14013	21.5	26.7	21.5
2	2011	Apple	108248	25922	23.9	26.7	21.5
3	2012	Apple	156508	41733	26.7	26.7	21.5
4	2010	Google	29321	8505	29.0	29.0	21.4
5	2011	Google	37905	9737	25.7	29.0	21.4
6	2012	Google	50175	10737	21.4	29.0	21.4
7	2010	Microsoft	62484	18760	30.0	33.1	23.0
8	2011	Microsoft	69943	23150	33.1	33.1	23.0
9	2012	Microsoft	73723	16978	23.0	33.1	23.0

In some cases, though, what you want is a new data frame with just the (entire) rows that have the highest profit margins. One way to do that is to add a filter() statement to the pipe. filter() takes the syntax filter(mydata, conditional statement), or in this case:

highestProfitMargins <- companiesData %>%
  group_by(company) %>%
  mutate(
    highestMargin = max(margin)
  ) %>%
  filter(margin == highestMargin)

That last line says "only keep rows within each group where the value of the margin column equals the value of the highestMargin column." Also note the double equals sign, which means "test whether the left side of the equation equals the right side of the equation." A single equals sign assigns the value of the right side of the equation to the variable on the left side.

	fy	company	revenue	profit	margin
1	2012	Apple	156508	41733	26.7
2	2010	Google	29321	8505	29.0
3	2011	Microsoft	69943	23150	33.1

Again, a reminder that %>% is a "chaining" operation that allows you to string together multiple commands on a data frame.

Note that highestProfitMargins and myresults are a special type of data frame created by dplyr called a tibble. If you have problems running more conventional non-dplyr operations on a dplyr result, which has happened to me on occasion, convert it to a "regular" data frame with as.data.frame(), such as

highestProfitMargins <- as.data.frame(highestProfitMargins)

Join the newsletter!

Error: Please check your email address.

More about Advanced Apple GitHub Google Microsoft