To get the highest profit margins for each company, we're splitting the data frame by only one factor -- company. To get just the highest value and company name for each company, use summarize :
highestProfitMargins <- companiesData %>% group_by(company) %>% summarize( bestMargin = max(margin) )
(Here we've assigned the results to the variable highestProfitMargins, which will contain a new data frame.)
company | bestMargin | |
---|---|---|
1 | Apple | 26.7 |
2 | 29.0 | |
3 | Microsoft | 33.1 |
Summarize doesn't give any information from other columns in the original data frame. In what year did each of the highest margins occur? We can't tell by using summarize.
If you want all the other column data, too, change "summarize" to "mutate." That will return your existing data frame with a new column that repeats the maximum margin for each company:
highestProfitMargins <- companiesData %>% group_by(company) %>% mutate( bestMargin = max(margin) )
fy | company | revenue | profit | margin | bestMargin | |
---|---|---|---|---|---|---|
1 | 2010 | Apple | 65225 | 14013 | 21.5 | 26.7 |
2 | 2011 | Apple | 108248 | 25922 | 23.9 | 26.7 |
3 | 2012 | Apple | 156508 | 41733 | 26.7 | 26.7 |
4 | 2010 | 29321 | 8505 | 29.0 | 29.0 | |
5 | 2011 | 37905 | 9737 | 25.7 | 29.0 | |
6 | 2012 | 50175 | 10737 | 21.4 | 29.0 | |
7 | 2010 | Microsoft | 62484 | 18760 | 30.0 | 33.1 |
8 | 2011 | Microsoft | 69943 | 23150 | 33.1 | 33.1 |
9 | 2012 | Microsoft | 73723 | 16978 | 23.0 | 33.1 |
Note that this result shows the profit margin for each company and year in the margin column along with the bestMargin repeated for each company and year. The only way to tell which year has the best margin is to compare the two columns to see where they're equal.
summarize() and mutate() let you apply more than one function at a time, for example:
highestProfitMargins <- companiesData %>% group_by(company) %>% mutate( highestMargin = max(margin), lowestMargin = min(margin) )
This gets you:
fy | company | revenue | profit | margin | highestMargin | lowestMargin | |
---|---|---|---|---|---|---|---|
1 | 2010 | Apple | 65225 | 14013 | 21.5 | 26.7 | 21.5 |
2 | 2011 | Apple | 108248 | 25922 | 23.9 | 26.7 | 21.5 |
3 | 2012 | Apple | 156508 | 41733 | 26.7 | 26.7 | 21.5 |
4 | 2010 | 29321 | 8505 | 29.0 | 29.0 | 21.4 | |
5 | 2011 | 37905 | 9737 | 25.7 | 29.0 | 21.4 | |
6 | 2012 | 50175 | 10737 | 21.4 | 29.0 | 21.4 | |
7 | 2010 | Microsoft | 62484 | 18760 | 30.0 | 33.1 | 23.0 |
8 | 2011 | Microsoft | 69943 | 23150 | 33.1 | 33.1 | 23.0 |
9 | 2012 | Microsoft | 73723 | 16978 | 23.0 | 33.1 | 23.0 |
In some cases, though, what you want is a new data frame with just the (entire) rows that have the highest profit margins. One way to do that is to add a filter()
statement to the pipe. filter() takes the syntax filter(mydata, conditional statement)
, or in this case:
highestProfitMargins <- companiesData %>% group_by(company) %>% mutate( highestMargin = max(margin) ) %>% filter(margin == highestMargin)
That last line says "only keep rows within each group where the value of the margin column equals the value of the highestMargin column." Also note the double equals sign, which means "test whether the left side of the equation equals the right side of the equation." A single equals sign assigns the value of the right side of the equation to the variable on the left side.
fy | company | revenue | profit | margin | |
---|---|---|---|---|---|
1 | 2012 | Apple | 156508 | 41733 | 26.7 |
2 | 2010 | 29321 | 8505 | 29.0 | |
3 | 2011 | Microsoft | 69943 | 23150 | 33.1 |
Again, a reminder that %>%
is a "chaining" operation that allows you to string together multiple commands on a data frame.
Note that highestProfitMargins and myresults are a special type of data frame created by dplyr called a tibble. If you have problems running more conventional non-dplyr operations on a dplyr result, which has happened to me on occasion, convert it to a "regular" data frame with as.data.frame(), such as
highestProfitMargins <- as.data.frame(highestProfitMargins)