The Excel PivotTable is plain awesome. In this post I will show you how to make a PivotTable in R (kind of).
We will use two popular libraries,
reshape2. This exercise is doable with base R (aggregate(), apply() and others), but would leave much to be desired.
One important distinction to keep in mind is that while PivotTables are useful for both analysis and reporting, R is inherently not a great tool for reporting. For example, column and row totals are commonplace in Excel reports, but not in R. Other formatting elements like cascaded columns in PivotTables are also not practical in R. What you do get in R is the ability to summarize and reshape data frames in the same way you do with PivotTables.
We will replicate these three features:
- Basic by-group summaries with filters (sum, count, calculated fields, etc)
- Transpose columns and rows
- Advanced by-group summaries and manipulation
We will use data on marijuana prices, scraped from priceofweed.com by the folks at Mode Analytics. The data import code can be found at the end of this article.
I know nothing about marijuana, so this could be interesting…
Basic PivotTable Summaries
Here is a basic PivotTable:
It’s a three-step process with the
dplyr library in R:
# 1: filter to keep three states. basic_summ = filter(mprices, state %in% c("California", "New York", "Illinois")) # 2: set up data frame for by-group processing. basic_summ = group_by(basic_summ, quality, state) # 3: calculate the three summary metrics basic_summ = summarise(basic_summ, sum_amount = sum(amount), avg_ppo = mean(ppo), avg_ppo2 = sum(price) / sum(amount)) basic_summ
quality state sum_amount avg_ppo avg_ppo2 1 high quality California 5494.925 277.9640 226.9336 2 high quality Illinois 1538.014 376.3324 310.5967 3 high quality New York 2251.540 375.3289 306.3876 4 low quality California 356.148 275.0264 189.7413 5 low quality Illinois 118.911 227.0347 142.5184 6 low quality New York 170.510 349.5067 175.5440 7 medium quality California 6053.553 212.2781 165.8847 8 medium quality Illinois 1351.176 306.1785 218.0745 9 medium quality New York 2316.076 287.7946 222.6205
That’s it! The numbers match the Excel table. avg_ppo is the straight average of the ppo column, while avg_ppo2 is like a calculated field in a Pivot Table.
Note also, as I pointed out earlier, R is not a good tool for reporting, per se. You don’t get a beautifully formatted table as you could in Excel, though there may be packages to help with that.
Sometimes we need to restructure the data set for particular analyses. For example, with a PivotTable we might do this:
I moved the “quality” column to the top of the PivotTable. To transpose the data frame like that, we can use the
dcast() function from the reshape2 library.
dcast() can be used to turn a long data to wide format, while
melt() can be used for turning a wide data set to long:
basic_summ_t = dcast(basic_summ, state ~ quality, value.var = "avg_ppo2") basic_summ_t
state high quality low quality medium quality 1 California 226.9336 189.7413 165.8847 2 Illinois 310.5967 142.5184 218.0745 3 New York 306.3876 175.5440 222.6205
The second parameter in the function,
quality ~ state is written as a formula. You can think of the variable on the left, quality, as the PivotTable row item, and the right, state, as the PivotTable column item.
This was a simple case when we only had one metric, avg_ppo2. But what if we wanted to show multiple metrics, as we do in this PivotTable:
Here we kept two metrics. Let’s do this in R, in a two-step process:
# first melt the data frame to put all the metrics in a single column basic_summ_t2 = melt(basic_summ, id.vars = c("state","quality"), measure.vars = c("avg_ppo2", "sum_amount")) basic_summ_t2
This is what that intermediate data frame looks like. We have converted the data frame from wide to long using the
melt() function. As you can see, all the metrics are now in a single column, value, identified by the variable column.
state quality variable value 1 California high quality avg_ppo2 226.9336 2 Illinois high quality avg_ppo2 310.5967 3 New York high quality avg_ppo2 306.3876 4 California low quality avg_ppo2 189.7413 5 Illinois low quality avg_ppo2 142.5184 6 New York low quality avg_ppo2 175.5440 7 California medium quality avg_ppo2 165.8847 8 Illinois medium quality avg_ppo2 218.0745 9 New York medium quality avg_ppo2 222.6205 10 California high quality sum_amount 5494.9250 11 Illinois high quality sum_amount 1538.0140 12 New York high quality sum_amount 2251.5400 13 California low quality sum_amount 356.1480 14 Illinois low quality sum_amount 118.9110 15 New York low quality sum_amount 170.5100 16 California medium quality sum_amount 6053.5530 17 Illinois medium quality sum_amount 1351.1760 18 New York medium quality sum_amount 2316.0760
So now we can transpose that data frame as we did before with
dcast(), just adding one more variable:
# then transpose the quality and variable co basic_summ_t2 = dcast(basic_summ_t2, state ~ quality + variable, value.var = "value") basic_summ_t2
Notice the only difference between this dcast() and the one earlier is we added the
+ variable to the formula.
state high quality_avg_ppo2 high quality_sum_amount low quality_avg_ppo2 low quality_sum_amount medium quality_avg_ppo2 medium quality_sum_amount 1 California 226.9336 5494.925 189.7413 356.148 165.8847 6053.553 2 Illinois 310.5967 1538.014 142.5184 118.911 218.0745 1351.176 3 New York 306.3876 2251.540 175.5440 170.510 222.6205 2316.076
There you have it. We have transposed a data frame like a PivotTable.
Advanced by-group processing
There are many operations not supported by PivotTables that are PivotTable-like in nature. For example, taking the first PivotTable above, what if we wanted to rank each state within each quality level, based on average price per ounce? You can do that like this:
# first set up the data frame for by-group processing, by quality basic_summ_rank = group_by(basic_summ, quality) # add new column ranking states based on avg_ppo2 basic_summ_rank = mutate(basic_summ_rank, ppo_rank = rank(avg_ppo2)) basic_summ_rank
quality state sum_amount avg_ppo avg_ppo2 ppo_rank 1 high quality California 5494.925 277.9640 226.9336 1 2 high quality Illinois 1538.014 376.3324 310.5967 3 3 high quality New York 2251.540 375.3289 306.3876 2 4 low quality California 356.148 275.0264 189.7413 3 5 low quality Illinois 118.911 227.0347 142.5184 1 6 low quality New York 170.510 349.5067 175.5440 2 7 medium quality California 6053.553 212.2781 165.8847 1 8 medium quality Illinois 1351.176 306.1785 218.0745 2 9 medium quality New York 2316.076 287.7946 222.6205 3
The ppo_rank ranks the three states for high, medium and low qualities separately. E.g., California has the cheapest high-quality stuff but the most expensive low-quality. The key to this is the
group_by() function sets the data frame up for by-group processing. In this case, we are running a process separately for each value of quality.
Notice all the syntax looks similar to the group_by() + summarise() pair we saw earlier.
summarise() collapses the data set and creates a summary, while mutate maintains the same number of rows and simply creates a new column.
Hopefully you got a sense of how to replicate PivotTable-like functionality in R. As with that last example, there are a lot of things you can do with dplyr that PivotTables cannot do.
Setting up the data
Here is how the data was loaded. We will import the data stored on a web server:
# import data from github. Reading http:// is straightforward # A few extra steps needed for https urls mprices = tempfile() download.file("https://raw.githubusercontent.com/mode/blog/master/2013-10-03%20MMap/m_prices.csv",destfile=mprices, method="curl") mprices = read.csv(mprices, stringsAsFactors = FALSE) head(mprices)
state location price quantity quality date_old amount quality_score ppo date state_code 1 Arizona Phoenix, Arizona 220 an ounce high quality 2012-01-01 1 3 220 2.01201e+11 4 2 California Sacramento, California 280 an ounce high quality 2012-01-01 1 3 280 2.01201e+11 6 3 Colorado Grand Junction, Colorado 300 an ounce high quality 2012-01-01 1 3 300 2.01201e+11 8 4 Michigan West Bloomfield, Michigan 20 an ounce high quality 2012-01-01 1 3 20 2.01201e+11 26 5 Minnesota Saint Cloud, Minnesota 300 an ounce medium quality 2012-01-01 1 2 300 2.01201e+11 27 6 Mississippi Tupelo, Mississippi 120 an ounce high quality 2012-01-01 1 3 120 2.01201e+11 28
This includes data on marijuana prices, scraped from priceofweed.com by the folks at href=”http://blog.modeanalytics.com/mapping-marijuana-prices/” target=”_blank”>Mode Analytics. The data frame has about 130k rows of data, and the columns are generally straightforward. Each row represents a marijuana purchase as recorded anonymously by users.