How to Make a PivotTable in R

The Excel PivotTable is plain awesome. In this post I will show you how to make a PivotTable in R (kind of).

We will use two popular libraries, dplyr and reshape2. This exercise is doable with base R (aggregate(), apply() and others), but would leave much to be desired.

One important distinction to keep in mind is that while PivotTables are useful for both analysis and reporting, R is inherently not a great tool for reporting. For example, column and row totals are commonplace in Excel reports, but not in R. Other formatting elements like cascaded columns in PivotTables are also not practical in R. What you do get in R is the ability to summarize and reshape data frames in the same way you do with PivotTables.

Goal

We will replicate these three features:

  1. Basic by-group summaries with filters (sum, count, calculated fields, etc)
  2. Transpose columns and rows
  3. Advanced by-group summaries and manipulation

We will use data on marijuana prices, scraped from priceofweed.com by the folks at Mode Analytics. The data import code can be found at the end of this article.

I know nothing about marijuana, so this could be interesting…

Basic PivotTable Summaries

Here is a basic PivotTable:

PivotTable in R

It’s a three-step process with the dplyr library in R:

# 1: filter to keep three states.  
basic_summ = filter(mprices, state %in% c("California", "New York", "Illinois"))

# 2: set up data frame for by-group processing.  
basic_summ = group_by(basic_summ, quality, state)

# 3: calculate the three summary metrics
basic_summ = summarise(basic_summ, 
                        sum_amount = sum(amount),
                        avg_ppo = mean(ppo),
                        avg_ppo2 = sum(price) / sum(amount))

basic_summ
         quality      state sum_amount  avg_ppo avg_ppo2
1   high quality California   5494.925 277.9640 226.9336
2   high quality   Illinois   1538.014 376.3324 310.5967
3   high quality   New York   2251.540 375.3289 306.3876
4    low quality California    356.148 275.0264 189.7413
5    low quality   Illinois    118.911 227.0347 142.5184
6    low quality   New York    170.510 349.5067 175.5440
7 medium quality California   6053.553 212.2781 165.8847
8 medium quality   Illinois   1351.176 306.1785 218.0745
9 medium quality   New York   2316.076 287.7946 222.6205

That’s it! The numbers match the Excel table. avg_ppo is the straight average of the ppo column, while avg_ppo2 is like a calculated field in a Pivot Table.

Note also, as I pointed out earlier, R is not a good tool for reporting, per se. You don’t get a beautifully formatted table as you could in Excel, though there may be packages to help with that.

Transposed Summaries

Sometimes we need to restructure the data set for particular analyses. For example, with a PivotTable we might do this:

Transpose in R

I moved the “quality” column to the top of the PivotTable. To transpose the data frame like that, we can use the dcast() function from the reshape2 library. dcast() can be used to turn a long data to wide format, while melt() can be used for turning a wide data set to long:

basic_summ_t = dcast(basic_summ, state ~ quality, value.var = "avg_ppo2")
basic_summ_t
       state high quality low quality medium quality
1 California     226.9336    189.7413       165.8847
2   Illinois     310.5967    142.5184       218.0745
3   New York     306.3876    175.5440       222.6205

The second parameter in the function, quality ~ state is written as a formula. You can think of the variable on the left, quality, as the PivotTable row item, and the right, state, as the PivotTable column item.

This was a simple case when we only had one metric, avg_ppo2. But what if we wanted to show multiple metrics, as we do in this PivotTable:

Transpose data frame in R

Here we kept two metrics. Let’s do this in R, in a two-step process:

# first melt the data frame to put all the metrics in a single column
basic_summ_t2 = melt(basic_summ, id.vars = c("state","quality"), 
                                 measure.vars = c("avg_ppo2", "sum_amount"))
basic_summ_t2

This is what that intermediate data frame looks like. We have converted the data frame from wide to long using the melt() function. As you can see, all the metrics are now in a single column, value, identified by the variable column.

        state        quality   variable     value
1  California   high quality   avg_ppo2  226.9336
2    Illinois   high quality   avg_ppo2  310.5967
3    New York   high quality   avg_ppo2  306.3876
4  California    low quality   avg_ppo2  189.7413
5    Illinois    low quality   avg_ppo2  142.5184
6    New York    low quality   avg_ppo2  175.5440
7  California medium quality   avg_ppo2  165.8847
8    Illinois medium quality   avg_ppo2  218.0745
9    New York medium quality   avg_ppo2  222.6205
10 California   high quality sum_amount 5494.9250
11   Illinois   high quality sum_amount 1538.0140
12   New York   high quality sum_amount 2251.5400
13 California    low quality sum_amount  356.1480
14   Illinois    low quality sum_amount  118.9110
15   New York    low quality sum_amount  170.5100
16 California medium quality sum_amount 6053.5530
17   Illinois medium quality sum_amount 1351.1760
18   New York medium quality sum_amount 2316.0760

So now we can transpose that data frame as we did before with dcast(), just adding one more variable:

# then transpose the quality and variable co
basic_summ_t2 = dcast(basic_summ_t2, state ~ quality + variable, value.var = "value")
basic_summ_t2

Notice the only difference between this dcast() and the one earlier is we added the + variable to the formula.

       state high quality_avg_ppo2 high quality_sum_amount low quality_avg_ppo2 low quality_sum_amount medium quality_avg_ppo2 medium quality_sum_amount
1 California              226.9336                5494.925             189.7413                356.148                165.8847                  6053.553
2   Illinois              310.5967                1538.014             142.5184                118.911                218.0745                  1351.176
3   New York              306.3876                2251.540             175.5440                170.510                222.6205                  2316.076

There you have it. We have transposed a data frame like a PivotTable.

Advanced by-group processing

There are many operations not supported by PivotTables that are PivotTable-like in nature. For example, taking the first PivotTable above, what if we wanted to rank each state within each quality level, based on average price per ounce? You can do that like this:

# first set up the data frame for by-group processing, by quality
basic_summ_rank = group_by(basic_summ, quality)

# add new column ranking states based on avg_ppo2
basic_summ_rank = mutate(basic_summ_rank, ppo_rank = rank(avg_ppo2))
basic_summ_rank
         quality      state sum_amount  avg_ppo avg_ppo2 ppo_rank
1   high quality California   5494.925 277.9640 226.9336        1
2   high quality   Illinois   1538.014 376.3324 310.5967        3
3   high quality   New York   2251.540 375.3289 306.3876        2
4    low quality California    356.148 275.0264 189.7413        3
5    low quality   Illinois    118.911 227.0347 142.5184        1
6    low quality   New York    170.510 349.5067 175.5440        2
7 medium quality California   6053.553 212.2781 165.8847        1
8 medium quality   Illinois   1351.176 306.1785 218.0745        2
9 medium quality   New York   2316.076 287.7946 222.6205        3

The ppo_rank ranks the three states for high, medium and low qualities separately. E.g., California has the cheapest high-quality stuff but the most expensive low-quality. The key to this is the group_by() function sets the data frame up for by-group processing. In this case, we are running a process separately for each value of quality.

Notice all the syntax looks similar to the group_by() + summarise() pair we saw earlier. summarise() collapses the data set and creates a summary, while mutate maintains the same number of rows and simply creates a new column.

Conclusion

Hopefully you got a sense of how to replicate PivotTable-like functionality in R. As with that last example, there are a lot of things you can do with dplyr that PivotTables cannot do.

Appendix

Setting up the data

Here is how the data was loaded. We will import the data stored on a web server:

# import data from github.  Reading http:// is straightforward
# A few extra steps needed for https urls
mprices = tempfile()
download.file("https://raw.githubusercontent.com/mode/blog/master/2013-10-03%20MMap/m_prices.csv",destfile=mprices, method="curl")
mprices = read.csv(mprices, stringsAsFactors = FALSE)
head(mprices)
        state                  location price quantity        quality   date_old amount quality_score ppo        date state_code
1     Arizona          Phoenix, Arizona   220 an ounce   high quality 2012-01-01      1             3 220 2.01201e+11          4
2  California    Sacramento, California   280 an ounce   high quality 2012-01-01      1             3 280 2.01201e+11          6
3    Colorado  Grand Junction, Colorado   300 an ounce   high quality 2012-01-01      1             3 300 2.01201e+11          8
4    Michigan West Bloomfield, Michigan    20 an ounce   high quality 2012-01-01      1             3  20 2.01201e+11         26
5   Minnesota    Saint Cloud, Minnesota   300 an ounce medium quality 2012-01-01      1             2 300 2.01201e+11         27
6 Mississippi       Tupelo, Mississippi   120 an ounce   high quality 2012-01-01      1             3 120 2.01201e+11         28

This includes data on marijuana prices, scraped from priceofweed.com by the folks at href=”http://blog.modeanalytics.com/mapping-marijuana-prices/” target=”_blank”>Mode Analytics. The data frame has about 130k rows of data, and the columns are generally straightforward. Each row represents a marijuana purchase as recorded anonymously by users.

Leave a Reply

14 Comments on "How to Make a PivotTable in R"

avatar

Viola
Guest
Viola
2 years 8 months ago

Hi John,

Great article, thank you! Do you know how to add Grand Totals under the summary in R (just like you have in your excel example)?

Thanks,
Viola

Carlos Rodrigues
Guest
Carlos Rodrigues
2 years 4 months ago

Hello John, Thanks for the paper !

Im a newbie on this, just started this week playing around with R, i need to it in order to analyse some data. My questions is if it is possible to get data from 2 different csv files to build a pivot table in R ? like to correlate variables…

Cheers
Carlos

Nabha
Guest
Nabha
1 year 8 months ago

Hi John,
Thanks for sharing, really simple approach for excel users.

When I tried doing it on my data set – the variables are being read as character and not numeric. I used the as.numeric function but not getting the desired output – can you please advice.

summarise(subcat_summary,
sales = sum(as.numeric(SALES)))

Thanks
Nabha

Jenny
Guest
Jenny
7 months 2 days ago

Hi John,

I have Date to be added to the Columns, can it be done

Vasu
Guest
Vasu
4 months 15 days ago

Heloooo Is anyone here…
I need one help.

I have one data with the different product names with different amount.

i want all the product in columm. can you suggest some coding.

wpDiscuz