SQL is a fairly common skill set among business analysts. If you’re one of those lucky ones, then good news: you could write SELECT SQL in R. This is how I got started with R, and if you already know SQL, then I recommend it as a gateway into R. I will show some examples and cap off with an interactive googleVis line graph.
(If you don’t know SQL, I don’t necessarily recommend you learn just for the sake of using it in R. It is not as efficient as other techniques if your data sets are large. It is a great skill to have so you should probably learn it anyway. In this post I will assume you are already familiar with SQL, and I will not go into detail).
First, the Lingo
- A package, aka library, is a collection of functions someone has written to perform certain analysis / manipulation / etc. It’s like the Excel Add-Ins, such as Analysis Data Pack, which lets you do regression and a few other statistical analyses. In R, you install a library once with the
install.packages()function, and then every time you want to use functions from that library, you load it with the library() function
- A data frame in R is like a data table in Excel. It has rows and columns; columns have names, and all values within a column are of the same type (e.g., date, integer, string, etc). You would normally import a data set into R from somewhere, like a website or your company’s database using functions like
- A function in R is like a formula in Excel. It takes some input and returns something useful as output.
Let’s Use SQL in R…
For many examples in my blog, I will use Sean Lahman’s baseball database. If you don’t like baseball, don’t worry; this will not be a baseball geekfest. We’re using it because it is plentiful and free (and also because I love baseball myself). So please download the 2013 comma-delimited version, and unzip the files somewhere on your computer.
library(sqldf) salaries = read.csv("/Users/johnt/Google Drive/Baseball Data/salaries.csv") avg_salary_by_year = sqldf("SELECT yearID, AVG(salary) as avg_salary FROM salaries GROUP BY yearID") #spot check the first few rows head(avg_salary_by_year)
yearID avg_salary 1 1985 476299.4 2 1986 417147.0 3 1987 434729.5 4 1988 453171.1 5 1989 506323.1 6 1990 511973.7
If you know SQL, this is pretty straightforward; we’re looking at average salaries by year. You can also do JOINs instead of using R’s
#import the Teams.csv file teams = read.csv("/Users/johnt/Google Drive/Baseball Data/Teams.csv") salaries_teams = sqldf("SELECT a.*, b.lgID, b.name FROM salaries a INNER JOIN teams b ON(a.yearID = b.yearID AND a.teamID = b.teamID)")
Here we did an INNER JOIN, but we can do LEFT JOINs too. Here is a snippet of the result:
yearID teamID lgID playerID salary lgID name 1 1985 BAL AL murraed02 1472819 AL Baltimore Orioles 2 1985 BAL AL lynnfr01 1090000 AL Baltimore Orioles 3 1985 BAL AL ripkeca01 800000 AL Baltimore Orioles 4 1985 BAL AL lacyle01 725000 AL Baltimore Orioles 5 1985 BAL AL flanami01 641667 AL Baltimore Orioles 6 1985 BAL AL boddimi01 625000 AL Baltimore Orioles
Question: how have average salaries differed between the leagues?
With our data all set up, let’s use the SQL CASE statement, and then plot it with the wonderful googleVis package:
al_nl_salary = sqldf("SELECT yearID, AVG(CASE WHEN lgID = 'AL' THEN salary ELSE NULL END) as avg_AL_salary, AVG(CASE WHEN lgID = 'NL' THEN salary ELSE NULL END) as avg_NL_salary FROM salaries_teams GROUP BY 1") ## one-time install install.packages("googleVis") library(googleVis) gvisLineChart(al_nl_salary, xvar = "yearID", yvar = c("avg_AL_salary", "avg_NL_salary"), )
The American League (blue) and National League (red) had been practically trending together from 1985, and after the 2005 season the AL really started running away with it. Baseball fans would guess certain teams like the NY Yankees, Anaheim Angels and Detroit Tigers might be driving that increase.
If you know SQL, and you’re new to R, then use the sqldf package. But as your data sets get larger (200K+ rows) and more complex, you will want to upgrade to more powerful libraries, like dplyr, because sqldf gets pretty slow. If you don’t know SQL, then you can skip this package altogether (but still learn SQL for other reasons).