Using SQL in R

SQL is a fairly common skill set among business analysts. If you’re one of those lucky ones, then good news: you could write SELECT SQL in R. This is how I got started with R, and if you already know SQL, then I recommend it as a gateway into R. I will show some examples and cap off with an interactive googleVis line graph.

(If you don’t know SQL, I don’t necessarily recommend you learn just for the sake of using it in R. It is not as efficient as other techniques if your data sets are large.  It is a great skill to have so you should probably learn it anyway.  In this post I will assume you are already familiar with SQL, and I will not go into detail).

First, the Lingo

  • A package, aka library, is a collection of functions someone has written to perform certain analysis / manipulation / etc.  It’s like the Excel Add-Ins, such as Analysis Data Pack, which lets you do regression and a few other statistical analyses.  In R, you install a library once with the install.packages() function, and then every time you want to use functions from that library, you load it with the library() function
  • A data frame in R is like a data table in Excel. It has rows and columns; columns have names, and all values within a column are of the same type (e.g., date, integer, string, etc).  You would normally import a data set into R from somewhere, like a website or your company’s database using functions like read.csv() or read.table()
  • A function in R is like a formula in Excel.  It takes some input and returns something useful as output.

Let’s Use SQL in R…

For many examples in my blog, I will use Sean Lahman’s baseball database. If you don’t like baseball, don’t worry; this will not be a baseball geekfest. We’re using it because it is plentiful and free (and also because I love baseball myself). So please download the 2013 comma-delimited version, and unzip the files somewhere on your computer.

library(sqldf)

salaries = read.csv("/Users/johnt/Google Drive/Baseball Data/salaries.csv")

avg_salary_by_year = sqldf("SELECT yearID, AVG(salary) as avg_salary FROM salaries GROUP BY yearID")

#spot check the first few rows
head(avg_salary_by_year)
yearID avg_salary
1   1985   476299.4
2   1986   417147.0
3   1987   434729.5
4   1988   453171.1
5   1989   506323.1
6   1990   511973.7

If you know SQL, this is pretty straightforward; we’re looking at average salaries by year. You can also do JOINs instead of using R’s merge() function:

#import the Teams.csv file
teams = read.csv("/Users/johnt/Google Drive/Baseball Data/Teams.csv")
salaries_teams = sqldf("SELECT a.*, b.lgID, b.name
                       FROM salaries a
                       INNER JOIN teams b ON(a.yearID = b.yearID AND a.teamID = b.teamID)")

Here we did an INNER JOIN, but we can do LEFT JOINs too. Here is a snippet of the result:

head(salaries_teams)
yearID teamID lgID  playerID  salary lgID              name
1   1985    BAL   AL murraed02 1472819   AL Baltimore Orioles
2   1985    BAL   AL  lynnfr01 1090000   AL Baltimore Orioles
3   1985    BAL   AL ripkeca01  800000   AL Baltimore Orioles
4   1985    BAL   AL  lacyle01  725000   AL Baltimore Orioles
5   1985    BAL   AL flanami01  641667   AL Baltimore Orioles
6   1985    BAL   AL boddimi01  625000   AL Baltimore Orioles

Question: how have average salaries differed between the leagues?

With our data all set up, let’s use the SQL CASE statement, and then plot it with the wonderful googleVis package:

al_nl_salary = sqldf("SELECT yearID,
AVG(CASE WHEN lgID = 'AL' THEN salary ELSE NULL END) as avg_AL_salary,
AVG(CASE WHEN lgID = 'NL' THEN salary ELSE NULL END) as avg_NL_salary
FROM salaries_teams GROUP BY 1")

## one-time install
install.packages("googleVis")
library(googleVis)
gvisLineChart(al_nl_salary, xvar = "yearID", yvar = c("avg_AL_salary", "avg_NL_salary"),
)

[sc name=”avg_salary_by_league”]

The American League (blue) and National League (red) had been practically trending together from 1985, and after the 2005 season the AL really started running away with it. Baseball fans would guess certain teams like the NY Yankees, Anaheim Angels and Detroit Tigers might be driving that increase.

Conclusion

If you know SQL, and you’re new to R, then use the sqldf package. But as your data sets get larger (200K+ rows) and more complex, you will want to upgrade to more powerful libraries, like dplyr, because sqldf gets pretty slow. If you don’t know SQL, then you can skip this package altogether (but still learn SQL for other reasons).

Leave a Reply

5 Comments on "Using SQL in R"

avatar

Edward Lee
Guest
Edward Lee
3 years 2 days ago

When using R + sql behind a firewall or on secured network, is there a way to load data into R without downloading first as txt or csv? How can R connect to the sql database seamlessly?

Dr. Duru
Guest
Dr. Duru
17 days 5 hours ago

It seems the package sqldf no longer exists. Is there a different one that can be used?

Kamil
Guest
Kamil
15 days 11 hours ago

Of course it does exist!
First, you need to install it:
install.packages(“sqldf”)
Then when you are about using it you need to load that package first:
library(sqldf)

Dr. Duru
Guest
Dr. Duru
14 days 14 hours ago

My bad. I forgot the single quotes around sqldf. Loading that library sure generated a *lot* of additional downloads!

wpDiscuz