VLOOKUP is usually the first magical formula people learn when learning Excel. The magic never goes away. So let’s see how to do a VLOOKUP in R.
Let’s say we have two datasets from World Bank — one showing annual average life expectancy by country and the other showing a measure of access to sanitation facilities. To do our analysis (regression, visual, whatever), we need these two sets of values combined.
% of population with access to sanitation
Life expectancy in years
Basic VLOOKUP in R
Let’s say you already have your two datasets set up. (The code to import and set up the two data sets is at the end of this article).
Say we want to analyze the most recent year of data, 2012. We can use the
merge() function to combine two datasets. In the code below, we are merging a subset of life_expectancy and sanitation; the subsets each have just two columns, country name and the values for 2012.
data_2012 = merge(life_expectancy[, c("country.name", "le_2012")], sanitation[, c("country.name", "san_2012")]) head(data_2012, 10)
Note that I’m only keeping one numeric variable, for 2012, from each data set. If we simply fed the entire life_expectancy and sanitation data frames as is, our result would have 100+ columns. Have you ever VLOOKUP’d 50+ columns from one data set to another? It’s not nearly as pretty as
# Sample - first 10 rows country.name le_2012 san_2012 1 Afghanistan 60.50912 29.00000 2 Albania 77.35046 91.20000 3 Algeria 70.88217 95.20000 4 American Samoa NA 62.50000 5 Andean Region NA NA 6 Andorra NA 100.00000 7 Angola 51.46400 60.10000 8 Antigua and Barbuda 75.66532 NA 9 Arab World 69.94970 81.64278 10 Argentina 76.01268 97.20000
There we have it — the country name along with the 2012 values for each metric. Notice we did not specify any options in the
merge() function; we just told it which two data frames to combine. That’s because the default options of
merge() suffice in this case; this is a basic example. Specifically, the lookup columns in both data frames have the same name, and both data frames have the same number of rows (one per column), making this a simple 1-to-one lookup.
INNER / LEFT / RIGHT JOIN in R
What if the two data frames were different sizes? Say you have life expectancy on every country, but only have income disparity stats on a subset of countries? Then you have to decide what you need for your analysis. Here is a cheat sheet assuming you use the following
merge(x = data_w_all_countries, y = data_w_some_countries)
|Scenario||aka in SQL||Merge Parameters|
|Keep rows where there’s a match in both||INNER JOIN||N/A. This is default|
|Keep all rows from x, regardless of match in y||LEFT JOIN||set
|Keep all rows from y, regardless of match in x||RIGHT JOIN||set
|Keep all rows from x AND from y||OUTER JOIN||set
So if you were building a “master” data set and you need to keep all the countries from the life expectancy table, regardless of finding a match in the income disparity data set, then you would use the following:
# Left Join: keep all rows in x even if there's no match in y merge(life_expectancy, income_disparity, all.x = TRUE)
Merge by multiple columns
One of the weaknesses of VLOOKUP is that it could only merge by one column. The example above works fine in Excel because we are only VLOOKUPing data by country. But what if we had multiple columns to match on — say country and city? That is, if both data sets had stats at the country and city level.
By default, as long as the columns are named the same way in both data frames, R is smart enough to automatically join the two data frames by these columns. In any case, if we want to be explicit in our code (which I recommend), you can specify which columns to merge by, as follows:
# Merge by multiple columns merge(life_expectancy, income_disparity, by = c("country", "city"))
Here we used the
by parameter to specify which columns to merge by. We provided a vector of column names. (Remember,
c() is shorthand for creating a vector of elements).
When “by” column names differ
Often times, the column names across data frames will differ, so we need to explicitly tell R which columns to merge by. In the example above, if life_expectancy’s country column was named “cntr_nm” and sanitation’s column was named “country.name”, we would use the following
# Merge by columns with different names merge(life_expectancy, sanitation, by.x = "cntr_nm", by.y = "country.name")
There we instruct R which columns to base the merge on.
Best Practice: be explicit
I always recommend you be explicit when merging, and don’t rely on the defaults. Even if the “by” columns are named the same, I still like to specify which columns to merge by. It will make your code more readable, but also keep you out of trouble.
For example, remember how our original data set had one metric column for each year (2010, 2011, 2012, etc)? If we tried merging the raw life_expectancy and sanitation data frames without renaming the columns, and without setting
by parameters, R would have tried merging the two data frames by all common columns — namely country.name, 2010, 2011, 2012, etc. Since the numeric columns (2010-2012) likely won’t match across the two data sets, your merge will yield no results. Plus, if your data sets are big enough, the merge could really take a lot of time to process. So always lean on the side of explicit in the case of
VLOOKUP is awesome, but
merge() is even more awesome. It can do a lot of things that VLOOKUP cannot. Even though the default options are really useful, I recommend explicitly telling the function how you want things merged, by specifying the
Importing and prepping the data
Let’s import the data and rename the metrics columns. I want to rename them because they have the same names in both data sets, and (e.g., 1995 column name). This way we can distinguish the two sets of columns better.
life_expectancy = read.csv("life_expectancy.csv", skip = 2, header = TRUE, stringsAsFactors = FALSE) sanitation = read.csv("sanitation.csv", skip = 2, header = TRUE, stringsAsFactors = FALSE) ## rename all yearly metric columns. E.g., X1995 becomes le_1995 (for life expectancy) names(life_expectancy)[grepl("\\bX", names(life_expectancy))] = gsub("X", "le_", names(le)[grepl("\\bX", names(life_expectancy))]) names(sanitation)[grepl("\\bX", names(sanitation))] = gsub("X", "san_", names(sanitation)[grepl("\\bX", names(sanitation))]) ## lowercase column names names(life_expectancy) = tolower(names(life_expectancy)) names(sanitation) = tolower(names(sanitation))
I use regular expressions (link coming soon) for the renaming. Here is a sample of the Life Expectancy dataset:
# Sample - 6 rows, 8 columns country.name country.code indicator.name indicator.code le_1961 le_1962 le_1963 le_1964 1 Aruba ABW Life expectancy at birth, total (years) SP.DYN.LE00.IN 65.98802 66.36554 66.71398 67.04429 2 Andorra AND Life expectancy at birth, total (years) SP.DYN.LE00.IN NA NA NA NA 3 Afghanistan AFG Life expectancy at birth, total (years) SP.DYN.LE00.IN 32.09598 32.61188 33.12732 33.64124 4 Angola AGO Life expectancy at birth, total (years) SP.DYN.LE00.IN 33.38622 33.78759 34.18846 34.59034 5 Albania ALB Life expectancy at birth, total (years) SP.DYN.LE00.IN 63.27346 64.16285 64.88710 65.43820 6 Andean Region ANR Life expectancy at birth, total (years) SP.DYN.LE00.IN NA NA NA NA
You can see the year columns now have the “le_” prefix. Data is prepped and ready for VLOOKUP.