How to do VLOOKUP in R

VLOOKUP is usually the first magical formula people learn when learning Excel. The magic never goes away. So let’s see how to do a VLOOKUP in R.

Situation

Let’s say we have two datasets from World Bank — one showing annual average life expectancy by country and the other showing a measure of access to sanitation facilities. To do our analysis (regression, visual, whatever), we need these two sets of values combined.

% of population with access to sanitation

Vlookup in R Sanitation Data

Life expectancy in years

Vlookup in R Life Exp Data

The tables have 260 rows and >50 columns (one for each year). World Bank Data Links: Life Expectancy, Sanitation Access

Basic VLOOKUP in R

Let’s say you already have your two datasets set up. (The code to import and set up the two data sets is at the end of this article).

Say we want to analyze the most recent year of data, 2012. We can use the merge() function to combine two datasets. In the code below, we are merging a subset of life_expectancy and sanitation; the subsets each have just two columns, country name and the values for 2012.

data_2012 = merge(life_expectancy[, c("country.name", "le_2012")], 
                  sanitation[, c("country.name", "san_2012")])
head(data_2012, 10)

Note that I’m only keeping one numeric variable, for 2012, from each data set. If we simply fed the entire life_expectancy and sanitation data frames as is, our result would have 100+ columns. Have you ever VLOOKUP’d 50+ columns from one data set to another? It’s not nearly as pretty as merge().

# Sample - first 10 rows
          country.name  le_2012  san_2012
1          Afghanistan 60.50912  29.00000
2              Albania 77.35046  91.20000
3              Algeria 70.88217  95.20000
4       American Samoa       NA  62.50000
5        Andean Region       NA        NA
6              Andorra       NA 100.00000
7               Angola 51.46400  60.10000
8  Antigua and Barbuda 75.66532        NA
9           Arab World 69.94970  81.64278
10           Argentina 76.01268  97.20000

There we have it — the country name along with the 2012 values for each metric. Notice we did not specify any options in the merge() function; we just told it which two data frames to combine. That’s because the default options of merge() suffice in this case; this is a basic example. Specifically, the lookup columns in both data frames have the same name, and both data frames have the same number of rows (one per column), making this a simple 1-to-one lookup.

INNER / LEFT / RIGHT JOIN in R

What if the two data frames were different sizes? Say you have life expectancy on every country, but only have income disparity stats on a subset of countries? Then you have to decide what you need for your analysis. Here is a cheat sheet assuming you use the following merge() function:

merge(x = data_w_all_countries, y = data_w_some_countries)

Scenario aka in SQL Merge Parameters
Keep rows where there’s a match in both INNER JOIN N/A. This is default
Keep all rows from x, regardless of match in y LEFT JOIN set all.x = TRUE
Keep all rows from y, regardless of match in x RIGHT JOIN set all.y = TRUE
Keep all rows from x AND from y OUTER JOIN set all = TRUE

So if you were building a “master” data set and you need to keep all the countries from the life expectancy table, regardless of finding a match in the income disparity data set, then you would use the following:

# Left Join: keep all rows in x even if there's no match in y
merge(life_expectancy, income_disparity, all.x = TRUE)

Merge by multiple columns

One of the weaknesses of VLOOKUP is that it could only merge by one column. The example above works fine in Excel because we are only VLOOKUPing data by country. But what if we had multiple columns to match on — say country and city? That is, if both data sets had stats at the country and city level.

By default, as long as the columns are named the same way in both data frames, R is smart enough to automatically join the two data frames by these columns. In any case, if we want to be explicit in our code (which I recommend), you can specify which columns to merge by, as follows:

# Merge by multiple columns
merge(life_expectancy, income_disparity, by = c("country", "city"))

Here we used the by parameter to specify which columns to merge by. We provided a vector of column names. (Remember, c() is shorthand for creating a vector of elements).

When “by” column names differ

Often times, the column names across data frames will differ, so we need to explicitly tell R which columns to merge by. In the example above, if life_expectancy’s country column was named “cntr_nm” and sanitation’s column was named “country.name”, we would use the following merge():

# Merge by columns with different names
merge(life_expectancy, sanitation, by.x = "cntr_nm", by.y = "country.name")

There we instruct R which columns to base the merge on.

Best Practice: be explicit

I always recommend you be explicit when merging, and don’t rely on the defaults. Even if the “by” columns are named the same, I still like to specify which columns to merge by. It will make your code more readable, but also keep you out of trouble.

For example, remember how our original data set had one metric column for each year (2010, 2011, 2012, etc)? If we tried merging the raw life_expectancy and sanitation data frames without renaming the columns, and without setting by parameters, R would have tried merging the two data frames by all common columns — namely country.name, 2010, 2011, 2012, etc. Since the numeric columns (2010-2012) likely won’t match across the two data sets, your merge will yield no results. Plus, if your data sets are big enough, the merge could really take a lot of time to process. So always lean on the side of explicit in the case of merge().

Conclusion

VLOOKUP is awesome, but merge() is even more awesome. It can do a lot of things that VLOOKUP cannot. Even though the default options are really useful, I recommend explicitly telling the function how you want things merged, by specifying the by parameter.

Appendix

Importing and prepping the data

Let’s import the data and rename the metrics columns. I want to rename them because they have the same names in both data sets, and (e.g., 1995 column name). This way we can distinguish the two sets of columns better.

life_expectancy = read.csv("life_expectancy.csv", skip = 2, header = TRUE, stringsAsFactors = FALSE)
sanitation = read.csv("sanitation.csv", skip = 2, header = TRUE, stringsAsFactors = FALSE)

## rename all yearly metric columns.  E.g., X1995 becomes le_1995 (for life expectancy)
names(life_expectancy)[grepl("\\bX", names(life_expectancy))] = gsub("X", "le_", names(le)[grepl("\\bX", names(life_expectancy))])
names(sanitation)[grepl("\\bX", names(sanitation))] = gsub("X", "san_", names(sanitation)[grepl("\\bX", names(sanitation))])

## lowercase column names
names(life_expectancy) = tolower(names(life_expectancy))
names(sanitation) = tolower(names(sanitation))

I use regular expressions (link coming soon) for the renaming. Here is a sample of the Life Expectancy dataset:

# Sample - 6 rows, 8 columns
   country.name country.code                          indicator.name indicator.code  le_1961  le_1962  le_1963  le_1964
1         Aruba          ABW Life expectancy at birth, total (years) SP.DYN.LE00.IN 65.98802 66.36554 66.71398 67.04429
2       Andorra          AND Life expectancy at birth, total (years) SP.DYN.LE00.IN       NA       NA       NA       NA
3   Afghanistan          AFG Life expectancy at birth, total (years) SP.DYN.LE00.IN 32.09598 32.61188 33.12732 33.64124
4        Angola          AGO Life expectancy at birth, total (years) SP.DYN.LE00.IN 33.38622 33.78759 34.18846 34.59034
5       Albania          ALB Life expectancy at birth, total (years) SP.DYN.LE00.IN 63.27346 64.16285 64.88710 65.43820
6 Andean Region          ANR Life expectancy at birth, total (years) SP.DYN.LE00.IN       NA       NA       NA       NA

You can see the year columns now have the “le_” prefix. Data is prepped and ready for VLOOKUP.

Leave a Reply

10 Comments on "How to do VLOOKUP in R"

avatar

Rick
Guest
Rick
1 year 10 months ago

Really nice explanation of ‘merge’. Thanks.

As a minor point, Excel can do the equivalent to a VLOOKUP across multiple criteria, by using INDEX and MATCH functions. http://stackoverflow.com/questions/18233144/optimization-of-vlookup-with-multiple-criteria-index-match.

However, your conclusion is right: merge() is even more awesome. And merge() is much simpler (a concise ‘R’ statement on one line versus a lengthy Excel array formula and copying it across all rows).

John
Guest
1 year 9 months ago

Thanks for the post, it’s helped me solve a bunch of different challenges! But I’m having trouble applying merge to another issue I’m facing. I have a data frame with multiple columns (player data from the NBA). Each row is the statistics from a player in a game. I need to get the full name of the opponent in a new column. I have the three digit opposing team abbreviation (stripped from a longer string in another column using substr), the three digit player team abbreviation and the player team full name.

In excel speak, I need to vlookup the team name from the three digit abbreviation (i.e. =vlookup(three digit opposing, three digit player team:player team full name, 2,0) or =index(player team full name,match(three digit opposing, three digit player,0)))

Is there a way to do this with merge? I can’t seem to get it to work. Each time I merge, I get a result with millions of rows (vs my dataset of ~9000 rows). There is probably something totally obvious that I’m missing.

Deirdre Monks
Guest
Deirdre Monks
1 year 6 months ago

Thanks for that John. Would you by any chance know how to do the equivalent of a HLOOKUP in R? I have a dataset of just under 600k rows and approx. 60 columns that I can reduce to under 20 with a simple HLOOKUP in excel, however it just keeps crashing on me. Thanks in advance.

Iren Ramadhan
Guest
Iren Ramadhan
1 year 4 months ago

Hi!

I want to ask how to do ‘Multiple VLOOKUP’ just like we do in Excel.

For example :

=VLOOKUP(E2,$A$2:$C$30,VLOOKUP(F2,$J$2:$K$3,2,FALSE),FALSE)

Thanks 🙂

Kathryn
Guest
Kathryn
1 year 3 months ago

SO VERY HELPFUL. Thx!

Tim
Guest
Tim
5 months 3 days ago

What if we wanted to match/merge data in to a column that already existed in both data sets? Say, we have the following:

authors <- data.frame(
surname = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil"),
nationality = c("US", "Australia", "US", "UK", "Australia"),
deceased = c("yes", rep("no", 4))
)

books <- data.frame(
name = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil", "R Core"),
title = c("Exploratory Data Analysis",
"Modern Applied Statistics",
"LISP-STAT",
"Spatial Statistics",
"Interactive Data Analysis",
"An Introduction to R"),
nationality = c(rep("US", 6))
)

books_update = merge(books, authors, by.x = "name", by.y = "surname", all = TRUE)

Nationality is in both data frames, so it creates a 'nationality.x' and 'nationality.y' respectively for each field. Is there a way to overwrite the data from y (the 'authors' data frame in this instance), or would a different function be needed?

wpDiscuz