Combine delimited files with R

Want to combine delimited files into one big file?  Here I will show you how to do it with R.

Try it yourself!
Code
Data

The Social Security Administration provides several CSV files listing the top 1000 popular baby names, by US State, for every year since 1910.  I have noticed lately no one seems to have my name anymore.  It’s been a long time since I have met another John (not Jon, which are a dime a dozen these days). To analyze this data, it would be helpful to combine these files.

The Answer

Let’s start with the final answer, and then I will explain each step:

# Step 1: set the working directory (where files are saved)
setwd("/Users/johnt/Google Drive/namesbystate/")

# Step 2: get all the right file names
file_names = list.files(getwd())
file_names = file_names[grepl(".TXT",file_names)]

# Step 3: get the read.csv function working
files = read.csv("WY.TXT", header=F, stringsAsFactors = F)

# Step 4: use lapply to apply the read.csv function to all values of file_names
files = lapply(file_names, read.csv, header=F, stringsAsFactors = F)
files = do.call(rbind,files)

# Step 5: Set column names
names(files) = c("state", "gender", "year", "name", "count")

# check structure of new data set
str(files)

'data.frame':	5552452 obs. of  5 variables:
 $ state : chr  "AK" "AK" "AK" "AK" ...
 $ gender: chr  "F" "F" "F" "F" ...
 $ year  : int  1910 1910 1910 1910 1910 1910 1910 1910 1911 1911 ...
 $ name  : chr  "Mary" "Annie" "Anna" "Margaret" ...
 $ count : int  14 12 10 8 7 6 6 5 12 7 ...

The Explanation

Download the Data

The state-specific data can be downloaded from SSA’s website. Download the zip file and unzip it.  Your folder should look like this (plus a few other unrelated files):

Files to combine in R

 

Step 1: Set the Working Directory

Since we will be reading many files from the same location, it would be helpful to set the working directory, which will allow us to make file references without having to include the entire file path every time. We do that with the setwd() function:

# set the working directory (where files are saved)
setwd("/Users/johnt/Google Drive/namesbystate/")

Step 2: Get List of Files in Directory

We then use the list.files() function to get a listing of all the file names in the specified directory. We used the getwd() function to get the working directory.

That function returns a vector of all file names in the directory, including a few other files that are supplementary and do not contain name information.

To ignore those additional unrelated files, we use the grepl() function to find all filenames with extension .TXT.

file_names = list.files(getwd())
file_names = file_names[grepl(".TXT",file_names)]

# print file_names vector
file_names
 [1] "AK.TXT" "AL.TXT" "AR.TXT" "AZ.TXT" "CA.TXT" "CO.TXT" "CT.TXT" "DC.TXT"
 [9] "DE.TXT" "FL.TXT" "GA.TXT" "HI.TXT" "IA.TXT" "ID.TXT" "IL.TXT" "IN.TXT"
[17] "KS.TXT" "KY.TXT" "LA.TXT" "MA.TXT" "MD.TXT" "ME.TXT" "MI.TXT" "MN.TXT"
[25] "MO.TXT" "MS.TXT" "MT.TXT" "NC.TXT" "ND.TXT" "NE.TXT" "NH.TXT" "NJ.TXT"
[33] "NM.TXT" "NV.TXT" "NY.TXT" "OH.TXT" "OK.TXT" "OR.TXT" "PA.TXT" "RI.TXT"
[41] "SC.TXT" "SD.TXT" "TN.TXT" "TX.TXT" "UT.TXT" "VA.TXT" "VT.TXT" "WA.TXT"
[49] "WI.TXT" "WV.TXT" "WY.TXT"

Chapter 4 of my book covers vectors, an important fundamental every R user must understand.

Step 3: Import the data

Our goal is eventually to import all these files and combine them. But first let’s make sure we know how to import just one of them; i.e., that we get the right function parameters. Let’s use the read.csv() function:

# read the WY.TXT file, just for testing
file = read.csv("/Users/johnt/namesbystate/WY.TXT", header=F, stringsAsFactors=F)

# see the data structure
str(file)

'data.frame':	26600 obs. of  5 variables:
 $ V1: chr  "WY" "WY" "WY" "WY" ...
 $ V2: chr  "F" "F" "F" "F" ...
 $ V3: int  1910 1910 1910 1910 1910 1910 1910 1910 1910 1910 ...
 $ V4: chr  "Mary" "Margaret" "Helen" "Alice" ...
 $ V5: int  27 22 13 10 9 8 8 8 7 7 ...

Since the file has no header row, we set header=TRUE in the read.csv() function. Also, I usually set stringsAsFactors = FALSE so that character values are imported as characters and not factors. In this case, it might make sense to convert State and Gender columns to factors; we can do that later, when needed.

The table structure looks good: 26.6K rows and 5 columns. The columns do not come with names, hence the generic V1 to V5. And the data types look good: characters and integers.

Step 4: Combine the data

Now that we know how to import one file, we want to essentially loop through each filename, import the file and stack it together into a larger data set. This can be done with a for() loop, as with many languages. But explicit loops usually don’t perform well in R, so we can use the lapply() function instead.

lapply() is part of the apply family of functions, which do different kinds of looping. lapply() specifically runs a function over a vector; in this case, we want to run the read.csv() through a vector of file names. It stores the result of each read.csv() into an R list, which we can then rbind() together into one data frame with the do.call() function.

# run read.csv on all values of file_names
files = lapply(file_names, read.csv, header=F, stringsAsFactors = F)
files = do.call(rbind,files)

To test this out, you can replace file_names with c("WY.TXT","NM.TXT"). This will run read.csv on the WY.TXT file, then run it for the NM.TXT file, and both data frames will be stored in a list. Here is the structure of that list:

# run only on WY.TXT and NM.TXT
str(files)
List of 2
 $ :'data.frame':	26600 obs. of  5 variables:
  ..$ V1: chr [1:26600] "WY" "WY" "WY" "WY" ...
  ..$ V2: chr [1:26600] "F" "F" "F" "F" ...
  ..$ V3: int [1:26600] 1910 1910 1910 1910 1910 1910 1910 1910 1910 1910 ...
  ..$ V4: chr [1:26600] "Mary" "Margaret" "Helen" "Alice" ...
  ..$ V5: int [1:26600] 27 22 13 10 9 8 8 8 7 7 ...
 $ :'data.frame':	70712 obs. of  5 variables:
  ..$ V1: chr [1:70712] "NM" "NM" "NM" "NM" ...
  ..$ V2: chr [1:70712] "F" "F" "F" "F" ...
  ..$ V3: int [1:70712] 1910 1910 1910 1910 1910 1910 1910 1910 1910 1910 ...
  ..$ V4: chr [1:70712] "Mary" "Maria" "Margaret" "Josephine" ...
  ..$ V5: int [1:70712] 98 40 38 35 26 26 20 19 18 16 ...

A list is an arbitrary collection of objects. In this case, the list is holding two data frames, and above you see the structure of the list and the component data frames. You can then run the do.call() function above to combine those two data frames into one.

Lastly, we want to replace the generic column names (V1 to V5) with more intuitive names. We can do that with the names() function. We do that below and then print the structure of the large data frame we created (this time I ran it for all file_names, not just the two states):

Step 5: Set column names

# set column names
names(files) = c("state", "gender", "year", "name", "count")
str(files)
'data.frame':	5552452 obs. of  5 variables:
 $ state : chr  "AK" "AK" "AK" "AK" ...
 $ gender: chr  "F" "F" "F" "F" ...
 $ year  : int  1910 1910 1910 1910 1910 1910 1910 1910 1911 1911 ...
 $ name  : chr  "Mary" "Annie" "Anna" "Margaret" ...
 $ count : int  14 12 10 8 7 6 6 5 12 7 ...

Our big data set, which combines all state files, has more than 5.5M rows and five columns with the names we provided.

Conclusion

With just a few lines of code and a matter of seconds to execute, we have combined multiple delimited files stored in the same directory, into one big file that is ready for analysis. For the next blog posts, we will see some name trends to visualize what is the future of John in this country.

Leave a Reply

Be the First to Comment!

avatar

wpDiscuz