Names at a Party – Bootstrap Sampling with R

many_babies_v2

You’re strolling in the park and bump into a party of 30 babies. What is the chance that at least two of them have the same name? We’re going to answer this question via bootstrap sampling — generating thousands of random samples and then calculating the percentage of times names get repeated.

[sc name=”code_and_data” codelink=”https://github.com/nvenkataraman1/rforexcelusers/blob/master/Names%20data%20analysis/3_name_sampling.R” datalink=”https://www.rforexcelusers.com/combine-delimited-files-r/” ]

Once again we will use the baby names data provided by the Social Security Administration, which we pulled and combined a few posts ago.

Excel Analysis

You can almost take the bootstrap approach in Excel with a combination of RANDBETWEEN, OFFSET, INDEX, MATCH, COUNTIF and array formula magic. But using that many advanced formulas is a sure sign that maybe Excel is not the best tool for this. Not to mention the 5.5M row data set does not even fit in Excel.

Nonetheless here is how I approached it in Excel (just using Delaware data so it can fit):

Resampling in excel
Resampling in excel

In Step 1, I used a PivotTable to list the set of names to sample from. In Step 2, you can see the complicated formula I wrote to sample from the list; I created 100 of these random samples. And in Step 3, I have another complicated formula that determines whether any duplicates were found in a sample of 30. The full Excel file can be downloaded here.

There are some unforgivable flaws with the Excel analysis, though: (1) the full data set cannot fit in Excel, (2) no way to sample with a specified probability distribution. So the name Michael (very popular) will be just as likely to be sampled as the name Troy (much less popular).

That said, with Excel I estimate an 85% chance of having at least one name be repeated among 30 babies.

Bootstrap sampling with R

We can do the same with R, but I will take it a step further. Instead of just checking the probability of repeats happen with 30 names, I will check how many repeats happen with 1, 2, up to 100 names.

Step 1: Set up the sampling process

First I want to set up the pool of names from which to sample, just as we did with the PivotTable above.

library(dplyr)
# set up a summary data frame from which to pull name samples
names_pool = filter(files, year==2013, gender == "M") %.%
             group_by(name) %.% 
             summarise(births = sum(count))
names_pool$prob = names_pool$births / sum(names_pool$births)

Using the wonderful dplyr library, I filtered all boys names from 2013 and then with group_by() and summarise(), counted the number of births by name. I then used standard R notation to create a new column, prob, which computes the ratio of each name across all births. I could have also used dplyr’s mutate() function to do that. The result looks like this, head(arrange(name_pool, desc(births))):

     name births        prob
1    Noah  18090 0.011045042
2    Liam  18002 0.010991312
3   Jacob  17976 0.010975438
4   Mason  17591 0.010740372
5 William  16495 0.010071197
6   Ethan  16127 0.009846511

Who knew that Noah and Liam were the two most popular boys names in 2013? Not me! Now with these data points, we can draw random samples using the sample() function:

name_sample = sample(names_pool$name, size = 30, 
                     replace = TRUE, prob = names_pool$prob)

We are pulling 30 names at random. The two key parameters here are:

  • prob = names_pool$prob. I want to pick a sample that’s “as diverse as America”, so Noah should be picked with probability 1.1045%, and Ethan with a 0.9847% probability.
  • replace = TRUE. Since my data set has one row per name, I want to sample with replacement to allow for the same name to be picked more than once.

The sample() function returns a vector; let’s take a look at the result (sorted using sort()):

sort(name_sample)
[1] "Aaron"   "Adam"    "Andrew"  "Ayden"   "Connor"  "Cordell" "Daniel"  "Daniel"  "David"   "Easton"  "Elijah"  "Ethan"  
[13] "Ezekiel" "Hunter"  "Isaiah"  "Isiah"   "Jack"    "Jaden"   "Jamison" "Joseph"  "Joshua"  "Kamden"  "Kris"    "Lincoln"
[25] "Luis"    "Malek"   "Parker"  "Reid"    "Tristin" "Wyatt"  

And finally let’s count the number of duplicated names:

sum(duplicated(name_sample))

This counts the number of names that are duplicated; in this case it’s 1. The name Daniel appears twice (found by running name_sample[duplicated(name_sample)].

Step 2: Automate the sampling process

Now that we have the general framework for sampling names, we want to automate this to generate hundreds of samples and get a smooth probability curve. In programming, this is often done with loops. R comes with the “apply” family of functions useful for looping. In this case, we want to loop through a number of sample sizes, say from 1 to 100. With a random sample of 1 baby, the chance of getting a duplicate name is 0%. With 100 babies, the chance will be much higher. We want to get the full curve for everything in between.

First let’s package the sampling into its own function:

sample_has_duplicate = function(size_val) {
  x = sample(names_pool$name, size = size_val, 
             replace = TRUE, prob = names_pool$prob)
  return(sum(duplicated(x)) > 0)
}

Our function takes one input, size_val, runs a sample of length size_val, and returns a TRUE / FALSE indicating whether the sample had duplicates or not.

Now we want to run this function for sample sizes 1 through 100. This is a perfect job for sapply(), which will run a specific function over a specified vector.

sapply(X = 1:100, FUN = sample_has_duplicate)

sapply() will run the sample_has_duplicate() function, first taking as input the value 1, then 2, then 3, up to 100, and returning a vector of TRUE / FALSE values that are returned by the sample_has_duplicate() function for each of those scenarios. That is, when size_val is 1, there will be no duplicates so the answer is FALSE; when size_val is 100, there most likely will be duplicates, so we should get TRUE. This is a mouthful; feel free to reread that.

Here is what the sapply() output looks like. The first several values are FALSE because smaller samples are less likely to have duplicates.

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
[22] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[43]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[64]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Creating a curve based on just this set of values is not enough. For example, in the above case, we did not get a duplicate when sampling two names (the second value is FALSE). But we know it is possible in the real world to get duplicate names with just two people in a room. So let’s run this hundreds or thousands of times in order to get a truer sense of likelihoods. Then we can average over those many scenarios.

The replicate() function is also in the “apply” family of looping functions, and can be used to run sapply() many times. In the code below, we can put it all together:

# run sapply() 1000 times.  returns a matrix with 100 rows, 1000 columns
x = replicate(n = 1000, sapply(X = 1:100, FUN = sample_has_duplicate))

# calculate average across all the rows by averaging the TRUE/FALSE's
probabilities = apply(x, 1, mean)

# plot average probability by sample size
plot(probabilities, type="l", xlab = "Sample Size", ylab = "Probability of Duplicate")

In just a few lines of code and a few seconds of processing, we sampled 100,000 names (100 names, 1000 times) to estimate the cumulative distribution of duplicate names. This is pretty amazing, if you ask me.

plot cdf with R

By the time you get to 52 babies, you are pretty much guaranteed to see a name repeated. You need just 20 names to have a 50% chance of a repeated name. Oh, and running this for the state of Delaware (as we did in Excel), the chance that a sample of 30 names has a repeated name is 94.5%; Excel underestimated by a few percentage points. This makes sense — in Excel, every name is equally likely to be selected, making it less likely for popular names to be selected multiple times. Not too shabby nonetheless for Excel.

Generating Distributions

In the last question, we were looking for a point estimate — just a single number. But what if we want to understand the range of possibilities? For example, we could ask “in a party of 100 people, how many names will be repeated?” Instead of giving a single number, like the average, we can generate the whole distribution of possible outcomes.

We already have most of the hard work done. We just need to tweak our custom function:

sample_duplicates = function(size_val) {
  x = sample(names_pool$name, size = size_val, 
             replace = TRUE, prob = names_pool$prob)
  return(sum(duplicated(x)))
}

The only thing we changed to the original function is the value being returned. Now we are returning the number of repeated names, not just a yes/no value. Now we run the replicate() and sapply() as we did before.

sample_100 = replicate(n = 10000, sapply(X = 100, FUN = sample_duplicates))

Instead of running for the range of X=1:100 as we did before (to sample 1 name, 2 names,… 100 names), now we just run it for the single value of 100. And I bumped n=1000 to n=10000 just for fun. In other words, we will create 10000 samples of 100 names. The result, sample_100, is a vector showing the number of repeated names in all 10000 samples.

We can summarize the sample_100 vector with the table() function, which is like a PivotTable summary:

table(sample_100)
   3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27 
   1    2   10   34   73  153  292  441  713  890 1070 1213 1140 1090  896  678  515  350  205   84   76   46   15   10    3 

This tells us 1 of the 10000 samples had 3 names repeated, 292 of the 10000 samples had 9 names repeated, and so forth. Now we can plot this distribution (histogram) with hist(sample_100):

R Bootstrap Distribution
R Bootstrap Distribution

And there, my friends, is a beautiful bell curve. That tends to happen when we use bootstrap resampling because most things in life are normally distributed.

Conclusion

Using R’s simple sampling functions, combined with the powerful (though initially confusing) apply functions, we can simulate data for bootstrap analysis. We almost got away with it in Excel, but as you can see, you can do so much more with it in R.

Fun exercise

Let’s compare the curve in 2013 to 1950. In 1950, with a party of 30 you were essentially guaranteed to have a name repeated. I bet it was easier to remember people’s names back then!

[sc name=”names_sampling_party” ]
[sc name=”names_sampling_party_1950″ ]

Appendix

Photo credit: humansarefree.com

Leave a Reply

Be the First to Comment!

avatar

wpDiscuz