The growing number of baby names – using dplyr

So many friends and family members have had babies in the past few years,¬†and I can’t think of a single baby whose name has been repeated. Are parents getting more creative? Is it a function of the diversity around me (in NYC / CT / Boston)? Is it a statistical¬†anomaly?

[sc name=”code_and_data” codelink=”https://github.com/nvenkataraman1/rforexcelusers/blob/master/Names%20data%20analysis/2_dplyr_name_trends.R” datalink=”https://www.rforexcelusers.com/combine-delimited-files-r/”]

In this post I will do some quick data crunching to get a sense of high-level baby-naming trends. Are we indeed seeing more names in the books? Are parents disfavoring the popular names?

I will use data from Social Security Administration, which we imported in the last post, and will make heavy use of the dplyr library for data manipulation and summarization.

How has the number of distinct names changed over time?

With 5.5M rows, this data set is already too big for Excel, so this analysis is a non-starter. Even if it did fit, this would still not be a straightforward analysis in Excel. That’s because names are repeated across states, and there is no good way of counting unique instances. However, using just one state’s worth of data, this is a simple PivotTable and charting exercise:

Question 1 in Excel: PivotTable
Question 1 in Excel: PivotTable

This is using Colorado’s data. The number of boys’ and girls’ names has nearly quadrupled over the past 100 years, and female names started outpacing male names in the mid-1940s.

Analysis with R

As usual with R, there are dozens of ways to do the same thing. We could use SQL (see my SQL in R post), we could use built-in functions like aggregate(), by(), ave(). We could use the plyr library or doBy. We can go on and on.

But my new personal favorite is dplyr. (1) it’s fast and (2) the syntax is consistent and simple.

Set up data

Let’s first replicate the PivotTable:

# Step 1: summarize the data
names_count = group_by(files, year, gender)
names_count = summarise(names_count, unique_names = n_distinct(name),
                                     names_per_1M = unique_names / (sum(count) / 1000000))

To summarize data with dplyr, we first specify the columns we wish to group by. That’s what we do in line 2, grouping the files data frame by year and gender. This does not visibly change the data; it just sets it up for by-group processing. In line 3, the summarise() function creates two new columns:

  • unique_names. Using the n_distinct() function, we get number of distinct names. This is the function that’s missing in Excel; counting distinct items in Excel is that so easy.
  • names_per_1M. Normalizes the name count by number of births. We might see fewer names way back in the day simply because there were fewer births. Normalizing takes care of that, so comparing years becomes apples-to-apples. This is like a calculated field in a PivotTable.

Here is a snippet of the summary table:

   year gender unique_names names_per_1M
1  1910      F         1083     3075.988
2  1910      M          692     4213.731
3  1911      F         1066     2862.829
4  1911      M          754     3897.849
5  1912      F         1261     2500.654
6  1912      M         1114     2903.302
7  1913      F         1350     2381.200
8  1913      M         1256     2721.053
9  1914      F         1549     2222.812
10 1914      M         1494     2504.913

Now let’s graph it. For some graphics packages, like ggplot2, this data frame would be sufficient. But we will use the googleVis package, which requires that each series be represented by a column, just like Excel charts. So let’s transpose the gender column:

# Step 2: transpose the gender column
uniq_names_count = dcast(names_count, year ~ gender, value.var="unique_names")
names_per_1M = dcast(names_count, year ~ gender, value.var="names_per_1M")

I created two new data frames, one holding the number of unique names by gender, and the other having the number of names per million births. Instead of creating two data frames, I could have also “melted” the data frame with melt() then transposed it with dcast(). But for simplicity, I’m doing it this way.

We use the dcast() function to transpose the gender column. This function is part of the reshape2 library. dcast() takes a formula, year ~ gender, as input to indicate how the new table should be structured. If you think of it like a PivotTable, the year column is on the left hand side of the PivotTable, gender is on the top part and names_per_1M is in the values part of the PivotTable.

Here is an excerpt of the names_per_1M data frame:

   year        F        M
1  1910 3075.988 4213.731
2  1911 2862.829 3897.849
3  1912 2500.654 2903.302
4  1913 2381.200 2721.053
5  1914 2222.812 2504.913
6  1915 1998.805 2226.530
7  1916 1950.397 2185.154
8  1917 1912.197 2139.653
9  1918 1873.776 1992.210
10 1919 1908.073 2124.031

Now we’re ready to plot it using the googleVis library:

# Step 3: Plot the chart with googleVis
plot(gvisLineChart(names_per_1M, xvar="year", yvar=c("F","M"),
                   options=list(title="Number of Names Per Million Births",
                                hAxis="{title: 'Year'}",
                                vAxis="{title: '# of Names'}")))

googleVis, though limited in features compared to ggplot2, has two great advantages. (1) it is interactive and web-ready and (2) the syntax is significantly (!) more simple than ggplot2’s. In this example, I specified the data frame (names_per_1M), the x-variable and y-variable. The options are aesthetic.

[sc name=”names_trend_distinct” ]

I only showed code for the graph on the right, but the graph on the left is essentially the same, only referring to the other data frame. The first chart shows that the number of names clearly been growing for many decades. In 1911, about 1,800 names covered the US population; today that number is more than 10,000. In the early 1940s, the number of female names started growing at a faster pace than that of men, and today, there are approximately 1,500 more female than male names.

The second chart shows an interesting twist — specifically the dip from the 1940s to late 1950s. For every million births, this period had the fewest number of distinct names. These are the WWII baby boomers. So many babies born, with fewer names to go around. This exercise shows why normalizing is so important — it really tells a different story from the non-normalized view.

The number of names per million births has nearly quadrupled in the past 60 years. So parents do, indeed, have more names to choose from these days.

Are popular names less popular?

Ok fine, so we see more distinct names showing up. But the lion’s share of newborns must still have the core popular names, right? To answer this question, let’s look at the percent of the population having the top 50 names. That is, let’s take the top 50 names in each state and tally up the percent of people in that state who have those name.

This would a dreadful analysis in Excel. To figure out the top 50 names in each state, I could imagine a PivotTable counting births in a state-vs-name matrix and flagging the top 50 from each name column and vlookup’ing that into the raw table. To repeat that for every year, forget about it; Excel just isn’t made for this kind of database analysis.

Let’s do it in R with some dplyr functions:

# flag top 50 names in each state
top_by_state = group_by(files, year, state, gender) %.% 
               mutate(group = ifelse( rank(desc(count), ties.method = "first") <= 50, "TopX", "Other"))

# summarise births by top vs other, across the country
top_national = group_by(top_by_state, gender, year, group) %.%
            summarise(births = sum(count))

# transpose the gender and group columns
top_national = dcast(top_national, year ~ gender+group, value.var = "births")

# calculate % of births among top names
top_national$M_topx_pct = top_national$M_TopX / (top_national$M_Other + top_national$M_TopX)
top_national$F_topx_pct = top_national$F_TopX / (top_national$F_Other + top_national$F_TopX)

In the first chunk, I find the top 50 boys and girls names in each state every year. Frankly within this whole analysis, this is the killer part. In fact, once we do this, we can easily send it back to Excel and run a PivotTable. But let’s stick with R.

As usual, we first use group_by() whenever we want to operate “by” something; in this case we want to create the Top vs Other flag by year, state and gender. Then with mutate() we use rank() to assign a “Top” vs “Other” flag to each name. The rank() function works like the Excel version of RANK(). The rest is standard summarizing and transposing.

Here is what that looks like:

  year F_Other F_TopX M_Other M_TopX M_topx_pct F_topx_pct
1 1910  135787 216295   41545 122680  0.7470239  0.6143313
2 1911  142400 229959   49814 143626  0.7424835  0.6175734
3 1912  198754 305514  114399 269302  0.7018538  0.6058564
4 1913  224439 342502  141875 319711  0.6926358  0.6041228
5 1914  278652 418213  189011 407417  0.6830950  0.6001349
6 1915  368441 540102  255586 525899  0.6729483  0.5944705

The last two columns are the ones we care about. The first row shows 75% of baby boys had one of the top 50 names in 1910. Wow! The remaining 25% of boys had the other 1000+ names. Now let’s plot the trend over time:

plot(gvisLineChart(top_national, xvar = "year", yvar = c("M_topx_pct","F_topx_pct"), 
     options = list(title = "% of newborns having the top 50 names", vAxis="{minValue:0}")))

[sc name=”names_trend_topX”]

Interesting. Starting in the early 80s, we see a decline in the popularity of popular names. Nowadays about 36% of baby boys have one of the 50 popular names in their state; 60 years ago 60% of boys had one of the fifty popular names. A similar effect is happening with girls’ names.

Conclusion

Parents these days have more names to pick from when naming their kids. Have parents gotten more creative, or is this a function of growing ethnic diversity in America? Who knows. We are also seeing that popular names are just not that popular anymore. When taking the top 50 names in each state each year, we are seeing a smaller and smaller percentage of babies having those names.

With just a few lines of code, we did in R what would have taken significantly more effort in Excel. This is a case where I would pick R over Excel, without a doubt. Stay tuned for more interesting analyses in future posts.

Leave a Reply

2 Comments on "The growing number of baby names – using dplyr"

avatar

Thomas Lumley
Guest
2 years 11 months ago

Nice post. One comment, though: I don’t think scaling the number of unique names by the population size is the ideal way.

The expected scaling will depend on name frequency, but, for example, if names behaved like words and roughly followed Zipf’s Law, you’d expect to get about log(N) unique names from N people when there was no underlying change in the name distribution. I don’t think the data really support a decrease in name diversity in the early 20th century.

wpDiscuz