Monday, March 31, 2008

Names: Boys vs. Girls

Using data from the 1990 US Census I was amazed to discover that 90% of the US male population has one of 1,219 first names, but 90% of the female population has one of 4,275. There are 3.5x as many female first names as male first names.

The top 10 male first names are: James, John, Robert, Michael, William, David, Richard, Charles, Joseph and Thomas (which account for 23.2% of the male population; 50% of the population have one of only 60 names).

The top 10 female first names are: Mary, Patricia, Linda, Barbara, Elizabeth, Maria, Susan, Margaret and Dorothy (which account for 10.7% of the female population; 50% of the population have on of 139 names).

You can also see that all the variety in names happens between the 80% and 90%. For males 80% of the population is covered by 27% of the names; for females 80% is covered by 19% of the names).

The large numbers of female names appears to be because there are lots of variants of female names compared to male names. A quick run through calculating the Levenshtein distance between names and selecting the 10 closest for each gives an average distance of:

Male: 2.62
Female: 2.01

So female names are more 'similar' than male names, hence the variety created by all these variants.

The other thing we can extract from this data is the prevalence of names beginning with certain letters and weight adjust based on the occurrence of each name.

Things are much more polarized when you look at trailing letters (for example, the trailing letter A is an almost sure sign that it's a woman; the opposite is true of D):

So combining the two it's possible to give a 'maleness' score (the blue part) to each final letter: