How to Get Top 5 Baby Names Female and Male in Python


Hello Readers,

Hither in the tertiary part of the Python and Pandas series, we analyze over ane.6 million infant name records from the United States Social Security Administration from 1880 to 2010. A particular name must have at least 5 occurrences for inclusion into the data set. Nosotros will explore reading in multiple raw data files, merging them into one DataFrame, subsetting desired portions of the data, creating new variable metrics, and visualizing results.

As usual, start IPython in your command prompt if y'all want to follow along. You can notice the data hither, under National data (it unzips to 'names' binder). Let'south jump in.

Data Preview


The infant name files are separate by year of nativity, all in a similar format: 'yob1880.txt', 'yob1881.txt', then on to 'yob2010.txt'. Yous tin can become alee and import 'pandas', 'pylab', and 'numpy' modules now or when they required later.

Use the '.read_csv()' method to access the first text file of baby names in 1880. We see that at that place were 2,000 boy and girl names from the information that year (n>=5), with three variables: the name, the sex activity of the baby, and the birth count for that name.

Code:


                    1  2  3  4  five  6  7  8  9 10 11 12 13 14 15 xvi 17 18 nineteen 20 21 22 23 24 25 26 27 28 29 30 31 32 33
C:\Users\wayne>cd                    .\Documents\python\dataAnalysis\git\ch02  C:\Users\wayne\Documents\python\dataAnalysis\git\ch02>ipython                    --matplotlib Python                    ii.7                    .                    eight                    (default, Jun                    thirty                    2014,                    16:03:49) [MSC v.                    1500                    32                    chip (Intel)] Type                    "copyright",                    "credits"                    or                    "license"                    for                    more information.                    IPython                    2.1                    .                    0                    --                    An enhanced Interactive Python.                    ?                    ->                    Introduction                    and                    overview of IPython's features.                    %quickref                    ->                    Quick reference.                    help                    ->                    Python's own assist system.                    object                    ?                    ->                    Details about                    'object', use                    'object??'                    for                    extra details.                    Using matplotlib backend: Qt4Agg  In [1]:                    import                    pandas                    as                    pd                    In [2]: names1880                    =                    pd.read_csv('names\yob1880.txt',names=['name','sex activity','births'])  In [3]: names1880 Out[three]:                    <                    class                    '                    pandas                    .core.frame.DataFrame'>                    Int64Index:                    2000                    entries,                    0                    to                    1999                    Data columns (full                    3                    columns): proper name                    2000                    non-null values sexual practice                    2000                    non-null values births                    2000                    non-null values dtypes: int64(1),                    object(2)  In [4]: names1880.groupby('sexual activity').births.sum() Out[4]: sex F                    90993                    M                    110493                    Proper name: births, dtype: int64                  


Performing a quick tab, we grouping the information past 'sex activity' and view the count of 'births'. There are ninety,993 girls, and 110,493 boys in the 1880 data.


Assembling the Data


Now that we take an thought of the data contents, and we know the pattern of the text file names, we can create a loop to read in the data. At the same time, nosotros add together some other variable denoting the year for a particular proper name entry for when all the years are together.

Create a 'years' variable which nosotros will apply to iterate through each year text file. Then nosotros read the information, add a 'yr' cavalcade, append the data to DataFrame 'pieces', and then merge them together. Using the '%d' string formatter, we can replace that space with a given variable, 'year'. Afterwards using the '.append()' method to add the electric current 'frame' object to 'pieces', nosotros accept advantage of '.concat()' to merge the frames in 'pieces' past row for a completed DataFrame in 'names'. 'ignore_index' should be True because we do not desire to go along the original indexes.


Code:


                    1  2  iii  4  five  vi  7  viii  9 10 xi 12 13 xiv 15 16 17 xviii 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
                    # information carve up by year                    # so gather data into single DataFrame and add together twelvemonth field                    # use %d string formatter to iterate through the years                    # add 'year' column                    # append pieces together                    # .concat merges by row and do not preserve original row numbers                    In [nine]: pieces                    =                    []  In [10]: columns                    =                    ['name',                    'sexual practice',                    'births']  In [11]: years                    =                    range(1880,                    2011)  In [12]:                    for                    year                    in                    years:                    ....:     path                    =                    'names/yob                    %d                    .txt'                    %                    twelvemonth                    ....:     frame                    =                    pd.read_csv(path, names=columns)                    ....:     frame['year']                    =                    yr                    ....:     pieces.append(frame)                    ....:     names                    =                    pd.concat(pieces, ignore_index=                    True)                    ....:  In [thirteen]: names Out[13]:                    <                    class                    '                    pandas                    .core.frame.DataFrame'>                    Int64Index:                    1690784                    entries,                    0                    to                    1690783                    Data columns (total                    four                    columns): name                    1690784                    non-goose egg values sexual practice                    1690784                    non-zip values births                    1690784                    non-null values year                    1690784                    non-null values dtypes: int64(2),                    object(2)                    # one,690,783 rows of data with 4 columns                    In [xiv]: names.relieve('names.pkl')                  


In the 'names' DataFrame, we have 1,690,783 names from years 1880 to 2010 with iv columns, including the year. Remember to pickle the DataFrame with '.salve()', in other words, save it. It is a hefty file, around 63 MB in size, merely Python will do all the heavy lifting!


Exploring the Data


First off, a pivot tabular array is in order. Permit's move the 'sexual activity' to the columns, and the 'year' in the rows, while positioning the 'births' in values. Calling '.tail()' will give us the final 5 rows in the table. To get a bigger motion-picture show, plot the tabular array of births stratified past sexual practice and year.

Code:


                    1  2  3  4  5  half-dozen  7  8  9 x 11 12 thirteen 14
In [17]: total_births                    =                    names.pivot_table('births', rows=                    'year', cols=                    'sexual practice', aggfunc=                    sum)  In [19]: total_births.tail() Out[19]: sex         F        M yr                    2006                    1896468                    2050234                    2007                    1916888                    2069242                    2008                    1883645                    2032310                    2009                    1827643                    1973359                    2010                    1759010                    1898382                    In [xx]: total_births.plot(championship=                    'Total Births by sex and year') Out[20]:                    <matplotlib.axes.AxesSubplot at                    0x1a5f4730                    >                  
Figure 1. Total births by sex and year

We tin find the nascency trends ascension and fall based on economic trends- they births tend to fall in times of recession, and male births started to outpace female person births subsequently WWII.

Calculation Proportion and Subsetting Top chiliad Names


Hither we add the column of proportions to our 'names' DataFrame. The proportions will be the number of births out of each total births grouped past year and sex. We define a new method, 'add_prop()' and convert the argument value to a float type for not-integer partition purposes. Then we separate the births by the sum of births in the grouping, and return the number.

Lawmaking:


                    1  ii  3  four  5  6  7  8  nine 10 11 12 13 14 15 sixteen 17 eighteen xix 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 fifty 51 52
                    # group by year and sex activity                    # add proportion of babies given sure name relative to number of births                    In [23]:                    def                    add_prop(group):                    ....:                    # integer division floors                    ....:     births                    =                    group.births.astype(bladder)                    ....:     grouping['prop']                    =                    births                    /                    births.sum()                    ....:                    return                    grouping                    ....:  In [24]: names                    =                    names.groupby(['twelvemonth','sex']).utilise(add_prop)  In [25]: names Out[25]:                    <                    class                    '                    pandas                    .cadre.frame.DataFrame'>                    Int64Index:                    1690784                    entries,                    0                    to                    1690783                    Information columns (total                    five                    columns): name                    1690784                    non-zilch values sexual activity                    1690784                    not-aught values births                    1690784                    not-nada values year                    1690784                    non-nix values prop                    1690784                    not-null values dtypes: float64(one), int64(2),                    object(ii)                    # cheque to see values past group sum to 1                    In [28]:                    import                    numpy                    equally                    np                    In [29]: np.allclose(names.groupby(['yr','sexual practice']).prop.sum(),1) Out[29]:                    Truthful                    # subset top 1000 births                    In [30]:                    def                    get_top1000(grouping):                    ....:                    return                    group.sort_index(by=                    'births', ascending=                    Imitation)[:1000]                    ....:  In [31]: grouped                    =                    names.groupby(['yr','sex'])  In [32]: top1000                    =                    grouped.employ(get_top1000)  In [37]: top1000 Out[37]:                    <                    class                    '                    pandas                    .core.frame.DataFrame'>                    MultiIndex:                    261877                    entries, (1880, F,                    0) to (2010, M,                    1677643) Data columns (total                    5                    columns): name                    261877                    non-null values sexual practice                    261877                    non-nothing values births                    261877                    non-null values yr                    261877                    not-null values prop                    261877                    non-aught values dtypes: float64(1), int64(2),                    object(2)                  


Nosotros pass the groups by year and sex to the 'add_prop()' method using '.use()'. Confirming the new 'prop' column, the new 'names' DataFrame now has five columns. To ensure the birth proportions by groups are accurate, nosotros verify using '.allclose()' method in the 'numpy' module, and compare the sum to 1. Python returns 'True', and we are assured the column values are right.


With this new DataFrame, nosotros now will subset the meridian 1000 names by birth in each year and sex grouping. Define a new method, 'get_top1000()', and which sorts the births in descending club, and returns the commencement thou entries. We pass the 'names' DataFrame grouped by twelvemonth and sex to the 'get_top1000()' method into our new DataFrame, 'top1000'. Instead of over ane.six one thousand thousand entries, nosotros at present have 261,877 entries with which to work.

Some Naming Trends


Because the data include information spanning from 1880 to 2010, we tin examine trends in time of baby names for any changes. As usual, recall to salvage and pickle the 'top1000' DataFrame as we go along the assay. Begin past separating sex into two different DataFrames for later use.

Now create a pivot table from 'top1000', with births as summed values, years in rows, and names in the columns. There are 131 rows, one for each year and 6,865 columns, or names. Nosotros volition subset past column, accept but specific names, and plot the births for the selected names by yr in a single plot. You can cull different names, and I chose John, Harry, Mary, and Marilyn as sample names.

Code:


                    1  2  iii  4  5  half-dozen  vii  eight  9 x 11 12 13 14 xv 16 17 18 nineteen xx 21 22 23 24 25 26 27 28 29 xxx 31 32 33 34 35
                    # analyze naming trends                    In [38]: top1000.relieve('top1000.pkl')  In [39]: boys                    =                    top1000[top1000.sex                    ==                    'M']  In [40]: girls                    =                    top1000[top1000.sex activity                    ==                    'F']  In [41]: total_births                    =                    top1000.pivot_table('births', rows=                    'yr', cols=                    'name', aggfunc=                    sum)  In [42]: total_births Out[42]:                    <                    course                    '                    pandas                    .core.frame.DataFrame'>                    Int64Index:                    131                    entries,                    1880                    to                    2010                    Columns:                    6865                    entries, Aaden to Zuri dtypes: float64(6865)  In [43]: subset                    =                    total_births[['John','Harry','Mary','Marilyn']]  In [46]: subset.plot(subplots=                    True, figsize=(12,x),grid=                    False,title=                    'Number of births per yr                    ')                    Out[46]: assortment([<matplotlib.axes.AxesSubplot                    object                    at                    0x1A7144D0                    >,                    <matplotlib.axes.AxesSubplot                    object                    at                    0x14C13E30                    >,                    <matplotlib.axes.AxesSubplot                    object                    at                    0x1AA527F0                    >,                    <matplotlib.axes.AxesSubplot                    object                    at                    0x172A05B0                    >], dtype=                    object)                    # plot 'Wayne'                    In [47]: subsetw                    =                    total_births[['Wayne']]  In [48]: subsetw.plot(championship=                    'Wayne') Out[48]:                    <matplotlib.axes.AxesSubplot at                    0x14c0f2f0                    >                    # names growing out of favor?                  
Figure two. Number of Births per Yr, Selected Names

For John, Harry, and Mary, they accept bimodal peaks about 1920s and 1950s. For Marilyn, the proper name became steadily popular from the 1930's to the late 1950's. For all four names, we discover a fall in births per year. Are those names actually becoming more uncommon? Nosotros volition discover what is happening in the information below.

Curious, I plotted my name to encounter the nascence fourth dimension trends of 'Wayne'. It follows the same rise, peak, and fall around the 1950's, though it followed less of a bimodal distribution.

Effigy three. Number of Births per Year, For Name: Wayne

Baby Name Diversity


First, a spoiler: the drop in births for certain names have something to practice with the proper name diversity- what parents choose to proper name their child. The trend changes from 1950'due south onwards. To examine this change, nosotros turn to the variable we created before, the proportion of births in each group by year and sexual activity. So we create a pin table from the 'top1000', just this time with the sum values as 'prop', 'yr' as rows, and 'sex' as columns.

This will let u.s.a. to plot Effigy 4. Note how proportion total starts at 1.0 in 1880, and slowly drops in 1960 for females and in 1970 for males. The reject in proportion of births accounted by the top thou names has declined to around 74% for females and 85% for males by 2010. That means the share of births for other names outside of the top thou has risen. More than parents are choosing dissimilar, more uncommon names to call their newborns.

Figure iv. Proportion of Top Births by Twelvemonth and Sex

We tin cheque this past examining the boys and girls DataFrames we created earlier. We subset the year 2010, sort past proportion in descending order, then accept the cumulative sum of the proportions of births. Taking the first ten names, we encounter that the elevation proper noun were roughly i.15% of the full male births in 2010. Using '.searchsorted(0.5)' to find the sorted alphabetize of the 50th percentile, Python returns 116. Therefore 117 names consist of 50% of the male births in 2010. We compare this number to the 50% percentile of male person births in 1880, which is 25. From 1880 to 2010, the number of names in the pinnacle 50% percentile of male births increased over 350% from 25 to 117. Male person proper name diversity sure increased over the years.

Code:


                    i  2  3  4  five  half-dozen  seven  8  ix ten 11 12 13 14 xv sixteen 17 18 19 20 21 22 23 24 25 26 27 28 29 xxx 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 l 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
                    # exploring increases in naming variety                    # fewer parents choosing common names for children                    In [fifteen]: tabular array                    =                    top1000.pivot_table('prop', rows=                    ....:                    'year',cols=                    'sexual practice', aggfunc=                    sum)  In [17]: tabular array.plot(title=                    'Sum of table1000.prop by year and sex', \                    ....: yticks=np.linspace(0,1.two,13), xticks=                    range(1880,2020,ten)) Out[17]:                    <matplotlib.axes.AxesSubplot at                    0x6a1ead0                    >                    # names proportion going down from 1 from peak 1000 names                    In [20]: df                    =                    boys[boys.year==                    2010]  In [21]: df Out[21]:                    <                    course                    '                    pandas                    .cadre.frame.DataFrame'>                    MultiIndex:                    1000                    entries, (2010, K,                    1676644) to (2010, M,                    1677643) Data columns (total                    5                    columns): name                    1000                    non-zippo values sex activity                    1000                    non-nada values births                    k                    not-null values year                    thousand                    not-nothing values prop                    1000                    non-null values dtypes: float64(ane), int64(2),                    object(2)  In [22]: prop_cumsum                    =                    df.sort_index(by=                    'prop', ascending=                    Simulated).prop.cumsum()  In [23]: prop_cumsum[:ten] Out[23]: year  sex                    2010                    M                    1676644                    0.011523                    1676645                    0.020934                    1676646                    0.029959                    1676647                    0.038930                    1676648                    0.047817                    1676649                    0.056579                    1676650                    0.065155                    1676651                    0.073414                    1676652                    0.081528                    1676653                    0.089621                    dtype: float64  In [24]: prop_cumsum.searchsorted(0.v) Out[24]:                    116                    # index 116, and then 117 names in top 50% in 2010                    In [25]: df1900                    =                    boys[boys.year==                    1900]  In [26]: prop1900                    =                    df1900.sort_index(by=                    'prop', ascending=                    False).prop.cumsum()  In [27]: prop1900.searchsorted(0.five)+                    ane                    Out[27]:                    25                    # in 1900, summit 50% of names covered with 25 names                    # so at that place is a big increase in name diversity                    In [28]:                    def                    get_quantile(group, q=                    0.5):                    ....:     group                    =                    grouping.sort_index(by=                    'prop', ascending=                    False)                    ....:                    return                    grouping.prop.cumsum().searchsorted(q)+                    ane                    ....:  In [29]: diversity                    =                    top1000.groupby(['year','sex']).utilise(get_quantile)  In [thirty]: diverseness                    =                    multifariousness.unstack('sex')  In [31]: diverseness.head() Out[31]: sexual practice    F   M yr                    1880                    38                    14                    1881                    38                    14                    1882                    38                    15                    1883                    39                    15                    1884                    39                    16                    In [32]: variety.plot(championship=                    'Number of popular names in top 50%') Out[32]:                    <matplotlib.axes.AxesSubplot at                    0x2075c1f0                    >                  


This diversity increase tin can be said for female births also. Rather than take the number of names in the summit 50th percentile for 1880 and 2010, we calculate them for all the years, and both male and female names. Define a new function, 'get_quantile(group, q=0.five)', which sorts the 'grouping' argument past variable 'prop' in descending society, and returns the alphabetize of the sorted cumulative sum value at 0.v, adding 1 at the cease to account for the alphabetize.


Laissez passer this method through the top 1000 names grouped by year and sex into the 'diverseness' DataFrame. Reconfigure the DataFrame table by placing 'sex' in the columns with '.unstack()' to finalize the table. Have a tiptop at the first five years of the 'variety' data with '.head()', and detect that 38 female names and xiv male names accounted for 50% of the top births in 1880. Lastly nosotros plot the 'multifariousness' DataFrame, shown below.

Effigy five. Pop Babe Names in 50% percentile

We see a distinct increase in name diverseness around 1985 for both males and females. Historically, female names were more various than male names. By 2010, the number of top female person names accounting for the peak 50 birth percentile more than doubled the male name analogue.

Once more, nosotros achieve the end of another lengthy, but I hope, enjoyable mail in Python and Pandas apropos infant names. We explored and manipulated a dataset of 1.6 million rows, re-organized DataFrames, created new variables, and visualized various proper noun metrics, all later accessing data split into 131 text files. There is more on baby names nosotros volition explore in Function B of this mail. Then stay tuned for more than!


Cheers for reading,

Wayne
@beyondvalence
LinkedIn

Python and Pandas Serial:
1.Python and Pandas: Function 1: bit.ly and Time Zones
2.Python and Pandas: Part ii. Movie Ratings
iii.Python and Pandas: Part 3. Baby Names, 1880-2010
four.Python and Pandas: Role iv. More Baby Names
.

cronineader1959.blogspot.com

Source: http://beyondvalence.blogspot.com/2014/09/python-and-pandas-part-3-baby-names.html

0 Response to "How to Get Top 5 Baby Names Female and Male in Python"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel