Pandas function operations

9.4k views Asked by At

Data is from the United States Census Bureau. Counties are political and geographic subdivisions of states in the United States. This dataset contains population data for counties and states in the US from 2010 to 2015.

Which state has the most counties in it? (hint: consider the sumlevel key carefully! You'll need this for future questions too...)

I can not fetch the county name out of the code. Please help

my code:

import pandas as pd
import numpy as np
census_df = pd.read_csv('census.csv')
census_df.head()
def answer_five():
    return census_df.groupby('STNAME').COUNTY.sum().max()



answer_five()
9

There are 9 answers

1
Aishwarya Kanchan On
def answer_five():
    new_df = census_df[census_df['SUMLEV'] == 50]
    x = new_df.groupby('STNAME')
    return x.count()['COUNTY'].idxmax()


answer_five()
1
dfadeeff On

Here is the answer that worked for me:

def answer_five():
    return census_df.groupby(["STNAME"],sort=False).sum()["COUNTY"].idxmax()

First part created aggregated df

census_df.groupby(["STNAME"],sort=False).sum()

Second part takes the col you need

["COUNTY"].idxmax()

and returns value corresponding to index with max, check here

0
Silvis Sora On

Actually you can just count the number in states level instead of looking into County details.

And this should work:

census_df[census_df['SUMLEV']==50].groupby(['STNAME']).size().idxmax()
1
Jay Mulani On
import pandas as pd
def answer_five():
    df=census_df.groupby(['STNAME'])
    df=df.sum();
    fd=df['COUNTY'].max()
    df=df[df['COUNTY']==fd]
    return df.index[0]
answer_five()
0
Anand Krishnan On

We can also do this question using sum() function

def answer_five():
  return census_df.groupby(["STNAME"]).sum()["COUNTY"].idxmax()

Using sum() it will sum up all the values in COUNTY column from which we can apply idxmax() to find the the state which has the highest no:of counties.

0
yogs On

def answer_five():
    county = census_df[census_df['SUMLEV']==50]
    county = county.groupby(['STNAME']).count()

    return county['SUMLEV'].idxmax(axis=0)

answer_five()

0
jasonlcy91 On

Just a correction to your entire code.

First, according to the source, SUMLEV of 50 means the row is a county. Two ways to answer this.

Thought process (think of it like in Excel): You want to count the number of "county rows" in each state group. First, you create the mask/condition to select all SUMLEV == 50 ("county rows"). Then group them by STNAME. Then use .size() to count the number of rows in each grouping.

# this is it!
def answer_five():
    mask = (census_df.SUMLEV == 50)
    max_index = census_df[mask].groupby('STNAME').size().idxmax()
    return max_index

# not so elegant
def answer_five():
    census_df['Counts'] = 1
    mask = (census_df.SUMLEV == 50)
    max_index = census_df[mask].groupby('STNAME')['Counts'].sum().idxmax()
    return max_index

You are welcome. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.size.html

0
Nathan On

It's the change from .max() to idxmax() that returns the correct value for the STNAME rather than a large integer.

1
Terk On
def answer_five():
    return census_df.groupby('STNAME')['CTYNAME'].count().idxmax()