So I have a data frame like this
Rank State/Union territory NSDP Per Capita (Nominal)(2019–20)[1][2] state_id
0 1 Goa 466585.0 30.0
1 2 Sikkim 425656.0 11.0
2 3 Delhi 376143.0 NaN
3 4 Chandigarh NaN 4.0
4 5 Haryana 247207.0 6.0
5 6 Telangana 225756.0 0.0
6 7 Karnataka 223246.0 29.0
7 8 Kerala 221904.0 32.0
8 9 Puducherry 220949.0 34.0
9 10 Andaman and Nicobar Islands 219842.0 NaN
10 11 Tamil Nadu 218599.0 33.0
11 12 Gujarat 216329.0 24.0
12 13 Mizoram 204018.0 15.0
13 14 Uttarakhand 202895.0 5.0
14 15 Maharashtra 202130.0 27.0
15 16 Himachal Pradesh 190255.0 2.0
16 17 Andhra Pradesh 168480.0 28.0
17 18 Arunachal Pradesh 164615.0 NaN
18 19 Punjab 161083.0 3.0
20 20 Nagaland 130282.0 13.0
21 21 Tripura 125630.0 16.0
22 22 Rajasthan 115492.0 8.0
23 23 West Bengal 115348.0 19.0
24 24 Odisha 98896.0 21.0
25 25 Chhattisgarh 105281.0 22.0
26 26 Jammu and Kashmir 102882.0 NaN
27 27 Madhya Pradesh 103288.0 23.0
28 28 Meghalaya 92174.0 17.0
29 29 Assam 90758.0 18.0
30 30 Manipur 84746.0 14.0
31 31 Jharkhand 79873.0 20.0
32 32 Uttar Pradesh 65704.0 9.0
33 33 Bihar 46664.0 10.0
And my other dictionary has
{'Telangana': 0, 'Andaman & Nicobar Island': 35, 'Andhra Pradesh': 28, 'Arunanchal Pradesh': 12, 'Assam': 18, 'Bihar': 10, 'Chhattisgarh': 22, 'Daman
& Diu': 25, 'Goa': 30, 'Gujarat': 24, 'Haryana': 6, 'Himachal Pradesh': 2, 'Jammu & Kashmir': 1, 'Jharkhand': 20, 'Karnataka': 29, 'Kerala': 32, 'Lakshadweep': 31, 'Madhya Pradesh': 23, 'Maharashtra': 27, 'Manipur': 14, 'Chandigarh': 4, 'Puducherry': 34, 'Punjab': 3, 'Rajasthan': 8, 'Sikkim': 11, 'Tamil Nadu': 33, 'Tripura': 16, 'Uttar Pradesh': 9, 'Uttarakhand': 5, 'West Bengal': 19, 'Odisha': 21, 'Dadara & Nagar Havelli': 26, 'Meghalaya': 17, 'Mizoram': 15, 'Nagaland': 13, 'NCT of Delhi': 7}
So you have probably seen the problem, Andaman and Nicobar Islands
is present in both but spelled differently, like ' Andaman & Nicobar Island'
in the dictionary.
And this makes the final column NaN
9 10 Andaman and Nicobar Islands 219842.0 NaN
How can I combine this with difflib library?
I've tried
df_19_20['State/Union territory'] = df_19_20['State/Union territory'].apply(get_close_matches(df_19_20['State/Union territory'], id_d.keys()))
and
df_19_20['State/Union territory'] = get_close_matches(df_19_20['State/Union territory'], id_d.keys())
Is there something I'm missing? How can I process the columns to get the best match?
The issue was in the application of
df.apply
df.apply
needs to be given a function taking in the value from each row it is being iterated over. You also need to clean the return ofget_close_matches
which returns alist
, so you need to take the first elementshould work