I have a dataframe like the following:
BankNum | FirstName | LastName | ID |
00987772 | Michael | Brown | 123 |
00987772 | Bob | Brown | 123 |
00987772 | Michael | Mooney | 123 |
00987772 | Raven | Mallik | 245 |
00982122 | Karim | Hareche | 564 |
I am doing the following to get two dictionaries:
cols = [
{'col': 'BankNum', 'func': lambda x: x.value_counts().to_dict()},
{'col': 'FirstName', 'func': pd.Series.nunique},
{'col': 'LastName', 'func': pd.Series.nunique}]
d = df.groupby('Transporter ID').apply(lambda x: tuple(c['func'](x[c['col']]) for c in cols)).to_dict()
cols1 = ['ID']
df2 = df.groupby('BankNum').apply(lambda x: tuple(x[c].nunique() for c in cols1))
d1 = df2.to_dict()
where
d ={ 123 : ({00987772: 3}, 2,2), 245: ({00987772: 1}, 1,1), 564: ({00982122: 1}, 1,1)}
d1 = {00987772: (2,), 00982122:(1,)}
Next, I'm doing the following (below is just the relevant code, there are other things also that I'm doing which I've removed from the code below:
same_banknum={}
l=[]
w=[]
m = v[2].values()
h2 = sum(i > 6 for i in m)
mod2 = sum(i in [5,6] for i in m)
l2 = sum(i in [3,4] for i in m)
if h2 != 0:
for k2, v2 in v[2].items():
if v2 > 6:
l.append(k2)
w.append(v2)
new_l=[]
for i in l:
v3 = d1.get(i)
new_l.append(v3[0])
h3 = sum(i > 8 for i in new_l)
m3 = sum(i in [5,6,7,8] for i in new_l)
l3 = sum(i in [3,4] for i in new_l)
c=[]
if h3 != 0:
for g in new_l:
if g > 8:
c.append(g)
wt = sum(c)
same_banknum[k]= ("High", wt)
elif m3 != 0:
for g in new_l:
if g in [5,6,7,8]:
c.append(g)
wt = sum(c)
same_banknum[k]= ("Moderate", wt)
elif l3 != 0:
for g in new_l:
if g in [3,4]:
c.append(g)
wt = sum(c)
same_banknum[k]= ("Low", wt)
else:
same_banknum[k]= ("Low", 0.0)
elif mod2 != 0:
for k2, v2 in v[2].items():
if v2 in [5,6]:
l.append(k2)
w.append(v2)
new_l=[]
for i in l:
v3 = d1.get(i)
new_l.append(v3[0])
h3 = sum(i > 8 for i in new_l)
m3 = sum(i in [5,6,7,8] for i in new_l)
l3 = sum(i in [3,4] for i in new_l)
c=[]
if h3 != 0:
for g in new_l:
if g > 8:
c.append(g)
wt = sum(c)
same_banknum[k]= ("High", wt)
elif m3 != 0:
for g in new_l:
if g in [5,6,7,8]:
c.append(g)
wt = sum(c)
same_banknum[k]= ("Moderate", wt)
elif l3 != 0:
for g in new_l:
if g in [3,4]:
c.append(g)
wt = sum(c)
same_banknum[k]= ("Low", wt)
else:
same_banknum[k]= ("Low", 0.0)
elif l2 != 0:
for k2, v2 in v[2].items():
if v2 in [3,4]:
l.append(k2)
w.append(v2)
new_l=[]
for i in l:
v3 = d1.get(i)
new_l.append(v3[0])
h3 = sum(i > 8 for i in new_l)
m3 = sum(i in [5,6,7,8] for i in new_l)
l3 = sum(i in [3,4] for i in new_l)
c=[]
if h3 != 0:
for g in new_l:
if g > 8:
c.append(g)
wt = sum(c)
same_banknum[k]= ("High", wt)
elif m3 != 0:
for g in new_l:
if g in [5,6,7,8]:
c.append(g)
wt = sum(c)
same_banknum[k]= ("Moderate", wt)
elif l3 != 0:
for g in new_l:
if g in [3,4]:
c.append(g)
wt = sum(c)
same_banknum[k]= ("Low", wt)
else:
same_banknum[k]= ("Low", 0.0)
else:
for k2, v2 in v[2].items():
if v2 in [1,2]:
l.append(k2)
new_l=[]
for i in l:
v3 = d1.get(i)
new_l.append(v3[0])
h3 = sum(i > 8 for i in new_l)
m3 = sum(i in [5,6,7,8] for i in new_l)
l3 = sum(i in [3,4] for i in new_l)
c=[]
if h3 != 0:
for g in new_l:
if g > 8:
c.append(g)
wt = sum(c)
same_banknum[k]= ("High", wt)
elif m3 != 0:
for g in new_l:
if g in [5,6,7,8]:
c.append(g)
wt = sum(c)
same_banknum[k]= ("Moderate", wt)
elif l3 != 0:
for g in new_l:
if g in [3,4]:
c.append(g)
wt = sum(c)
same_banknum[k]= ("Low", wt)
else:
same_banknum[k]= ("Low", 0.0)
to get a dictionary like this:
same_banknum = {123: ('Low', 0.6), 245: ('Low', 0.6), 564: ('Low', 0.0)}
The same_banknum dictionary performs the above computation and finds out if same BankNum
exists for multiple IDs and then assigns them High
, Low
, Moderate
value along with the weight of it being true to it, to give us a dictionary.
Which I can convert to a dataframe like following:
df1 = pd.DataFrame.from_dict(same_banknum, orient='index').reset_index()
df1.columns = ['ID','SameBankNum_Val','SameBankNum_Wt']
which gives:
ID | SameBankNum_Val | SameBankNum_Wt
123 | Low | 0.6
245 | Low | 0.6
564 | Low | 0.0
What I want to do is that instead of performing this computation again and again for each new dataset that comes in, I want to use Machine Learning to build a predictive model, that predicts the above SameBankNum_Val
& SameBankNum_Wt
for the new IDs (test data).
I could add the SameBankNum_Val
& the SameBankNum_Wt
columns to the above training dataframe. But, what I want to know is:
How do I pass multiple columns (BankNum
, FirstName
, LastName
, ID
) (from Dataframe 1 from above) as the train data and multiple columns (SameBankNum_Val
, SameBankNum_Wt
) (from Dataframe 2 above) as the train label in a machine learning model?
Also, will the machine learning model be accurate enough as to when to give it High
, Low
or Moderate
value and what weight without performing that long computation again and again? For this question, I guess I'll just have to test with multiple models first.
Please help! Thanks!