Passing a dataframe as train data and multiple columns of dataframe as train labels to a machine learning prediction model

931 views Asked by At

I have a dataframe like the following:

BankNum | FirstName | LastName  | ID |

00987772  |  Michael  | Brown   | 123 |
00987772  |  Bob      | Brown   | 123 |
00987772  |  Michael  | Mooney  | 123 |
00987772  |  Raven    | Mallik  | 245 |
00982122  |  Karim    | Hareche | 564 |

I am doing the following to get two dictionaries:

cols = [
    {'col': 'BankNum', 'func': lambda x: x.value_counts().to_dict()},
    {'col': 'FirstName', 'func': pd.Series.nunique},
    {'col': 'LastName', 'func': pd.Series.nunique}]

    d = df.groupby('Transporter ID').apply(lambda x: tuple(c['func'](x[c['col']]) for c in cols)).to_dict()            

    cols1 = ['ID']
    df2 = df.groupby('BankNum').apply(lambda x: tuple(x[c].nunique() for c in cols1))
    d1 = df2.to_dict()

where

d ={ 123 : ({00987772: 3}, 2,2), 245: ({00987772: 1}, 1,1), 564: ({00982122: 1}, 1,1)}

d1 = {00987772: (2,), 00982122:(1,)}

Next, I'm doing the following (below is just the relevant code, there are other things also that I'm doing which I've removed from the code below:

   same_banknum={}
   l=[] 
    w=[]
    m = v[2].values()
    h2 = sum(i > 6 for i in m)
    mod2 = sum(i in [5,6] for i in m)
    l2 = sum(i in [3,4] for i in m)
    if h2 != 0:
        for k2, v2 in v[2].items():
            if v2 > 6:
                l.append(k2)
                w.append(v2)



    new_l=[]
    for i in l:
        v3 = d1.get(i) 
        new_l.append(v3[0])

    h3 = sum(i > 8 for i in new_l)
    m3 = sum(i in [5,6,7,8] for i in new_l)
    l3 = sum(i in [3,4] for i in new_l)
    c=[]
    if h3 != 0:
        for g in new_l:
            if g > 8:
                c.append(g)
        wt = sum(c)
        same_banknum[k]= ("High", wt)
    elif m3 != 0:
        for g in new_l:
            if g in [5,6,7,8]:
                c.append(g)
        wt = sum(c)
        same_banknum[k]= ("Moderate", wt)
    elif l3 != 0:
        for g in new_l:
            if g in [3,4]:
                c.append(g)
        wt = sum(c)
        same_banknum[k]= ("Low", wt)
    else:
        same_banknum[k]= ("Low", 0.0)

elif mod2 != 0:
    for k2, v2 in v[2].items():
        if v2 in [5,6]:
            l.append(k2)
            w.append(v2)

    new_l=[]
    for i in l:
        v3 = d1.get(i) 
        new_l.append(v3[0])

    h3 = sum(i > 8 for i in new_l)
    m3 = sum(i in [5,6,7,8] for i in new_l)
    l3 = sum(i in [3,4] for i in new_l)
    c=[]
    if h3 != 0:
        for g in new_l:
            if g > 8:
                c.append(g)
        wt = sum(c)
        same_banknum[k]= ("High", wt)
    elif m3 != 0:
        for g in new_l:
            if g in [5,6,7,8]:
                c.append(g)
        wt = sum(c)
        same_banknum[k]= ("Moderate", wt)
    elif l3 != 0:
        for g in new_l:
            if g in [3,4]:
                c.append(g)
        wt = sum(c)
        same_banknum[k]= ("Low", wt)
    else:
        same_banknum[k]= ("Low", 0.0)
elif l2 != 0:
    for k2, v2 in v[2].items():
        if v2 in [3,4]:
            l.append(k2)
            w.append(v2)

    new_l=[]
    for i in l:
        v3 = d1.get(i) 
        new_l.append(v3[0])

    h3 = sum(i > 8 for i in new_l)
    m3 = sum(i in [5,6,7,8] for i in new_l)
    l3 = sum(i in [3,4] for i in new_l)
    c=[]
    if h3 != 0:
        for g in new_l:
            if g > 8:
                c.append(g)
        wt = sum(c)
        same_banknum[k]= ("High", wt)
    elif m3 != 0:
        for g in new_l:
            if g in [5,6,7,8]:
                c.append(g)
        wt = sum(c)
        same_banknum[k]= ("Moderate", wt)
    elif l3 != 0:
        for g in new_l:
            if g in [3,4]:
                c.append(g)
        wt = sum(c)
        same_banknum[k]= ("Low", wt)
    else:
        same_banknum[k]= ("Low", 0.0)
else:
    for k2, v2 in v[2].items():
        if v2 in [1,2]:
            l.append(k2)

    new_l=[]
    for i in l:
        v3 = d1.get(i) 
        new_l.append(v3[0])

    h3 = sum(i > 8 for i in new_l)
    m3 = sum(i in [5,6,7,8] for i in new_l)
    l3 = sum(i in [3,4] for i in new_l)
    c=[]
    if h3 != 0:
        for g in new_l:
            if g > 8:
                c.append(g)
        wt = sum(c)
        same_banknum[k]= ("High", wt)
    elif m3 != 0:
        for g in new_l:
            if g in [5,6,7,8]:
                c.append(g)
        wt = sum(c)
        same_banknum[k]= ("Moderate", wt)
    elif l3 != 0:
        for g in new_l:
            if g in [3,4]:
                c.append(g)
        wt = sum(c)
        same_banknum[k]= ("Low", wt)
    else:
        same_banknum[k]= ("Low", 0.0)

to get a dictionary like this:

same_banknum = {123: ('Low', 0.6), 245: ('Low', 0.6), 564: ('Low', 0.0)}

The same_banknum dictionary performs the above computation and finds out if same BankNum exists for multiple IDs and then assigns them High, Low, Moderate value along with the weight of it being true to it, to give us a dictionary.

Which I can convert to a dataframe like following:

df1 = pd.DataFrame.from_dict(same_banknum, orient='index').reset_index()
df1.columns = ['ID','SameBankNum_Val','SameBankNum_Wt']

which gives:

ID   | SameBankNum_Val  | SameBankNum_Wt
123  |  Low             | 0.6
245  | Low              | 0.6
564  | Low              | 0.0

What I want to do is that instead of performing this computation again and again for each new dataset that comes in, I want to use Machine Learning to build a predictive model, that predicts the above SameBankNum_Val & SameBankNum_Wt for the new IDs (test data).

I could add the SameBankNum_Val & the SameBankNum_Wt columns to the above training dataframe. But, what I want to know is: How do I pass multiple columns (BankNum, FirstName, LastName, ID) (from Dataframe 1 from above) as the train data and multiple columns (SameBankNum_Val, SameBankNum_Wt) (from Dataframe 2 above) as the train label in a machine learning model?

Also, will the machine learning model be accurate enough as to when to give it High, Low or Moderate value and what weight without performing that long computation again and again? For this question, I guess I'll just have to test with multiple models first.

Please help! Thanks!

0

There are 0 answers