Train and test split in such a way that each name and proportion of tartget class is present in both train and test

31 views Asked by At

I am trying to solve a ML problem if a person will deliver an order or not. Highly Imbalance dataset. Here is the glimpse of my dataset

[{'order_id': '1bjhtj', 'Delivery Guy': 'John', 'Target': 0},
 {'order_id': '1aec', 'Delivery Guy': 'John', 'Target': 0},
 {'order_id': '1cgfd', 'Delivery Guy': 'John', 'Target': 0},
 {'order_id': '1bceg', 'Delivery Guy': 'Tom', 'Target': 0},
 {'order_id': '1a2fg', 'Delivery Guy': 'Tom', 'Target': 0},
 {'order_id': '1cbsf', 'Delivery Guy': 'Tom', 'Target': 1},
 {'order_id': '1bc5', 'Delivery Guy': 'Jay', 'Target': 0},
 {'order_id': '1a22', 'Delivery Guy': 'Jay', 'Target': 0},
 {'order_id': '1bzc5', 'Delivery Guy': 'Jay', 'Target': 0},
 {'order_id': '1av22', 'Delivery Guy': 'Jay', 'Target': 0},
 {'order_id': '1bsc5', 'Delivery Guy': 'Jay', 'Target': 1},
 {'order_id': '1a2t2', 'Delivery Guy': 'Jay', 'Target': 0},
 {'order_id': '1bc5b', 'Delivery Guy': 'Jay', 'Target': 0},
 {'order_id': '1a22a', 'Delivery Guy': 'Mary', 'Target': 0},
 {'order_id': '1c5bv', 'Delivery Guy': 'Mary', 'Target': 0},
 {'order_id': 'vb2er', 'Delivery Guy': 'Mary', 'Target': 0},
 {'order_id': '1bs5s', 'Delivery Guy': 'Mary', 'Target': 0},
 {'order_id': '1a22n', 'Delivery Guy': 'Mary', 'Target': 0},
 {'order_id': '122a', 'Delivery Guy': 'James', 'Target': 1},
 {'order_id': '1cw5bv', 'Delivery Guy': 'James', 'Target': 0},
 {'order_id': 'vb=er', 'Delivery Guy': 'James', 'Target': 0},
 {'order_id': '1b5s', 'Delivery Guy': 'James', 'Target': 0},
 {'order_id': '1a2n', 'Delivery Guy': 'James', 'Target': 1}]


This is my table :

| order_id | Delivery Guy | Target |
|----------|--------------|--------|
| 1bjhtj   | John         | 0      |
| 1aec     | John         | 0      |
| 1cgfd    | John         | 0      |
| 1bceg    | Tom          | 0      |
| 1a2fg    | Tom          | 0      |
| 1cbsf    | Tom          | 1      |
| 1bc5     | Jay          | 0      |
| 1a22     | Jay          | 0      |
| 1bzc5    | Jay          | 0      |
| 1av22    | Jay          | 0      |
| 1bsc5    | Jay          | 1      |
| 1a2t2    | Jay          | 0      |
| 1bc5b    | Jay          | 0      |
| 1a22a    | Mary         | 0      |
| 1c5bv    | Mary         | 0      |
| vb2er    | Mary         | 0      |
| 1bs5s    | Mary         | 0      |
| 1a22n    | Mary         | 0      |
| 122a     | James        | 1      |
| 1cw5bv   | James        | 0      |
| vb=er    | James        | 0      |
| 1b5s     | James        | 0      |
| 1a2n     | James        | 1      |

I want my machine learning model to understand each person attributes and predict these two

cases: will deliver "0" and will not deliver "1"

I want to split my train and test in such a way that it should preserver few rows of name and few rows of Target class so that it learns all the patterns.

I have used this so far

X = df.drop(columns = "Target")
y = df.Target
X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.7,stratify=y)

It does give me output of each Delivery Guy but it misses the part where we can split 'James' in such way that "1" will be there in train another "1" will be in test. Could anyone help me approach this problem in different way.

1

There are 1 answers

2
DataSciRookie On

Here's an approach to ensure that:

Every "Delivery Guy" is represented in both the training and test sets. Each "Target" class is adequately represented in both sets.

  • This can be achieved by manually splitting the dataset while considering both the "Delivery Guy" and the "Target" columns. Here's a step-by-step guide to doing this:

Step 1: Split Data by "Delivery Guy" :

  • First, split your dataset by "Delivery Guy" to ensure that data for each individual is grouped together.
for name, group in df.groupby('Delivery Guy'):

Step 2: For Each Group, Further Split by "Target"

  • For each "Delivery Guy", split their data based on the "Target" to ensure that you can separately handle the representation of each class. :
for target, target_group in group.groupby('Target'):

Step 3 : Step 3: Allocate Train/Test Data

  • For each subgroup (i.e., each "Delivery Guy" with a specific "Target"), allocate a portion of the data to the training set and the rest to the test set. Given the imbalanced nature of your dataset, you might want to ensure that at least one instance of each "Target" for each "Delivery Guy" ends up in both the training and test sets, if possible.

Step 4: Combine Data Back

After allocating both "Target" classes for each "Delivery Guy" to both sets, combine these allocations back into your final training and test sets.

Here's how you could implement this in Python:

df = pd.DataFrame(data)

train_list = []
test_list = []

# Split the dataset by 'Delivery Guy' ensuring each one is represented in both sets
for name, group in df.groupby('Delivery Guy'):
    for target, target_group in group.groupby('Target'):
        if len(target_group) > 1:
            target_train, target_test = train_test_split(target_group, test_size=0.5, random_state=42)
            train_list.append(target_train)
            test_list.append(target_test)
        else:
            # Decide to add the single sample group to the training set
            train_list.append(target_group)

# Concatenate all the lists into DataFrames at once
train = pd.concat(train_list, ignore_index=True)
test = pd.concat(test_list, ignore_index=True)