Creating Balanced Training Sets with Sklearn's train_test_split and KBinsDiscretizer

72 views Asked by At

I'm working on a machine learning task with an unbalanced dataset of three input features (3 x 189,000) and a single output (1 x 189,000). My goal is to balance the dataset using the "pass energy" input feature, which is skewed towards higher values. I aim to divide this feature into three bins (low, mid, high energy) using KBinsDiscretizer, and then use these bins for stratified sampling with train_test_split.

Here's what I've done so far:

p = np.array(data[1, :]).reshape((-1,1))
est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform',
                           subsample=None)
est.fit(p)
p_bins = est.transform(p)

After binning, I get bin sizes of 161493, 25417, and 2985. I then add this as a fourth feature and apply train_test_split:

Y = data[3, :]
X = np.append(data[0:3, :], p_bins.T, axis=0)
# Split the model_data into train and test data
# Separate the test data
x, x_test, y, y_test = train_test_split(X.T, Y, test_size=0.25, random_state=42,
                                        shuffle=True,
                                        stratify=p_bins)

However, this approach doesn't balance the bins in the training set; it just maintains the original distribution. I'd like to achieve a training set where each bin has an equal number of samples (e.g., 2221 each), while the test and validation sets contain the remaining data, with some holdout from the smallest bin.

Could someone suggest a method or modifications to achieve this balanced distribution in the training set, and a suitable distribution in the validation and test sets?

0

There are 0 answers