How to save cluster assignments and prevent them from being overwritten in next iterations when calculating the bias?

248 views Asked by At

I'm implementing an algorithm that calculates the bias per cluster and then splits the cluster with the highest bias into new clusters. Eventually, I want to find the cluster with the highest bias, which means that the classifier either produced more errors on these instances, or substantially less errors.

This is the algorithm:

  1. Start with whole dataset as one cluster
  2. Split into two clusters with KMeans
  3. Calculate the macro F1-score for each of these clusters
  4. Calculate the bias for both of these clusters. The bias is: F1-score_cluster_k - F1-score all clusters excluding cluster k
  5. if Max(bias_cluster_i, bias_cluster_j) >= bias_previous_cluster: add the clusters cluster_i and cluster_j to the list and remove the previous cluster
  6. Proceed with the cluster from the cluster_list which has the highest standard deviation of the error metric.
  7. Split this cluster into 2 clusters with KMeans and proceed with step 3

To make this algorithm work, I need to save the cluster assignments and the F-scores from the previous iterations to be able to compare them in the current iteration (step 5).

  • One of my solutions is to save the cluster assignments in a Pandas DF as a new column and then compare this column with the new cluster assignments, but is there a better way of preventing these cluster assignments from being overwritten?

This is my code:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.datasets import load_wine

data = load_wine()
df_data = pd.DataFrame(data.data, columns=data.feature_names)
df_target = pd.DataFrame(data = data.target)

# Merging the datasets into one dataframe
all_data = df_data.merge(df_target, left_index=True, right_index=True)
all_data.rename( columns={0 :'target_class'}, inplace=True )
all_data.head()

# Dividing X and y into train and test data (small train data to gain more errors)
X_train, X_test, y_train, y_test = train_test_split(df_data, df_target, test_size=0.60, random_state=2)

# Training a RandomForest Classifier 
model = RandomForestClassifier()
model.fit(X_train, y_train.values.ravel())

# Obtaining predictions
y_hat = model.predict(X_test)

# Converting y_hat from Np to DF
predictions_col = pd.DataFrame()
predictions_col['predicted_class'] = y_hat.tolist()
predictions_col['true_class'] = y_test

# Calculating the errors with the absolute value 
predictions_col['errors'] = abs(predictions_col['predicted_class'] - predictions_col['true_class'])

# It doesn't matter whether the misclassification is between class 0 and 2 or between 0 and 1, it has the same error value. 
predictions_col['errors'] = predictions_col['errors'].replace(2.0, 1.0)

# Adding predictions to test data
df_out = pd.merge(X_test, predictions_col, left_index = True, right_index = True)

# Scaling the features
scaled_matrix = StandardScaler().fit_transform(df_matrix)

# Calculating the errors of the instances in the clusters.
def F_score(results, class_number):
    true_pos = results.loc[results["true_class"] == class_number][results["predicted_class"] == class_number]
    true_neg = results.loc[results["true_class"] != class_number][results["predicted_class"] != class_number]
    false_pos = results.loc[results["true_class"] != class_number][results["predicted_class"] == class_number]
    false_neg = results.loc[results["true_class"] == class_number][results["predicted_class"] != class_number]
    
    try:
        precision =  len(true_pos)/(len(true_pos) + len(false_pos))
    except ZeroDivisionError:
        return 0
    try:
        recall = len(true_pos)/(len(true_pos) + len(false_neg))
    except ZeroDivisionError:
        return 0

    f_score = 2 * ((precision * recall)/(precision + recall))

    return f_score

# Calculating the macro average F-score
def mean_f_score(results):
    n_classes = results['true_class'].unique()
    class_list = []
    for i in range(0, n_classes-1):
        class_i = F_score(results, i)
        class_list.append(class_i)
   
    mean_f_score = (sum(class_list))/n_classes
    
    return(mean_f_score)

def calculate_bias(clustered_data, cluster_number):
    cluster_x = clustered_data.loc[clustered_data["assigned_cluster"] == cluster_number]
    remaining_clusters = clustered_data.loc[clustered_data["assigned_cluster"] != cluster_number]
    
    # Bias definition:
    return mean_f_score(remaining_clusters) - mean_f_score(cluster_x)

MAX_ITER = 10
cluster_comparison = []

# start with all instances in one cluster
# scaled_matrix
for i in range(1, MAX_ITER):
    kmeans_algo = KMeans(n_clusters=2, **clus_model_kwargs).fit(scaled_matrix) 
    clustered_data = pd.DataFrame(kmeans_algo.predict(scaled_matrix), columns=['assigned_cluster']) 
# Adding the assigned cluster to the column 
    # groups = pd.DataFrame(cluster_model.predict(df_data),columns=["group"])
    
    # Calculating bias per cluster
    for cluster in clustered_data:
        negative_bias_0 = calculate_bias(clustered_data, 0)
        negative_bias_1 = calculate_bias(clustered_data, 1)
    # the code below doesn't work
    if max(negative_bias_0, negative_bias_1) >= bias_prev_iteration:

0

There are 0 answers