I am trying to find outliers in a couple of different datasets from the UCI repository (thyroid, diabetes, and lymphography) currently I am working on the code for the iforest algorithm and i cannot get it to work. What am I doing wrong? and what can I do to fix it?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import LabelEncoder
data = pd.read_csv(r'C:\\Users\\There\\Desktop\\thyroid-disease.csv')
label_encoder = LabelEncoder()
data\['sex'\] = label_encoder.fit_transform(data\['sex'\])
data\['on_thyroxine'\] = label_encoder.fit_transform(data\['on_thyroxine'\])
data\['query_on_thyroxine'\] = label_encoder.fit_transform(data\['query_on_thyroxine'\])
data.dropna(inplace=True)
data = data.astype(float)
selected_features = \['age', 'TSH'\]
X = data\[selected_features\]
clf = IsolationForest(contamination=0.1, random_state=42)
outliers = clf.fit_predict(X)
plt.scatter(X.iloc\[:, 0\], X.iloc\[:, 1\], color='k', s=3., label='Data points')
plt.scatter(X.iloc\[outliers == -1, 0\], X.iloc\[outliers == -1, 1\], color='r', s=30., label='Outliers')
plt.legend(loc='best')
plt.title('Isolation Forest Outlier Detection')
plt.xlabel('Age')
plt.ylabel('TSH')
plt.show()