I don't understand why the centroids are jammed into the lower left corner but there are three cluster labels in the graph.
print(df.info())
print(df)
preprocessor = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(), ['State'])
], remainder='passthrough'
)
kmeans = Pipeline([
('preprocessor', preprocessor),
('kmeans', KMeans(n_clusters = 3, random_state=0, n_init = "auto"))
]).fit(df)
labels = kmeans['kmeans'].labels_
print("Cluster Labels:", labels)
centroids = kmeans['kmeans'].cluster_centers_
print("Centroids:", centroids)
labels = kmeans['kmeans'].labels_
centroids = kmeans['kmeans'].cluster_centers_
plt.scatter(df['SumOfTotalPrice'], df['State'], c = labels)
plt.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=200, c='#050505')
plt.xlabel('SumOfTotalPrice')
plt.ylabel('State')
plt.show()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 State 11 non-null object
1 SumOfTotalPrice 11 non-null float64
dtypes: float64(1), object(1)
memory usage: 304.0+ bytes
None
State SumOfTotalPrice
0 AK 1.063432e+07
1 CA 4.172891e+07
2 IL 2.103149e+07
3 IN 2.270681e+08
4 KY 4.144238e+07
5 ME 2.057557e+07
6 MI 4.216375e+07
7 OH 7.970354e+08
8 PA 2.158148e+07
9 SD 1.025623e+07
10 TX 2.061534e+07
Cluster Labels: [0 0 0 2 0 0 0 1 0 0 0]
Centroids: [[1.11111111e-01 1.11111111e-01 1.11111111e-01 0.00000000e+00
1.11111111e-01 1.11111111e-01 1.11111111e-01 0.00000000e+00
1.11111111e-01 1.11111111e-01 1.11111111e-01 2.55588301e+07]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 7.97035399e+08]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 2.27068150e+08]]
