I am conducting PCA on a dataset. I am attempting to add a line in my 3d graph which shows the first principal component. I have tried a few methods but have not been able to display the first principal component as a line in my 3d graph. Any help is greatly appreciated. My code is as follows:
import numpy as np
np.set_printoptions (suppress=True, precision=5, linewidth=150)
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
file_name = 'C:/Users/data'
input_data = pd.read_csv (file_name + '.csv', header=0, index_col=0)
A = input_data.A.values.astype(float)
B = input_data.B.values.astype(float)
C = input_data.C.values.astype(float)
D = input_data.D.values.astype(float)
E = input_data.E.values.astype(float)
F = input_data.F.values.astype(float)
X = np.column_stack((A, B, C, D, E, F))
ncompo = int (input ("Number of components to study: "))
print("")
pca = PCA (n_components = ncompo)
pcafit = pca.fit(X)
cov_mat = np.cov(X, rowvar=0)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
perc = pcafit.explained_variance_ratio_
perc_x = range(1, len(perc)+1)
plt.plot(perc_x, perc)
plt.xlabel('Components')
plt.ylabel('Percentage of Variance Explained')
plt.show()
#3d Graph
plt.clf()
le = LabelEncoder()
le.fit(input_data.Grade)
number = le.transform(input_data.Grade)
colormap = np.array(['green', 'blue', 'red', 'yellow'])
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(D, E, F, c=colormap[number])
ax.set_xlabel('D')
ax.set_ylabel('E')
ax.set_zlabel('F')
plt.title('PCA')
plt.show()
Some remarks to begin with:
pca.fit, either you do it yourself. But you don't need to do both, unless you want to discoverpca.fitand see for yourself that it does exactly what you expect it to do (if this is what you wanted, fine. It is a good thing to do that king of checking. I did this once also). Of coursepca.fithas another advantage: once you have it, it also providespca.predictto project points in the components space. But that also is simply a base change using eigenvectors matrix (that is matrix to change base)pca.components_) and eigen values (pca.explained_variance_)pca.fitis a 'inplace' method. It does not return a newPCAobject. It just fit the one you have. So, no need to getpcafitand use it.minimal reproducible exempleas required on SO. We should be able to copy and paste it, and run it, so see exactly your problem. Not to guess what kind of secret data you have. And in the meantime, it should be minimal. So, contains data example generation (it doesn't matter if those data doesn't make sense. Sometimes it is even better, since it allows some testing. In my following code, I generate my own noisy data along an axis, which allow me to verify that, indeed, I am able to "guess" what was that axis). Plus, since your problem concerns only 3d plot, there is no need to include ploting of explained variance here. That part is not part of your question.Now, to print the principal component, well, you already did the hard part. Twice. That is to compute it. It is the eigenvector associated with the highest eigenvalue. With
pcaobject no need to search for it, they are already sorted. So it is simplypca.components_[0]. And since you want to plot in the space D,E,F, you simply need to draw vectorpca.components_[0][3:]. With correct scaling.You can do that with
plotproviding just 2 points (first and last)Here is my version (which, by the way, shows also what a minimal reproducible example is)
My example is not that minimal, because I took advantage of it to illustrate my first remark, and also computed PCA twice, to compare both result. So, here I print, eigenvalues
(1st being your computation by diagonalisation of covariance matrix, 2nd
pca.explained_variance_) As you can see, they are the same, except sorting for the 1st oneLike wise,
Also the same, but for sorting and transpose. Eigen vectors are presented column wise when you diagonalize a matrix. Where as for
pca.components_each line is an eigen vector. But you can see that in the 1st matrix, the eigen vector associated to the biggest eigen value, that is, since biggest eigen value was the 1st one, the 1st column (-0.52, 0.52, etc.) is also the same as the first line ofpca.components_.Like wise, the 4th biggest eigen value in your diagonalisation was the last one. And if you look at the last column of your eigen vectors (0.64, 0.72, -0.76...), it is the same as the 4th line of
pca.components_(with a irrelevant ×-1 factor)So, long story short, you already have eigenvals in
pca.explained_variance_sorted from the biggest to the smallest. And eigen vectors inpca_components_, in the same order.Last thing I print here, is comparison between the first component (
pca.components_[0]) and the vector I used to generate the data in the first place (my data are all colinear to a vectorvec, + a gaussian noise).As expected, PCA did find correctly that main axis.
So, that was just side comments.
What is really what you were looking for is
sc1 and sc2 being just scaling factors (here I choose it so that it scales approx like the data. Another way would have been to set
ax.set_xlim,ax.set_ylim,ax.set_zlimfromD.min(),D.max(),E.min(),E.max(), etc. And then just use big values for sc1 and sc2, like sc1=1000 sc2=-1000