Principle Component Analysis, add a line to the 3d graph showing the first principal component

239 views Asked by At

I am conducting PCA on a dataset. I am attempting to add a line in my 3d graph which shows the first principal component. I have tried a few methods but have not been able to display the first principal component as a line in my 3d graph. Any help is greatly appreciated. My code is as follows:

import numpy as np
np.set_printoptions (suppress=True, precision=5, linewidth=150)
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

file_name = 'C:/Users/data'
input_data = pd.read_csv (file_name + '.csv', header=0, index_col=0)
A = input_data.A.values.astype(float)
B = input_data.B.values.astype(float)
C = input_data.C.values.astype(float)
D = input_data.D.values.astype(float)
E = input_data.E.values.astype(float)
F = input_data.F.values.astype(float)
X = np.column_stack((A, B, C, D, E, F))

ncompo = int (input ("Number of components to study: "))
print("")
pca = PCA (n_components = ncompo)
pcafit = pca.fit(X)
cov_mat = np.cov(X, rowvar=0)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

perc = pcafit.explained_variance_ratio_
perc_x = range(1, len(perc)+1)
plt.plot(perc_x, perc)
plt.xlabel('Components')
plt.ylabel('Percentage of Variance Explained')
plt.show()

#3d Graph
plt.clf()
le = LabelEncoder()
le.fit(input_data.Grade)
number = le.transform(input_data.Grade)
colormap = np.array(['green', 'blue', 'red', 'yellow'])

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(D, E, F, c=colormap[number])

ax.set_xlabel('D')
ax.set_ylabel('E')
ax.set_zlabel('F')

plt.title('PCA')
plt.show()
1

There are 1 answers

0
chrslg On

Some remarks to begin with:

  • You are computing PCA twice! To compute PCA is to compute eigen values and eigen vectors of the covariance matrix. So, either you use the sklearn function pca.fit, either you do it yourself. But you don't need to do both, unless you want to discover pca.fit and see for yourself that it does exactly what you expect it to do (if this is what you wanted, fine. It is a good thing to do that king of checking. I did this once also). Of course pca.fit has another advantage: once you have it, it also provides pca.predict to project points in the components space. But that also is simply a base change using eigenvectors matrix (that is matrix to change base)
  • pca object let you get the eigenvectors (pca.components_) and eigen values (pca.explained_variance_)
  • pca.fit is a 'inplace' method. It does not return a new PCA object. It just fit the one you have. So, no need to get pcafit and use it.
  • This is not a minimal reproducible exemple as required on SO. We should be able to copy and paste it, and run it, so see exactly your problem. Not to guess what kind of secret data you have. And in the meantime, it should be minimal. So, contains data example generation (it doesn't matter if those data doesn't make sense. Sometimes it is even better, since it allows some testing. In my following code, I generate my own noisy data along an axis, which allow me to verify that, indeed, I am able to "guess" what was that axis). Plus, since your problem concerns only 3d plot, there is no need to include ploting of explained variance here. That part is not part of your question.

Now, to print the principal component, well, you already did the hard part. Twice. That is to compute it. It is the eigenvector associated with the highest eigenvalue. With pca object no need to search for it, they are already sorted. So it is simply pca.components_[0]. And since you want to plot in the space D,E,F, you simply need to draw vector pca.components_[0][3:]. With correct scaling.

You can do that with plot providing just 2 points (first and last)

Here is my version (which, by the way, shows also what a minimal reproducible example is)

import numpy as np
np.set_printoptions (suppress=True, precision=5, linewidth=150)
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

# Generation of random data along a given vector
vec=np.array([1, -1, 0.5, -0.5, 0.75, 0.75]).reshape(-1,1)
# 10000 random data, that are U[0,10]×vec + gaussian noise std=1
X=(vec*np.random.rand(10000)*10 + np.random.normal(0,1,(6,10000))).T
(A,B,C,D,E,F)=X.T
input_data = pd.DataFrame({'A':A,'B':B,'C':C,'D':D,'E':E, 'F':F, 'Grade':np.random.randint(1,5, (10000,))})

ncompo=6
pca = PCA (n_components = ncompo)
pca.fit(X)
# Redundant
cov_mat = np.cov(X, rowvar=0)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
# See 
print("Eigen values")
print(eig_vals)
print(pca.explained_variance_)
print("Eigen vec")
print(eig_vecs)
print(pca.components_)
# Note, compare first components to
print("Main component")
print(vec/np.linalg.norm(vec))
print(pca.components_[0])

#3d Graph
le = LabelEncoder()
le.fit(input_data.Grade)
number = le.transform(input_data.Grade)

fig = plt.figure()
colormap = np.array(['green', 'blue', 'red', 'yellow'])
ax = fig.add_subplot(111, projection='3d')
ax.scatter(D, E, F, c=colormap[number])
U=pca.components_[0]
sc1=max(D)/U[3]
sc2=min(D)/U[3]

# Draw the 1st principal component as a blue line
ax.plot([sc1*U[3],sc2*U[3]], [sc1*U[4], sc2*U[4]], [sc1*U[5], sc2*U[5]], linewidth=3)

ax.set_xlabel('D')
ax.set_ylabel('E')
ax.set_zlabel('F')

plt.title('PCA')
plt.show()

My example is not that minimal, because I took advantage of it to illustrate my first remark, and also computed PCA twice, to compare both result. So, here I print, eigenvalues

Eigen values
[30.88941  1.01334  0.99512  0.96493  0.97692  0.98101]
[30.88941  1.01334  0.99512  0.98101  0.97692  0.96493]

(1st being your computation by diagonalisation of covariance matrix, 2nd pca.explained_variance_) As you can see, they are the same, except sorting for the 1st one

Like wise,

Eigen vec
[[-0.52251 -0.27292  0.40863 -0.06321  0.26699  0.6405 ]
 [ 0.52521  0.07577 -0.34211  0.27583 -0.04161  0.72357]
 [-0.26266 -0.41332 -0.60091  0.38027  0.47573 -0.16779]
 [ 0.26354 -0.52548  0.47284  0.59159 -0.24029 -0.15204]
 [-0.39493  0.63946  0.07496  0.64966 -0.08619  0.00252]
 [-0.3959  -0.25276 -0.35452 -0.0572  -0.79718  0.12217]]
[[ 0.52251 -0.52521  0.26266 -0.26354  0.39493  0.3959 ]
 [-0.27292  0.07577 -0.41332 -0.52548  0.63946 -0.25276]
 [-0.40863  0.34211  0.60091 -0.47284 -0.07496  0.35452]
 [-0.6405  -0.72357  0.16779  0.15204 -0.00252 -0.12217]
 [-0.26699  0.04161 -0.47573  0.24029  0.08619  0.79718]
 [-0.06321  0.27583  0.38027  0.59159  0.64966 -0.0572 ]]

Also the same, but for sorting and transpose. Eigen vectors are presented column wise when you diagonalize a matrix. Where as for pca.components_ each line is an eigen vector. But you can see that in the 1st matrix, the eigen vector associated to the biggest eigen value, that is, since biggest eigen value was the 1st one, the 1st column (-0.52, 0.52, etc.) is also the same as the first line of pca.components_.

Like wise, the 4th biggest eigen value in your diagonalisation was the last one. And if you look at the last column of your eigen vectors (0.64, 0.72, -0.76...), it is the same as the 4th line of pca.components_ (with a irrelevant ×-1 factor)

So, long story short, you already have eigenvals in pca.explained_variance_ sorted from the biggest to the smallest. And eigen vectors in pca_components_, in the same order.

Last thing I print here, is comparison between the first component (pca.components_[0]) and the vector I used to generate the data in the first place (my data are all colinear to a vector vec, + a gaussian noise).

Main component
[[ 0.52523]
 [-0.52523]
 [ 0.26261]
 [-0.26261]
 [ 0.39392]
 [ 0.39392]]
[ 0.52251 -0.52521  0.26266 -0.26354  0.39493  0.3959 ]

As expected, PCA did find correctly that main axis.

So, that was just side comments.

What is really what you were looking for is

ax.plot([sc1*U[3],sc2*U[3]], [sc1*U[4], sc2*U[4]], [sc1*U[5], sc2*U[5]], linewidth=3)

enter image description here

sc1 and sc2 being just scaling factors (here I choose it so that it scales approx like the data. Another way would have been to set ax.set_xlim, ax.set_ylim, ax.set_zlim from D.min(), D.max(), E.min(), E.max(), etc. And then just use big values for sc1 and sc2, like sc1=1000 sc2=-1000