Different PCA results in R and Python

206 views Asked by At

I'm trying to get PCA done for my data, which is dataframe with 16 observations in rows, and 11 features in columns.

In R, with prcomp the matrix format consists of features in the rows and principal components in the columns. In Python using sklearn the format is reversed. The rows are observations (in my case, administrative units), and the columns are again the principal components. While the eigenvalues and component loadings differ between R and Python, the cumulative sums of explained variance and the correlations of features with the principal components remain the same

I'm struggling to understand why these differences occur and how to interpret the Python results correctly. Any insights or explanations would be greatly appreciated.

R:

data_pca <- prcomp(data, scale = TRUE)

R result

Python:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

pca = PCA()
data_pca = pca.fit_transform(data_scaled)

Python result


Edit: Results after I tranposed the data to end up with the same shape. Results are odd and still differ from R.

explained_variance    explained_variance_ratio  cumulative_sum
PC1 1.541840e+01      8.760452e-01               0.876045
PC2 6.401815e-01      3.637395e-02               0.912419
PC3 5.191492e-01      2.949711e-02               0.941916
PC4 4.163386e-01      2.365560e-02               0.965572
PC5 3.616688e-01      2.054936e-02               0.986121
PC6 9.329659e-02      5.300943e-03               0.991422
PC7 8.263950e-02      4.695426e-03               0.996118
PC8 4.770578e-02      2.710556e-03               0.998828
PC9 1.481567e-02      8.417995e-04               0.999670
PC10 5.808094e-03     3.300053e-04               1.000000
PC11 8.392454e-33     4.768440e-34               1.000000
1

There are 1 answers

0
TCamara On

Just for fun and since this is a good use case for reticulate I built a test case for comparison of the 2 PCA calculations. Here's the test script you can adapt it to your specific use case accordingly:

library(reticulate)
# Use mtcars as test data
daten <- mtcars
# Results of PCA calculated in R
result_r <- prcomp(daten)
# Load scikit module via reticulate
pca <- import("sklearn.decomposition")
# PCA calculated wih Python
result_py <- pca$PCA(n_components = as.integer(11))
result_py <- result_py$fit(daten)
explained_var_py <-result_py$explained_variance_ratio_
result_py <- result_py$transform(daten)
# Calculate difference of the absolute values of result matrices;
#   calculated with abs, because the Principal Components
#   may point in opposite directions
abs(result_r$x) - abs(result_py)
# Calculate maximum and minimum difference
max(abs(result_r$x) - abs(result_py))
min(abs(result_r$x) - abs(result_py))
# Difference of explained variance for both calculations
explained_var_py - summary(result_r)$importance[2, ]

As you will see, there are differences, but of the order of 1e-13 to 1e-14, numerically justifiable and negligible. At the level of the explained variance the differences are in the order of 1e-6, also very small.


PS. Your transformations are not necessary, as you can see from this example. I did not have to use the Scaler.