How to build Principal Component Analysis (PCA) correlation circle from PCA object made with the package recipes?

Question

How to build Principal Component Analysis (PCA) correlation circle from PCA object made with the package recipes?

174 views Asked by Paul At 31 October 2023 at 16:54

Context: Tidymodels framework make use of a common grammar to build model. However, the FactoMineR and factoextra packages are not part of it. Instead, the package recipes is used.

Problem: in order to standardize my data analysis workflow, I try to use recipes and the tidymodels grammar to make an PCA. However, I cannot figure out how to get comparable outputs from the FactoMineR and the recipes approaches.

Question: How to build the correlation circle and other graphical representations, similar to FactoMineR::PCA() from an object made with recipes::step_pca()?

What I tried so far: I compared the PCA output form packages recipes, FactoMineR and stats, trying to identify "equivalent" information. However I do not understand how to make graphical representations equivalent to those made with FactoMineR, but using recipes + ggplot.

Here is some code to work with:

Using recipes:

# Using recipes ----
library(recipes)
rec_iris <- recipe(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris) %>%
  step_normalize(all_numeric_predictors()) %>% 
  step_pca(all_predictors())
prep_rec <- prep(rec_iris)
bake(prep_rec, new_data = NULL)
#> # A tibble: 150 × 4
#>      PC1     PC2     PC3      PC4
#>    <dbl>   <dbl>   <dbl>    <dbl>
#>  1 -2.26 -0.478   0.127   0.0241 
#>  2 -2.07  0.672   0.234   0.103  
#>  3 -2.36  0.341  -0.0441  0.0283 
#>  4 -2.29  0.595  -0.0910 -0.0657 
#>  5 -2.38 -0.645  -0.0157 -0.0358 
#>  6 -2.07 -1.48   -0.0269  0.00659
#>  7 -2.44 -0.0475 -0.334  -0.0367 
#>  8 -2.23 -0.222   0.0884 -0.0245 
#>  9 -2.33  1.11   -0.145  -0.0268 
#> 10 -2.18  0.467   0.253  -0.0398 
#> # ℹ 140 more rows
# I can get useful information using:
prep_rec %>% tidy(2, type = "variance")
#> # A tibble: 16 × 4
#>    terms                          value component id       
#>    <chr>                          <dbl>     <int> <chr>    
#>  1 variance                      2.92           1 pca_OMBbk
#>  2 variance                      0.914          2 pca_OMBbk
#>  3 variance                      0.147          3 pca_OMBbk
#>  4 variance                      0.0207         4 pca_OMBbk
#>  5 cumulative variance           2.92           1 pca_OMBbk
#>  6 cumulative variance           3.83           2 pca_OMBbk
#>  7 cumulative variance           3.98           3 pca_OMBbk
#>  8 cumulative variance           4              4 pca_OMBbk
#>  9 percent variance             73.0            1 pca_OMBbk
#> 10 percent variance             22.9            2 pca_OMBbk
#> 11 percent variance              3.67           3 pca_OMBbk
#> 12 percent variance              0.518          4 pca_OMBbk
#> 13 cumulative percent variance  73.0            1 pca_OMBbk
#> 14 cumulative percent variance  95.8            2 pca_OMBbk
#> 15 cumulative percent variance  99.5            3 pca_OMBbk
#> 16 cumulative percent variance 100              4 pca_OMBbk
prep_rec %>% tidy(2, type = "coef")
#> # A tibble: 16 × 4
#>    terms          value component id       
#>    <chr>          <dbl> <chr>     <chr>    
#>  1 Sepal.Length  0.521  PC1       pca_OMBbk
#>  2 Sepal.Width  -0.269  PC1       pca_OMBbk
#>  3 Petal.Length  0.580  PC1       pca_OMBbk
#>  4 Petal.Width   0.565  PC1       pca_OMBbk
#>  5 Sepal.Length -0.377  PC2       pca_OMBbk
#>  6 Sepal.Width  -0.923  PC2       pca_OMBbk
#>  7 Petal.Length -0.0245 PC2       pca_OMBbk
#>  8 Petal.Width  -0.0669 PC2       pca_OMBbk
#>  9 Sepal.Length  0.720  PC3       pca_OMBbk
#> 10 Sepal.Width  -0.244  PC3       pca_OMBbk
#> 11 Petal.Length -0.142  PC3       pca_OMBbk
#> 12 Petal.Width  -0.634  PC3       pca_OMBbk
#> 13 Sepal.Length  0.261  PC4       pca_OMBbk
#> 14 Sepal.Width  -0.124  PC4       pca_OMBbk
#> 15 Petal.Length -0.801  PC4       pca_OMBbk
#> 16 Petal.Width   0.524  PC4       pca_OMBbk

Using FactoMineR:

# Using FactoMineR ----
library(FactoMineR)
pca_facto <- FactoMineR::PCA(iris[, 1:4])

Using stats


# Using stats ----
pca_stats <- stats::prcomp(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris, scale = TRUE, center = TRUE)

Now lets try to compare the content of the objects we just made:

# Equivalent objects:
prep_rec %>% tidy(2, type = "variance"); pca_facto$eig; summary(pca_stats)
#> # A tibble: 16 × 4
#>    terms                          value component id       
#>    <chr>                          <dbl>     <int> <chr>    
#>  1 variance                      2.92           1 pca_OMBbk
#>  2 variance                      0.914          2 pca_OMBbk
#>  3 variance                      0.147          3 pca_OMBbk
#>  4 variance                      0.0207         4 pca_OMBbk
#>  5 cumulative variance           2.92           1 pca_OMBbk
#>  6 cumulative variance           3.83           2 pca_OMBbk
#>  7 cumulative variance           3.98           3 pca_OMBbk
#>  8 cumulative variance           4              4 pca_OMBbk
#>  9 percent variance             73.0            1 pca_OMBbk
#> 10 percent variance             22.9            2 pca_OMBbk
#> 11 percent variance              3.67           3 pca_OMBbk
#> 12 percent variance              0.518          4 pca_OMBbk
#> 13 cumulative percent variance  73.0            1 pca_OMBbk
#> 14 cumulative percent variance  95.8            2 pca_OMBbk
#> 15 cumulative percent variance  99.5            3 pca_OMBbk
#> 16 cumulative percent variance 100              4 pca_OMBbk
#>        eigenvalue percentage of variance cumulative percentage of variance
#> comp 1 2.91849782             72.9624454                          72.96245
#> comp 2 0.91403047             22.8507618                          95.81321
#> comp 3 0.14675688              3.6689219                          99.48213
#> comp 4 0.02071484              0.5178709                         100.00000
#> Importance of components:
#>                           PC1    PC2     PC3     PC4
#> Standard deviation     1.7084 0.9560 0.38309 0.14393
#> Proportion of Variance 0.7296 0.2285 0.03669 0.00518
#> Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

Now, why are these different?

# Why are these different?
prep_rec %>% tidy(2, type = "coef"); pca_stats$rotation # --> different form $var from an abject made with FactoMineR::PCA()
#> # A tibble: 16 × 4
#>    terms          value component id       
#>    <chr>          <dbl> <chr>     <chr>    
#>  1 Sepal.Length  0.521  PC1       pca_OMBbk
#>  2 Sepal.Width  -0.269  PC1       pca_OMBbk
#>  3 Petal.Length  0.580  PC1       pca_OMBbk
#>  4 Petal.Width   0.565  PC1       pca_OMBbk
#>  5 Sepal.Length -0.377  PC2       pca_OMBbk
#>  6 Sepal.Width  -0.923  PC2       pca_OMBbk
#>  7 Petal.Length -0.0245 PC2       pca_OMBbk
#>  8 Petal.Width  -0.0669 PC2       pca_OMBbk
#>  9 Sepal.Length  0.720  PC3       pca_OMBbk
#> 10 Sepal.Width  -0.244  PC3       pca_OMBbk
#> 11 Petal.Length -0.142  PC3       pca_OMBbk
#> 12 Petal.Width  -0.634  PC3       pca_OMBbk
#> 13 Sepal.Length  0.261  PC4       pca_OMBbk
#> 14 Sepal.Width  -0.124  PC4       pca_OMBbk
#> 15 Petal.Length -0.801  PC4       pca_OMBbk
#> 16 Petal.Width   0.524  PC4       pca_OMBbk
#>                     PC1         PC2        PC3        PC4
#> Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
#> Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
#> Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
#> Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971
pca_facto$var
#> $coord
#>                   Dim.1      Dim.2       Dim.3       Dim.4
#> Sepal.Length  0.8901688 0.36082989 -0.27565767 -0.03760602
#> Sepal.Width  -0.4601427 0.88271627  0.09361987  0.01777631
#> Petal.Length  0.9915552 0.02341519  0.05444699  0.11534978
#> Petal.Width   0.9649790 0.06399985  0.24298265 -0.07535950
#> 
#> $cor
#>                   Dim.1      Dim.2       Dim.3       Dim.4
#> Sepal.Length  0.8901688 0.36082989 -0.27565767 -0.03760602
#> Sepal.Width  -0.4601427 0.88271627  0.09361987  0.01777631
#> Petal.Length  0.9915552 0.02341519  0.05444699  0.11534978
#> Petal.Width   0.9649790 0.06399985  0.24298265 -0.07535950
#> 
#> $cos2
#>                  Dim.1       Dim.2       Dim.3        Dim.4
#> Sepal.Length 0.7924004 0.130198208 0.075987149 0.0014142127
#> Sepal.Width  0.2117313 0.779188012 0.008764681 0.0003159971
#> Petal.Length 0.9831817 0.000548271 0.002964475 0.0133055723
#> Petal.Width  0.9311844 0.004095980 0.059040571 0.0056790544
#> 
#> $contrib
#>                  Dim.1       Dim.2     Dim.3     Dim.4
#> Sepal.Length 27.150969 14.24440565 51.777574  6.827052
#> Sepal.Width   7.254804 85.24748749  5.972245  1.525463
#> Petal.Length 33.687936  0.05998389  2.019990 64.232089
#> Petal.Width  31.906291  0.44812296 40.230191 27.415396

^{Created on 2023-10-31 with reprex v2.0.2}

Original Q&A

There are 1 answers

**Seth** · Accepted Answer · 2023-10-31T17:48:12+00:00

The first plot can be made by mapping the first two principal components to the X and Y axes respectively. To match the example you shared I label the points using ggrepel.

library(tidyverse)
library(recipes)
library(ggrepel)
library(ggforce)

iris_pca <- recipe(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris) %>%
  step_normalize(all_numeric_predictors()) %>% 
  step_pca(all_predictors(), res = TRUE) %>%
  prep() 
  
iris_pca %>%
  bake(new_data = NULL) %>%
  mutate(id = row_number()) %>%
  ggplot(aes(PC1, PC2)) +
  geom_point() +
  geom_text_repel(aes(label = id)) +
  geom_hline(yintercept = 0, lty = 2) +
  geom_vline(xintercept = 0, lty = 2) +
  coord_equal() +
  theme_bw()

Getting the values for the circle plot requires getting the loadings from the prepped object. This is what matrix = 'v' is doing in the following.


tidy(iris_pca, 2, matrix = 'v') %>%
  pivot_wider(names_from = component, values_from = value) %>%
  ggplot() +
  geom_segment(aes(x = 0, y = 0, xend = PC1, yend = PC2)) +
  geom_label_repel(aes(x = PC1, y = PC2, label = terms)) +
  geom_circle(aes(x0 = 0, y0 = 0, r = 1)) +
  coord_equal() +
  theme_bw()

TechQA.

How to build Principal Component Analysis (PCA) correlation circle from PCA object made with the package recipes?

There are 1 answers

Related Questions in R

Related Questions in PCA

Related Questions in R-RECIPES

Related Questions in FACTOMINER

Popular Questions

Trending Questions