Parallel coordinates in ggparcoord not reporting all the observations

Question

Parallel coordinates in ggparcoord not reporting all the observations

25 views Asked by elsich At 26 February 2024 at 18:39

I would like to create a parallel coordinates graph for a data set with five variables to report on the x axis (Std_Dim1 to Std_Dim4)employing the variable cluster to separate the groups. This is an extract of my data:

dput(extract_cl[1:20, ])
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20), cluster = structure(c(1L, 2L, 
2L, 2L, 2L, 1L, 3L, 1L, 2L, 2L, 3L, 2L, 1L, 4L, 2L, 1L, 5L, 1L, 
3L, 5L), levels = c("1", "2", "3", "4", "5", "6"), class = "factor"), 
    Std_Dim1 = c(-0.0728923703380469, -0.339155028394344, 0.340473420392532, 
    -0.339155028394344, -0.339155028394344, -0.0728923703380469, 
    -0.267300038214234, -0.0728923703380469, -0.339155028394344, 
    -0.339155028394344, -0.267300038214234, -0.339155028394344, 
    -0.0728923703380469, 8.24372395142489, -0.339155028394344, 
    -0.0728923703380469, -0.340751805953902, 1.23367086777029, 
    -0.144946957713102, 0.895154025143995), Std_Dim2 = c(0.380193300283392, 
    -0.21657689529506, -0.0605737618430261, -0.21657689529506, 
    -0.21657689529506, 0.380193300283392, 0.0358222042004695, 
    0.380193300283392, -0.21657689529506, -0.21657689529506, 
    0.0358222042004695, -0.21657689529506, 0.380193300283392, 
    2.24175307931177, -0.21657689529506, 0.380193300283392, 4.41531912524421, 
    0.549235501606044, 0.127794200787863, 1.07941331484527), 
    Std_Dim3 = c(-0.249302376419046, -0.301241388482618, -0.515922638345379, 
    -0.301241388482618, -0.301241388482618, -0.249302376419046, 
    -0.243613817954941, -0.249302376419046, -0.301241388482618, 
    -0.301241388482618, -0.243613817954941, -0.301241388482618, 
    -0.249302376419046, -2.55242656849512, -0.301241388482618, 
    -0.249302376419046, 3.97061868943169, -0.505287507303791, 
    -0.306929946946723, 0.594582905299551), Std_Dim4 = c(-1.78310815898572, 
    0.600501349348233, 0.970728782381305, 0.600501349348233, 
    0.600501349348233, -1.78310815898572, -0.595475416793659, 
    -1.78310815898572, 0.600501349348233, 0.600501349348233, 
    -0.595475416793659, 0.600501349348233, -1.78310815898572, 
    2.19760934092996, 0.600501349348233, -1.78310815898572, 0.0136383315437233, 
    -1.23858333677849, -0.586822354919764, 1.81841980809893)), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

And this is the code I am employing

ggparcoord(res.cluster,
           columns = 21:24, groupColumn = 2, order = "anyClass",
           showPoints = TRUE, 
           title = "Location of cluster points in the MCA dimensions",
           alphaLines = 0.5
) + 
  scale_color_viridis(discrete=TRUE) +
  theme_ipsum()+
  theme(
    plot.title = element_text(size=10)
  )+
  theme(text=element_text(family="Calibri"))+
  xlab("")

I have two questions:

I think all of my data (152 observations) is not all being displayed in the graph since the number of dots is significantly small than 152 in the graph I attach. If I compare it with the same plot for iris data set (150 observations) this is very clear ( I attach it too). May it be possible that some observations are skipped? Why would it be the case?

I got the following warning: In summary.lm(lm(x ~ as.factor(classVar == class.names[i]))) : essentially perfect fit: summary may be unreliable

But I am not sure what it means, since the variables Std_Dim1 to Std_Dim4 are standardized values of the dimensions of a Multiple correspondence analysis.

In my graph Std_Dim4 appears in first place and I would like to have these variables ordered starting from Std_Dim1 and ending at Std_Dim4, but I could not figure out how to do it.

Any help will be very much appreciated!

Original Q&A

There are 1 answers

**Allan Cameron** · Answer 1 · 2024-02-26T21:17:01+00:00

For a straightforward use case like yours. the function ggparcoord is really just pivoting and plotting your data. I find it easier to reason about by just doing that directly:

library(tidyverse)

res.cluster %>%
  pivot_longer(-(1:2)) %>%
  ggplot(aes(name, value, group = ID, color = cluster)) +
  geom_point(size = 3) +
  geom_line(linewidth = 1, alpha = 0.5) +
  scale_color_viridis_d() +
  hrbrthemes::theme_ipsum(base_size = 16) +
  labs(x = NULL, y = NULL)

Our dimensions are automatically plotted in the expected order.

Note that despite having 20 rows of data, we only seem to have 8 points on each parallel axis. The reason for this is simply that many of your rows have the same values, so the points are drawn over each other.

We can see this if we add some jitter to the points:

res.cluster %>%
  pivot_longer(-(1:2)) %>%
  ggplot(aes(name, value, group = ID, color = cluster)) +
  geom_point(size = 3, position = position_jitter(0.2, 0, seed = 1)) +
  geom_line(linewidth = 1, alpha = 0.5, 
            position = position_jitter(0.2, 0, seed = 1)) +
  scale_color_viridis_d() +
  hrbrthemes::theme_ipsum(base_size = 16) +
  labs(x = NULL, y = NULL)

TechQA.

Parallel coordinates in ggparcoord not reporting all the observations

There are 1 answers

Related Questions in R

Related Questions in GGALLY

Related Questions in PARALLEL-COORDINATES

Popular Questions

Trending Questions