PCA plot on repeats grouped by condition

131 views Asked by At

I'm trying to produce a PCA plot on expression data produced from 3 conditions with 3 repeats each. I've managed to get a plot but then am struggling to colour and group the conditions as I think I may have laid the data out wrong.

And then I've run the following code by am stuck when I get to colouring each sample. I want to colour by O, FO and F containing each of the 3 repeats and then ellipse these 3 conditions. Any help would be appreciated.

The table:

structure(list(Gene_ID = c("gene-EHS42_RS00005", "gene-EHS42_RS00010", 
"gene-EHS42_RS00015", "gene-EHS42_RS00020", "gene-EHS42_RS00025", 
"gene-EHS42_RS00030", "gene-EHS42_RS00035", "gene-EHS42_RS00040", 
"gene-EHS42_RS00045", "gene-EHS42_RS00050", "gene-EHS42_RS00055", 
"gene-EHS42_RS00060", "gene-EHS42_RS00065", "gene-EHS42_RS00070", 
"gene-EHS42_RS00075", "gene-EHS42_RS00080"), O1 = c(757.784, 
896.264, 123.429, 985.022, 85.8583, 111.718, 10.7002, 152.577, 
17.7682, 1086.55, 2826.57, 109.637, 43.1502, 0, 3158.45, 2271.19
), O2 = c(723, 897.502, 157.31, 1075.96, 106.999, 118.593, 10.8549, 
137.093, 19.2265, 1142.01, 2841.09, 91.1191, 63.1088, 0, 2981.31, 
2136.32), O3 = c(724.17, 875.258, 133.573, 1155.09, 74.4442, 
107.826, 16.365, 164.105, 29.4387, 751.156, 2822.42, 93.7586, 
37.7846, 0, 2978.32, 2045.64), FO1 = c(688.876, 922.35, 135.935, 
1223.9, 119.83, 93.1258, 17.7483, 324.379, 77.5033, 862.804, 
2524.59, 95.5171, 53.9344, 0, 2455.88, 1462.5), FO2 = c(869.985, 
1185.33, 194.729, 882.644, 177.953, 135.183, 21.7251, 296.909, 
58.101, 1247, 2511.67, 114.952, 63.6875, 0, 1433.23, 904.294), 
    FO3 = c(840.392, 1195.88, 165.721, 937.342, 170.775, 145.854, 
    23.9473, 285.05, 44.2553, 1402.51, 2737.45, 100.696, 73.0917, 
    0, 1419.96, 1051.12), F1 = c(1718.91, 1729.51, 341.759, 1324.52, 
    86.4022, 264.029, 30.6917, 169.219, 37.1905, 1987.85, 1370.75, 
    97.2895, 69.3806, 0, 3641.66, 2916.67), F2 = c(1919.41, 1666.16, 
    323.399, 850.732, 67.4236, 271.421, 18.9667, 184.824, 18.0931, 
    1617.57, 1449.76, 86.3241, 48.5885, 0, 2524.14, 1730.51), 
    F3 = c(1951.07, 1850.52, 376.333, 1157.23, 41.8972, 277.754, 
    32.3741, 177.472, 34.1986, 1039.71, 874.081, 78.1316, 58.6108, 
    0, 3424.35, 2758.01)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -16L))

And then the code I ran:

str(PcA_Plot_Data)
head(PcA_Plot_Data)

expression.pca <- prcomp(PcA_Plot_Data[,c(2:10)],
                         centre = TRUE,
                         scale. = TRUE)
summary(expression.pca)

library(ggfortify)
expression.pca.plot <- autoplot(expression.pca,
                                data = PcA_Plot_Data,
                                colour = '')
1

There are 1 answers

0
ATpoint On

You're correct that it is convention that genes should rows and columns should be samples. But you're running the PCA on untransposed data, but I assume you want to have each sample as a single dot in te final plot. Here is the minimal version on what to do.

Note that I am not checking whether your data needs normalization or any transformation such as log, it just demonstrates how to do PCA based on such data. It's on you to check how to make them appropriate for such analysis:

# It's kind of convention to have gene expression data as numeric matrix/data.frame without genes as a column
data <- as.data.frame(data)
rownames(data) <- data$Gene_ID
data$Gene_ID <- NULL

# Run PCA on transposed data
pca <- prcomp(t(data))

# Parse group names
groups <- gsub("1|2|3", "", colnames(data))

# Biplot
library(ggplot2)
to_plot <- data.frame(pca$x, group=groups)

ggplot(data=to_plot, aes(x=PC1, y=PC2, color=group)) + geom_point(size=3)

enter image description here