I have a RNAseq dataset that I want to perform differential gene-expression analysis on. The dataset consists of 3 groups = macrophages deriving from adults (n=6), term-born infants (n=5), and preterm infants (n=3). Each sample has been treated with an immune-stimulus, or left untreated (paired) samples.
     Group  Treatment Sample_Nr Sample_within_group
1    adult    control         1                   1
2    adult    control         2                   2
3    adult stimulated         2                   2
4    adult    control         3                   3
5    adult    control         4                   4
6    adult stimulated         4                   4
7    adult    control         5                   5
8    adult stimulated         5                   5
9     term    control         7                   1
10    term stimulated         7                   1
11    term    control         8                   2
12    term stimulated         8                   2
13    term    control         9                   3
14    term stimulated         9                   3
15    term    control        10                   4
16    term stimulated        10                   4
17    term    control        11                   5
18    term stimulated        11                   5
19 preterm    control        12                   1
20 preterm stimulated        12                   1
21 preterm    control        13                   2
22 preterm stimulated        13                   2
23 preterm    control        14                   3
24 preterm stimulated        14                   3
25   adult stimulated         1                   1
26   adult stimulated         3                   3
27   adult    control         6                   6
28   adult stimulated         6                   6
So far, I have taken a look at both LIMMA and DESeq2 workflows to deal with this rather complex experimental setup (Vignette + Blogposts). In essence, I am interested in analysing both, differences caused by the treatment for each group, as well as, differences between treated cells and untreated cells between groups. Interestingly, sticking to the respective vignettes and blogposts, I have received rather disparate differential expression results.
As for LIMMA, I have used the following code, in alignment with (a) the vignette, Section 9.7 and (b) several blogposts (Nr.1, Nr.2 highlighting the concomitant usage of the "duplicateCorrelation" function and "voom".
#SummarizedExp to DGEList
x <- SE2DGEList(se)
#Introducing the "new" variable
x[["samples"]] <- x[["samples"]] %>% mutate(new_col = paste0(Group, Treatment, SEP = ""))
*#Filter dataset *
keep.exprs <- edgeR::filterByExpr(x, group=x[["samples"]][["new_col"]])
x <- x[keep.exprs,, keep.lib.sizes=FALSE]
x <- calcNormFactors(x, method = "TMM")
*#Design model matrix *
Treat <- factor(x[["samples"]][["new_col"]])
design <- model.matrix(~0+Treat)
colnames(design) <- levels(Treat)
#Run corfit and voom twice
v <- voom(x, design)
dupcor <- duplicateCorrelation(v,design,block=x[["samples"]][["Index"]])
v <- voom(x, design, block=x[["samples"]][["Index"]], correlation=dupcor$consensus)
corfit <- duplicateCorrelation(v, design, block=x[["samples"]][["Index"]])
#fit/contrasts/eBayes
fit <- lmFit(v,design,block=x[["samples"]][["Index"]],correlation=corfit$consensus)
cm <- makeContrasts(
  Adult_Effect = adultstimulated-adultcontrol,
  Preterm_Effect = pretermstimulated-pretermcontrol,
  Term_Effect = termstimulated-termcontrol,
  cont_AT = termcontrol-adultcontrol,
  cont_AP = pretermcontrol-adultcontrol,
  cont_PT = pretermcontrol-termcontrol,
  trtm_AT = termstimulated-adultstimulated,
  trtm_AP = pretermstimulated-adultstimulated,
  trtm_PT = pretermstimulated-termstimulated,
  levels=design)
fit <- contrasts.fit(fit, cm)
fit <- eBayes(fit)
ab <- decideTests(fit)
summary(ab)
#If I run the analysis like this, I get the following results:
      Adult_Effect Preterm_Effect Term_Effect cont_AT cont_AP cont_PT stim_AT stim_AP stim_PT
Down             34              0           0      38      41       0      43     180       0
NotSig        16990          17061       17063   16998   16970   17064   16968   16638   17064
Up               40              3           1      28      53       0      53     246       0
Interestingly, if I run my analyse the data the way the DESeq2 vignette recommends it (+ additional tipps/tricks by the authors on some blog-posts; Nr.1, Nr.2) I get completely different results.
Here is the code for the DESeq2 analysis:
#Following the section "matrix not full rank" from the vignette = creating model matrix "on my own"
metadata$Treatment = relevel(metadata$Treatment, "mock")
m1 <- model.matrix(~0 + Group + Group:Index_2 + Group:Treatment, metadata)
all.zero <- apply(m1, 2, function(x) all(x==0))
all.zero
idx <- which(all.zero)
m1 <- m1[,-idx]
#Initialize DESeq2dataset from summarized experiment dataset (se)
dds <- DESeqDataSet(se, design = m1)
#filter
smallestGroupSize <- 6
keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
dds <- dds[keep,]
#Run DESeq2
dds <- DESeq(dds)
#extract e.g. effects of stimulation on adult cells
x <- results(dds, contrast = list("Groupadult.Treatmentstimulated"))
summary(x)
out of 16075 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 540, 3.4%
LFC < 0 (down)     : 560, 3.5%
outliers [1]       : 0, 0%
low counts [2]     : 1247, 7.8%
(mean count < 10)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
#extract e.g. differences between stimulated adult and preterm cells
x <- results(dds, contrast =list("Grouppreterm.Treatmentstimulated","Groupadult.Treatmentstimulated" ))
> summary(x)
out of 16075 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 35, 0.22%
LFC < 0 (down)     : 23, 0.14%
outliers [1]       : 0, 0%
low counts [2]     : 4052, 25%
(mean count < 44)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
*Summary: * Now, while I get, that there are differences between using limma and DESeq2 for differential gene expression analysis, I cannot wrap my head around differences this big - there must be some kind of conceptual error in either one, or both of these workflows.
As I was unable to choose the experimental SetUp for this data, I also do not have sufficient metadata on the samples to account for e.g. gender etc. Thus, I played around with SVA, but could not get the model.matrix to work, since my number of groups is uneven and the manually produced model matrix I use in the DESeq analysis can“t be used as an input (or at least I could not find an example of how to do it).
So here are my concrete questions:
1) General: Is there a conceptual mistake in my understanding on how to tackle my data if I am interested in both, exploring treatment effects within a group and differences in treated/untreated cells between groups?
2) Limma: Is it correct to use the workflow proposed in the vignette section 9.7 for my dataset, combine voom with dupl.correlation and iterate this twice before proceeding?
3) DESeq2: Is it conceptually correct to use the workflow proposed in the section for "matrix not full rank" and manually provide DESeq2 with the model.matrix the way I did? Have any arguments in the uploaded code been used wrongly?
I am really looking forward to hearing your responses - unfortunately, I have to admit that I am a little bit in over my head with the issues/complexity that arise from this design and do not have a concrete and formal bioinformatic background. Thus, any help is highly appreciated!