I have a RNAseq dataset that I want to perform differential gene-expression analysis on. The dataset consists of 3 groups = macrophages deriving from adults (n=6), term-born infants (n=5), and preterm infants (n=3). Each sample has been treated with an immune-stimulus, or left untreated (paired) samples.
Group Treatment Sample_Nr Sample_within_group
1 adult control 1 1
2 adult control 2 2
3 adult stimulated 2 2
4 adult control 3 3
5 adult control 4 4
6 adult stimulated 4 4
7 adult control 5 5
8 adult stimulated 5 5
9 term control 7 1
10 term stimulated 7 1
11 term control 8 2
12 term stimulated 8 2
13 term control 9 3
14 term stimulated 9 3
15 term control 10 4
16 term stimulated 10 4
17 term control 11 5
18 term stimulated 11 5
19 preterm control 12 1
20 preterm stimulated 12 1
21 preterm control 13 2
22 preterm stimulated 13 2
23 preterm control 14 3
24 preterm stimulated 14 3
25 adult stimulated 1 1
26 adult stimulated 3 3
27 adult control 6 6
28 adult stimulated 6 6
So far, I have taken a look at both LIMMA and DESeq2 workflows to deal with this rather complex experimental setup (Vignette + Blogposts). In essence, I am interested in analysing both, differences caused by the treatment for each group, as well as, differences between treated cells and untreated cells between groups. Interestingly, sticking to the respective vignettes and blogposts, I have received rather disparate differential expression results.
As for LIMMA, I have used the following code, in alignment with (a) the vignette, Section 9.7 and (b) several blogposts (Nr.1, Nr.2 highlighting the concomitant usage of the "duplicateCorrelation" function and "voom".
#SummarizedExp to DGEList
x <- SE2DGEList(se)
#Introducing the "new" variable
x[["samples"]] <- x[["samples"]] %>% mutate(new_col = paste0(Group, Treatment, SEP = ""))
*#Filter dataset *
keep.exprs <- edgeR::filterByExpr(x, group=x[["samples"]][["new_col"]])
x <- x[keep.exprs,, keep.lib.sizes=FALSE]
x <- calcNormFactors(x, method = "TMM")
*#Design model matrix *
Treat <- factor(x[["samples"]][["new_col"]])
design <- model.matrix(~0+Treat)
colnames(design) <- levels(Treat)
#Run corfit and voom twice
v <- voom(x, design)
dupcor <- duplicateCorrelation(v,design,block=x[["samples"]][["Index"]])
v <- voom(x, design, block=x[["samples"]][["Index"]], correlation=dupcor$consensus)
corfit <- duplicateCorrelation(v, design, block=x[["samples"]][["Index"]])
#fit/contrasts/eBayes
fit <- lmFit(v,design,block=x[["samples"]][["Index"]],correlation=corfit$consensus)
cm <- makeContrasts(
Adult_Effect = adultstimulated-adultcontrol,
Preterm_Effect = pretermstimulated-pretermcontrol,
Term_Effect = termstimulated-termcontrol,
cont_AT = termcontrol-adultcontrol,
cont_AP = pretermcontrol-adultcontrol,
cont_PT = pretermcontrol-termcontrol,
trtm_AT = termstimulated-adultstimulated,
trtm_AP = pretermstimulated-adultstimulated,
trtm_PT = pretermstimulated-termstimulated,
levels=design)
fit <- contrasts.fit(fit, cm)
fit <- eBayes(fit)
ab <- decideTests(fit)
summary(ab)
#If I run the analysis like this, I get the following results:
Adult_Effect Preterm_Effect Term_Effect cont_AT cont_AP cont_PT stim_AT stim_AP stim_PT
Down 34 0 0 38 41 0 43 180 0
NotSig 16990 17061 17063 16998 16970 17064 16968 16638 17064
Up 40 3 1 28 53 0 53 246 0
Interestingly, if I run my analyse the data the way the DESeq2 vignette recommends it (+ additional tipps/tricks by the authors on some blog-posts; Nr.1, Nr.2) I get completely different results.
Here is the code for the DESeq2 analysis:
#Following the section "matrix not full rank" from the vignette = creating model matrix "on my own"
metadata$Treatment = relevel(metadata$Treatment, "mock")
m1 <- model.matrix(~0 + Group + Group:Index_2 + Group:Treatment, metadata)
all.zero <- apply(m1, 2, function(x) all(x==0))
all.zero
idx <- which(all.zero)
m1 <- m1[,-idx]
#Initialize DESeq2dataset from summarized experiment dataset (se)
dds <- DESeqDataSet(se, design = m1)
#filter
smallestGroupSize <- 6
keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
dds <- dds[keep,]
#Run DESeq2
dds <- DESeq(dds)
#extract e.g. effects of stimulation on adult cells
x <- results(dds, contrast = list("Groupadult.Treatmentstimulated"))
summary(x)
out of 16075 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 540, 3.4%
LFC < 0 (down) : 560, 3.5%
outliers [1] : 0, 0%
low counts [2] : 1247, 7.8%
(mean count < 10)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
#extract e.g. differences between stimulated adult and preterm cells
x <- results(dds, contrast =list("Grouppreterm.Treatmentstimulated","Groupadult.Treatmentstimulated" ))
> summary(x)
out of 16075 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 35, 0.22%
LFC < 0 (down) : 23, 0.14%
outliers [1] : 0, 0%
low counts [2] : 4052, 25%
(mean count < 44)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
*Summary: * Now, while I get, that there are differences between using limma and DESeq2 for differential gene expression analysis, I cannot wrap my head around differences this big - there must be some kind of conceptual error in either one, or both of these workflows.
As I was unable to choose the experimental SetUp for this data, I also do not have sufficient metadata on the samples to account for e.g. gender etc. Thus, I played around with SVA, but could not get the model.matrix to work, since my number of groups is uneven and the manually produced model matrix I use in the DESeq analysis canĀ“t be used as an input (or at least I could not find an example of how to do it).
So here are my concrete questions:
1) General: Is there a conceptual mistake in my understanding on how to tackle my data if I am interested in both, exploring treatment effects within a group and differences in treated/untreated cells between groups?
2) Limma: Is it correct to use the workflow proposed in the vignette section 9.7 for my dataset, combine voom with dupl.correlation and iterate this twice before proceeding?
3) DESeq2: Is it conceptually correct to use the workflow proposed in the section for "matrix not full rank" and manually provide DESeq2 with the model.matrix the way I did? Have any arguments in the uploaded code been used wrongly?
I am really looking forward to hearing your responses - unfortunately, I have to admit that I am a little bit in over my head with the issues/complexity that arise from this design and do not have a concrete and formal bioinformatic background. Thus, any help is highly appreciated!