Practical session about Normalization and differential analysis of RNAseq data

Data description and importation
- Raw counts
- Experimental design
Basic exploratory analysis of raw counts
- Pseudo raw count distribution
- PCA
Normalization
Differential analysis
Session information

This file illustrates the normalization and differential analysis of RNAseq data on a real life dataset. The focus is put on testing interaction between a genotype (wt or mutant) and an environment (with or without a treatment). The data are provided by courtesy of the transcriptomic platform of IPS2. Gene names have been shuffled so that the results are not biologically interpretable. The data are composed of three files, all included in the directory data:

D2-counts.txt contains the raw counts of the experiment (25 columns: the first one contains the gene names, the others correspond to 24 different samples);
D2-genesLength.txt contains information about gene lengths (not used in this practical session);
D2-targets.txt contains information about the sample and the experimental design.

In the rest of this file, we:

1/ first import and describe the data;

2/ then perform a standard normalization (TMM) and explore its effect on data;

3/ finally perform a differential analysis using different methods and models, including a model in which the interaction is properly defined.

We advice that you use the file by reading the R code while trying to make sense of it. Then, you run it and analyze the results. The packages ggplot2, reshape2, mixOmics, RColorBrewer, edgeR, VennDiagram and devtools are required to compile this file. Information about my system (including package and R versions) are provided at the end of this file, in Section “Session information”.

The packages are loaded with:

library(ggplot2)
library(reshape2)
library(mixOmics)
library(RColorBrewer)
library(gridExtra)
library(edgeR)
library(VennDiagram)
library(devtools)

Data description and importation

The data used in this practical session correspond to 24 samples of RNAseq obtained from Arabidopsis thaliana plants. There are four genotypes: the wild type and three types of mutations (“dbMT”, “oxMT” and “siMT”). For each genotype, gene expression is measured in two conditions (treated and controled) for three groups of plants grown at different time (biological replicates). In addition to the wild type plants, three types of mutations are studied (“dbMT”, “oxMT” and “siMT”) and for each of the four types of plant, a treatment and a control condition are measured.

The datasets are included in the repository data located at the root of the material for this practical session. They can be loaded using:

raw_counts <- read.table("../data/D2-counts.txt", header = TRUE, row.names = 1)
raw_counts <- as.matrix(raw_counts)
design <- read.table("../data/D2-targets.txt", header = TRUE, 
                     stringsAsFactors = FALSE)

Raw counts

The dimensions of the raw count matrix are equal to:

dim(raw_counts)

## [1] 33602    24

which shows that the data contains 24 columns (samples) and 33602 rows (genes). And a quick overview is obtained by:

head(raw_counts)

##             WT_control_rep1 WT_control_rep2 WT_control_rep3
## AT1G01010.1            1202            1172            1245
## AT1G01020.1               0               0               0
## AT1G01030.1            1089            1191            1161
## AT1G01040.2               9               7               9
## AT1G01046.1               0               0               0
## AT1G01050.1               1               1               0
##             dbMT_control_rep1 dbMT_control_rep2 dbMT_control_rep3
## AT1G01010.1              1069              1291              1183
## AT1G01020.1                 1                 0                 0
## AT1G01030.1              1160              1468              1347
## AT1G01040.2                 6                 9                 2
## AT1G01046.1                 0                 0                 0
## AT1G01050.1                 0                 0                 1
##             dbMT_treated_rep1 dbMT_treated_rep2 dbMT_treated_rep3
## AT1G01010.1              1268              1120              1036
## AT1G01020.1                 0                 0                 0
## AT1G01030.1              1113               930               894
## AT1G01040.2                 2                 2                 2
## AT1G01046.1                 0                 0                 0
## AT1G01050.1                 0                 0                 0
##             oxMT_control_rep1 oxMT_control_rep2 oxMT_control_rep3
## AT1G01010.1              1166              1041              1005
## AT1G01020.1                 0                 0                 0
## AT1G01030.1              1241              1242              1024
## AT1G01040.2                 5                 2                 5
## AT1G01046.1                 0                 0                 0
## AT1G01050.1                 0                 0                 0
##             oxMT_treated_rep1 oxMT_treated_rep2 oxMT_treated_rep3
## AT1G01010.1              1442              1490              1403
## AT1G01020.1                 0                 0                 0
## AT1G01030.1              1044              1168              1190
## AT1G01040.2                 3                 6                 4
## AT1G01046.1                 0                 0                 0
## AT1G01050.1                 0                 0                 0
##             siMT_control_rep1 siMT_control_rep2 siMT_control_rep3
## AT1G01010.1               943              1386              1335
## AT1G01020.1                 0                 0                 0
## AT1G01030.1               959              1389              1522
## AT1G01040.2                 5                13                 6
## AT1G01046.1                 0                 0                 0
## AT1G01050.1                 0                 1                 0
##             siMT_treated_rep1 siMT_treated_rep2 siMT_treated_rep3
## AT1G01010.1              1151              1379              1224
## AT1G01020.1                 0                 0                 0
## AT1G01030.1               793              1041               923
## AT1G01040.2                 3                 4                 9
## AT1G01046.1                 0                 0                 0
## AT1G01050.1                 0                 0                 0
##             WT_treated_rep1 WT_treated_rep2 WT_treated_rep3
## AT1G01010.1            1309            1562            1367
## AT1G01020.1               0               0               0
## AT1G01030.1             897            1180             905
## AT1G01040.2               6               2               3
## AT1G01046.1               0               0               0
## AT1G01050.1               0               0               0

which shows that the row names are gene names and that the data are made of counts. Many of these counts are equal to 0. A basic exploratory analysis of the count data is provided in the next section.

Experimental design

The experimental design is described in the object design:

design

##               labels        group replicat factor1 factor2
## 1    WT_control_rep1   WT_control  repbio1      WT control
## 2    WT_control_rep2   WT_control  repbio2      WT control
## 3    WT_control_rep3   WT_control  repbio3      WT control
## 4  dbMT_control_rep1 dbMT_control  repbio1    dbMT control
## 5  dbMT_control_rep2 dbMT_control  repbio2    dbMT control
## 6  dbMT_control_rep3 dbMT_control  repbio3    dbMT control
## 7  dbMT_treated_rep1 dbMT_treated  repbio1    dbMT treated
## 8  dbMT_treated_rep2 dbMT_treated  repbio2    dbMT treated
## 9  dbMT_treated_rep3 dbMT_treated  repbio3    dbMT treated
## 10 oxMT_control_rep1 oxMT_control  repbio1    oxMT control
## 11 oxMT_control_rep2 oxMT_control  repbio2    oxMT control
## 12 oxMT_control_rep3 oxMT_control  repbio3    oxMT control
## 13 oxMT_treated_rep1 oxMT_treated  repbio1    oxMT treated
## 14 oxMT_treated_rep2 oxMT_treated  repbio2    oxMT treated
## 15 oxMT_treated_rep3 oxMT_treated  repbio3    oxMT treated
## 16 siMT_control_rep1 siMT_control  repbio1    siMT control
## 17 siMT_control_rep2 siMT_control  repbio2    siMT control
## 18 siMT_control_rep3 siMT_control  repbio3    siMT control
## 19 siMT_treated_rep1 siMT_treated  repbio1    siMT treated
## 20 siMT_treated_rep2 siMT_treated  repbio2    siMT treated
## 21 siMT_treated_rep3 siMT_treated  repbio3    siMT treated
## 22   WT_treated_rep1   WT_treated  repbio1      WT treated
## 23   WT_treated_rep2   WT_treated  repbio2      WT treated
## 24   WT_treated_rep3   WT_treated  repbio3      WT treated

that contains 5 columns: the first one labels is the sample name, the second one is group which combines the genotype and the treatment/control factor, the third column replicat indicates the biological replicate number, the fourth column called factor1 indicates the genotype and the last column factor2 indicates condition (control or treated).

Basic exploratory analysis of raw counts

Pseudo raw count distribution

We start the analysis by filtering out the genes for which no count has been found:

raw_counts_wn <- raw_counts[rowSums(raw_counts) > 0, ]
dim(raw_counts_wn)

## [1] 27732    24

The number of genes which have been filtered out is thus equal to 5870.

Pseudo-count distribution is then obtained and visualized using the variable “group” to colour the different boxplots:

pseudo_counts <- log2(raw_counts_wn + 1)

df_raw <- melt(pseudo_counts, id = rownames(raw_counts_wn))
names(df_raw)[1:2] <- c("id", "sample")
df_raw$group <- substr(as.character(df_raw$sample), 1,
                       nchar(as.character(df_raw$sample)) - 5)
df_raw$method <- rep("Raw counts", nrow(df_raw))  

p <- ggplot(data=df_raw, aes(x=sample, y=value, fill=group))
p <- p + geom_boxplot()
p <- p + scale_fill_brewer(type = "qual", palette = 7)
p <- p + theme_bw()
p <- p + ggtitle("Boxplots of raw pseudo counts")
p <- p + ylab(expression(log[2] ~ (raw ~ count + 1))) + xlab("")
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(), 
               axis.ticks.x = element_blank())
print(p)

PCA

A first exploratory analysis is performed with PCA:

resPCA <- pca(t(pseudo_counts), ncomp = 12)
plot(resPCA)

3 principal components seem to be enough to retrieve most of the variability in the dataset (the colors are identical to the ones used in the previous boxplots):

plotIndiv(resPCA, group = design$group, col.per.group = brewer.pal(8, "Dark2"),
          title = "1st and 2nd PCs")

The first principal component shows a strong separation between the two groups of treatment. On the second principal component, the different genotypes are approximately organized in separated groups, even though the separation is not perfect.

plotIndiv(resPCA, group = design$group, col.per.group = brewer.pal(8, "Dark2"),
          comp = c(2,3), title = "2nd and 3rd PCs")

The third principal component also shows an imperfect separation between the different genotypes, with respect to their group of treatment.

Normalization

In this section, a normalization is performed using the TMM method implemented in the R package edgeR. Raw count data are first transformed into a DGEList object and the normalization is performed.

dge <- DGEList(raw_counts_wn)
dge <- calcNormFactors(dge, method = "TMM")

Normalized pseudo counts are obtained with the function cpm and stored in a data frame:

pseudo_TMM <- log2(cpm(dge) + 1)

df_TMM <- melt(pseudo_TMM, id = rownames(raw_counts_wn))
names(df_TMM)[1:2] <- c ("id", "sample")
df_TMM$group <- substr(as.character(df_TMM$sample), 1,
                       nchar(as.character(df_TMM$sample)) - 5)
df_TMM$method <- rep("TMM", nrow(df_TMM))

The normalization is first compared to the original dataset by displaying the distribution of the pseudo counts (before and after normalization) for the different samples:

df_allnorm <- rbind(df_raw, df_TMM)
df_allnorm$method <- factor(df_allnorm$method,
                            levels = c("Raw counts", "TMM"))

p <- ggplot(data=df_allnorm, aes(x=sample, y=value, fill=group))
p <- p + geom_boxplot()  
p <- p + scale_fill_brewer(type = "qual", palette = 7)
p <- p + theme_bw()
p <- p + ggtitle("Boxplots of normalized pseudo counts\n
for all samples by normalization methods")
p <- p + facet_grid(method ~ .) 
p <- p + ylab(expression(log[2] ~ (normalized ~ count + 1))) + xlab("")
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(), 
               axis.ticks.x = element_blank())
print(p)

which shows an improvement in the ressemblance between the different sample distributions.

Then a PCA is performed:

resPCA <- pca(t(pseudo_TMM), ncomp = 24)
plot(resPCA)

The first component explains more variance than the other components which almost all equally reproduce variance. The projection of the samples on the first two PCs is obtained with:

plotIndiv(resPCA, group = design$group, col.per.group = brewer.pal(8, "Dark2"),
          title = "1st and 2nd PCs")

And their projection on the second and third PCs is obtained with:

plotIndiv(resPCA, group = design$group, col.per.group = brewer.pal(8, "Dark2"),
          comp = c(2,3), title = "2nd and 3rd PCs")

Both figures show an improvement (compared to raw data) in the way the different mutants are grouped (with respect to their treatment group).

Differential analysis

The differential analysis is performed using two models:

the first one models the overall effects of genotype and treatment type with the variable “group”. The model fit in this part is:

\(\log(\lambda_{g,\textrm{replicate},\textrm{group}}) = \lambda_{g0} + \sum_{j=2}^3 \beta_{gj} \mathbb{I}_{\textrm{replicate}_j} + \sum_{k=1}^7 \gamma_{gk} \mathbb{I}_{\textrm{group}_k}\)

in which \(\mbox{group}_k \in \{\)dbMT_control, dbMT_treated, oxMT_control, oxMT_treated, siMT_control, siMT_treated, WT_treated\(\}\) (the control reference being “WT_control”). In this part, we will extract genes that are differentially expressed between the groups “groupdbMT_control” “groupdbMT_treated” (i.e., genes that exibits a differential expression after treatment for the mutant “dbMT”). This question is equivalent to testing \(\gamma_{g1} = \gamma_{g2}\) versus \(\gamma_{g1} \neq \gamma_{g2}\) and corresponds to testing a contrast.

the second one separates the effect of mutant type and treatment type and their interaction. The model fit in this part writes:

\(\log(\lambda_{g,\textrm{replicate},\textrm{mutant},\textrm{treatment}}) = \lambda_{g0} + \sum_{j=2}^3 \beta_{gj} \mathbb{I}_{\textrm{replicate}_j} + \sum_{k=1}^3 \gamma_{gk} \mathbb{I}_{\textrm{mutant}_k} + \alpha_{g1} \mathbb{I}_{\textrm{treatment}} + \sum_{k=1}^3 \mu_{gk} \mathbb{I}_{\textrm{mutant}_k \textrm{ and treatment}}\)

in which \(\mbox{replicate}_j\in\{\)dbMT, oxMT, siMT (the reference groups being “WT” for the mutant type and “control” for the treatment type). Contrary to the previous model, the genotype and treatment effects are well separated in this model, which makes possible to test for genes having a treatment effect, genes with a mutant type effect or genes which shows an interaction effect between the two factors (genotype x treatment interaction). In this part, we will extract genes with a significant interaction effect between “dbMT” and “treatment”. This question is equivalent to testing \(\mu_{g1} = 0\) versus \(\mu_{g1} \neq 0\).

The results of the two analyses will be compared in a last section to show that the two models are not equivalent.

Replicate and group effect

The first analysis creates a group effect, in which the group is a combination of the two factors and defines the design matrix:

# create design matrix
group <- relevel(as.factor(design$group), ref = "WT_control")
design_matrix <- model.matrix(~ design$replicat + group)
design_matrix

##    (Intercept) design$replicatrepbio2 design$replicatrepbio3
## 1            1                      0                      0
## 2            1                      1                      0
## 3            1                      0                      1
## 4            1                      0                      0
## 5            1                      1                      0
## 6            1                      0                      1
## 7            1                      0                      0
## 8            1                      1                      0
## 9            1                      0                      1
## 10           1                      0                      0
## 11           1                      1                      0
## 12           1                      0                      1
## 13           1                      0                      0
## 14           1                      1                      0
## 15           1                      0                      1
## 16           1                      0                      0
## 17           1                      1                      0
## 18           1                      0                      1
## 19           1                      0                      0
## 20           1                      1                      0
## 21           1                      0                      1
## 22           1                      0                      0
## 23           1                      1                      0
## 24           1                      0                      1
##    groupdbMT_control groupdbMT_treated groupoxMT_control groupoxMT_treated
## 1                  0                 0                 0                 0
## 2                  0                 0                 0                 0
## 3                  0                 0                 0                 0
## 4                  1                 0                 0                 0
## 5                  1                 0                 0                 0
## 6                  1                 0                 0                 0
## 7                  0                 1                 0                 0
## 8                  0                 1                 0                 0
## 9                  0                 1                 0                 0
## 10                 0                 0                 1                 0
## 11                 0                 0                 1                 0
## 12                 0                 0                 1                 0
## 13                 0                 0                 0                 1
## 14                 0                 0                 0                 1
## 15                 0                 0                 0                 1
## 16                 0                 0                 0                 0
## 17                 0                 0                 0                 0
## 18                 0                 0                 0                 0
## 19                 0                 0                 0                 0
## 20                 0                 0                 0                 0
## 21                 0                 0                 0                 0
## 22                 0                 0                 0                 0
## 23                 0                 0                 0                 0
## 24                 0                 0                 0                 0
##    groupsiMT_control groupsiMT_treated groupWT_treated
## 1                  0                 0               0
## 2                  0                 0               0
## 3                  0                 0               0
## 4                  0                 0               0
## 5                  0                 0               0
## 6                  0                 0               0
## 7                  0                 0               0
## 8                  0                 0               0
## 9                  0                 0               0
## 10                 0                 0               0
## 11                 0                 0               0
## 12                 0                 0               0
## 13                 0                 0               0
## 14                 0                 0               0
## 15                 0                 0               0
## 16                 1                 0               0
## 17                 1                 0               0
## 18                 1                 0               0
## 19                 0                 1               0
## 20                 0                 1               0
## 21                 0                 1               0
## 22                 0                 0               1
## 23                 0                 0               1
## 24                 0                 0               1
## attr(,"assign")
##  [1] 0 1 1 2 2 2 2 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$`design$replicat`
## [1] "contr.treatment"
## 
## attr(,"contrasts")$group
## [1] "contr.treatment"

Using this design matrix, the dispersions are then estimated:

dge <- estimateGLMCommonDisp(dge, design_matrix)
dge <- estimateGLMTrendedDisp(dge, design_matrix)
dge <- estimateGLMTagwiseDisp(dge, design_matrix)
plotBCV(dge, main = paste0("BCV plot"))

Then a contrast vector is created: this vector has a length equal to the number of columns in the design matrix and its entries are equal to zeros except for the two conditions for which we want to test equality (in our cases, they are in columns 4 and 5), which are equal, respectively, to -1 and to +1:

# GLM fit
fit <- glmFit(dge, design_matrix)
contrasts <- rep(0, ncol(design_matrix))
contrasts[4] <- -1
contrasts[5] <- 1
contrasts; colnames(design_matrix)[as.logical(contrasts)]

##  [1]  0  0  0 -1  1  0  0  0  0  0

## [1] "groupdbMT_control" "groupdbMT_treated"

The test (LR test) is then performed and different characteristics are saved: raw and adjusted p-values for all genes, value of the statistics (LR) for all genes, DEGs and their type (up/down regulated) for a risk of 1%:

res_GLM1 <- glmLRT(fit, contrast = contrasts)
# p-values
pvals_GLM1 <- data.frame("pvalue" = c(res_GLM1$table$PValue, 
                                      p.adjust(res_GLM1$table$PValue, "BH")),
                         "type" = rep(c("raw", "adjusted"),
                                      each = nrow(raw_counts_wn)))
topTab <- topTags(res_GLM1, n = nrow(raw_counts_wn))
# statistics (LR)
LR_GLM1 <- topTab$table$LR
names(LR_GLM1) <- rownames(topTab)
# DEG
DEGs <- decideTestsDGE(res_GLM1, adjust.method = "BH", p.value = 0.01)
sel_deg <- which(DEGs[ ,1] != 0)
DEG_GLM1 <- data.frame("name" = rownames(raw_counts_wn)[sel_deg], 
                       "UD" = DEGs[sel_deg,1])

The distribution of p-values is given by:

p <- ggplot(data = pvals_GLM1, aes(x = pvalue, fill = type))
p <- p + geom_histogram()
p <- p + theme_bw()
p <- p + ggtitle("Histogram of raw/adjusted p-values for exact test")
p <- p + facet_grid(type ~ .) 
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(), 
               axis.ticks.x = element_blank())
print(p)

The distribution of statistics (\(\log_2\) transformed) is given by:

df <- data.frame("statistics" = log2(LR_GLM1 + 1))
p <- ggplot(data = df, aes(x = statistics))
p <- p + geom_histogram()
p <- p + theme_bw()
p <- p + ggtitle(expression("Histogram of" ~ log[2](~"statistics" + 1)))
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(), 
               axis.ticks.x = element_blank())
print(p)

The number of DEGs and their types are finally obtained with:

nrow(DEG_GLM1)

## [1] 5891

table(DEG_GLM1$UD) # 1 means 'up-regulated'

## 
##   -1    1 
## 2662 3229

Two factors and interaction effects

The second analysis creates two factors (mutant and treatments) and creates a design matrix with an additive effects of these two factors and another additive effect corresponding to the interaction:

# create design matrix
factor1 <- relevel(as.factor(design$factor1), ref = "WT")
factor2 <- relevel(as.factor(design$factor2), ref = "control")
design_matrix <- model.matrix(~ design$replicat + factor1 + factor2 +
                                factor1:factor2)
design_matrix

##    (Intercept) design$replicatrepbio2 design$replicatrepbio3 factor1dbMT
## 1            1                      0                      0           0
## 2            1                      1                      0           0
## 3            1                      0                      1           0
## 4            1                      0                      0           1
## 5            1                      1                      0           1
## 6            1                      0                      1           1
## 7            1                      0                      0           1
## 8            1                      1                      0           1
## 9            1                      0                      1           1
## 10           1                      0                      0           0
## 11           1                      1                      0           0
## 12           1                      0                      1           0
## 13           1                      0                      0           0
## 14           1                      1                      0           0
## 15           1                      0                      1           0
## 16           1                      0                      0           0
## 17           1                      1                      0           0
## 18           1                      0                      1           0
## 19           1                      0                      0           0
## 20           1                      1                      0           0
## 21           1                      0                      1           0
## 22           1                      0                      0           0
## 23           1                      1                      0           0
## 24           1                      0                      1           0
##    factor1oxMT factor1siMT factor2treated factor1dbMT:factor2treated
## 1            0           0              0                          0
## 2            0           0              0                          0
## 3            0           0              0                          0
## 4            0           0              0                          0
## 5            0           0              0                          0
## 6            0           0              0                          0
## 7            0           0              1                          1
## 8            0           0              1                          1
## 9            0           0              1                          1
## 10           1           0              0                          0
## 11           1           0              0                          0
## 12           1           0              0                          0
## 13           1           0              1                          0
## 14           1           0              1                          0
## 15           1           0              1                          0
## 16           0           1              0                          0
## 17           0           1              0                          0
## 18           0           1              0                          0
## 19           0           1              1                          0
## 20           0           1              1                          0
## 21           0           1              1                          0
## 22           0           0              1                          0
## 23           0           0              1                          0
## 24           0           0              1                          0
##    factor1oxMT:factor2treated factor1siMT:factor2treated
## 1                           0                          0
## 2                           0                          0
## 3                           0                          0
## 4                           0                          0
## 5                           0                          0
## 6                           0                          0
## 7                           0                          0
## 8                           0                          0
## 9                           0                          0
## 10                          0                          0
## 11                          0                          0
## 12                          0                          0
## 13                          1                          0
## 14                          1                          0
## 15                          1                          0
## 16                          0                          0
## 17                          0                          0
## 18                          0                          0
## 19                          0                          1
## 20                          0                          1
## 21                          0                          1
## 22                          0                          0
## 23                          0                          0
## 24                          0                          0
## attr(,"assign")
##  [1] 0 1 1 2 2 2 3 4 4 4
## attr(,"contrasts")
## attr(,"contrasts")$`design$replicat`
## [1] "contr.treatment"
## 
## attr(,"contrasts")$factor1
## [1] "contr.treatment"
## 
## attr(,"contrasts")$factor2
## [1] "contr.treatment"

Using this design matrix, the dispersions are then estimated:

dge <- estimateGLMCommonDisp(dge, design_matrix)
dge <- estimateGLMTrendedDisp(dge, design_matrix)
dge <- estimateGLMTagwiseDisp(dge, design_matrix)
plotBCV(dge, main = paste0("BCV plot"))

res_GLM2 <- glmLRT(fit, coef = 8)
# p-values
pvals_GLM2 <- data.frame("pvalue" = c(res_GLM2$table$PValue, 
                                      p.adjust(res_GLM2$table$PValue, "BH")),
                         "type" = rep(c("raw", "adjusted"),
                                      each = nrow(raw_counts_wn)))
topTab <- topTags(res_GLM2, n = nrow(raw_counts_wn))
# statistics (LR)
LR_GLM2 <- topTab$table$LR
names(LR_GLM2) <- rownames(topTab)
# DEG
DEGs <- decideTestsDGE(res_GLM2, adjust.method = "BH", p.value = 0.01)
sel_deg <- which(DEGs[ ,1] != 0)
DEG_GLM2 <- data.frame("name" = rownames(raw_counts_wn)[sel_deg], 
                       "UD" = DEGs[sel_deg,1])

The distribution of p-values is given by:

p <- ggplot(data = pvals_GLM2, aes(x = pvalue, fill = type))
p <- p + geom_histogram()
p <- p + theme_bw()
p <- p + ggtitle("Histogram of raw/adjusted p-values for exact test")
p <- p + facet_grid(type ~ .) 
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(), 
               axis.ticks.x = element_blank())
print(p)

The distribution of statistics (\(\log_2\) transformed) is given by:

df <- data.frame("statistics" = log2(LR_GLM2 + 1))

## Warning in data.frame(statistics = log2(LR_GLM2 + 1)): NaNs produced

p <- ggplot(data = df, aes(x = statistics))
p <- p + geom_histogram()
p <- p + theme_bw()
p <- p + ggtitle(expression("Histogram of" ~ log[2](~"statistics" + 1)))
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(), 
               axis.ticks.x = element_blank())
print(p)

## Warning: Removed 1 rows containing non-finite values (stat_bin).

The number of DEGs and their types are finally obtained with:

nrow(DEG_GLM2)

## [1] 664

table(DEG_GLM2$UD) # 1 means 'up-regulated'

## 
##  -1   1 
## 322 342

Comparison

First, the two groups of genes are compared using a Venn diagram that shows that non only are the two groups very different in size but also their overlap is quite reduced:

vd <- venn.diagram(x=list("Overall group effect" = DEG_GLM1$name,
                          "Separate effects" = DEG_GLM2$name), 
                   fill = brewer.pal(3, "Set3")[1:2], 
                   cat.col = c("darkgreen", "darkred"),
                   cat.cex = 1.5, fontface="bold", filename=NULL, 
                   mar = rep(0.1,4))
grid.draw(vd)

Two more graphics show the relations between the statistics as obtained by the two models and the p-values (restricted to p-values of significant genes):

order_2in1 <- match(names(LR_GLM1), names(LR_GLM2))
df <- data.frame("statistics1" = log2(LR_GLM1 + 1),
                 "statistics2" = log2(LR_GLM2[order_2in1] + 1))

## Warning in data.frame(statistics1 = log2(LR_GLM1 + 1), statistics2 =
## log2(LR_GLM2[order_2in1] + : NaNs produced

p1 <- ggplot(data = df, aes(x = statistics1, y = statistics2))
p1 <- p1 + geom_point(colour = "red", alpha = 0.01)
p1 <- p1 + theme_bw()
p1 <- p1 + ggtitle("Statistics for both models")
p1 <- p1 + xlab("Overall group effect") + ylab("Separate effects")
p1 <- p1 + theme(title = element_text(size=10), axis.text.x = element_blank(), 
               axis.ticks.x = element_blank())

pvals1 <- pvals_GLM1$pvalue[pvals_GLM1$type == "adjusted"]
pvals2 <- pvals_GLM2$pvalue[pvals_GLM2$type == "adjusted"]
sel_pvals <- (pvals1 < 0.01 | pvals2 < 0.01)
df <- data.frame("pvals1" = pvals1[sel_pvals], "pvals2" = pvals2[sel_pvals])
p2 <- ggplot(data = df, aes(x = pvals1, y = pvals2))
p2 <- p2 + geom_point(colour = "red", alpha = 0.01)
p2 <- p2 + theme_bw()
p2 <- p2 + ggtitle("Statistics for both models")
p2 <- p2 + xlab("Overall group effect") + ylab("Separate effects")
p2 <- p2 + theme(title = element_text(size=10), axis.text.x = element_blank(), 
                 axis.ticks.x = element_blank())

grid.arrange(p1, p2, ncol = 2)

## Warning: Removed 1 rows containing missing values (geom_point).

These figures confirm the previous comment and show that both the statistics and p-values obtained by the two approaches are only slightly related.

Session information

session_info()

## Session info --------------------------------------------------------------

##  setting  value                       
##  version  R version 3.3.1 (2016-06-21)
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_US:en                    
##  collate  en_US.UTF-8                 
##  tz       <NA>                        
##  date     2016-07-07

## Packages ------------------------------------------------------------------

##  package        * version   date       source        
##  assertthat       0.1       2013-12-06 CRAN (R 3.3.1)
##  colorspace       1.2-6     2015-03-11 CRAN (R 3.3.1)
##  corpcor          1.6.8     2015-07-08 CRAN (R 3.3.1)
##  DBI              0.4-1     2016-05-08 CRAN (R 3.3.1)
##  devtools       * 1.12.0    2016-06-24 CRAN (R 3.3.1)
##  digest           0.6.9     2016-01-08 CRAN (R 3.3.1)
##  dplyr            0.5.0     2016-06-24 CRAN (R 3.3.1)
##  edgeR          * 3.12.1    2016-07-03 Bioconductor  
##  ellipse          0.3-8     2013-04-13 CRAN (R 3.3.1)
##  evaluate         0.9       2016-04-29 CRAN (R 3.3.1)
##  formatR          1.4       2016-05-09 CRAN (R 3.3.1)
##  futile.logger  * 1.4.1     2015-04-20 CRAN (R 3.3.1)
##  futile.options   1.0.0     2010-04-06 CRAN (R 3.3.1)
##  ggplot2        * 2.1.0     2016-03-01 CRAN (R 3.3.1)
##  gridExtra      * 2.2.1     2016-02-29 CRAN (R 3.3.1)
##  gtable           0.2.0     2016-02-26 CRAN (R 3.3.1)
##  htmltools        0.3.5     2016-03-21 CRAN (R 3.3.1)
##  igraph           1.0.1     2015-06-26 CRAN (R 3.3.1)
##  knitr            1.13      2016-05-09 CRAN (R 3.3.1)
##  labeling         0.3       2014-08-23 CRAN (R 3.3.1)
##  lambda.r         1.1.7     2015-03-20 CRAN (R 3.3.1)
##  lattice        * 0.20-33   2015-07-14 CRAN (R 3.3.1)
##  limma          * 3.26.9    2016-07-03 Bioconductor  
##  magrittr         1.5       2014-11-22 CRAN (R 3.3.1)
##  MASS           * 7.3-45    2016-04-21 CRAN (R 3.3.1)
##  memoise          1.0.0     2016-01-29 CRAN (R 3.3.1)
##  mixOmics       * 6.0.0     2016-06-14 CRAN (R 3.3.1)
##  munsell          0.4.3     2016-02-13 CRAN (R 3.3.1)
##  plyr             1.8.4     2016-06-08 CRAN (R 3.3.1)
##  R6               2.1.2     2016-01-26 CRAN (R 3.3.1)
##  RColorBrewer   * 1.1-2     2014-12-07 CRAN (R 3.3.1)
##  Rcpp             0.12.5    2016-05-14 CRAN (R 3.3.1)
##  reshape2       * 1.4.1     2014-12-06 CRAN (R 3.3.1)
##  rgl              0.94.1131 2014-09-08 CRAN (R 3.1.1)
##  rmarkdown        0.9.6     2016-05-01 CRAN (R 3.3.1)
##  scales           0.4.0     2016-02-26 CRAN (R 3.3.1)
##  stringi          1.1.1     2016-05-27 CRAN (R 3.3.1)
##  stringr          1.0.0     2015-04-30 CRAN (R 3.3.1)
##  tibble           1.1       2016-07-04 CRAN (R 3.3.1)
##  tidyr            0.5.1     2016-06-14 CRAN (R 3.3.1)
##  VennDiagram    * 1.6.17    2016-04-18 CRAN (R 3.3.1)
##  withr            1.0.2     2016-06-20 CRAN (R 3.3.1)
##  yaml             2.1.13    2014-06-12 CRAN (R 3.3.1)

Practical session about Normalization and differential analysis of RNAseq data - Estimating interaction effects

Nathalie Villa-Vialaneix

SPS Summer School - July 18-19th 2016